Compare commits

...

1533 Commits

Author SHA1 Message Date
Avi Kivity
f0a8465345 Update seastar submodule (json crash in describe_ring)
* seastar 0568c231cd...fd0d7c1c9a (1):
  > Merge 'stream_range_as_array: always close output stream' from Benny Halevy

Fixes #10592.
2022-06-08 16:50:51 +03:00
Beni Peled
1e6fe6391f release: prepare for 4.5.7 2022-05-16 15:27:49 +03:00
Benny Halevy
237c67367c table: clear: serialize with ongoing flush
Get all flush permits to serialize with any
ongoing flushes and preventing further flushes
during table::clear, in particular calling
discard_completed_segments for every table and
clearing the memtables in clear_and_add.

Fixes #10423

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit aae532a96b)
2022-05-15 13:44:15 +03:00
Raphael S. Carvalho
f5707399c3 compaction: LCS: don't write to disengaged optional on compaction completion
Dtest triggers the problem by:
1) creating table with LCS
2) disabling regular compaction
3) writing a few sstables
4) running maintenance compaction, e.g. cleanup

Once the maintenance compaction completes, disengaged optional _last_compacted_keys
triggers an exception in notify_completion().

_last_compacted_keys is used by regular for its round-robin file picking
policy. It stores the last compacted key for each level. Meaning it's
irrelevant for any other compaction type.

Regular compaction is responsible for initializing it when it runs for
the first time to pick files. But with it disabled, notify_completion()
will find it uninitialized, therefore resulting in bad_optional_access.

To fix this, the procedure is skipped if _last_compacted_keys is
disengaged. Regular compaction, once re-enabled, will be able to
fill _last_compacted_keys by looking at metadata of the files.

compaction_test.py::TestCompaction::test_disable_autocompaction_doesnt_
block_user_initiated_compactions[CLEANUP-LeveledCompactionStrategy]
now passes.

Fixes #10378.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #10508

(cherry picked from commit 8e99d3912e)
2022-05-15 13:20:40 +03:00
Juliusz Stasiewicz
4c28744bfd CQL: Replace assert by exception on invalid auth opcode
One user observed this assertion fail, but it's an extremely rare event.
The root cause - interlacing of processing STARTUP and OPTIONS messages -
is still there, but now it's harmless enough to leave it as is.

Fixes #10487

Closes #10503

(cherry picked from commit 603dd72f9e)
2022-05-10 14:09:54 +02:00
Eliran Sinvani
3b253e1166 prepared_statements: Invalidate batch statement too
It seams that batch prepared statements always return false for
depends_on, this in turn renders the removal criteria from the
prepared statements cache to always be false which result by the
queries not being evicted.
Here we change the function to return the true state meaning,
they will return true if one of the sub queries is dependant
upon the keyspace and/ or column family.

Fixes #10129

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
(cherry picked from commit 4eb0398457)
2022-05-08 12:37:00 +03:00
Eliran Sinvani
2dedfacabf cql3 statements: Change dependency test API to express better it's
purpose

Cql statements used to have two API functions, depends_on_keyspace and
depends_on_column_family. The former, took as a parameter only a table
name, which makes no sense. There could be multiple tables with the same
name each in a different keyspace and it doesn't make sense to
generalize the test - i.e to ask "Does a statement depend on any table
named XXX?"
In this change we unify the two calls to one - depends on that takes a
keyspace name and optionally also a table name, that way every logical
dependency tests that makes sense is supported by a single API call.

(cherry picked from commit bf50dbd35b)

Ref #10129
2022-05-08 12:36:48 +03:00
Calle Wilund
414e2687d4 cdc: Ensure columns removed from log table are registered as dropped
If we are redefining the log table, we need to ensure any dropped
columns are registered in "dropped_columns" table, otherwise clients will not
be able to read data older than now.
Includes unit test.

Should probably be backported to all CDC enabled versions.

Fixes #10473
Closes #10474

(cherry picked from commit 78350a7e1b)

Backport notes: removed cql-pytest test from the original commit, since
the cql-pytest framework on branch-4.5 is missing the `nodetool`
package.
2022-05-05 11:31:33 +02:00
Tomasz Grabiec
3d63f12d3b loading_cache: Make invalidation take immediate effect
There are two issues with current implementation of remove/remove_if:

  1) If it happens concurrently with get_ptr(), the latter may still
  populate the cache using value obtained from before remove() was
  called. remove() is used to invalidate caches, e.g. the prepared
  statements cache, and the expected semantic is that values
  calculated from before remove() should not be present in the cache
  after invalidation.

  2) As long as there is any active pointer to the cached value
  (obtained by get_ptr()), the old value from before remove() will be
  still accessible and returned by get_ptr(). This can make remove()
  have no effect indefinitely if there is persistent use of the cache.

One of the user-perceived effects of this bug is that some prepared
statements may not get invalidated after a schema change and still use
the old schema (until next invalidation). If the schema change was
modifying UDT, this can cause statement execution failures. CQL
coordinator will try to interpret bound values using old set of
fields. If the driver uses the new schema, the coordinaotr will fail
to process the value with the following exception:

  User Defined Type value contained too many fields (expected 5, got 6)

The patch fixes the problem by making remove()/remove_if() erase old
entries from _loading_values immediately.

The predicate-based remove_if() variant has to also invalidate values
which are concurrently loading to be safe. The predicate cannot be
avaluated on values which are not ready. This may invalidate some
values unnecessarily, but I think it's fine.

Fixes #10117

Message-Id: <20220309135902.261734-1-tgrabiec@scylladb.com>
(cherry picked from commit 8fa704972f)
2022-05-04 15:50:16 +03:00
Vlad Zolotarov
e09f2d0ea0 loading_shared_values/loading_cache: get rid of iterators interface and return value_ptr from find(...) instead
loading_shared_values/loading_cache'es iterators interface is dangerous/fragile because
iterator doesn't "lock" the entry it points to and if there is a
preemption point between aquiring non-end() iterator and its
dereferencing the corresponding cache entry may had already got evicted (for
whatever reason, e.g. cache size constraints or expiration) and then
dereferencing may end up in a use-after-free and we don't have any
protection against it in the value_extractor_fn today.

And this is in addition to #8920.

So, instead of trying to fix the iterator interface this patch kills two
birds in a single shot: we are ditching the iterators interface
completely and return value_ptr from find(...) instead - the same one we
are returning from loading_cache::get_ptr(...) asyncronous APIs.

A similar rework is done to a loading_shared_values loading_cache is
based on: we drop iterators interface and return
loading_shared_values::entry_ptr from find(...) instead.

loading_cache::value_ptr already takes care of "lock"ing the returned value so that it
would relain readable even if it's evicted from the cache by the time
one tries to read it. And of course it also takes care of updating the
last read time stamp and moving the corresponding item to the top of the
MRU list.

Fixes #8920

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20210817222404.3097708-1-vladz@scylladb.com>
(cherry picked from commit 7bd1bcd779)

[avi: prerequisite to backporting #10117]
2022-05-04 15:42:21 +03:00
Avi Kivity
484b23c08e Merge 'replica/database: drop_column_family(): properly cleanup stale querier cache entries' from Botond Dénes
Said method has to evict all querier cache entries, belonging to the to-be-dropped table. This is already the case, but there was a window where new entries could sneak in, causing a stale reference to the table to be de-referenced later when they are evicted due to TTL. This window is now closed, the entries are evicted after the method has waited for all ongoing operations on said table to stop.

Fixes: #10450

Closes #10451

* github.com:scylladb/scylla:
  replica/database: drop_column_family(): drop querier cache entries after waiting for ops
  replica/database: finish coroutinizing drop_column_family()
  replica/database: make remove(const column_family&) private

(cherry picked from commit 7f1e368e92)
2022-05-01 18:41:36 +03:00
Avi Kivity
97bcbd3c1f Update tools/java submodule (bad IPv6 addresses in nodetool)
* tools/java 8700c89b07...42151ec974 (1):
  > CASSANDRA-17581 fix NodeProbe: Malformed IPv6 address at index

Fixes #10442
2022-04-28 11:36:12 +03:00
Yaron Kaikov
85d3fe744b release: prepare for 4.5.6 2022-04-24 14:42:47 +03:00
Tomasz Grabiec
48d4759fad utils/chunked_vector: Fix sigsegv during reserve()
Fixes the case of make_room() invoked with last_chunk_capacity_deficit
but _size not in the last reserved chunk.

Found during code review, no known user impact.

Fixes #10363.

Message-Id: <20220411222605.641614-1-tgrabiec@scylladb.com>
(cherry picked from commit 01eeb33c6e)

[avi: make max_chunk_capacity() public for backport]
2022-04-13 10:30:57 +03:00
Avi Kivity
c4e8d2e761 transport: return correct error codes when downgrading v4 {WRITE,READ}_FAILURE to {WRITE,READ}_TIMEOUT
Protocol v4 added WRITE_FAILURE and READ_FAILURE. When running under v3
we downgrade these exceptions to WRITE_TIMEOUT and READ_TIMEOUT (since
the client won't understand the v4 errors), but we still send the new
error codes. This causes the client to become confused.

Fix by updating the error codes.

A better fix is to move the error code from the constructor parameter
list and hard-code it in the constructor, but that is left for a follow-up
after this minimal fix.

Fixes #5610.

Closes #10362

(cherry picked from commit 987e6533d2)
2022-04-13 09:49:36 +03:00
Avi Kivity
45c93db71f Update seastar submodule (logger deadlock with large messages)
* seastar 4456fcdfc0...0568c231cd (2):
  > log: Fix silencer to be shard-local and logger-global
  > log: Silence logger when logging

Fixes #10336.
2022-04-05 19:50:13 +03:00
Piotr Sarna
bab3afca5e cql3: fix qualifying restrictions with IN for indexing
When a query contains IN restriction on its partition key,
it's currently not eligible for indexing. It was however
erroneously qualified as such, which lead to fetching incorrect
results. This commit fixes the issue by not allowing such queries
to undergo indexing, and comes with a regression test.

Fixes #10300

Closes #10302

(cherry picked from commit c0fd53a9d7)
(cherry picked from commit ded169476b0498aeedeeca40a612f2cebb54348a)
2022-04-04 10:27:25 +02:00
Avi Kivity
8a0f3cc136 Update seastar submodule (pidof command not installed)
* seastar 7487cae646...4456fcdfc0 (1):
  > seastar-cpu-map.sh: switch from pidof to pgrep
Fixes #10238.
2022-03-29 12:41:53 +03:00
Yaron Kaikov
c789a18195 release: prepare for 4.5.5 2022-03-28 10:44:46 +03:00
Benny Halevy
195c8a6f80 atomic_cell: compare_atomic_cell_for_merge: compare ttl if expiry is equal
Following up on a57c087c89,
compare_atomic_cell_for_merge should compare the ttl value in the
reverse order since, when comparing two cells that are identical
in all attributes but their ttl, we want to keep the cell with the
smaller ttl value rather than the larger ttl, since it was written
at a later (wall-clock) time, and so would remain longer after it
expires, until purged after gc_grace seconds.

Fixes #10173

Test: mutation_test.test_cell_ordering, unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302154328.2400717-1-bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220306091913.106508-1-bhalevy@scylladb.com>
(cherry picked from commit a085ef74ff)
2022-03-24 18:18:15 +02:00
Benny Halevy
8a263600dd atomic_cell: compare_atomic_cell_for_merge: compare ttl if expiry is equal
Unlike atomic_cell_or_collection::equals, compare_atomic_cell_for_merge
currently returns std::strong_ordering::equal if two cells are equal in
every way except their ttl:s.

The problem with that is that the cells' hashes are different and this
will cause repair to keep trying to repair discrepancies caused by the
ttl being different.

This may be triggered by e.g. the spark migrator that computes the ttl
based on the expiry time by subtracting the expiry time from the current
time to produce a respective ttl.

If the cell is migrated multiple times at different times, it will generate
cells that the same expiry (by design) but have different ttl values.

Fixes #10156

Test: mutation_test.test_cell_ordering, unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302154328.2400717-1-bhalevy@scylladb.com>
(cherry picked from commit a57c087c89)
2022-03-24 18:17:28 +02:00
Avi Kivity
da806c4451 replica, atomic_cell: move atomic_cell merge code from replica module to atomic_cell.cc
compare_atomic_cell_for_merge() was placed in database.cc, before
atomic_cell.cc existed. Move it to its correct place.

Closes #9889

(cherry picked from commit 6c53717a39)

[avi: 4.5 backport: retain pre-shapship-operator code]
2022-03-24 18:11:42 +02:00
Benny Halevy
619bfb7c4e main: shutdown: do not abort on certain system errors
Currently any unhandled error during deferred shutdown
is rethrown in a noexcept context (in ~deferred_action),
generating a core dump.

The core dump is not helpful if the cause of the
error is "environmental", i.e. in the system, rather
than in scylla itself.

This change detects several such errors and calls
_Exit(255) to exit the process early, without leaving
a coredump behind.  Otherwise, call abort() explicitly,
rather than letting terminate() be called implicitly
by the destructor exception handling code.

Fixes #9573

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220227101054.1294368-1-bhalevy@scylladb.com>
(cherry picked from commit 132c9d5933)
2022-03-24 14:50:12 +02:00
Nadav Har'El
06795d29c5 Seastar: backport Seastar fix for missing scring escape in JSON output
Backported Seastar fix:
  > Merge 'json/formatter: Escape strings' from Juliusz Stasiewicz

Fixes #9061

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-03-23 22:32:15 +02:00
Asias He
7e625f3397 storage_service: Generate view update for load and stream
Currently, view will be not updated because the streaming reason is set
to streaming::stream_reason::rebuild. On the receiver side, only
streaming with the reason streaming::stream_reason::repair will trigger
view update.

Change the stream reason to repair to trigger view update for load and
stream. This makes load_and_stream behaves the same as nodetool refresh.

Note: However, this is not very efficient though.

Consider RF = 3, sst1, sst2, sst3 from the older cluster. When sst1 is
loaded, it streams to 3 replica nodes, if we generate view updates, we
will have 3 view updates for this replica (each of the peer nodes finds
its peer and writes the view update to peer). After loading sst2 and
sst3, we will have 9 view updates in total for a single partition.
If we create the view after the load and stream process, we will only
have 3 view updates for a single partition.

If we create the view after the load and stream process, we will only
have 3 view updates for a single partition.

Fixes #9205

Closes #9213

(cherry picked from commit eaf4d2afb4)
2022-03-09 16:37:13 +02:00
Nadav Har'El
865cfebfed cql: INSERT JSON should refuse empty-string partition key
Add the missing partition-key validation in INSERT JSON statements.

Scylla, following the lead of Cassandra, forbids an empty-string partition
key (please note that this is not the same as a null partition key, and
that null clustering keys *are* allowed).

Trying to INSERT, UPDATE or DELETE a partition with an empty string as
the partition key fails with a "Key may not be empty". However, we had a
loophole - you could insert such empty-string partition keys using an
"INSERT ... JSON" statement.

The problem was that the partition key validation was done in one place -
`modification_statement::build_partition_keys()`. The INSERT, UPDATE and
DELETE statements all inherited this same method and got the correct
validation. But the INSERT JSON statement - insert_prepared_json_statement
overrode the build_partition_keys() method and this override forgot to call
the validation function. So in this patch we add the missing validation.

Note that the validation function checks for more than just empty strings -
there is also a length limit for partition keys.

This patch also adds a cql-pytest reproducer for this bug. Before this
patch, the test passed on Cassandra but failed on Scylla.

Reported by @FortTell
Fixes #9853.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220116085216.21774-1-nyh@scylladb.com>
(cherry picked from commit 8fd5041092)
2022-03-02 23:02:01 +02:00
Botond Dénes
1963d1cc25 mutation_writer: feed_writer(): handle exceptions from consume_end_of_stream()
Currently the exception handling code of feed_writer() assumes
consume_end_of_stream() doesn't throw. This is false and an exception
from said method can currently lead to an unclean destroy of the writer
and reader. Fix by also handling exceptions from
consume_end_of_stream() too.

Closes #10147
2022-03-02 14:20:20 +01:00
Takuya ASADA
53b0aaa4e8 scylla_raid_setup: revert workaround patch and stop using mdmonitor
We found that monitor mode of mdadm does not work on RAID0, and it is
not a bug, expected behavior according to RHEL developer.
Therefore, we should revert workaround patch which downgrades mdadm,
and stop enabling mdmonitor since we always use RAID0.

See #9540

Closes #10077
2022-02-17 18:36:42 +02:00
Yaron Kaikov
ebfa2279a4 release: prepare for 4.5.4 2022-02-14 08:24:59 +02:00
Nadav Har'El
5e38a69f6d alternator: allow REMOVE of non-existent nested attribute
DynamoDB allows an UpdateItem operation "REMOVE x.y" when a map x
exists in the item, but x.y doesn't - the removal silently does
nothing. Alternator incorrectly generated an error in this case,
and unfortunately we didn't have a test for this case.

So in this patch we add the missing test (which fails on Alternator
before this patch - and passes on DynamoDB) and then fix the behavior.
After this patch, "REMOVE x.y" will remain an error if "x" doesn't
exist (saying "document paths not valid for this item"), but if "x"
exists and is a map, but "x.y" doesn't, the removal will silently
do nothing and will not be an error.

Fixes #10043.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220207133652.181994-1-nyh@scylladb.com>
(cherry picked from commit 9982a28007)
2022-02-08 12:09:38 +02:00
Nadav Har'El
6a54033a63 docker: don't repeat "--alternator-address" option twice
If the Docker startup script is passed both "--alternator-port" and
"--alternator-https-port", a combination which is supposed to be
allowed, it passes to Scylla the "--alternator-address" option twice.
This isn't necessary, and worse - not allowed.

So this patch fixes the scyllasetup.py script to only pass this
parameter once.

Fixes #10016.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220202165814.1700047-1-nyh@scylladb.com>
(cherry picked from commit cb6630040d)
2022-02-03 18:40:03 +02:00
Avi Kivity
ec44412cd9 Update seastar submodule (gratuitous exceptions on allocation failure)
* seastar e26dc6665f...770167e835 (1):
  > core: memory: Avoid current_backtrace() on alloc failure when logging suppressed

Fixes #9982.
2022-01-30 20:05:06 +02:00
Avi Kivity
8f45f65b09 Merge 'scylla_raid_setup: use mdmonitor only when RAID level > 0' from Takuya ASADA
We found that monitor mode of mdadm does not work on RAID0, and it is
not a bug, expected behavior according to RHEL developer.
Therefore, we should stop enabling mdmonitor when RAID0 is specified.

Fixes #9540

----

This reverts 0d8f932 and introduce correct fix.

Closes #9970

* github.com:scylladb/scylla:
  scylla_raid_setup: use mdmonitor only when RAID level > 0
  Revert "scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8"

(cherry picked from commit df22396a34)
2022-01-27 10:24:53 +02:00
Avi Kivity
5a7324c423 Update tools/java submodule (maxPendingPerConnection default)
* tools/java dbcea78e7d...8700c89b07 (2):
  > Fix NullPointerException in SettingsMode
  > cassandra-stress: Remove maxPendingPerConnection default

Ref #7748.
2022-01-12 21:37:07 +02:00
Nadav Har'El
2eb0ad7b4f alternator: allow Authorization header to be without spaces
The "Authorization" HTTP header is used in DynamoDB API to sign
requests. Our parser for this header, in server::verify_signature(),
required the different components of this header to be separated by
a comma followed by a whitespace - but it turns out that in DynamoDB
both spaces and commas are optional - one of them is enough.

At least one DynamoDB client library - the old "boto" (which predated
boto3) - builds this header without spaces.

In this patch we add a test that shows that an Authorization header
with spaces removed works fine in DynamoDB but didn't work in
Alternator, and after this patch modifies the parsing code for this
header, the test begins to pass (and the other tests show that the
previously-working cases didn't break).

Fixes #9568

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211101214114.35693-1-nyh@scylladb.com>
(cherry picked from commit 56eb994d8f)
2021-12-29 14:40:02 +02:00
Nadav Har'El
5d7064e00e alternator: return the correct Content-Type header
Although the DynamoDB API responses are JSON, additional conventions apply
to these responses - such as how error codes are encoded in JSON. For this
reason, DynamoDB uses the content type `application/x-amz-json-1.0` instead
of the standard `application/json` in its responses.

Until this patch, Scylla used `application/json` in its responses. This
unexpected content-type didn't bother any of the AWS libraries which we
tested, but it does bother the aiodynamo library (see HENNGE/aiodynamo#27).

Moreover, we should return the x-amz-json-1.0 content type for future
proofing: It turns out that AWS already defined x-amz-json-1.1 - see:
https://awslabs.github.io/smithy/1.0/spec/aws/aws-json-1_1-protocol.html
The 1.1 content type differs (only) in how it encodes error replies.
If one day DynamoDB starts to use this new reply format (it doesn't yet)
and if DynamoDB libraries will need to differenciate between the two
reply formats, Alternator better return the right one.

This patch also includes a new test that the Content-Type header is
returned with the expected value. The test passes on DynamoDB, and
after this patch it starts to pass on Alternator as well.

Fixes #9554.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211031094621.1193387-1-nyh@scylladb.com>
(cherry picked from commit 6ae0ea0c48)
2021-12-29 12:22:49 +02:00
Takuya ASADA
b56b9f5ed5 scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8
On CentOS8, mdmonitor.service does not works correctly when using
mdadm-4.1-15.el8.x86_64 and later versions.
Until we find a solution, let's pinning the package version to older one
which does not cause the issue (4.1-14.el8.x86_64).

Fixes #9540

Closes #9782

(cherry picked from commit 0d8f932f0b)
2021-12-28 11:38:24 +02:00
Avi Kivity
c8f14886dc Revert "cql3: Reject updates with NULL key values"
This reverts commit 44c784cb79. It
causes a regression without 6afdc6004c,
and it is too complicated to backport at this time.

Ref #9311.
2021-12-23 13:03:13 +02:00
Botond Dénes
5c8057749b treewide: distinguish truncated frame errors
We have two identical "Truncated frame" errors, at:
* read_frame_size() in serialization_visitors.hh;
* cql_server::connection::read_and_decompress_frame() in
  transport/server.cc;

When such an exception is thrown, it is impossible to tell where was it
thrown from and it doesn't have any further information contained in it
(beyond the basic information it being thrown implies).
This patch solves both problems: it makes the exception messages unique
per location and it adds information about why it was thrown (the
expected vs. real size of the frame).

Ref: #9482

Closes #9520

(cherry picked from commit 9ec55e054d)
2021-12-21 15:27:17 +02:00
Pavel Emelyanov
a35646b874 row-cache: Handle exception (un)safety of rows_entry insertion
The B-tree's insert_before() is throwing operation, its caller
must account for that. When the rows_entry's collection was
switched on B-tree all the risky places were fixed by ee9e1045,
but few places went under the radar.

In the cache_flat_mutation_reader there's a place where a C-pointer
is inserted into the tree, thus potentially leaking the entry.

In the partition_snapshot_row_cursor there are two places that not
only leak the entry, but also leave it in the LRU list. The latter
it quite nasty, because those entry can be evicted, eviction code
tries to get rows_entry iterator from "this", but the hook happens
to be unattached (because insertion threw) and fails the assert.

fixes: #9728

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit ee103636ac)
2021-12-14 16:01:55 +02:00
Pavel Emelyanov
da8708932d partition_snapshot_row_cursor: Shuffle ensure_result creation
Both places get the C-pointer on the freshly allocated rows_entry,
insert it where needed and return back the dereferenced pointer.

The C-pointer is going to become smart-pointer that would go out
of scope before return. This change prepares for that by constructing
the ensure_result from the iterator, that's returned from insertion
of the entry.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 9fd8db318d)

Ref #9728
2021-12-14 16:01:48 +02:00
Nadav Har'El
78a545716a Update Seastar module with additional backports
Backported a Seastar fix:
  > Merge 'metrics: Fix dtest->ulong conversion error' from Benny Halevy

Fixes #9794
2021-12-14 13:18:07 +02:00
Asias He
1488278fc1 storage_service: Wait for seastar::get_units in node_ops
The seastar::get_units returns a future, we have to wait for it.

Fixes #9767

Closes #9768

(cherry picked from commit 9859c76de1)
2021-12-12 18:42:33 +02:00
Nadav Har'El
d9455a910f alternator: add missing BatchGetItem metric
Unfortunately, defining metrics in Scylla requires some code
duplication, with the metrics declared in one place but exported in a
different place in the code. When we duplicated this code in Alternator,
we accidentally dropped the first metric - for BatchGetItem. The metric
was accounted in the code, but not exported to Prometheus.

In addition to fixing the missing metric, this patch also adds a test
that confirms that the BatchGetItem metric increases when the
BatchGetItem operation is used. This test failed before this patch, and
passes with it. The test only currently tests this for BatchGetItem
(and BatchWriteItem) but it can be later expanded to cover all the other
operations as well.

Fixes #9406

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210929121611.373074-1-nyh@scylladb.com>
(cherry picked from commit 5cbe9178fd)
2021-12-06 12:43:49 +02:00
Yaron Kaikov
406b4bce8d release: prepare for 4.5.3 2021-12-05 14:48:33 +02:00
Botond Dénes
417e853b9b mutation_reader: shard_reader: ensure referenced objects are kept alive
The shard reader can outlive its parent reader (the multishard reader).
This creates a problem for lifecycle management: readers take the range
and slice parameters by reference and users keep these alive until the
reader is alive. The shard reader outliving the top-level reader means
that any background read-ahead that it has to wait on will potentially
have stale references to the range and the slice. This was seen in the
wild recently when the evictable reader wrapped by the shard reader hit
a use-after-free while wrapping up a background read-ahead.
This problem was solved by fa43d76 but any previous versions are
susceptible to it.

This patch solves this problem by having the shard reader copy and keep
the range and slice parameters in stable storage, before passing them
further down.

Fixes: #9719

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211202113910.484591-1-bdenes@scylladb.com>
2021-12-02 17:49:36 +02:00
Dejan Mircevski
44c784cb79 cql3: Reject updates with NULL key values
We were silently ignoring INSERTs with NULL values for primary-key
columns, which Cassandra rejects.  Fix it by rejecting any
modification_statement that would operate on empty partition or
clustering range.

This is the most direct fix, because range and slice are calculated in
one place for all modification statements.  It covers not only NULL
cases, but also impossible restrictions like c>0 AND c<0.
Unfortunately, Cassandra doesn't treat all modification statements
consistently, so this fix cannot fully match its behavior.  We err on
the side of tolerance, accepting some DELETE statements that Cassandra
rejects.  We add a TODO for rejecting such DELETEs later.

Fixes #7852.

Tests: unit (dev), cql-pytest against Cassandra 4.0

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #9286

(cherry picked from commit 1fdaeca7d0)
2021-11-29 17:19:51 +02:00
Piotr Jastrzebski
ab425a11a8 sstables: Fix writing KA/LA sstables index
Before this patch when writing an index block, the sstables writer was
storing range tombstones that span the boundary of the block in order
of end bounds. This led to a range tombstone being ignored by a reader
if there was a row tombstone inside it.

This patch sorts the range tombstones based on start bound before
writing them to the index file.

The assumption is that writing an index block is rare so we can afford
sortting the tombstones at that point. Additionally this is a writer of
an old format and writing to it will be dropped in the next major
release so it should be rarely used already.

Kudos to Kamil Braun <kbraun@scylladb.com> for finding the reproducer.

Test: unit(dev)

Fixes #9690

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit scylladb/scylla-enterprise@eb093afd6f)
2021-11-26 15:36:03 +01:00
Tomasz Grabiec
2228a1a92a cql: Fix missing data in indexed queries with base table short reads
Indexed queries are using paging over the materialized view
table. Results of the view read are then used to issue reads of the
base table. If base table reads are short reads, the page is returned
to the user and paging state is adjusted accordingly so that when
paging is resumed it will query the view starting from the row
corresponding to the next row in the base which was not yet
returned. However, paging state's "remaining" count was not reset, so
if the view read was exhausted the reading will stop even though the
base table read was short.

Fix by restoring the "remaining" count when adjusting the paging state
on short read.

Tests:

  - index_with_paging_test
  - secondary_index_test

Fixes #9198
Message-Id: <20210818131840.1160267-1-tgrabiec@scylladb.com>

(cherry picked from commit 1e4da2dcce)
2021-11-23 11:22:13 +02:00
Takuya ASADA
36b190a65e docker: add stopwaitsecs
We need stopwaitsecs just like we do TimeoutStpSec=900 on
scylla-server.service, to avoid timeout on scylla-server shutdown.

Fixes #9485

Closes #9545

(cherry picked from commit c9499230c3)
2021-11-15 13:36:40 +02:00
Takuya ASADA
f7e5339c14 scylla_io_setup: support ARM instances on AWS
Add preset parameters for AWS ARM intances.

Fixes #9493

(cherry picked from commit 4e8060ba72)
2021-11-15 13:35:15 +02:00
Asias He
4c4972cb33 gossip: Fix use-after-free in real_mark_alive and mark_dead
In commit 11a8912093 (gossiper:
get_gossip_status: return string_view and make noexcept)
get_gossip_status returns a pointer to an endpoint_state in
endpoint_state_map.

After commit 425e3b1182 (gossip: Introduce
direct failure detector), gossiper::mark_dead and gossiper::real_mark_alive
can yield in the middle of the function. It is possible that
endpoint_state can be removed, causing use-after-free to access it.

To fix, make a copy before we yield.

Fixes #8859

Closes #8862

(cherry picked from commit 7a32cab524)
2021-11-15 13:23:02 +02:00
Takuya ASADA
50ce5bef2c scylla_util.py: On is_gce(), return False when it's on GKE
GKE metadata server does not provide same metadata as GCE, we should not
return True on is_gce().
So try to fetch machine-type from metadata server, return False if it
404 not found.

Fixes #9471

Signed-off-by: Takuya ASADA <syuu@scylladb.com>

Closes #9582

(cherry picked from commit 9b4cf8c532)
2021-11-15 13:18:02 +02:00
Asias He
f864eea844 repair: Return HTTP 400 when repiar id is not found
There are two APIs for checking the repair status and they behave
differently in case the id is not found.

```
{"host": "192.168.100.11:10001", "method": "GET", "uri":
"/storage_service/repair_async/system_auth?id=999", "duration": "1ms",
"status": 400, "bytes": 49, "dump": "HTTP/1.1 400 Bad
Request\r\nContent-Length: 49\r\nContent-Type: application/json\r\nDate:
Wed, 03 Nov 2021 10:49:33 GMT\r\nServer: Seastar
httpd\r\n\r\n{\"message\": \"unknown repair id 999\", \"code\": 400}"}

{"host": "192.168.100.11:10001", "method": "GET", "uri":
"/storage_service/repair_status?id=999&timeout=1", "duration": "0ms",
"status": 500, "bytes": 49, "dump": "HTTP/1.1 500 Internal Server
Error\r\nContent-Length: 49\r\nContent-Type: application/json\r\nDate:
Wed, 03 Nov 2021 10:49:33 GMT\r\nServer: Seastar
httpd\r\n\r\n{\"message\": \"unknown repair id 999\", \"code\": 500}"}
```

The correct status code is 400 as this is a parameter error and should
not be retried.

Returning status code 500 makes smarter http clients retry the request
in hopes of server recovering.

After this patch:

curl -X PGET
'http://127.0.0.1:10000/storage_service/repair_async/system_auth?id=9999'
{"message": "unknown repair id 9999", "code": 400}

curl -X GET
'http://127.0.0.1:10000/storage_service/repair_status?id=9999'
{"message": "unknown repair id 9999", "code": 400}

Fixes #9576

Closes #9578

(cherry picked from commit f5f5714aa6)
2021-11-15 13:15:59 +02:00
Calle Wilund
b9735ab079 cdc: fix broken function signature in maybe_back_insert_iterator
Fixes #9103

compare overload was declared as "bool" even though it is a tri-cmp.
causes us to never use the speed-up shortcut (lessen search set),
in turn meaning more overhead for collections.

Closes #9104

(cherry picked from commit 59555fa363)
2021-11-15 13:13:04 +02:00
Takuya ASADA
766e16f19e scylla_io_setup: handle nr_disks on GCP correctly
nr_disks is int, should not be string.

Fixes #9429

Closes #9430

(cherry picked from commit 3b798afc1e)
2021-11-15 13:06:30 +02:00
Michał Chojnowski
e6520df41c utils: fragment_range: fix FragmentedView utils for views with empty fragments
The copying and comparing utilities for FragmentedView are not prepared to
deal with empty fragments in non-empty views, and will fall into an infinite
loop in such case.
But data coming in result_row_view can contain such fragments, so we need to
fix that.

Fixes #8398.

Closes #8397

(cherry picked from commit f23a47e365)
2021-11-15 12:55:25 +02:00
Hagit Segev
26aca7b9f7 release: prepare for 4.5.2 2021-11-14 14:19:34 +02:00
Avi Kivity
103c85a23f build: clobber user/group info from node_exporter tarball
node_exporter is packaged with some random uid/gid in the tarball.
When extracting it as an ordinary user this isn't a problem, since
the uid/gid are reset to the current user, but that doesn't happen
under dbuild since `tar` thinks the current user is root. This causes
a problem if one wants to delete the build directory later, since it
becomes owned by some random user (see /etc/subuid)

Reset the uid/gid infomation so this doesn't happen.

Closes #9579

Fixes #9610.

(cherry picked from commit e1817b536f)
2021-11-10 14:18:56 +02:00
Nadav Har'El
db66b62e80 alternator: fix bug in ReturnValues=ALL_NEW
This patch fixes a bug in UpdateItem's ReturnValues=ALL_NEW, which in
some cases returned the OLD (pre-modification) value of some of the
attributes, instead of its NEW value.

The bug was caused by a confusion in our JSON utility function,
rjson::set(), which sounds like it can set any member of a map, but in
fact may only be used to add a *new* member - if a member with the same
name (key) already existed, the result is undefined (two values for the
same key). In ReturnValues=ALL_NEW we did exactly this: we started with
a copy of the original item, and then used set() to override some of the
members. This is not allowed.

So in this patch, we introduce a new function, rjson::replace(), which
does what we previously thought that rjson::set() does - i.e., replace a
member if it exists, or if not, add it. We call this function in
the ReturnValues=ALL_NEW code.

This patch also adds a test case that reproduces the incorrect ALL_NEW
results - and gets fixed by this patch.

In an upcoming patch, we should rename the confusingly-named set()
functions and audit all their uses. But we don't do this in this patch
yet. We just add some comments to clarify what set() does - but don't
change it, and just add one new function for replace().

Fixes #9542

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211104134937.40797-1-nyh@scylladb.com>
(cherry picked from commit b95e431228)
2021-11-08 13:11:37 +02:00
Asias He
098fcf900f storage_service: Abort restore_replica_count when node is removed from the cluster
Consider the following procedure:

- n1, n2, n3
- n3 is down
- n1 runs nodetool removenode uuid_of_n3 to removenode from n3 the
  cluster
- n1 is down in the middle of removenode operation

Node n1 will set n3 to removing gossip status during removenode
operation. Whenever existing nodes learn a node is in removing gossip
status, they will call restore_replica_count to stream data from other
nodes for the ranges n3 loses if n3 was removed from the cluster. If
the streaming fails, the streaming will sleep and retry. The current
max number of retry attempts is 5. The sleep interval starts at 60
seconds and increases 1.5 times per sleep.

This can leave the cluster in a bad state. For example, nodes can go
out of disk space if the streaming continues.  We need a way to abort
such streaming attempts.

To abort the removenode operation and forcely remove the node, users
can run `nodetool removenode force` on any existing nodes to move the
node from removing gossip status to removed gossip status. However,
the restore_replica_count will not be aborted.

In this patch, a status checker is added in restore_replica_count, so
that once a node is in removed gossip status, restore_replica_count
will be aborted.

This patch is for older releases without the new NODE_OPS_CMD
infrastructure where such abort will happen automatically in case of
error.

Fixes #8651

Closes #8655

(cherry picked from commit 0858619cba)
2021-11-02 17:25:34 +02:00
Benny Halevy
84025f6ce0 large_data_handle: add sstable name to log messages
Although the sstable name is part of the system.large_* records,
it is not printed in the log.
In particular, this is essential for the "too many rows" warning
that currently does not record a row in any large_* table
so we can't correlate it with a sstable.

Fixes #9524

Test: unit(dev)
DTest: wide_rows_test.py

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211027074104.1753093-1-bhalevy@scylladb.com>
(cherry picked from commit a21b1fbb2f)
2021-10-29 10:41:21 +03:00
Asias He
9898a114a6 repair: Handle everywhere_topology in bootstrap_with_repair
The everywhere_topology returns the number of nodes in the cluster as
RF. This makes only streaming from the node losing the range impossible
since no node is losing the range after bootstrap.

Shortcut to stream from all nodes in local dc in case the keyspace is
everywhere_topology.

Fixes #8503

(cherry picked from commit 3c36517598)
2021-10-28 18:56:01 +03:00
Yaron Kaikov
4c0eac0491 release: prepare for 4.5.1 2021-10-24 14:11:07 +03:00
Benny Halevy
c1d8ce7328 date_tiered_manifest: get_now: fix use after free of sstable_list
The sstable_list is destroyed right after the temporary
lw_shared_ptr<sstable_list> returned from `cf.get_sstables()`
is dereferenced.

Fixes #9138

Test: unit(dev)
DTest: resharding_test.py:ReshardingTombstones_with_DateTieredCompactionStrategy.disable_tombstone_removal_during_reshard_test (debug)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210804075813.42526-1-bhalevy@scylladb.com>
(cherry picked from commit 3ad0067272)
2021-10-20 17:03:49 +03:00
Jan Ciolek
5c5a71d2d7 cql3: Fix need_filtering on indexed table
There were cases where a query on an indexed table
needed filtering but need_filtering returned false.

This is fixed by using new conditions in cases where
we are using an index.

Fixes #8991.
Fixes #7708.

For now this is an overly conservative implementation
that returns true in some cases where filtering
is not needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit 54149242b4)
2021-10-18 11:31:20 +03:00
Benny Halevy
454ff04ff6 utils: phased_barrier: advance_and_await: make noexcept
As a function returning a future, simplify
its interface by handling any exceptions and
returning an exceptional future instead of
propagating the exception.

In this specific case, throwing from advance_and_await()
will propagate through table::await_pending_* calls
short-circuiting a .finally clause in table::stop().

Also, mark as noexcept methods of class table calling
advance_and_await and table::await_pending_ops that depends on them.

Fixes #8636

A followup patch will convert advance_and_await to a coroutine.
This is done separately to facilitate backporting of this patch.

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210511161407.218402-1-bhalevy@scylladb.com>
(cherry picked from commit c0dafa75d9)
2021-10-13 12:26:03 +03:00
Takuya ASADA
f5f5b9a307 scylla_raid_setup: enabling mdmonitor.service on Debian variants
On Debian variants, mdmonitor.service cannnot enable because it missing
[Install] section, so 'systemctl enable mdmonitor.service' will fail,
not able to run mdmonitor after the system restarted.

To force running the service, add Wants=mdmonitor.service on
var-lib-scylla.mount.

Fixes #8494

Closes #8530

(cherry picked from commit c9324634ca)
2021-10-12 13:58:49 +03:00
Avi Kivity
a433c5fe06 Merge 'rjson: Add throwing allocator' from Piotr Sarna
This series adds a wrapper for the default rjson allocator which throws on allocation/reallocation failures. It's done to work around several rapidjson (the underlying JSON parsing library) bugs - in a few cases, malloc/realloc return value is not checked, which results in dereferencing a null pointer (or an arbitrary pointer computed as 0 + `size`, with the `size` parameter being provided by the user). The new allocator will throw an `rjson:error` if it fails to allocate or reallocate memory.
This series comes with unit tests which checks the new allocator behavior and also validates that an internal rapidjson structure which we indirectly rely upon (Stack) is not left in invalid state after throwing. The last part is verified by the fact that its destructor ran without errors.

Fixes #8521
Refs #8515

Tests:
 * unit(release)
 * YCSB: inserting data similar to the one mentioned in #8515 - 1.5MB objects clustered in partitions 30k objects in size - nothing crashed during various YCSB workloads, but nothing also crashed for me locally before this patch, so it's not 100% robust
 relevant YCSB workload config for using 1.5MB objects:
```yaml
fieldcount=150
fieldlength=10000
```

Closes #8529

* github.com:scylladb/scylla:
  test: add a test for rjson allocation
  test: rename alternator_base64_test to alternator_unit_test
  rjson: add a throwing allocator

(cherry picked from commit c36549b22e)
2021-10-12 13:56:59 +03:00
Benny Halevy
7f96ee6689 streaming: stream_session: do not escape curly braces in format strings
Those turn into '{}' in the formatted strings and trigger
a logger error in the following sstlog.warn(err.c_str())
call.

Fixes #8436

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210408173048.124417-1-bhalevy@scylladb.com>
(cherry picked from commit 76cd315c42)
2021-10-12 13:49:13 +03:00
Calle Wilund
ebe196e32d table: ensure memtable is actually in memtable list before erasing
Fixes #8749

if a table::clear() was issued while we were flushing a memtable,
the memtable is already gone from list. We need to check this before
erase. Otherwise we get random memory corruption via
std::vector::erase

v2:
* Make interface more set-like (tolerate non-existance in erase).

Closes #8904

(cherry picked from commit 373fa3fa07)
2021-10-12 13:47:25 +03:00
Benny Halevy
38aa455e83 utils: merge_to_gently: prevent stall in std::copy_if
std::copy_if runs without yielding.

See https://github.com/scylladb/scylla/issues/8897#issuecomment-867522480

Note that the standard states that no iterators or references are invalidated
on insert so we can keep inserting before last1 when merging the
remainder of list2 at the tail of list1.

Fixes #8897

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 453e7c8795)
2021-10-12 13:05:45 +03:00
Avi Kivity
a99382a076 main: start background reclaim before bootstrap
We start background reclaim after we bootstrap, so bootstrap doesn't
benefit from it, and sees long stalls.

Fix by moving background reclaim initialization early, before
storage_service::join_cluster().

(storage_service::join_cluster() is quite odd in that main waits
for it synchronously, compared to everything else which is just
a background service that is only initialized in main).

Fixes #8473.

Closes #8474

(cherry picked from commit 935378fa53)
2021-10-12 13:00:49 +03:00
Pavel Emelyanov
6e2d055be3 mutation: Keep range tombstone in tree when consuming
Current code std::move()-s the range tombstone into consumer thus
moving the tombstone's linkage to the containing list as well. As
the result the orignal range tombstone itself leaks as it leaves
the tree and cannot be reached on .clear(). Another danger is that
the iterator pointing to the tombstone becomes invalid while it's
then ++-ed to advance to the next entry.

The immediate fix is to keep the tombstone linked to the list while
moving.

fixes: #9207

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210825100834.3216-1-xemul@scylladb.com>
(cherry picked from commit b012040a76)
2021-10-12 12:57:49 +03:00
Michael Livshin
152f710dec avoid race between compaction and table stop
Also add a debug-only compaction-manager-side assertion that tests
that no new compaction tasks were submitted for a table that is being
removed (debug-only because not constant-time).

Fixes #9448.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20211007110416.159110-1-michael.livshin@scylladb.com>
(cherry picked from commit e88891a8af)
2021-10-12 12:51:34 +03:00
Asias He
14620444a2 storage_service: Fix argument in send_meta_data::do_receive
The extra status print is not needed in the log.

Fixes the following error:

ERROR 2021-08-10 10:54:21,088 [shard 0] storage_service -
service/storage_service.cc:3150 @do_receive: failed to log message:
fmt='send_meta_data: got error code={}, from node={}, status={}':
fmt::v7::format_error (argument not found)

Fixes #9183

Closes #9189

(cherry picked from commit ce8fd051c9)
2021-10-12 12:45:47 +03:00
Yaron Kaikov
47be33a104 release: prepare for 4.5.0 2021-10-06 14:12:31 +03:00
Takuya ASADA
18b8388958 scylla_cpuscaling_setup: add --force option
To building Ubuntu AMI with CPU scaling configuration, we need force
running mode for scylla_cpuscaling_setup, which run setup without
checking scaling_governor support.

See scylladb/scylla-machine-image#204

Closes #9326

(cherry picked from commit f928dced0c)
2021-10-05 16:20:10 +03:00
Takuya ASADA
56b24818ec scylla_cpuscaling_setup: disable ondemand.service on Ubuntu
On Ubuntu, scaling_governor becomes powersave after rebooted, even we configured cpufrequtils.
This is because ondemand.service, it unconditionally change scaling_governor to ondemand or powersave.
cpufrequtils will start before ondemand.service, scaling_governor overwrite by ondemand.service.
To configure scaling_governor correctly, we have to disable this service.

Fixes #9324

Closes #9325

(cherry picked from commit cd7fe9a998)
2021-10-03 14:08:22 +03:00
Raphael S. Carvalho
5ed149b7e1 compaction_manager: prevent unbounded growth of pending tasks
There will be unbounded growth of pending tasks if they are submitted
faster than retiring them. That can potentially happen if memtables
are frequently flushed too early. It was observed that this unbounded
growth caused task queue violations as the queue will be filled
with tons of tasks being reevaluated. By avoiding duplication in
pending task list for a given table T, growth is no longer unbounded
and consequently reevaluation is no longer aggressive.

Refs #9331.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210930125718.41243-1-raphaelsc@scylladb.com>
(cherry picked from commit 52302c3238)
2021-10-03 13:10:37 +03:00
Eliran Sinvani
9265dbd5f7 dist: rpm: Add specific versioning and python3 dependency
The Red Hat packages were missing two things, first the metapackage
wasn't dependant at all in the python3 package and second, the
scylla-server package dependencies didn't contain a version as part
of the dependency which can cause to some problems during upgrade.
Doing both of the things listed here is a bit of an overkill as either
one of them separately would solve the problem described in #XXXX
but both should be applied in order to express the correct concept.

Fixes #8829

Closes #8832

(cherry picked from commit 9bfb2754eb)
2021-09-12 16:01:01 +03:00
Avi Kivity
443fda8fb1 Merge "evictable_readers: don't drop static rows, drop assumption about snapshot isolation" from Botond
"
This mini-series fixes two loosely related bugs around reader recreation
in the evictable reader (related by both being around reader
recreation). A unit test is also added which reproduces both of them and
checks that the fixes indeed work. More details in the patches
themselves.
This series replaces the two independent patches sent before:
* [PATCH v1] evictable_reader: always reset static row drop flag
* [PATCH v1] evictable_reader: relax partition key check on reader
  recreation

As they depend on each other, it is easier to add a test if they are in
a series.

Fixes: #8923
Fixes: #8893

Tests: unit(dev, mutation_reader_test:debug)
"

* 'evictable-reader-recreation-more-bugs/v1' of https://github.com/denesb/scylla:
  test: mutation_reader_test: add more test for reader recreation
  evictable_reader: relax partition key check on reader recreation
  evictable_reader: always reset static row drop flag

(cherry picked from commit 4209dfd753)
2021-09-06 20:35:14 +03:00
Pavel Emelyanov
02da29fd05 btree: Destroy, not drop, node on clone roll-back
The node in this place is not yet attached to its parent, so
in btree::debug::yes (tests only) mode the node::drop()'s parent
checks will access null parent pointer.

However, in non-tesing runtime there's a chance that a linear
node fails to clone one of its keys and gets here. In this case
it will carry both leftmost and rightmost flags and the assertion
in drop will fire.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 1d857d604a)

Ref #9248.
2021-09-06 11:16:29 +03:00
Yaron Kaikov
edead1caf9 release: prepare for 4.5.rc7 2021-09-06 08:40:20 +03:00
Tomasz Grabiec
8dbd4edbb5 Merge 'hints: use token_metadata to tell if node has left the ring' from Piotr Dulikowski
This PR changes the `can_send` function so that it looks at the `token_metadata` in order to tell if the destination node is in the ring. Previously, gossiper state was used for that purpose and required a relatively complicated condition to check. The new logic just uses `token_metadata::is_member` which reduces complexity of the `can_send` function.

Additionally, `storage_service` is slightly modified so that during a removenode operation the `token_metadata` is first updated and only then endpoint lifecycle subscribers are notified. This was done in order to prevent a race just like the one which happened in #5087 - hints manager is a lifecycle subscriber and starts a draining operation when a node is removed, and in order for draining to work correctly, `can_send` should keep returning true for that node.

Tests:

- unit(dev)
- dtest(hintedhandoff_additional_test.py)
- dtest(topology_test.py)

Closes #8387

* github.com:scylladb/scylla:
  hints: clarify docstring comment for can_send
  hints: use token_metadata to tell if node is in the ring
  hints: slightly reogranize "if" statement in can_send
  storage_service: release token_metadata lock before notify_left
  storage_service: notify_left after token_metadata is replicated

(cherry picked from commit 307bd354d2)

Ref #5087.
2021-09-05 17:38:11 +03:00
Pavel Emelyanov
02bb2e1f4c btree: Dont leak kids on clone roll-back
When failed-to-be-cloned node cleans itself it must also clear
all its child nodes. Plain destroy() doesn't do it, it only
frees the provided node.

fixes: #9248

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit d1a1a2dac2)
2021-09-05 17:33:48 +03:00
Benny Halevy
4bae31523d distributed_loader: distributed_loader::get_sstables_from_upload_dir: do not copy vector containing foreign shared sstables
lw_shared_ptr must not be copied on a foreign shard.
Copying the vector on shard 0 tries increases the reference count of
lw_shared_ptr<sstable> elements that were created on other shards,
as seen in https://github.com/scylladb/scylla/issues/9278.

Fixes #9278

DTest: migration_test.py:TestLoadAndStream_with_3_0_md.load_and_stream_increase_cluster_test(debug)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210902084313.2003328-1-bhalevy@scylladb.com>
(cherry picked from commit 33f579f783)
2021-09-05 17:02:38 +03:00
Michał Chojnowski
55348131f9 utils: compact-radix-tree: fix accidental cache line bouncing
Whenever a node_head_ptr is assigned to nil_root, the _backref inside it is
overwritten. But since nil_root is shared between shards, this causes severe
cache line bouncing. (It was observed to reduce the total write throughput
of Scylla by 90% on a large NUMA machine).

This backreference is never read anyway, so fix this bug by not writing it.

Fixes #9252

Closes #9246

(cherry picked from commit 126baa7850)
2021-08-29 15:45:33 +03:00
Avi Kivity
c1b9de3d5e Revert "messaging_service: Enforce dc/rack membership iff required for non-tls connections"
This reverts commit a0745f9498. It breaks
multiregion clusters on AWS.

Ref #8418.
2021-08-29 15:44:11 +03:00
Avi Kivity
9956bce436 Revert "messaging_service: Bind to listen address, not broadcast"
This reverts commit 6dc7ef512d. It stands
in the way of reverting  a0745f9498, which
is implicated in #8418.
2021-08-29 15:43:29 +03:00
Pavel Solodovnikov
95f32428e4 raft: create system tables only when raft experimental feature is set
Also introduce a tiny function to return raft-enabled db config
for cql testing.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210826091432.279532-1-pa.solodovnikov@scylladb.com>
(cherry picked from commit c0854a0f62)

Ref #9239.
2021-08-29 14:01:36 +03:00
Pavel Solodovnikov
5b3319816a db: add experimental option for raft
Introduce `raft` experimental option.
Adjust the tests accordingly to accomodate the new option.

It's not enabled by default when providing
`--experimental=true` config option and should be
requested explicitly via `--experimental-options=raft`
config option.

Hide the code related to `raft_group_registry` behind
the switch. The service object is still constructed
but no initialization is performed (`init()` is not
called) if the flag is not set.

Later, other raft-related things, such as raft schema
changes, will also use this flag.

Also, don't introduce a corresponding gossiper feature
just yet, because again, it should be done after the
raft schema changes API contract is stabilized.

This will be done in a separate series, probably related to
implementing the feature itself.

Tests: unit(dev)

Ref #9239.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210823121956.167682-1-pa.solodovnikov@scylladb.com>
(cherry picked from commit 22794efc22)
2021-08-29 14:01:33 +03:00
Avi Kivity
8f63a9de31 Update seastar submodule (perftune failure on bond NIC)
* seastar dab10ba6ad...70ea9312a1 (1):
  > perftune.py: instrument bonding tuning flow with 'nic' parameter

Fixes #9225.
2021-08-19 16:58:15 +03:00
Hagit Segev
88314fedfa release: prepare for 4.5.rc6 2021-08-17 14:17:04 +03:00
Calle Wilund
b0edfa6d70 commitlog/config: Make hard size enforcement false by default + add config opt
Refs #9053

Flips default for commitlog disk footprint hard limit enforcement to off due
to observed latency stalls with stress runs. Instead adds an optional flag
"commitlog_use_hard_size_limit" which can be turned on to in fact do enforce it.

Sort of tape and string fix until we can properly tweak the balance between
cl & sstable flush rate.

Closes #9195

(cherry picked from commit 3633c077be)
2021-08-16 10:05:08 +03:00
Takuya ASADA
9338f6b6b8 scylla_cpuscaling_setup: change scaling_governor path
On some environment /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
does not exist even it supported CPU scaling.
Instead, /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor is
avaliable on both environment, so we should switch to it.

Fixes #9191

Closes #9193

(cherry picked from commit e5bb88b69a)
2021-08-12 12:10:04 +03:00
Asias He
28940ef505 table: Fix is_shared assert for load and stream
The reader is used by load and stream to read sstables from the upload
directory which are not guaranteed to belong to the local shard.

Using the make_range_sstable_reader instead of
make_local_shard_sstable_reader.

Tests:

backup_restore_tests.py:TestBackupRestore.load_and_stream_using_snapshot_test
backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_2_test
backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_1_test
migration_test.py:TestLoadAndStream.load_and_stream_asymmetric_cluster_test
migration_test.py:TestLoadAndStream.load_and_stream_decrease_cluster_test
migration_test.py:TestLoadAndStream.load_and_stream_frozen_pk_test
migration_test.py:TestLoadAndStream.load_and_stream_increase_cluster_test
migration_test.py:TestLoadAndStream.load_and_stream_primary_replica_only_test

Fixes #9173

Closes #9185

(cherry picked from commit 040b626235)
2021-08-12 11:17:33 +03:00
Raphael S. Carvalho
4c03bcce4c compaction: Prevent tons of compaction of fully expired sstable from happening in parallel
Compaction manager can start tons of compaction of fully expired sstable in
parallel, which may consume a significant amount of resources.
This problem is caused by weight being released too early in compaction, after
data is all compacted but before table is called to update its state, like
replacing sstables and so on.
Fully expired sstables aren't actually compacted, so the following can happen:
- compaction 1 starts for expired sst A with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 2 starts for expired sst B with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 3 starts for expired sst C with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 1 is done updating table state, so it finally completes and
releases all the resources.
- compaction 2 is done updating table state, so it finally completes and
releases all the resources.
- compaction 3 is done updating table state, so it finally completes and
releases all the resources.

This happens because, with expired sstable, compaction will release weight
faster than it will update table state, as there's nothing to be compacted.

With my reproducer, it's very easy to reach 50 parallel compactions on a single
shard, but that number can be easily worse depending on the amount of sstables
with fully expired data, across all tables. This high parallelism can happen
only with a couple of tables, if there are many time windows with expired data,
as they can be compacted in parallel.

Prior to 55a8b6e3c9, weight was released earlier in compaction, before
last sstable was sealed, but right now, there's no need to release weight
earlier. Weight can be released in a much simpler way, after the compaction is
actually done. So such compactions will be serialized from now on.

Fixes #8710.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210527165443.165198-1-raphaelsc@scylladb.com>

[avi: drop now unneeded storage_service_for_tests]

(cherry picked from commit a7cdd846da)
2021-08-10 18:16:24 +03:00
Piotr Jastrzebski
860e2190a9 api: use proper type to reduce partition count
Partition count is of a type size_t but we use std::plus<int>
to reduce values of partition count in various column families.
This patch changes the argument of std::plus to the right type.
Using std::plus<int> for size_t compiles but does not work as expected.
For example plus<int>(2147483648LL, 1LL) = -2147483647 while the code
would probably want 2147483649.

Fixes #9090

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

Closes #9074

(cherry picked from commit 90a607e844)
2021-08-10 18:16:15 +03:00
Nadav Har'El
6cf88812f6 secondary index: fix regression in CREATE INDEX IF NOT EXISTS
The recent commit 0ef0a4c78d added helpful
error messages in case an index cannot be created because the intended
name of its materialized view is already taken - but accidentally broke
the "CREATE INDEX IF NOT EXISTS" feature.

The checking code was correct, but in the wrong place: we need to first
check maybe the index already exists and "IF NOT EXISTS" was chosen -
and only do this new error checking if this is not the case.

This patch also includes a cql-pytest test for reproducing this bug.
The bug is also reproduced by the translated Cassandra unit tests
    cassandra_tests/validation/entities/secondary_index_test.py::
    testCreateAndDropIndex
and this is how I found this bug. After these patch, all these tests
pass.

Fixes #8717.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210526143635.624398-1-nyh@scylladb.com>
(cherry picked from commit 97e827e3e1)
2021-08-10 18:02:02 +03:00
Nadav Har'El
89fbcf9c81 Merge 'Fix index name conflicts with regular tables' from Piotr Sarna
When an index is created without an explicit name, a default name
is chosen. However, there was no check if a table with conflicting
name already exists. The check is now in place and if any conflicts
are found, a new index name is chosen instead.
When an index is created *with* an explicit name and a conflicting
regular table is found, index creation should simply fail.

This series comes with a test.

Fixes #8620
Tests: unit(release)

Closes #8632

* github.com:scylladb/scylla:
  cql-pytest: add regression tests for index creation
  cql3: fail to create an index if there is a name conflict
  database: check for conflicting table names for indexes

(cherry picked from commit cee4c075d2)
2021-08-10 12:33:21 +03:00
Calle Wilund
6dc7ef512d messaging_service: Bind to listen address, not broadcast
Refs #8418

Broadcast can (apparently) be an address not actually on machine, but
on the other side of NAT. Thus binding local side of outgoing
connection there will fail.
Bind instead to listen_address (or broadcast, if listen_to_broadcast),
this will require routing + NAT to make the connection looking
like from broadcast from node connected to, to allow the connection
(if using partial encryption).

Note: this is somewhat verified somewhat limitedly. I would suggest
verifying various multi rack/dc setups before relying on it.

Closes #8974

(cherry picked from commit b8b5f69111)
2021-07-18 14:09:03 +03:00
Asias He
ae39b30ed3 repair: Consider memory bloat when calculate repair parallelism
The repair parallelism is calculated by the number of memory allocated to
repair and memory usage per repair instance. Currently, it does not
consider memory bloat issues (e.g., issue #8640) which cause repair to
use more memory and cause std::bad_alloc.

Be more conservative when calculating the parallelism to avoid repair
using too much memory.

Fixes #8641

Closes #8652

(cherry picked from commit b8749f51cb)
2021-07-15 13:01:43 +03:00
Avi Kivity
d54372d699 Dedicate Scylla 4.5 release 2021-07-14 18:41:02 +03:00
Hagit Segev
7f96719c55 release: prepare for 4.5.rc5 2021-07-11 16:56:02 +03:00
Takuya ASADA
0c33983c71 scylla-fstrim.timer: drop BindsTo=scylla-server.service
To avoid restart scylla-server.service unexpectedly, drop BindsTo=
from scylla-fstrim.timer.

Fixes #8921

Closes #8973

(cherry picked from commit def81807aa)
2021-07-08 10:06:12 +03:00
Calle Wilund
3c51b4066b commitlog: Use defensive copies of segment list in iterations
Fixes #8952

In 5ebf5835b0 we added a segment
prune after flushing, to deal with deadlocks in shutdown.
This means that calls that issue sync/flush-like ops "for-all",
need to operate on a defensive copy of the list.

Closes #8980

(cherry picked from commit ce45ffdffb)
2021-07-08 09:56:14 +03:00
Takuya ASADA
efa4d24deb dist/redhat: fix systemd unit name of scylla-node-exporter
systemd unit name of scylla-node-exporter is
scylla-node-exporter.service, not node-exporter.service.

Fixes #8966

Closes #8967

(cherry picked from commit f19ebe5709)
2021-07-07 18:37:42 +03:00
Takuya ASADA
3e1d608111 dist: stop removing /etc/systemd/system/*.mount on package uninstall
Listing /etc/systemd/system/*.mount as ghost file seems incorrect,
since user may want to keep using RAID volume / coredump directory after
uninstalling Scylla, or user may want to upgrade enterprise version.

Also, we mixed two types of files as ghost file, it should handle differently:
 1. automatically generated by postinst scriptlet
 2. generated by user invoked scylla_setup

The package should remove only 1, since 2 is generated by user decision.

However, just dropping .mount from %files section causes another
problem, rpm will remove these files during upgrade, instead of
uninstall (#8924).

To fix both problem, specify .mount files as "%ghost %config".
It will keep files both package upgrade and package remove.

See scylladb/scylla-enterprise#1780

Closes #8810
Closes #8924

Closes #8959

(cherry picked from commit f71f9786c7)
2021-07-07 18:37:42 +03:00
Pavel Emelyanov
a87bb38c29 hasher: More picky noexcept marking of feed_hash()
Commit 5adb8e555c marked the ::feed_hash() and a visitor lambda of
digester::feed_hash() as noexcept. This was quite recklesl as the
appending_hash<>::operator()s called by ::feed_hash() are not all
marked noexcept. In particular, the appending_hash<row>() is not
such and seem to throw.

The original intent of the mentioned commit was to facilitate the
partition_hasher in repair/ code. The hasher itself had been removed
by the 0af7a22c21, so it no longer needs the feed_hash-s to be
noexcepts.

The fix is to inherit noexcept from the called hashers, but for the
digester::feed_hash part the noexcept is just removed until clang
compilation bug #50994 is fixed.

fixes: #8983
tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210706153608.4299-1-xemul@scylladb.com>
(cherry picked from commit 63a2fed585)
2021-07-07 18:36:00 +03:00
Raphael S. Carvalho
ab3e284e04 LCS/reshape: Don't reshape single sstable in level 0 with strict mode
With strict mode, it could happen that a sstable alone in level 0 is
selected for offstrategy compaction, which means that we could run
into an infinite reshape process.

This is fixed by respecting the offstrategy threshold. Unit test is
added.

Fixes #8573.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210506181324.49636-1-raphaelsc@scylladb.com>
(cherry picked from commit 8480839932)
2021-07-07 14:06:20 +03:00
Raphael S. Carvalho
b0e833d9e5 LCS: reshape: Fix overlapping check when determining if a sstable set is disjoint
Wrong comparison operator is used when checking for overlapping. It
would miss overlapping when last key of a sstable is equal to the first
key of another sstable that comes next in the set, which is sorted by
first key.

Fixes #8531.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 39ecddbd34)
2021-07-07 14:04:04 +03:00
Hagit Segev
9e55d9bd04 release: prepare for 4.5.rc4 2021-07-07 12:17:19 +03:00
Avi Kivity
f89f4e69a0 Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed (#8695)​ (v3)' from Calle Wilund
Fixes #8270

If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk footprint that is over limit causing new segment allocation to stall.

We need to take a few things into account:
1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here.
2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task.
3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit.
4.) (v2) Must ensure discard/delete routines are executed. Because we can race with background disk syncs, we may need to
    issue segment prunes from end_flush() so we wake up actual file deletion/recycling
5.) (v2) Shutdown must ensure discard/delete is run after we've disabled background task etc, otherwise we might fail waking up replenish and get stuck in gate
6.) (v2) Recycling or deleting segments must be consistent, regardless of shutdown. For same reason as above.
7.) (v3) Signal recycle/delete queues/promise on shutdown (with recognized marker) to handle edge case where we only have a single (allocating) segment in the list, and cannot wake up replenisher in any more civilized way.

Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything).

New attempt at this, should fix intermittent shutdown deadlocks in commitlog_test.

Closes #8764

* github.com:scylladb/scylla:
  commitlog_test: Add test case for usage/disk size threshold mismatch
  commitlog_test: Improve test assertion
  commitlog: Add waitable future for background sync/flush
  commitlog: abort queues on shutdown
  commitlog: break out "abort" calls into member functions
  commitlog: Do explicit discard+delete in shutdown
  commitlog: Recycle or not should not depend on shutdown state
  commitlog: Issue discard_unused_segments on segment::flush end IFF deletable
  commitlog: Flush all segments if we only have one.
  commitlog: Always force flush if segment allocation is waiting
  commitlog: Include segment wasted (slack) size in footprint check
  commitlog: Adjust (lower) usage threshold

(cherry picked from commit 14252c8b71)
2021-06-27 14:05:36 +03:00
Nadav Har'El
7146646bf4 Merge 'commitlog: make_checked_file for segments, report and ignore other errors on shutdown' from Benny Halevy
Shutdown must never fail, otherwise it may cause hangs
as seen in https://github.com/scylladb/scylla/issues/8577.

This change wraps the file created in `allocate_segment_ex` in `make_checked_file` so that scylla will abort when failing to write to the commitlog files.

In case other errors are seen during shutdown, just log them and continue with shutting down to prevent scylla from hanging.

Fixes #8577

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #8578

* github.com:scylladb/scylla:
  commitlog: segment_manager::shutdown: abort on errors
  commitlog: allocate_segment_ex: make_checked_file

(cherry picked from commit 48ff641f67)
2021-06-27 14:04:04 +03:00
Benny Halevy
b3aba49ab0 commitlog: segment_manager: max_size must be aligned
This was triggered by the test_total_space_limit_of_commitlog dtest.
When it passes a very large commitlog_segment_size_in_mb (1/6th of the
free memory size, in mb), segment_manager constructor limits max_size
to std::numeric_limits<position_type>::max() which is 0xffffffff.

This causes allocate_segment_ex to loop forever when writing the segment
file since `dma_write` returns 0 when the count is unaligned (seen 4095).

The fix here is to select a sligtly small maxsize that is aligned
down to a multiple of 1MB.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210407121059.277912-1-bhalevy@scylladb.com>
(cherry picked from commit 705f9c4f79)
2021-06-27 14:03:41 +03:00
Hagit Segev
706de00ef2 release: prepare for 4.5.rc3 2021-06-20 17:14:50 +03:00
Piotr Sarna
cd5b915460 Merge 'view: fix use-after-move when handling view update failures'
Backport of 6726fe79b6.

The code was susceptible to use-after-move if both local
and remote updates were going to be sent.
The whole routine for sending view updates is now rewritten
to avoid use-after-move.

Fixes #8830
Tests: unit(release),
       dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)

Closes #8834

* backport-6726fe7:
  view: fix use-after-move when handling view update failures
  db,view: explicitly move the mutation to its helper function
  db,view: pass base token by value to mutate_MV
2021-06-16 13:27:12 +02:00
Piotr Sarna
247d30f075 view: fix use-after-move when handling view update failures
The code was susceptible to use-after-move if both local
and remote updates were going to be sent.
The whole routine for sending view updates is now rewritten
to avoid use-after-move.

Refs #8830
Tests: unit(release),
       dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)
2021-06-16 13:25:50 +02:00
Piotr Sarna
13da17e6fe db,view: explicitly move the mutation to its helper function
The `apply_to_remote_endpoints` helper function used to take
its `mut` parameter by reference, but then moved the value from it,
which is confusing and prone to errors. Since the value is moved-from,
let's pass it to the helper function as rvalue ref explicitly.
2021-06-16 13:25:46 +02:00
Piotr Sarna
6e29d74ab8 db,view: pass base token by value to mutate_MV
The base token is passed cross-continuations, so the current way
of passing it by const reference probably only works because the token
copying is cheap enough to optimize the reference out.
Fix by explicitly taking the token by value.
2021-06-16 13:23:10 +02:00
Tomasz Grabiec
e820e7f3c5 Merge 'Backport for 4.5: Fix replacing node takes writes' from Asias He
This backport fixes the follow issue:

    Cassandra stress fails to achieve consistency during replace node operation #8013

without the NODE_OPS_CMD infrastructure.

The commit c82250e0cf (gossip: Allow deferring advertise of local node to be up) which fixes for

     During replace node operation - replacing node is used to respond to read queries #7312

is already present in 4.5 branch.

Closes #8703

* github.com:scylladb/scylla:
  storage_service: Delay update pending ranges for replacing node
  gossip: Add helper to wait for a node to be up
2021-06-08 23:20:25 +02:00
Nadav Har'El
0cebafd104 alternator: fix equality check of nested document containing a set
In issue #5021 we noticed that the equality check in Alternator's condition
expressions needs to handle sets differently - we need to compare the set's
elements ignoring their order. But the implementation we added to fix that
issue was only correct when the entire attribute was a set... In the
general case, an attribute can be a nested document, with only some
inner set. The equality-checking function needs to tranverse this nested
document, and compare the sets inside it as appropriate. This is what
we do in this patch.

This patch also adds a new test comparing equality of a nested document with
some inner sets. This test passes on DynamoDB, failed on Alternator before
this patch, and passes with this patch.

Refs #5021
Fixes #8514

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419184840.471858-1-nyh@scylladb.com>
(cherry picked from commit dae7528fe5)
2021-06-07 09:10:42 +03:00
Nadav Har'El
9abd4677b1 alternator: fix inequality check of two sets
In issue #5021 we noted that Alternator's equality operator needs to be
fixed for the case of comparing two sets, because the equality check needs
to take into account the possibility of different element order.

Unfortunately, we fixed only the equality check operator, but forgot there
is also an inequality operator!

So in this patch we fix the inequality operator, and also add a test for
it that was previously missing.

The implementation of the inequality operator is trivial - it's just the
negation of the equality test. Our pre-existing tests verify that this is
the correct implementation (e.g., if attribute x doesn't exist, then "x = 3"
is false but "x <> 3" is true).

Refs #5021
Fixes #8513

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419141450.464968-1-nyh@scylladb.com>
(cherry picked from commit 50f3201ee2)
2021-06-07 08:42:46 +03:00
Nadav Har'El
9a07d7ca76 alternator: fix equality check of two unset attributes
When a condition expression (ConditionExpression, FilterExpression, etc.)
checks for equality of two item attributes, i.e., "x = y", and when one of
these attributes was missing we correctly returned false.
However, we also need to return false when *both* attributes are missing in
the item, because this is what DynamoDB does in this case. In other words
an unset attribute is never equal to anything - not even to another unset
attribute. This was not happening before this patch:

When x and y were both missing attributes, Alternator incorrectly returned
true for "x = y", and this patch fixes this case. It also fixes "x <> y"
which should to be true when both x and y are unset (but was false
before this patch).

The other comparison operators - <, <=, >, >=, BETWEEN, were all
implemented correctly even before this patch.

This patch also includes tests for all the two-unset-attribute cases of
all the operators listed above. As usual, we check that these tests pass
on both DynamoDB and Alternator to confirm our new behavior is the correct
one - before this patch, two of the new tests failed on Alternator and
passed on DynamoDB.

Fixes #8511

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419123911.462579-1-nyh@scylladb.com>
(cherry picked from commit 46448b0983)
2021-06-06 15:55:57 +03:00
Takuya ASADA
d92a26636a dist: add DefaultDependencies=no to .mount units
To avoid ordering cycle error on Ubuntu, add DefaultDependencies=no
on .mount units.

Fixes #8482

Closes #8495

(cherry picked from commit 0b01e1a167)
2021-05-31 12:11:51 +03:00
Yaron Kaikov
7445bfec86 install.sh: Setup aio-max-nr upon installation
This is a follow up change to #8512.

Let's add aio conf file during scylla installation process and make sure
we also remove this file when uninstall Scylla

As per Avi Kivity's suggestion, let's set aio value as static
configuration, and make it large enough to work with 500 cpus.

Closes #8650

Refs: #8713

(cherry picked from commit dd453ffe6a)
2021-05-27 14:12:19 +03:00
Yaron Kaikov
b077b198bf scylla_io_setup: configure "aio-max-nr" before iotune
On severl instance types in AWS and Azure, we get the following failure
during scylla_io_setup process:
```
ERROR 2021-04-14 07:50:35,666 [shard 5] seastar - Could not setup Async
I/O: Resource temporarily unavailable. The most common cause is not
enough request capacity in /proc/sys/fs/aio-max-nr. Try increasing that
number or reducing the amount of logical CPUs available for your
application
```

We have scylla_prepare:configure_io_slots() running before the
scylla-server.service start, but the scylla_io_setup is taking place
before

1) Let's move configure_io_slots() to scylla_util.py since both
   scylla_io_setup and scylla_prepare are import functions from it
2) cleanup scylla_prepare since we don't need the same function twice
3) Let's use configure_io_slots() during scylla_io_setup to avoid such
failure

Fixes: #8587

Closes #8512

Refs: #8713

(cherry picked from commit 588a065304)
2021-05-27 14:11:17 +03:00
Avi Kivity
44f85d2ba0 Update seastar submodule (httpd handler not reading content)
* seastar dadd299e7d...dab10ba6ad (1):
  > httpd: allow handler to not read an empty content
Fixes #8691.
2021-05-25 11:32:18 +03:00
Asias He
ccfe1d12ea storage_service: Delay update pending ranges for replacing node
In commit c82250e0cf (gossip: Allow deferring
advertise of local node to be up), the replacing node is changed to postpone
the responding of gossip echo message to avoid other nodes sending read
requests to the replacing node. It works as following:

1) replacing node does not respond echo message to avoid other nodes to
mark replacing node as alive

2) replacing node advertises hibernate state so other nodes knows
replacing node is replacing

3) replacing node responds echo message so other nodes can mark
replacing node as alive

This is problematic because after step 2, the existing nodes in the
cluster will start to send writes to the replacing node, but at this
time it is possible that existing nodes haven't marked the replacing
node as alive, thus failing the write request unnecessarily.

For instance, we saw the following errors in issue #8013 (Cassandra
stress fails to achieve consistency when only one of the nodes is down)

```
scylla:
[shard 1] consistency - Live nodes 2 do not satisfy ConsistencyLevel (2
required, 1 pending, live_endpoints={127.0.0.2, 127.0.0.1},
pending_endpoints={127.0.0.3}) [shard 0] gossip - Fail to send
EchoMessage to 127.0.0.3: std::runtime_error (Not ready to respond
gossip echo message)

c-s:
java.io.IOException: Operation x10 on key(s) [4c4f4d37324c35304c30]:
Error executing: (UnavailableException): Not enough replicas available
for query at consistency QUORUM (2 required but only 1 alive
```

To solve this problem for older releases without the patch "repair:
Switch to use NODE_OPS_CMD for replace operation", a minimum fix is
implemented in this patch. Once existing nodes learn the replacing node
is in HIBERNATE state, they add the replacing as replacing, but only add
the replacing to the pending list only after the replacing node is
marked as alive.

With this patch, when the existing nodes start to write to the replacing
node, the replacing node is already alive.

Tests: replace_address_test.py:TestReplaceAddress.replace_node_same_ip_test + manual test
Fixes: #8013

Closes #8614

(cherry picked from commit e4872a78b5)
2021-05-25 14:12:31 +08:00
Asias He
b0399a7c3b gossip: Add helper to wait for a node to be up
This patch adds gossiper::wait_alive helper to wait for nodes to be up
on all shards.

Refs #8013

(cherry picked from commit f690f3ee8e)
2021-05-25 14:12:11 +08:00
Takuya ASADA
b81919dbe2 scylla_raid_setup: use /dev/disk/by-uuid to specify filesystem
Currently, var-lib-scylla.mount may fails because it can start before
MDRAID volume initialized.
We may able to add "After=dev-disk-by\x2duuid-<uuid>.device" to wait for
device become available, but systemd manual says it automatically
configure dependency for mount unit when we specify filesystem path by
"absolute path of a device node".

So we need to replace What=UUID=<uuid> to What=/dev/disk/by-uuid/<uuid>.

Fixes #8279

Closes #8681

(cherry picked from commit 3d307919c3)
2021-05-24 17:23:56 +03:00
Takuya ASADA
5651a20ba1 install.sh: apply correct file security context when copying files
Currently, unified installer does not apply correct file security context
while copying files, it causes permission error on scylla-server.service.
We should apply default file security context while copying files, using
'-Z' option on /usr/bin/install.

Also, because install -Z requires normalized path to apply correct security
context, use 'realpath -m <PATH>' on path variables on the script.

Fixes #8589

Closes #8602

(cherry picked from commit 60c0b37a4c)
2021-05-19 12:40:12 +03:00
Takuya ASADA
8c30b83ea4 install.sh: fix not such file or directory on nonroot
Since we have added scylla-node-exporter, we needed to do 'install -d'
for systemd directory and sysconfig directory before copying files.

Fixes #8663

Closes #8664

(cherry picked from commit 6faa8b97ec)
2021-05-19 12:40:12 +03:00
Avi Kivity
fce7eab9ac Merge 'Fix type checking in index paging' from Piotr Sarna
When recreating the paging state from an indexed query,
a bunch of panic checks were introduced to make sure that
the code is correct. However, one of the checks is too eager -
namely, it throws an error if the base column type is not equal
to the view column type. It usually works correctly, unless the
base column type is a clustering key with DESC clustering order,
in which case the type is actually "reversed". From the point of view
of the paging state generation it's not important, because both
types deserialize in the same way, so the check should be less
strict and allow the base type to be reversed.

Tests: unit(release), along with the additional test case
       introduced in this series; the test also passes
       on Cassandra

Fixes #8666

Closes #8667

* github.com:scylladb/scylla:
  test: add a test case for paging with desc clustering order
  cql3: relax a type check for index paging

(cherry picked from commit 593ad4de1e)
2021-05-19 12:40:09 +03:00
Raphael S. Carvalho
ab8eefade7 compaction_manager: Don't swallow exception in procedure used by reshape and resharding
run_custom_job() was swallowing all exceptions, which is definitely
wrong because failure in a resharding or reshape would be incorrectly
interpreted as success, which means upper layer will continue as if
everything is ok. For example, ignoring a failure in resharding could
result in a shared sstable being left unresharded, so when that sstable
reaches a table, scylla would abort as shared ssts are no longer
accepted in the main sstable set.
Let's allow the exception to be propagated, so failure will be
communicated, and resharding and reshape will be all or nothing, as
originally intended.

Fixes #8657.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210515015721.384667-1-raphaelsc@scylladb.com>
(cherry picked from commit 10ae77966c)
2021-05-18 13:00:07 +03:00
Hagit Segev
e2704554b5 release: prepare for 4.5.rc2 2021-05-12 14:55:53 +03:00
Lauro Ramos Venancio
e36e490469 TWCS: initialize _highest_window_seen
The timestamp_type is an int64_t. So, it has to be explicitly
initialized before using it.

This missing inicialization prevented the major compactation
from happening when a time window finishes, as described in #8569.

Fixes #8569

Signed-off-by: Lauro Ramos Venancio <lauro.venancio@incognia.com>

Closes #8590

(cherry picked from commit 15f72f7c9e)
2021-05-06 08:51:58 +03:00
Pavel Emelyanov
c97005fbb8 tracing: Stop tracing in main's deferred action
Tracing is created in two steps and is destroyed in two too.
The 2nd step doesn't have the corresponding stop part, so here
it is -- defer tracing stop after it was started.

But need to keep in mind, that tracing is also shut down on
drain, so the stopping should handle this.

Fixes #8382
tests: unit(dev), manual(start-stop, aborted-start)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210331092221.1602-1-xemul@scylladb.com>
(cherry picked from commit 887a1b0d3d)
2021-05-05 15:22:05 +03:00
Nadav Har'El
d881d539f3 Update tools/java submodule
Backport sstableloader fix in tools/java submodule.
Fixes #8230.

* tools/java 768a59a6f1...dbcea78e7d (1):
  > sstableloader: Handle non-prepared batches with ":" in identifier names

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-05-03 10:02:00 +03:00
Avi Kivity
b8a502fab0 Merge '[branch 4.5] Backport reader_permit: always forward resources to the semaphore' from Botond Dénes
This is a backport of 8aaa3a7bb8 to <= branch-4.5. The main conflicts were around Benny's reader close series (fa43d7680), but it also turned out that an additional patch (2f1d65ca11) also has to backported to make sure admission on signaling resources doesn't deadlock.

Refs: https://github.com/scylladb/scylla/issues/8493

Closes #8558

* github.com:scylladb/scylla:
  test: mutation_reader_test: add test_reader_concurrency_semaphore_forward_progress
  test: mutation_reader_test: add test_reader_concurrency_semaphore_readmission_preserves_units
  reader_concurrency_semaphore: add dump_diagnostics()
  reader_permit: always forward resources
  test: multishard_mutation_query_test: fuzzy-test: don't consume resource up-front
  reader_concurrency_semaphore: make admission conditions consistent
2021-04-29 15:36:03 +03:00
Botond Dénes
f7f2bb482f test: mutation_reader_test: add test_reader_concurrency_semaphore_forward_progress
This unit test checks that the semaphore doesn't get into a deadlock
when contended, in the presence of many memory-only reads (that don't
wait for admission). This is tested by simulating the 3 kind of reads we
currently have in the system:
* memory-only: reads that don't pass admission and only own memory.
* admitted: reads that pass admission.
* evictable: admitted reads that are furthermore evictable.

The test creates and runs a large number of these reads in parallel,
read kinds being selected randomly, then creates a watchdog which
kills the test if no progress is being made.

(cherry picked from commit 45d580f056)
2021-04-29 15:26:21 +03:00
Botond Dénes
b16db6512c test: mutation_reader_test: add test_reader_concurrency_semaphore_readmission_preserves_units
This unit test passes a read through admission again-and-again, just
like an evictable reader would be during its lifetime. When readmitted
the read sometimes has to wait and sometimes not. This is to check that
the readmitting a previously admitted reader doesn't leak any units.

(cherry picked from commit cadc26de38)
2021-04-29 15:26:21 +03:00
Botond Dénes
1a7c8223fe reader_concurrency_semaphore: add dump_diagnostics()
Allow semaphore related tests to include a diagnostics printout in error
messages to help determine why the test failed.

(cherry picked from commit d246e2df0a)
2021-04-29 15:26:21 +03:00
Botond Dénes
ac6aa66a7b reader_permit: always forward resources
This commit conceptually reverts 4c8ab10. Said commit was meant to
prevent the scenario where memory-only permits -- those that don't pass
admission but still consume memory -- completely prevent the admission
of reads, possibly even causing a deadlock because a permit might even
blocks its own admission. The protection introduced by said commit
however proved to be very problematic. It made the status of resources
on the permit very hard to reason about and created loopholes via which
permits could accumulate without tracking or they could even leak
resources. Instead of continuing to patch this broken system, this
commit does away with this "protection" based on the observation that
deadlocks are now prevented anyway by the admission criteria introduced
by 0fe75571d9, which admits a read anyway when all the initial count
resources are available (meaning no admitted reader is alive),
regardless of availability of memory.
The benefits of this revert is that the semaphore now knows about all
the resources and is able to do its job better as it is not "lied to"
about resource by the permits. Furthermore the status of a permit's
resources is much simpler to reason about, there are no more loopholes
in unexpected state transitions to swallow/leak resources.
To prove that this revert is indeed safe, in the next commit we add
robust tests that stress test admission on a highly contested semaphore.
This patch also does away with the registered/admitted differentiation
of permits, as this doesn't make much sense anymore, instead these two
are unified into a single "active" state. One can always tell whether a
permit was admitted or not from whether it owns count resources anyway.

(cherry picked from commit caaa8ef59a)
2021-04-29 15:26:21 +03:00
Botond Dénes
98a39884c3 test: multishard_mutation_query_test: fuzzy-test: don't consume resource up-front
The fuzzy test consumes a large chunk of resource from the semaphore
up-front to simulate a contested semaphore. This isn't an accurate
simulation, because no permit will have more than 1 units in reality.
Furthermore this can even cause a deadlock since 8aaa3a7 as now we rely
on all count units being available to make forward progress when memory
is scarce.
This patch just cuts out this part of the test, we now have a dedicated
unit test for checking a heavily contested semaphore, that does it
properly, so no need to try to fix this clumsy attempt that is just
making trouble at this point.

Refs: #8493

Tests: release(multishard_mutation_query_test:fuzzy_test)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210429084458.40406-1-bdenes@scylladb.com>
(cherry picked from commit 26ae9555d1)
2021-04-29 15:26:21 +03:00
Eliran Sinvani
88192811e7 Materialized views: fix possibly old views comming from other nodes
Migration manager has a function to get a schema (for read or write),
this function queries a peer node and retrieves the schema from it. One
scenario where it can happen is if an old node, queries an old not fixed
index.
This makes a hole through which views that are only adjusted for reading
can slip through.

Here we plug the hole by fixing such views before they are registered.

Closes #8509

(cherry picked from commit 480a12d7b3)

Fixes #8554.
2021-04-29 14:02:01 +03:00
Botond Dénes
32f21f7281 reader_concurrency_semaphore: make admission conditions consistent
Currently there are two places where we check admission conditions:
`do_wait_admission()` and `signal()`. Both use `has_available_units()`
to check resource availability, but the former has some additional
resource related conditions on top (in `may_proceed()`), which lead to
the two paths working with slightly different conditions. To fix, push
down all resource availability related checks to `has_available_units()`
to ensure admission conditions are consistent across all paths.

(cherry picked from commit d90cd6402c)
2021-04-27 18:12:29 +03:00
Piotr Sarna
c9eaf95750 Merge 'commitlog: Fix race and edge condition in delete_segments' from Calle Wilund
Fixes #8363
Fixes #8376

Delete segements has two issues when running with size-limited
commit log and strict adherence to said limit.

1.) It uses parallel processing, with deferral. This means that
    the disk usage variables it looks at might not be fully valid
    - i.e. we might have already issued a file delete that will
    reduce disk footprint such that a segment could instead be
    recycled, but since vars are (and should) only updated
    _post_ delete, we don't know.
2.) It does not take into account edge conditions, when we only
    delete a single segment, and this segment is the border segment
    - i.e. the one pushing us over the limit, yet allocation is
    desperately waiting for recycling. In this case we should
    allow it to live on, and assume that next delete will reduce
    footprint. Note: to ensure exact size limit, make sure
    total size is a multiple of segment size.

if we had an error in recycling (disk rename?), and no elements
are available, we could have waiters hoping they will get segements.
abort the queue (not permanent, but wakes up waiters), and let them
retry. Since we did deletions instead, disk footprint should allow
for new allocs at least. Or more likely, everything is broken, but
we will at least make more noise.

Closes #8372

* github.com:scylladb/scylla:
  commitlog: Add signalling to recycle queue iff we fail to recycle
  commitlog: Fix race and edge condition in delete_segments
  commitlog: coroutinize delete_segments
  commitlog_test: Add test for deadlock in recycle waiter

(cherry picked from commit 8e808a56d2)
2021-04-21 18:01:37 +03:00
Avi Kivity
44c6d0fcf9 Update seastar submodule (low bandwidth disks)
* seastar 72e3baed9c...dadd299e7d (2):
  > io_queue: Honor disks with tiny request rate
  > io_queue: Shuffle fair_group creation

Fixes #8378.
2021-04-21 14:00:35 +03:00
Avi Kivity
4bcc0badb2 Point seastar submodule at scylla-seastar.git
This allows us to backport fixes to seastar selectively.
2021-04-21 14:00:00 +03:00
Tomasz Grabiec
97664e63fe Merge 'Make sure that cache_flat_mutation_reader::do_fill_buffer does not fast forward finished underlying reader' from Piotr Jastrzębski
It is possible that a partition is in cache but is not present in sstables that are underneath.
In such case:
1. cache_flat_mutation_reader will fast forward underlying reader to that partition
2. The underlying reader will enter the state when it's empty and its is_end_of_stream() returns true
3. Previously cache_flat_mutation_reader::do_fill_buffer would try to fast forward such empty underlying reader
4. This PR fixes that

Test: unit(dev)

Fixes #8435
Fixes #8411

Closes #8437

* github.com:scylladb/scylla:
  row_cache: remove redundant check in make_reader
  cache_flat_mutation_reader: fix do_fill_buffer
  read_context: add _partition_exists
  read_context: remove skip_first_fragment arg from create_underlying
  read_context: skip first fragment in ensure_underlying

(cherry picked from commit 163f2be277)
2021-04-20 12:49:10 +03:00
Kamil Braun
204964637a time_series_sstable_set: return partition start if some sstables were ck-filtered out
When a particular partition exists in at least one sstable, the cache
expects any single-partition query to this partition to return a `partition_start`
fragment, even if the result is empty.

In `time_series_sstable_set::create_single_key_sstable_reader` it could
happen that all sstables containing data for the given query get
filtered out and only sstables without the relevant partition are left,
resulting in a reader which immediately returns end-of-stream (while it
should return a `partition_start` and if not in forwarding mode, a
`partition_end`). This commit fixes that.

We do it by extending the reader queue (used by the clustering reader
merger) with a `dummy_reader` which will be returned by the queue as
the very first reader. This reader only emits a `partition_start` and,
if not in forwarding mode, a `partition_end` fragment.

Fixes #8447.

Closes #8448

(cherry picked from commit 5c7ed7a83f)
2021-04-20 12:49:10 +03:00
Kamil Braun
c402abe8e9 clustering_order_reader_merger: handle empty readers
The merger could return end-of-stream if some (but not all) of the
underlying readers were empty (i.e. not even returning a
`partition_start`). This could happen in places where it was used
(`time_series_sstable_set::create_single_key_sstable_reader`) if we
opened an sstable which did not have the queried partition but passed
all the filters (specifically, the bloom filter returned a false
positive for this sstable).

The commit also extends the random tests for the merger to include empty
readers and adds an explicit test case that catches this bug (in a
limited scope: when we merge a single empty reader).

It also modifies `test_twcs_single_key_reader_filtering` (regression
test for #8432) because the time where the clustering key filter is
invoked changes (some invocations move from the constructor of the
merger to operator()). I checked manually that it still catches the bug
when I reintroduce it.

Fixes #8445.

Closes #8446

(cherry picked from commit 7ffb0d826b)
2021-04-20 12:49:10 +03:00
Yaron Kaikov
4a78d6403e release: prepare for 4.5.rc1 2021-04-14 17:02:17 +03:00
Kamil Braun
2f20d52ac7 sstables: fix TWCS single key reader sstable filter
The filter passed to `min_position_reader_queue`, which was used by
`clustering_order_reader_merger`, would incorrectly include sstables as
soon as they passed through the PK (bloom) filter, and would include
sstables which didn't pass the PK filter (if they passed the CK
filter). Fortunately this wouldn't cause incorrect data to be returned,
but it would cause sstables to be opened unnecessarily (these sstables
would immediately return eof), resulting in a performance drop. This commit
fixes the filter and adds a regression test which uses statistics to
check how many times the CK filter was invoked.

Fixes #8432.

Closes #8433

(cherry picked from commit 3687757115)
2021-04-11 12:59:31 +03:00
Raphael S. Carvalho
540439ee46 sstable_set: Implement compound_sstable_set's create_single_key_sstable_reader()
compound set isn't overriding create_single_key_sstable_reader(), so
default implementation is always called. Although default impl will
provide correct behavior, specialized ones which provides better perf,
which currently is only available for TWCS, were being ignored.

compound set impl of single key reader will basically combine single key
readers of all sets managed by it.

Fixes #8415.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210406205009.75020-1-raphaelsc@scylladb.com>
(cherry picked from commit 8e0a1ca866)
2021-04-11 12:59:31 +03:00
Gleb Natapov
a0622e85ab storage_proxy: do not crash on LOCAL_QUORUM access to a DC with zero replication
If a table that is not replicated to a certain DC (rf=0) is accessed
with LOCAL_QUORUM on that DC the current code will crash since the
'targets' array will be empty and read executor does not handle it.
Fix it by replying with empty result.

Fixes #8354

Message-Id: <YGro+l2En3fF80CO@scylladb.com>
(cherry picked from commit cd24dfc7e5)
2021-04-06 18:12:36 +03:00
Nadav Har'El
90741dc62c Update submodule tools/java
Backport for refs #8390.

* tools/java ccc4201ded...768a59a6f1 (1):
  > sstableloader: fix handling of rewritten partition

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-04-05 18:47:49 +03:00
Piotr Jastrzebski
83cfa6a63c config: ignore enable_sstables_mc_format flag
Don't allow users to disable MC sstables format any more.
We would like to retire some old cluster features that has been around
for years. Namely MC_SSTABLE and UNBOUNDED_RANGE_TOMBSTONES. To do this
we first have to make sure that all existing clusters have them enabled.
It is impossible to know that unless we stop supporting
enable_sstables_mc_format flag.

Test: unit(dev)

Refs #8352

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

Closes #8360

(cherry picked from commit 57c7964d6c)
2021-04-01 14:04:58 +03:00
Hagit Segev
1816c6df8c release: prepare for 4.5.rc0 2021-03-31 22:25:05 +03:00
Avi Kivity
f9244734f9 Update seastar submodule
* seastar 48376c76a...72e3baed9 (3):
  > file: Add RFW_NOWAIT detection case for AuFS
  > sharded: provide type info on no sharded instance exception
  > iotune: Estimate accuarcy of measurement

Added missing include "database.hh" to api/lsa.cc since seastar::sharded<>
now needs full type information.
2021-03-31 10:40:04 +03:00
Avi Kivity
de10a74a84 Merge 'types: remove linearization from abstract_type::compare' from Wojciech Mitros
This patch is another series on removing big allocations from scylla.
The buffers in `compare_visitor` were replaced with `managed_bytes_view`, similiar change was also needed in tuple_deserializing_iterator and listlike_partial_deserializing_iterator, and was applied as well.

Tests:unit(dev)

Closes #8357

* github.com:scylladb/scylla:
  types: remove linearization from abstract_type::compare
  types: replace buffers in tuple_deserializing_iterator with fragmented ones
  types: make tuple_type_impl::split work with any FragmentedViews
  types: move read_collection_size/value specialization to header file
2021-03-31 08:50:52 +03:00
Wojciech Mitros
f57fa935a2 types: remove linearization from abstract_type::compare
To avoid high latencies caused by large contigous allocations
needed by linearizing, work on fragmented buffers instead.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-03-31 06:35:10 +02:00
Wojciech Mitros
daa31be37f types: replace buffers in tuple_deserializing_iterator with fragmented ones
In preparation for removing linearization from abstract_type::compare,
add options to avoid linearization in tuple_deserializing_iterator.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-03-31 06:35:09 +02:00
Wojciech Mitros
823d4c7529 types: make tuple_type_impl::split work with any FragmentedViews
We may want to store a tuple in a fragmented buffer. To split it
into a vector of optional bytes, tuple_type_impl::split can be used.
To split a contiguous buffer(bytes_view), simply pass
single_fragmented_view(bytes_view).

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-03-31 06:34:37 +02:00
Piotr Sarna
6a2377a233 Merge 'Fast slow query trace doc' from Ivan
Addressed https://github.com/scylladb/scylla/pull/8314#issuecomment-803671234
(write issue: "Tracing: slow query fast mode documentation request")

adds a fast slow queries tracing mode documentation to the docs/guide/tracing.md

patch to the scylla-doc will be dup-ed after this one merged

cc @nyh
cc @vladzcloudius

Closes #8373

* github.com:scylladb/scylla:
  tracing: api: fast mode doc improvement
  tracing: fast slow query tracing mode docs
2021-03-30 17:57:04 +02:00
Ivan Prisyazhnyy
778d9217f3 tracing: api: fast mode doc improvement
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
2021-03-30 16:22:56 +02:00
Ivan Prisyazhnyy
b3b66fb629 tracing: fast slow query tracing mode docs
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
2021-03-30 16:22:56 +02:00
Avi Kivity
d2921b5112 Merge 'Clean up > 2-year-old features' from Piotr Sarna
Following the work started in 253a7640e, a new batch of old features is assumed to be always available. They are all still announced via gossip, but the code assumes that the feature is always true, because we only support upgrades from a previous release, and the release window is considerably smaller than 2 years.

Features picked this time via `git blame`, along with the date of their introduction:

* fe4afb1aa3 (Asias He                  2018-09-05 14:52:10 +0800  109) static const sstring ROW_LEVEL_REPAIR = "ROW_LEVEL_REPAIR";
* ff5e541335 (Calle Wilund              2019-02-05 13:06:07 +0000  110) static const sstring TRUNCATION_TABLE = "TRUNCATION_TABLE";
* fefef7b9eb (Tomasz Grabiec            2019-03-05 19:08:07 +0100  111) static const sstring CORRECT_STATIC_COMPACT_IN_MC = "CORRECT_STATIC_COMPACT_IN_MC";

Tests: unit(dev)

Closes #8235

* github.com:scylladb/scylla:
  sstables,test: remove variables depending on old features
  gms: make CORRECT_STATIC_COMPACT_IN_MC ft unconditionally true
  sstables: stop relying on CORRECT_STATIC_COMPACT_IN_MC feature
  gms: make TRUNCATION_TABLE feature unconditionally true
  gms: make ROW_LEVEL_REPAIR feature unconditionally true
  repair: stop relying on ROW_LEVEL_REPAIR feature
2021-03-30 16:13:35 +03:00
Calle Wilund
c0666ea89b commitlog: Fix inner loop condition in allocation pre-fill
Fixes #8369

This was originally found (and fixed) by @gleb-cloudius, but the patch set with
the fix was reverted at some point, and the fix went away. Now the error remains
even in new, nice coroutine code.

We check the wrong var in the inner loop of the pre-fill path of
allocate_segment_ex, often causing us to generate giant writev:s of more or less
the whole file.  Not intended.

Closes #8370
2021-03-30 12:14:55 +02:00
Avi Kivity
c2866f46b5 test: relax quota for tests on machines with small page size
8a8589038c ("test: increase quota for tests to 6GB") increased
the quota for tests from 2GB to 6GB. I later found that the increased
requirement is related to the page size: Address Sanitizer allocates
at least a page per object, and so if the page size is larger the
memory requirement is also larger.

Make use of this by only increasing the quota if the page size
is greater than 4096 (I've only seen 4096 and 65536 in the wild).
This allows greater parallelism when the page size is small.

Closes #8371
2021-03-30 12:13:42 +02:00
Avi Kivity
8785dd62cb tests: use kernel page cache
Tests are short-lived and use a small amount of data. They
are also often run repeatly, and the data is deleted immediately
after the test. This is a good scenario for using the kernel page
cache, as it can cache read-only data from test to test, and avoid
spilling write data to disk if it is deleted quickly.

Acknowledge this by using the new --kernel-page-cache option for
tests.

This is expected to help on large machines, where the disk can be
overloaded. Smaller machines with NVMe disks probably will not see
a difference.

Closes #8347
2021-03-30 12:04:55 +02:00
Piotr Sarna
6de2691bbd sstables,test: remove variables depending on old features
In order to maintain backward compatibility wrt. cluster features,
two boolean variables were kept in sstable writers:
 - correctly_serialize_non_compound_range_tombstones
 - correctly_serialize_static_compact_in_mc

Since these features are assumed to always be present now,
the above variables are no longer needed and can be purged.
2021-03-30 09:37:41 +02:00
Piotr Sarna
e42dee6afb gms: make CORRECT_STATIC_COMPACT_IN_MC ft unconditionally true
The feature is assumed to be true due to being over 2 years old.
It's still advertised in gossip, but it's assumed to always be present.
2021-03-30 09:37:13 +02:00
Piotr Sarna
28c9af6fa5 sstables: stop relying on CORRECT_STATIC_COMPACT_IN_MC feature
The feature bit is going away because it's over 2 years old,
so the code which depended on it becomes unconditional.
2021-03-30 09:37:04 +02:00
Piotr Sarna
08c4350968 gms: make TRUNCATION_TABLE feature unconditionally true
Turns out the feature was not used presently.
Historically, the commit which removed the support is
30a700c5b0 .
2021-03-30 09:36:45 +02:00
Piotr Sarna
c070178c7e gms: make ROW_LEVEL_REPAIR feature unconditionally true
The feature is assumed to be true due to being over 2 years old.
It's still advertised in gossip, but it's assumed to always be present.
2021-03-30 09:36:11 +02:00
Piotr Sarna
80ebedd242 repair: stop relying on ROW_LEVEL_REPAIR feature
The feature is going away because it's over 2 years old,
so the code which depended on it becomes unconditional.
2021-03-30 09:35:40 +02:00
Avi Kivity
c1badc6317 noexcept_traits: convert enable_if to concepts
A little easier to read.

Closes #8329
2021-03-30 09:30:23 +02:00
Avi Kivity
405c4e7af1 serializer: replace enable_if in deserialized_bytes_proxy with constraint
Simpler to read and understand.

Closes #8303
2021-03-30 09:30:06 +02:00
Avi Kivity
7c953f33d5 utils: disk-error-handler: replace enable_if with concepts
Simpler, cleaner. We also replace the deprecated std::result_of_t
with std::invoke_result_t.

Closes #8305
2021-03-30 09:29:46 +02:00
Nadav Har'El
115324f71a Merge 'Add partial admission control to Thrift frontend' from Piotr Sarna
This pull request adds partial admission control to Thrift frontend. The solution is partial mostly because the Thrift layer, aside from allowing Thrift messages, may also be used as a base protocol for CQL messages. Coupling admission control to this one is a little bit more complicated due to how the layer currently works - a Thrift handler, created once per connection, keeps a local `query_state` instance for the occasion of handling CQL requests. However, `query_state` should be kept per query, not per connection, so adding admission control to this aspect of the frontend is left for later.

Finally, the way service permits are passed from the server, via the handler factory, handler and then to queries is hacky. I haven't figured out how to force Thrift to pass custom context per query, so the way it works now is by relying on the fact that the server does not yield (in Seastar sense) between having read the request and launching the proper handler. Due to that, it's possible to just store the service permit in the server itself, pass the reference (address) to it down to the handler, and then read it back from the handling code and claim ownership of it. It works, but if anyone has a better idea, please share.

Refs #4826

Closes #8313

* github.com:scylladb/scylla:
  thrift: add support for max_concurrent_requests_per_shard
  thrift: add metrics for admission control
  thrift: add a counter for in-flight requests
  thrift: add a counter for blocked requests
  thrift: partially add admission control
  service_permit: add a getter for the number of units held
  thrift: coroutinize processing a request
  memory_limiter: add a missing seastarx include
2021-03-29 21:36:50 +03:00
Raphael S. Carvalho
a390f4eb61 sstables: optimize LCS reshape for repair-based operations
LCS reshape is currently inefficient for repair-based operation, because
the disjoint run of 256 sstables is reshaped into bigger L0 files, which
will be then integrated into the main sstable set.
On reshape completion, LCS has to compact those big L0 files onto higher
levels, until last level is reached, producing bad write amplification.

A much better approach is to instead compact that disjoint run into the
best possible level L, which can be figured out with:
	log (base fan_out) of (total_size / max_sstable_size)
This compaction will be essentially a copy operation. It's important
to do it rather than only mutating the level of sstables because we have
to reshape the input run according to LCS parameters like sstable size.

For repair-based bootstrap/replace, the input disjoint run is now efficiently
reshaped into an ideal level L, so there's no compaction backlog once
reshape completes.

This behavior will manifest in the log as this:
LeveledManifest - Reshaping 256 disjoint sstables in level 0 into level 2

For repair-based decommission/removenode though, which reshape wasn't
wired on yet, level L may temporarily hold 2 disjoint runs, which overlap
one another, but LCS itself will incrementally merge them through either
promotion of L-1 into L, or by detecting overlapping in level L and
merging the overlapping sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210329171826.42873-1-raphaelsc@scylladb.com>
2021-03-29 20:22:04 +03:00
Botond Dénes
3c54c990ab test: view_build_test: test_view_update_generator_buffering: fail gracefully
Failures in this test typically happen inside the test consumer object.
These however don't stop the test as the code invoking the consumer
object handles exceptions coming from it. So the test will run to
completion and will fail again when comparing the produced output with
the expected one. This results in distracting failures. The real problem
is not the difference in the output, but the first check that failed,
which is however buried in the noise. To prevent this add an "ok" flag
which is set to false if the consumer fails. In this case the additional
checks are skipped in the end to not generate useless noise.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210326083147.26113-2-bdenes@scylladb.com>
2021-03-29 17:58:28 +03:00
Avi Kivity
a8463cfb37 Merge "reader_permit: signal leaked resources" from Botond
"
When a permit is destroyed we check if it still holds on to any
resources in the destructor. Any resources the permit still holds on are
leaked resources, as users should have released these. Currently we just
invoke `on_internal_error_noexcept()` to handle this, which -- depending
on the configuration -- will result in an error message or an assert. In
the former case, the resources will be leaked for good. This mini-series
fixes this, by signaling back these resources to the semaphore. This
helps avoid an eventual complete dry-up of all semaphore resources and a
subsequent complete shutdown of reads.

Tests: unit(release, debug)
"

* 'reader-permit-signal-leaked-resources/v1' of https://github.com/denesb/scylla:
  reader_permit: signal leaked resources
  test: test_reader_lifecycle_policy: keep semaphores alive until all ops cease
  sstables: generate_summary(): extend the lifecycle of the reader concurrency semaphore
2021-03-29 17:57:31 +03:00
Botond Dénes
9e01c4c667 test: view_build_test: test_view_update_generator_buffering: use separate permit for readers
Said test has two separate logical readers, but they share the same
permit, which is illegal. This didn't cause any problems yet, but soon
the semaphore will start to keep score of active/inactive permits which
will be confused by such sharing, so have them use separate permits.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210326083147.26113-1-bdenes@scylladb.com>
2021-03-29 17:35:51 +03:00
Takuya ASADA
6f678ab7ff aws: initialize self._disks['ebs'] when no EBS disks
Seems like aws_instance.ebs_disks() causes traceback when no EBS disks
available, need to initialize with empty list.

Fixes #8365

Closes #8366
2021-03-29 17:21:14 +03:00
Gleb Natapov
13a3cf62bb raft: move incoming message processing into per state functions
Clean up step() function by moving state specific processing into per
state functions. This way it is easier to see how each state handles
individual messages. No functional changes here.

Message-Id: <YGHCiTWjq+L/jVCB@scylladb.com>
2021-03-29 15:48:43 +02:00
Tomasz Grabiec
43fd322856 Merge 'scylla-gdb.py: Add io-queues command' from Piotr Sarna
The command can be used to inspect IO queues of a local reactor.
Example output:
```
    (gdb) scylla io-queues
        Dev 0:
            Class:                  |shares:         |ptr:
            --------------------------------------------------------------------------------
            "default"               |1               |(seastar::priority_class_data *)0x6000002c6500
            "commitlog"             |1000            |(seastar::priority_class_data *)0x6000003ad940
            "memtable_flush"        |1000            |(seastar::priority_class_data *)0x6000005cb300
            "streaming"             |200             |(seastar::priority_class_data *)0x0
            "query"                 |1000            |(seastar::priority_class_data *)0x600000718580
            "compaction"            |1000            |(seastar::priority_class_data *)0x6000030ef0c0

            Max request size:    2147483647
            Max capacity:        Ticket(weight: 4194303, size: 4194303)
            Capacity tail:       Ticket(weight: 73168384, size: 100561888)
            Capacity head:       Ticket(weight: 77360511, size: 104242143)

            Resources executing: Ticket(weight: 2176, size: 514048)
            Resources queued:    Ticket(weight: 384, size: 98304)
            Handles: (1)
                Class 0x6000005d7278:
                    Ticket(weight: 128, size: 32768)
                    Ticket(weight: 128, size: 32768)
                    Ticket(weight: 128, size: 32768)
            Pending in sink: (0)
```

Created when debugging a core dump. Turned out not to be immediately useful for this use case, but I'm publishing it since it may come in handy in future investigations.

Closes #8362

* github.com:scylladb/scylla:
  scylla-gdb: add io-queues command
  scylla-gdb.py: add parsing std::priority_queue
  scylla-gdb.py: add parsing std::atomic
  scylla-gdb.py: add parsing std::shared_ptr
  scylla-db.py: add parsing intrusive_slist
2021-03-29 15:31:48 +02:00
Piotr Sarna
adf07eb8fb scylla-gdb: add io-queues command
The command can be used to inspect reactor's IO queues. Example output:
(gdb) scylla io-queues
    Dev 0:
	Class:                  |shares:         |ptr:
	--------------------------------------------------------------------------------
        "default"               |1               |(seastar::priority_class_data *)0x6000002c6500
        "commitlog"             |1000            |(seastar::priority_class_data *)0x6000003ad940
        "memtable_flush"        |1000            |(seastar::priority_class_data *)0x6000005cb300
        "streaming"             |200             |(seastar::priority_class_data *)0x0
        "query"                 |1000            |(seastar::priority_class_data *)0x600000718580
        "compaction"            |1000            |(seastar::priority_class_data *)0x6000030ef0c0

        Max request size:    2147483647
        Max capacity:        Ticket(weight: 4194303, size: 4194303)
        Capacity tail:       Ticket(weight: 73168384, size: 100561888)
        Capacity head:       Ticket(weight: 77360511, size: 104242143)

        Resources executing: Ticket(weight: 2176, size: 514048)
        Resources queued:    Ticket(weight: 384, size: 98304)
        Handles: (1)
            Class 0x6000005d7278:
                Ticket(weight: 128, size: 32768)
                Ticket(weight: 128, size: 32768)
                Ticket(weight: 128, size: 32768)
        Pending in sink: (0)
2021-03-29 15:01:25 +02:00
Piotr Sarna
f162423b8a scylla-gdb.py: add parsing std::priority_queue
The parsing assumes that the underlying storage is a vector,
which is often enough the case.
2021-03-29 13:10:36 +02:00
Piotr Sarna
e36c1f1d25 scylla-gdb.py: add parsing std::atomic 2021-03-29 13:10:36 +02:00
Piotr Sarna
0d4d04d3e6 scylla-gdb.py: add parsing std::shared_ptr 2021-03-29 13:10:36 +02:00
Piotr Sarna
c61822bc86 scylla-db.py: add parsing intrusive_slist 2021-03-29 13:10:36 +02:00
Piotr Sarna
bc1c92fd05 Merge 'Improve flat_mutation_reader::consume_pausable' from Piotr Jastrzębski
`flat_mutation_reader::consume_pausable` is widely used in Scylla. Some places
worth mentioning are memtables and combined readers but there are others as
well.

This patchset improves `consume_pausable` in three ways:
1. it removes unnecessary allocation
2. it rearranges ifs to not check the same thing twice
3. for a consumer that returns plain stop_iteration not a future<stop_iteration>
   it reduces the amount of future usage

Test: unit(dev, release, debug)

Combined reader microbenchmark has shown from 2% to 22% improvement in median
execution time while memtable microbenchmark has shown from 3.6% to 7.8%
improvement in median execution time.

Before the change:
```
./build/release/test/perf/perf_mutation_readers --random-seed 3549335083
single run iterations:    0
single run duration:      1.000s
number of runs:           5
number of cores:          16
random seed:              3549335083

test                                      iterations      median         mad         min         max
combined.one_row                             1316234   140.120ns     0.020ns   140.074ns   140.141ns
combined.single_active                          7332    91.484us    31.890ns    91.453us    91.778us
combined.many_overlapping                        945   870.973us   429.720ns   868.625us   871.403us
combined.disjoint_interleaved                   7102    85.989us     7.847ns    85.973us    85.997us
combined.disjoint_ranges                        7129    85.570us     7.840ns    85.562us    85.596us
combined.overlapping_partitions_disjoint_rows        5458   124.787us    56.738ns   124.731us   125.370us
clustering_combined.ranges_generic           1920688   217.940ns     0.184ns   217.742ns   218.275ns
clustering_combined.ranges_specialized       1935318   194.610ns     0.199ns   194.210ns   195.228ns
memtable.one_partition_one_row                624001     1.600us     1.405ns     1.599us     1.605us
memtable.one_partition_many_rows               79551    12.555us     1.829ns    12.549us    12.558us
memtable.many_partitions_one_row               40557    24.748us    77.083ns    24.644us    25.135us
memtable.many_partitions_many_rows              3220   310.429us    57.628ns   310.295us   311.189us
```

After the change:
```
./build/release/test/perf/perf_mutation_readers --random-seed 3549335083
single run iterations:    0
single run duration:      1.000s
number of runs:           5
number of cores:          16
random seed:              3549335083

test                                      iterations      median         mad         min         max
combined.one_row                             1358839   109.222ns     0.122ns   109.089ns   109.348ns
combined.single_active                          7525    87.305us    25.540ns    87.273us    87.362us
combined.many_overlapping                        962   853.195us     1.904us   851.244us   855.142us
combined.disjoint_interleaved                   7310    81.988us    28.877ns    81.949us    82.032us
combined.disjoint_ranges                        7315    81.699us    37.144ns    81.662us    81.874us
combined.overlapping_partitions_disjoint_rows        5591   120.964us    15.294ns   120.949us   121.120us
clustering_combined.ranges_generic           1954722   211.993ns     0.052ns   211.883ns   212.084ns
clustering_combined.ranges_specialized       2042194   187.807ns     0.066ns   187.732ns   188.289ns
memtable.one_partition_one_row                648701     1.542us     0.339ns     1.542us     1.543us
memtable.one_partition_many_rows               85007    11.759us     1.168ns    11.752us    11.782us
memtable.many_partitions_one_row               43893    22.805us    17.147ns    22.782us    22.843us
memtable.many_partitions_many_rows              3441   290.220us    41.720ns   290.172us   290.306us
```

Closes #8359

* github.com:scylladb/scylla:
  flat_mutation_reader: optimize consume_pausable for some consumers
  flat_mutation_reader: special case consumers in consume_pausable
  flat_mutation_reader: Change order of checks in consume_pausable
  flat_mutation_reader: fix indentation in consume_pausable
  flat_mutation_reader: Remove allocation in consume_pausable
  perf: Add benchmarks for large partitions
2021-03-29 13:06:56 +02:00
Piotr Sarna
4c79f132b6 thrift: add support for max_concurrent_requests_per_shard
The Thrift frontend is now capable of limiting the max number
of concurrent in-flight requests. Surplus requests are shed.

Tests: manual
2021-03-29 13:05:16 +02:00
Piotr Sarna
9f53327c9d thrift: add metrics for admission control
The new metrics include information about how many requests
were blocked on memory, how much is still available, etc.
2021-03-29 13:05:16 +02:00
Piotr Sarna
6b021779d2 thrift: add a counter for in-flight requests 2021-03-29 13:05:16 +02:00
Piotr Sarna
9391515461 thrift: add a counter for blocked requests
The counter tracks how many requests were blocked by the
memory estimation based admission control semaphore.
2021-03-29 13:05:16 +02:00
Piotr Sarna
ef1de114f0 thrift: partially add admission control
This commit adds admission control in the form of passing
service permits to the Thrift server.
The support is partial, because Thrift also supports running CQL
queries, and for that purpose a query_state object is kept
in the Thrift handler. However, the handler is generally created
once per connection, not once per query, and the query_state object
is supposed to keep the state of a single query only.
In order to keep this series simpler, the CQL-on-top-of-Thrift
layer is not touched and is left as TODO.
Moreover, the Thrift layer does not make it easy to pass custom
per-query context (like service_permit), so the implementation
uses a trick: the service permit is created on the server
and then passed as reference to its connections and their respective
Thrift handlers. Then, each time a query is read from the socket,
this service permit is overwritten and then read back from the Thrift
handler. This mechanism heavily relies on the fact that there are
zero preemption points between overwriting the service permit
and reading it back by the handler. Otherwise, races may occur.
This assumption was verified by code inspection + empirical tests,
but if somebody is aware that it may not always hold, please speak up.
2021-03-29 13:05:16 +02:00
Nadav Har'El
ccc75bfe2a Merge 'Disable thrift by default' from Piotr Sarna
The Thrift layer is functional, but it's not usually the first-choice protocol for Scylla users, so it's hereby disabled by default.

Fixes #8336

Closes #8338

* github.com:scylladb/scylla:
  docs: mention disabling Thrift by default
  db,config: disable Thrift by default
2021-03-29 12:48:20 +03:00
Piotr Sarna
3388694e69 service_permit: add a getter for the number of units held
The helper function makes debugging considerably easier.
2021-03-29 11:34:18 +02:00
Piotr Sarna
364b921e25 thrift: coroutinize processing a request
While not particularly useful now, it will facilitate
later changes which introduce service permits.
2021-03-29 11:34:18 +02:00
Piotr Sarna
09621e5fc5 memory_limiter: add a missing seastarx include
It's that or declaring everything that belongs to seastar namespace
explicitly, and including "seastarx.hh" is more standard.
2021-03-29 11:34:18 +02:00
Michał Chojnowski
8c45225f21 docs: remove the obsolete IMR design note
IMR, as described in this design note, was removed in 001652815c.
This doc should have been removed back then, but was overlooked.

Closes #8340
2021-03-29 10:58:05 +02:00
Pekka Enberg
aec33c599b Update tools/python3 submodule
* tools/python3 6f3bcbe...ad04e8e (2):
  > dist/debian: fix renaming debian/scylla-* files rule
  > fix license of package build script to AGPL
2021-03-29 11:50:24 +03:00
Pekka Enberg
203b7394d7 Update tools/java submodule
* tools/java 7b66b7a0fc...ccc4201ded (1):
  > dist/debian: fix renaming debian/scylla-* files rule
2021-03-29 11:50:19 +03:00
Tomasz Grabiec
c0ce122f77 Merge "raft: wire up rpc add_server/remove_server for configuration changes" from Pavel Solodovnikov
Raft instance needs to update RPC subsystem on changes in
configuration, so that RPC can deliver messages to the new nodes
in configuration, as well as dispose of the old nodes.
I.e. the nodes which are not the part of the most recent
configuration anymore.

The effective scope of RPC mappings is limited by the piece of
code which sends messages to both the "new" nodes (which
are added to the cluster with the most recent configuration
change) and the "old" nodes which are removed from the cluster.

Until the messages are successfully delivered to at least
the majority of "old" nodes and we have heard back from them,
the mappings should be kept intact. After that point the RPC
mappings for the removed nodes are no longer of interest
and thus can be immediately disposed.

There is also another problem to be solved: in Raft an instance may
need to communicate with a peer outside its current configuration.
This may happen, e.g., when a follower falls out of sync with the
majority and then a configuration is changed and a leader not present
in the old configuration is elected.

The solution is to introduce the concept of "expirable" updates to
the RPC subsystem.

When RPC receives a message from an unknown peer, it also adds the
return address of the peer to the address map with a TTL. Should
we need to respond to the peer, its address will be known.

An outgoing communication to an unconfigured peer is impossible.

* manmanson/raft_mappings_wiring_v12:
  raft: update README.md with info on RPC server address mappings
  raft: wire up `rpc::add_server` and `rpc::remove_server` for configuration changes
  raft/fsm: add optional `rpc_configuration` field to fsm_output
  raft: maintain current rpc context in `server_impl`
  raft: use `.contains` instead of `.count` for std::set in `raft::configuration::diff`
  raft: unit-tests for `raft_address_map`
  raft: support expiring server address mappings for rpc module
2021-03-29 10:28:45 +02:00
Piotr Jastrzebski
86cf566692 flat_mutation_reader: optimize consume_pausable for some consumers
consumers that return stop_iteration not future<stop_iteration> don't
have to consume a single fragment per each iteration of repeat. They can
consume whole buffer in each iteration.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-03-29 09:55:14 +02:00
Piotr Jastrzebski
26cc4f112d flat_mutation_reader: special case consumers in consume_pausable
consume_pausable works with consumers that return either stop_iteration
or future<stop_iteration>. So far it was calling futurize_invoke for
both. This patch special cases consumers that return
future<stop_iteration> and don't call futurize_invoke for them as this
is unnecessary work.

More importantly, this will allow the following patch to optimize
consumers that return plain stop_iteration.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-03-29 09:55:14 +02:00
Piotr Jastrzebski
164e23d2b1 flat_mutation_reader: Change order of checks in consume_pausable
This way we can avoid checking is_buffer_empty twice.
Compiler might be able to optimize this out but why depend on it
when the alternative is not less readable.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-03-29 09:55:14 +02:00
Piotr Jastrzebski
776ba29cec flat_mutation_reader: fix indentation in consume_pausable
Code was left with wrong indentation by the previous commit that
removed do_with call around the code that's currently present.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-03-29 09:55:14 +02:00
Piotr Jastrzebski
9fb0014d72 flat_mutation_reader: Remove allocation in consume_pausable
The allocation was introduced in 515bed90bb but I couldn't figure out
why it's needed. It seems that the consumer can just be captured inside
lambda. Tests seem to support the idea.

Indentation will be fixed in the following commit to make the review
easier.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-03-29 09:55:14 +02:00
Piotr Jastrzebski
3aa7bee5e3 perf: Add benchmarks for large partitions
in perf_mutation_readers.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-03-29 09:48:11 +02:00
Avi Kivity
ec4d91f9eb tools: toolchain: dbuild: improve cgroupv2 detection code
dbuild detects if the kernel is using cgroupv2 by checking if the
cgroup2 filesystem is mounted on /sys/fs/cgroup. However, on Ubuntu
20.10, the cgroup filesystem is mounted on /sys/fs/cgroup and the
cgroup2 filesystem is mounted on /sys/fs/cgroup/unified. This second
mount matches the search expression and gives a false positive.

Fix by adding a space at the end; this will fail to match
/sys/fs/cgroup/unified.

Closes #8355
2021-03-29 09:31:29 +03:00
Pavel Solodovnikov
2d9e94f050 raft: update README.md with info on RPC server address mappings
Describe the high-level scheme of managing RPC mappings and
also expand on the introduction of "expirable" RPC mappings concept
and why these are needed.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 22:47:13 +03:00
Pavel Solodovnikov
f61206e483 raft: wire up rpc::add_server and rpc::remove_server for configuration changes
Raft instance needs to update RPC subsystem on changes in
configuration, so that RPC can deliver messages to the new nodes
in configuration, as well as dispose of the old nodes.
I.e. the nodes which are not the part of the most recent
configuration anymore.

The effective scope of RPC mappings is limited by the piece of
code which sends messages to both the "new" nodes (which
are added to the cluster with the most recent configuration
change) and the "old" nodes which are removed from the cluster.

Until the messages are successfully delivered to at least
the majority of "old" nodes and we have heard back from them,
the mappings should be kept intact. After that point the RPC
mappings for the removed nodes are no longer of interest
and thus can be immediately disposed.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 22:47:09 +03:00
Pavel Solodovnikov
16d9e8e9af raft/fsm: add optional rpc_configuration field to fsm_output
The field is set in `fsm.get_output` whenever
`_log.last_conf_idx()` or the term changes.

Also, add `_last_conf_idx` and `_last_term` to
`fsm::last_observed_state`, they are utilized
in the condition to evaluate current rpc
configuration in `fsm.get_output()`.

This will be used later to update rpc config state
stored in `server_impl` and maintain rpc address map.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 22:47:05 +03:00
Tomasz Grabiec
6035fd05b3 Merge "Unify drain() and drain_on_shutdown()" from Pavel Emelyanov
The start-stop code is drifting towards a straightforward
scheme of a bunch of

  service foo;
  foo.start();
  auto stop_foo = defer([&foo] { foo.stop(); });

blocks. The drain_on_shutdown() and its relation to drain()
and decommission() is a big hurdle on the way of this effort.

This set unifies drain() and drain_on_shutdown() so that drain
really becomes just some first steps of the regular shutdown,
i.e. -- what it should be. Some synchronisation bits around it
are still needed, though.

This unification also closes a bunch not-yet-caught bugs when
parts of the system remained running in case shutdown happens
after nodetool drain. In this case the whole drain_on_sutdown()
becomes a noop (just returns drain()'s future) and what's
missing in drain() becomes missing on shutdown.

tests: unit(dev), dtest(simple_boot_shutdown : dev),
       manual(start+stop, start+drain+stop : dev)
refs: #2737

* xemul/br-drain-on-shutdown:
  drain_on_shutdown: Simplify
  drain: Fix indentation
  storage_service: Unify drain and drain_on_shutdown
  storage_proxy: Drain and unsubscribe in main.cc
  migration_manager: Stop it in two phases
  stream_manager: Stop instances on drain
  batchlog_manager: Stop its instances on shutdown
  tracing: Shutdown tracing in drain
  tracing: Stop it in main.cc
  system_distributed_keyspace: Stop it in main.cc
  storage_service: Move (un)subscription to migration events
2021-03-26 18:37:27 +01:00
Pavel Solodovnikov
19cc85b3b6 raft: maintain current rpc context in server_impl
Introduce rpc server_address that represents the
last observed state of address mappings
for RPC module.

It does not correspond to any kind of configuration
in the raft sense, just an artificial construct
corresponding to the largest set of server
addresses coming from both previous and current
raft configurations (to be able to contact both
joining and leaving servers).

This will be used later to update rpc module mappings
when cluster configuration changes.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 20:22:44 +03:00
Pavel Solodovnikov
8799ccbab0 raft: use .contains instead of .count for std::set in raft::configuration::diff
`std::unordered_set::contains` is introduced in C++20 and provides
clearer semantics to check existence of a given element in a set.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 20:22:44 +03:00
Pavel Solodovnikov
7c229998e8 raft: unit-tests for raft_address_map
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 20:22:44 +03:00
Pavel Solodovnikov
3c4d46728d raft: support expiring server address mappings for rpc module
This patch introduces `raft_address_map` class to abstract
the notion of expirable address mappings for a raft rpc module.

In Raft an instance may need to communicate with a peer outside
its current configuration. This may happen, e.g., when a follower
falls out of sync with the majority and then a configuration is
changed and a leader not present in the old configuration is elected.

The solution is to introduce the concept of "expirable" updates to
the RPC subsystem.

When RPC receives a message from an unknown peer, it also adds the
return address of the peer to the address map with a TTL. Should
we need to respond to the peer, its address will be known.

An outgoing communication to an unconfigured peer is impossible.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 20:22:44 +03:00
Pavel Emelyanov
d1796ab3dc drain_on_shutdown: Simplify
The modern version of this method doesn't need the
run_with_no_api_lock(), as it's launched on shard 0
anyway, neither it needs logging before and after
as it's done by the deferred action from main that
calls it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:46 +03:00
Pavel Emelyanov
58b47efe16 drain: Fix indentation
Previous patch left it broken for readability.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:46 +03:00
Pavel Emelyanov
8d7ad6de03 storage_service: Unify drain and drain_on_shutdown
Now they only differ in one bit -- compaction manager is
drained on drain and is left running (until regular stop)
on shutdown. So this unification adds a boolean flag for
this case.

Also the indentation is deliberately left broken for the
sake of patch readability.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:46 +03:00
Pavel Emelyanov
b60099c2f8 storage_proxy: Drain and unsubscribe in main.cc
Currently shutdown after drain leaves storage proxy
subscribed on storage_service events and without the
storage_proxy::drain_on_shutdown being called. So it
seems safe if the whole thing is relocated closer to
its starting peers.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:46 +03:00
Pavel Emelyanov
9a8f125890 migration_manager: Stop it in two phases
Before the patch the migration manager was stopped in
two ways and one was buggy.

Plain shutdown -- it's just sharded::stop-ed by defer
in main(), but this happens long after the shutdown
of commitlog, which is not correct.
Shutdown after drain -- it's stopped twice, first time
right before the commitlog shutdown, second -- the
same defer in main(). And since the sharded::stop is
reentrable, the 2nd stop works noop.

This patch splits the stop into two phases: first it
stops the instances and does this in _both_ -- plain
shutdown and shutdown after drain. This phase is done
before commitlog shutdown in both cases. Second, the
existring deferred sharded::stop in main.cc.

This changes needs the migration_manager::stop() to
become re-entrable, but that's easily checked with the
help of abort_source the migration_manager has.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:45 +03:00
Pavel Emelyanov
de8a7fe798 stream_manager: Stop instances on drain
It's not seen directly from ths patch itself, but the only
difference between first several calls that drain() makes
and the stop_transport() is the do_stop_stream_manager()
in the latter.

Again, it's partially a bugfix (shutdown after drain leaves
streaming running), partially a must-have thing (streaming
is not expected in the background after drain), partially a
unification of two drains out there.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:45 +03:00
Pavel Emelyanov
9a7e2a218b batchlog_manager: Stop its instances on shutdown
It's now stopped (not sharded::stop(), but batchlog_manager::stop)
on plain drain, but plain shutdown leaves it running, so fill this
gap.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:45 +03:00
Pavel Emelyanov
bcc3935ce7 tracing: Shutdown tracing in drain
First of all, shutdown that happens after nodetoo drain leaves
tracing up-n-running, so it's effectively a bugfix. But also
a step towards unified drain and drain_on_shutdown.

Keeping this bit in drain seems to be required because drain
stops transport, flushes column families and shuts commitlog
down. Any tracing activity happening after it looks uncalled
for.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:45 +03:00
Pavel Emelyanov
f1d7804102 tracing: Stop it in main.cc
The tracing::stop() just checks that it was shutdown()-ed and
otherwise a noop, so it's OK to stop tracing later. This brings
drain() and drain_on_shutdown() closer to each other and makes
main.cc look more like it should.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:45 +03:00
Pavel Emelyanov
d7cccec97f system_distributed_keyspace: Stop it in main.cc
It's now stopped in drain_on_shutdown, but since its stop()
method is a noop, it doesn't matter where it is. Keeping it
in main.cc next to related start brings drain_on_shutdown()
closer to drain() and the whole thing closer to the Ideal
start-stop sequence.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:45 +03:00
Pavel Emelyanov
5456174d69 storage_service: Move (un)subscription to migration events
After the patch the subscription effectively happens at
the same time as before, but is now located in main.cc,
so no real change here.

The unsubscription was in the drain_on_shutdown before
the patch, but after it it happens to be a defer next
to its peer, i.e. later, but it shouldn't be disastrous
for two reasons. First -- client services and migration
manager are already stopped. Second -- before the patch
this subscription was _not_ cancelled if shutdown ran
after nodetool drain and it didn't cause troubles.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:45 +03:00
Botond Dénes
d64b1fdd6a reader_permit: signal leaked resources
When destroying a permit with leaked resources we call
`on_internal_error_noexcept()` in the destructor. This method logs an
error or asserts depending on the configuration. When not asserting, we
need to return the leaked units to the semaphore, otherwise they will be
leaked for good. We can do this because we know exactly how many
resources the user of the permit leaked (never signalled).
2021-03-26 14:23:32 +02:00
Botond Dénes
0f1a72ba59 test: test_reader_lifecycle_policy: keep semaphores alive until all ops cease
To ensure the semaphores outlive all permits created as part of the
tests.
2021-03-26 14:22:43 +02:00
Botond Dénes
f843e3de08 sstables: generate_summary(): extend the lifecycle of the reader concurrency semaphore
Used to produce the needed permits for the index reads, such that it
over-lives all the permits in use.
2021-03-26 11:06:02 +02:00
Tomasz Grabiec
f86896d387 Merge "Iterate range tombstones in partition_snapshot_reader" from Pavel Emelyanov
Currently the guy copies and merges all range tombstones from all partition
versions (that match the given range, but still) when being initialized or
decides to refresh iterators. This is a lot of potentially useless work and
memory, as the reader may be dropped before it emits all the mutations from
the given range(s).

It's better to walk the tombstones step-by-step, like it's done for rows.

fixes: #1671
tests: unit(dev)

* xemul/br-partiion-snapshot-reader-on-demand-range-tombstones-2:
  range_tombstone_stream: Remove unused methods
  partition_snapshot_reader: Emit range tombstones on demand
  partition_snapshot_reader: Introduce maybe_refresh_state
  partition_snapshot_reader: Move range tombstone stream member
  partition_snapshot_reader: Add reset_state method to helper class
  partition_snapshot_reader: Downgrade heap comparator
  partition_snapshot_reader: Use on-demand comparators
  range_tombstone_list: Add new slice() helper
  range_tombstone_list: Introduce iterator_range alias
2021-03-26 01:27:18 +01:00
Pavel Emelyanov
c6a0e0439e files: Construct file_impls properly
Constructors of classes inherited from file_impl copy alignment
values by hands, but miss the overwrite one, thus on a new file
it remains default-initialized.

To fix this and not to forget to properly initalize future fields
from file_impl, use the impl's copy constructor.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210325104830.31923-1-xemul@scylladb.com>
2021-03-26 00:22:11 +01:00
Tomasz Grabiec
ef06a939c4 Merge "raft: seven etcd unit tests ported" from Alejo
Seven etcd unit tests as boost tests.

* alejo/raft-tests-etcd-08-v4-communicate-v5:
  raft: etcd unit tests: test proposal handling scenarios
  raft: etcd unit tests: test old messages ignored
  raft: etcd unit tests: test single node precandidate
  raft: etcd unit tests: test dueling precandidates
  raft: etcd unit tests: test dueling candidates
  raft: etcd unit tests: test cannot commit without new term
  raft: etcd unit tests: test single node commit
  raft: etcd unit tests: update test_leader_election_overwrite_newer_logs
  raft: etcd unit tests: fix test_progress_leader
  raft: testing: log comparison helper functions
  raft: testing: helper to make fsm candidate
  raft: testing: expose log for test verification
  raft: testing: use server_address_set
  raft: testing: add prevote configuration
  raft: testing: make become_follower() available for tests
2021-03-25 20:27:07 +01:00
Alejo Sanchez
ace0ee514f raft: etcd unit tests: test proposal handling scenarios
TestProposal
For multiple scenarios, check proposal handling.

Note, instead of expecting an explicit result for each specified case,
the test automatically checks for expected behavior when quorum is
reached or not.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
77163ea76a raft: etcd unit tests: test old messages ignored
TestOldMessages
Checks an append request from a leader from a previous term is ignored.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
bf65b19803 raft: etcd unit tests: test single node precandidate
TestSingleNodePreCandidate
Checks a single node configuration with precandidate on works to
automatically elect the node.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
de7051467b raft: etcd unit tests: test dueling precandidates
TestDuelingPreCandidates
In a configuration of 3 nodes, two nodes don't see each other and they
compete for leadership. Loser (3) should revert to follower when prevote
is rejected and revert to term 1.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
aa7d23f86b raft: etcd unit tests: test dueling candidates
TestDuelingCandidates
In a configuration of 3 nodes, two nodes don't see each other and they
compete for leadership. Once reconnected, loser should not disrupt.

But note it will remain candidate with current algorithm without
prevoting and other fsms will not bump term.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
1eac94e7d6 raft: etcd unit tests: test cannot commit without new term
TestCannotCommitWithoutNewTermEntry tests the entries cannot be
committed when leader changes, no new proposal comes in and ChangeTerm
proposal is filtered.

NOTE: this doesn't check committed but it's implicit for next round;
      this could also use communicate() providing committed output map

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
b421fe3605 raft: etcd unit tests: test single node commit
Port etcd TestSingleNodeCommit

In a single node configuration elect the node, add 2 entries and check
number of committed entries.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
9b4538476b raft: etcd unit tests: update test_leader_election_overwrite_newer_logs
Make test_leader_election_overwrite_newer_logs use newer communicate()
and other new helpers.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
368eec1190 raft: etcd unit tests: fix test_progress_leader
Make implementation follow closer to original test.
Use newer boost test helpers.

NOTE: in etcd it seems a leader's self progress is in PIPELINE state.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:28 -04:00
Alejo Sanchez
ba29970e29 raft: testing: log comparison helper functions
Two helper functions to compare logs. For now only index, term, and data
type are used. Data content comparison does not seem to be necessary for now.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:28 -04:00
Alejo Sanchez
aeab4cf4a9 raft: testing: helper to make fsm candidate
Current election_timeout() helper might bump the term twice.
It's convenient and less error prone to have a more fine grained helper
that stops right when candidate state is reached.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:19 -04:00
Alejo Sanchez
7a6616f1cb raft: testing: expose log for test verification
Let derived classes access the log to verify its contents.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:03:46 -04:00
Alejo Sanchez
05b1f57e67 raft: testing: use server_address_set
Use server_address_set in local namespace for brevity.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:01:12 -04:00
Alejo Sanchez
9d0a7d8ccf raft: testing: add prevote configuration
Provide a generic prevote configuration for tests.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:00:28 -04:00
Dejan Mircevski
b2a04985f7 cql-pytest: Drop needless INSERT in test_null
One INSERT statement was unnecessary for the test, so delete it.
Another was necessary, so explain it.

Tests: cql-pytest/test_null on both Scylla and Cassandra

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8304
2021-03-25 16:37:00 +01:00
Tomasz Grabiec
7b30d31d77 Merge "raft: test configuration changes" from Kostja
Test raft configuration changes:
a node with empty configuration, transitioning
to an entirely different cluster, transitioning
in presence of down nodes, leader change during
configuration change, stray replies, etc.

* scylla-dev/raft-empty-confchange-v5: (21 commits)
  raft: (testing) stray replies from removed followers
  raft: always return a non-zero configuration index from the log
  raft: (testing) leader change during configuration change
  raft: (testing) test confchange {ABCDE} -> {ABCDEFG}
  raft: (testing) test confchange {ABCDEF} -> {ABCGH}
  raft: (testing) test confchange {ABC} -> {CDE}
  raft: (testing) test confchange {AB} -> {CD}
  raft: (testing) test confchange {A} -> {B}
  raft: (testing) test a server with empty configuration
  raft: (testing) introduce testing utilities
  raft: (testing) simplify id allocation in test
  raft: (testing) add select_leader() helper
  raft: (testing) introduce communicate() helper
  raft: (testing) style cleanup in raft_fsm_test
  raft: (testing) fix bug in election_threshold
  raft: minor style changes & comments
  raft: do not assert when transitioning to empty config
  raft: assert we never apply a snapshot over uncommitted entries (leader)
  raft: improve tracing
  raft: add fsm_output::empty() helper to aid testing
  ...
2021-03-25 14:01:09 +01:00
Wojciech Mitros
b152dc8c86 types: move read_collection_size/value specialization to header file
The template method needs to be specialized in each file that is
using it. To avoid rewriting the specialization into multiple files,
move it to the header file.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-03-25 12:18:38 +01:00
Avi Kivity
46185d7d82 Update tools/jmx submodule
* tools/jmx 9c687b5...440313e (1):
  > storage_service: Add a generic toppartitions endpoint
2021-03-25 12:36:10 +02:00
Alejo Sanchez
7e6807e8fc raft: testing: make become_follower() available for tests
Some etcd tests need to force a follower with a specific leader.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-24 19:11:09 -04:00
Piotr Wojtczak
c1daf2bb24 column_family: Make toppartitions queries more generic
Right now toppartitions can only be invoked on one column family at a time.
This change introduces a natural extension to this functionality,
allowing to specify a list of families.

We provide three ways for filtering in the query parameter "name_list":
    1. A specific column family to include in the form "ks:cf"
    2. A keyspace, telling the server to include all column families in it.
       Specified by omitting the cf name, i.e. "ks:"
    3. All column families, which is represented by an empty list
The list can include any amount of one or both of the 1. and 2. option.

Fixes #4520

Closes #7864
2021-03-24 17:54:05 +02:00
Raphael S. Carvalho
bcbb39999b LCS: Fix terrible write amplification when reshaping level 0
LCS reshape is basically 'major compacting' level 0 until it contains less than
N sstables.

That produces terrible write amplification, because any given byte will be
compacted (initial # of sstables / max_threshold (32)) times. So if L0 initially
contained 256 ssts, there would be a WA of about 8.

This terrible write amplification can be reduced by performing STCS instead on
L0, which will leave L0 in a good shape without hurting WA as it happens
now.

Fixes #8345.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210322150655.27011-1-raphaelsc@scylladb.com>
2021-03-24 17:48:50 +02:00
Piotr Sarna
24a43681b4 thrift: handle gate closed exception on retry
During the retry mechanism, it's possible to encounter a gate
closed exception, which should simply be ignored, because
it indicates that the server is shutting down.

Closes #8337
2021-03-24 17:41:58 +02:00
Konstantin Osipov
1a1d7ab662 raft: (testing) stray replies from removed followers 2021-03-24 14:05:55 +03:00
Konstantin Osipov
0295163f6f raft: always return a non-zero configuration index from the log
Return snapshot index for last configuration index if there
is no configuration in the log.
2021-03-24 14:05:55 +03:00
Konstantin Osipov
cec59e53ef raft: (testing) leader change during configuration change 2021-03-24 14:05:36 +03:00
Pavel Emelyanov
37bec6fb76 commitlog: Open files with append_is_unlikely
This open option tells seastar that the file in question
will be truncated to the needed size right at once and all
the subsequent writes will happen within this size. This
hint turns off append optimization in seastar that's not
that cheap and helps so save few cpu cycles.

The option was introduced in seastar by 8bec57bc.

tests: unit(dev), dtest(commitlog:
                        test_batch_commitlog,
                        test_periodic_commitlog,
                        test_commitlog_replay_on_startup)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210323115409.31215-1-xemul@scylladb.com>
2021-03-24 13:05:33 +02:00
Konstantin Osipov
a203c8833f raft: (testing) test confchange {ABCDE} -> {ABCDEFG} 2021-03-24 14:04:18 +03:00
Konstantin Osipov
40e117d36e raft: (testing) test confchange {ABCDEF} -> {ABCGH} 2021-03-24 14:04:18 +03:00
Konstantin Osipov
14b2d5d308 raft: (testing) test confchange {ABC} -> {CDE}
Test leader change during configuration change.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
3c718a175e raft: (testing) test confchange {AB} -> {CD} 2021-03-24 14:04:18 +03:00
Konstantin Osipov
2e30c8540e raft: (testing) test confchange {A} -> {B}
Test non-restart and leader restart scenario.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
e23da06fef raft: (testing) test a server with empty configuration
Try becoming a candidate for such server, or adding it
to an existing configuration.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
b18599c630 raft: (testing) introduce testing utilities
Add a discrete_failure_detector, to be able
to mark a single server dead.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
8d26d24370 raft: (testing) simplify id allocation in test 2021-03-24 14:04:18 +03:00
Konstantin Osipov
322a15ec33 raft: (testing) add select_leader() helper
With leader stepdown extension, leadership transfer can happen
to any follower with long enough log. Add a helper to select that
follower from a list.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
4a00da276d raft: (testing) introduce communicate() helper
Allow to communicate between arbitrary number of FSMs. Drop
messages to FSMs which are not in the argument list.
Stop communication upon predicate.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
7182323ac0 raft: (testing) style cleanup in raft_fsm_test
1) Avoid memory violations on test failure
2) Print better diagnostics on failure (BOOST_CHECK_EQUAL vs
   BOOST_CHECK)
2021-03-24 14:04:18 +03:00
Konstantin Osipov
f0f25bf7fb raft: (testing) fix bug in election_threshold
election_threshold was ticking one extra tick,
causing the follower to become candidate in some cases.
This was rendering tests unstable.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
00d7379bc9 raft: minor style changes & comments
Add comments explaining the rationale from transfer_leadership()
(more PhD quotes), encapsulate stable leader check in tick()
into a lambda and add more detailed comments to it.
2021-03-24 14:04:18 +03:00
Piotr Sarna
06131e21a3 configure.py: add customizing clang inline threshold
Until clang figures things out with the now infamous
`-llvm -inline-threshold X` parameter, let's allow customizing
it to make the compilation of release builds less tiresome.
For instance, scylla's row_level.o object file currently does not compile
for me until I decrease the inline threshold to a low value (e.g. 50).

Message-Id: <54113db9438e3c3371410996f49b7fbe9a1b7257.1616422536.git.sarna@scylladb.com>
2021-03-24 12:09:26 +02:00
Tomasz Grabiec
9272e74e8c sstable: writer: ka/la: Write row marker cell after row tombstone
Row marker has a cell name which sorts after the row tombstone's start
bound. The old code was writing the marker first, then the row
tombstone, which is incorrect.

This was harmeless to our sstable reader, which recognized both as
belonging to the current clustering row fragment, and collects both
fine.

However, if both atoms trigger creation of promoted index blocks, the
writer will create a promoted index with entries wich violate the cell
name ordering. It's very unlikely to run into in practice, since to
trigger promoted index entries for both atoms, the clustering key
would be so large so that the size of the marker cell exceeds the
desired promoted index block size, which is 64KB by default (but
user-controlled via column_index_size_in_kb option). 64KB is also the
limit on clustering key size accepted by the system.

This was caught by one of our unit tests:

  sstable_conforms_to_mutation_source_test

...which runs a battery of mutation reader tests with various
desired promoted index block sizes, including the target size of 1
byte, which triggers an entry for every atom.

The test started to fail for some random seeds after commit ecb6abe
inside the
test_streamed_mutation_forwarding_is_consistent_with_slicing test
case, reporting a mutation mismatch in the following line:

    assert_that(*sliced_m).is_equal_to(*fwd_m, slice_with_ranges.row_ranges(*m.schema(), m.key()));

It compares mutations read from the same sstable using different
methods, slicing using clustering key restricitons, and fast
forwarding. The reported mismatch was that fwd_m contained the row
marker, but sliced_m did not. The sstable does contain the marker, so
both reads should return it.

After reverting the commit which introduced dynamic adjustments, the
test passes, but both mutations are missing the marker, both are
wrong!

They are wrong because the promoted index contians entries whose
starting positions violate the ordering, so binary search gets confused
and selects the row tombstone's position, which is emitted after the
marker, thus skipping over the row marker.

The explanation for why the test started to fail after dynamic
adjustements is the following. The promoted index cursor works by
incrementally parsing buffers fed by the file input stream. It first
parses the whole block and then does a binary search within the parsed
array. The entries which cursor touches during binary search depend on
the size of the block read from the file. The commit which enabled
dynamic adjustements causes the block size to be different for
subsequent reads, which allows one of the reads to walk over the
corrupted entries and read the correct data by selecting the entry
corresponding to the row marker.

Fixes #8324
Message-Id: <20210322235812.1042137-1-tgrabiec@scylladb.com>
2021-03-23 16:13:47 +01:00
Tomasz Grabiec
235154cca5 Merge "Teach scylla-gdb new trees in row cache" from Pavel Emelyanov
Clustering rows are now stored in intrusive btree, cells are
now stored in radix tree, but scylla-gdb tries to walk the
intrusive_set and vector/set union respectively.

For the former case -- the btree wrapper is introduced.

For the latter -- compiler optimizes-away too many important
bits and walking the tree turns into a bunch of hard-coded
hacks and reiterpret-casts. Untill better solution is found,
just print the address of the tree root.

* xemul/br-gdb-btree-rows:
  gdb: Show address of the row::_cells tree (or "empty" mark)
  gdb: Add support for intrusive B tree
  gdb: Use helper to get rows from mutation_partition
2021-03-23 12:50:17 +01:00
Pavel Emelyanov
1cd9ec952f gdb: Show address of the row::_cells tree (or "empty" mark)
Currently clang optimizes-out lots of critical stuff from
compact radix tree. Untill we find out the way to walk the
tree in gdb, it's better to at least show where it is in
memory.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-23 13:29:40 +03:00
Pavel Emelyanov
5c85fcb3c9 gdb: Add support for intrusive B tree
Rows inside partition are now stored in an intrusive B-tree,
so here's the helper class that wraps this collection.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-23 12:54:44 +03:00
Pavel Emelyanov
ed38b18a84 gdb: Use helper to get rows from mutation_partition
Preparation for the next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-23 12:54:14 +03:00
Avi Kivity
3c292e31af utils: utf8: fix validate_partial() on non-SIMD-optimized architectures
validate_partial() is declared in the internal namespace, but defined
outside it. This causes calls to validate_partial() to be ambiguous
on architectures that haven't been SIMD-optimized yet (e.g. s390x).

Fix by defining it in the internal namespace.

Closes #8268
2021-03-23 09:21:14 +02:00
Avi Kivity
957259fab7 tools: toolchain: prepare: adjust manifest manipulations
The manifest manipulation commands stopped working with podman 3;
the containers-storage: prefix now throws errors.

Switch to `buildah manifest`; since we're building with buildah,
we might as well maintain the manifest with buildah as well.

Closes #8231
2021-03-23 09:18:19 +02:00
Avi Kivity
4dae434f69 utils: crc: fix build with big-endian architectures and 1-byte objects
crc has some code to reverse endianness on big-endian machines, but does
not handle the case of a 1-byte object (which doesn't need any adjustement).
This causes clang to complain that the switch statement doesn't handle that
case.

Fix by adding a no-op case.

Closes #8269
2021-03-23 09:16:20 +02:00
Konstantin Osipov
ce29fb44c3 raft: do not assert when transitioning to empty config
Throw instead, to make this case testable.
2021-03-22 18:55:40 +03:00
Konstantin Osipov
2ee15ad6c7 raft: assert we never apply a snapshot over uncommitted entries (leader) 2021-03-22 18:55:40 +03:00
Konstantin Osipov
c7f7ad2c4e raft: improve tracing
Add tracing to apply_snapshot, request_vote.
2021-03-22 18:55:40 +03:00
Konstantin Osipov
4dd66edae5 raft: add fsm_output::empty() helper to aid testing
Used in testing to implement trivial transport.
2021-03-22 18:55:40 +03:00
Konstantin Osipov
89349f550c raft: aid testing by providing fsm::id() 2021-03-22 18:55:40 +03:00
Botond Dénes
742a33730a scylla-gdb.py: dereference_smart_ptr(): add support for seastar::smart_ptr
Although a seastar::smart_ptr is trivial to dereference manually, so is
adding support for it to dereference_smart_ptr(), avoiding the annoying
(but brief) detour which is currently needed.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210322150149.84534-1-bdenes@scylladb.com>
2021-03-22 17:30:35 +02:00
Piotr Sarna
b774d69ad2 docs: mention disabling Thrift by default
Thrift is no longer enabled by default, so the documentation
should mention that, as well as the suggested way of enabling it
if necessary.
2021-03-22 14:32:51 +01:00
Raphael S. Carvalho
c86dd125a1 sstables: clean up partitioned_sstable_set::insert()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210322130227.16805-2-raphaelsc@scylladb.com>
2021-03-22 15:30:32 +02:00
Raphael S. Carvalho
48d8cc261e sstables: don't swallow exception in partitioned_sstable_set::insert()
regression introduced by 02b2df1ea9 (Fri Mar 12 01:22:41 2021 -0300).

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210322130227.16805-1-raphaelsc@scylladb.com>
2021-03-22 15:30:31 +02:00
Avi Kivity
50dda795e9 Update seastar submodule
* seastar 83339edb04...48376c76a1 (2):
  > iotune: Warn user about write-back cache mode
  > reactor: add --kernel-page-cache option to disable O_DIRECT
2021-03-22 13:33:08 +02:00
Avi Kivity
74df67776b bytes_ostream: convert write_placeholder from enable_if to concepts
Concepts are easier to read and result in better error messages.

This change also tightens the constraint from "std::is_fundamental" to
"std::integral". The differences are floating point values, nullptr_t,
and void. The latter two are illegal/useless to write, and nobody uses
floating point values for list lengths, so everything still compiles.

Closes #8326
2021-03-22 12:00:07 +01:00
Piotr Sarna
e2443337d9 db,config: disable Thrift by default
It will still be possible to use Thrift once it's enabled
in the yaml file, but it's better to not open this port
by default, since Thrift is definitely not the first choice
for Scylla users.

Fixes #8336
2021-03-22 10:54:26 +01:00
Piotr Sarna
23057dd186 Merge 'Implement RAFT's leader stepdown extension' from Gleb
This series implements leader stepdown extension. See patch 4 for
justification for its existence. First three patches either implement
cleanups to existing code that future patch will touch or fix bugs
that need to be fixed in order for stepdown test to work.

* 'raft-leader-stepdown-v3' of github.com:scylladb/scylla-dev:
  raft: add test for leader stepdown
  raft: introduce leader stepdown procedure
  raft: fix replication when leader is not part of current config
  raft: do not update last election time if current leader is not a part of current configuration
  raft: move log limiting semaphore into the leader state
2021-03-22 09:45:19 +01:00
Avi Kivity
3c44445c07 Merge "Introduce off-strategy compaction for repair-based bootstrap and replace" from Raphael
"
Scylla suffers with aggressive compaction after repair-based operation has initiated. That translates into bad latency and slowness for the operation itself.

This aggressiveness comes from the fact that:
1) new sstables are immediately added to the compaction backlog, so reducing bandwidth available for the operation.
2) new sstables are in bad shape when integrated into the main sstable set, not conforming to the strategy invariant.

To solve this problem, new sstables will be incrementally reshaped, off the compaction strategy, until finally integrated into the main set.

The solution takes advantage there's only one sstable per vnode range, meaning sstables generated by repair-based operations are disjoint.

NOTE: off-strategy for repair-based decommission and removenode will follow this series and require little work as the infrastructure is introduced in this series.

Refs #5226.
"

* 'offstrategy_v7' of github.com:raphaelsc/scylla:
  tests: Add unit test for off-strategy sstable compaction
  table: Wire up off-strategy compaction on repair-based bootstrap and replace
  table: extend add_sstable_and_update_cache() for off-strategy
  sstables/compaction_manager: Add function to submit off-strategy work
  table: Introduce off-strategy compaction on maintenance sstable set
  table: change build_new_sstable_list() to accept other sstable sets
  table: change non_staging_sstables() to filter out off-strategy sstables
  table: Introduce maintenance sstable set
  table: Wire compound sstable set
  table: prepare make_reader_excluding_sstables() to work with compound sstable set
  table: prepare discard_sstables() to work with compound sstable set
  table: extract add_sstable() common code into a function
  sstable_set: Introduce compound sstable set
  reshape: STCS: preserve token contiguity when reshaping disjoint sstables
2021-03-22 10:43:13 +02:00
Gleb Natapov
272cb1c1e6 raft: add test for leader stepdown 2021-03-22 10:31:16 +02:00
Gleb Natapov
9d6bf7f351 raft: introduce leader stepdown procedure
Section 3.10 of the PhD describes two cases for which the extension can
be helpful:

1. Sometimes the leader must step down. For example, it may need to reboot
 for maintenance, or it may be removed from the cluster. When it steps
 down, the cluster will be idle for an election timeout until another
 server times out and wins an election. This brief unavailability can be
 avoided by having the leader transfer its leadership to another server
 before it steps down.

2. In some cases, one or more servers may be more suitable to lead the
 cluster than others. For example, a server with high load would not make
 a good leader, or in a WAN deployment, servers in a primary datacenter
 may be preferred in order to minimize the latency between clients and
 the leader. Other consensus algorithms may be able to accommodate these
 preferences during leader election, but Raft needs a server with a
 sufficiently up-to-date log to become leader, which might not be the
 most preferred one. Instead, a leader in Raft can periodically check
 to see whether one of its available followers would be more suitable,
 and if so, transfer its leadership to that server. (If only human leaders
 were so graceful.)

The patch here implements the extension and employs it automatically
when a leader removes itself from a cluster.
2021-03-22 10:28:43 +02:00
Gleb Natapov
888b52dea1 raft: fix replication when leader is not part of current config
When a leader orchestrates its own removal from a cluster there is a
situation where the leader is still responsible for replication, but it
is no longer part of active configuration. Current code skips replication
in this case though. Fix it by always replicating in the leader state.
2021-03-22 09:52:17 +02:00
Gleb Natapov
1acc8996bc raft: do not update last election time if current leader is not a part of current configuration
Since we use external failure detector instead of relying on empty
AppendRequests from a leader there can be a situation where a node
is no longer part of a certain raft group but is still alive (and also
may be part of other raft groups). In such case last election time
should not be updated even if the node is alive. It is the same as if
it would have stopped to send empty AppendRequests in original raft.
2021-03-22 09:52:17 +02:00
Gleb Natapov
ccf4435759 raft: move log limiting semaphore into the leader state
Log limiting semaphore is used on a leader only, so it should be stored
inside the leader state.
2021-03-22 09:52:17 +02:00
Takuya ASADA
35a14ab22b configure.py: drop compat-python3 targets
Since we switched scylla-python3 build directory to tools/python3/build
on Jenkins, we nolonger need compat-python3 targets, drop them.

Related scylladb/scylla-pkg#1554

Closes #8328
2021-03-21 18:04:27 +02:00
Benny Halevy
f562c9c2f3 test: sstable_datafile_test: tombstone_purge_test: use a longer ttl
As seen in next-3319 unit testing on jenkins
The cell ttl may expire during the test (presuming
that the test machine was overloaded), leading to:
```
INFO  2021-03-21 10:05:23,048 [shard 0] compaction - [Compact tests.tombstone_purge 2fcaf680-8a1c-11eb-b1b9-97020c5d261e] Compacting [/jenkins/workspace/scylla-master/next/scylla/testlog/release/scylla-af8644ec-7f07-4ffe-80bf-6703a942e435/la-17-big-Data.db:level=0:origin=, ]
INFO  2021-03-21 10:05:23,048 [shard 0] compaction - [Compact tests.tombstone_purge 2fcaf680-8a1c-11eb-b1b9-97020c5d261e] Compacted 1 sstables to []. 4kB to 0 bytes (~0% of original) in 0ms = 0 bytes/s. ~128 total partitions merged to 0.
./test/lib/mutation_assertions.hh(108): fatal error: in "tombstone_purge_test": Mutations differ, expected {table: 'tests.tombstone_purge', key: {'id': alpha, token: -7531858254489963}, mutation_partition: {
  rows: [
    {
      cont: true,
      dummy: false,
      position: {
        bound_weight: 0,
      },
      'value': { atomic_cell{1,ts=1616313953,expiry=1616313958,ttl=5} },
    },
  ]
}
}
 ...but got: {table: 'tests.tombstone_purge', key: {'id': alpha, token: -7531858254489963}, mutation_partition: {
  rows: [
    {
      cont: true,
      dummy: false,
      position: {
        bound_weight: 0,
      },
      'value': { atomic_cell{DEAD,ts=1616313953,deletion_time=1616313953} },
    },
  ]
}
}
```

This corresponds to:
```
2395            auto mut2 = make_expiring(alpha, ttl);
2396            auto mut3 = make_insert(beta);
...
2399            auto sst2 = make_sstable_containing(sst_gen, {mut2, mut3});
```

Extend (logical) ttl to 10 seconds to reduce flakiness
due to real-time timing.

Test: sstable_datafile_test(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210321142931.1226850-1-bhalevy@scylladb.com>
2021-03-21 16:42:00 +02:00
Avi Kivity
1e820687eb Merge "reader_concurrency_semaphore: limit non-admitted inactive reads" from Botond
"
Due to bad interaction of recent changes (913d970 and 4c8ab10) inctive
readers that are not admitted have managed to completely fly under the
radar, avoiding any sort of limitation. The reason is that pre-admission
the permits don't forward their resource cost to the semaphore, to
prevent them possibly blocking their own admission later. However this
meant that if such a reader is registered as inactive, it completely
avoids the normal resource based eviction mechanism and can accumulate
without bounds.
The real solution to this is to move the semaphore before the cache and
make all reads pass admission before they get started (#4758). Although
work has been started towards this, it is still a while until it lands.
In the meanwhile this patchset provides a workaround in the form of a
new inactive state, which -- like admitted -- causes the permit to
forward its cost to the semaphore, making sure these un-admitted
inactive reads are accounted for and evicted if there is too much of
them.

Fixes: #8258

Tests: unit(release), dtest(oppartitions_test.py:TestTopPartitions.test_read_by_gause_key_distribution_for_compound_primary_key_and_large_rows_number)
"

* 'reader-concurrency-semaphore-limit-inactive-reads/v4' of https://github.com/denesb/scylla:
  test: mutation_reader_test: add test for permit cleanup
  test: querier_cache_test: add memory based cache eviction test
  reader_permit: add inactive state
  querier: insert(): account immediately evicted querier as resource based eviction
  reader_concurrency_semaphore: fix clear_inactive_reads()
  reader_concurrency_semaphore: make inactive_read_handle a weak reference
  reader_concurrency_semaphore: make evict() noexcept
  reader_concurrency_semaphore: update out-of-date comments
2021-03-21 16:24:54 +02:00
Nadav Har'El
ab75226626 test/cql-pytest: remove xfail from passing test
After commit 0bd201d3ca ("cql3: Skip indexed
column for CK restrictions") fixed issue #7888, the test
cassandra_tests/validation/entities/frozen_collections_test.py::testClusteringColumnFiltering
began passing, as expected. So we can remove its "xfail" label.

Refs #7888.

cassandra_tests/validation/entities/frozen_collections_test.py::testClusteringColumnFiltering

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210321080522.1831115-1-nyh@scylladb.com>
2021-03-21 16:02:30 +02:00
Avi Kivity
e2cd551880 Update seastar submodule
* seastar ea5e529f30...83339edb04 (21):
  > cmake: filter out -Wno-error=#warnings from pkgconfig (seastar.pc)
  > Merge 'utils/log.cc: fix nested_exception logging (again)' from Vlad Zolotarov
Fixes #8327.
  > file: Add option to refuse the append-challenged file
  > Merge "Teach io-tester to work on block device" from Pavel E
  > Merge "Cleanup files code" from Pavel E
  > install-dependencies: Support rhel-8.3
  > install-dependencies: Add some missing rh packages
  > file, reactor: reinstate RWF_NOWAIT support
  > file: Prevent fsxattr.fsx_extsize from overflow
  > cmake: enable clang's -Wno-error=#warnings if supported
  > cmake: harden seastar_supports_flag aginst inputs with spaces or #
  > cmake: fix seastar_supports_flag failing after first invocation
  > thread: Stop backtraces in main() on s390x architecture
  > intent: Explicitly declare constructors for references
  > test: file_io_test: parallel_overwrite: use testing::local_random_engine
  > util: log-impl: rework log_buf::inserter_iterator
  > rwlock: pass timeout parameter to get_units
  > concepts: require lib support to enable concepts
  > rpc: print more info on bad protocol magic
  > seastar-addr2line: strip input line to restore multiline support
  > log: skip on unknown nested mixing instead of stopping the logging
Ref #8327.
2021-03-21 15:58:10 +02:00
Nadav Har'El
10bf2ba60a cql-pytest: translate Cassandra's reproducers for issue #2962
This is a translation of Cassandra's CQL unit test source file
validation/entities/SecondaryIndexOnMapEntriesTest.java into our
our cql-pytest framework.

This test file checks various features of indexing (with secondary index)
individual entries of maps. All these tests pass on Cassandra, but fail on
Scylla because of issue #2962 - we do not yet support indexing of the content
of unfrozen collections. The failing test currently fail as soon as they
try to create the index, with the message:
"Cannot create secondary index on non-frozen collection or UDT column v".

Refs #2962.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210310124638.1653606-1-nyh@scylladb.com>
2021-03-21 12:30:00 +02:00
Avi Kivity
75da8a8d81 Merge 'Fix the retry mechanism in Thrift frontend' from Piotr Sarna
Thrift used to be quite unsafe with regard to its retry mechanism, which caused very rapid use of resources, namely the number of file descriptors. It was also prone to use-after-free due to spawning futures without guarding the captured objects with anything.
The mechanism is now cleaned up, and a simple exponential backoff replaced previous constant backoff policy.

Fixes #8317
Tests: unit(dev), manual(see #8317 for a simple reproducer)

Closes #8318

* github.com:scylladb/scylla:
  thrift: add exponential backoff for retries
  thrift: fix and simplify retry logic
2021-03-21 12:26:13 +02:00
Avi Kivity
a78f43b071 Merge 'tracing: fast slow query tracing' from Ivan Prisyazhnyy
The set of patches introduces a new tracing mode - `fast slow query tracing`. In this mode, Scylla tracks only tracing sessions and omits all tracing events if the tracing context does not have a `full_tracing` state set.

Fixes #2572

Motivation
---

We want to run production systems with that option always enabled so we could always catch slow queries without an overhead. The next step is we are gonna optimize further the costs of having tracing enabled to minimize session context handling overhead to allow it to be as transparent for the end-user as possible.

Fast tracing mode
---

To read the status do

    $ curl -v http://localhost:10000/storage_service/slow_query

To enable fast slow-query tracing

    $ curl -v --request POST http://localhost:10000/storage_service/slow_query\?fast=true\&enable=true

Potential optimizations
---

- remove tracing::begin(lazy_eval)
- replace tracing::begin(string) for enum to remove copying and memory allocations
- merge parameters allocations
- group parameters check for trace context
- delay formatting
- reuse prepared statement shared_ptr instead of both copying it and copying its query

Performance
---

100% cache hits
---

1 Core:

```
$ SCYLLA_HOME=/home/sitano.public/Projects/scylla build/release/scylla --smp 1 --cpuset 7 --log-to-syslog 0 --log-to-stdout 1 --default-log-level info --network-stack posix --workdir /home/sitano.public/Projects/scylla --developer-mode 1 --listen-address 0.0.0.0 --api-address 0.0.0.0 --rpc-address 0.0.0.0 --broadcast-rpc-address 172.18.0.1 --broadcast-address 127.0.0.1

./cassandra-stress write n=100000 no-warmup -pop seq=1..100000 -node 127.0.0.1 -log level=verbose -rate threads=1 -mode native cql3

curl --request POST http://localhost:10000/storage_service/slow_query\?fast\=false\&enable\=false
for i in $(seq 5); do
  taskset -c 2,3,4,5 ./cassandra-stress read duration=5m -pop seq=1..100000 -node 127.0.0.1 -log level=verbose -rate threads=4 throttle=30000/s -mode native cql3
done

curl --request POST http://localhost:10000/storage_service/slow_query\?fast\=true\&enable\=true
for i in $(seq 5); do
  taskset -c 2,3,4,5 ./cassandra-stress read duration=5m -pop seq=1..100000 -node 127.0.0.1 -log level=verbose -rate threads=4 throttle=30000/s -mode native cql3
done

curl --request POST http://localhost:10000/storage_service/slow_query\?fast\=false\&enable\=true
for i in $(seq 5); do
  taskset -c 2,3,4,5 ./cassandra-stress read duration=5m -pop seq=1..100000 -node 127.0.0.1 -log level=verbose -rate threads=4 throttle=30000/s -mode native cql3
done
```

  | qps |   |   |  
-- | -- | -- | -- | --
  | baseline | fast, slow | nofast, slow | %[1-fastslow/baseline]
  | 29,018 | 26,468 | 23,591 | 8.79%
  | 28,909 | 26,274 | 23,584 | 9.11%
  | 28,900 | 26,547 | 23,598 | 8.14%
  | 28,921 | 26,669 | 23,596 | 7.79%
  | 28,821 | 26,385 | 23,601 | 8.45%
stdev | 70.24030182 | 150.9678774 | 6.670832032 |  
avg | 28,914 | 26,469 | 23,594 |  
stderr | 0.24% | 0.57% | 0.03% |  
%[avg/baseline] |   | **8.46%** | 18.40% |  

8.46% performance degradation in `fast slow query mode` for pure in-memory workload with minimum traces.
18.40%  performance degradation in `original slow query mode` for pure in-memory workload with minimum traces.

0% cache hits
---

1GB memory, 1 Core:

    $ SCYLLA_HOME=/home/sitano.public/Projects/scylla build/release/scylla --memory 1G --smp 1 --cpuset 7 --log-to-syslog 0 --log-to-stdout 1 --default-log-level info --network-stack posix --workdir /home/sitano.public/Projects/scylla --developer-mode 1 --listen-address 0.0.0.0 --api-address 0.0.0.0 --rpc-address 0.0.0.0 --broadcast-rpc-address 172.18.0.1 --broadcast-address 127.0.0.1

2.4GB, 10000000 keys data:

    $ ./cassandra-stress write n=10000000 no-warmup -pop seq=1..10000000 -node 127.0.0.1 -log level=verbose -rate threads=4 -mode native cql3
    $ curl --request POST http://localhost:10000/storage_service/slow_query\?fast\=true\&enable\=true

CASSANDRA_STRESS prepared statements with BYPASS CACHE

    $ taskset -c 2,3,4,5 ./cassandra-stress read duration=5m -pop seq=1..10000000 -node 127.0.0.1 -log level=verbose -rate threads=4 throttle=30000/s -mode native cql3

20000 reads IOPS, 100MB/s from disk

  | qps |   |   |  
-- | -- | -- | -- | --
  | baseline reads | fast, slow reads | %[1-fastslow/baseline] |  
  | 9,575 | 9,054 | 5.44% |  
  | 9,614 | 9,065 | 5.71% |  
  | 9,610 | 9,066 | 5.66% |  
  | 9,611 | 9,062 | 5.71% |  
  | 9,614 | 9,073 | 5.63% |  
stdev | 16.75410397 | 6.892024376 |
avg | 9,605 | 9,064 |
stderr | 0.17% | 0.08% |
%[avg/baseline] |   | **5.63%** |

5.63% performance degradation in `fast slow query mode` for pure on-disk workload with minimum traces.

Closes #8314

* github.com:scylladb/scylla:
  tracing: fast mode unit test
  tracing: rest api for lightweight slow query tracing
  tracing: omit tracing session events and subsessions in fast mode
2021-03-21 12:15:17 +02:00
Dejan Mircevski
318f773d81 types: Unreverse tuple subtype for serialization
When a tuple value is serialized, we go through every element type and
use it to serialize element values.  But an element type can be
reversed, which is artificially different from the type of the value
being read.  This results in a server error due to the type mismatch.
Fix it by unreversing the element type prior to comparing it to the
value type.

Fixes #7902

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8316
2021-03-21 12:07:29 +02:00
Dejan Mircevski
0bd201d3ca cql3: Skip indexed column for CK restrictions
When querying an index table, we assemble clustering-column
restrictions for that query by going over the base table token,
partition columns, and clustering columns.  But if one of those
columns is the indexed column, there is a problem; the indexed column
is the index table's partition key, not clustering key.  We end up
with invalid clustering slice, which can cause problems downstream.

Fix this by skipping the indexed column when assembling the clustering
restrictions.

Tests: unit (dev)

Fixes #7888

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8320
2021-03-21 09:52:06 +02:00
Avi Kivity
58b7f225ab keys: convert trichotomic comparators to return std::strong_ordering
A trichotomic comparator returning an int an easily be mistaken
for a less comparator as the return types are convertible.

Use the new std::strong_ordering instead.

A caller in cql3's update_parameters.hh is also converted, following
the path of least resistance.

Ref #1449.

Test: unit (dev)

Closes #8323
2021-03-21 09:30:43 +02:00
Avi Kivity
29a5047982 utils: error_injection: convert enable_if to concepts
Constrain inject() with a requires clause rather than enable_if,
simplifying the code and compiler diagnostics.

Note that the second instance could not have been called, since
the template argument does not appear in the function parameter
list and thus could not be deduced. This is corrected here.

Closes #8322
2021-03-21 09:28:23 +02:00
Avi Kivity
c28d67dd7f types: time_point_to_string: convert enable_if to concepts
time_point_to_string ensures its input is a time_point with
millisecond resolution (though it neglects to verify the epoch
is what it expects). Change the test from a clunky enable_if to
a nicer concept.

Closes #8321
2021-03-21 09:11:40 +02:00
Tomasz Grabiec
88a019ba21 Merge "raft: respond with snapshot_reply to send_snapshot RPC" from Kostja
Currently send_snapshot is the only two-way RPC used by Raft.
However, the sender (the leader) does not look at the receiver's
reply, other than checks it's not an error. This has the following
issues:

- if the follower has a newer term and rejects the snapshot for
  that reason, the leader will not learn about a newer follower
  term and will not step down
- the send_snapshot message doesn't pass through a single-endpoint
  fsm::step() and thus may not follow the general Raft rules
  which apply for all messages.
- making a general purpose transport that simply calls fsm::step()
  for every message becomes impossible.

Fix it by actually responding with snapshot_reply to send_snapshot
RPC, generating this reply in fsm::step() on the follower,
and feeding into fsm::step() on the leader.

* scylla-dev/raft-send-snapshot-v2:
  raft: pass snapshot_reply into fsm::step()
  raft: respond with snapshot_reply to send_snapshot RPC
  raft: set follower's next_idx when switching to SNAPSHOT mode
  raft: set the current leader upon getting InstallSnapshot
2021-03-19 18:13:40 +01:00
Piotr Sarna
31d3854bb7 thrift: add exponential backoff for retries
The original backoff mechanism which just retries after 1ms
may still lead to rapid resource depletion.
Instead, an exponential backoff is used, with a cap of ~2s.

Tests: manual, with cassandra-stress and browsing logs
2021-03-19 13:16:39 +01:00
Piotr Sarna
f81044d75d thrift: fix and simplify retry logic
The retry logic for Thrift frontend had two bugs:
1. Due to missing break in a switch statement,
   two retry calls were always performed instead of one,
   which acts a little bit like a Seastar forkbomb
2. The delayed action was not guarded with any gate,
   so it was theoretically possible to access a captured `this`
   pointer of an object which already got deallocated.

In order to fix the above, the logic is simplified to always
retry with backoff - it makes very little sense to skip the backoff
and immediate retries are not needed by anyone, while they cause
severe overload risk.

Tests: manual - a simple cassandra-stress invocation was able to crash
       scylla with a segfault:
       $ cassandra-stress write -mode thrift -rate threads=2000

Fixes #8317
2021-03-19 13:15:35 +01:00
Nadav Har'El
abab1d906c Merge 'sstables: convert enable_if to equivalent concepts' from Avi Kivity
enable_if is hard to understand, especially its error messages. Convert
enable_if in sstable code to concepts.

A new concept is introduced, self_describing, for the case of a type
that follows the obj.describe_type() protocol. Otherwise this is quite
straightforward.

Closes #8315

* github.com:scylladb/scylla:
  sstables: vector write: convert to concepts
  sstables: check_truncated_and_assign: convert to concept
  sstables: convert write() to concepts
  sstables: convert write_vint() to concepts
  sstables: vector parse(): convert to concept
  sstables: convert parse() for a self-describing type to concept
  sstables: read_vint(): convert enable_if to concepts
  sstables: add concept for self-describing type
2021-03-18 23:09:34 +02:00
Raphael S. Carvalho
64d78eae6a tests: Add unit test for off-strategy sstable compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 16:56:00 -03:00
Avi Kivity
bf0c7d1340 sstables: vector write: convert to concepts
We have an integral and a non-integral overload, each constrained
with enable_if. We use std::integral to constrain the integral
overload and leave the other unconstrained, as C++ will choose the
more constrained version when applicable.
2021-03-18 19:26:54 +02:00
Avi Kivity
11636563d9 sstables: check_truncated_and_assign: convert to concept
Use std::integral instead of static_assert to reject non-integral
parameters.
2021-03-18 19:26:54 +02:00
Avi Kivity
42e3f33722 sstables: convert write() to concepts
There are three variants: integral, enum, and self-describing
(currently expressed as not integral and not enum). Convert to
concepts by using the standard concepts or the new self_describing
concept.
2021-03-18 19:26:43 +02:00
Avi Kivity
4832041857 sstables: convert write_vint() to concepts
Instead of a maze of deleted functions, enable_if, and static_assert,
use the standard std::integral concept.
2021-03-18 19:24:42 +02:00
Nadav Har'El
0b2cf21932 alternator-test: increase read timeout and avoid retries
By default the boto3 library waits up to 60 second for a response,
and if got no response, it sends the same request again, multiple
times. We already noticed in the past that it retries too many times
thus slowing down failures, so in our test configuration lowered the
number of retries to 3, but the setting of 60-second-timeout plus
3 retries still causes two problems:

  1. When the test machine and the build are extremely slow, and the
     operation is long (usually, CreateTable or DeleteTable involving
     multiple views), the 60 second timeout might not be enough.

  2. If the timeout is reached, boto3 silently retries the same operation.
     This retry may fail because the previous one really succeeded at
     least partially! The symptom is tests which report an error when
     creating a table which already exists, or deleting a table which
     dooesn't exist.

The solution in this patch is first of all to never do retries - if
a query fails on internal server error, or times out, just report this
failure immediately. We don't expect to see transient errors during
local tests, so this is exactly the right behavior.
The second thing we do is to increase the default timeout. If 1 minute
was not enough, let's raise it to 5 minutes. 5 minutes should be enough
for every operation (famous last words...).

Even if 5 minutes is not enough for something, at least we'll now see
the timeout errors instead of some wierd errors caused by retrying an
operation which was already almost done.

Fixes #8135

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210222125630.1325011-1-nyh@scylladb.com>
2021-03-18 18:58:08 +02:00
Avi Kivity
777d48e78d sstables: vector parse(): convert to concept
The two vector parse() overloads select between integral members
and non-integral members. Use std::integral to constrain the
integral overload and leave the other unconstrained; C++ will choose
the more constrained version when it applies.
2021-03-18 18:48:11 +02:00
Avi Kivity
bc42aee7c1 sstables: convert parse() for a self-describing type to concept
This parse() overload uses "not integral and not enum" to reject
non-self-describing types. Express it directly with the self_describing
concept instead.
2021-03-18 18:47:00 +02:00
Avi Kivity
a96b8e8aed sstables: read_vint(): convert enable_if to concepts
Convert read_vint() to a concept. The explicitly deleted version
is no longer needed since wrongly-typed inputs will be rejected
by the constraint. Similarly the static assert can be dropped
for the same reason.
2021-03-18 18:45:05 +02:00
Avi Kivity
bba9c1c616 sstables: add concept for self-describing type
Our sstable parsing and writing code contains a self-describing
type concept, where a type can advertise its members via a
describe_types() member function with a specific protocol.

Formalize that into a C++ concept. This is a little tricky, since
describe_type() accepts a parameter that is itself a template, and
requires clauses only work with concrete type. To handle this problem,
create such a concrete example type and use it in the concept.
2021-03-18 17:52:54 +02:00
Botond Dénes
7980140549 test: test_utils: do_check()/do_require(): tone down log to trace
They are way too noisy to be at debug level.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210318143547.101932-1-bdenes@scylladb.com>
2021-03-18 16:59:59 +02:00
Raphael S. Carvalho
65b09567dd table: Wire up off-strategy compaction on repair-based bootstrap and replace
Now, sstables created by bootstrap and replace will be added to the
maintenance set, and once the operation completes, off-strategy compaction
will be started.

We wait until the end of operation to trigger off-strategy, as reshaping
can be more efficient if we wait for all sstables before deciding what
to compact. Also, waiting for completion is no longer an issue because
we're able to read from new sstables using partitioned_sstable_set and
their existence aren't accounted by the compaction backlog tracker yet.

Refs #5226.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
c45d2e1d27 table: extend add_sstable_and_update_cache() for off-strategy
Function is extended to add sstable to maintenance set if requested
by the caller.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
6ca2ac34ac sstables/compaction_manager: Add function to submit off-strategy work
This new variant will allow its caller to submit off-strategy job
asynchronously on behalf of a given table.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
e0e5bf8285 table: Introduce off-strategy compaction on maintenance sstable set
Off-strategy compaction is about incrementally reshaping the off-strategy
sstables in maintenance set, using our existing reshape mechanism, until
the set is ready for integration into the main sstable set.
The whole operation is done in maintenance mode, using the streaming
scheduling group.
We can do it this way because data in maintenance set is disjoint, so
effects on read amplification is avoided by using
partitioned_sstable_set, which is able to efficiently and incrementally
retrieve data from disjoint sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
439e9b6fab table: change build_new_sstable_list() to accept other sstable sets
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
6e95860e09 table: change non_staging_sstables() to filter out off-strategy sstables
SSTables that are off-strategy should be excluded by this function as
it's used to select candidates for regular compaction.
So in addition to only returning candidates from the main set, let's
also rename it to precisely reflect its behavior.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
c64a156c53 table: Introduce maintenance sstable set
This new sstable set will hold sstables created by repair-based
operations. A repair-based op creates 1 sstable per vrange (256),
so sstables added to this new set are disjoint, therefore they
can be efficiently read from using partitioned_sstable_set.

Compound set is changed to include this new set, so sstables in
this new set are automatically included when creating readers,
computing statistics, and so on.
This new set is not backlog tracked, so changes were needed to
prevent a sstable in this set from being added or removed from
the tracker.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:47 -03:00
Raphael S. Carvalho
1e7a444a8b table: Wire compound sstable set
From now own, _sstables  becomes the compound set, and _main_sstables refer
only to the main sstables of the table. In the near future, maintenance
set will be introduced and will also be managed by the compound set.

So add_sstable() and on_compaction_completion() are changed to
explicitly insert and remove sstables from the main set.

By storing compound set in _sstables, functions which used _sstables for
creating reader, computing statistics, etc, will not have to be changed
when we introduce the maintenance set, so code change is a lot minimized
by this approach.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:46:06 -03:00
Raphael S. Carvalho
42b309b43e table: prepare make_reader_excluding_sstables() to work with compound sstable set
Compound set will not be inserted or erased directly, so let's change
this function to build a new set from scratch instead.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:42:50 -03:00
Raphael S. Carvalho
4e142458eb table: prepare discard_sstables() to work with compound sstable set
After compound set, discard_sstables() will have to prune each set
individually and later refresh the compound set. So let's change
the function to support multiple sstable sets, taking into account
that a sstable set may not want to be backlog tracked.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:42:50 -03:00
Raphael S. Carvalho
d25822a030 table: extract add_sstable() common code into a function
The purpose is to allow the code to be eventually reused by maintenance
sstable set, which will be soon introduced.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:42:50 -03:00
Raphael S. Carvalho
e4b5f5ba33 sstable_set: Introduce compound sstable set
This new sstable set implementation is useful for combining operation of
multiple sstable sets, which can still be referenced individually via
its shared ptr reference.
It will be used when maintenance set is introduced in table, so a
compound set is required to allow both sets to have their operations
efficiently combined.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:42:49 -03:00
Raphael S. Carvalho
1261519266 reshape: STCS: preserve token contiguity when reshaping disjoint sstables
When reshaping hundreds of disjoint sstables, like on bootstrap,
contiguity wasn't being preserved because the heuristic for picking
candidates didn't take into account their token range, which resulted
in reshape messing with the contiguity that could otherwise be
preserved by respecting the token order of the disjoint sstables.
In other words, sstables with the smallest first tokens should be
compacted first. By doing that, the contiguity is preserved even
across size tiers, after reshape has completed its possible multiple
rounds to get all the data in shape.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:36:18 -03:00
Botond Dénes
ad02f313dd test: mutation_reader_test: add test for permit cleanup
Check that a permit correctly restores the units on the semaphore in
each state it can be destroyed in.
2021-03-18 16:18:22 +02:00
Raphael S. Carvalho
e53cedabb1 LCS: reshape: tolerate more sstables in level 0 with relaxed mode
Relaxed mode, used during initialization, of reshape only tolerates min_threshold
(default: 4) L0 sstables. However, relaxed mode should tolerate more sstables in
level 0, otherwise boot will have to reshape level 0 every time it crosses the
min threshold. So let's make LCS reshape tolerate a max of max_threshold and 32.
This change is beneficial because once table is populated, LCS regular compaction
can decide to merge those sstables in level 0 into level 1 instead, therefore
reducing WA.

Refs #8297.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210318131442.17935-1-raphaelsc@scylladb.com>
2021-03-18 15:58:21 +02:00
Botond Dénes
2b7c1bce86 scylla-gdb.py: add variant_member convenience function
Allow conveniently accessing the active member of an `std::variant`
instance.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210318134427.92668-1-bdenes@scylladb.com>
2021-03-18 15:57:51 +02:00
Konstantin Osipov
fcc6e621f8 raft: pass snapshot_reply into fsm::step()
By the time we receive snapshot_reply from a follower
we may no longer be the leader. Follower term may be
different from snapshot term, e.g. the follower may
be aware of a new leader already and have a higher term.

We should pass this information into (possibly ex-) leader FSM via
fsm::step() so that it can correctly change its state, and
not call FSM directly.
2021-03-18 16:56:46 +03:00
Konstantin Osipov
4afa662d62 raft: respond with snapshot_reply to send_snapshot RPC
Raft send_snapshot RPC is actually two-way, the follower
responds with snapshot_reply message. This message until now
was, however, muted by RPC.

Do not mute snapshot_reply any more:
- to make it obvious the RPC is two way
- to feed the follower response directly into leader's FSM and
  thus ensure that FSM testing results produced when using a test
  transport are representative of the real world uses of
  raft::rpc.
2021-03-18 16:56:42 +03:00
Konstantin Osipov
cb3314d756 raft: set follower's next_idx when switching to SNAPSHOT mode
Set follower's next_idx to snapshot index + 1 when switching
it to snapshot mode. If snapshot transfer succeeds, that's the
best match for the follower's next replication index. If it fails,
the leader will send a new probe to find out the follower position
again and re-try sending a possibly newer snapshot.

The change helps reduce protocol state managed outside FSM.
2021-03-18 16:35:11 +03:00
Ivan Prisyazhnyy
f00391af8b tracing: fast mode unit test
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
2021-03-18 15:05:09 +02:00
Ivan Prisyazhnyy
7cbe2aa9c6 tracing: rest api for lightweight slow query tracing
The patch adds REST API support for the lightweight
slow query tracing (fast) mode that is implemented by
omitting all of the trace events during the tracing.

    $ curl -v http://localhost:10000/storage_service/slow_query
    $ curl -v --request POST http://localhost:10000/storage_service/slow_query\?fast=true\&enable=true

Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
2021-03-18 15:05:05 +02:00
Ivan Prisyazhnyy
85fbca2049 tracing: omit tracing session events and subsessions in fast mode
If tracing::tracing::_ignore_trace_events is enabled then
the tracing system must ignore all sessions events
for non full_tracing sessions (probability tracing and
user requested) and creating subsessions with the
make_trace_info.

Patch introduces the slow query tracing fast mode that
omits all events during tracing.

Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
2021-03-18 15:04:47 +02:00
Botond Dénes
c822f0d02a test: querier_cache_test: add memory based cache eviction test
Ensure that the memory consumption of querier cache entries is kept
under the limit.
2021-03-18 14:58:21 +02:00
Botond Dénes
a14bb4ba94 reader_permit: add inactive state
This state will be used for permits that are not in admitted state when
registered as inactive. We can have such reads if a read can be served
entirely from cache/memtables and it doesn't have to go to disk and
hence doesn't go through admission. These permits currently don't
forward their cost to the semaphore so they won't prevent their own
admission creating a deadlock. However, when in inactive state, we do
want to keep tabs on their resource consumption so we don't accumulate
too much of these inactive reads. So introduce a new state for these
non-admitted inactive reads. When entering the inactive state, the
permit registers its cost with the semaphore, and when unregistered as
inactive, it retracts it. This is a workaround (khm hack) until #4758 is
solved and all permits will be admitted on creation.
2021-03-18 14:58:21 +02:00
Botond Dénes
594636ebbf querier: insert(): account immediately evicted querier as resource based eviction
`reader_concurrency_semaphore::register_inactive_read()` drops the
registered inactive read immediately if there is a resource shortage.
This is in effect a resource based eviction, so account it as such in
`querier::insert()`.
2021-03-18 14:57:57 +02:00
Botond Dénes
1a337d0ec1 reader_concurrency_semaphore: fix clear_inactive_reads()
Broken by the move to an intrusive container (9cbbf40), which caused
said method to only clear the container but not destroy the inactive
reads contained therein. This patch restores the previous behaviour and
also adds a call the destructor (to ensure inactive reads are cleaned up
under any circumstances), as well as a unit test.
2021-03-18 14:57:57 +02:00
Botond Dénes
581edc4e4e reader_concurrency_semaphore: make inactive_read_handle a weak reference
Having the handle keep an owning reference to the inactive read lead to
awkward situations, where the inactive read is destroyed during eviction
in certain situations only (querier cache) and not in other cases.
Although the users didn't notice anything from this, it lead to very
brittle code inside the reader concurrency semaphore. Among others, the
inactive read destructor has to be open coded in evict() which already
lead to mistakes.
This patch goes back to the weak pointer paradigm used a while ago,
which is a much more natural fit for this. Inactive reads are still kept
in an intrusive list in the semaphore but the handle now keeps a weak
pointer to them. When destroyed the handler will destroy the inactive
read if it is still alive. When evicting the inactive read, it will
set the pointer in the handle to null.
2021-03-18 14:57:57 +02:00
Botond Dénes
cbc83b8b1b reader_concurrency_semaphore: make evict() noexcept
In the next patch it will be called from a destructor.
2021-03-18 14:57:57 +02:00
Botond Dénes
2d348e0211 reader_concurrency_semaphore: update out-of-date comments 2021-03-18 14:57:57 +02:00
Botond Dénes
3b8220f777 scylla-gdb.py: update w.r.t. storage_proxy::_hints_manager not being optional
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210318110256.50137-1-bdenes@scylladb.com>
2021-03-18 12:47:57 +01:00
Piotr Sarna
2509b7dbde Merge 'dht: convert ring_position and decorated_key to std::strong_ordering' from Avi Kivity
As #1449 notes, trichotomic comparators returning int are dangerous as they
can be mistaken for less comparators. This series converts dht::ring_position
and dht::decorated_key, as well as a few closely related downstream types, to
return std::strong_ordering.

Closes #8225

* github.com:scylladb/scylla:
  dht: ring_position, decorated_key: convert tri_comparators to std::strong_ordering
  pager: rephrase misleading comparison check
  test: total_order_checks: prepare for std::strong_ordering
  test: mutation_test: prepare merge_container for std::strong_ordering
  intrusive_array: prepare for std::strong_ordering
  utils: collection-concepts: prepare for std::strong_ordering
2021-03-18 11:51:54 +01:00
Avi Kivity
378556418c dht: ring_position, decorated_key: convert tri_comparators to std::strong_ordering
Convert tri_comparators to return std::strong_ordering rather than int,
to prevent confusion with less comparators. Downstream users are either
also converted, or adjust the return type back to int, whichever happens
to be simpler; in all cases the change it trivial.
2021-03-18 12:40:05 +02:00
Avi Kivity
4ead1a79ce pager: rephrase misleading comparison check
We check !result_of_tri_compare, which makes it look like we're
checking a boolean predicate, whereas we're really checking for
equality. Change to result_of_tri_compare == 0, which is less likely
to be confusing, and is also compatible with std::strong_ordering.
2021-03-18 12:40:05 +02:00
Avi Kivity
a5f17b9a2d test: total_order_checks: prepare for std::strong_ordering
Adjust the total_order_check template to work with comparators
returning either int (as a temporary compatibility measure) or
std::strong_ordering (for #1449 safety).
2021-03-18 12:40:05 +02:00
Avi Kivity
f0092ae475 test: mutation_test: prepare merge_container for std::strong_ordering
The function merge_container() accepts a trichotomic comparator returning
an int. As #1449 explains, this is dangerous as it could be mistaken for
a less comparator. Switch to std::strong_ordering, but leave a compatible
merge_container() in place as it is still needed (even after this series).
2021-03-18 12:40:05 +02:00
Avi Kivity
fe0f983dfb intrusive_array: prepare for std::strong_ordering
Newer comparators can return std::strong_ordering, so don't
expect an int.
2021-03-18 12:40:05 +02:00
Avi Kivity
9fbe4850c9 utils: collection-concepts: prepare for std::strong_ordering
collection-concepts includes a Comparable concept for a trichotomic
comparator function, used in intrusive btree and double_decker. Prepare
for std::strong_ordering by also allowing std::strong_ordering as a
return type. Once we've cleaned the code base, we can tighten it to
only allow std::strong_ordering.
2021-03-18 12:40:03 +02:00
Piotr Sarna
0bcf584992 docs: mention --no-rebase in maintainer.md
For a default git config it's enough to pull with --no-ff
to ensure that a merge commit is created, but with a custom
configuration, it's better to also explicitly prevent
rebasing.

Message-Id: <7dc6027f1f38fa4db7435592a3b72308b1a08614.1616063525.git.sarna@scylladb.com>
2021-03-18 12:38:29 +02:00
Piotr Sarna
5a852d3812 Merge 'Decouple memory limiter sem from storage service' from Pavel
This set removes few more calls for global storage service and prevents
more of them to happen in thrift that's about to start using the memory
limiter semaphore too.

The set turns this semaphore into a sharded one living in the scope of
main(), makes others use the local instance and removes the no longer
needed bits from storage service.

tests: unit(dev)
branch: https://github.com/xemul/scylla/commits/br-global-memory-limiter-sem

* xemul_drop_memory_limiter:
  storage_service: Drop memory limiter
  memory_limiter: Use main-local instance everyehere
  main: Have local memory limiter and carry where needed
  memory_limiter: Encapsulate memory limiting facility
  cql_server: Remove semaphore getter fn from config
2021-03-18 11:29:32 +01:00
Pavel Emelyanov
dcdd207349 storage_service: Drop memory limiter
Nobody uses it now.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-18 11:28:45 +01:00
Pavel Emelyanov
f0a79574d4 memory_limiter: Use main-local instance everyehere
The cql_server and alternator both need the limiter, so
patch them to stop using storage service's one and use
the main-local one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-18 11:28:45 +01:00
Pavel Emelyanov
359e9caf54 main: Have local memory limiter and carry where needed
Prepare memory limiters to have non-global instance of
the service. For now the main-local instance is not
used and (!) is not stopped for real, just like the
storage_service's one is.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-18 11:28:45 +01:00
Pavel Emelyanov
4ca2ae1341 memory_limiter: Encapsulate memory limiting facility
The storage service carries sempaphore and a size_t value
to facilitate the memory limiting for client services.

This patch encapsulates both fields on a separate helper
class that will be used by whoever needs it without
messing with the storage service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-18 11:28:45 +01:00
Pavel Emelyanov
c2f94fb527 cql_server: Remove semaphore getter fn from config
The cql_server() need to get the memory limiter semaphore
from local storage service instance. To make this happen
a callback in introduced on the config structure. The same
can be achieved in a simler manner -- by providing the
local storage service instances directly.

Actually, the storage service will be removed in further
patches from this place, so this patch is mostly to get
rid of the callback from the config.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-18 11:28:45 +01:00
Nadav Har'El
4a7d3175e9 test/alternator: make another test faster
The slowest test in test_streams.py is test_list_streams_paged. It is meant
to test the ListStreams operation with paging. The existing test repeated
its test four times, for four different stream types. However, there is
no reason to suspect that the ListStreams operation might somehow be
different for the four stream types... We already have other tests which
create streams of the four types, and uses these streams - we don't
need the test for ListStreams to also test creating the four types.

By doing this test just once, not four times, we can save around 1.5
seconds of test time.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210318073755.1784349-1-nyh@scylladb.com>
2021-03-18 11:24:18 +01:00
Nadav Har'El
79af728335 test/alternator: make tracing test a bit faster
In the test test_tracing.py::test_tracing_all, we do some operations and
then need to wait until they appear in the tracing table.
The current code used an exponentially-increasing delay during this wait,
starting with 0.1 seconds and then doubling the delay until we find what
we're looking for.

However, it turns out that the delay until the data appears in the table
is deliberately chosen by Scylla - and is always around 2 seconds.
In this case, an exponential delay is really bad - we will usually wait
for around 1 seconds too long after the needed wait of 2 seconds.

So in this patch we replace the exponential delay by a constant delay -
we wait 0.3 seconds between each retry.

This change makes the test test_tracing.py::test_tracing_all finish
in a little over 2 seconds, instead of a little over 3 seconds
before this patch. We cannot reduce this 2 second time any further
unless we make the 2-second tracing delay configurable.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210318000040.1782933-1-nyh@scylladb.com>
2021-03-18 11:24:18 +01:00
Nadav Har'El
4e87f95b42 test/alternator: remove slow and unhelpful test
The test test_table.py::test_table_streams_on creates tables with various
stream types, and then immediately deletes them without testing anything.
This is a slow test (taking almost a full second on my laptop), and is
redundant because in test_streams.py we have tests which create tables
with streams in the same way - but then actually test that things work
with these streams. So this test might as well be removed, and this is
what we do in this patch.

Removing this test shaves another second from the Alternator test suite's
run time.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210317230530.1780849-1-nyh@scylladb.com>
2021-03-18 11:24:18 +01:00
Nadav Har'El
879656e3e0 test/alternator: make a test faster, safer and more correct
The test
test_condition_expression.py::test_condition_expression_with_forbidden_rmw
takes half a second to run (dev build, on my laptop), one of the slowest
tests in Alternator's test suite. Part of the reason was that it needlessly
set the same table to forbidden_rmw, multiple times.

Instead of doing that, we switch to using the test_table_s_forbid_rmw
fixture, which is a table like test_table_s but created just once in
forbid_rmw mode.

The result is a faster test (0.05 seconds instead of 0.5 seconds), but
also safer if we ever want to run tests in parallel. It also fixes a
bug in the test: At the end of the test, we intended to double-check
that although the forbid_rmw table forbids read-modify-write operations,
it does allow pure writes. Yet the test did this after clearing the
forbid_rmw mode... So after this patch the test verifies this on the
forbid_rmw table, as intended.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210317222703.1779992-1-nyh@scylladb.com>
2021-03-18 11:24:18 +01:00
Nadav Har'El
1c2e473e62 test/alternator: make a test faster
The test
test_condition_expression.py::test_condition_expression_with_permissive_write_isolation

Currently takes (on my laptop, dev build) a full two seconds, one of
the slowest tests. It is not surprising it is slow - it runs five other
tests three times each (for three different write isolation modes),
but it doesn't have to be this slow. Before this patch, for each of
the five tests we switch the write isolation mode three times, and
these switches involve schema changes and are fairly slow. So in
this patch we reverse the loop - and switch the write isolation mode
to the outer loop.

This patch halves the runtime of this test - from two seconds to one.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210317221045.1779329-1-nyh@scylladb.com>
2021-03-18 11:24:18 +01:00
Takuya ASADA
d9a625c842 scylla_setup: don't run node-exporter setup when it's not installed
We need to run package existance check before run setup of
node-exporter.

Fixes #8276

Closes #8278
2021-03-18 11:24:18 +01:00
Avi Kivity
f038d1555c Merge 'Add more context to configure.py' from Piotr Sarna
This series makes configure.py output slightly more helpful in case of incorrect parameters passed to the compiler/linker.

Closes #8267

* github.com:scylladb/scylla:
  configure: print more context if the linking attempt failed
  configure: provide more context on failed ./configure.py run
  configure: add verbose option to try_compile_and_link
2021-03-18 11:24:18 +01:00
Takuya ASADA
0424a41e30 tools/toolchain: stop ignoring error on install-dependencies.sh, run jmx/java script correctly
We should run install-dependencies.sh with -e option to prevent ignoring
error in the script.
Also, need to add tools/jmx/install-dependencies.sh and
tools/java/install-dependencies.sh, to fix 'No such file or directory'
error on them.

Fixes #8293

Closes #8294

[avi: did not regenerate toolchain image, since no new packages are
      installed]
2021-03-18 11:24:18 +01:00
Avi Kivity
b91d6776a0 Update tools/java submodule
* tools/java fdc8fcc22c...7b66b7a0fc (1):
  > dist/redhat: add support SLES
2021-03-18 11:24:18 +01:00
Nadav Har'El
bd742f2951 Merge 'treewide: get rid of incorrect reinterpret casts' from Michał Chojnowski
In some places we use the `*reinterpret_cast<const net::packed<T>*>(&x)`
pattern to reinterpret memory. This is a violation of C++'s aliasing rules,
which invokes undefined behaviour.

The blessed way to correctly reinterpret memory is to copy it into a new
object. Let's do that.

Note: the reinterpret_cast way has no performance advantage. Compilers
recognize the memory copy pattern and optimize it away.

Closes #8241

* github.com:scylladb/scylla:
  treewide: get rid of unaligned_cast
  treewide: get rid of incorrect reinterpret casts
2021-03-18 11:24:18 +01:00
Benny Halevy
7862cad669 sstable_set: partitioned_sstable_set: clone: do clone all sstables
The existing implementation wrongfully shares _all sstables
rather than cloning it. This caused a use-after-free
in `repair_meta::do_estimate_partitions_on_local_shard`
when traversing a shared sstable_set, during which
`table::make_reader_excluding_sstables` erased an entry.
The erase should have happened on a cloned copy
of the sstable_list, not on a shared copy.

The regression was introduced in
c3b8757fa1.

Added a unit test that reproduces the share-on-copy issue
for partitioned_stable_set (sstables::sstable_set).

Fixes #8274

Test: unit(release, debug)
DTest: materialized_views_test.py:TestMaterializedViews.simple_repair_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210317145552.701559-1-bhalevy@scylladb.com>
2021-03-18 11:15:59 +02:00
Piotr Sarna
ea096de1b4 service, transport: avoid using private storage_service fields
... in the transport controller. Instead, simple getters would suffice.

Message-Id: <582a71d0c1b61edf0107f5a2ef96536c395972d0.1615988516.git.sarna@scylladb.com>
2021-03-18 11:15:59 +02:00
Nadav Har'El
42169b2eef Merge 'Alternator: add slow query logging' from Piotr Sarna
This series adds slow query logging capability to alternator. Queries which last longer than the specified threshold are logged in `system_traces.node_slow_log` and traced.

In order to be better prepared for https://github.com/scylladb/scylla/issues/2572, this series also expands the tracing API to allow custom key-value params and adds a custom `alternator_op` parameter to the slow node log. This information can also be deduced from the tracing session id by consulting the system_traces.events table, but https://github.com/scylladb/scylla/issues/2572 's assumption is that this tracing might not always be available in the future.

This series comes with a simple test case which checks if operation logs indeed end up in `system_traces.node_slow_log`.

Tests:
unit(dev, alternator pytest)
manual: verified that no operations are logged if slow query logging is disabled; verified that operations that take less time than the threshold are not logged; verified with test_batch.py::test_batch_write_item_large that a large-enough operation is indeed logged and traced.

Fixes #8292

Example trace:

```cql
cqlsh> select parameters, duration from system_traces.node_slow_log where start_time=b7a44589-8711-11eb-8053-14c6c5faf955;

 parameters                                                                                  | duration
---------------------------------------------------------------------------------------------+----------
 {'alternator_op': 'DeleteTable', 'query': '{"TableName": "alternator_Test_1615979572905"}'} |    75732
```

Closes #8298

* github.com:scylladb/scylla:
  alternator: add test for slow query logging
  alternator: allow enabling slow query logging
  tracing: allow providing a custom session record param
2021-03-18 11:15:59 +02:00
Avi Kivity
de45575ea9 Merge "Allow all supported compaction types to be stopped by nodetool stop" from Raphael
"
All compaction types can now be stopped with the nodetool stop
command, example: nodetool stop SCRUB

Supported types are: COMPACTION, CLEANUP, VALIDATION, SCRUB,
INDEX_BUILD, RESHARD, UPGRADE, RESHAPE.
"

* 'stop_compaction_types_v2' of github.com:raphaelsc/scylla:
  compaction: Allow all supported compaction types to be stopped
  compaction: introduce function to map compaction name to respective type
  compaction: refactor mapping of compaction type to string
  compaction: move compaction_name() out of line
2021-03-18 11:15:59 +02:00
Botond Dénes
981699ae76 sstables: move promoted_index_blocks_reader into own header
index_entry.hh (the current home of `promoted_index_blocks_reader`) is
included in `sstables.hh` and thus in half our code-base. All that code
really doesn't need the definition of the promoted index blocks reader
which also pulls in the sstables parser mechanism. Move it into its own
header and only include it where it is actually needed: the promoted
index cursor implementations.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210317093654.34196-1-bdenes@scylladb.com>
2021-03-18 11:15:59 +02:00
Botond Dénes
5859195b36 sstables: mx/parser.hh: add missing include
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210317093806.34858-1-bdenes@scylladb.com>
2021-03-18 11:15:59 +02:00
Benny Halevy
2e7677f76b sstables: sstable_set_impl: include mutation_reader.hh
To make sstables/sstable_set_impl.hh self-sufficient
mutation_reader.hh provides position_reader_queue,
needed by time_series_sstable_set.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210317094223.590067-1-bhalevy@scylladb.com>
2021-03-18 11:15:59 +02:00
Konstantin Osipov
66c729da66 raft: set the current leader upon getting InstallSnapshot
If the current leader is set, the follower will not vote
for another candidate. This is also known as "sticky leadership" rule.

Before this change, the rule was enacted only upon receiving
AppendEntries RPC from the leader. Turn it on also upon receiving
InstallSnapshot RPC.
2021-03-18 08:36:57 +03:00
Michał Chojnowski
5c3385730b treewide: get rid of unaligned_cast
unaligned_cast violates strict aliasing rules. Replace it with
safe equivalents.
2021-03-17 17:00:41 +01:00
Michał Chojnowski
4e35befcf2 treewide: get rid of incorrect reinterpret casts
In some places we use the `*reinterpret_cast<const net::packed<T>*>(&x)`
pattern to reinterpret memory. This is a violation of C++'s aliasing rules,
which invokes undefined behaviour.

The blessed way to correctly reinterpret memory is to copy it into a new
object. Let's do that.

Note: the reinterpret_cast way has no performance advantage. Compilers
recognize the memory copy pattern and optimize it away.
2021-03-17 17:00:38 +01:00
Piotr Sarna
efe734c575 alternator: add test for slow query logging
The test checks whether slow queries are properly logged
in the system_traces.node_slow_log system table.
The test is deterministic because it uses the threshold of 0ms
to qualify a query as slow, which effectively makes all queries
"slow enough".
2021-03-17 13:24:26 +01:00
Benny Halevy
6846319e65 partitioned_sstables_set: insert: propagate exception
Do not swallow the caught exception.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210316170821.496218-1-bhalevy@scylladb.com>
2021-03-17 13:29:03 +02:00
Piotr Sarna
f9adee70d2 alternator: allow enabling slow query logging
Alternator is now aware of the slow query logging configuration
and can start tracing slow queries.
2021-03-17 11:20:42 +01:00
Piotr Sarna
5386739354 tracing: allow providing a custom session record param
The mechanism of session record params is currently only used
to store query strings and a couple more params like consistency level,
but since we now have more frontends than just CQL and Thrift,
it would be nice to also allow the users to put custom parameters in
there.
An immediate first user of this mechanism would be alternator,
which is going to put the operation type under the "alternator_op" key.
The operation type is not part of the query string due to how DynamoDB's
protocol works - the op type is stored separately in the HTTP header.
While it's possible to extract the operation type from the session_id,
it might not be the case once #2572 is implemented.
2021-03-17 11:14:28 +01:00
Gleb Natapov
32d386d0d8 raft: fix use after free during logging in append_entries_reply()
As the existing comment explains a progress can be deleted at the point
of logging. The logging should only be done if the progress still
exists.

Message-Id: <YFDFVRQU1iVYhFdM@scylladb.com>
2021-03-17 09:59:22 +02:00
Dejan Mircevski
8db24fc03b cql3/expr: Handle IN ? bound to null
Previously, we crashed when the IN marker is bound to null.  Throw
invalid_request_exception instead.

Fixes #8265

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8287
2021-03-17 09:59:22 +02:00
Avi Kivity
1afd6fbe06 hashing: appending_hash: convert from enable_if to concepts
A little simpler to understand.

Closes #8288
2021-03-17 09:59:22 +02:00
Piotr Sarna
7961a28835 Merge 'storage_proxy: Include counter writes in...
...  `writes_coordinator_outside_replica_set`' from Juliusz Stasiewicz

With this change, coordinator prefers himself as the "counter leader", so if
another endpoint is chosen as the leader, we know that coordinator was not a
member of replica set. With this guarantee we can increment
`scylla_storage_proxy_coordinator_writes_coordinator_outside_replica_set` metric
after electing different leader (that metric used to neglect the counter
updates).

The motivation for this change is to have more reliable way of counting
non-token-aware queries.

Fixes #4337
Closes #8282

* github.com:scylladb/scylla:
  storage_proxy: Include counter writes in `writes_coordinator_outside_replica_set`
  counters: Favor coordinator as leader
2021-03-17 09:59:22 +02:00
Avi Kivity
972ea9900c Merge 'commitlog: Make pre-allocation drop O_DSYNC while pre-filling' from Calle Wilund
Refs #7794

Iff we need to pre-fill segment file ni O_DSYNC mode, we should
drop this for the pre-fill, to avoid issuing flushes until the file
is filled. Done by temporarily closing, re-opening in "normal" mode,
filling, then re-opening.

Closes #8250

* github.com:scylladb/scylla:
  commitlog: Make pre-allocation drop O_DSYNC while pre-filling
  commitlog: coroutinize allocate_segment_ex
2021-03-17 09:59:22 +02:00
Dejan Mircevski
992d5c6184 cql3/expr: Improve column printing
Before this change, we would print an expression like this:

((ColumnDefinition{name=c, type=org.apache.cassandra.db.marshal.Int32Type, kind=CLUSTERING_COLUMN, componentIndex=0, droppedAt=-9223372036854775808}) = 0000007b)

Now, we print the same expression like this:

(c = 0000007b)

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8285
2021-03-17 09:59:22 +02:00
Tomasz Grabiec
40121621f6 Merge "Kill some get_local_migration_manager() calls" from Pavel Emelyanov
There are a bunch of such calls in schema altering statements and
there's currently no way to obtain the migration manager for such
statements, so a relatively big rework needed.

The solution in this set is -- all statements' execute() methods are
called with query processor as first argument (now the storage proxy
is there), query processor references and provides migration manager
for statements. Those statements that need proxy can already get it
from the query processor.

Afterwards table_helper and thrift code can also stop using the global
migration manager instance, since they both have query processor in
needed places. While patching them a couple of calls to global storage
proxy also go away.

The new query processor -> migration manager dependency fits into
current start-stop sequence: the migration manager is started early,
the query processor is started after it. On stop the query processor
remains alive, but the migration manager stops. But since no code
currently (should) call get_local_migration_manager() it will _not_
call the query_processor::get_migration_manager() either, so this
dangling reference is ugly, but safe.

Another option could be to make storage proxy reference migration
manager, but this dependency doesn't look correct -- migration manager
is higher-level service than the storage proxy is, it is migration
manager who currently calls storage proxy, but not the vice versa.

* xemul/br-kill-some-migration-managers-2:
  cql3: Get database directly from query processor
  thrift: Use query_processor::get_migration_manager()
  table_helper: Use query_processor::get_migration_manager()
  cql3: Use query_processor::get_migration_manager() (lambda captures cases)
  cql3: Use query_processor::get_migration_manager() (alter_type statement)
  cql3: Use query_processor::get_migration_manager() (trivial cases)
  query_processor: Keep migration manager onboard
  cql3: Pass query processor to announce_migration:s
  cql3: Switch to qp (almost) in schema-altering-stmt
  cql3: Change execute()'s 1st arg to query_processor
2021-03-17 09:59:22 +02:00
Raphael S. Carvalho
2065e2c912 partitioned_sstable_set: adjust select_sstable_runs() to work with compound set
compound set will select runs from all of its managed sets, so let's
adjust select_sstable_runs() to only return runs which belong to it.
without this adjustment, selection of runs would fail because
function would try to unconditionally retrieve the run which may
live somewhere else.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312042255.111060-3-raphaelsc@scylladb.com>
2021-03-17 09:59:22 +02:00
Raphael S. Carvalho
02b2df1ea9 sstable_set: move select_sstable_runs() into partitioned_sstable_set
after compound set is introduced, select_sstable_runs() will no longer
work because the sstable runs live in sstable_set, but they should
actually live in the sstable_set being written to.

Given that runs is a concept that belongs only to strategies which
use partitioned_sstable_set, let's move the implementation of
select_sstable_runs() to it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312042255.111060-2-raphaelsc@scylladb.com>
2021-03-17 09:59:22 +02:00
Avi Kivity
11308c05f4 Update tools/jmx submodule
* tools/jmx 15c1d4f...9c687b5 (1):
  > dist/redhat: add support SLES
2021-03-17 09:59:22 +02:00
Calle Wilund
a0745f9498 messaging_service: Enforce dc/rack membership iff required for non-tls connections
When internode_encryption is "rack" or "dc", we should enforce incoming
connections are from the appropriate address spaces iff answering on
non-tls socket.

This is implemented by having two protocol handlers. One for tls/full notls,
and one for mixed (needs checking) connections. The latter will ask
snitch if remote address is kosher, and refuse the connection otherwise.

Note: requires seastar patches:
"rpc: Make is possible for rpc server instance to refuse connection"
"RPC: (client) retain local address and use on stream creation"

Note that ip-level checks are not exhaustive. If a user is also using
"require_client_auth" with dc/rack tls setting we should warn him that
there is a possibility that someone could spoof himself pass the
authentication.

Closes #8051
2021-03-17 09:59:22 +02:00
Avi Kivity
bcd41cb32d Merge 'Support installing our rpm to SLES' from Takuya ASADA
Basically SLES support is already done in f20736d93d, but it was for offline installer.
This fixes few more problems to install our rpm to SLES.
After this change, we can just install our rpm for both CentOS/RHEL and SLES in single image, like unified deb.
SLES uses original package manager called 'zypper', but it does support yum repository so no need to change required for repo.

Closes #8277

* github.com:scylladb/scylla:
  scylla_coredump_setup: support SLES
  scylla_setup: use rpm to check package availability for SLES
  dist: install optional packages for SLES
2021-03-17 09:59:22 +02:00
Tomasz Grabiec
cc0bb92afe Merge "raft: provide a ticker for each raft server" from Pavel Solodovnikov
Automatically initialize and start a timer in
`raft_services::add_server` for each raft server instance created.

The patch set also changes several other things in order
for tickers to work:

1. A bug in `raft_sys_table_storage` which caused an exception
   if `raft::server::start` is called without any persisted state.
2. `raft_services::add_server` now automatically calls
   `raft::server::start()` since a server instance should be started
   before any of its methods can be called.
3. Raft servers can now start with initial term = 0. There was an
   artificial restriction which is now lifted.
4. Raft schema state machine now returns a ready future instead of
   throwing "not implemented" exception in `abort()`.

* github.com/ManManson/scylla.git/raft_services_tickers_v9_next_rebase:
  raft/raft_services: provide a ticker for each raft server
  raft/raft_services: switch from plain `throw` to `on_internal_error`
  raft/raft_services: start server instance automatically in `add_server`
  raft: return ready future instead of throwing in schema_raft_state_machine
  raft: allow raft server to start with initial term 0
  raft/raft_sys_table_storage: fix loading term/vote and snapshot from empty state
2021-03-17 09:59:22 +02:00
Nadav Har'El
e344f74858 Merge 'logalloc: improve background reclaim shares management' from Avi Kivity
The log structured allocator's background reclaimer tries to
allocate CPU power proportional to memory demand, but a
bug made that not happen. Fix the bug, add some logging,
and future-proof the timer. Also, harden the test against
overcommitted test machines.

Fixes #8234.

Test: logalloc_test(dev), 20 concurrent runs on 2 cores (1 hyperthread each)

Closes #8281

* github.com:scylladb/scylla:
  test: logalloc_test: harden background reclain test against cpu overcommit
  logalloc: background reclaim: use default scheduling group for adjusting shares
  logalloc: background reclaim: log shares adjustment under trace level
  logalloc: background reclaim: fix shares not updated by periodic timer
2021-03-17 09:59:21 +02:00
Pavel Solodovnikov
aaea8c6c7d raft/raft_services: provide a ticker for each raft server
Automatically initialize a ticker for each raft server
instance when `raft_services::add_server` is called.
A ticker is a timer which regularly calls `raft::server::tick`
in order to tick its raft protocol state machine.

Note that the timer should start after the server calls
its `start()` method, because otherwise it would crash
since fsm is not initialized yet.

Currently, the tick interval is hardcoded to be 100ms.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Pavel Solodovnikov
1496a3559f raft/raft_services: switch from plain throw to on_internal_error
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Pavel Solodovnikov
975c9a8021 raft/raft_services: start server instance automatically in add_server
Raft server instance cannot be used in any way prior
to calling the `start()` method, which initializes
its internal state, e.g. raft protocol state machine.
Otherwise, it will likely result in a crash.

Also, properly stop the servers on shutdown via
`raft_services::stop_servers()`.

In case some exception happened inside `add_server`,
the `init` function will de-initialize what it already
initialized, i.e. raft rpc verbs. This is important
since otherwise it would break further initialization
process and, what is more important, will prevent raft
rpc verbs deinitialization. This will cause a crash in
`messaging_service` uninit procedure, because raft rpc
handlers would still be initialized.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Pavel Solodovnikov
0b3dba07bd raft: return ready future instead of throwing in schema_raft_state_machine
The current implementation throws an exception, which will cause
a crash when stopping scylla. This will be used in the next patch.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Pavel Solodovnikov
93c565a1bf raft: allow raft server to start with initial term 0
Prior to the fix there was an assert to check in
`raft::server_impl::start` that the initial term is not 0.

This restriction is completely artificial and can be lifted
without any problems, which will be described below.

The only place that is dependent on this corner case is in
`server_impl::io_fiber`. Whenever term or vote has changed,
they will be both set in `fsm::get_output`. `io_fiber` checks
whether it needs to persist term and vote by validating that
the term field is set (by actually executing a `term != 0`
condition).

This particular check is based on an unobvious fact that the
term will never be 0 in case `fsm::get_output` saves
term and vote values, indicating that they need to be
persisted.

Vote and term can change independently of each other, so that
checking only for term obscures what is happening and why
even more.

In either case term will never be 0, because:

1. If the term has changed, then it's naturally greater than 0,
   since it's a monotonically increasing value.
2. If the vote has changed, it means that we received
   a vote request message. In such case we have already updated
   our term to the requester's term.

Switch to using an explicit optional in `fsm_output` so that
a reader don't have to think about the motivation behind this `if`
and just checks that `term_and_vote` optional is engaged.

Given the motivation described above, the corresponding

    assert(_fsm->get_current_term() != term_t(0));

in `server_impl::start` is removed.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Pavel Solodovnikov
ae5f26adec raft/raft_sys_table_storage: fix loading term/vote and snapshot from empty state
When a raft server is started for the first time and there isn't
any persisted state yet, provide default return values for
`load_term_and_vote` and `load_snapshot`. The code currently
does not handle this corner case correctly and fail with an
exception.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Juliusz Stasiewicz
f77d0f5439 storage_proxy: Include counter writes in writes_coordinator_outside_replica_set
Coordinator prefers himself as the "counter leader", so if another
endpoint is chosen as the leader, we know that coordinator was
not a member of replica set. We can use this information to
increment relevant metric (which used to neglect the counters
completely).

Fixes #4337
2021-03-16 12:07:16 +01:00
Juliusz Stasiewicz
5689106b92 counters: Favor coordinator as leader
This not only reduces internode traffic but is also needed for a
later change in this PR: metrics for non-token-aware writes
including counter updates.
2021-03-16 12:07:13 +01:00
Pavel Emelyanov
a7a5ad4ded range_tombstone_stream: Remove unused methods
Both methods apply a list of tombstones to the stream. One
was unused even before the set, the other one became unused
after previous patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 12:08:18 +03:00
Pavel Emelyanov
2e6255c499 partition_snapshot_reader: Emit range tombstones on demand
Currently the reader gets all range tombstones from the given
range and places them into a stream. When filling the buffer
with fragments the range tombstones are extracted from the
stream one by one.

This is memory consuming, the reader's memory usage shouldn't
depend on the number of inhabitants in the partition range.

The patch implements the heap-based cursor for range tombstones
almost like it's done for rows.

The heap contains range_tombstone_list::iterator_ranges, the
tombstones are popped from the heap when needed, are applied
into the stream and then are emitted from it into the buffer.
The refresh_state() is called on each new range to set up the
iterators, and when lsa reports references invalidation to
refresh the iterators. To let the refresh_state revalidate the
iterators, the position at which the last range tombstone was
emitted is maintained.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 12:08:18 +03:00
Pavel Emelyanov
ef61f84426 partition_snapshot_reader: Introduce maybe_refresh_state
The existing refresh_state() is supposed to setup or revalidate
iterators to rows inside partition versions if needed. It will
be called in more than one place soon, so here's the helper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 12:08:18 +03:00
Pavel Emelyanov
5e0a8130d4 partition_snapshot_reader: Move range tombstone stream member
The lsa_partition_reader is the helper sub-class for
partition_snapshot_reader that, among other things, is
responsible for filling the stream of range tombstones,
that's then used by the reader itself.

Next patches will change the way range tombstones are
emitted by the reader, so hide the stream inside the
helper subclass in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 12:08:18 +03:00
Pavel Emelyanov
755d993031 partition_snapshot_reader: Add reset_state method to helper class
This method "notifies" the lsa_reader helper class when the owning
reader moves to a new range. This method is now empty, but will be
used by next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 12:07:20 +03:00
Pavel Emelyanov
a387fbd984 partition_snapshot_reader: Downgrade heap comparator
Next patch will extend the comparator to manage heap of
range tombstones. Not to add yet another comparator to
it (and not to create another heap comparator class) just
use the comparator that's common for both -- rows and range
tombstones.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 12:06:19 +03:00
Pavel Emelyanov
2179014efa partition_snapshot_reader: Use on-demand comparators
There are already two raii-sh comparators on reader, next patch will
need to add the third. This just bloats the reader, the comparators
in question are state-less and can be created on demand for free.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 12:04:47 +03:00
Pavel Emelyanov
c8b2079705 range_tombstone_list: Add new slice() helper
There are two of them now -- one to return iterator_range that
covers the given query::clustering_range, the other to return
it for two given positions.

In the next patch the 3rd one is needed -- the slice() to get
iterator_range that's

a) starts strictly after a given position
b) ends after the given clustering_range's end

It will be used to refresh the range tombstones iterators after
some of them will have been emitted. The same thing is currently
done by partition_snapshot_reader's refresh_state wrt rows:

    if (last_row)
        start = rows.upper_bound(last_row) // continuation
    else
        start = rows.lower_bound(range.start) // initial

    end = rows.upper_bound(range.end) // end is the same in
                                      // either case

Respectively for range tombstones the goal is the same.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 11:55:28 +03:00
Pavel Emelyanov
7e1170ecb9 range_tombstone_list: Introduce iterator_range alias
The range_tombstone_list::slice() set of methods return
back pair of iterators represending a range. In the next
patches this pair will be actively used, and it's handy
to have a shorter alias for it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 11:55:28 +03:00
Piotr Sarna
2201c9b146 configure: print more context if the linking attempt failed
Previously, when a linking attempt failed, configure.py immediately
printed that neither lld nor gold was found, which might be misleading
if the linkers are installed, but the compilation failed anyway.
The printed information is now more specific, and combined with the
previous commit, it will also provide more information why the
compilation attempt failed.
2021-03-16 07:39:05 +01:00
Piotr Sarna
f86b879933 configure: provide more context on failed ./configure.py run
If the configuration step failed, it used to only inform that
it must be due to the wrong GCC version, which can be misleading.
For instance, trying to compile on clang with incorrect flags
also resulted in an "wrong GCC version" message.
Now, the message is more generic, but it also prints the stderr
output from the miscompilation, which may help pinpoint the problem:

$ ./configure.py --mode release --cflags='-fhello -fcolor-diagnostics -mllvm -opt-bisect-limit=10000' --compiler=clang++ --c-compiler=clang
Note: neither lld nor gold found; using default system linker
Compilation failed: clang++ -x c++ -o build/tmp/tmp1177gojf /home/sarna/repo/scylla/build/tmp/tmp_u3voys6 -fhello -fcolor-diagnostics -mllvm -opt-bisect-limit=10000 []

// clang pretends to be gcc (defined __GNUC__), so we
// must check it first
\#ifdef __clang__

\#if __clang_major__ < 10
    #error "MAJOR"
\#endif

\#elif defined(__GNUC__)

\#if __GNUC__ < 10
    #error "MAJOR"
\#elif __GNUC__ == 10
    #if __GNUC_MINOR__ < 1
        #error "MINOR"
    #elif __GNUC_MINOR__ == 1
        #if __GNUC_PATCHLEVEL__ < 1
            #error "PATCHLEVEL"
        #endif
    #endif
\#endif

\#else

\#error "Unrecognized compiler"

\#endif

int main() { return 0; }

clang-11: error: unknown argument: '-fhello'
distcc[4085341] ERROR: compile (null) on localhost failed

Wrong compiler version or incorrect flags. Scylla needs GCC >= 10.1.1 with coroutines (-fcoroutines) or clang >= 10.0.0 to compile.
2021-03-16 07:39:03 +01:00
Piotr Sarna
6389246d6e configure: add verbose option to try_compile_and_link
Which will be useful later for providing more context
why a ./configure.py run failed.
2021-03-16 07:35:16 +01:00
Pavel Emelyanov
12e4269dce cql3: Get database directly from query processor
After previous patches some places in cql3 code take a
long path to get database reference:

  query processor -> storage proxy -> database

The query processor can provide the database reference
by itself, so take this chance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:36:04 +03:00
Pavel Emelyanov
fb49550943 thrift: Use query_processor::get_migration_manager()
Thrift needs migration manager to call announce_<something> on
it and currently it grabs blobak migration manager instance.

Since thrift handler has query processor rerefence onboard and
the query processor can provide the migration manager reference,
it's time to remove few more globals from thrift code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:35:59 +03:00
Pavel Emelyanov
6dc9a16b4e table_helper: Use query_processor::get_migration_manager()
After the migration manager can be obtained from the query
processor the table heler can also benefit from it and not
call for global migration manager instance any longer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:35:53 +03:00
Pavel Emelyanov
a9646dd779 cql3: Use query_processor::get_migration_manager() (lambda captures cases)
There are few schema altering statements that need to have
the query processor inside lambda continuations. Fortunately,
they all are continuations of make_ready_future<>()s, so the
query processor can be simply captured by reference and used.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:35:48 +03:00
Pavel Emelyanov
50e4eacd08 cql3: Use query_processor::get_migration_manager() (alter_type statement)
This statement needs the query processor one step below the
stack from its .announce_migration method. So here's the
dedicated patch for it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:35:43 +03:00
Pavel Emelyanov
464e58abf7 cql3: Use query_processor::get_migration_manager() (trivial cases)
Most of the schema altering statements implementations can now
stop calling for global migration manager instance and get it
from the query processor.

Here are the trivial cases when the query processor is just
avaiable at the place where it's needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:35:36 +03:00
Pavel Emelyanov
1de235f4da query_processor: Keep migration manager onboard
The query processor sits upper than the migration manager,
in the services layering, it's started after and (will be)
stopped before the migration manager.

The migration manager is needed in schema altering statements
which are called with query processor argument. They will
later get the migration manager from the query processor.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:00:58 +03:00
Pavel Emelyanov
1e8f0963f9 cql3: Pass query processor to announce_migration:s
Now when the only call to .announce_migration gas the
query processor at hands -- pass it to the real statements.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:00:33 +03:00
Pavel Emelyanov
470928dd94 cql3: Switch to qp (almost) in schema-altering-stmt
The schema altering statements are all inherited from the same
base class which delcares a pure virtual .announce_migration()
method. All the real statements are called with storage proxy
argument, while the need the migration manager. So like in the
previous patch -- replace storage proxy with query processor.

While doing the replacement also get the database instance from
the querty processor, not from proxy.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:00:33 +03:00
Pavel Emelyanov
26c115f379 cql3: Change execute()'s 1st arg to query_processor
Currently the statement's execute() method accepts storage
proxy as the first argument. This is enough for all of them
but schema altering ones, because the latter need to call
migration manager's announce.

To provide the migration manager to those who need it it's
needed to have some higher-level service that the proxy. The
query processor seems to be good candidate for it.

Said that -- all the .execute()s now accept the querty
processor instead of the proxy and get the proxy itself from
the query processor.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:00:33 +03:00
Avi Kivity
65fea203d2 test: logalloc_test: harden background reclain test against cpu overcommit
Use thread CPU time instead of real time to avoid an overcommitted
machine from not being able to supply enough CPU for the test.
2021-03-15 13:54:49 +02:00
Avi Kivity
290897ddbc logalloc: background reclaim: use default scheduling group for adjusting shares
If the shares are currently low, we might not get enough CPU time to
adjust the shares in time.

This is currently no-op, since Seastar runs the callback outside
scheduling groups (and only uses the scheduling group for inherited
continuations); but better be insulated against such details.
2021-03-15 13:54:49 +02:00
Avi Kivity
a87f6498c3 logalloc: background reclaim: log shares adjustment under trace level
Useful when debugging, but too noisy at any other time.
2021-03-15 13:54:49 +02:00
Avi Kivity
ce1b1d6ec4 logalloc: background reclaim: fix shares not updated by periodic timer
adjust_shares() thinks it needs to do nothing if the main loop
is running, but in reality it can only avoid waking the main loop;
it still needs to adjust the shares unconditionally. Otherwise,
the background reclaim shares can get locked into a low value.

Fix by splitting the conditional into two.
2021-03-15 13:54:37 +02:00
Tomasz Grabiec
bf6c4e0b24 Merge "raft: consolidate tests in raft directory" from Alejo
Move boost tests to tests/raft and factor out common helpers.

* alejo/raft-tests-reorg-5-rebase-next-2:
  raft: tests: move common helpers to header
  raft: tests: move boost tests to tests/raft
2021-03-15 11:59:16 +01:00
Takuya ASADA
e8cfd5114f scylla_coredump_setup: support SLES
SLES requires to install systemd-coredump package and enable
systemd-coredump.socket to use systemd-coredump.
2021-03-15 19:19:56 +09:00
Takuya ASADA
13871ff1f8 scylla_setup: use rpm to check package availability for SLES
Use rpm to check scylla packages installed on SLES.
2021-03-15 19:18:44 +09:00
Takuya ASADA
e3b5ffcf14 dist: install optional packages for SLES
Support SUSE original package manager 'zypper' for pkg_install()
function.
2021-03-15 19:17:48 +09:00
Alejo Sanchez
88063b6e3e raft: tests: move common helpers to header
Move common test helper functions and data structures to a common
helpers.hh header.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-15 06:16:58 -04:00
Alejo Sanchez
6139ad6337 raft: tests: move boost tests to tests/raft
Move raft boost tests to test/raft directory.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-15 06:16:58 -04:00
Calle Wilund
48ca01c3ab commitlog: Make pre-allocation drop O_DSYNC while pre-filling
Refs #7794

Iff we need to pre-fill segment file ni O_DSYNC mode, we should
drop this for the pre-fill, to avoid issuing flushes until the file
is filled. Done by temporarily closing, re-opening in "normal" mode,
filling, then re-opening.

v2:
* More comment
v3:
* Add missing flush
v4:
* comment
v5:
* Split coroutine and fix into separate patches
2021-03-15 09:35:45 +00:00
Calle Wilund
ae3b8e6fdf commitlog: coroutinize allocate_segment_ex
To make further changes here easier to write and read.
2021-03-15 09:35:37 +00:00
Avi Kivity
f326a2253c Update tools/java submodule
* tools/java 2c6110500c...fdc8fcc22c (1):
  > sstableloader: Use compound "where" restrictions for clustering
2021-03-15 11:19:22 +02:00
Raphael S. Carvalho
7171244844 compaction_manager: Fix performance of cleanup compaction due to unlimited parallelism
Prior to 463d0ab, only one table could be cleaned up at a time on a given shard.
Since then, all tables belonging to a given keyspace are cleaned up in parallel.
Cleanup serialization on each shard was enforced with a semaphore, which was
incorrectly removed by the patch aforementioned.

So space requirement for cleanup to succeed can be up to the size of keyspace,
increasing the chances of node running out of space.

Node could also run out of memory if there are tons of tables in the keyspace.
Memory requirement is at least #_of_tables * 128k (not taking into account write
behind, etc). With 5k tables, it's ~0.64G per shard.

Also all tables being cleaned up in parallel will compete for the same
disk and cpu bandwidth, so making them all much slower, and consequently
the operation time is significantly higher.

This problem was detected with cleanup, but scrub and upgrade go through the
same rewrite procedure, so they're affected by exact the same problem.

Fixes #8247.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312162223.149993-1-raphaelsc@scylladb.com>
2021-03-14 14:31:26 +02:00
Nadav Har'El
d73934372d storage_service: correct missing exception in logging rebuild failure
When failing to rebuild a node, we would print the error with the useless
explanation "<no exception>". The problem was a typo in the logging command
which used std::current_exception() - which wasn't relevant in that point -
instead of "ep".

Refs #8089

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210314113118.1690132-1-nyh@scylladb.com>
2021-03-14 14:11:11 +02:00
Tomasz Grabiec
f2ecb4617e Merge "raft: implement prevoting stage in leader election" from Gleb
This is how PhD explain the need for prevoting stage:

  One downside of Raft's leader election algorithm is that a server that
  has been partitioned from the cluster is likely to cause a disruption
  when it regains connectivity. When a server is partitioned, it will
  not receive heartbeats. It will soon increment its term to start
  an election, although it won't be able to collect enough votes to
  become leader. When the server regains connectivity sometime later, its
  larger term number will propagate to the rest of the cluster (either
  through the server's RequestVote requests or through its AppendEntries
  response). This will force the cluster leader to step down, and a new
  election will have to take place to select a new leader.

  Prevoting stage is addressing that. In the Prevote algorithm, a
  candidate only increments its term if it first learns from a majority of
  the cluster that they would be willing to grant the candidate their votes
  (if the candidate's log is sufficiently up-to-date, and the voters have
  not received heartbeats from a valid leader for at least a baseline
  election timeout).

  The Prevote algorithm solves the issue of a partitioned server disrupting
  the cluster when it rejoins. While a server is partitioned, it won't
  be able to increment its term, since it can't receive permission
  from a majority of the cluster. Then, when it rejoins the cluster, it
  still won't be able to increment its term, since the other servers
  will have been receiving regular heartbeats from the leader. Once the
  server receives a heartbeat from the leader itself, it will return to
  the follower state(in the same term).

In our implementation we have "stable leader" extension that prevents
spurious RequestVote to dispose an active leader, but AppendEntries with
higher term will still do that, so prevoting extension is also required.

* scylla-dev/raft-prevote-v5:
  raft: store leader and candidate state in state variant
  raft: add boost tests for prevoting
  raft: implement prevoting stage in leader election
  raft: reset the leader on entering candidate state
  raft: use modern unordered_set::contains instead of find in become_candidate
2021-03-12 11:15:51 +01:00
Gleb Natapov
e231186a7b raft: store leader and candidate state in state variant
We already have server state dependant state in fsm, so there is no need
to maintain "voters" and "tracker" optionals as well. The upside is that
optional and variant sates cannot drift apart now.
2021-03-12 11:12:57 +02:00
Gleb Natapov
e17e7d57bd raft: add boost tests for prevoting 2021-03-12 11:12:57 +02:00
Gleb Natapov
1f868d516e raft: implement prevoting stage in leader election
This is how PhD explain the need for prevoting stage:

  One downside of Raft's leader election algorithm is that a server that
  has been partitioned from the cluster is likely to cause a disruption
  when it regains connectivity. When a server is partitioned, it will
  not receive heartbeats. It will soon increment its term to start
  an election, although it won't be able to collect enough votes to
  become leader. When the server regains connectivity sometime later, its
  larger term number will propagate to the rest of the cluster (either
  through the server's RequestVote requests or through its AppendEntries
  response). This will force the cluster leader to step down, and a new
  election will have to take place to select a new leader.

  Prevoting stage is addressing that. In the Prevote algorithm, a
  candidate only increments its term if it first learns from a majority of
  the cluster that they would be willing to grant the candidate their votes
  (if the candidate's log is sufficiently up-to-date, and the voters have
  not received heartbeats from a valid leader for at least a baseline
  election timeout).

  The Prevote algorithm solves the issue of a partitioned server disrupting
  the cluster when it rejoins. While a server is partitioned, it won't
  be able to increment its term, since it can't receive permission
  from a majority of the cluster. Then, when it rejoins the cluster, it
  still won't be able to increment its term, since the other servers
  will have been receiving regular heartbeats from the leader. Once the
  server receives a heartbeat from the leader itself, it will return to
  the follower state(in the same term).

In our implementation we have "stable leader" extension that prevents
spurious RequestVote to dispose an active leader, but AppendEntries with
higher term will still do that, so prevoting extension is also required.
2021-03-12 11:09:21 +02:00
Raphael S. Carvalho
f6fc32c8da table: use new sstable_set::for_each_sstable
for_each_sstable() is preferred over all() because it's guaranteed to
perform no copy.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210311163009.42210-2-raphaelsc@scylladb.com>
2021-03-11 18:47:17 +02:00
Raphael S. Carvalho
e7a6f3926a sstable_set: introduce for_each_sstable()
This new method is preferred over all() for iterations purposes, because
all() may have to copy sstables into a temporary.
For example, all() implementation of the upcoming compound_sstable_set
will have no choice but to merge all sstables from N managed sets into
a temporary.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210311163009.42210-1-raphaelsc@scylladb.com>
2021-03-11 18:47:16 +02:00
Avi Kivity
486f6bf29c Merge "sstables: move format specific reader code to kl/, mx/" from Botond
"
Currently the sstable reader code is scattered across several source
files as following (paths are relative to sstables/):
* partition.cc - generic reader code;
* row.hh - format specific code related to building mutation fragments
  from cells;
* mp_row_consumer.hh - format specific code related to parsing the raw
  byte stream;

This is a strange organization scheme given that the generic sstable
reader is a template and as such it doesn't itself depend on the other
headers where the consumer and context implementations live. Yet these
are all included in partition.cc just so the reader factory function can
instantiate the sstable reader template with the format specific
objects.

This patchset reorganizes this code such that the generic sstable reader
is exposed in a header. Furthermore, format specific code is moved to
the kl/ and mx/ directories respectively. Each directory has a
reader.hh with a single factory function which creates the reader, all
the format specific code is hidden from sight. The added benefit is that
now reader code specific to a format is centralized in the format
specific folder, just like the writer code.

This patchset only moves code around, no logical changes are made.

Tests: unit(dev)
"

* 'sstable-reader-separation/v1' of https://github.com/denesb/scylla:
  sstables: get rid of mp_row_consumer.{hh,cc}
  sstables: get rid of row.hh
  sstables/mp_row_consumer.hh: remove unused struct new_mutation
  sstables: move mx specific context and consumer to mx/reader.cc
  sstables: move kl specific context and consumer to kl/reader.cc
  sstables: mv partition.cc sstable_mutation_reader.hh
2021-03-11 16:57:54 +02:00
Raphael S. Carvalho
6ff8bb4eac compaction: Allow all supported compaction types to be stopped
Let's make stop_compaction() use sstables::to_compaction_type(),
so all supported compaction types can now be aborted.

Refs #7738.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-11 09:30:11 -03:00
Raphael S. Carvalho
f1b8d5f20f compaction: introduce function to map compaction name to respective type
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-11 09:29:59 -03:00
Raphael S. Carvalho
a44bc233f5 compaction: refactor mapping of compaction type to string
This will make it easier to introduce new type and also to map type to
string and vice-versa, using reverse lookup.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-11 09:29:53 -03:00
Raphael S. Carvalho
503a0ea928 compaction: move compaction_name() out of line
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-11 09:29:46 -03:00
Botond Dénes
361ba473c7 sstables: get rid of mp_row_consumer.{hh,cc}
Move stuff contained therein to `sstable_mutation_reader.{hh,cc}` which
will serve as the collection point of utility stuff needed by all reader
implementations.
2021-03-11 12:17:13 +02:00
Botond Dénes
3ba782bddd sstables: get rid of row.hh
Move stuff contained therein to `sstable_mutation_reader.{hh,cc}` which
will serve as the collection point of utility stuff needed by all reader
implementations.
2021-03-11 12:17:13 +02:00
Botond Dénes
f5b0657fa5 sstables/mp_row_consumer.hh: remove unused struct new_mutation 2021-03-11 12:17:13 +02:00
Botond Dénes
cecc7f8064 sstables: move mx specific context and consumer to mx/reader.cc
Move all the mx format specific context and consumer code to
mx/reader.cc and add a factory function `mx::make_reader()` which takes
over the job of instantiating the `sstable_mutation_reader` with the mx
specific context and consumer.
2021-03-11 12:17:13 +02:00
Botond Dénes
4e3ae9d913 sstables: move kl specific context and consumer to kl/reader.cc
Move all the kl format specific context and consumer code to
kl/reader* and add a factory function `kl::make_reader()` which takes
over the job of instantiating the `sstable_mutation_reader` with the kl
specific context and consumer. Code which is used by test is moved to
kl/reader_impl.hh, while code that can be hidden us moved to
kl/reader.cc. Users who just want to create a reader only have to
include kl/reader.hh.
2021-03-11 12:17:13 +02:00
Botond Dénes
0ec040921d sstables: mv partition.cc sstable_mutation_reader.hh
The sstable reader currently knows the definition of all the different
consumers and contexts. But it doesn't really need to, as it is a
template. Exploit this and prepare for a organization scheme where the
consumers and contexts live hidden in a cc file which includes and
instantiates the sstable reader template. As a first step expose
`sstable_mutation_reader` in a header.
2021-03-11 12:17:13 +02:00
Avi Kivity
a49c4ab754 Update tools/java submodule
* tools/java c5d9e8513e...2c6110500c (1):
  > cassandra.in.sh: Add path to rack/dc properties file to classpath

Fixes #7930.
2021-03-11 12:03:01 +02:00
Asias He
d5e6ba1ff1 repair: Shortcut when no followers to repair with
- 3 nodes in the cluster with rf = 3
- run repair on node1 with ignore_nodes to ignore node2 and node3
- node1 has no followers to repair with

However, currently node1 will walk through the repair procedure to read
data from disk and calculate hashes which are unnecessary.

This patch fixes this issue, so that in case there are no followers, we
skip the range and avoid the unnecessary work.

Before:
   $ curl -X POST http://127.0.0.1:10000/storage_service/repair_async/myks3?ignore_nodes="127.0.0.2,127.0.0.3"

   repair - repair id [id=1, uuid=ff39151b-2ce9-4885-b7e9-89158b14b5c2] on shard 0 stats:
   repair_reason=repair, keyspace=myks3, tables={standard1},
   ranges_nr=769, sub_ranges_nr=769, round_nr=1456,
   round_nr_fast_path_already_synced=1456,
   round_nr_fast_path_same_combined_hashes=0,
   round_nr_slow_path=0, rpc_call_nr=0, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.19 seconds,
   tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
   row_from_disk_bytes={{127.0.0.1, 2822972}},
   row_from_disk_nr={{127.0.0.1, 6218}},
   row_from_disk_bytes_per_sec={{127.0.0.1, 14.1695}} MiB/s,
   row_from_disk_rows_per_sec={{127.0.0.1, 32726.3}} Rows/s,
   tx_row_nr_peer={}, rx_row_nr_peer={}

Data was read from disk.

After:
   $ curl -X POST http://127.0.0.1:10000/storage_service/repair_async/myks3?ignore_nodes="127.0.0.2,127.0.0.3"

   repair - repair id [id=1, uuid=c6df8b23-bd3b-4ebc-8d4c-a11d1ebcca39] on shard 0 stats:
   repair_reason=repair, keyspace=myks3, tables={standard1}, ranges_nr=769,
   sub_ranges_nr=0, round_nr=0, round_nr_fast_path_already_synced=0,
   round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
   rpc_call_nr=0, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.0 seconds,
   tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
   row_from_disk_bytes={},
   row_from_disk_nr={},
   row_from_disk_bytes_per_sec={} MiB/s,
   row_from_disk_rows_per_sec={} Rows/s,
   tx_row_nr_peer={}, rx_row_nr_peer={}

No data was read from disk.

Fixes #8256

Closes #8257
2021-03-11 11:53:22 +02:00
Avi Kivity
c8f692e526 Merge 'cql3: Rewrite get_clustering_bounds() using expressions' from Dejan Mircevski
Instead of using the `restrictions` class hierarchy, calculate the clustering slice using the `expr::expression` representation of the WHERE clause.  This will allow us to eventually drop the `restrictions` hierarchy altogether.

Tests: unit (dev, debug)

Closes #8227

* github.com:scylladb/scylla:
  cql3: Make get_clustering_bounds() use expressions
  cql3/expr: Add is_multi_column()
  cql3/expr: Add more operators to needs_filtering
  cql3: Replace CK-bound mode with comparison_order
  cql3/expr: Make to_range globally visible
  cql3: Gather slice-defining WHERE expressions
  cql3: Add statement_restrictions::_where
  test: Add unit tests for get_clustering_bounds
2021-03-11 11:46:52 +02:00
Gleb Natapov
a849246cfc raft: reset the leader on entering candidate state
Not resetting a leader causes vote requests to be ignored instead of
rejected which will make voting round to take more time to fail and may
slow down new leader election.
2021-03-11 10:36:43 +02:00
Gleb Natapov
20d6bb36cd raft: use modern unordered_set::contains instead of find in become_candidate 2021-03-11 10:36:43 +02:00
Dejan Mircevski
990de02d28 cql3: Make get_clustering_bounds() use expressions
Use expressions instead of _clustering_columns_restrictions.  This is
a step towards replacing the entire restrictions class hierarchy with
expressions.

Update some expected results in unit tests to reflect the new code.
These new results are equivalent to the old ones in how
storage_proxy::query() will process them (details:
bound_view::from_range() returns the same result for an empty-prefix
singular as for (-inf,+inf)).

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:25:43 -05:00
Dejan Mircevski
8dac132581 cql3/expr: Add is_multi_column()
It will come in handy when we start using expressions to calculate the
clustering slice.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:25:43 -05:00
Dejan Mircevski
1f591bd16e cql3/expr: Add more operators to needs_filtering
Omitting these operators didn't cause bugs, because needs_filtering()
is never invoked on them.  But that will likely change in the future,
so add them now to prevent problems down the road.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:25:43 -05:00
Dejan Mircevski
c0c93982d0 cql3: Replace CK-bound mode with comparison_order
Instead of defining this enum in multi_column_restriction::slice, put
it in the expr namespace and add it to binary_operator.  We will need
it when we switch bounds calculation from multi_column_restriction to
expr classes.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:25:43 -05:00
Dejan Mircevski
7dfe471b5a cql3/expr: Make to_range globally visible
It will be used in statement_restrictions for calculating clustering
bounds.  And it will come in handy elsewhere in the future, I'm sure.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:25:43 -05:00
Dejan Mircevski
28b5a372f8 cql3: Gather slice-defining WHERE expressions
Add statement_restrictions::_clustering_prefix_restrictions and fill
it with relevant expressions.  Explain how to find all such
expressions in the WHERE clause.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:25:43 -05:00
Dejan Mircevski
da096bfdce cql3: Add statement_restrictions::_where
... and collect all restrictions' expressions into it.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:25:43 -05:00
Dejan Mircevski
2525759027 test: Add unit tests for get_clustering_bounds
... as guardrails for the upcoming rewrite.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:17:26 -05:00
Calle Wilund
f44420f2c9 snapshot: Add filter to check for existing snapshot
Fixes #8212

Some snapshotting operations call in on a single table at a time.
When checking for existing snapshots in this case, we should not
bother with snapshots in other tables. Add an optional "filter"
to check routine, which if non-empty includes tables to check.

Use case is "scrub" which calls with a limited set of tables
to snapshot.

Closes #8240
2021-03-10 20:21:38 +02:00
Benny Halevy
ff5b42a0fa bytes_ostream: max_chunk_size: account for chunk header
Currently, if the data_size is greater than
max_chunk_size - sizeof(chunk), we end up
allocating up to max_chunk_size + sizeof(chunk) bytes,
exceeding buf.max_chunk_size().

This may lead to allocation failures, as seen in
https://github.com/scylladb/scylla/issues/7950,
where we couldn't allocate 131088 (= 128K + 16) bytes.

This change adjusted the expose max_chunk_size()
to be max_alloc_size (128KB) - sizeof(chunk)
so that the allocated chunks would normally be allocated
in 128KB chunks in the write() path.

Added a unit test - test_large_placeholder that
stresses the chunk allocation path from the
write_place_holder(size) entry point to make
sure it handles large chunk allocations correctly.

Refs #7950
Refs #8081

Test: unit(release), bytes_ostream_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210303143413.902968-1-bhalevy@scylladb.com>
2021-03-10 19:54:12 +02:00
Asias He
268fa9d9fe main: Lower shares for main scheduling group     The main scheduling group has the shares of 1000, which is as high as   the statement group.     From time to time, we see unexpected scheduling group leaking to the main group, which causes the drop of the query performance.   This patch reduce the main scheduling shares to 200, which is the same as the maintenance scheduling group. It is a safer default in case code leaks to the main scheduling group.         Refs: #7720
Closes #8243
2021-03-10 19:34:45 +02:00
Takuya ASADA
af8eae317b scylla_coredump_setup: avoid coredump failure when hard limit of coredump is set to zero
On the environment hard limit of coredump is set to zero, coredump test
script will fail since the system does not generate coredump.
To avoid such issue, set ulimit -c 0 before generating SEGV on the script.

Note that scylla-server.service can generate coredump even ulimit -c 0
because we set LimitCORE=infinity on its systemd unit file.

Fixes #8238

Closes #8245
2021-03-10 19:28:10 +02:00
Avi Kivity
5342d79461 Merge "Preparatory work in sstable_set for the upcoming compound_sstable_set_impl" from Raphael
* 'preparatory_work_for_compound_set' of github.com:raphaelsc/scylla:
  sstable_set: move all() implementation into sstable_set_impl
  sstable_set: preparatory work to change sstable_set::all() api
  sstables: remove bag_sstable_set
2021-03-10 19:19:26 +02:00
Botond Dénes
cf28552357 mutation_test: test_mutation_diff_with_random_generator: compact input mutations
This test checks that `mutation_partition::difference()` works correctly.
One of the checks it does is: m1 + m2 == m1 + (m2 - m1).
If the two mutations are identical but have compactable data, e.g. a
shadowable tombstone shadowed by a row marker, the apply will collapse
these, causing the above equality check to fail (as m2 - m1 is null).
To prevent this, compact the two input mutations.

Fixes: #8221
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210310141118.212538-1-bdenes@scylladb.com>
2021-03-10 16:28:14 +01:00
Raphael S. Carvalho
c3b8757fa1 sstable_set: move all() implementation into sstable_set_impl
The main motivation behind this is that by moving all() impl into
sstable_set_impl, sstable_set no longer needs to maintain a list
with all sstables, which in turn may disagree with the respective
sstable_set_impl.

This will be very important for compound_sstable_set_impl which
will be built from existing sets, and will implement all() by
combining the all() of its managed sets.
Without this patch, we'd have to insert the same sstable at
both compound set and also the set managed by it, to guarantee
all() of compound set would return the correct data, which would
be expensive and error prone.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-10 12:02:13 -03:00
Raphael S. Carvalho
05b07c7161 sstable_set: preparatory work to change sstable_set::all() api
users of sstable_set::all() rely on the set itself keeping a reference
to the returned list, so user can iterate through the list assuming
that it is alive all the way through.

this will change in the future though, because there will be a
compound set impl which will have to merge the all() of multiple
managed sets, and the result is a temporary value.

so even range-based loops on all() have to keep a ref to the returned
list, to avoid the list from being prematurely destroyed.

so the following code
	for (auto& sst : *sstable_set.all()) { ...}
becomes
	for (auto sstables = sstable_set.all(); auto& sst : *sstables) { ... }

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-10 12:02:12 -03:00
Avi Kivity
746798fd56 Merge "sstables: get rid of data_consume_context" from Botond
"
This class is basically a wrapper around a unique pointer and a few
short convenience methods, but is otherwise a distraction in trying to
untangle the maze that is the sstable reader class hierachy.
So this patchset folds it into its only real user: the sstable reader.
"

* 'data_consume_context_bye' of https://github.com/denesb/scylla:
  sstable: move data_consume_* factory methods to row.hh
  sstables: fold data_consume_context: into its users
  sstables: partition.cc: remove data_consume_* forward declarations
2021-03-10 16:45:32 +02:00
Nadav Har'El
a1725217e1 Merge 'alternator: coroutinize handle_api_request' from Piotr Sarna
The indentation level is significantly reduced, and so is the number
of allocations. The function signature is changed from taking an rvalue
ref to taking the unique_ptr by value, because otherwise the coroutine
captures the request as a reference, which results in use-after-free.

Tests: unit(dev)

Closes #8249

* github.com:scylladb/scylla:
  alternator: drop read_content_and_verify_signature
  alternator: coroutinize handle_api_request
2021-03-10 16:08:08 +02:00
Piotr Sarna
ba264e7199 alternator: drop read_content_and_verify_signature
The only use of this helper function was inlined in a bigger
coroutine, so it's no longer needed.
2021-03-10 14:42:53 +01:00
Piotr Sarna
35da51879f alternator: coroutinize handle_api_request
The indentation level is significantly reduced, and so is the number
of allocations.
The function signature is changed from taking an rvalue ref to taking
the unique_ptr by value, because otherwise the coroutine captures
the request as a reference, which results in use-after-free.
2021-03-10 14:42:52 +01:00
Botond Dénes
1aa2424dcf sstable: move data_consume_* factory methods to row.hh 2021-03-10 15:40:50 +02:00
Botond Dénes
a06465a8f3 sstables: fold data_consume_context: into its users
`data_consume_context` is a thin wrapper over the real context object
and it does little more than forward method calls to it. The few
methods doing more then mere forwarding can be folded into its single
real user: `sstable_reader`.
2021-03-10 15:38:58 +02:00
Botond Dénes
37eb547224 sstables: partition.cc: remove data_consume_* forward declarations
They don't seem to serve any purpose, everything builds fine without
them.
2021-03-10 15:23:54 +02:00
Raphael S. Carvalho
f7cc431477 compaction_manager: Fix use-after-free in rewrite_sstables()
Use-after-free introduced by 2cf0c4bbf1.
That's because compacting is moved into then_wrapped() lambda, so it's
potentially freed on the next iteration of repeat().

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210309232940.433490-1-raphaelsc@scylladb.com>
2021-03-10 13:18:38 +02:00
Nadav Har'El
f41dac2a3a alternator: avoid large contiguous allocation for request body
Alternator request sizes can be up to 16 MB, but the current implementation
had the Seastar HTTP server read the entire request as a contiguous string,
and then processed it. We can't avoid reading the entire request up-front -
we want to verify its integrity before doing any additional processing on it.
But there is no reason why the entire request needs to be stored in one big
*contiguous* allocation. This always a bad idea. We should use a non-
contiguous buffer, and that's the goal of this patch.

We use a new Seastar HTTPD feature where we can ask for an input stream,
instead of a string, for the request's body. We then begin the request
handling by reading lthe content of this stream into a
vector<temporary_buffer<char>> (which we alias "chunked_content"). We then
use this non-contiguous buffer to verify the request's signature and
if successful - parse the request JSON and finally execute it.

Beyond avoiding contiguous allocations, another benefit of this patch is
that while parsing a long request composed of chunks, we free each chunk
as soon as its parsing completed. This reduces the peak amount of memory
used by the query - we no longer need to store both unparsed and parsed
versions of the request at the same time.

Although we already had tests with requests of different lengths, most
of them were short enough to only have one chunk, and only a few had
2 or 3 chunks. So we also add a test which makes a much longer request
(a BatchWriteItem with large items), which in my experiment had 17 chunks.
The goal of this test is to verify that the new signature and JSON parsing
code which needs to cross chunk boundaries work as expected.

Fixes #7213.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210309222525.1628234-1-nyh@scylladb.com>
2021-03-10 09:22:34 +01:00
Juliusz Stasiewicz
382545a614 docs: explain SSL/non-SSL and shard-aware CQL ports
I added short description of shard-aware ports + updated the rules
for disabling ports and enabling SSL introduced by #7992.

Fixes #8146

Closes #8152
2021-03-09 22:48:30 +02:00
Tomasz Grabiec
c9c2beabc0 Merge "raft: replication tests as individual boost tests" from Alejo
* alejo/raft-tests-replication-boost-5:
  raft: replication test: use Seastar random generator
  raft: replication test: rename drop_replication
  raft: replication test: change to Boost test
  raft: replication test: id helper functions
  raft: replication test: improve handling connectivity
  raft: replication test: parametrize snapshots
  raft: replication test: parametrize drop_replication
  raft: replication test: remove unused configuration
  raft: replication test: add license
2021-03-09 17:58:59 +01:00
Pavel Emelyanov
096e452db9 test: Fix exit condition of row_cache_test::test_eviction_from_invalidated
The test populates the cache, then invalidates it, then tries to push
huge (10x times the segment size) chunks into seastar memory hoping that
the invalid entries will be evicted. The exit condition on the last
stage is -- total memory of the region (sum of both -- used and free)
becomes less than the size of one chunk.

However, the condition is wrong, because cache usually contains a dummy
entry that's not necessarily on lru and on some test iteration it may
happen that

  evictable size < chunk size < evictable size + dummy size

In this case test fails with bad_alloc being unable to evict the memory
from under the dummy.

fixes: #7959
tests: unit(row_cache_test), unit(the failing case with the triggering
       seed from the issue + 200 times more with random seeds)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210309134138.28099-1-xemul@scylladb.com>
2021-03-09 17:57:52 +01:00
Alejo Sanchez
f67b85e2b3 raft: replication test: use Seastar random generator
Use the random generator provided by Seastar test suite.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 12:52:07 -04:00
Alejo Sanchez
1bf10a87c6 raft: replication test: rename drop_replication
Rename drop_replication to packet_drops for readability.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 12:52:07 -04:00
Alejo Sanchez
6e193ee3bf raft: replication test: change to Boost test
Change test/raft directory to Boost test type.

Run replication_test cases with their own test.

RAFT_TEST_CASE macro creates 2 test cases, one with random 20% packet
loss named name_drops.

The directory test/raft is changed to host Boost tests instead of unit.

While there improve the documentation.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 12:52:07 -04:00
Alejo Sanchez
8d9c797954 raft: replication test: id helper functions
In raft the UUID 0 is a special case so server ids start at 1.
Add two helper functions. Convert local 0-based id to raft 1-based
UUID. And from UUID to raft_id.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 12:50:12 -04:00
Alejo Sanchez
0ffa450222 raft: replication test: improve handling connectivity
Change global map of disconnected servers to a more intuitive class
connected. The class is callable for the most common case
connected(id).

Methods connect(), disconnect(), and all() are provided for readability
instead of directly calling map methods (insert, erase, clear). They
also support both numerical (0 based) and server_id (UUID, 1 based) ids.

The actual shared map is kept in a lw_shared_ptr.

The class is passed around to be copy-constructed which is practically
just creating a new lw_shared_ptr.

Internally it tracks disconnected servers but externally it's more
intuitive to use connect instead of disconnect. So it reads
"connected id" and "not disconnected id", without double negatives.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 12:39:29 -04:00
Alejo Sanchez
7a644f37d3 raft: replication test: parametrize snapshots
Snapshots and persisted snapshots created per test instead of globals.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 11:58:20 -04:00
Alejo Sanchez
f72e89fcfe raft: replication test: parametrize drop_replication
Pass drop_replication down instead of keeping it global.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 11:58:20 -04:00
Alejo Sanchez
5a03670f91 raft: replication test: remove unused configuration
Remove test case configuration as it's not implemented yet.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 11:58:20 -04:00
Alejo Sanchez
efc6681cd6 raft: replication test: add license
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 11:58:20 -04:00
Piotr Sarna
d473bc9b06 Merge 'Fix inconsistencies in MV and SI (reworked)' from Eliran Sinvani
This is a reworked submission of #7686 which has been reverted.  This series
fixes some race conditions in MV/SI schema creation and load, we spotted some
places where a schema without a base table reference can sneak into the
registry. This can cause to an unrecoverable error since write commands with
those schemas can't be issued from other nodes. Most of those cases can occur on
2 main and uncommon cases, in a mixed cluster (during an upgrade) and in a small
window after a view or base table altering.

Fixes #7709

Closes #8091

* github.com:scylladb/scylla:
  database: Fix view schemas in place when loading
  global_schema_ptr: add support for view's base table
  materialized views: create view schemas with proper base table reference.
  materialized views: Extract fix legacy schema into its own logic
2021-03-09 16:27:34 +01:00
Asias He
61ac8d03b9 repair: Add ignore_nodes option
In some cases, user may want to repair the cluster, ignoring the node
that is down. For example, run repair before run removenode operation to
remove a dead node.

Currently, repair will ignore the dead node and keep running repair
without the dead node but report the repair is partial and report the
repair is failed. It is hard to tell if the repair is failed only due to
the dead node is not present or some other errors.

In order to exclude the dead node, one can use the hosts option. But it
is hard to understand and use, because one needs to list all the "good"
hosts including the node itself. It will be much simpler, if one can
just specify the node to exclude explicitly.

In addition, we support ignore nodes option in other node operations
like removenode. This change makes the interface to ignore a node
explicitly more consistent.

Refs: #7806

Closes #8233
2021-03-09 16:03:13 +01:00
Gleb Natapov
2a41ad0b57 raft: add testing for non-voting members
Add tests to check if quorum (for leader election and commit index
purposes) is calculated correctly in the presence of non-voting members.
Message-Id: <20210304101158.1237480-3-gleb@scylladb.com>
2021-03-09 13:51:09 +01:00
Gleb Natapov
dd6ba3d507 raft: add non-voting member support
This patch adds a support for non-voting members. Non voting member is a
member which vote is not counted for leader election purposes and commit
index calculation purposes and it cannot become a leader. But otherwise
it is a normal raft node. The state is needed to let new nodes to catch
up their log without disturbing a cluster.

All kind of transitions are allowed. A node may be added as a voting member
directly or it may be added as non-voting and then changed to be voting
one through additional configuration change. A node can be demoted from
voting to non-voting member through a configuration change as well.
Message-Id: <20210304101158.1237480-2-gleb@scylladb.com>
2021-03-09 13:47:48 +01:00
Raphael S. Carvalho
863b95aa34 sstables: remove bag_sstable_set
bag_sstable_set can be replaced with partitioned_sstable_set, which
will provide the same functionality, given that L0 sstables go to
a "bag" rather than interval map. STCS, for example, will only
have L0 sstables, so it will get exact the same behavior with
partitioned_sstable_set.

it also gives us the benefit of keeping the leveled sstables in
the interval map if user has switched from LCS to STCS, until
they're all compacted into size-tiered ssts.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-09 08:39:48 -03:00
Avi Kivity
9038a81317 treewide: drop SEASTAR_CONCEPT
Since Scylla requires C++20, there is no need to protect
concept definitions or usages with SEASTAR_CONCEPT; it just
clutters the code. This patch therefore removes all uses.

Closes #8236
2021-03-08 16:04:20 +01:00
Asias He
dc40184faa gossip: Handle timeout error in gossiper::do_shadow_round
Currently, the rpc timeout error for the GOSSIP_GET_ENDPOINT_STATES verb
is not handled in gossiper::do_shadow_round. If the
GOSSIP_GET_ENDPOINT_STATES rpc call to any of the remote nodes goes
timeout, gossiper::do_shadow_round will throw an exception and fail the
whole boot up process.

It is fine that some of the remote nodes timeout in shadow round. It is
not a must to talk to all nodes.

This patch fixes an issue we saw recently in our sct tests:

```
INFO    | scylla[1579]: [shard 0] init - Shutting down gossiping
INFO    | scylla[1579]: [shard 0] gossip - gossip is already stopped
INFO    | scylla[1579]: [shard 0] init - Shutting down gossiping was successful
...

ERR     | scylla[1579]: [shard 0] init - Startup failed: seastar::rpc::timeout_error (rpc call timed out)
```

Fixes #8187

Closes #8213
2021-03-08 13:03:41 +01:00
Nadav Har'El
28804a50f7 alternator-test: test that index can't be a name reference (#xyz)
We already have a test which shows verify DynamoDB and Alternator
do not allow an index in an attribute path - like a[0].b - to be
a value reference - a[:xyz].b. We forgot to verify that the index
also can't be a name reference - a[#xyz].b is a syntax error. So here
we add a test which confirms that this is indeed the case - DynamoDB
doesn't allow it, and neither does Alternator.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210219123310.1240271-1-nyh@scylladb.com>
2021-03-08 10:17:19 +01:00
Avi Kivity
938761f49f types.cc: drop unused #include "compaction_garbage_collector.hh"
Garbage-collect unused #includes.

Closes #8232
2021-03-08 06:44:03 +01:00
Takuya ASADA
2d9feaacea scylla_raid_setup: don't abort using raiddev when array_state is 'clear'
On Ubuntu 20.04 AMI, scylla_raid_setup --raiddev /dev/md0 causes
'/dev/md0 is already using' (issue #7627).
So we merged the patch to find free mdX (587b909).

However, look into /proc/mdstat of the AMI, it actually says no active md device available:

ubuntu@ip-10-0-0-43:~$ cat /proc/mdstat
Personalities :
unused devices: <none>

We currently decide mdX is used when os.path.exists('/sys/block/mdX/md/array_state') == True,
but according to kernel doc, the file may available even array is STOPPED:

    clear

        No devices, no size, no level
        Writing is equivalent to STOP_ARRAY ioctl
https://www.kernel.org/doc/html/v4.15/admin-guide/md.html

So we should also check array_state != 'clear', not just array_state
existance.

Fixes #8219

Closes #8220
2021-03-07 18:30:11 +02:00
Avi Kivity
1287a5e1d0 test: index_reader_assertions: fix misuse of trichotomic comparator in has_monotonic_positions
has_monotonic_positions() wants to check for a greater-than-or-equal-to
relation, but actually tests for not-equal, since it treats a
trichotomic comparator as a less-than comparator. This is clearly seen
in the BOOST_FAIL message just below.

Fix by aligning the test with the intended invariant. Luckily, the tests
still pass.

Ref #1449.

Closes #8222
2021-03-07 13:44:37 +02:00
Eliran Sinvani
0220786710 database: Fix view schemas in place when loading
On restart the view schemas are loaded and might contain old
views with an unmarked computed column. We already have code to
update the schema, but before we do it we load the view as is. This
is not desired since once registered, this view version can be used
for writes which is forbidden since we will spot a none computed
column which is in the view's primary key but not in the base table
at all. To solve this, in addition to altering the persistent schema,
we fix the view's loaded schema in place. This is safe since computed
column is just involved in generating a value for this column when
creating a view update so the effect of this manipulation stays
internal.
The second stage of the in place fixing is to persist the
changes made in the in place fixing so the view is ready for
the next node restart in particular the `computed_columns` table.
2021-03-07 12:57:16 +02:00
Eliran Sinvani
04de770566 global_schema_ptr: add support for view's base table
Up until now, the global_schema_ptr object was a crack
through which a view schema with an uninitialized base
reference could sneak. Even if the schema itself contained a
base reference, the base schema didn't carry over to shards
different than the shard on which the global_schema_ptr was
created.
Since once the schema is in the registry it might be used for
everything (reads and writes), we also need to make sure that
global schemas for an incomplete view schemas will not be created.
2021-03-07 12:50:42 +02:00
Eliran Sinvani
9162748b18 materialized views: create view schemas with proper base table
reference.

Newly created view schemas don't always have their base info,
this is bad since such schemas don't support read nor write.
This leaves us vulnerable to a race condition where there is
an attempt to use this schema for read or write. Here we initialize
the base reference and also reconfigure the view to conform to the
new computed column type, which makes it usable for write and not only
reads. We do it for views created in the migration manager following
announcements and also for copied schemas.
2021-03-07 12:50:42 +02:00
Eliran Sinvani
39cd9dae4e materialized views: Extract fix legacy schema into its own logic
We extract the logic for fixing the view schema into it's own
logic as we will need to use it in more places in the code.
This makes 'maybe_update_legacy_secondary_index_mv_schema' redundant since
it becomes a two liner wrapper for this logic. We also
remove it here and replace the call to it with the equivalent code.
2021-03-07 12:50:42 +02:00
Takuya ASADA
53c7600da8 dist: increase fs.aio-max-nr value for other apps
Current fs.aio-max-nr value cpu_count() * 11026 is exact size of scylla
uses, if other apps on the environment also try to use aio, aio slot
will be run out.
So increase value +65536 for other apps.

Related #8133

Closes #8228
2021-03-07 12:11:36 +02:00
Piotr Sarna
7106ca27e6 service: reduce continuation length for paxos pruning
A pair of (finally, handle_exception) is reduced to a single
use of then_wrapped(), which saves an allocation.

Message-Id: <01949e286db93397209435a85fcc46a8beef6d24.1614937462.git.sarna@scylladb.com>
2021-03-07 11:59:10 +02:00
Nadav Har'El
ad563c6279 Update tools/java submodule
Fixes an sstableloader bug where we quoted twice column names that
had to be quoted, and therefore failed on such tables - and in particular
Alternator tables which always have a column called ":attrs".

Fixes #8229

* tools/java 142f517a23...c5d9e8513e (1):
  > sstableloader: Only escape column names once
2021-03-07 10:33:49 +02:00
Botond Dénes
debaae41f9 mutation_partition: operator<<(mutation_partition::printer)
Include row tombstones in the row printout.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210305094106.210249-1-bdenes@scylladb.com>
2021-03-05 14:39:39 +02:00
Botond Dénes
45471419d0 multishard_mutation_query: re-enable reverse queries
034cb81323 and 0f0c3be disallowed reverse partition-range scans based on
the observation that the CQL frontend disallows them, assuming that
other client APIs also disallow them. As it turns out this is not true
and there it at least one client API (Thrift) which does allows reverse
range scans. So re-enable them.

Fixes: #8211

Tests: unit(release), dtest(thrift_tests.py)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210304142249.164247-1-bdenes@scylladb.com>
2021-03-04 17:06:16 +02:00
Nadav Har'El
acfa180766 cql-pytest: recognize when Scylla crashes
Before this patch, if Scylla crashes during some test in cql-pytest, all
tests after it will fail because they can't connect to Scylla - and we can
get a report on hundreds of failures without a clear sign of where the real
problem was.

This patch introduces an autouse fixture (i.e., a fixture automatically
used by every test) which tries to run a do-nothing CQL command after each
test.  If this CQL command fails, we conclude that Scylla crashed and
report the test in which this happened - and exist pytest instead of failing
a hundred more tests.

Fixes #8080

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210304132804.1527977-1-nyh@scylladb.com>
2021-03-04 16:00:00 +02:00
Raphael S. Carvalho
1226fc755f compaction_manager: Increase cleanup compaction resilience when low on disk space
In a scenario where node is running out of disk space, which is a common
cause of cluster expansion, it's very important to clean up the smallest
files first to increase the chances of success when the biggest files are
reached down the road. That's possible given that cleanup operates on a
single file at a time, and that the smaller the file the smaller the
space requirement.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210303165520.55563-1-raphaelsc@scylladb.com>
2021-03-04 15:11:06 +02:00
Botond Dénes
25367deb01 mutation_partition: make row::consume_with() exception safe
This function currently eagerly decrements `_size`, before `func()` is
invoked. If `func()` throws the consumption fails but the size remains
decremented. If this happens right at the last element in the row, the
`row::empty()` will incorrectly return `true`, even though there is
still one cell left in it. Move the decrement after the `func()`
invocation to avoid this by only decrementing if the consumption
was successful.

Fixes: #8154

Tests: unit(mutation_test:release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210304125318.143323-1-bdenes@scylladb.com>
2021-03-04 15:07:15 +02:00
Piotr Sarna
added53b7d Merge 'hints: use a soft disk space limit in hints commitlog' from Piotr Dulikowski
A recent change to the commitlog (4082f57) caused its configurable size limit to
be strictly enforced - after reaching the limit, new segments wouldn't be
allocated until some of the previous segments are freed. This flow can work for
the regular commitlog, however the hints commitlog does not delete the segments
itself - instead, hints manager recreates its commitlog every 10 seconds, picks
up segments left by the previous instance and deletes each segment manually only
after all hints are sent out from a segment.

Because of the non-standard flow, it is possible that the hints commitlog fills
up and stops accepting more hints. Hints manager uses a relatively low limit for
each commitlog instance (128MB divided by shard count), so it's not hard to fill
it up. What's worse, hints manager tries to acquire file_update_mutex in
exclusive mode before re-creating the commitlog, while hints waiting to be
written acquire this lock in shared mode - which causes hints flushing to
completely deadlock and no more hints be admitted to the commitlog. The queue of
hints waiting to be admitted grows very quickly and soon all writes which could
result in a hint being generated are rejected with OverloadedException.

To solve this problem, it is now possible to bring back the soft disk space
limit by setting a flag in commitlog's configuration.

Tests:
- unit(dev)
- wrote hints for 15 minutes in order to see if it gets stuck again

Fixes #8137

Closes #8206

* github.com:scylladb/scylla:
  hints_manager: don't use commitlog hard space limit
  commitlog: add an option to allow going over size limit
2021-03-04 12:24:05 +01:00
Tomasz Grabiec
d6a94a7db1 Merge 'Make dht::token tri_compare safer' from Avi Kivity
tri_compare() returns an int, which is dangerous as a tri_compare can
be misused where a less_compare is expected. To prevent such misuse,
convert the interval<> template to accept comparators that return
std::strong_ordering, and then convert dht::token's comparator to do
the same.

Ref #1449.

Closes #8181

* github.com:scylladb/scylla:
  dht: convert token tri_compare to std::strong_ordering
  interval: support C++20 three-way comparisons
2021-03-04 11:55:08 +01:00
Nadav Har'El
3e66a5cd43 Merge 'More Redis cleanups' from Pekka Enberg
This pull request removes seastar namespace imports from the header
files. There are some additional cleanups to make that easier and to
remove some commented out code.

Closes #8202

* github.com:scylladb/scylla:
  redis: Remove seastar namespace import from query_processor.hh
  redis: Switch to seastar::sharded<> in query_procesor.hh
  redis: Remove seastar namespace import from query_utils.hh
  redis: Remove seastar namespace import from reply.hh
  redis: Remove commented out code from options.hh
  redis: Remove seastar namespace import from options.hh
  redis: Remove seastar namespace import from service.hh
  redis: Switch to seastar::sharded<> in service.{hh,cc}
  redis: Remove unneeded include from keyspace_utils.hh
  redis: Remove seastar namespace import from keyspace_utils.hh
  redis: Remove seastar namespace import from command_factory.hh
  redis: Fix include path in command_factory.hh
  redis: Remove unneeded includes from command_factory.hh
2021-03-04 11:08:24 +02:00
Pekka Enberg
6066db7c90 Update tools/jmx submodule
* tools/jmx bac7d0b...15c1d4f (2):
  > StorageService: Add a method to return the uptime
  > Bump Jackson version in scylla-apiclient
2021-03-04 10:56:37 +02:00
Nadav Har'El
e12e57c915 Merge 'Fix alternator streams management regression' from Calle Wilund
Refs: #8012
Fixes: #8210

With the update to CDC generation management, the way we retrieve and process these changed.
One very bad bug slipped through though; the code for getting versioned streams did not take into
account the late-in-pr change to make clustering of CDC gen timestamps reversed. So our alternator
shard info became quite rump-stumped, leading to more or less no data depending on when generations
changed w.r. data.

Also, the way we track the above timestamps changed, so we should utilize this for our end-of-iterator check.

Closes #8209

* github.com:scylladb/scylla:
  alternator::streams: Use better method for generation timestamp
  system_distributed_keyspace: Add better routine to get latest cdc gen. timestamp
  system_distributed_keyspace: Fix cdc_get_versioned_streams timestamp range
2021-03-04 09:43:56 +02:00
Pekka Enberg
1d8a94f941 Update tools/jmx submodule
* tools/jmx c2fc96b...bac7d0b (1):
  > Merge 'Fix locking in APIBuilder.remove()' from Pekka Enberg
2021-03-03 18:30:48 +02:00
Calle Wilund
8bbc976ff1 alternator::streams: Use better method for generation timestamp
Get timestamp via system_distributed, instead of local gen.
2021-03-03 15:46:38 +00:00
Calle Wilund
5da0129775 system_distributed_keyspace: Add better routine to get latest cdc gen. timestamp
Since we have a table of cdc version timestamps, conviniently sorted
reversed, we can just query this and get the latest known gen ts.
2021-03-03 15:44:54 +00:00
Calle Wilund
5a69250d7e system_distributed_keyspace: Fix cdc_get_versioned_streams timestamp range
With the new scheme for cdc generation management, one of the last
changes was to make the time ordering of the stream timestamps reversed.

However, cdc_get_versioned_streams forgot to take this into account
when sifting out timestamp ranges for stream retrieval (based on
low mark).

Fixed by doing reverse iteration.
2021-03-03 15:41:42 +00:00
Tomasz Grabiec
3cb01f218f Merge "raft: add unit tests for log, tracker, votes and fix found bugs" from Kostja
Test log consistency after apply_snapshot() is called.
Ensure log::last_term() log::last_conf_index() and log::size()
work as expected.

Misc cleanups.

* scylla-dev.git/raft-confchange-test-v4:
  raft: fix spelling
  raft: add a unit test for voting
  raft: do not account for the same vote twice
  raft: remove fsm::set_configuration()
  raft: consistently use configuration from the log
  raft: add ostream serialization for enum vote_result
  raft: advance commit index right after leaving joint configuration
  raft: add tracker test
  raft: tidy up follower_progress API
  raft: update raft::log::apply_snapshot() assert
  raft: add a unit test for raft::log
  raft: rename log::non_snapshoted_length() to log::in_memory_size()
  raft: inline raft::log::truncate_tail()
  raft: ignore AppendEntries RPC with a very old term
  raft: remove log::start_idx()
  raft: return a correct last term on an empty log
  raft: do not use raft::log::start_idx() outside raft::log()
  raft: rename progress.hh to tracker.hh
  raft: extend single_node_is_quiet test
2021-03-03 16:29:40 +01:00
Tomasz Grabiec
0dc57db248 Revert "Merge "raft: add unit tests for log, tracker, votes and fix found bugs" from Kostja"
This reverts commit f94f70cda8, reversing
changes made to 5206a97915.

Not the latest version of the series was merged. Rvert prior to
merging the latest one.
2021-03-03 16:29:02 +01:00
Avi Kivity
facc7c370e Update tools/jmx submodule
* tools/jmx 8073af6...c2fc96b (1):
  > APIBuilder: Remove RW-lock in JMX server repository wrapper

Fixes #7991.
2021-03-03 15:41:09 +02:00
Avi Kivity
aae43e1a20 Merge 'Untyped_result_set: make non-copying and fragment retaining' from Calle Wilund
Refs #7961
Fixes #8014

The "untyped_result_set" object was created for small, internal access to cql-stored metadata.
It is nowadays used for rather more than that (cdc).
This has the potential of mixing badly with the fact that the type does deep copying of data
and linearizes all (not to mention handles multiple rows rather inefficiently).

Instead of doing a deep copy of input, we keep assume ownership and build
rows of the views therein, potentially retaining fragmented data as-is
avoiding premature linearization.

Note that this is not all sugar and flowers though. Any data access will
by nature be more expensive, and the view collections we create are
potentially just as expensive as copying for small cells.

Otoh, it allows writing code using this that avoids data copying,
depending on destination.

v2:
* Fixed wrong collection reserved in visitor
* Changed row index from shared ptr to ref
* Moved typedef
* Removed non-existing constructors
* Added const ref to index build
* Fixed raft usage after rebase

v3:
* Changed shared_ptr to unique

Closes #8015

* github.com:scylladb/scylla:
  untyped_result_set: Do not copy data from input store (retain fragmented views)
  result_generator: make visitor callback args explicit optionals
  listlike_partial_deserializing_iterator: expose templated collection routines
2021-03-03 13:13:18 +02:00
Nadav Har'El
4e3db5297a cql-pytest: rework tests for filtering leaving out most rows
Previously, we had two tests demonstrating issue #7966. But since then,
our understanding of this issue has improved which resulted in issue #8203,
so this patch improves those tests and makes them reproduce the new issue.

Importantly, we now know that this problem is not specific to a full-table
scan, and also happens in a single-partition scan, so we fix the test to
demonstrate this (instead of the old test, which missed the problem so
the test passed).

Both tests pass on Cassandra, and fail on Scylla.

Refs #8203.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210302224020.1498868-1-nyh@scylladb.com>
2021-03-03 11:22:08 +01:00
Calle Wilund
e4d6c8904f untyped_result_set: Do not copy data from input store (retain fragmented views)
Refs #7961
Fixes #8014

Instead of doing a deep copy of input, we keep assume ownership and build
rows of the views therein, potentially retaining fragmented data as-is
avoiding premature linearization.

Note that this is not all sugar and flowers though. Any data access will
by nature be more expensive, and the view collections we create are
potentially just as expensive as copying for small cells.

Otoh, it allows writing code using this that avoids data copying,
depending on destination.

v2:
* Fixed wrong collection reserved in visitor
* Changed row index from shared ptr to ref
* Moved typedef
* Removed non-existing constructors
* Added const ref to index build
* Fixed raft usage after rebase

v3:
* Changed shared_ptr to unique
2021-03-03 10:19:46 +00:00
Calle Wilund
353730d4bb result_generator: make visitor callback args explicit optionals
This allows a visitor to separate temporaries (non-optional views)
from store backed views (optionals) when traversing.
2021-03-03 10:19:46 +00:00
Calle Wilund
bba43ce31a listlike_partial_deserializing_iterator: expose templated collection routines
To allow using fragmented types as input.
2021-03-03 10:19:46 +00:00
Nadav Har'El
0fea089b37 Merge 'Fix reading whole requests during shedding' from Piotr Sarna
When shedding requests (e.g. due to their size or number exceeding the
limits), errors were returned right after parsing their headers, which
resulted in their bodies lingering in the socket. The server always
expects a correct request header when reading from the socket after the
processing of a single request is finished, so shedding the requests
should also take care of draining their bodies from the socket.

Fixes #8193

Closes #8194

* github.com:scylladb/scylla:
  cql-pytest: add a shedding test
  transport: return error on correct stream during size shedding
  transport: return error on correct stream during shedding
  transport: skip the whole request if it is too large
  transport: skip the whole request during shedding
2021-03-03 08:52:48 +02:00
Piotr Sarna
4499f89916 cql-pytest: add a shedding test
This scylla-only test case tries to push a too-large request
to Scylla, and then retries with a smaller request, expecting
a success this time.

Refs #8193
2021-03-03 07:08:55 +01:00
Pekka Enberg
310b5c9592 redis: Fix license text in server.hh
The search and replace pattern went bit overboard. Let's fix up the
license text.
Message-Id: <20210302171150.3346-1-penberg@scylladb.com>
2021-03-03 07:06:45 +01:00
Dejan Mircevski
05497fe14d cql3/maps: Drop redundant if condition
Accidentally introduced in 9eed26ca3d, it can never be true due to
code above it.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8201
2021-03-03 07:06:45 +01:00
Nadav Har'El
d6335b7fda test/alternator: better tests of oversized requests
Like DynamoDB, Alternator rejects requests larger than some fixed maximum
size (16MB). We had a test for this feature - test_too_large_request,
but it was too blunt, and missed two issues:

Refs #8195
Refs #8196

So this patch adds two better tests that reproduce these two issues:

First, test_too_large_request_chunked verifies that an oversized request
is detected even if the body is sent with chunked encoding.

Second, both tests - test_too_large_request_chunked and
test_too_large_request_content_length - verify that the rather limited
(and arguably buggy) Python HTTP client is able to read the 413 status
code - and doesn't report some generic I/O error.

Both tests pass on DynamoDB, but fail on Alternator because of these two
open issues.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210302154555.1488812-1-nyh@scylladb.com>
2021-03-03 07:06:45 +01:00
Nadav Har'El
c6ca1ec643 cql-pytest: add reproducers for two filtering-related issues
The main goal of this patch is to add a reproducer for issue #7966, where
partition-range scan with filtering that begins with a long string of
non-matches aborts the query prematurely - but the same thing is fine with
a single-partition scan. The test, test_filtering_with_few_matches, is
marked as "xfail" because it still fails on Scylla. It passes on Cassandra.

I put a lot of effort into making this reproducer *fast* - the dev-build
test takes 0.4 seconds on my laptop. Earlier reproducers for the same
problem took as much as 30 seconds, but 0.4 seconds turns this test into
a viable regression test.

We also add a test, test_filter_on_unset, reproduces issue #6295 (or
the duplicate #8122), which was already solved so this test passes.

Refs #6295
Refs #7966
Refs #8122

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210301170451.1470824-1-nyh@scylladb.com>
2021-03-03 07:06:45 +01:00
Calle Wilund
58489dc003 cql3::restrictions: Add SCYLLA_CLUSTERING_BOUND keyword for sstableloader
Refs #8093
Refs /scylladb/scylla-tools-java#218

Adds keyword that can preface value tuples in (a, b, c) > (1, 2, 3)
expressions, forcing the restriction to bypass column sort order
treatment, and instead just create the raw ck bounds accordningly.

This is a very limited, and simple version, but since we only need
to cover this above exact syntax, this should be sufficient.

v2:
* Add small cql test
v3:
* Added comment in multi_column_restriction::slice, on what "mode" means and is for
* Added small document of our internal CQL extension keywords, including this.
v4:
* Added a few more cases to tests to verify multi-column restrictions
* Reworded docs a bit
v5:
* Fixed copy-paste error in comment
v6:
* Added negative (error) test cases
v7:
* Added check + reject of trying to combine SCYLLA_CLUST... slice and
  normal one

Closes #8094
2021-03-03 07:06:45 +01:00
Pekka Enberg
9d54a3e743 redis: Remove seastar namespace import from query_processor.hh 2021-03-02 18:39:30 +02:00
Pekka Enberg
27c5041c86 redis: Switch to seastar::sharded<> in query_procesor.hh 2021-03-02 18:38:41 +02:00
Pekka Enberg
ee8fe53b3c redis: Remove seastar namespace import from query_utils.hh 2021-03-02 18:37:31 +02:00
Pekka Enberg
c90c1ccd44 redis: Remove seastar namespace import from reply.hh 2021-03-02 18:36:30 +02:00
Pekka Enberg
1075d72780 redis: Remove commented out code from options.hh 2021-03-02 18:34:46 +02:00
Pekka Enberg
1c222fda65 redis: Remove seastar namespace import from options.hh 2021-03-02 18:34:30 +02:00
Pekka Enberg
d452c8f42e redis: Remove seastar namespace import from service.hh 2021-03-02 18:33:31 +02:00
Pekka Enberg
d0594a86aa redis: Switch to seastar::sharded<> in service.{hh,cc} 2021-03-02 18:30:39 +02:00
Avi Kivity
ee9db75210 Merge 'Clean up Redis transport layer' from Pekka Enberg
The Redis transport layer seems to have originated as a copy-paste of
the CQL transport layer. This pull request removes bunch of unused
and commented out bits of code, and also does some minor cleanups
like organizing includes, to make the code more readable.

Closes #8198

* github.com:scylladb/scylla:
  redis: Remove unused to_bytes_view() function from server.cc
  redis: Remove unused tracing_request_type enum
  redis: Remove unneeded connection friend declaration
  redis: Remove unused process_request_executor friend declaration
  redis: Remove unused _request_cpu class member
  redis: Remove commented out code from server.hh
  redis: Remove duplicate request.hh include
  redis: Remove unused db::config forward declaration
  redis: Remove unused fmt_visitor forward declaration
  redis: Organize includes in server.{cc,hh}
  redis: Switch to seastar::sharded<>
  redis: Remove redundant access modifiers from server.hh
2021-03-02 18:27:38 +02:00
Pekka Enberg
097aaa6dc2 redis: Remove unneeded include from keyspace_utils.hh 2021-03-02 18:16:29 +02:00
Pekka Enberg
7f4de3f915 redis: Remove seastar namespace import from keyspace_utils.hh 2021-03-02 18:15:37 +02:00
Pekka Enberg
bf47b58b8a redis: Remove seastar namespace import from command_factory.hh 2021-03-02 18:13:49 +02:00
Pekka Enberg
92e257d5bd redis: Fix include path in command_factory.hh 2021-03-02 18:13:08 +02:00
Pekka Enberg
ac4b8e4534 redis: Remove unneeded includes from command_factory.hh 2021-03-02 18:12:30 +02:00
Piotr Dulikowski
376da49cf4 hints_manager: don't use commitlog hard space limit
This commit disables the hard space limit applied by commitlogs created
to store hints. The hard limit causes problems for hints because they
use small-sized commitlogs to store hints (128MB, currently). Instead of
letting the commitlog delete the segments itself, it recreates the
commitlog every 10 seconds and manually deletes old segments after all
hints are sent out from them.

If the 128MB limit is reached, the hints manager will get stuck. A
future which puts hint into commitlog holds a shared lock, and commitlog
recreation needs to get an exclusive lock, which results in a deadlock.
No more hints will be admitted, and eventually we will start rejecting
writes with OverloadedException due to too many hints waiting to be
admitted to the commitlog.

By disabling the hard limit for hints commitlog, the old behavior is
brought back - commitlog becomes more conservative with the space used
after going over its size limit, but does not block until some of its
segments are deleted.
2021-03-02 16:53:50 +01:00
Piotr Sarna
8635094144 transport: return error on correct stream during size shedding
When a request is shed due to being too large, its response
was sent with stream id 0 instead of the stream id that matches
the communication lane. That in turn confused the client,
which is no longer the case.
2021-03-02 15:10:46 +01:00
Piotr Sarna
d6ea6937ee transport: return error on correct stream during shedding
When a request is shed due to exceeding the max number of concurrent
requests, its response was sent with stream id 0 instead of
the stream id that matches the communication lane.
That in turn confused the client, which is no longer the case.
2021-03-02 15:10:46 +01:00
Pekka Enberg
01a785f561 redis: Remove unused to_bytes_view() function from server.cc 2021-03-02 14:29:52 +02:00
Pekka Enberg
fb6eecfae2 redis: Remove unused tracing_request_type enum 2021-03-02 14:29:52 +02:00
Pekka Enberg
8d79deb973 redis: Remove unneeded connection friend declaration 2021-03-02 14:29:51 +02:00
Pekka Enberg
ff81f7bc23 redis: Remove unused process_request_executor friend declaration 2021-03-02 14:29:51 +02:00
Pekka Enberg
87c5968602 redis: Remove unused _request_cpu class member 2021-03-02 14:29:51 +02:00
Pekka Enberg
11fa32e8c9 redis: Remove commented out code from server.hh 2021-03-02 14:29:51 +02:00
Pekka Enberg
ddab15c47f redis: Remove duplicate request.hh include 2021-03-02 14:29:51 +02:00
Pekka Enberg
07bd125a59 redis: Remove unused db::config forward declaration 2021-03-02 14:29:51 +02:00
Pekka Enberg
5a7e6b6c09 redis: Remove unused fmt_visitor forward declaration 2021-03-02 14:29:51 +02:00
Pekka Enberg
298bf19981 redis: Organize includes in server.{cc,hh} 2021-03-02 14:29:51 +02:00
Pekka Enberg
23c2f47054 redis: Switch to seastar::sharded<> 2021-03-02 14:29:51 +02:00
Pekka Enberg
7bd4ff9d75 redis: Remove redundant access modifiers from server.hh 2021-03-02 14:13:45 +02:00
Avi Kivity
5f4bf18387 Revert "Merge 'sstables: add versioning to the sstable_set ' from Wojciech Mitros"
This reverts commit 31909515b3, reversing
changes made to ef97adc72a. It shows many
serious regressions in dtest.

Fixes #8197.
2021-03-02 13:21:22 +02:00
Takuya ASADA
870c3a28c1 scylla_setup: strip spaces of comma separated list
On RAID prompt, we can type disk list something like this:
 /dev/sda1,/dev/sdb1,/dev/sdc1,/dev/sdd1

However, if the list has spaces in the list, it doesn't work:
 /dev/sda1, /dev/sdb1, /dev/sdc1, /dev/sdd1

Because the script mistakenly recognize the space part of a device path.
So we need strip() the input for each item.

Fixes #8174

Closes #8190
2021-03-02 12:48:18 +02:00
Piotr Sarna
4a24d7dca0 transport: skip the whole request if it is too large
When a request is shed due to being too large, only the header
was actually read, and the body was still stuck in the socket
- and would be read in the next iteration, which would expect
to actually read a new request header.
Instead, the whole message is now skipped, so that a new request
can be correctly read and parsed.

Fixes #8193
2021-03-02 10:10:19 +01:00
Piotr Sarna
3eb7e768cb transport: skip the whole request during shedding
When a request is shed due to exceeding the number of max concurrent
requests, only its header was actually read, and the body was still
stuck in the socket - and would be read in the next iteration,
which would expect to actually read a new request header.
Instead, the whole message is now skipped, so that a new request
can be correctly read and parsed.

Refs #8193
2021-03-02 10:10:19 +01:00
Avi Kivity
10364fca6e Merge "Build query::result directly in range scan queries" from Botond
"
Currently range scans build their results on the replica in the
`reconcilable_result` format, that -- as its name suggests -- is
normally used for reconciliation (read repair). As such this result
format is quite inefficient for normal queries: it contains all columns
and all tombstones in the requested range. These are all unnecessary for
normal queries which only want live data and only those columns that are
requested by the user.
Furthermore, as the coordinator works in terms of `query::result` for
normal queries anyway, this intermediate result has to be converted to
the final `query::result` format adding an unnecessary intermediate
conversion step.
This series gets rid of this problem by introducing
`query_data_on_all_shards()`, a variant of
`query_mutations_on_all_shards()` that builds `query::result` directly.
Reverse queries still use the old intermediate method behind the scenes.

Fixes #8061
Refs #7434

Tests: unit(release, debug)
"

* 'range-scan-data-variant/v5-rebased' of https://github.com/denesb/scylla:
  cql_query_test: add unit test for the more efficient range scan result format
  test/cql_test_env: do_with_cql_test_env(): add thread_attributes parameter
  cql_query_test: test_query_limit: clean up scheduling groups
  storage_proxy: use query_data_on_all_shards() for data range scan queries
  query: partition_slice: add range_scan_data_variant option
  gms: add RANGE_SCAN_DATA_VARIANT cluster feature
  multishard_mutation_query: query_mutations_on_all_shards(): refuse reverse queries
  multishard_mutation_query: add query_data_on_all_shards()
  mutation_partition.cc: fix indentation
  query_result_builder: make it a public type
  multishard_mutation_query: generalize query code w.r.t. the result builder used
  multishard_mutation_query: query_mutations_on_all_shards(): extract logic into new method
  multishard_mutation_query: query_mutations_on_all_shards(): convert to coroutine
  multishar_mutation_query: do_query_mutations(): convert to coroutine
  multishard_mutation_query: read_page(): convert to coroutine
  multishard_mutation_query: extract page reading logic into separate method
2021-03-02 08:54:41 +02:00
Botond Dénes
257c295cff cql_query_test: add unit test for the more efficient range scan result format
The most user-visible aspect of this change is range scans which select
a small subset of the columns. These queries work as the user expects
them to work: unselected columns are not included in determining the
size of the result (or that of the page). This is the aspect this test
is checking for. While at it, also test single partition queries too.
2021-03-02 08:01:53 +02:00
Botond Dénes
af0a23e75c test/cql_test_env: do_with_cql_test_env(): add thread_attributes parameter
To allow conveniently setting the scheduling group `func` is to be run
in.
2021-03-02 07:53:53 +02:00
Botond Dénes
fe280271a6 cql_query_test: test_query_limit: clean up scheduling groups
Destroy scheduling groups created for this test, so other tests can
create scheduling groups with the same name, without conflicts.
2021-03-02 07:53:53 +02:00
Botond Dénes
f8ce168c8e storage_proxy: use query_data_on_all_shards() for data range scan queries
Currently range scans build their result using the `reconcilable_result`
format and then convert it to `query::result`. This is inefficient for
multiple reasons:
1) it introduces an additional intermediate result format and a
   subsequent conversion to the final one;
2) the reconcilable result format was designed for reconciliation so it
   contains all data, including columns unselected by the query, dead
   rows and tombstones, which takes much more memory to build;

There is no reason to go through all this trouble, if there ever was one
in the past it doesn't stand anymore. So switch to the newly introduced
`query_data_on_all_shards()` when doing normal data range scans, but
only if all the nodes in the cluster supports it, to avoid artificial
differences in page sizes due to how reconcilable result and
query::result calculates result size and the consequent false-positive
read repair.
The transition to this new more efficient method is coordinated by a
cluster feature and whether to use it is decided by the coordinator
(instead of each replica individually). This is to avoid needless
reconciliation due to the different page sizes the two formats will
produce.
2021-03-02 07:53:53 +02:00
Botond Dénes
f15551d23a query: partition_slice: add range_scan_data_variant option
Switching to the data variant of range scans have to be coordinated by
the coordinator to avoid replicas noticing the availability of the
respective feature in different time, resulting in some using the
mutation variant, some using the data variant.
So the plan is that it will be the coordinator's job to check the
cluster feature and set the option in the partition slice which will
tell the replicas to use the data variant for the query.
2021-03-02 07:53:53 +02:00
Botond Dénes
5c84aa52db gms: add RANGE_SCAN_DATA_VARIANT cluster feature
To control the transition to the data variant of range scans. As there
is a difference in how the data and mutation variants calculate pages
sizes, the transition to the former has to happen in a controlled
manner, when all nodes in the cluster support it, to avoid artificial
differences in page content and subsequently triggering false-positive
read repair.
2021-03-02 07:53:53 +02:00
Botond Dénes
0f0c3be63e multishard_mutation_query: query_mutations_on_all_shards(): refuse reverse queries
Refuse reverse queries just like in the new
`query_data_on_all_shards()`. The reason is the same, reverse range
scans are not supported on the client API level and hence they are
underspecified and more importantly: not tested.
2021-03-02 07:53:53 +02:00
Botond Dénes
034cb81323 multishard_mutation_query: add query_data_on_all_shards()
A data query variant of the existing `query_mutations_on_all_shards()`.
This variant builds a `query::result`, instead of `reconcilable_result`.
This is actually the result format coordinators want when executing
range scans, the reason for using the reconcilable result for these
queries is historic, and it just introduces an unnecessary intermediate
format.
This new method allows the storage proxy to skip this intermediate
format and the associated conversion to `query::result`, just like we do
for single partition queries.

Reverse queries are refused because they are not supported on the client
API (CQL) level anyway and hence it is unspecified how they should work
and more importantly: they are not tested.
2021-03-02 07:53:53 +02:00
Botond Dénes
df0f501ba2 mutation_partition.cc: fix indentation
Left broken from the previous patch.
2021-03-02 07:53:53 +02:00
Botond Dénes
950150c6df query_result_builder: make it a public type
We will want to use it in multishard_mutation_query.cc.
2021-03-02 07:53:53 +02:00
Botond Dénes
f19ab5cff1 multishard_mutation_query: generalize query code w.r.t. the result builder used
We want to add support to building `query::result` directly and reuse
the code path we use to build reconcilable result currently for it.
So templatize said code path on the result builder used. Since the
different result builders don't have a source level compatible interface
an adaptor class is used.
2021-03-02 07:53:53 +02:00
Botond Dénes
bddb0d35d6 multishard_mutation_query: query_mutations_on_all_shards(): extract logic into new method
In the next patches we are going to generalize the query logic w.r.t.
the result builder used, so query_mutations_on_all_shards() will be just
a facade parametrizing the actual query code with the right result
builder.
2021-03-02 07:53:53 +02:00
Botond Dénes
b0b620b501 multishard_mutation_query: query_mutations_on_all_shards(): convert to coroutine
In preparation to generalizing it w.r.t. the result builder used.
This change will be much simpler with the coroutine code.
2021-03-02 07:53:53 +02:00
Botond Dénes
5d85615698 multishar_mutation_query: do_query_mutations(): convert to coroutine
In preparation to generalizing it w.r.t. the result builder used.
This change will be much simpler with the coroutine code.
2021-03-02 07:53:53 +02:00
Botond Dénes
8138bdb434 multishard_mutation_query: read_page(): convert to coroutine
In preparation to generalizing it w.r.t. the result builder used. This
change will be much simpler with the coroutine code.
2021-03-02 07:53:53 +02:00
Botond Dénes
29195f67f1 multishard_mutation_query: extract page reading logic into separate method
The block of code moved also coincides with the scope in which the
reader has to be alive, making the code more clear.
2021-03-02 07:53:53 +02:00
Benny Halevy
baf5d05631 storage_service: use atomic_vector for lifecycle_subscribers
So it can be modified while walked to dispatch
subscribed event notifications.

In #8143, there is a race between scylla shutdown and
notify_down(), causing use-after-free of cql_server.

Using an atomic vector itstead and futurizing
unregister_subscriber allows deleting from _lifecycle_subscribers
while walked using atomic_vector::for_each.

Fixes #8143

Test: unit(release)
DTest:
  update_cluster_layout_tests:TestUpdateClusterLayout.add_node_with_large_partition4_test(release)
  materialized_views_test.py:TestMaterializedViews.double_node_failure_during_mv_insert_4_nodes_test(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210224164647.561493-2-bhalevy@scylladb.com>
2021-03-01 20:34:42 +02:00
Benny Halevy
1ed04affab cql_server: event_notifier: unregister_subscriber in stop
Move unregister_subscriber from the destructor to stop
as preparation for moving storage_service lifescyle_subscribers
to atomic_vector and futurizing unregister_subscriber.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210224164647.561493-1-bhalevy@scylladb.com>
2021-03-01 20:34:42 +02:00
Avi Kivity
fe8f9039a2 Update seastar submodule
* seastar 803e790598...ea5e529f30 (3):
  > Merge "Teach io_tester to generate YAML output" from Pavel E
  > bitset: set_range: mark constructor constexpr
  > Update dpdk submodule
2021-03-01 20:34:35 +02:00
Avi Kivity
8747c684e0 Merge 'Move timeouts to client state' from Piotr Sarna
This series is extracted from #7913 as it may prove useful to other series as well, and #7913 might take a while until its merged, given that it also depends on other unmerged pull requests.

The idea of this series is to move timeouts to the client state, which will allow changing them independently for each session - e.g. by setting per-service-level timeouts and initializing the values from attached service levels (see #7867).

Closes #8140

* github.com:scylladb/scylla:
  treewide: remove timeout config from query options
  cql3: use timeout config from client state instead of query options
  cql3: use timeout config from client state instead of query options
  cql3: use timeout config from client state instead of query options
  service: add timeout config to client state
2021-03-01 20:34:35 +02:00
Raphael S. Carvalho
2cf0c4bbf1 compaction: Prevent cleanup and regular from compacting the same sstable
Due to regression introduced by 463d0ab, regular can compact in parallel a sstable
being compacted by cleanup, scrub or upgrade.

This redundancy causes resources to be wasted, write amplification is increased
and so does the operation time, etc.

That's a potential source of data resurrection because the now-owned data from
a sstable being compacted by both cleanup and regular will still exist in the
node afterwards, so resurrection can happen if node regains ownership.

Fixes #8155.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210225172641.787022-1-raphaelsc@scylladb.com>
2021-03-01 20:34:35 +02:00
Tomasz Grabiec
cb0b8d1903 row_cache: Zap dummy entries when populating or reading a range
This will prevent accumulation of unnecessary dummy entries.

A single-partition populating scan with clustering key restrictions
will insert dummy entries positioned at the boundaries of the
clustering query range to mark the newly populated range as
continuous.

Those dummy entries may accumulate with time, increasing the cost of
the scan, which needs to walk over them.

In some workloads we could prevent this. If a populating query
overlaps with dummy entries, we could erase the old dummy entry since
it will not be needed, it will fall inside a broader continuous
range. This will be the case for time series worklodas which scan with
a decreasing (newest) lower bound.

Refs #8153.

_last_row is now updated atomically with _next_row. Before, _last_row
was moved first. If exception was thrown and the section was retried,
this could cause the wrong entry to be removed (new next instead of
old last) by the new algorithm. I don't think this was causing
problems before this patch.

The problem is not solved for all the cases. After this patch, we
remove dummies only when there is a single MVCC version. We could
patch apply_monotonically() to also do it, so that dummies which are
inside continuous ranges are eventually removed, but this is left for
later.

perf_row_cache_reads output after that patch shows that the second
scan touches no dummies:

$ build/release/test/perf/perf_row_cache_reads_g -c1 -m200M
Rows in cache: 0
Populating with dummy rows
Rows in cache: 265320
Scanning
read: 142.621613 [ms], preemption: {count: 639, 99%: 0.545791 [ms], max: 0.526929 [ms]}, cache: 0/0 [MB]
read: 0.023197 [ms], preemption: {count: 1, 99%: 0.035425 [ms], max: 0.032736 [ms]}, cache: 0/0 [MB]

Message-Id: <20210226172801.800264-1-tgrabiec@scylladb.com>
2021-03-01 20:34:35 +02:00
Tomasz Grabiec
761f89e55e api: Introduce system/drop_sstable_caches RESTful API
Evicts objects from caches which reflect sstable content, like the row
cache. In the future, it will also drop the page cache
and sstable index caches.

Unlike lsa/compact, doesn't cause reactor stalls.

The old lsa/compact call invokes memory reclamation, which is
non-preemptible. It also compacts LSA segments, so does more
work. Some use cases don't need to compact LSA segments, just want the
row cache to be wiped.

Message-Id: <20210301120211.36195-1-tgrabiec@scylladb.com>
2021-03-01 16:13:04 +02:00
Piotr Dulikowski
aa2df75321 commitlog: add an option to allow going over size limit
This commit adds an option which, when turned on, allows the commitlog
to go over configured size limit. After reaching the limit, commitlog
will be more conservative with its usage of the disk space - for
example, it won't increase the segment reserve size or reuse recycled
segments. Most importantly, it won't block writes until the space used
by the commitlog goes down.

This change is necessary for hinted handoff to keep its current
behavior. Hinted handoff does not let the commitlog free segments
itself - instead, it re-creates it every 10 seconds and manually deletes
segments after all hints are sent from a segment.
2021-03-01 14:16:05 +01:00
Takuya ASADA
d0297c599a dist: tune fs.aio-max-nr based on the number of cpus
Current aio-max-nr is set up statically to 1048576 in
/etc/sysctl.d/99-scylla-aio.conf.
This is sufficient for most use cases, but falls short on larger machines
such as i3en.24xlarge on AWS that has 96 vCPUs.

We need to tune the parameter based on the number of cpus, instead of
static setting.

Fixes #8133

Signed-off-by: Takuya ASADA <syuu@scylladb.com>

Closes #8188
2021-03-01 14:18:24 +02:00
Avi Kivity
31909515b3 Merge 'sstables: add versioning to the sstable_set ' from Wojciech Mitros
Currently, the sstable_set in a table is copied before every change
to allow accessing the unchanged version by existing sstable readers.
This patch changes the sstable_set to a structure that keeps all its
versions that are referenced somewhere and provides a way of getting
a reference to an immutable version of the set.
Each sstable in the set is associated with the versions it is alive in,
and is removed when all such versions don't have references anymore.
To avoid copying, the object holding all sstables in the set version is
changed to a new structure, sstable_list, which was previously an alias
for std::unordered_set<shared_sstable>, and which implements most of the
methods of an unordered_set, but its iterator uses the actual set with
all sstables from all referenced versions and iterates over those
sstables that belong to the captured version.
The methods that modify the sets contents give strong exception guarantee
by trying to insert new sstables to its containers, and erasing them in
the case of an caught exception.
To release shared_sstables as soon as possible (i.e. when all references
to versions that contain them die), each time a version is removed, all
sstables that were referenced exclusively by this version are erased. We
are able to find these sstables efficiently by storing, for each version,
all sstables that were added and erased in it, and, when a version is
removed, merging it with the next one. When a version that adds an sstable
gets merged with a version that removes it, this sstable is erased.

Fixes #2622

Signed-off-by: Wojciech Mitros wojciech.mitros@scylladb.com

Closes #8111

* github.com:scylladb/scylla:
  sstables: add test for checking the latency of updating the sstable_set in a table
  sstables: move column_family_test class from test/boost to test/lib
  sstables: use fast copying of the sstable_set instead of rebuilding it
  sstables: replace the sstable_set with a versioned structure
  sstables: remove potential ub
  sstables: make sstable_set constructor less error-prone
2021-03-01 14:16:36 +02:00
Avi Kivity
ef97adc72a Merge "Validate token monotonicity on the sstable write path" from Botond
"
We have recently seen out-of-order partitions getting into sstables
causing major disruption later on. Given the damage caused, it was again
raised that we should enable partition key monotonicity validation
unconditionally in the sstable write path. This was also raised in the
past but dismissed as key validation was suspected (but not measured) to
add considerable per-fragment overhead. One of the problems was that the
key monotonicity validation was all or nothing. It either validated all
(clustering and partition) key monotonicity or none of it.
This series takes a second look at this and solves the all-or-nothing
problem by making the configuration of the key monotonicity check more
fine grained, allowing for enabling just token monotonicity validation
separately, then enables it unconditionally.

Refs: #7623

Tests: unit(release)
"

* 'sstable-writer-validate-partition-keys-unconditionally/v3' of https://github.com/denesb/scylla:
  sstables: enable token monotonicity validation by default
  mutation_fragment_stream_validator: add token validation level
  mutation_fragment_stream_validating_filter: make validation levels more fine-grained
2021-03-01 11:23:51 +02:00
Amnon Heiman
0595596172 api/compaction_manager: add the compaction id in get_compaction
This patch adds the compaction id to the get_compaction structure.
While it was supported, it was not used and up until now wasn't needed.

After this patch a call to curl -X GET 'http://localhost:10000/compaction_manager/compactions'
will include the compaction id.

Relates to #7927

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes #8186
2021-03-01 10:51:31 +02:00
Piotr Sarna
7936652322 db,view: improve verbosity of errors coming from view updates
The error now contains information about the view table that failed,
as well as base and view tokens.
Example:
view - Error applying view update to 127.0.0.1 (view: ks.testme_v_idx_index,
        base token: -4069959284402364209, view token: -3248873570005575792): std::runtime_error (manually injected error)

Fixes #8177

Closes #8178
2021-03-01 10:46:14 +02:00
Avi Kivity
86d8977c96 Update tools/python3 submodule
* tools/python3 199ac90...6f3bcbe (2):
  > Add support pip modules
  > create-relocatable-package.py: add support python libraries in /usr/local
2021-03-01 10:10:13 +02:00
Avi Kivity
8ac0d6d15d Update tools/jmx submodule
* tools/jmx bf8bb16...8073af6 (1):
  > CompactionManager: add the compaction id when available

Fixes #7927.
2021-03-01 10:09:16 +02:00
Takuya ASADA
4cf9b6988e scylla_coredump_setup: don't run apt-get when systemd-coredump is already installed
Check systemd-coredump existance before running apt-get install
systemd-coredump.

Closes #8185
2021-03-01 09:38:51 +02:00
Botond Dénes
f0b284dab8 sstables: enable token monotonicity validation by default
Partition key order validation in data written to sstables can be very
disruptive. All our components in the storage layers assume that
partitions are in order, which means that reading out-of-order
partitions triggers undefined behaviour. Computer scientists often joke
that undefined behaviour can erase your hard drive and in this case the
damage done by undefined behaviour caused by out-of-order partitions is
very close to that. The corruption is known to mutate causing crashes,
corrupting more data and even loose data. For this reason it is
imperative that out-of-order partitions cannot get into sstables. This
patch enables token monotonicity validation unconditionally in
the sstable writer. As partition key monotonicity checks involve a key
copy per partition, which might have an impact on the performance, we do
the next best thing instead and enable only token monotonicity
validation.
2021-03-01 07:49:23 +02:00
Botond Dénes
727bc0f5d4 mutation_fragment_stream_validator: add token validation level
In some cases the full-blown partition key validation and especially the
associated key copy per partition might be deemed too costly. As a next
best thing this patch adds a token only validation, which should cover
99% (number pulled out of my sleeve) of the cases. Let's hope no one
gets unlucky.
2021-03-01 07:49:23 +02:00
Botond Dénes
694f8a4ec6 mutation_fragment_stream_validating_filter: make validation levels more fine-grained
Currently key order validation for the mutation fragment stream
validating filter is all or nothing. Either no keys (partition or
clustering) are validated or all of them. As we suspect that clustering
key order validation would add a significant overhead, this discourages
turning key validation on, which means we miss out on partition key
monotonicity validation which has a much more moderate cost.
This patch makes this configurable in a more fine-grained fashion,
providing separate levels for partition and clustering key monotonicity
validation.

As the choice for the default validation level is not as clear-cut as
before, the default value for the validation level is removed in the
validating filter's constructor.
2021-03-01 07:49:23 +02:00
Avi Kivity
3cd2f00438 dht: convert token tri_compare to std::strong_ordering
Change token's tri_compare functions to return std::strong_ordering,
which is not convertible to bool and therefore not suspect to
being misused where a less-compare is expected.

Two of the users (ring_position and decorated_key) have to undo
the conversion, since they still return int. A follow up will
convert them too.

Ref #1449.
2021-02-28 21:03:59 +02:00
Avi Kivity
d3d7698502 interval: support C++20 three-way comparisons
Allow the tri-comparator input to range functions to return
std::strong_ordering, e.g. the result of operator<=>. An int
input is still allowed, and coerced to std::strong_ordering by
tri-comparing it against zero. Once all users are converted, this
will be disallowed.

The clever code that performs boundary comparisons unfortunately
has to be dumbed down to conditionals. A helper
require_ordering_and_on_equal_return() is introduced that accepts
a comparison result between bound values, an expected comparison
result, and what to return if the bound value matches (this depends
on whether individual bounds are exclusive or inclusive, on
whether the bounds are start bounds or end bounds, and on the
sense of the comparison).

Unfortunately, the code is somewhat pessimized, and there is no
way to pessimize it as the enum underlying std::strong_ordering
is hidden.
2021-02-28 21:03:25 +02:00
Avi Kivity
d980f550d1 Merge 'row_cache: Make fill_buffer() preemptable when cursor leads with dummy rows' from Tomasz Grabiec
fill_buffer() will keep scanning until _lower_bound_changed is true,
even if preemption is signaled, so that the reader makes forward
progress.

Before the patch, we did not update _lower_bound on touching a dummy
entry. The read will not respect preemption until we hit a non-dummy
row. If there is a lot of dummy rows, that can cause reactor stalls.

Fix that by updating _lower_bound on dummy entries as well.

Refs #8153.

Tested with perf_row_cache_reads:

```
$ build/release/test/perf/perf_row_cache_reads -c1 -m200M
Rows in cache: 0
Populating with dummy rows
Rows in cache: 373929
Scanning
read: 183.658966 [ms], preemption: {count: 848, 99%: 0.545791 [ms], max: 0.519343 [ms]}, cache: 99/100 [MB]
read: 120.951515 [ms], preemption: {count: 257, 99%: 0.545791 [ms], max: 0.518795 [ms]}, cache: 99/100 [MB]
```

Notice that max preemption latency is low in the second "read:" line.

Closes #8167

* github.com:scylladb/scylla:
  row_cache: Make fill_buffer() preemptable when cursor leads with dummy rows
  tests: perf: Introduce perf_row_cache_reads
  row_cache: Add metric for dummy row hits
2021-02-28 21:00:20 +02:00
Botond Dénes
1d9b5911fe time_series_sstable_set::create_single_key_sstable_reader(): fix use-after-free
The optimal path of said method mistakenly captures `pos` (a local
variable) in its reader factory method and passes a temporary range
implicitly constructed from said `pos` as the range parameter to the
sstable reader. This will lead to the sstable reader using a dangling
range and will result in returning no result for queries. This patch
fixes this bug and adds a unit test to cover this code path.

Fixes #8138.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210226143111.104591-2-bdenes@scylladb.com>
2021-02-26 23:57:25 +02:00
Botond Dénes
dd5a601aaa result_memory_accounter: abort unpaged queries hitting the global limit
The `result_memory_accounter` terminates a query if it reaches either
the global or shard-local limit. This used to be so only for paged
queries, unpaged ones could grow indefinitely (until the node OOM'd).
This was changed in fea5067 which enforces the local limit on unpaged
queries as well, by aborting them. However a loophole remained in the
code: `result_memory_accounter::check_and_update()` has another stop
condition, besides `check_local_limit()`, it also checks the global
limit. This stop condition was not updated to enforce itself on unpaged
queries by aborting them, instead it silently terminated them, causing
them to return less data then requested. This was masked by most queries
reaching the local limit first.
This patch fixes this by aborting unpaged mutation queries when they hit
the global limit.

Fixes: #8162

Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210226102202.51275-1-bdenes@scylladb.com>
2021-02-26 23:43:16 +02:00
Botond Dénes
bc1fcd3db2 multishard_combining_reader: only read from needed shards
The multishard combining reader currently assumes that all shards have
data for the read range. This however is not always true and in extreme
cases (like reading a single token) it can lead to huge read
amplification. Avoid this by not pushing shards to
`_shard_selection_min_heap` if the first token they are expected to
produce falls outside of the read range. Also change the read ahead
algorithm to select the shards from `_shard_selection_min_heap`, instead
of walking them in shard order. This was wrong in two ways:
* Shards may be ordered differently with respect to the first partition
  they will produce; reading ahead on the next shard in shard order
  might not bring in data on the next shard the read will continue on.
  Shard order is only correct when starting a new range and shards are
  iterated over in the order they own tokens according to the sharding
  algorithm.
* Shards that may not have data relevant to the read range are also
  considered for read ahead.

After this patch, the multishard reader will only read from shards that
have data relevant to the read range, both in the case of normal reads
and also for read-ahead.

Fixes: #8161

Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210226132536.85438-1-bdenes@scylladb.com>
2021-02-26 23:29:20 +02:00
Piotr Sarna
0e0282cdf1 Merge ' cdc: move (most of) CDC generation management to a new service' from Kamil Braun
Currently all management of CDC generations happens in storage_service,
which is a big ball of mud that does many unrelated things.

This PR introduces a new service crafted to handle CDC generation
management: listening and reacting to generation changes in the cluster.

We plug the service in, initializing it in main and test code,
passing a reference to storage_service and having storage_service call
the service (using the `after_join` method): the service only starts
doing its job after the node joins the token ring (either on bootstrap
or restart).

Some parts of generation management still remain in storage_service:
the bootstrap procedure, which happens inside storage_service,
must also do some initialization regarding CDC generations,
for example: on restart it must retrieve the latest known generation
timestamp from disk; on bootstrap it must create a new generation
and announce it to other nodes. The order of these operations w.r.t
the rest of the startup procedure is important, hence the startup
procedure is the only right place for them. We may try decoupling
these services even more in follow-up PRs, but that requires a bit
of careful reasoning. What this PR does is a low-hanging fruit.

Still, what remains in storage_service is a small part of the entire
CDC generation management logic; most of it has been moved to the
new service. This includes listening for generation changes and
updating the data structures for performing CDC log writes (cdc::metadata).
Furthermore these handling functions now return futures (and are internally
coroutines), where previously they required a seastar::async context.

This PR is a prerequisite to fixing #7985. The fact that all the CDC generation
management code was in storage_service is technical debt. It will be easier
to modify the management algorithms when they sit in their own module.

Tests: unit (dev) and cdc_tests.py dtest (dev), and local replication test using scylla-cdc-java

Closes #8172

* github.com:scylladb/scylla:
  cdc: move (most of) CDC generation management code to the new service
  cdc: coroutinize make_new_cdc_generation
  cdc: coroutinize update_streams_description
  cdc: introduce cdc::generation_service
  main: move cdc_service initialization just prior to storage_service initialization
2021-02-26 12:42:27 +01:00
Kamil Braun
e2f03e4aba cdc: move (most of) CDC generation management code to the new service
Currently all management of CDC generations happens in storage_service,
which is a big ball of mud that does many unrelated things.

Previous commits have introduced a new service for managing CDC
generations. This code moves most of the relevant code to this new
service.

However, some part still remains in storage_service: the bootstrap
procedure, which happens inside storage_service, must also do some
initialization regarding CDC generations, for example: on restart it
must retrieve the latest known generation timestamp from disk; on
bootstrap it must create a new generation and announce it to other
nodes. The order of these operations w.r.t the rest of the startup
procedure is important, hence the startup procedure is the only right
place for them.

Still, what remains in storage_service is a small part of the entire
CDC generation management logic; most of it has been moved to the
new service. This includes listening for generation changes and
updating the data structures for performing CDC log writes (cdc::metadata).
Furthermore these functions now return futures (and are internally
coroutines), where previously they required a seastar::async context.
2021-02-26 12:06:12 +01:00
Tomasz Grabiec
b9c3b6c10f row_cache: Make fill_buffer() preemptable when cursor leads with dummy rows
fill_buffer() will keep scanning until _lower_bound_chnaged is true,
even if preemption is signalled, so that the reader makes forward
progress.

Before the patch, we did not update _lower_bound on touching a dummy
entry. The read will not respect preemption until we hit a non-dummy
row. If there is a lot of dummy rows, that can cause reactor stalls.

Fix that by updating _lower_bound on dummy entries as well.

Refs #8153.

Tested with perf_row_cache_reads:

$ build/release/test/perf/perf_row_cache_reads -c1 -m200M
Rows in cache: 0
Populating with dummy rows
Rows in cache: 373929
Scanning
read: 183.658966 [ms], preemption: {count: 848, 99%: 0.545791 [ms], max: 0.519343 [ms]}, cache: 99/100 [MB]
read: 120.951515 [ms], preemption: {count: 257, 99%: 0.545791 [ms], max: 0.518795 [ms]}, cache: 99/100 [MB]

Notice that max preemption latency is low in the second "read:" line.
2021-02-26 01:20:38 +01:00
Tomasz Grabiec
52e411df36 tests: perf: Introduce perf_row_cache_reads
Tests performance of various read patterns from the row cache.

Example:

$ build/release/test/perf/perf_row_cache_reads_g  -c1 -m200M
Filling memtable
Rows in cache: 0
Populating with dummy rows
Rows in cache: 373929
Scanning
read: 156.288986 [ms], preemption: {count: 702, 99%: 0.545791 [ms], max: 0.537537 [ms]}, cache: 99/100 [MB]
read: 106.480766 [ms], preemption: {count: 6, 99%: 0.006866 [ms], max: 106.496168 [ms]}, cache: 99/100 [MB]
2021-02-26 01:20:38 +01:00
Tomasz Grabiec
f0a3272a5f row_cache: Add metric for dummy row hits
This will help to diagnose performance problems related to the read
having to walk through a lot of dummy rows to fill the buffer.

Refs #8153
2021-02-25 18:26:01 +01:00
Piotr Sarna
c5214eb096 treewide: remove timeout config from query options
Timeout config is now stored in each connection, so there's no point
in tracking it inside each query as well. This patch removes
timeout_config from query_options and follows by removing now
unnecessary parameters of many functions and constructors.
2021-02-25 17:20:27 +01:00
Piotr Sarna
f973e09454 cql3: use timeout config from client state instead of query options
... in batch statement, in order to be able to remove the timeout
from query options later.
2021-02-25 17:20:27 +01:00
Piotr Sarna
639d90d2d6 cql3: use timeout config from client state instead of query options
... in modification statement, in order to be able to remove the timeout
from query options later.
2021-02-25 17:20:27 +01:00
Piotr Sarna
b71665efe8 cql3: use timeout config from client state instead of query options
... in select statement, in order to be able to remove the timeout
from query options later.
2021-02-25 17:20:27 +01:00
Piotr Sarna
7ceafda70a service: add timeout config to client state
Future patches will use this per-connection timeout config
to allow setting different timeouts for each session,
based on roles.
2021-02-25 17:20:26 +01:00
Takuya ASADA
aabc67e386 dist/debian: don't run dh_installinit for scylla-node-exporter when service name == package name
dh_installinit --name <service> is for forcing install debian/*.service
and debian/*.default that does not matches with package name.
And if we have subpackages, packager has responsibility to rename
debian/*.service to debian/<subpackage>.*service.

However, we currently mistakenly running
dh_installinit --name scylla-node-exporter for
debian/scylla-node-exporeter.service,
the packaging system tries to find destination package for the .service,
and does not find subpackage name on it, so it will pick first
subpackage ordered by name, scylla-conf.

To solve the issue, we just need to run dh_installinit without --name
when $product == 'scylla'.

Fixes #8163

Closes #8164
2021-02-25 17:05:20 +02:00
Avi Kivity
032fdfe855 Update seastar submodule
* seastar e53a1059f9...803e790598 (9):
  > io_queue: Count total time spent in the queue
  > io_queue: Fix "delay" metrics
Fixes #8166.
  > file: expose disk offset alignment for overwrites
Ref #7663.
  > RPC: (client) retain local address and use on stream creation
  > rpc: sink_impl: align _{last,next}_seq_num to cache-line size
  > reactor: Fix outdated comment
  > fair_queue: Remove now dead ticket strictly_less method
  > io_queue: Double max request size
  > bitsets: set_iterator: correctly implement pre- and post-increment operators
2021-02-25 16:58:06 +02:00
Takuya ASADA
f3a82f4685 scylla_setup: allow running scylla_setup with strict umask setting
We currently deny running scylla_setup when umask != 0022.
To remove this limitation, run os.chmod(0o644) on every file creation
to allow reading from scylla user.

Note that perftune.yaml is not really needed to set 0644 since perftune.py is
running in root user, but setting it to align permission with other files.

Fixes #8049

Closes #8119
2021-02-25 16:42:45 +02:00
Nadav Har'El
750d7903be cql-pytest: fix some comments in util.py
Fix some incorrect comments, pasted from other files or mentioning
wrong names. No other changes except comments

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210225133237.1403891-1-nyh@scylladb.com>
2021-02-25 16:00:20 +02:00
Raphael S. Carvalho
7bf0744d36 reshape/TWCS: Fix off-by-one in threshold check
A given time bucket should also be reshaped if its # of sstables
has reached the threshold.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210223182634.570648-1-raphaelsc@scylladb.com>
2021-02-24 15:12:40 +02:00
Raphael S. Carvalho
21608bd677 sstables: Fix TWCS reshape for windows with at least min_threshold sstables
TWCS reshape was silently ignoring windows which contain at least
min_threshold sstables (can happen with data segregation).
When resizing candidates, size of multi_window was incorrectly used and
it was always empty in this path, which means candidates was always
cleared.

Fixes #8147.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210224125322.637128-1-raphaelsc@scylladb.com>
2021-02-24 15:11:19 +02:00
Tomasz Grabiec
ecb6c56a2a Merge 'lsa: background reclaim' from Avi Kivity
This series adds background reclaim to lsa, with the goal
that most large allocations can be satisfied from available
free memory, and and reclaim work can be done from a preemptible
context.

If the workload has free cpu, then background reclaim will
utilize that free cpu, reducing latency for the main workload.
Otherwise, background reclaim will compete with the main
workload, but since that work needs to happen anyway,
throughput will not be reduced.

A unit test is added to verify it works.

Fixes #1634.

Closes #8044

* github.com:scylladb/scylla:
  test: logalloc_test: test background reclaim
  logalloc: reduce gap between std min_free and logalloc min_free
  logalloc: background reclaim
  logalloc: preemptible reclaim
2021-02-24 13:23:30 +01:00
Piotr Sarna
25f47561cb transport: fix an outdated comment
The comment mentions calling a lambda in-place, but the lambda
is no longer there since 2019!

Message-Id: <3903c84d5c151415409f28935e328b552dd548f8.1614155567.git.sarna@scylladb.com>
2021-02-24 11:14:01 +02:00
Avi Kivity
15d3797e97 test: logalloc_test: test background reclaim
Test that the background reclaimer is able to compete with a
fake load and reclaim 10 MB/s. The test is quite stressful as the "LRU"
is fully randomized.

If the background reclaimer is disabled, the test fails as soon as the
20MB "gap" is exhausted. With the reclaimer enabled, it is able to
free memory ahead of the allocations.
2021-02-23 19:42:42 +02:00
Nadav Har'El
d905e71a90 Alternator: add support for CORS protocol
This patch adds to Alternator support for the CORS (Cross-Origin Resource
Sharing) protocol - a simple extension over the HTTP protocol which
browsers use when Javascript code contacts HTTP-based servers.

Although we usually think of Alternator as being used in a three-tier
application, in some setups there is no middle layer and the user's
browser, running Javascript code, wants to communicate directly with the
database. However, for security reasons, by default Javascript loaded
from domain X is not allowed to communicate with different domains Y.
The CORS protocol is meant to allow this, and Alternator needs to
participate in this protocol if it is to be used directly from Javascript
in browsers.

To implement CORS, Alternator needs to respond to the OPTIONS method
which it didn't allow before - with certain headers based on the
input headers. It also needs to do some of these things for the
regular methods (mostly, POST). The patch includes a comprehensive
test that runs against both Alternator and DynamoDB and shows that
Alternator handles these headers and methods the same as DynamoDB.

Additionally, I tested manually a Javascript DynamoDB client - which
didn't work prior to this patch (the browser reported CORS errors),
and works after this patch.

Fixes #8025.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210217222027.1219319-1-nyh@scylladb.com>
2021-02-23 13:15:03 +01:00
Asias He
7018377bd7 messaging_service: Move gossip ack message verb to gossip group
Fix a scheduling group leak:

INFO [shard 0] gossip - gossiper::run sg=gossip
INFO [shard 0] gossip - gossiper::handle_ack_msg sg=statement
INFO [shard 0] gossip - gossiper::handle_syn_msg sg=gossip
INFO [shard 0] gossip - gossiper::handle_ack2_msg sg=gossip

After the fix:

INFO [shard 0] gossip - gossiper::run sg=gossip
INFO [shard 0] gossip - gossiper::handle_ack_msg sg=gossip
INFO [shard 0] gossip - gossiper::handle_syn_msg sg=gossip
INFO [shard 0] gossip - gossiper::handle_ack2_msg sg=gossip

Fixes #7986

Closes #8129
2021-02-23 10:10:00 +02:00
Tomasz Grabiec
fb1d3fe2cf table: Fix schema mismatch between memtable reader and sstable writer
The schema used to create the sstable writer has to be the same as the
schema used by the reader, as the former is used to intrpret mutation
fragments produced by the reader.

Commit 9124a70 intorduced a deferring point between reader creation
and writer creation which can result in schema mismatch if there was a
concurrent alter.

This could lead to the sstable write to crash, or generate a corrupted
sstable.

Fixes #7994

Message-Id: <20210222153149.289308-1-tgrabiec@scylladb.com>
2021-02-22 17:51:00 +02:00
Raphael S. Carvalho
81d773e5d8 compaction_manager: Redefine weight for better control of parallel compactions
Compaction manager allows compaction of different weights to proceed in
parallel. For example, a small-sized compaction job can happen in parallel to a
large-sized one, but similar-sized jobs are serialized.

The problem is the current definition of weight, which is the log (base 4) of
total size (size of all sstables) of a job.

This is what we get with the current weight definition:
    weight=5	for sizes=[1K, 3K]
    weight=6	for sizes=[4K, 15K]
    weight=7	for sizes=[16K, 63K]
    weight=8	for sizes=[64K, 255K]
    weight=9	for sizes=[258K, 1019K]
    weight=10	for sizes=[1M, 3M]
    weight=11	for sizes=[4M, 15M]
    weight=12	for sizes=[16M, 63M]
    weight=13	for sizes=[64M, 254M]
    weight=14	for sizes=[256M, 1022M]
    weight=15	for sizes=[1033M, 4078M]
    weight=16	for sizes=[4119M, 10188M]
    total weights: 12

Note that for jobs smaller than 1MB, we have 5 different weights, meaning 5
jobs smaller than 1MB could proceed in parallel. High number of parallel
compactions can be observed after repair, which potentially produces tons of
small sstables of varying sizes. That causes compaction to use a significant
amount of resources.

To fix this problem, let's add a fixed tax to the size before taking the log,
so that jobs smaller than 1M will all have the same weight.

Look at what we get with the new weight definition:
    weight=10	for sizes=[1K, 2M]
    weight=11	for sizes=[3M, 14M]
    weight=12	for sizes=[15M, 62M]
    weight=13	for sizes=[63M, 254M]
    weight=14	for sizes=[256M, 1022M]
    weight=15	for sizes=[1033M, 4078M]
    weight=16	for sizes=[4119M, 10188M]
    total weights: 7

Fixes #8124.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210217123022.241724-1-raphaelsc@scylladb.com>
2021-02-22 15:50:29 +02:00
Asias He
554ab035dd main: Run init_server and join_cluster inside maintenance scheduling group
Currently, init_server and join_cluster which initiate the bootstrap and
replace operations on the new node run inside the main scheduling group.
We should run them inside the maintenance scheduling group to reduce the
impact on the user workload.

This patch fixes a scheduling group leak for bootstrap and replace operation.

Before:
[shard 0] storage_service - storage_service::bootstrap sg=main
[shard 0] repair - bootstrap_with_repair sg=main

After:
[shard 0] storage_service - storage_service::bootstrap sg=streaming
[shard 0] repair - bootstrap_with_repair sg=streaming

Fixes #8130

Closes #8131
2021-02-22 14:55:02 +02:00
Michał Chojnowski
a24f83852e atomic_cell: fix operator<< for atomic_cell_or_collection
operator<< used the wrong criterium for deciding whether the data is
stored as atomic_cell or collection_mutation, resulting in
catastrophical failure if it was used with frozen collections or UDTs.
Since frozen collections and UDTs are stored as atomic_cell, not
collection_mutation, the correct criterium is not is_collection(),
but is_multi_cell().

Closes #8134
2021-02-22 14:45:34 +02:00
Kamil Braun
022d7773f4 cdc: coroutinize make_new_cdc_generation 2021-02-22 12:47:44 +01:00
Kamil Braun
26ca9d6c33 cdc: coroutinize update_streams_description 2021-02-22 12:46:53 +01:00
Kamil Braun
d4937daaea cdc: introduce cdc::generation_service
This commit introduces a new service crafted to handle CDC generation
management: listening and reacting to generation changes in the cluster.

The implementation is a stub for now, the service reacts to generation
changes by simply logging the event.

The commit plugs the service in, initializing it in main and test code,
passing a reference to storage_service and having storage_service start
the service (using the `after_join` method): the service only starts
doing its job after the node joins the token ring (either on bootstrap
or restart).
2021-02-22 12:45:43 +01:00
Kamil Braun
8e72c33d7c main: move cdc_service initialization just prior to storage_service
initialization

As a preparation for introducing CDC generation management service.

cdc_service will depend on the generation service.
But the generation service needs some other services to work
properly. In particular, it uses the local database, so it should be
initialized after the local database.

The only service that will need the cdc generation service is
storage_service, so we can place the generation service initialization
code right before storage_service initialization code. So the order will
be cdc_generation_service -> cdc_service -> storage_service.
2021-02-22 12:43:10 +01:00
Liu Lan
d2378129a3 docs: fix invalid path in README.mds
Signed-off-by: Liu Lan <liulan_yewu@cmss.chinamobile.com>

Closes #8126
2021-02-21 13:49:12 +02:00
Konstantin Osipov
95ee8e1b90 raft: fix spelling
Fix spelling of a few comments.
2021-02-19 22:56:26 +03:00
Pekka Enberg
d483922671 Update tools/java submodule
* tools/java 0187829d5e...142f517a23 (2):
  > nodetool: Enable resetlocalschema
  > sstableloader: Make progress printout less eager.
2021-02-19 12:37:04 +02:00
Avi Kivity
78d1afeabd Merge "Use radix tree to store cells on a row" from Pavel E
"
Current storage of cells in a row is a union of vector and set. The
vector holds 5 cell_and_hash's inline, up to 32 ones in the external
storage and then it's switched to std::set. Once switched, the whole
union becomes the waste of space, as it's size is

   sizeof(vector head) + 5 * sizeof(cell and hash) = 90+ bytes

and only 3 pointers from it are used (std::set header). Also the
overhead to keep cell_and_hash as a set entry is more then the size
of the structure itself.

Column ids are 32-bit integers that most likely come sequentialy.
For this kind of a search key a radix tree (with some care for
non-sequential cases) can be beneficial.

This set introduces a compact radix tree, that uses 7-bit sub values
from the search key to index on each node and compacts the nodes
themselves for better memory usage. Then the row::_storage is replaced
with the new tree.

The most notable result is the memory footprint decrease, for wide
rows down to 2x times. The performance of micro-benchmarks is a bit
lower for small rows and (!) higer for longer (8+ cells). The numbers
are in patch #12 (spoiler: they are better than for v2)

v3:
- trimmed size of radix down to 7 bits
- simplified the nodes layouts, now there are 2 of them (was 4)
- enhanced perf_mutation to test N-cells schema
- added AVX intra-nodes search for medium-sized nodes
- added .clone_from() method that helped to improve perf_mutation
- minor
  - changed functions not to return values via refs-arguments
  - fixed nested classes to properly use language constructors
  - renamed index_to to key_t to distinguish from node_index_t
  - improved recurring variadic templates not to use sentinel argument
  - use standard concepts

v2:
- fixed potential mis-compilation due to strict-aliasing violation
- added oracle test (radix tree is compared with std::map)
- added radix to perf_collection
- cosmetic changes (concepts, comments, names)

A note on item 1 from v2 changelog. The nodes are no longer packed
perfectly, each has grown 3 bytes. But it turned out that when used
as cells container most of this growth drowned in lsa alignments.

next todo:
- aarch64 version of 16-keys node search

tests: unit(dev), unit(debug for radix*), pref(dev)
"

* 'br-radix-tree-for-cells-3' of https://github.com/xemul/scylla:
  test/memory_footpring: Print radix tree node sizes
  row: Remove old storages
  row: Prepare row::equal for switch
  row: Prepare row::difference for switch
  row: Introduce radix tree storage type
  row-equal: Re-declare the cells_equal lambda
  test: Add tests for radix tree
  utils: Compact radix tree
  array-search: Add helpers to search for a byte in array
  test/perf_collection: Add callback to check the speed of clone
  test/perf_mutation: Add option to run with more than 1 columns
  test/perf_mutation: Prepare to have several regular columns
  test/perf_mutation: Use builder to build schema
2021-02-18 21:19:14 +02:00
Nadav Har'El
02dde2aca1 cql-pytest: port Cassandra's unit test validation/entities/json_test
In this patch, we port validation/entities/json_test.java, containing
21 tests for various JSON-related operations - SELECT JSON, INSERT JSON,
and the fromJson() and toJson() functions.

In porting these tests, I uncovered 19 (!!) previously unknown bugs in
Scylla:

Refs #7911: Failed fromJson() should result in FunctionFailure error, not
            an internal error.
Refs #7912: fromJson() should allow null parameter.
Refs #7914: fromJson() integer overflow should cause an error, not silent
            wrap-around.
Refs #7915: fromJson() should accept "true" and "false" also as strings.
Refs #7944: fromJson() should not accept the empty string "" as a number.
Refs #7949: fromJson() fails to set a map<ascii, int>.
Refs #7954: fromJson() fails to set null tuple elements.
Refs #7972: toJson() truncates some doubles to integers.
Refs #7988: toJson() produces invalid JSON for columns with "time" type.
Refs #7997: toJson() is missing a timezone on timestamp.
Refs #8001: Documented unit "µs" not supported for assigning a "duration"
            type.
Refs #8002: toJson() of decimal type doesn't use exponents so can produce
            huge output.
Refs #8077: SELECT JSON output for function invocations should be
            compatible with Cassandra.
Refs #8078: SELECT JSON ignores the "AS" specification.
Refs #8085: INSERT JSON with bad arguments should yield InvalidRequest
            error, not internal error.
Refs #8086: INSERT JSON cannot handle user-defined types with case-
            sensitive component names.
Refs #8087: SELECT JSON incorrectly quotes strings inside map keys.
Refs #8092: SELECT JSON missing null component after adding field to
            UDT definition.
Refs #8100: SELECT JSON with IN and ORDER BY does not obey the ORDER BY.

Due to these bugs, 8 out of the 21 tests here currently xfail and one
has to be skipped (issue #8100 causes the sanitizer to detect a use
after free, and crash Scylla).

As usual in these sort of tests, all 21 tests pass when running against
Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210217130732.1202811-1-nyh@scylladb.com>
2021-02-18 20:44:04 +02:00
Takuya ASADA
32d4ec6b8a scylla_util.py: resolve /dev/root to get actual device on aws
When psutil.disk_paritions() reports / is /dev/root, aws_instance mistakenly
reports root partition is part of ephemeral disks, and RAID construction will
fail.
This prevents the error and reports correct free disks.

Fixes #8055

Closes #8040
2021-02-18 20:25:45 +02:00
Avi Kivity
90a7f76fb6 Merge 'cdc: log: fix a use-after-free in process_bytes_visitor' from Michał Chojnowski
Due to small value optimization used in `bytes`, views to `bytes` stored
in `vector` can be invalidated when the vector resizes, resulting in
use-after-free and data corruption. Fix that.

Closes #8105

* github.com:scylladb/scylla:
  cdc: log: avoid an unnecessary copy
  cdc: log: fix use-after-free in process_bytes_visitor
2021-02-18 20:23:41 +02:00
Michał Chojnowski
96c22cf3f8 cdc: log: avoid an unnecessary copy
There is no need to copy `bytes_view` into `bytes` here.
2021-02-18 14:08:18 +01:00
Michał Chojnowski
8cc4f39472 cdc: log: fix use-after-free in process_bytes_visitor
Due to small value optimization used in `bytes`, views to `bytes` stored
in `vector` can be invalidated when the vector resizes, resulting in
use-after-free and data corruption. Fix that.

Fixes #8117
2021-02-18 14:08:17 +01:00
Konstantin Osipov
32952a744a raft: add a unit test for voting
Test duplicate votes, votes from non-members and voting
in joint configuration.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
e49d5f89a5 raft: do not account for the same vote twice
While a duplicate vote from the same server is not possible by a
conforming Raft implementation, Raft assumptions on network permit
duplicates.
So, in theory, it is possible that a vote message is delivered
multiple times.

The current voting implementation does reject votes from non-members,
but doesn't check for duplicate votes.

Keep track of who has voted yet, and reject duplicate votes.

A unit test follows.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
7ea064ac04 raft: remove fsm::set_configuration()
Set either tracker or votes configuration explicitly.
This saves a few lines and simplifies unit tests.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
4083026b65 raft: consistently use configuration from the log 2021-02-18 16:04:44 +03:00
Konstantin Osipov
c4552ffb9a raft: add ostream serialization for enum vote_result 2021-02-18 16:04:44 +03:00
Konstantin Osipov
2ae04d8a47 raft: advance commit index right after leaving joint configuration
Imagine the cluster is in joint configuration {{A, B}, {A, B, C, D, E}}.
The leader's view of stable indexes is:

Server  Match Index
A       5
B       5
C       6
D       7
E       8

The commit index would be 5 if we use joint configuration, and 6
if we assume we left it. Let it happen without an extra FSM
step.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
132db931da raft: add tracker test 2021-02-18 16:04:44 +03:00
Konstantin Osipov
6e3932bbc7 raft: tidy up follower_progress API
Make the API More explicit so it's available for testing.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
ed65a8635e raft: update raft::log::apply_snapshot() assert
apply_snapshot() doesn't support applying the same snapshot
twice. The caller must check the current snapshot before
applying.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
e58a3e42ca raft: add a unit test for raft::log 2021-02-18 16:04:44 +03:00
Konstantin Osipov
51c968bcb4 raft: rename log::non_snapshoted_length() to log::in_memory_size()
The old name was incorrect, in case apply_snapshot() was called with
non-zero trailing entries, the total log length is greater than the
length of the part that is not stored in a snapshot.

Fix spelling in related comments.

Rename fsm::wait() to fsm::wait_max_log_size(), it's a more
specific name. Rename max_log_length to max_log_size to use
'size' rather than 'length' consistently for log size.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
cfe407b402 raft: inline raft::log::truncate_tail()
It's the core of apply_snapshot() work and is only used in it.

Now that truncate_tail is inline, rename truncate_head()
to truncate_uncommitted().
2021-02-18 16:04:44 +03:00
Konstantin Osipov
e0011c6e4d raft: ignore AppendEntries RPC with a very old term
Do not assert on an outdated message.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
805d52eb16 raft: remove log::start_idx()
Replace it with a private _first_idx, which is maintained
along with the rest of class log state.
_first_idx is a name consistent with counterpart last_idx().

Do not use a function since going forward we may want
to remove Raft index from struct log_entry, so should rely
less on it.

This fixes a bug when _last_conf_idx was not reset
after apply_snapshot() because start_idx() was pointing
to a non-existent entry.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
af8770da63 raft: return a correct last term on an empty log
If the log is empty, we must use snapshot's term,
since the log could be right after taking a snapshot
when no trailing entries were kept.

This fixes a rare possible bug when a log matching
rule could be violated during elections by a follower
with a log which was just truncated after a snapshot.

A separate unit test for the issue will follow.
2021-02-18 16:04:43 +03:00
Konstantin Osipov
cb035a7c8d raft: do not use raft::log::start_idx() outside raft::log()
raft::log::start_idx() is currently not meaningful
in case the log is empty.

Avoid using it in fsm::replicate_to() and avoid manual search for
previous log term, instead encapsulate the search in log::term_for().

As a side effect we currently return a correct term (0)
when log matching rule is exercised for an empty log
and the very first snapshot with term 0. Update raft_etcd_test.cc
accordingly.

This change happens to reduce the overall line count.

While at it, improve the comments in raft::replicate_to().
2021-02-18 16:04:43 +03:00
Konstantin Osipov
04b4d97d6a raft: rename progress.hh to tracker.hh
class tracker is the main class of this module.
2021-02-18 16:04:43 +03:00
Konstantin Osipov
97a16c0f77 raft: extend single_node_is_quiet test 2021-02-18 16:04:43 +03:00
Avi Kivity
f0950e023d Merge 'Split CDC streams table partitions into clustered rows ' from Kamil Braun
Until now, the lists of streams in the `cdc_streams_descriptions` table
for a given generation were stored in a single collection. This solution
has multiple problems when dealing with large clusters (which produce
large lists of streams):
1. large allocations
2. reactor stalls
3. mutations too large to even fit in commitlog segments

This commit changes the schema of the table as described in issue #7993.
The streams are grouped according to token ranges, each token range
being represented by a separate clustering row. Rows are inserted in
reasonably large batches for efficiency.

The table is renamed to enable easy upgrade. On upgrade, the latest CDC
generation's list of streams will be (re-)inserted into the new table.

Yet another table is added: one that contains only the generation
timestamps clustered in a single partition. This makes it easy for CDC
clients to learn about new generations. It also enables an elegant
two-phase insertion procedure of the generation description: first we
insert the streams; only after ensuring that a quorum of replicas
contains them, we insert the timestamp. Thus, if any client observes a
timestamp in the timestamps table (even using a ONE query),
it means that a quorum of replicas must contain the list of streams.

---

Nodes automatically ensure that the latest CDC generation's list of
streams is present in the streams description table. When a new
generation appears, we only need to update the table for this
generation; old generations are already inserted.

However, we've changed the description table (from
`cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The
existing mechanism only ensures that the latest generation appears in
the new description table. We add an additional procedure that
rewrites the older generations as well, if we find that it is necessary
to do so (i.e. when some CDC log tables may contain data in these
generations).

Closes #8116

* github.com:scylladb/scylla:
  tests: add a simple CDC cql pytest
  cdc: add config option to disable streams rewriting
  cdc: rewrite streams to the new description table
  cql3: query_processor: improve internal paged query API
  cdc: introduce no_generation_data_exception exception type
  docs: cdc: mention system.cdc_local table
  cdc: coroutinize do_update_streams_description
  sys_dist_ks: split CDC streams table partitions into clustered rows
  cdc: use chunked_vector for streams in streams_version
  cdc: remove `streams_version::expired` field
  system_distributed_keyspace: use mutation API to insert CDC streams
  storage_service: don't use `sys_dist_ks` before it is started
2021-02-18 12:49:43 +02:00
Kamil Braun
4bf28aad7a tests: add a simple CDC cql pytest 2021-02-18 11:44:59 +01:00
Kamil Braun
841f07e9b7 cdc: add config option to disable streams rewriting
Rewriting stream descriptions is a long, expensive, and prone-to-failure
operation. Due to #8061 it may consume a lot of memory. In general, it
may keep failing (and being retried) endlessly, straining the cluster.
As a backdoor we add this flag for potential future needs of admins or
field engineers.

I don't expect it will ever be used, but it won't hurt and may save us
some work in the worst case scenario.
2021-02-18 11:44:59 +01:00
Kamil Braun
9bdd000e97 cdc: rewrite streams to the new description table
Nodes automatically ensure that the latest CDC generation's list of
streams is present in the streams description table. When a new
generation appears, we only need to update the table for this
generation; old generations are already inserted.

However, we've changed the description table (from
`cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The
existing mechanism only ensures that the latest generation appears in
the new description table. This commit adds an additional procedure that
rewrites the older generations as well, if we find that it is necessary
to do so (i.e. when some CDC log tables may contain data in these
generations).
2021-02-18 11:44:59 +01:00
Kamil Braun
4ef736a0a3 cql3: query_processor: improve internal paged query API
The `query_processor::query` method allowed internal paged queries.
However, it was quite limited, hardcoding a number of parameters:
consistency level, timeout config, page size.

This commit does the following improvements:
1. Rename `query` to `query_internal` to make it obvious that this API
   is supposed to be used for internal queries only
2. Extend the method to take consistency level, timeout config, and page
   size as parameters
3. Remove unused overloads of `query_internal`
4. Fix a bunch of typos / grammar issues in the docstring
2021-02-18 11:44:59 +01:00
Kamil Braun
7c91894ddf cdc: introduce no_generation_data_exception exception type 2021-02-18 11:44:59 +01:00
Kamil Braun
99cc9b8051 docs: cdc: mention system.cdc_local table 2021-02-18 11:44:59 +01:00
Kamil Braun
44aab61aea cdc: coroutinize do_update_streams_description 2021-02-18 11:44:59 +01:00
Kamil Braun
67d4e5576d sys_dist_ks: split CDC streams table partitions into clustered rows
Until now, the lists of streams in the `cdc_streams_descriptions` table
for a given generation were stored in a single collection. This solution
has multiple problems when dealing with large clusters (which produce
large lists of streams):
1. large allocations
2. reactor stalls
3. mutations too large to even fit in commitlog segments

This commit changes the schema of the table as described in issue #7993.
The streams are grouped according to token ranges, each token range
being represented by a separate clustering row. Rows are inserted in
reasonably large batches for efficiency.

The table is renamed to enable easy upgrade. On upgrade, the latest CDC
generation's list of streams will be (re-)inserted into the new table.

Yet another table is added: one that contains only the generation
timestamps clustered in a single partition. This makes it easy for CDC
clients to learn about new generations. It also enables an elegant
two-phase insertion procedure of the generation description: first we
insert the streams; only after ensuring that a quorum of replicas
contains them, we insert the timestamp. Thus, if any client observes a
timestamp in the timestamps table (even using a ONE query),
it means that a quorum of replicas must contain the list of streams.
2021-02-18 11:44:59 +01:00
Kamil Braun
ba920361b3 cdc: use chunked_vector for streams in streams_version
The vector may get quite long (say... 1,6M stream IDs). We prevent a
large allocation by using utils::chunked_vector.
2021-02-18 11:44:59 +01:00
Kamil Braun
9ae4467970 cdc: remove streams_version::expired field
This field was not used anywhere.
2021-02-18 11:44:59 +01:00
Kamil Braun
3d7b990300 system_distributed_keyspace: use mutation API to insert CDC streams
The `storage_proxy::mutate` low-level API is much more powerful than
the CQL API. This power is not needed for this commit but for the next.
2021-02-18 11:44:59 +01:00
Kamil Braun
0df15ca8cc storage_service: don't use sys_dist_ks before it is started
It could happen that system_distributed_keyspace was used by
storage_service before it was fully started (inside
`handle_cdc_generation`), i.e. before sys_dist_ks' `start()` returned
(on shard 0). It only checked whether `local_is_initialized()` returns
true, so it only ensured that the service is constructed.

Currently, sys_dist_ks' `start` only announces migrations, so this was
mostly harmless. More concretely: it could result in the node trying to
send CQL requests using a table that it didn't yet recognize by calling
sys_dist_ks' methods before the `announce_migration` call inside `start`
has returned. This would result in an exception; however, the exception
would be catched by the caller and the procedure would be retried,
succeeding eventually. See `handle_cdc_generation` for details.

Still, the initial intention of the code was to wait for the sys_dist_ks
service to be fully started before it was used. This commit fixes that.
2021-02-18 11:44:59 +01:00
Tomasz Grabiec
f94f70cda8 Merge "raft: add unit tests for log, tracker, votes and fix found bugs" from Kostja
Test log consistency after apply_snapshot() is called.
Ensure log::last_term() log::last_conf_index() and log::size()
work as expected.

Misc cleanups.

* scylla-dev/raft-confchange-test:
  raft: add a unit test for voting
  raft: do not account for the same vote twice
  raft: remove fsm::set_configuration()
  raft: consistently use configuration from the log
  raft: add ostream serialization for enum vote_result
  raft: advance commit index right after leaving joint configuration
  raft: add tracker test
  raft: tidy up follower_progress API
  raft: update raft::log::apply_snapshot() assert
  raft: add a unit test for raft::log
  raft: rename log::non_snapshoted_length() to log::length()
  raft: inline raft::log::truncate_tail()
  raft: ignore AppendEntries RPC with a very old term
  raft: remove log::start_idx()
  raft: return a correct last term on an empty log
  raft: do not use raft::log::start_idx() outside raft::log()
  raft: rename progress.hh to tracker.hh
  raft: extend single_node_is_quiet test
2021-02-18 10:55:59 +01:00
Raphael S. Carvalho
5206a97915 compaction: Fix leak of expired sstable in the backlog tracker
expired sstables are skipped in the compaction setup phase, because they don't
need to be actually compacted, but rather only deleted at the end.
that is causing such sstables to not be removed from the backlog tracker,
meaning that backlog caused by expired sstables will not be removed even after
their deletion, which means shares will be higher than needed, making compaction
potentially more aggressive than it have to.

to fix this bug, let's manually register these sstables into the monitor,
such that they'll be removed from the tracker once compaction completes.

Fixes #6054.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210216203700.189362-1-raphaelsc@scylladb.com>
2021-02-18 11:12:00 +02:00
Takuya ASADA
d7f202f900 dist/debian: fix renaming debian/scylla-* files rule
Current renaming rule of debian/scylla-* files is buggy, it fails to
install some .service files when custom product name specified.

Introduce regex based rewriting instead of adhoc renaming, and fixed
wrong renaming rule.

Fixes #8113

Closes #8114
2021-02-18 10:35:19 +02:00
Pekka Enberg
843bf57c3c Update tools/jmx submodule
* tools/jmx 949cefc...bf8bb16 (1):
  > Merge 'dist/debian: fix renaming debian/scylla-* files rule' from Takuya ASADA
2021-02-18 10:35:00 +02:00
Botond Dénes
c3b4c3f451 evictable_reader: reset _range_override after fast-forwarding
`_range_override` is used to store the modified range the reader reads
after it has to be recreated (when recreating a reader it's read range
is reduced to account for partitions it already read). When engaged,
this field overrides the `_pr` field as the definitive range the reader
is supposed to be currently reading. Fast forwarding conceptually
overrides the range the reader is currently reading, however currently
it doesn't reset the `_range_override` field. This resulted in
`_range_override` (containing the modified pre-fast-forward range)
incorrectly overriding the fast-forwarded-to range in `_pr` when
validating the first partition produced by the just recreated reader,
resulting in a false-positive validation failure.

Fixes: #8059

Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210217164744.420100-1-bdenes@scylladb.com>
2021-02-17 19:11:00 +02:00
Benny Halevy
4b46793c19 row_cache: scanning_and_populating_reader: add _read_next_partition flag
Instead of resetting _reader in scanning_and_populating_reader::fill_buffer
in the `reader_finished` case, use a gentler, _read_next_partition flag
on which `read_next_partition` will be called in the next iteration.

Then, read_next_partition can close _reader only before overwriting it
with a new reader.  Otherwise, if _reader is always closed in the
``reader_finished` case, we end up hitting premature end_of_stream.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-30-bhalevy@scylladb.com>
2021-02-17 19:06:21 +02:00
Benny Halevy
57540dae42 mutation_query: mark reconcilable_result_builder constructor noexcept
With result_memory_accounter begin nothrow move constructible
reconcilable_result_builder does not throw.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-67-bhalevy@scylladb.com>
2021-02-17 18:56:12 +02:00
Benny Halevy
92e0e84ee5 database: futurize remove
In preparation for futurizing the querier_cache api.

Coroutinize drop_column_family while at it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-61-bhalevy@scylladb.com>
2021-02-17 18:52:53 +02:00
Benny Halevy
5263ab0e9d row_cache: read_context: use query-request is_single_partition helper
Rather than hand-coding the same logic.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-32-bhalevy@scylladb.com>
2021-02-17 18:29:39 +02:00
Benny Halevy
35256d1b92 treewide: explicitly use flat_mutation_reader_opt
Unlike flat_mutation_reader_opt that is defined using
optimized_optional<flat_mutation_reader>, std::optional<T> does not evaluate
to `false` after being moved, only after it is explicitly reset.

Use flat_mutation_reader_opt rather than std::optional<flat_mutation_reader>
to make it easier to check if it was closed before it's destroyed
or being assigned-over.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-6-bhalevy@scylladb.com>
2021-02-17 17:57:34 +02:00
Avi Kivity
c63e26e26f Merge 'cdc: Limit size of topology description' from Piotr Jastrzębski
Currently, whole topology description for CDC is stored in a single row.
This means that for a large cluster of strong machines (say 100 nodes 64
cpus each), the size of the topology description can reach 32MB.

This causes multiple problems. First of all, there's a hard limit on
mutation size that can be written to Scylla. It's related to commit log
block size which is 16MB by default. Mutations bigger than that can't be
saved. Moreover, such big partitions/rows cause reactor stalls and
negatively influence latency of other requests.

This patch limits the size of topology description to about 4MB. This is
done by reducing the number of CDC streams per vnode and can lead to CDC
data not being fully colocated with Base Table data on shards. It can
impact performance and consistency of data.

This is just a quick fix to make it easily backportable. A full solution
to the problem is under development.

For more details see #7961, #7993 and #7985.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

Closes #8048

* github.com:scylladb/scylla:
  cdc: Limit size of topology description
  cdc: Extract create_stream_ids from topology_description_generator
2021-02-17 15:43:53 +02:00
Piotr Jastrzebski
649f254863 cdc: Limit size of topology description
Currently, whole topology description for CDC is stored in a single row.
This means that for a large cluster of strong machines (say 100 nodes 64
cpus each), the size of the topology description can reach 32MB.

This causes multiple problems. First of all, there's a hard limit on
mutation size that can be written to Scylla. It's related to commit log
block size which is 16MB by default. Mutations bigger than that can't be
saved. Moreover, such big partitions/rows cause reactor stalls and
negatively influence latency of other requests.

This patch limits the size of topology description to about 4MB. This is
done by reducing the number of CDC streams per vnode and can lead to CDC
data not being fully colocated with Base Table data on shards. It can
impact performance and consistency of data.

This is just a quick fix to make it easily backportable. A full solution
to the problem is under development.

For more details see #7961, #7993 and #7985.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-02-17 13:24:40 +01:00
Avi Kivity
001652815c Merge 'imr: switch back to open-coded description of structures' from Michał Chojnowski
Commit aab6b0ee27 introduced the
controversial new IMR format, which relied on a very template-heavy
infrastructure to generate serialization and deserialization code via
template meta-programming. The promise was that this new format, beyond
solving the problems the previous open-coded representation had (working
on linearized buffers), will speed up migrating other components to this
IMR format, as the IMR infrastructure reduces code bloat, makes the code
more readable via declarative type descriptions as well as safer.
However, the results were almost the opposite. The template
meta-programming used by the IMR infrastructure proved very hard to
understand. Developers don't want to read or modify it. Maintainers
don't want to see it being used anywhere else. In short, nobody wants to
touch it.

This commit does a conceptual revert of
aab6b0ee27. A verbatim revert is not
possible because related code evolved a lot since the merge. Also, going
back to the previous code would mean we regress as we'd revert the move
to fragmented buffers. So this revert is only conceptual, it changes the
underlying infrastructure back to the previous open-coded one, but keeps
the fragmented buffers, as well as the interface of the related
components (to the extent possible).

Fixes: #5578

Closes #8106

* github.com:scylladb/scylla:
  imr: switch back to open-coded description of structures
  utils: managed_bytes: add a few trivial helper methods
  utils: fragment_range: move FragmentedView helpers to fragment_range.hh
  utils: fragment_range: add single_fragmented_mutable_view
  utils: fragment_range: implement FragmentRange for fragment_range
  utils: mutable_view: add front()
  types: remove an unused helper function
  test: mutation_test: fix memory calculations in make_fragments_with_non_monotonic_positions
  test: mutation_test: remove an obsolete assertion
  test: mutation_test: initialize an uninitialized variable
  test: sstable_datafile_test: fix tracking of closed sstables in sstable_run_based_compaction_test
2021-02-17 13:40:16 +02:00
Botond Dénes
ba7a9d2ac3 imr: switch back to open-coded description of structures
Commit aab6b0ee27 introduced the
controversial new IMR format, which relied on a very template-heavy
infrastructure to generate serialization and deserialization code via
template meta-programming. The promise was that this new format, beyond
solving the problems the previous open-coded representation had (working
on linearized buffers), will speed up migrating other components to this
IMR format, as the IMR infrastructure reduces code bloat, makes the code
more readable via declarative type descriptions as well as safer.
However, the results were almost the opposite. The template
meta-programming used by the IMR infrastructure proved very hard to
understand. Developers don't want to read or modify it. Maintainers
don't want to see it being used anywhere else. In short, nobody wants to
touch it.

This commit does a conceptual revert of
aab6b0ee27. A verbatim revert is not
possible because related code evolved a lot since the merge. Also, going
back to the previous code would mean we regress as we'd revert the move
to fragmented buffers. So this revert is only conceptual, it changes the
underlying infrastructure back to the previous open-coded one, but keeps
the fragmented buffers, as well as the interface of the related
components (to the extent possible).

Fixes: #5578
2021-02-16 23:43:07 +01:00
Michał Chojnowski
25a9569cc4 utils: managed_bytes: add a few trivial helper methods
We will use them in the upcoming IMR removal patch.
2021-02-16 23:43:07 +01:00
Michał Chojnowski
3f248ca7cc utils: fragment_range: move FragmentedView helpers to fragment_range.hh
In the upcoming IMR removal patch we will need read_simple() and similar helpers
for FragmentedView outside of types.hh. For now, let's move them to
fragment_range.hh, where FragmentedView is defined. Since it's a widely included
header, we should consider moving them to a more specialized header later.
2021-02-16 21:35:15 +01:00
Michał Chojnowski
8a06a576aa utils: fragment_range: add single_fragmented_mutable_view
We will use it later in the upcoming IMR removal patch.
2021-02-16 21:35:15 +01:00
Michał Chojnowski
7b662b9315 utils: fragment_range: implement FragmentRange for fragment_range
This will allow us to pass FragmentedView instances to places where
FragmentRange is expected.
2021-02-16 21:35:15 +01:00
Michał Chojnowski
f972f90193 utils: mutable_view: add front()
We will use it in the upcoming patches.
2021-02-16 21:35:14 +01:00
Michał Chojnowski
9e591c6634 types: remove an unused helper function 2021-02-16 21:35:14 +01:00
Michał Chojnowski
6b8a69e01f test: mutation_test: fix memory calculations in make_fragments_with_non_monotonic_positions
The off-by-one error would cause
test_multishard_combining_reader_non_strictly_monotonic_positions to fail if
the added range_tombstones filled the buffer exactly to the end.
In such situation, with the old loop condition,
make_fragments_with_non_monotonic_positions would add one range_tombstone too
many to the deque, violating the test assumptions.
2021-02-16 21:35:14 +01:00
Michał Chojnowski
5b79d6ca4c test: mutation_test: remove an obsolete assertion
Due to small value optimizations, the removed assertions are not true in
general. Until now, atomic_cell did not use small value optimizations, but
it will after upcoming changes.
2021-02-16 21:35:14 +01:00
Michał Chojnowski
aa60f28a09 test: mutation_test: initialize an uninitialized variable
It was assumed to be zero-initialized, but C++ does not guarantee that.
It has to be initialized explicitly.
2021-02-16 21:35:14 +01:00
Michał Chojnowski
52bd190bb3 test: sstable_datafile_test: fix tracking of closed sstables in sstable_run_based_compaction_test
sstable_run_based_compaction_test assumed that sstables are freed immediately
after they are fully processed.
Hovewer, since commit b524f96a74,
mutation_reader_merger releases sstables in batches of 4, which breaks the
assumption. This fix adjusts the test accordingly.

Until now, the test only kept working by chance: by coincidence, the number of
test sstables processed by merging_reader in a single fill_buffer() call was
divisible by 4. Since the test checks happen between those calls,
the test never witnessed a situation when an sstable was fully processed,
but not released yet.

The error was noticed during the work on an upcoming patch which changes the
size of mutation_fragment, and reduces the number of test sstables processed
in a single fill_buffer() call, which breaks the test.
2021-02-16 21:35:14 +01:00
Konstantin Osipov
d293966366 raft: add a unit test for voting
Test duplicate votes, votes from non-members and voting
in joint configuration.
2021-02-16 23:15:16 +03:00
Konstantin Osipov
3478389d60 raft: do not account for the same vote twice
While duplicate votes are not allowed by Raft rules, it is possible
that a vote message is delivered multiple times.

The current voting implementation does reject votes from non-members,
but doesn't check for duplicate votes.

Keep track of who has voted yet, and reject duplicate votes.

A unit test follows.
2021-02-16 23:15:16 +03:00
Konstantin Osipov
ffd38de5fe raft: remove fsm::set_configuration()
Set either tracker or votes configuration explicitly.
This saves a few lines and simplifies unit tests.
2021-02-16 23:15:16 +03:00
Konstantin Osipov
b941ca9bae raft: consistently use configuration from the log 2021-02-16 23:15:16 +03:00
Konstantin Osipov
75eddaf493 raft: add ostream serialization for enum vote_result 2021-02-16 23:15:16 +03:00
Konstantin Osipov
e099003c7c raft: advance commit index right after leaving joint configuration
Imagine the cluster is in joint configuration {{A, B}, {A, B, C, D, E}}.
Server stable indexes are:

Server  Stable Index
A       5
B       5
C       6
D       7
E       8

The commit index would be 5 if we use joint configuration, and 6
if we assume we left it. Left it happen without an extra FSM
step.
2021-02-16 23:15:16 +03:00
Konstantin Osipov
1bdb3fc8a9 raft: add tracker test 2021-02-16 23:15:16 +03:00
Konstantin Osipov
63965f46f4 raft: tidy up follower_progress API
Make the API More explicit so it's available for testing.
2021-02-16 23:15:16 +03:00
Konstantin Osipov
74879fab09 raft: update raft::log::apply_snapshot() assert
apply_snapshot() doesn't support applying the same snapshot
twice. The caller must check the current snapshot before
applying.
2021-02-16 23:15:12 +03:00
Konstantin Osipov
6ee3aedcc2 raft: add a unit test for raft::log 2021-02-16 23:12:01 +03:00
Konstantin Osipov
c35f029be1 raft: rename log::non_snapshoted_length() to log::length()
The old name was incorrect, in case apply_snapshot() was called with
non-zero trailing entries, the total log length is greater than the
length of the part that is not stored in a snapshot.

Fix spelling in related comments.

Rename fsm::wait() to fsm::wait_max_log_length(), it's a more
specific name.
2021-02-16 23:12:01 +03:00
Konstantin Osipov
9e1a652805 raft: inline raft::log::truncate_tail()
It's the core of apply_snapshot() work and is only used in it.

Now that truncate_tail is inline, truncate_head() can be
called simply truncate().
2021-02-16 23:10:58 +03:00
Konstantin Osipov
f7fb788edf raft: ignore AppendEntries RPC with a very old term
Do not assert on an outdated message.
2021-02-16 23:07:58 +03:00
Konstantin Osipov
7236f081c1 raft: remove log::start_idx()
Replace it with a private _first_idx, which is maintained
along with the rest of class log state.
_first_idx is a name consistent with counterpart last_idx().

Do not use a function since going forward we may want
to remove Raft index from struct log_entry, so should rely
less on it.

This fixes a bug when _last_conf_idx was not reset
after apply_snapshot() because start_idx() was pointing
to a non-existent entry.
2021-02-16 23:06:23 +03:00
Konstantin Osipov
59ea383c7d raft: return a correct last term on an empty log
If the log is empty, we must use snapshot's term,
since the log could be right after taking a snapshot
when no trailing entries were kept.

This fixes a rare possible bug when a log matching
rule could be violated during elections by a follower
with a log which was just truncated after a snapshot.

A separate unit test for the issue will follow.
2021-02-16 21:07:05 +03:00
Konstantin Osipov
6c14775b20 raft: do not use raft::log::start_idx() outside raft::log()
raft::log::start_idx() is currently not meaningful
in case the log is empty.

Avoid using it in fsm::replicate_to() and avoid manual search for
previous log term, instead encapsulate the search in log::term_for().

As a side effect we currently return a correct term (0)
when log matching rule is exercised for an empty log
and the very first snapshot with term 0. Update raft_etcd_test.cc
accordingly.

This change happens to reduce the overall line count.

While at it, improve the comments in raft::replicate_to().
2021-02-16 21:05:44 +03:00
Nadav Har'El
946e63ee6e cql-pytest: remove "xfail" tag from two passing tests
Issue #7595 was already fixed last week, in commit
b6fb5ee912, so the two tests which failed
because of this issue no longer fail and their "xfail" tag can be removed.

Refs #7595.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210216160606.1172855-1-nyh@scylladb.com>
2021-02-16 19:17:22 +02:00
Nadav Har'El
737c1c6cc7 cql-pytest: Additional JSON tests
This patch adds several additional tests o test/cql-pytest/test_json.py
to reproduce additional bugs or clarify some non-bugs.

First, it adds a reproducer for issue #8087, where SELECT JSON may create
invalid JSON - because it doesn't quote a string which is part of a map's
key. As usual for these reproducers, the test passes on Cassandra, and fails
on Scylla (so marked xfail).

We have a bigger test translated from Cassandra's unit tests,
cassandra_tests/validation/entities/json_test.py::testInsertJsonSyntaxWithNonNativeMapKeys
which demonstrates the same problem, but the test added in this patch is much
shorter and focuses on demonstrating exactly where the problem is.

Second, this patch adds a test test verifies that SELECT JSON works correctly
for UDTs or tuples where one of their components was never set - in such a
case the SELECT JSON should also output this component, with a "null" value.
And this test works (i.e., produces the same result in Cassandra and Scylla).
This test is interesting because it shows that issue #8092 is specific to the
case of an altered UDT, and doesn't happen for every case of null
component in a UDT.

Refs #8087
Refs #8092

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210216150329.1167335-1-nyh@scylladb.com>
2021-02-16 16:05:31 +01:00
Avi Kivity
2f3b265dac Update seastar submodule
* seastar 76cff58964...e53a1059f9 (18):
  > rpc: streaming sink: order outgoing messages
Fixes #7552.
  > http: fix compilation issues when using clang++
  > http/file_handler: normalize file-type for mime detection
  > http/mime_types: add support for svg+xml
  > reactor: simplify get_sched_stats()
  > Merge "output_stream: make api noexcept" from Benny
  > Merge " input_stream: make api noexcept" from Benny
  > rpc: mark 'protocol' class as final
  > tls: reloadable_certificate inotify flag is wrong
Fixes #8082.
  > cli: Ignore the --num-io-queues option
  > io_queue: Do not carry start time in lambda capture
  > fstream: Cancel all IO-s on file_data_source_impl close
  > http: add "Transfer-Encoding: chunked" handling
  > http: add ragel parsers for chunks used in messages with Transfer-Encoding: chunked
  > http: add request content streaming
  > http: add reading/skipping all bytes in an input_stream
  > Merge "Reduce per-io-queue container for prio classes" from Pavel Emelyanov
  > seastar-addr2line: split multiple addresses on the same line
2021-02-16 16:19:26 +02:00
Avi Kivity
789233228b messaging: don't inherit from seastar::rpc::protocol
messaging_service's rpc_protocol_server_wrapper inherits from
seastar::rpc::protocol::server as a way to avoid a
is unfortunate, as protocol.hh wasn't designed for inheritance, and
is not marked final.

Avoid this inheritance by hiding the class as a member. This causes
a lot of boilerplate code, which is unfortunate, but this random
inheritance is bad practice and should be avoided.

Closes #8084
2021-02-16 16:04:44 +02:00
Gleb Natapov
c9392095ce cql3: store cf_prop_defs as optional instead of shared_ptr
It been a shard_ptr is a remnant of translation from Java.
Message-Id: <20210216123931.80280-3-gleb@scylladb.com>
2021-02-16 15:58:38 +02:00
Gleb Natapov
805da054e7 cql3: store cf_name as optional in cf_statement instead of shared_ptr
It been a shard_ptr is a remnant of translation from Java.
Message-Id: <20210216123931.80280-2-gleb@scylladb.com>
2021-02-16 15:58:37 +02:00
Gleb Natapov
6335af625e cql3: assert that unengaged optional is not accessed in keyspace_element_name::get_keyspace()
Message-Id: <20210216085545.54753-2-gleb@scylladb.com>
2021-02-16 15:36:00 +02:00
Gleb Natapov
200ca974c3 Do not access potentially unengaged optional in keyspace_element_name
Currently there are places that call
keyspace_element_name::get_keyspace() without checking that _ks_name is
engaged. Fix those places.
Message-Id: <20210216085545.54753-1-gleb@scylladb.com>
2021-02-16 15:35:59 +02:00
Botond Dénes
4d309fc34a repair: row_level: invoke on_internal_error() on out-of-order partitions
repair_writer::do_write(): already has a partition compare for each
mutation fragment written, do determine whether the fragment belongs to
another partition or not. This equal compare can be converted to a
tri_compare at no extra cost allowing for detecting out-of-order
partitions, in which case `on_internal_error()` is called.

Refs: #7623
Refs: #7552

Test: dtest(RepairAdditionalTest.repair_disjoint_row_3nodes_diff_shard_count_test:debug)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210216074523.318217-1-bdenes@scylladb.com>
2021-02-16 15:31:40 +02:00
Benny Halevy
50ca693a02 main: disable stall detector during startup
We see long reactor stalls from `logalloc::prime_segment_pool`
in debug mode yet the stall detector's purpose is to detect
reactor stalls during normal operation where they can increase
the latency of other queries running in parallel.

Since this change doesn't actually fix the stalls but rather
hides them, the following annotations will just refrence
the respective github issues rather than auto-close them.

Refs #7150
Refs #5192
Refs #5960

Restore blocked_reactor_notify_ms right before
starting storage_proxy.  Once storage_proxy is up, this node
affects cluster latency, and so stalls should be reported so
they can be fixed.

Test: secondary_index_test --blocked-reactor-notify-ms 1 (release)
DTest: CASSANDRA_DIR=../scylla/build/release SCYLLA_EXT_OPTS="--blocked-reactor-notify-ms 2" ./scripts/run_test.sh materialized_views_test:TestMaterializedViews.interrupt_build_process_with_resharding_half_to_max_test

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210216112052.27672-1-bhalevy@scylladb.com>
2021-02-16 13:28:31 +02:00
Tomasz Grabiec
446ea07ac6 Merge "raft: server instance init and raft RPC handlers" from Pavel Solodovnikov
This series provides a `raft_services` class to create and store
a raft schema changes server instances, and also wires up the RPC handlers
for Raft RPC verbs.

* manmanson/raft-api-server-handlers-v10:
  raft: share `raft_gossip_failure_detector` instance across multiple raft rpc instances
  raft: move server address handling from `raft_rpc` to `raft_services` class
  raft: wire up schema Raft RPC handlers
  raft: raft_rpc: provide `update_address_mapping` and dispatcher functions
  raft: pass `group_id` as an argument to raft rpc messages
  raft: use a named constant for pre-defined schema raft group
2021-02-16 11:14:50 +01:00
Pavel Solodovnikov
1ada0abf81 raft: share raft_gossip_failure_detector instance across multiple raft rpc instances
Store an instance inside `raft_services` and reuse it for
all raft groups created and managed by `raft_services` instance.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-02-16 13:09:12 +03:00
Pavel Solodovnikov
8c2a904dc8 raft: move server address handling from raft_rpc to raft_services class
This allows to decouple `raft_gossip_failure_detector` from being
dependent on a particular rpc instance and thus makes it possible
to share the same failure detector instance among all raft servers
since they are managed in a centralized way by a `raft_services`
instance.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-02-16 13:09:06 +03:00
Pavel Solodovnikov
63cdf4694d raft: wire up schema Raft RPC handlers
This patch adds registration and de-registration of the
corresponding Raft RPC verbs handlers.

There is a new `raft_services` class that
is responsible for initializing the raft RPC verbs and
managing raft server instances.

The service inherits `seastar::peering_sharded_service<T>`,
because we need to route the request to the appropriate shard
which is handled by the `shard_for_group` function (currently
only handling schema raft group to land on shard 0, otherwise
throws an exception).

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-02-16 13:08:59 +03:00
Nadav Har'El
1e1cbaf589 docs/alternator: clean up description of DynamoDB compatibility
We had Alternator's current compatibility with DynamoDB described in
two places - alternator.md and compatibility.md. This duplication was
not only unnecessary, in some places it led to inconsistent claims.

In general, the better description was in compatibility.md, so in
this patch we remove the compatibility section from alternator.md
and instead link to compatibility.md. There was a bit of information
that was missing in compatibility.md, so this patch adds it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210215203057.1132162-1-nyh@scylladb.com>
2021-02-16 08:48:28 +01:00
Pavel Emelyanov
9baf1226dc test/memory_footpring: Print radix tree node sizes
After switching cells storage onto compact radix tree it
becomes useful to know the tree nodes' sizes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:41:09 +03:00
Pavel Emelyanov
1bdfa355ea row: Remove old storages
Now when the 3rd storage type (radix tree) is all in, old
storage can be safely removed.  The result is:

1. memory footprint

sizeof(class row):  112 => 16 bytes
sizeof(rows_entry): 126 => 120 bytes

the "in cache" value depends on the number of cells:

num of cells     master       patch
         1       752         656
         2       808         712
         3       864         768
         4       920         824
         5       968         936
         6      1136         992
         ...
         16     1840        1672
         17     1904        1992  (+88)
         18     1976        2048  (+72)
         19     2048        2104  (+56)
         20     2120        2160  (+40)
         21     2184        2208  (+24)
         22     2256        2264  ( +8)
         23     2328        2320
         ...
         32     2960        2808

After 32 cells the storage switches into rbtree with
24-bytes per-cell overhead and the radix tree improvement
rocketlaunches

           64     7872        6056
           128   15040        9512
           256   29376       18568

2. perf_mutation test is enhanced by this series and the
   results differ depending on the number of columns used

                    tps value
--column-count    master   patch
          1       59.9k    57.6k  (-3.8%)
          2       59.9k    57.5k
          4       59.8k    57.6k
          8       57.6k    57.7k  <- eq
         16       56.3k    57.6k
         32       53.2k    57.4k  (+7.9%)

A note on this. Last time 1-column test was ~5% worse which
was explained by inline storage of 5 cells that's present on
current implementation and was absent in radix tree.

An attempt to make inline storage for small radix trees
resulted in complete loss of memory footprint gain, but gave
fraction of percent to perf_mutation performance. So this
version doesn't have inline nodes.

The 1.2% improvement from v2 surprisingly came from the
tree::clone_from() which in v2 was work-around-ed by slow
walk+emplace sequence while this version has the optimized
API call for cloning.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:35:06 +03:00
Pavel Emelyanov
2053b1c202 row: Prepare row::equal for switch
Same as the previous patch, re-implement the row::equal to use
the radix_tree iterator for comparison of two index:cell sequences.

The std::equal() doesn't work here, since the predicate-fn needs
to look at both iterators to call it.key() on (radix tree API
feature), while std::equal provides only the T&s in it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:31:52 +03:00
Pavel Emelyanov
b5527b3635 row: Prepare row::difference for switch
The method effectively walks two pairs of <colun_id, cell> and
applies the difference to separare row instance. The code added
is the copy of the same code below this hunk with the mechanical
substitution:

 c.first -> c.key()
 c.second -> c->cell
 it->first -> it.key()
 it->second -> it.cell

because first-s are column_id-s reported by radix tree iterator
.key() method and second-s are cells, that were referenced by
current code in get_..._vector() from boost::irange and are now
directly pointed to by raidx tree iterator.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:27:00 +03:00
Pavel Emelyanov
f006acc853 row: Introduce radix tree storage type
Currently class row uses a union of a vector and a set to keep
the cells and switches between them. Add the 3rd type with the
radix tree, but never switch to it, just to show how the operations
would look like. Later on vector and set will be removed and the
whole row will be immediately switched to the radix tree storage.

NB: All the added places have indentation deliberately broken, so
that next patch will just remove the surrounding (old) code away
and (most of) the new one will happen in its place instantly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:27:00 +03:00
Pavel Emelyanov
5f276b279e row-equal: Re-declare the cells_equal lambda
For further patching it's handy to have this helper to accept
column_id and atomic_cell_or_collection arguments, instead of
an std::pair of these two.

This is to facilitate next patching.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:27:00 +03:00
Pavel Emelyanov
aa85bc790b test: Add tests for radix tree
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:27:00 +03:00
Pavel Emelyanov
a5bd68ae5d utils: Compact radix tree
The tree uses integral type as a search key. On each level the local index
is next 7 bits from the key, respectively for 32-bit key we have 5 levels.
The tree uses 2 memory packing techniques -- prefix compaction and growing
node layouts.

The prefix compaction is used when a node has only one child. In this case
such a node is replaced in its parent with this only child and the child in
question keeps "prefix:length" pair on board, that's used to check if the
short-cut lookup took the correct path.

The growing node layouts makes the nodes occupy as much memory as needed
to keep the _present_ keys and there are 2 kinds of layouts.

Direct layout is array, intra-node search is plain indexing. The layout
storage grows in vector-like manner, but there's a special case for the
maximum-sized layout that helps avoiding some boundary checks.

Indirect layout keeps two arrays on board -- with values and with indices.
The intra-node search is thus a lookup in the latter array first. This
layout is used to save memory for sparse keys. Lookup is optimized with
SIMD instructions.

Inner nodes use direct layouts, as they occupy ~1% of memory and thus
need not to be memory efficient. At the same time lookup of a key in the
tree potentially walks several inner nodes, so speeding up search for
them is beneficial.

Leaf nodes are indirect, since they are 99% of memory and thus need to
be packed well. The single indirect lookup when searching in the tree
doesn't slow things down notably even on insertion stress test.

Said that
 * inner nodes are: header + 4 / 8 / 16 / 32 / 64 / 128 pointers
 * leaf nodes are : header + 4 / 8 / 16 / 32 bytes + <same nr> objects
                 or header + 16 bytes bitmap + 128 objects

The header is
 - backreference (8 bytes)
 - prefix (4 bytes)
 - size, layout, capacity (1 byte each)

The iterator is one-direction (for simplicity) but it enough for its main
target -- the sparse array of cells on a row. Also the iterator has an
.index() method that reports back the index of the entry at which it points.
This greatly simplifies the tree scans by the class row further.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 19:25:10 +03:00
Piotr Sarna
495b7b5596 alternator: use unique_ptr for storing attribute paths
Previous commit eliminated the only copying of the attribute paths,
so it's now safe to make the object noncopyable.
Message-Id: <5468e8c17d3d42a03c1dd33706bbaac0c58959ce.1613398751.git.sarna@scylladb.com>
2021-02-15 18:22:59 +02:00
Piotr Sarna
7e1641224c alternator: batch: pass attrs_to_get by a shared pointer
The attrs_to_get object was previously copied, but it's quite
a heavyweight operations, since this object may contain an
instance of std::map or std::unordered_map.
To avoid copying whole maps, the object is wrapped in a shared
const pointer.
Message-Id: <75ad810de16c630b65ae8d319cb4b37e1de8085f.1613398751.git.sarna@scylladb.com>
2021-02-15 18:22:56 +02:00
Tomasz Grabiec
f86108aef1 Merge "raft: move ticking to external code" from Alejo
As Gleb suggested in a previous review, remove ticker from raft and
leave calling tick() to external code.

While there, tick faster to speed up tests.

* https://github.com/alecco/scylla/tree/tests-17-remove-ticker:
  raft: replication test: reduce ticker from 100ms to 1ms
  raft: drop ticker from raft
2021-02-15 18:14:03 +02:00
Botond Dénes
c24f350846 scylla-gdb.py: nonwrapping_interval_printer: fix compatibility with 4.2+
Use the `_interval` member instead of the old `_range` field, but stay
compatible with pre 4.2 releases, falling back to `_range` when
`_interval` doesn't exist.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210215104008.166746-1-bdenes@scylladb.com>
2021-02-15 18:14:03 +02:00
Pavel Emelyanov
d43ad8738c array-search: Add helpers to search for a byte in array
The radix tree code will need the code to find 8-bit value
in an array of some fixed size, so here are the helpers.

Those that allow for SIMD implementations are such for x86_64

TODO: Add aarch64

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 17:47:59 +03:00
Pavel Emelyanov
0ad361b380 test/perf_collection: Add callback to check the speed of clone
In some places scylla clones collections of objects, so it's
sometimes needed to measure the speed of this operation.

This patch adds a placeholder for it, but no implementations
for any supported collections. It will be added soon for radix
tree.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 17:46:37 +03:00
Pavel Emelyanov
767253fe24 test/perf_mutation: Add option to run with more than 1 columns
The --column-count makes the test generate schema with
the given numbers of columns and make mutation maker
fill random column with the value on each iteration.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 17:45:42 +03:00
Pavel Emelyanov
fc84ab3418 test/perf_mutation: Prepare to have several regular columns
Teach the schema builder and test itself to work on more
than one regular column, but for now only use 1, as before.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 17:44:34 +03:00
Pavel Emelyanov
21adff2a41 test/perf_mutation: Use builder to build schema
The test will be taught to use more than one regular
column, so switch to builder in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 17:44:06 +03:00
Piotr Sarna
cbbb7f08a0 Merge 'Alternator: support nested attribute paths...
in all expressions' from Nadav Har'El.

This series fixes #5024 - which is about adding support for nested attribute
paths (e.g., a.b.c[2]) to Alternator.  The series adds complete support for this
feature in ProjectionExpression, ConditionExpression, FilterExpression and
UpdateExpression - and also its combination with ReturnValues. Many relevant
tests - and also some new tests added in this series - now pass.

The first patch in the series fixes #8043 a bug in some error cases in
conditions, which was discovered while working in this series, and is
conceptually separate from the rest of the series.

Closes #8066

* github.com:scylladb/scylla:
  alternator: correct implemention of UpdateItem with nested attributes and ReturnValues
  alternator: fix bug in ReturnValues=UPDATED_NEW
  alternator: implemented nested attribute paths in UpdateExpression
  alternator: limit the depth of nested paths
  alternator: prepare for UpdateItem nested attribute paths
  alternator: overhaul ProjectionExpression hierarchy implementation
  alternator: make parsed::path object printable
  alternator-test: a few more ProjectionExpression conflict test cases
  alternator-test: improve tests for nested attributes in UpdateExpression
  alternator: support attribute paths in ConditionExpression, FilterExpression
  alternator-test: improve tests for nested attributes in ConditionExpression
  alternator: support attribute paths in ProjectionExpression
  alternator: overhaul attrs_to_get handling
  alternator-test: additional tests for attribute paths in ProjectionExpression
  alternator-test: harden attribute-path tests for ProjectionExpression
  alternator: fix ValidationException in FilterExpression - and more
2021-02-15 15:45:49 +02:00
Tomasz Grabiec
508f928220 tests: sstables: Test sstable write fails on missing partition_end mid-stream
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210115163055.74398-1-tgrabiec@scylladb.com>
2021-02-15 15:45:49 +02:00
Benny Halevy
e532585126 test: sstables::test_env: do_with: futurize_invoke func
Otherwise, if `func` throws, test_env isn't stopped, as it should.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210214190157.211858-1-bhalevy@scylladb.com>
2021-02-15 15:45:49 +02:00
Wojciech Mitros
1819be5ebc canonical_mutation: make the data type non-contiguous
The canonical_mutation type can contain a large mutation, particularly
when the mutation is a result of converting a big schema. Its data
was stored in a field of type 'bytes', which is non-contiguous and
may cause a large allocation.
This is fixed by simply changing the type to 'bytes_ostream', which is
fragmented. The change is compatible because the idl type 'bytes' is compatible
with 'bytes_ostream' as a result of dcf794b, and all canonical_mutations's
methods use the field as an input stream (ser::as_input_stream), which can
be used on 'bytes_ostream' too.

Fixes #8074

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #8075
2021-02-15 10:24:47 +01:00
Nadav Har'El
f884104eed cql-pytest: add more JSON tests
This patch adds several more tests reproducing bugs in toJson() and
SELECT JSON.

First add two xfailing tests reproducing two toJson() issues - #7988
and #8002. The first is that toJson() incorrectly formats values of the
"time" type - it should be a string but Scylla forgets the quotes.
The second is that toJson() format "decimal" values as JSON numbers
without using an exponent, resulting in memory allocation failure
for numbers with high exponents, like 1e1000000000.

The actual test for 1e1000000000 has to be skipped because in
debug build mode we get a crash trying this huge allocation.
So instead, we check 1e1000 - this generates a string of 1000
characters, which is much too much (should just be "1e1000")
but doesn't crash.

Then we add a reproducing test for issue #8077: When using SELECT JSON
on a function, such as count(*), ttl(v) or intAsBlob(v), Cassandra has
a specific way how it formats the result in JSON, and Scylla should do
it the same way unless we have a good reason not to.

As usual, the new tests passes on Cassandra, fails on Scylla, so is marked
xfail.

Refs #7988
Refs #8002
Refs #8077.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210214210727.1098388-1-nyh@scylladb.com>
2021-02-15 10:55:43 +02:00
Nadav Har'El
9e029f09e5 docs: improve CONTRIBUTING.md
Start improving CONTRIBUTING.md, as suggested in issue #8037:

1. Incorporate the few lines we had in coding-style.md into CONTRIBUTING.md.
   This was mostly a pointer to Seastar's coding style anyway, so it's not
   helpful to have a separate file which hopeful developers will not find
   anyway.

2. Mention the Scylla developers mailing list, not just the Scylla users
   mailing list. The Scylla developers mailing list is where all the action
   happens, and it's very odd not to mention it.

3. The decisions that github pull requests are forbidden was retracted
   a long time ago, so change the explanation on pull requests.

4. Some smaller phrasing changes.

Refs #8037.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210214152752.1071313-1-nyh@scylladb.com>
2021-02-14 22:09:24 +02:00
Nadav Har'El
8c9935359c build: Stir -O levels
Merged patch series from Pavel Emelyanov:

The default -O<> levels are considered to produce slow and tedious to
test code, so it's tempting to increase the level. On the other hand,
there was some complains about re-compile-mostrly work that would suffer
from slower builds.

This set tries to find a reasonable compromise -- raise the default opt
levels and provide the ability to configure one if needed.

* 'br-cxx-o-levels-2' of github.com:xemul/scylla:
	configure: Switch debug build from -O0 to -Og
	configure: Switch dev build from -O1 to -O2
	configure: Make -O flag configurable
2021-02-14 22:09:24 +02:00
Avi Kivity
cb4e1bb0b9 logalloc: reduce gap between std min_free and logalloc min_free
With the larger gap, logalloc reserved more memory for std than
the background reclaim threshold for running, so it was triggered
rarely.

With the gap reduced, background reclaim is constantly running in
an allocating workload (e.g. cache misses).
2021-02-14 19:09:29 +02:00
Avi Kivity
ca0c006b37 logalloc: background reclaim
Set up a coroutine in a new scheduling group to ensure there is
a "cushion" of free memory. It reclaims in preemptible mode in
order to reduce reactor stalls (constrast with synchronous reclaim
that cannot preempt until it achieved its goal).

The free memory target is arbitrarily set at 60MB. The reclaimer's
shares are proportional to the distance from the free memory target;
so a workload that allocates memory rapidly will have the background
reclaimer working harder.

I rolled my own condition variable here, mostly as an experiment.
seastar::condition_variable requires several allocations, while
the one here requires none. We should formalize it after we gain
more experience with it.
2021-02-14 19:09:29 +02:00
Avi Kivity
35076dd2d3 logalloc: preemptible reclaim
Add an option (currently unused by all callers) to preempt
reclaim. If reclaim is preempted, it just stops what it is
doing, even if it reclaimed nothing. This is useful for background
reclaim.

Currently, preemption checks are on segment granularity. This is
probably too coarse, and should be refined later, but is already
better than the current granularity which does not allow preemption
until the entire requested memory size was reclaimed.
2021-02-14 19:09:29 +02:00
Alejo Sanchez
5e49650146 raft: replication test: reduce ticker from 100ms to 1ms
To speed up replication test reduce the tick time from 100ms to 1ms

Speed up: debug 3.7 to 2.5, dev 2.9 to 2.1 seconds

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-02-14 09:59:06 -04:00
Alejo Sanchez
b41a6822e8 raft: drop ticker from raft
Remove ticker callbacks from raft::server.
External code should periodically call raft::server::tick().

Update replication_test accordingly.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-02-14 09:41:42 -04:00
Nadav Har'El
ea338db581 cql-pytest: reproduce bug in setting time column with integer
This test reproduces issue #7987, where Scylla cannot set a time column
with an integer - wheres the documentation says this should be possible
and it also works in Cassandra.

The test file includes tests for both ways of setting a time column
(using an integer and a string), with both prepared and unprepared
statements, and demonstrates that only one combination fails in Scylla -
an unprepared statement with an integer. This test xfails on Scylla
and passes on Cassandra, and the rest pass on both.

Refs #7987.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210128215103.370723-1-nyh@scylladb.com>
2021-02-14 15:09:38 +02:00
Nadav Har'El
49cd9b3fd5 alternator: correct implemention of UpdateItem with nested attributes and ReturnValues
This patch fixes the last missing part of nested attribute support in
UpdateItem - returning the correct attributes when ReturnValues is requested.
When the expression says "a.b = :val" and ReturnValues is set to UPDATED_OLD
or UPDATED_NEW, only the actual updated attribute a.b should be returned, not
the entire top-level attribute a as we did before this patch.

This patch was made very simple because our existing hierarchy_filter()
function already does exactly the right thing, and can trivially be made to
accept any attribute_path_map<T> (in our case attribute_path_map<action>),
not just attrs_to_get as it did until now.

This patch also adds several more checks to the test in test_returnvalues.py
to improve the test's coverage even more. Interestingly, I discovered two
esoteric cases where DynamoDB does something which makes little sense, but
apparently simplified their implementation - but the beautiful thing is that
it also simplifies our implementation! See long comments about these two
cases in the test code.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
964500e47a alternator: fix bug in ReturnValues=UPDATED_NEW
Commit 0c460927bf broke UpdateItem's
ReturnValues=UPDATED_NEW by moving previous_item while it is still
needed. None of the existing tests broke because none of them needed
previous_item after it was moved - but it started to break when we
add support for nested attribute paths, which need this previous_item.

So this patch returns the move to a copy, as it was before the
aforementioned patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
33685a683e alternator: implemented nested attribute paths in UpdateExpression
This patch adds full support for nested attribute paths (e.g., a.b[3].c)
in UpdateExpression. After in previous patches we already added such
support for ProjectionExpression, ConditionExpression and FilterExpression
this means the nested attribute paths feature is now complete, so we
remove the warning from the documents. However, there is one last loose
end to tie and we will do it in the next patch: After this patch, the
combination of UpdateExpression with nested attributes and ReturnValues
is still wrong, and the test for it in test_returnvalues.py still xfails.

Note that previous patches already implemented support for attribute paths
in expression evaluations - i.e., the right-hand side of UpdateExpression
actions, and in this patch we just needed to implement the left hand side:
When an update action is on an attribute a.b we need to read the entire
content of the top-level a (an RWM operation), modify just the b part of
its json with the result of the action, and finally write back the entire
content of a. Of course everything gets complicated by the fact that we
can have multiple actions on multiple pieces of the same JSON, and we also
need to detect overlapping and conflicting actions (we already have this
detection in the attribute_path_map<> class we introduced in a previous
patch).

I decided to leave one small esoteric difference, reproduced by the xfailing
test_update_expression.py::test_nested_attribute_remove_from_missing_item:
As expected, "SET x.y = :val" fails for an item if its attribute x doesn't
exist or the item itself does not exist. For the update expression
"REMOVE x.y", DynamoDB fails if the attribute x doesn't exist, but oddly
silently passes if the entire item doesn't exist. Alternator does not
currently reproduce this oddity - it will fail this write as well.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
7789606545 alternator: limit the depth of nested paths
DynamoDB limits the depth of a nested path in expressions (e.g. "a.b.c.d")
to 32 levels. This patch adds the same limit also to Alternator.

The exact value of this limit is less important (although it did make
sense to choose the same limit as DynamoDB does), but it's important
to have *some* limit: It's often convenient to handle paths with a
recursive algorithm, and if we allow unlimited path depth, it can
result in unlimited recursion depth, and a crash. Let's avoid this
possibility.

We detect the over-long path while building the parsed::path object
in the parser, and generate a parse error.

This patch also includes a test that verifies that both Alternator
and DynamoDB have the same 32-level nesting limit on paths.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
4c7e27c688 alternator: prepare for UpdateItem nested attribute paths
This patch prepares UpdateItem for updating of nested attribute paths
(e.g., "SET a.b = :val"), but does not yet support them.

Instead of _update_expression holding an unsorted list of "actions",
we change it to hold a attribute_path_map of actions. This will allow
us to process all the actions on a top-level attribute together, and
moreover gets us "for free" the correct checking for overlapping and
conflicting updates - exactly the same checking we already had in
attribute_path_map for ProjectionExpression. Other than this change,
most of this patch is just code movement, not functional changes.

After this patch, the tests for update path overlap and conflict pass:
test_update_expression_multi_overlap_nested and
test_update_expression_multi_conflict_nested.

We can also mark test_update_expression_nested_attribute_rhs as passing -
this test involves an attribute path in the right-hand-side of an update,
but the left-hand-side is still a top-level attribute, so it works (it
actually worked before this patch - it started working when we implemented
attribute paths in expressions, for ConditionExpression and
FilterExpression).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
7c5db2da83 alternator: overhaul ProjectionExpression hierarchy implementation
For ProjectionExpression we implemented a hierarchical filter object which
can be used to hold a tree of attribute paths groups by a the top-level
attributes, and also detect overlapping and conflicting entries.

For UpdateExpression, we need almost exactly the same object: We need to
group update actions (e.g., SET a.b=3) by the top-level attribute, and
also detect and fail overlapping or conflicting paths.

So in this patch we rewrite the data structure we had for ProjectionExpression
in a more genric manner, using the template attribute_path_map<T> - which
holds data of type T for each attribute path. We also implement a template
function attribute_path_map_add() to add a path/value pair to this map,
and includes all the overlap and conflict detecting logic.

There shouldn't be functional changes in this patch. The ProjectionExpression
code uses the new generic code instead of the specific code, but should work
the same. In the next patch we can use the new generic code to implement
UpdateExpression as well.

The only somewhat functional change is better error messages for
conflicting or overlapping paths - which now include one of the
conflicting paths.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
f78d33dd73 alternator: make parsed::path object printable
Make the parsed::path object printable - which is useful for error messages.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
c2f18e56ea alternator-test: a few more ProjectionExpression conflict test cases
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
de62a8c2d3 alternator-test: improve tests for nested attributes in UpdateExpression
We already had many tests for nested attributes in UpdateExpression, but
this patch adds even more:

 * Test nested attribute in right-hand-side in assignment: z = a.c.x.
 * Test for making multiple changes to the same and different top-level
   attributes in the same update.
 * Additional cases of overlap between multiple changes.
 * Tests for conflict between multiple changes.
 * Tests for writing to a nested path on a non-existent attribute or item.
 * A stronger test for array append sorts the added items.

As this feature was not yet implemented, these tests fail on Alternator,
and pass on DynamoDB.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:24 +02:00
Pavel Solodovnikov
2ed445bfdd raft: raft_rpc: provide update_address_mapping and dispatcher functions
Provide several utility functions which will be used in rpc message
handlers:

1. `update_address_mapping` -- add a new (server_id -> inet_address)
   mapping for a `raft_rpc` instance.
   This is used to update rpc module with a caller address
   upon receiving an rpc message from a yet unknown server.
2. A set of dispatcher functions for every rpc call that forward calls
   to an appropriate `raft::rpc_server` instance (for which `raft::rpc`
   has a back-pointer).

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-02-12 17:55:48 +03:00
Pavel Emelyanov
ffc9cc9aec range-streamer: Remove global storage service reference
The reference is used by range streamer and (!) storage
service itself to find out if the consistent_rangemovement
option is ON/OFF.

Both places already have the database with config at hands
and can be simplified.

v2: spellchecking

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210212095403.22662-1-xemul@scylladb.com>
2021-02-12 15:50:30 +01:00
Tomasz Grabiec
26aa000493 Merge "raft: replication test fixes" from Alejo
Fix rare debug mode hang and a minor fix.

* alejo/tests-16-fix-debug-hang-disruptive-ticks-master-v3:
  raft: replication test: fix debug mode hangs
  raft: replication test: remove unnecessary param
2021-02-11 20:35:35 +01:00
Nadav Har'El
a03a8a89a9 cql-pytest: fix flaky timeuuid_test.py
The test timeuuid_test.py::testTimeuuid sporadically failed, and it turns out
the reason was a bug in the test - which this patch fixes.

The buggy test created a timeuuid and then compared the time stored in it
to the result of the dateOf() CQL function. The problem is that dateOf()
returns a CQL "timestamp", which has millisecond resolution, while the
timeuuid *may* have finer than millisecond resolution. The reason why this
test rarely failed is that in our implementation, the timeuuid almost
always gets a millisecond-resolution timestamp. Only if now() gets called
more than once in one millisecond, does it pick a higher time incremented
by less than a millisecond.

What this patch does is to truncate the time read from the timeuuid to
millisecond resolution, and only then compare it to the result of dateOf().
We cannot hope for more.

Fixes #8060

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210211165046.878371-1-nyh@scylladb.com>
2021-02-11 19:03:58 +02:00
Alejo Sanchez
97338ab53f raft: replication test: fix debug mode hangs
For certain situations where barely enough nodes to elect a new leader
are connected a disruptive candidate can occassionally block the
election.

For example having servers A B C D E and only A B C are active in a
partition. If the test wants to elect A, it has to first make all 3
servers reach election timeout threshold (to make B and C receptive).
Then A is ticked till it becomes a candidate and has to send vote
requests to the other servers.

But all servers have a timer (_ticker) calling their periodic tick()
functions. If one of the other servers, say B, gets its timer tick
before A sends vote requests, B becomes a (disruptive) candidate and
will refuse to vote for A. In our case of only having 3 out of 5 servers
connected a single missing vote can hang the election.

This patch disables timer ticks for all servers when running custom
elections and partitioning.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-02-11 11:42:31 -04:00
Pavel Solodovnikov
d8dfdfba1e raft: pass group_id as an argument to raft rpc messages
This will be used later to filter the requests which belong
to the schema raft group and route them to shard 0.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-02-11 16:25:33 +03:00
Pavel Solodovnikov
3b50cdf1ed raft: use a named constant for pre-defined schema raft group
Introduce a static `schema_raft_state_machine::group_id` constant,
which denotes the raft group id for the schema changes server.

Also fix the comment on the state machine class declaration
to emphasize that the instance will be managed by shard 0.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-02-11 16:24:39 +03:00
Tomasz Grabiec
234f9dbe85 Merge 'Fix mixed cluster schema sync' from Eliran Sinvani
When a table is altered in a mixed cluster by a node with a more
recent version, the request can fail if there is a difference in
schema_features between the two versions.  This miniset handles the
two problems that prevents the sync.

Closes #8011

* github.com:scylladb/scylla:
  schema: recalculate digest when computed_columns feature is enabled
  schema tables: Remove mutations to unknown tables when adapting schema mutations
  schema tables: Register 'scylla_tables' versions that were sent to other nodes
2021-02-11 13:03:38 +01:00
Eliran Sinvani
63b794d104 schema: recalculate digest when computed_columns feature is enabled
The schema digest is affected by the computed_columns feature, this
means that we have to recalculate our schema digest when this feature is
enabled.
2021-02-11 13:48:58 +02:00
Eliran Sinvani
178ced9014 schema tables: Remove mutations to unknown tables when adapting schema
mutations

Whenever an alter table occurs, the mutations for the just altered table
are sent over to all of the replicas from the coordinator.
In a mixed cluster the mutations should be adapted to a specific version
of the schema. However, the adaptation that happens today doesn't omit
mutations to newly added schema tables, to be more specific, mutations
to the `computed_columns` table which doesn't exist for example in
version 2019.1
This makes altering a table during a rolling upgrade from 2019.1 to
2020.1 dangerous.
2021-02-11 13:48:55 +02:00
Eliran Sinvani
ff1ba9bc2b schema tables: Register 'scylla_tables' versions that were sent to other
nodes

In a mixed cluster there can be a situation where `scylla_tables` needs
to be  sent over to another node because a schema sync or because the
node pulls it because it is referenced by a frozen_mutation. The former
is not a problem since the sending node chooses the version to send.
However, the former is problematic since `scylla_tables` versions are not
registered anywhere.
This registers every `scylla_tables` schema version which is used to adapted
mutations since after this happens a schema pull for this version might
follow.
2021-02-11 13:47:16 +02:00
Takuya ASADA
856fe12e13 dist/debian: install scylla-node-exporter.service correctly
node-exporter systemd unit name is "scylla-node-exporter.service", not
"node-exporter.service".

Fixes #8054

Closes #8053
2021-02-11 12:19:38 +02:00
Benny Halevy
d01e7e7b58 stream_session: prepare: fix missing string format argument
As seen in
mv_populating_from_existing_data_during_node_decommission_test dtest:
```
ERROR 2021-02-11 06:01:32,804 [shard 0] stream_session - failed to log message: fmt::v7::format_error (argument not found)
```

Fixes #8067

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210211100158.543952-1-bhalevy@scylladb.com>
2021-02-11 12:05:32 +02:00
Wojciech Mitros
17634d141b sstables: add test for checking the latency of updating the sstable_set in a table
Added a test which measures the time it takes to replace sstables in a table's
sstable_set, using the leveled compaction strategy.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-02-11 11:02:55 +01:00
Wojciech Mitros
693b4e0fcd sstables: move column_family_test class from test/boost to test/lib
Column_family_test allows performing private methods on column_family's
sstable_set. It may be useful not only in the boost tests, so it's moved
from test/boost/sstable_test.hh to test/lib/sstable_test_env.hh.
sstable_test.hh includes sstable_test_env.hh, so no includes need to be
changed.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-02-11 11:02:55 +01:00
Wojciech Mitros
0feff8712e sstables: use fast copying of the sstable_set instead of rebuilding it
The sstable_set enables copying without iterating over all its elements,
so it's faster to copy a set and modify it than copy all its elements
while filtering the ones that were erased.

The modifications are done on a temporary version of the set, so that
if an operation fails the base version remains unchanged

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-02-11 11:02:55 +01:00
Wojciech Mitros
aa0cd940d6 sstables: replace the sstable_set with a versioned structure
Currently, the sstable_set in a table is copied before every change
to allow accessing the unchanged version by existing sstable readers.
This patch changes the sstable_set to a structure that allows copying
without actually copying all the sstables in the set, while providing
the same methods(and some extra) without majorly decreasing their speed.
This is achieved by associating all copies with sstable_set versions
which hold the changes that were performed in them, and references to
the versions that were copied, a.k.a. their parents. The set represented
by a version is the result of combining all changes of its ancestors.

This causes most methods of the version to have a time complexity
dependent on the number of its ancestors. To limit this number, versions
that represent copies that have already been deleted are merged with its
descendants.

The strategy used for deciding when and with which of its children
should a version be merged heavily depends on the use case of sstable_sets:
there is a main copy of the set in a table class which undergoes many
insertions and deletions, and there are copies of it in compaction or
mutation readers which are further copied or edited few or zero times.
It's worth to mention, that when a copy is made, the copied set should not
be modified anymore, because it would also modify the results given by the
copy. In order to still allow modifying the copied set, if a change is
to be performed on it, the version assiociated with this set is replaced
with a new version depending on the previous one.
As we can see, in our use case there is a main chain of versions(with
changes from the table), and smaller branches of versions that start
from a version from this chain, but are deleted soon after.
In such case we can merge a version when it has exactly one descendant,
as this limits the number of concurrent ancestors of a version to the
number of copies of its ancestors are concurrently used. During each
such merge, the parent version is removed and the child version is
modified so that all operations on it give the same results.

In order to preserve the same interface, the sstable_set still contains a
lw_shared_ptr<sstable_list>, but sstable_list (previously an alias for
unordered_set<shared_sstable>) is now a new structure. Each sstable_set
contains a sstable_list but not every sstable_list has to be contained
by a sstable_set, and we also want to allow fast copying of sstable_lists,
so the reference to the sstable_set_version is kept by the sstable_lists
and the sstable_set can access the sstable_set_version it's associated
with through its sstable_list.

Accessing sstables that are elements of a certain sstable_set copy(so
the select, select_sstable_runs and sstable_list's iterator) get results
from containers that hold all sstables from all versions(which are stored
in a single, shared "versioned_sstable_set_data" structure), and then
filter out these sstables that aren't present in the version in question.
This version of the sstable_set allows adding and erasing the same sstable
repeatedly. Inserting and erasing from the set modifies the containers in
a version only when it has an actual effect: if an sstable has been added
in the parent version, and hasn't been erased in the child version, adding
it again will have no effect. This ensures that when merging versions, the
versions have disjoint sets of added, and erased sstables (an sstable can
still be added in one and erased in the second). It's worth noting hat if
an sstable has been added in one of the merged sets and erased in the
second, the version that remains after merging doesn't need to have any
info about the sstable's inclusion in the set - it can be inferred from
the changes in previous versions (and it doesn't matter if the sstable has
been erased before or after being added).

To release pointers to sstables as soon as possible (i.e. when all references
to versions that contain them die), if an sstable is added/erased in all
child versions that are based on a version which has no external references,
this change gets removed from these versions and added to the parent version.
If an sstable's insertion gets overwritten as a result, we might be able
to remove the sstable completely from the set. We know how many times this
needs to happen by counting, for each sstable, in how many different verisions
has it been added. When a change that adds an sstable gets merged with a change
that removes it, or when a such a change simply gets deleted alongside its
associated version, this count is reduced, and when an sstable gets added to a
version that doesn't already contain it, this count is increased.

The methods that modify the sets contents give strong exception guarantee
by trying to insert new sstables to its containers, and erasing them in
the case of an caught exception.

Fixes #2622

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-02-11 11:02:55 +01:00
Wojciech Mitros
48153a1e2c sstables: remove potential ub
If the range expression in a range based for loop returns a temporary,
its lifetime is extended until the end of the loop. The same can't be said
about temporaries created within the range expression. In our case,
*t->get_sstables_including_compacted_undeleted() returns a reference to a
const sstable_list, but the t->get_sstables_including_compacted_undeleted()
is a temporary lw_shared_ptr, so its lifetime may not be prolonged until the
end of the loop, and it may be the sole owner of the referenced sstable_list,
so the referenced sstable_list may be already deleted inside the loop too.
Fix by creating a local copy of the lw_shared_ptr, and get reference from it
in the loop.

Fixes #7605

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-02-11 11:02:55 +01:00
Wojciech Mitros
e1b494633b sstables: make sstable_set constructor less error-prone
Adding an non-empty set of sstables as the set of all sstables in
an sstable_set could cause inconsistencies with the values returned
by select_sstable_runs because the _all_runs map would still be
initialized empty. For similar reasons, the provided sstable_set_impl
should also be empty.

Dispel doubts by removing the unordered_set from the constructor, and
adding a check of emptiness of the sstable_set_impl.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-02-11 11:02:55 +01:00
Shlomi Livne
718976e794 scylla_io_setup did not configure pre tuned gce instances correctly
scylla_io_setup condition for nr_disks was using the bitwise operator
(&) instead of logical and operator (and) causing the io_properties
files to have incorrect values

Fixes #7341

Reviewed-by: Lubos Kosco <lubos@scylladb.com>
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>

Closes #8019
2021-02-11 11:06:00 +02:00
Avi Kivity
9cbbf40710 Merge "register_inactive_read: error handling" from Benny
"
Currently, register_inactive_read accepts an eviction_notify_handler
to be called when the inactive_read is evicted.

However, in case there was an error in register_inactive_read
the notification function isn't called leaving behind
state that needs to be cleaned up.

This series separates the register_inactive_reader interface
into 2 parts:

1. register_inactive_reader(flat_mutation_reader) - which just registers
the reader and return an inactive_read_handle, *if permitted*.
Otherwise, the notification handler is not called (it is not known yet)
and the caller is not expected to do anything fance at this point
that will require cleanup.

This optimizes the server when overloaded since we do less work
that we'd need to undo in case the reader_concurrecy_semaphore
runs out of resources.

2. After register_inactive_reader succeeded to return a valid
inactive_read_handle, the caller sets up its local state
and may call `set_notify_handler` to set the optional
notify_handler and ttl on the o_r_h.

After this state, the notify_handler will be called when
the inactive_reader is evicted, for any reason.

querier_cache::insert_querier was modified to use the
above procedure and to handle (and log/ignore) any error
in the process.

inactive_read_handle and inactive_read keeping track of each other
was simplified by keeping an iterator in the handle and a backpointer
in the inactive_read object.  The former is used to evict the reader
and to set the notify_handler and/or ttl without having to lookup the i_r.
The latter is used to invalidate the i_r_h when the i_r is destroyed.

Test: unit(release), querier_cache_test(debug)
"

* tag 'register_inactive_read-error-handling-v6' of github.com:bhalevy/scylla:
  querier_cache: insert_querier: ignore errors to register inactive reader
  querier_cache: insert_querier: handle errors
  querier_utils: mark functions noexcept
  reader_concurrency_semaphore: register_inactive_read: make noexcept
  reader_concurrency_semaphore: separate set_notify_handler from register_inactive_reader
  reader_concurrency_semaphore: inactive_read: make ttl_timer non-optional
  reader_concurrency_semaphore: inactive_read: use intrusive list
  reader_concurrency_semaphore: do_wait_admission: use try_evict_one_inactive_read
  reader_concurrency_semaphore: try_evict_one_inactive_read: pass evict_reason
  reader_concurrency_semaphore: unregister_inactive_read: calling on wrong semaphore is an internal error
  reader_concurrency_semaphore: unregister_inactive_read: do nothing if disengaged
  reader_concurrency_semaphore: inactive_read_handle: swap definition order
  reader_lifecycle_policy: retire low level try_resume method
  reader_concurrency_semaphore: inactive_read: keep a flat_mutation_reader
2021-02-10 19:09:21 +02:00
Alejo Sanchez
941eceb9c8 raft: replication test: remove unnecessary param
Remove unnecessary param from wait_log()

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-02-10 09:11:17 -04:00
Piotr Sarna
2aa4631148 test: fix a flaky timeout test depending on TTL
One of the USING TIMEOUT tests relied on a specific TTL value,
but that's fragile if the test runs on the boundary of 2 seconds.
Instead, the test case simply checks if the TTL value is present
and is greater than 0, which makes the test robust unless its execution
lasts for more than 1 million seconds, which is highly unlikely.

Fixes #8062

Closes #8063
2021-02-10 14:20:02 +02:00
Piotr Sarna
8f98c0585f failure_detector: add a missing const qualifier
The mean() method is effectively const, so it should be marked as such.
Message-Id: <14dd39e8419136909fcf10508c34de3752faa7fe.1612953601.git.sarna@scylladb.com>
2021-02-10 13:04:37 +02:00
Piotr Sarna
aa39130a20 bounded_stats_queue: add missing const qualifiers
Most of the methods of this utility are effectively const.
Message-Id: <ed376ab74b6323cf770cc0a1314edbae0b16111e.1612953601.git.sarna@scylladb.com>
2021-02-10 13:04:35 +02:00
Piotr Jastrzebski
390cef6a96 cdc: Extract create_stream_ids from topology_description_generator
This new function will be used in the following patches in additional
places.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-02-10 10:24:06 +01:00
Gleb Natapov
d06d21bfae database: remove add_keyspace() function
It is not longer used.
Message-Id: <20210209175931.1796263-2-gleb@scylladb.com>
2021-02-10 00:36:02 +01:00
Nadav Har'El
54785604b4 Merge 'Add max concurrent requests to alternator' from Piotr Sarna
Previous version, merged and dequeued due to a dependency bug: https://github.com/scylladb/scylla/pull/7297

Note: this pull request is temporarily created against /next, because it depends on https://github.com/scylladb/scylla/pull/7279.

This series adds support for `max_concurrent_requests_per_shard` config variable to alternator. Excessive requests are shed and RequestLimitExceeded is sent back to the client.

Tested manually by reloading Scylla multiple times and editing the config, while bombarding alternator with many concurrent requests. Observed excepted failures are:
`botocore.errorfactory.RequestLimitExceeded: An error occurred (RequestLimitExceeded) when calling the CreateTable operation: too many in-flight requests: 17
`

Fixes #7294

Closes #8039

* github.com:scylladb/scylla:
  alternator: server: return api_error instead of throwing
  alternator: add requests_shed metrics
  alternator: add handling max_concurrent_requests_per_shard
  alternator: add RequestLimitExceeded error
2021-02-09 19:55:31 +02:00
Gleb Natapov
d8345c67d9 Consolidate system and non system keyspace creation
The code that creates system keyspace open code a lot of things from
database::create_keyspace(). The patch makes create_keyspace() suitable
for both system and non system keyspaces and uses it to create system
keyspaces as well.
Message-Id: <20210209160506.1711177-1-gleb@scylladb.com>
2021-02-09 17:18:04 +01:00
Gleb Natapov
51037e94ec lwt: handle an error during prune operation
The error is benign but if it is not handled "unhandled exception" error
will be printed in the logs.

Message-Id: <20210209150313.GA1708015@scylladb.com>
2021-02-09 16:26:00 +01:00
Tomasz Grabiec
3dd9c5596a Merge 'Minor tweaks to the failure detector interface' from Piotr Sarna
The interface of the failure detector service is cleaned up a little:
 - an unimplemented method is removed (is_alive)
 - a return type of another method is fixed (arrival_samples)
 - a getter for the most recent successful update is added (last_update)

This code was tested manually during various overload protection
experiments, which check if the failure detector can be used to reject
requests which have a very small chance of succeeding within their
timeout.

Closes #8052

* github.com:scylladb/scylla:
  failure_detector: add getting last update time point
  failure_detector: return arrival samples by const reference
  failure_detector: remove unimplemented is_alive method
2021-02-09 15:23:09 +01:00
Konstantin Osipov
86dec79c1b raft: rename progress.hh to tracker.hh
class tracker is the main class of this module.
2021-02-09 17:07:25 +03:00
Konstantin Osipov
41387225c3 raft: extend single_node_is_quiet test 2021-02-09 17:04:13 +03:00
Piotr Sarna
4acc6fecf0 Merge 'locator: Check DC names in NetworkTopologyStrategy' from Juliusz Stasiewicz
The same trick is used as in C*:
79e693e16e/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java (L241)

The edited CQL test relied on quietly accepting non-existing DCs, so it had to
be removed. Also, one boost-test referred to nonexistent `datacenter2` and had
to be removed.

Fixes #7595

Closes #8056

* github.com:scylladb/scylla:
  tests: Adjusted tests for DC checking in NTS
  locator: Check DC names in NTS
2021-02-09 14:45:20 +02:00
Botond Dénes
3d001b5587 query: use local limit for non-limited queries in mixed cluster
Since fea5067df we enforce a limit on the memory consumption of
otherwise non-limited queries like reverse and non-paged queries. This
limit is sent down to the replicas by the coordinator, ensuring that
each replica is working with the same limit. This however doesn't work
in a mixed cluster, when upgrading from a version which doesn't have
this series. This has been worked around by falling back to the old
max_result_size constant of 1MB in mixed clusters. This however resulted
in a regression when upgrading from a pre fea5067df to a post fea5067df
one. Pre fea5067df already had a limit for reverse queries, which was
generalized to also cover non-paged ones too by fea5067df.
The regression manifested in previously working reverse queries being
aborted. This happened because even though the user has set a generous
limit for them before the upgrade, in the mix cluster replicas fall back
to the much stricter 1MB limit temporarily ignoring the configured limit
if the coordinator is an old node. This patch solves this problem by
using the locally configured limit instead of the max_result_size
constant. This means that the user has to take extra care to configure
the same limit on all replicas, but at least they will have working
reverse queries during the upgrade.

Fixes: #8022

Tests: unit(release), manual test by user who reported the issue
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210209075947.1004164-1-bdenes@scylladb.com>
2021-02-09 14:45:20 +02:00
Avi Kivity
2f50bf2029 Update seastar submodule
* seastar 4c7c5c7c4...76cff5896 (6):
  > rpc: Make is possible for rpc server instance to refuse connection
  > reactor: expose cumulative tasks processed statistic
  > fair_queue: add missing #include <optional>
  > reactor: optimize need_preempt() thread-local-storage access
  > Merge " Use reference for backend->reactor link" from Pavel E
  > test: coroutines: failed coroutine does not throw
2021-02-09 14:45:20 +02:00
Avi Kivity
37b41d7764 test: add missing #include <fstream>
std::ofstream is used, but there is no direct include for it. This
fails the build with libstdc++ 11.

Closes #8050
2021-02-09 14:45:20 +02:00
Juliusz Stasiewicz
97bb15b2f2 tests: Adjusted tests for DC checking in NTS
CQL test relied on quietly acceptiong non-existing DCs, so it had
to be removed. Also, one boost-test referred to nonexisting
`datacenter2` and had to be removed.
2021-02-09 08:29:35 +01:00
Juliusz Stasiewicz
b6fb5ee912 locator: Check DC names in NTS
The same trick is used as in C*:
79e693e16e/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java (L241)

Fixes #7595
2021-02-09 07:04:17 +01:00
Benny Halevy
d2b8b3041d querier_cache: insert_querier: ignore errors to register inactive reader
Since the reader may normally dropped upon
registration, hitting an error is equivalent to having
it evicted at any time, so just log the exception
and ignore it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 22:31:01 +02:00
Benny Halevy
9bdb8190ce querier_cache: insert_querier: handle errors
Make insert_querier exception safe.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 22:31:01 +02:00
Benny Halevy
b8f935457a querier_utils: mark functions noexcept
They all are trivially noexcept.
Mark them so to simplify error handing assumptions in the
next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 22:31:01 +02:00
Benny Halevy
6e92f07630 reader_concurrency_semaphore: register_inactive_read: make noexcept
Catch error to allocate an inactive_read and just log them.
Return an empty inactive_read_handle in
this case, as if the inactive reader was evicted due to
lack of resources.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 22:31:01 +02:00
Benny Halevy
46c2229b78 reader_concurrency_semaphore: separate set_notify_handler from register_inactive_reader
Register the inactive reader first with no
evict_notify_handler and ttl.

Those can be set later, only if registration succeeded.
Otherwise, as in the querier example, there is no need
to to place the querier in the index and erase it
on eviction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 22:31:01 +02:00
Benny Halevy
d752ea7e91 reader_concurrency_semaphore: inactive_read: make ttl_timer non-optional
By default it will be unarmed and with no callback
so there's no need to wrap it in a std::optional.

This saves an allocation and another potential
error case.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 22:31:01 +02:00
Benny Halevy
a12c9638b6 reader_concurrency_semaphore: inactive_read: use intrusive list
To simplify insertion and eviction into the inactive_reads container,
use an intrusive list thta requires a single allocation for the
inactive_read object itself.

This allows passing a reference to the inactive_read
to evict it.

Note that the reader will be unlinked automatically from
the inactive_readers list if the inactive_read_handle is destroyed.
This is okay since there is no need to track the inactive_read
if the caller loses the i_r_h (e.g. if an error is thrown).

It is also safe to evict the inactive_reader while the
i_r_h is alive.  In this case the i_r will be unlinked
after the flat_mutation_reader it holds is moved out of it.

bi::auto_unlink will detect that it's alredy unlinked
when destroyed and do nothing.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 22:31:01 +02:00
Benny Halevy
f751e42bf9 reader_concurrency_semaphore: do_wait_admission: use try_evict_one_inactive_read
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 20:52:16 +02:00
Benny Halevy
81cd3d0c51 reader_concurrency_semaphore: try_evict_one_inactive_read: pass evict_reason
So try_evict_one_inactive_read could be used also in do_wait_admission
in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 20:32:40 +02:00
Benny Halevy
e072199b8d reader_concurrency_semaphore: unregister_inactive_read: calling on wrong semaphore is an internal error
Calling unregister_inactive_read on the wrong semaphore is a blatant
bug so better call on_internal_error so it'd be easier to catch and fix.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 20:32:40 +02:00
Benny Halevy
9c9b4c85ae reader_concurrency_semaphore: unregister_inactive_read: do nothing if disengaged
There is no need to lookup the inactive_read if the i_r_h
is disengaged, it should not be registered so just return
quickly.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 20:32:40 +02:00
Benny Halevy
769dff6c54 reader_concurrency_semaphore: inactive_read_handle: swap definition order
For using boost::intrusive::list for _inactive_reads.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 20:32:40 +02:00
Benny Halevy
d565e3fb57 reader_lifecycle_policy: retire low level try_resume method
The caller can now just call sem.unregister_inactive_read(irh) directly.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 20:32:40 +02:00
Benny Halevy
4e8f29ef14 reader_concurrency_semaphore: inactive_read: keep a flat_mutation_reader
There's no need to hold a unique_ptr<flat_mutation_reader> as
flat_mutation_reader itself holds a unique_ptr<flat_mutation_reader::impl>
and functions as a unique ptr via flat_mutation_reader_opt.

With that, unregister_inactive_read was modified to return a
flat_mutation_reader_opt rather than a std::unique_ptr<flat_mutation_reader>,
keeping exactly the same semantics.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 20:32:40 +02:00
Nadav Har'El
e52785be08 alternator: support attribute paths in ConditionExpression, FilterExpression
This patch fully implements support for attribute paths (e.g. a.b.c, a.d[3])
for the ConditionExpression in conditional updates, and FilterExpression in
queries and scans. After this patch, all previously-xfailing tests in
test_projection_expression.py and test_filter_expression.py now pass.

The fix is simple: Both ConditionExpression and FilterExpression use the
function calculate_value() to calculate the value of the expression. When
this function calculates the value of a path, it mustn't just take the
top-level attribute - it needs to walk into the specific sub-object as
specified by the attribute path.

This is not the end of attribute path support, UpdateExpression and
ReturnValues are not yet fully supported. This will come in following
patches.

Refs #5024

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-08 19:19:09 +02:00
Nadav Har'El
579c7b8dae alternator-test: improve tests for nested attributes in ConditionExpression
Strengthen the tests in test_condition_expression.py for nested attribute
paths (e.g., b.y[1]):

1. The test test_update_condition_nested_attributes only tested successful
   conditions involving nested attributes. Let's also add an *unsuccessful*
   condition, to verify we don't accidentally pass every condition involving
   a nested attribute.

2. Test a case where a non-existant nested attribute is involved in the
   condition.

3. In the test for an attribute path with references - "#name1.#name2",
   make sure the test doesn't pass if #name2 is silently ignored.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-08 19:19:09 +02:00
Piotr Sarna
faca59efa6 failure_detector: add getting last update time point
It can be useful to use the information how long ago an endpoint
responded to heartbeat.
2021-02-08 16:45:58 +01:00
Tomasz Grabiec
c16e4a0423 migration_manager: Propagate schema changes with reads like we do on writes
This fixes the problem where the cordinator already knows about the
new schema and issues a read which uses new objects, but the replica
doesn't know those objects yet. The read will fail in this case. We
can avoid this if we propagate schema changes with reads, like we
already do for writes.

Message-Id: <20210205163422.414275-1-tgrabiec@scylladb.com>
2021-02-08 16:49:55 +02:00
Avi Kivity
4082f57edc Merge 'Make commitlog disk limit a hard limit.' from Calle Wilund
Refs #6148

Commitlog disk limit was previously a "soft" limit, in that we allowed allocating new segments, even if we were over
disk usage max. This would also cause us sometimes to create new segments and delete old ones, if badly timed in
needing and releasing segments, in turn causing useless disk IO for pre-allocation/zeroing.

This patch set does:
* Make limit a hard limit. If we have disk usage > max, we wait for delete or recycle.
* Make flush threshold configurable. Default is ask for flush when over 50% usage. (We do not wait for results)
* Make flush "partial". We flush X% of the used space (used - thres/2), and make the rp limit accordingly. This means we will try to clear the N oldest segments, not all. I.e. "lighter" flush. Of course, if the CL is wholly dominated by a single CF, this will not really help much. But when > 1 cf is used, it means we can skip those not having unflushed data < req rp.
* Force more eager flush/recycle if we're out of segments

Note: flush threshold is not exposed in scylla config (yet). Because I am unsure of wording, and even if it should.
Note: testing is sparse, esp. in regard to latency/timeouts added in high usage scenarios. While I can fairly easily provoke "stalls" (i.e. forced waiting for segments to free up) with simple C-S, it is hard to say exactly where in a more sane config (I set my limits looow) latencies will start accumulating.

Closes #7879

* github.com:scylladb/scylla:
  commitlog: Force earlier cycle/flush iff segment reserve is empty
  commitlog: Make segment allocation wait iff disk usage > max
  commitlog: Do partial (memtable) flushing based on threshold
  commitlog: Make flush threshold configurable
  table: Add a flush RP mark to table, and shortcut if not above
2021-02-08 16:44:05 +02:00
Avi Kivity
af2d1fa0de Update abseil submodule
Compiles with newer compilers.

Added new library wyhash.a to configure.py.

* abseil 1e3d25b...9c6a50f (51):
  > Export of internal Abseil changes
  >  Do not set mvsc linker flags for clang-cl (fixes #874) (#891)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Add support for Elbrus 2000 (e2k) (#889)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Add missing word 'library' in the 'status' description (#868)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Include the status library into the main README. (#863)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > fix build dll (#797)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Fix stacktrace on aarch64 architecture. Fixes #805 (#827)
  > moved deleted functions to public for better compiler errors. (#828)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
2021-02-08 15:41:46 +02:00
Gleb Natapov
b9a5aff7a6 distributed_loader: drop execute_futures function
execute_futures() is just a local reimplementation of
when_all_succeed(). Use the former directly.

Message-Id: <20210208114816.GA1658725@scylladb.com>
2021-02-08 13:24:19 +01:00
Nadav Har'El
104ef5242b alternator: support attribute paths in ProjectionExpression
This patch fully implements support for attribute paths (e.g. a.b.c, a.d[3])
for the ProjectionExpression in the various operations where this parameter
is supported - GetItem, BatchGetItem, Query and Scan. After this patch, all
xfailing tests in test_projection_expression.py now pass.

In the previous patch we remembered in the "attrs_to_get" object not only
the top-level attributes to read from the table, but also how to filter
from it only the desired pieces of the nested document. In this patch we
add a filter() function to do this filtering, and call it in the right
places to post-process the JSON objects we read from the table.

We also had to fix reference resolution in paths to resolve all the
components of the path (e.g., #name1.#name2) and not just the top-level
attribute.

This is not the end of attribute path support, there are still other
expressions (ConditionExpression, UpdateExpression, FilterExpression,
ReturnValues) where they are not yet supported. This will come in following
patches.

Refs #5024

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-08 14:16:40 +02:00
Nadav Har'El
6340619e69 alternator: overhaul attrs_to_get handling
In the existing code, the variable "attrs_to_get" is a list of top-level
attributes to fetch for an item. It is used to implement features like
ProjectionExpression or AttributesToGet in GetItem and other places.

However, to support attribute paths (e.g., a.b.c[2]) in ProjectionExpression,
i.e., issue #5024, we need more than that. We still need to know the top-
level attribute "a", because this is the granularity we have in the Scylla
table (all the content inside "a" is serialized as a single JSON); But we
also need to remember exactly which parts *inside* "a" we will need to
extract and return.

So in this patch we add a new type, "attrs_to_get", which is more than
just a list of top-level attributes. Instead, it is a *map*, whose keys
are the top-level attributes, and the value for each of them is a
"hierarchy_filter", an object which describes which part of the attribute
is needed.

This patch includes the code which converts the AttributesToGet and
ProjectionExpression into the new attrs_to_get structure. During this
conversion, we recognize two kinds of errors which DynamoDB complains
about: We recognize "overlapping" attributes (e.g., requesting both
a.b and a.b.c) and "conflicting" attributes (e.g, requesting both
a.b and a[1]). After this, two xfailing tests we had for detecting
these overlap and conflicts finally pass and their "xfail" label is
removed.

After this patch, we have the attrs_to_get object which can allow us
to filter only the requested pieces of the top-level attributes, but
we don't use it yet - so this patch is not enough for complete support
of attribute paths in ProjectionExpression. We will complete this
support in the next patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-08 14:16:40 +02:00
Nadav Har'El
b2dbd56a3a alternator-test: additional tests for attribute paths in ProjectionExpression
This patch adds more tests for attribute paths in ProjectionExpression,
that deal with document paths which do not fit the content of the item -
e.g., trying to ask for "a.b[3]" when a.b is not a list but rather an
integer or a dictionary.

Moreover, we note that if you try to ask for "a.b, a[2]", DynamoDB
fails this request as a "conflict". The reasoning is that no single
item can ever have both a.b and a[2] (the first is only valid for
dictionaries, the second for lists). It's not clear to me why we
still can't return whichever of the two actually is relevant, but
the fact is that DynamoDB does not allow it.

The new tests fail on Alternator (marked xfailed) and pass on DynamoDB.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-08 14:16:40 +02:00
Nadav Har'El
2a2c5563ba alternator-test: harden attribute-path tests for ProjectionExpression
We have 7 xfailing tests for usage of nested attribute paths (e.g.,
"a.b.c[7]") in a ProjectionExpression. But some of these tests were too
"easy" to pass - a trivial and *wrong* implementation that just ignores
the path and uses the top level attribute (in the above example, "a"),
would cause some of them to start passing.

So this patch strengthens these tests. They still pass on AWS DynamoDB,
and now continue to fail with the aforementioned broken implementation.

Refs #5024.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-08 14:16:40 +02:00
Nadav Har'El
653610f4bc alternator: fix ValidationException in FilterExpression - and more
The first condition expressions we implemented in Alternator were the old
"Expected" syntax of conditional updates. That implementation had some
specific assumptions on how it handles errors: For example, in the "LT"
operator in "Expected", the second operand is always part of the query, so
an error in it (e.g., an unsupported type) resulted it a ValidationException
error.

When we implemented ConditionExpression and FilterExpression, we wrongly
used the same functions check_compare(), check_BETWEEN(), etc., to implement
them. This results in some inaccurate error handling. The worst example is
what happens when you use a FilterExpression with an expression such as
"x < y" - this filter is supposed to silently skip items whose "x" and "y"
attributes have unsupported or different types, but in our implementation
a bad type (e.g., a list) for y resulted in a ValidationException which
aborted the entire scan! Interestingly, in once case (that of BEGINS_WITH)
we actually noticed the slightly different behavior needed and implemented
the same operator twice - with ugly code duplication. But in other operators
we missed this problem completely.

This patch first adds extensive tests of how the different expressions
(Expected, QueryFilter, FilterExpression, ConditionExpression) and the
different operators handle various input errors - unsupported types,
missing items, incompatible types, etc. Importantly, the tests demonstrate
that there is often different behavior depending on whether the bad
input comes from the query, or from the item. Some of the new tests
fail before this patch, but others pass and were useful to verify that
the patch doesn't break anything that already worked correctly previously.
As usual, all the tests pass on Cassandra.

Finally, this patch *fixes* all these problems. The comparison functions
like check_compare() and check_BETWEEN() now not only take the operands,
they also take booleans saying if each of the operands came from the
query or from an item. The old-syntax caller (Expected or QueryFilter)
always say that the first operand is from the item and the second is
from the query - but in the new-syntax caller (ConditionExpression or
FilterExpression) any or all of the operands can come from the query
and need verification.

The old duplicated code for check_BEGINS_WITH() - which a TODO to remove
it - is finally removed. Instead we use the same idea of passing booleans
saying if each of its operands came from an item or from the query.

Fixes #8043

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-08 14:16:30 +02:00
Pavel Emelyanov
a05adb8538 database: Remove global storage proxy reference
The db::update_keyspace() needs sharded<storage_proxy>
reference, but the only caller of it already has it and
can pass one as argument.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210205175611.13464-3-xemul@scylladb.com>
2021-02-08 12:59:46 +01:00
Pavel Emelyanov
8490c9ff6a transport: Remove global storage service reference
On start the transport controller keeps the storage service
on server config's lambda just to let the server grab a
database config option.

The same can be achieved by passing the sharded database
reference to sharded<server>::start, so that each server
instance get local database with config.

As an nice side effect transport::server's config looks
more like a config with simple values and without methods
and/or lambdas on board.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210205175611.13464-1-xemul@scylladb.com>
2021-02-08 12:58:49 +01:00
Piotr Sarna
d23584c8f7 failure_detector: return arrival samples by const reference
There's no point in always returning the whole map by value - callers
can decide to copy the map of their own if need be.
2021-02-08 11:50:32 +01:00
Piotr Sarna
445e6e44f4 failure_detector: remove unimplemented is_alive method
The method was never implemented, so it makes no sense to keep
it in the header.
2021-02-08 11:49:50 +01:00
Amnon Heiman
4498bb0a48 API: Fix aggregation in column_familiy
Few method in column_familiy API were doing the aggregation wrong,
specifically, bloom filter disk size.

The issue is not always visible, it happens when there are multiple
filter files per shard.

Fixes #4513

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes #8007
2021-02-08 12:11:30 +02:00
Raphael S. Carvalho
e1261d10f1 table: Avoid useless allocations when updating cache on memtable flush completion
we're unconditionally using make_combined_mutation_source(), which causes extra
allocations, even if memtable was flushed into a single sstable, which is the
most common case. memtable will only be flushed into more than one sstable if
TWCS is used and memtable had old data written into it due to out-of-order
writes.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210205182028.439948-1-raphaelsc@scylladb.com>
2021-02-06 20:03:33 +02:00
Pavel Emelyanov
7e68ed6a5d configure: Switch debug build from -O0 to -Og
Previous patch changed the -O flag for dev builds. This
had no effect on unit tests compile+run time, and was
aimed at improving the individual tests, dtest, stress-
and other tests runtimes.

This change is mainly focused on imprving the debug-mode
full unit tests running, while keeping the debuggability:
the compile+run time gets ~10 minutes shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-05 19:46:28 +03:00
Pavel Emelyanov
4fd5ef92ae configure: Switch dev build from -O1 to -O2
Based on the original patch from Nadav.

The -O1-generated code is too slow. Raising the opt level
slows compilation down ~9%, but greatly improves the
testing time. E.g. running the alternator test alone is
2.5 times faster with -O2 (118 vs 48 seconds).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-05 19:46:28 +03:00
Pavel Emelyanov
7ced07d22c configure: Make -O flag configurable
It was noticed, that current optimization levels do not
generate fast enough code for dev builds. On the other
hand just increasing the default optimization level will
make re-compile-mostly work much more frustrating.

The new configure.py option allows to select the desired
-O option value by hands. Current hard-coded values are
used as defaults.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-05 19:46:28 +03:00
Botond Dénes
7910e745bc scylla-gdb.py: std_list: restore python2 compatibility
std_list has an iterator object which provides the python3 `__next__()`
method only. Python2 wants a method called `next()`. As it is trivial to
provide both, do that to allow debugging on centos7.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210205073549.734362-1-bdenes@scylladb.com>
2021-02-05 12:47:53 +01:00
Gleb Natapov
8dbe222331 raft: compile raft by default 2021-02-05 12:40:20 +01:00
Konstantin Osipov
adc87aa278 raft: re-lookup progress object after a configuration change
Fix raft_fsm_test failure in debug mode. ASAN complained
that follower_progress is used in append_entries_reply()
after it was destroyed. This could happen if in maybe_commit()
we switched to a new configuration and destroyed old progress
objects.

The fix is to lookup the object one more time after maybe_commit().
2021-02-05 12:40:19 +01:00
Piotr Sarna
d7848750d8 alternator: server: return api_error instead of throwing
Throwing a C++ exception creates unnecessary overhead, so when
an unsupported operation is encountered, the api error is directly
returned instead of being thrown.
2021-02-04 17:23:41 +01:00
Piotr Sarna
868e04e8e2 alternator: add requests_shed metrics
The counter shows the total number of requests shed due to overload.
2021-02-04 17:23:41 +01:00
Piotr Sarna
1b8c946ad7 alternator: add handling max_concurrent_requests_per_shard
The config value is already used to set an upper limit of concurrent
CQL requests, and now it's also abided by alternator.
Excessive requests result in returning RequestLimitExceeded error
to the client.

Tests: manual
Running multiple concurrent requests via the test suite results in:
botocore.errorfactory.RequestLimitExceeded: An error occurred (RequestLimitExceeded) when calling the CreateTable operation: too many in-flight requests: 17
2021-02-04 17:23:41 +01:00
Piotr Sarna
32dc692b8b alternator: add RequestLimitExceeded error
The error code is used when requests are shed due to crossing
the user-defined threshold of the rate of incoming requests.
2021-02-04 17:14:21 +01:00
Avi Kivity
7f3083739f Merge "sstables: Share partition index pages between readers" from Tomasz
"
Before this patch, each index reader had its own cache of partition
index pages. Now there is a shared cache, owned by the sstable object.
This allows concurrent reads to share partition index pages and thus
reduce the amount of I/O.

It used to be like that a few years ago, but we moved to per-reader
cache to implement incremental promoted index parsing, to avoid OOMs
with large partitions. At that time, the solution involved caching
input streams inside partition index entries, which couldn't be reused
between readers. This could have been solved differently. Instead of
caching input streams, we can cache information needed to created them
(temporary_buffer<>). This solution takes this approach.

This series is also needed before we can implement promoted index
caching. That's because before the promoted index can be shared by
readers, the partition index entries, which hold the promoted index,
must also be shareable.

The pages live as long as there is at least one index reader
referencing them. So it only helps when there is concurrent access. In
the future we will keep them for longer and evict on memory pressure.

Promoted index cursor is no longer created when the partition index
entry is parsed, by it's created on-demand when the top-level cursor
enters the partition. The promoted index cursor is owned by the
top-level cursor, not by the partition index entry.

Below are the results of an experiment performed on my laptop which
demonstrates the improvement in performance.

Load driver command line:

  ./scylla-bench                   \
       -workload uniform           \
       -mode read                  \
       --partition-count=10        \
       -clustering-row-count=1     \
       -concurrency 100

Scylla command line:

  scylla --developer-mode=1 -c1 -m1G --enable-cache=0

The workload is IO-bound.
Before, we needed 2 I/O per read, now we need 1 (amortized).
The throughput is ~70% higher.

Before:

 time   ops/s  rows/s errors max    99.9th 99th   95th   90th   median mean
   1s    4706    4706      0 35ms   30ms   27ms   25ms   24ms   21ms   21ms
   2s    4646    4646      0 42ms   31ms   31ms   27ms   25ms   21ms   22ms
 3.1s    4670    4670      0 40ms   27ms   26ms   25ms   25ms   21ms   21ms
 4.1s    4581    4581      0 39ms   33ms   33ms   27ms   26ms   21ms   22ms
 5.1s    4345    4345      0 40ms   37ms   35ms   32ms   31ms   21ms   23ms
 6.1s    4328    4328      0 49ms   40ms   34ms   32ms   31ms   22ms   23ms
 7.1s    4198    4198      0 45ms   36ms   35ms   31ms   30ms   22ms   24ms
 8.2s    3913    3913      0 51ms   50ms   50ms   39ms   35ms   24ms   26ms
 9.2s    4524    4524      0 34ms   31ms   30ms   28ms   27ms   21ms   22ms

After:

 time   ops/s  rows/s errors max    99.9th 99th   95th   90th   median mean
   1s    7913    7913      0 25ms   25ms   20ms   15ms   14ms   12ms   13ms
   2s    7913    7913      0 18ms   18ms   18ms   16ms   14ms   12ms   13ms
   3s    8125    8125      0 20ms   20ms   17ms   15ms   14ms   12ms   12ms
   4s    5609    5609      0 41ms   35ms   29ms   28ms   27ms   13ms   18ms
 5.1s    8020    8020      0 18ms   17ms   17ms   15ms   14ms   12ms   13ms
 6.1s    7102    7102      0 27ms   27ms   24ms   19ms   18ms   13ms   14ms
 7.1s    5780    5780      0 26ms   26ms   26ms   23ms   22ms   17ms   18ms
 8.1s    6530    6530      0 37ms   34ms   26ms   22ms   20ms   15ms   15ms
 9.1s    7937    7937      0 19ms   19ms   17ms   17ms   16ms   12ms   13ms

Tests:

  - unit [release]
  - scylla-bench
"

* tag 'share-partition-index-v1' of github.com:tgrabiec/scylla:
  sstables: Share partition index pages between readers
  sstables: index_reader: Drop now unnecessary index_entry::close_pi_stream()
  sstables: index_reader: Do not store cluster index cursor inside partition indexes
2021-02-04 17:27:49 +02:00
Nadav Har'El
1953b1b006 alternator-test: increase timeout in tracing test
Our test for tracing Alternator requests can't be sure when tracing a request
finished, because tracing is asynchronous and has no official ending signal.
So before we can conclude that tracing failed, we need to wait until a
timeout, which in the current code was roughly 6.4 seconds (the timeout
logic is unnecessarily convoluted, but to make a long story short it has
exponential sleeps starting with 0.1 second and ending with 3.2 seconds,
totaling 6.4 seconds).

It turns out that sporadically, in test runs on overcommitted test machines
with the very slow debug build, we fail this test with this timeout.
So this patch increases the timeout to 51.2 seconds. It should be more
than enough for everyone. Famous last words :-)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210204151554.582260-1-nyh@scylladb.com>
2021-02-04 17:17:07 +02:00
Tomasz Grabiec
63188abb87 sstables: Share partition index pages between readers
Before this patch, each index reader had its own cache of partition
index pages. Now there is a shared cache, owned by the sstable object.
This allows concurrent reads to share partition index pages and thus
reduce the amount of I/O.

This change is also needed before we can implement promoted index caching.
That's because before the promoted index can be shared by readers, the
partition index entries, which hold the promoted index, must also be
shareable.

The pages live as long as there is at least one index reader
referencing them. So it only helps when there is concurrent access. In
the future we will keep them for longer and evict on memory pressure.

Promoted index cursor is no longer created when the partition index entry
is parsed, by it's created on-demand when the top-level cursor enters
the partition. The promoted index cursor is owned by the top-level cursor,
not by the partition index entry.
2021-02-04 15:24:07 +01:00
Tomasz Grabiec
c232d71fc8 sstables: index_reader: Drop now unnecessary index_entry::close_pi_stream() 2021-02-04 15:24:07 +01:00
Tomasz Grabiec
5ed559c8c6 sstables: index_reader: Do not store cluster index cursor inside partition indexes
Currently, the partition index page parser will create and store
promoted index cursors for each entry. The assumption is that
partition index pages are not shared by readers so each promoted index
cursor will be used by a single index_reader (the top-level cursor).

In order to be able to share partition index entries we must make the
entries immutable and thus move the cursor outside. The promoted index
cursor is now created and owned by each index_reader. There is at most
one such active cursor per index_reader bound (lower/upper).
2021-02-04 15:23:55 +01:00
Avi Kivity
713a159600 tools: toolchain: add simplified procedure for creating dbuild images
The current procedure for building images is complicated, as it
requires access to x86_64, aarch64, and s390x machines. Add an alternative
procedure that is fully automated, as it relies on emulation on a single
machine.

It is slow, but requires less attention.

Closes #8024
2021-02-04 15:37:36 +02:00
Avi Kivity
bd7fbcc0cf tools: toolchain: dbuild: keep original user's groups
The supplementary groups are removed by default, so add them back.
Supplementary groups are useful for group-shared directories like
ccache.

I added them to the podman-only branch since I don't know if this
works for docker. If a docker user verifies it works there too,
we can move it to the generic code.

Closes #8020
2021-02-04 15:36:55 +02:00
Gleb Natapov
e9043565b3 raft: add counters to raft server
The patch adds set of counters for various events inside raft
implementation to facilitate monitoring and debugging.

Message-Id: <20210204125313.GA1513786@scylladb.com>
2021-02-04 14:19:54 +01:00
Benny Halevy
f5fe8283cc test: reader_permit: do not include reader_concurrency_semaphore.hh in header file
We can do with a forward declaration instead to reduce
the dependency, and include reader_concurrency_semaphore.hh
in test/lib/reader_permit.cc instead.

We need to include "../../reader_permit.hh" to get the
definition of class reader_permit. We need the include
path to prevent recursive include (or rename test/lib/reader_permit.hh
but this creates a lot of code churn).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210204122002.1041808-1-bhalevy@scylladb.com>
2021-02-04 15:02:16 +02:00
Benny Halevy
338c190842 reader_concurrency_semaphore: inactive_read_handle: mark methods noexcept
All are trivially noexcept.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210204113327.1027792-1-bhalevy@scylladb.com>
2021-02-04 13:57:42 +02:00
Benny Halevy
ba4b8dd6e5 sstables: row.hh: no need to include reader_concurrency_semaphore.hh
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210204113413.1027893-1-bhalevy@scylladb.com>
2021-02-04 13:42:06 +02:00
Tomasz Grabiec
2e3f6a9622 tests: perf_fast_forward: Print outpout directory
Message-Id: <20210203180053.230627-1-tgrabiec@scylladb.com>
2021-02-04 10:39:41 +02:00
Tomasz Grabiec
e0ceb454c0 tests: perf_fast_forward: Print error hints to stdout
They point to lines printed to stdout, so should be aligned with them.
Message-Id: <20210203180016.230547-1-tgrabiec@scylladb.com>
2021-02-04 10:39:41 +02:00
Avi Kivity
fcd48adcc4 Update seastar submodule
* seastar b5b2ee53d...4c7c5c7c4 (1):
  > Merge "add support for printing backtraces on one line" from Benny

Fixes #5464.
2021-02-03 14:01:45 +02:00
Benny Halevy
ca6f5cb0bc test: commitlog_test: test_allocation_failure: fill memory using smaller allocations
commitlog was changed to use fragmented_temporary_buffer::ostream (db::commitlog::output).
So if there are discontiguous small memory blocks, they can be used to satisfy
an allocation even if no contiguous memory blocks are available.

To prevent that, as Avi suggested, this change allocates in 128K blocks
and frees the last one to succeed (so that we won't fail on allocating continuations).

Fixes #8028

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210203100333.862036-1-bhalevy@scylladb.com>
2021-02-03 12:21:20 +02:00
Pavel Solodovnikov
856b0b3a58 raft: introduce raft_gossip_failure_detector class
This is an implementation of `raft::failure_detector` for Scylla
that uses gms::gossiper to query `is_alive` state for a given
raft server id.

Server ids are translated to `gms::inet_address` to be consumed
by `gms::gossiper` with the help of `raft_rpc` class,
which manages the mapping.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210129223109.2142072-1-pa.solodovnikov@scylladb.com>
2021-02-03 10:45:18 +01:00
Tomasz Grabiec
f8ae46f294 Merge "raft: RPC module implementation" from Pavel Solodovnikov
This series provides additional RPC verbs and corresponding methods in
`messaging_service` class, as well as a scylla-specific Raft RPC module
implementation that uses `netw::messaging_service` under the hood to
dispatch RPC messages.

* https://github.com/ManManson/scylla/commits/raft-api-rpc-impl-v6:
  raft: introduce `raft_rpc` class
  raft: add Raft RPC verbs to `messaging_service` and wire up the RPC calls
  configure.py: compile serializer.cc
2021-02-03 10:43:58 +01:00
Benny Halevy
55e3df8a72 dist: scylla_util: prevent IndexError when no ephemeral_disks were found
Currently we call firstNvmeSize before checking that we have enough
(at least 1) ephemeral disks.  When none are found, we hit the following
error (see #7971):
```
File "/opt/scylladb/scripts/libexec/scylla_io_setup", line 239, in
if idata.is_recommended_instance():
File "/opt/scylladb/scripts/scylla_util.py", line 311, in is_recommended_instance
diskSize = self.firstNvmeSize
File "/opt/scylladb/scripts/scylla_util.py", line 291, in firstNvmeSize
firstDisk = ephemeral_disks[0]
IndexError: list index out of range
```

This change reverses the order and first checks that we found
enough disks before getting the fist disk size.

Fixes #7971

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #8027
2021-02-03 11:30:18 +02:00
Avi Kivity
10606aadb5 Update tools/java submodule
* tools/java 78c8ef4f54...0187829d5e (1):
  > nodetool: alternate way to specify table name which includes a dot

Fixes #6521.
2021-02-03 11:27:33 +02:00
Botond Dénes
46b795b5fd mutation: consume(): add reverse mode
`mutation::consume()` is used by range scans to convert the immediate
`reconcilable_result` to the final `query::result` format. When the
range scan is in reverse, `mutation::consume()` has to feed the
clustering fragments to the consumer in reverse order, but currently
`mutation::consume()` always uses the natural order, breaking reverse
range scans.
This patch fixes this by adding a `consume_in_reverse` parameter to
`mutation::consume()`, and consequently support for consuming clustering
fragments in reverse order.

Fixes: #8000

Tests: unit(release, debug),
dtest(thrift_tests.py:TestMutations.test_get_range_slice)

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210203081659.622424-1-bdenes@scylladb.com>
2021-02-03 11:00:47 +02:00
Piotr Sarna
c03363b520 README: fix a dead link for building instructions
The link was outdated, since its destination was moved to
a subdirectory.

Message-Id: <b0e0eedaea4f26acf050a91ab9eed1ca37a838bb.1612338584.git.sarna@scylladb.com>
2021-02-03 10:59:50 +02:00
Avi Kivity
913d970c64 Merge "Unify inactive readers" from Botond
"
Currently inactive readers are stored in two different places:
* reader concurrency semaphore
* querier cache
With the latter registering its inactive readers with the former. This
is an unnecessarily complex (and possibly surprising) setup that we want
to move away from. This series solves this by moving the responsibility
if storing of inactive reads solely to the reader concurrency semaphore,
including all supported eviction policies. The querier cache is now only
responsible for indexing queriers and maintaining relevant stats.
This makes the ownership of the inactive readers much more clear,
hopefully making Benny's work on introducing close() and abort() a
little bit easier.

Tests: unit(release, debug:v1)
"

* 'unify-inactive-readers/v2' of https://github.com/denesb/scylla:
  reader_concurrency_semaphore: store inactive readers directly
  querier_cache: store readers in the reader concurrency semaphore directly
  querier_cache: retire memory based cache eviction
  querier_cache: delegate expiry to the reader_concurrency_semaphore
  reader_concurrency_semaphore: introduce ttl for inactive reads
  querier_cache: use new eviction notify mechanism to maintain stats
  reader_concurrency_semaphore: add eviction notification facility
  reader_concurrency_semaphore: extract evict code into method evict()
2021-02-03 10:59:04 +02:00
Piotr Sarna
d395305ddd api: fix retrieving replied RPC messages
The API call referred to a nonexistent callback,
which is now renamed to better match the API path
and actually implemented.

Message-Id: <3d0dbb42f67e1584999a58da9aa9cc722487fda1.1612279443.git.sarna@scylladb.com>
2021-02-03 09:42:17 +02:00
Pekka Enberg
5670276163 Update seastar submodule
* seastar cb3aaf07...b5b2ee53 (1):
  > perftune.py: fix assignment after extend and add asserts
Fixes #8008
2021-02-02 15:27:13 +02:00
Tomasz Grabiec
873e732042 Merge "Switch partition rows onto B-tree" from Pavel Emelyanov
This is the continuaiton of the row-cache performance
improvements, this time -- the rework of clustering keys part.

The goal is to solve the same set of problems:
- logN eviction complexity
- deep and sparse tree

Unlike partitions, this cache has one big feature that makes it
impossible to just use existing B+ tree:

  There's no copyable key at hands. The clustering key is the
  managed_bytes() that is not nothrow-copy-constructibe, neither
  it's hash-able for lookup due to prefix lookup.

Thus the choice is the B-tree, which is also N-ary one, but
doesn't copy keys around.

B-trees are like B+, but can have key:data pairs in inner nodes,
thus those nodes may be significantly bigger then B+ ones, that
have data-s only in leaf trees. Not to make the memory footprint
worse, the tree assumes that keys and data live on the same object
(the rows_entry one), and the tree itself manages only the key
pointers.

Not to invalidate iterators on insert/remove the tree nodes keep
pointers on keys, not the keys themselves.

The tree uses tri-compare instead of less-compare. This makes the
.find and .lower_bound methods do ~10% less comparisons on random
insert/lookup test.

Numbers:

- memory_footprint: B-tree       master
  rows_entry size:  216          232

  1 row
   in-cache:        968          960     (because of dummy entry)
   in-memtable:     1006         1022

  100 rows
   in-cache:        50774        50856
   in-memtable:     50620        50918

- mutation_test:    B-tree       master
   tps.average:     891177       833896

- simple_query:     B-tree       master
   tps.median:      71807        71656
   tps.maximum:     71847        71708

* xemul/clustering-cache-over-btree-4:
  mutation_partition: Save one keys comparison
  partition_snapshot_row_cursor: Remove rows pointer
  mutation_partition: Use B-tree insertion sugar
  perf-test : Print B-tree sizes
  mutation_partition: Switch cache of rows onto B-tree
  partition_snapshot_reader: Rename cmp to less for explicity
  mutation_partition: Make insertion bullet-proof
  mutation_partition: Use tri-compare in non-set places
  flat_mutation_reader: Use clear() in destroy_current_mutation()
  rows_entry: Generalize compare
  utils: Intrusive B-tree (with tests)
  tests: Generalize bptree compaction test
  tests: Generalize bptree stress test
2021-02-02 12:26:02 +01:00
Tomasz Grabiec
75eb97b12c Merge 'Commitlog multi-entry write' from Calle Wilund
Fixes #7615

Makes the CL writer interface N-valued (though still 1 for the "old" paths). Adds a new write path to input N mutations -> N rp_handles.
Guarantees that all entries are written or none are, and that they will be flushed to disk together.

Small test included.

Closes #7616

* github.com:scylladb/scylla:
  commitlog_test: Add multi-entry write test
  commitlog: Add "add_entries" call to allow inputting N mutations
  commitlog: Make commitlog entries optionally multi-entry
  commitlog: Move entry_writer definition to cc file
2021-02-02 12:23:19 +01:00
Tomasz Grabiec
7b17969a6e Merge 'sstable: reader: preempt after every fragment' from Avi Kivity
Whenever we push a fragment, we check whether the buffer is
full and return proceed::no if so, so that the state machine pauses
and lets the consumer continue. This patch adds an additional
condition - if preemption is needed, we also return proceed::no.
This drops us back to the outer loop
(in sstable_mutation_reader::fill_buffer), which will yield to
the reactor as part of seastar::do_until().

Two cases (partition_start and partition_end) did not have the
check for is_buffer_full(); it is added now. This can trigger
is the partition has no rows.

Unlike the previous attempt, push_ready_fragments() is not touched.

The extra preemption opportunities triggered a preexisting bug in
clustering_ranges_walker; it is fixed in the first patch of the series.

I tested this by reading from a large partition with a simple
schema (pk int, ck int, primary key(pk, ck)) with BYPASS CACHE.
However, even without the patch I only got sporadic stalls
with the detector set to 1ms, so it's possible I'm not testing
correctly.

Test: unit (dev, debug, release)

Fixes #7883.

Closes #7928

* github.com:scylladb/scylla:
  sstable: reader: preempt after every fragment
  clustering_range_walker: fix false discontiguity detected after a static row
2021-02-02 12:21:58 +01:00
Benny Halevy
0fecc78d88 user_function: throw on_internal_error if executed outside a seastar thread
Rather than asserting, as seen in #7977.
This shouldn't crash the server in production.

Add unit test that reproduces this scenario
and verifies the internal error exception.

Fixes #7977

Test: unit(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210201163051.1775536-1-bhalevy@scylladb.com>
2021-02-02 13:03:39 +02:00
Calle Wilund
720a47fe8a commitlog_test: Add multi-entry write test 2021-02-02 10:41:08 +00:00
Calle Wilund
c5f6125039 commitlog: Add "add_entries" call to allow inputting N mutations
Fixes #7615

Allows N mutations to be written "atomically" (i.e. in the same
call). Either all are added to segement, or none.

Returns rp_handle vector corresponding to the call vector.
2021-02-02 10:41:08 +00:00
Calle Wilund
5fcc2066ed commitlog: Make commitlog entries optionally multi-entry
Allows writing more than one blob of data using a single
"add" call into segment. The old call sites will still
just provide a single entry.

To ensure we can determine the health of all the entries
as a unit, we need to wrap them in a "parent" entry.
For this, we bump the commitlog segment format and
introduce a magic marker, which if present, means
we have entries in entry, totalling "size" bytes.
We checksum the entra header, and also checksum
the individual checksums of each sub-entry (faster).
This is added as a post-word.

When parsing/replaying, if v2+ and marker, we have to
read all entries + checksums into memory, verify, and
_then_ we can actually send the info to caller.
2021-02-02 10:41:08 +00:00
Calle Wilund
6bef3f9cc3 commitlog: Move entry_writer definition to cc file
Should not be public/visible
2021-02-02 10:32:44 +00:00
Juliusz Stasiewicz
29e4737a9b transport: Fix abort on certain configurations of native_transport_port(_ssl)
The reason was accessing the `configs` table out of index. Also,
native_transport_port-s can no longer be disabled by setting to 0,
as per the table below.

Rules for port/encryption (the same apply to shard_aware counterpart):

np  := native_transport_port.is_set()
nps := native_transport_port_ssl.is_set()
ceo := ceo.at("enabled") == "true"
eq  := native_transport_port_ssl() == native_transport_port()

+-----+-----+-----+-----+
|  np | nps | ceo |  eq |
+-----+-----+-----+-----+
|  0  |  0  |  0  |  *  |   =>   listen on native_transport_port, unencrypted
|  0  |  0  |  1  |  *  |   =>   listen on native_transport_port, encrypted
|  0  |  1  |  0  |  *  |   =>   nonsense, don't listen
|  0  |  1  |  1  |  *  |   =>   listen on native_transport_port_ssl, encrypted
|  1  |  0  |  0  |  *  |   =>   listen on native_transport_port, unencrypted
|  1  |  0  |  1  |  *  |   =>   listen on native_transport_port, encrypted
|  1  |  1  |  0  |  *  |   =>   listen on native_transport_port, unencrypted
|  1  |  1  |  1  |  0  |   =>   listen on native_transport_port, unencrypted + native_transport_port_ssl, encrypted
|  1  |  1  |  1  |  1  |   =>   native_transport_port(_ssl), encrypted
+-----+-----+-----+-----+

Fixes #7783
Fixes #7866

Closes #7992
2021-02-02 11:32:31 +02:00
Avi Kivity
285303b131 Update tools/jmx submodule
* tools/jmx 2c95650...949cefc (2):
  > dist/redhat: stop using systemd macros, call systemctl directly
  > Remove obsolete FIXME

See scylladb/scylla-jmx#94.
2021-02-02 11:29:36 +02:00
Takuya ASADA
7b310c591e dist/redhat: stop using systemd macros, call systemctl directly
Fedora version of systemd macros does not work correctly on CentOS7,
since CentOS7 does not support "file trigger" feature.
To fix the issue we need to stop using systemd macros, call systemctl
directly.

See scylladb/scylla-jmx#94

Closes #8005
2021-02-02 11:28:07 +02:00
Avi Kivity
da4fa0629a Merge "sstables: add sstable_origin to scylla_metadata" from Benny
"
This series extends the scylla_metadata sstable component
to hold an optional testual description of the sstable origin.
It describes where the sstables originated from
(e.g. memtable, repair, streaming, compaction, etc.)

The origin string is provided by the sstable writer via
sstable_writer_config, written to the scylla_metadata component,
and loaded on sstable::load().

A get_origin() method was added to class sstable to retrieve
its origin.  It returns an empty string by default if the origin
is missing.

Compaction now logs the sstable origin for each sstable it
compacts, and it generates the sstable origin for all sstables
in generates.  Regular compaction origin is simply set to "compaction"
while other compaction types are mentioned by name, as
"cleanup", "resharding", "reshaping", etc.

A unit test was added to test the sstable_origin by writing either
an empty origin and a random string, and then comparing
the origin retrieved by sstable::load to the one written.

Test: unit(release)

Fixes #7880
"

* tag 'sstable-origin-v2' of github.com:bhalevy/scylla:
  compaction: log sstable origin
  sstables: scylla_metadata: add support for sstable_origin
  sstables: sstable_writer_config: add origin member
2021-02-02 10:35:11 +02:00
Pavel Emelyanov
54ddb5a70a mutation_partition: Save one keys comparison
The apply_monotonically checks if the cursor is behind the source
position to decide whether or not to push it forward (with the
lower_bound call). The 2nd comparison is done to check if either
the cursor was ahead or if lower_bound result actually hit the key.

This 2nd comparison can be avoided:

- the 1st case needs B-tree lower_bound API extention that reports
  if the bound is match or not.
- the 2nd one is covered with reusing tri-compare result from the
  1st comparison

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
4ccce97396 partition_snapshot_row_cursor: Remove rows pointer
The pointer is needed to erase an element by its iterator from the
rows container. The B-tree has this method on iterator and it does
NOT need to walk up the tree to find its root, so the complexity
is still amortized constant.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
8e7c1e049b mutation_partition: Use B-tree insertion sugar
The B-tree .insert methods accept unique pointers and release them

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
a92eb2f7a9 perf-test : Print B-tree sizes
After the switch from BST to B-tree the memory foorprint includes inner/leaf nodes
from the B-tree, so it's useful to know their sizes too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
5c0f9a8180 mutation_partition: Switch cache of rows onto B-tree
The switch is pretty straightforward, and consists of

- change less-compare into tri-compare

- rename insert/insert_check into insert_before_hint

- use tree::key_grabber in mutation_partition::apply_monotonically to
  exception-safely transfer a row from one tree to another

- explicitly erase the row from tree in rows_entry::on_evicted, there's
  a O(1) tree::iterator method for this

- rewrite rows_entry -> cache_entry transofrmation in the on_evicted to
  fit the B-tree API

- include the B-tree's external memory usage into stats

That's it. The number of keys per node was is set to 12 with linear search
and linear extention of 20 because

- experimenting with tree shows that numbers 8 through 10 keys with linear
  search show the best performance on stress tests for insert/find-s of
  keys that are memcmp-able arrays of bytes (which is an approximation of
  current clustring key compare). More keys work slower, but still better
  than any bigger value with any type of search up to 64 keys per node

- having 12 keys per nodes is the threshold at which the memory footprint
  for B-tree becomes smaller than for boost::intrusive::set for partitions
  with 32+ keys

- 20 keys for linear root eats the first-split peak and still performs
  well in linear search

As a result the footpring for B tree is bigger than the one for BST only for
trees filled with 21...32 keys by 0.1...0.7 bytes per key.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
165255e2bd partition_snapshot_reader: Rename cmp to less for explicity
This is less comparator, cmp is used as a sign of tri-compare in this set.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
ee9e104541 mutation_partition: Make insertion bullet-proof
The bi::intrusive::set::insert-s are non-throwing, so it's safe to add
new entry like this

  auto* ne = new entry;
  set.insert(ne);

and not worry about memory leak. B-tree's insert will be throwing, so we
need some way to free the new entries in case of exception. There's alreay
a way for this:

   std::unique_ptr<entry> ne = std::make_unique<entry>();
   set.insert(*ne);
   ne.release();

so make every insertion into the set work this way in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
926f748a3d mutation_partition: Use tri-compare in non-set places
The mutation_partition::_rows  will be switched on B-tree with tri
comparator, so to clearly identify not affected by it places, switch
them onto tri-compare in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
bfcd6a4bb7 flat_mutation_reader: Use clear() in destroy_current_mutation()
Currently the code uses a look of unlink_leftmost_without_rebalance
calls. B-tree does have it, but plain clearing of the tree is a bit
faster with clear().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
306c40939b rows_entry: Generalize compare
Turn the rows_entry less-comparator's calls into a template as
they are nothing but wrappers on top of rows_entyry tri-comparator.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
2f7c03d84c utils: Intrusive B-tree (with tests)
The design of the tree goes from the row-cache needs, which are

1. Insert/Remove do not invalidate iterators
2. Elements are LSA-manageable
3. Low key overhead
4. External tri-comparator
5. As little actions on insert/remove as possible

With the above the design is

Two types of nodes -- inner and leaf. Both types keep pointer on parent nodes
and N pointers on keys (not keys themselves). Two differences: inner nodes have
array of pointers on kids, leaf nodes keep pointer on the tree (to update left-
and rightmost tree pointers on node move).

Nodes do not keep pointers/references on trees, thus we have O(1) move of any
object, but O(logN) to get the tree size. Fortunately, with big keys-per-node
value this won't result in too many steps.

In turn, the tree has 3 pointers -- root, left- and rightmost leaves. The latter
is for constant-time begin() and end().

Keys are managed by user with the help of embeddable member_hook instance,
which is 1 pointer in size.

The code was copied from the B+ tree one, then heavily reworked, the internal
algorythms turned out to differ quite significantly.

For the sake of mutation_partition::apply_monotonically(), which needs to move
an element from one tree into another, there's a key_grabber helping wrapper
that allows doing this move respecting the exception-safety requirement.

As measured by the perf_collections test the B-tree with 8 keys is faster, than
the std::set, but slower than the B+tree:

            vs set        vs b+tree
   fill:     +13%           -6%
   find:     +23%          -35%

Another neat thing is that 1-key insertion-removal is ~40% faster than
for BST (the same number of allocations, but the key object is smaller,
less pointers to set-up and less instructions to execute when linking
node with root).

v4:
- equip insertion methods with on_alloc_point() calls to catch
  potential exception guarantees violations eariler

- add unlink_leftmost_without_rebalance. The method is borrowed from
  boost intrusive set, and is added to kill two birds -- provide it,
  as it turns out to be popular, and use a bit faster step-by-step
  tree destruction than plain begin+erase loop

v3:
- introduce "inline" root node that is embedded into tree object and in
  which the 1st key is inserted. This greatly improves the 1-key-tree
  performance, which is pretty common case for rows cache

v2:
- introduce "linear" root leaf that grows on demand

  This improves the memory consumption for small trees. This linear node may
  and should over-grow the NodeSize parameter. This comes from the fact that
  there are two big per-key memory spikes on small trees -- 1-key root leaf
  and the first split, when the tree becomes 1-key root with two half-filled
  leaves. If the linear extention goes above NodeSize it can flatten even the
  2nd peak

- mitigate the keys indirection a bit

  Prefetching the keys while doing the intra-node linear scan and the nodes
  while descending the tree gives ~+5% of fill and find

- generalize stress tests for B and B+ trees

- cosmetic changes

TODO:

- fix few inefficincies in the core code (walks the sub-tree twice sometimes)
- try to optimize the leaf nodes, that are not lef-/righmost not to carry
  unused tree pointer on board

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:29 +03:00
Pavel Emelyanov
6d63bdbefe tests: Generalize bptree compaction test
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:28:59 +03:00
Pavel Emelyanov
8bdad0bb28 tests: Generalize bptree stress test
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:28:57 +03:00
Avi Kivity
db4b9215dd sstable: reader: preempt after every fragment
Whenever we push a fragment, we check whether the buffer is
full and return proceed::no if so, so that the state machine pauses
and lets the consumer continue. This patch adds an additional
condition - if preemption is needed, we also return proceed::no.
This drops us back to the outer loop
(in sstable_mutation_reader::fill_buffer), which will yield to
the reactor as part of seastar::do_until().

Two cases (partition_start and partition_end) did not have the
check for is_buffer_full(); it is added now. This can trigger
is the partition has no rows.

Unlike the previous attempt, push_ready_fragments() is not touched.

I tested this by reading from a large partition with a simple
schema (pk int, ck int, primary key(pk, ck)) with BYPASS CACHE.
However, even without the patch I only got sporadic stalls
with the detector set to 1ms, so it's possible I'm not testing
correctly.

Test: unit (dev)

Fixes #7883.
2021-02-01 19:32:07 +02:00
Avi Kivity
7634a90dd2 clustering_range_walker: fix false discontiguity detected after a static row
clustering_range_walker detects when we jump from one row range to another. When
a static row is included in the query, the constructor sets up the first before/after
bounds to be exactly that static row. That creates an artificial range crossing if
the first clustering range is contiguous with the static row.

This can cause the index to be consulted needlessly if we happen to fall back
to sstable_mutation_reader after reading the static row.

A unit test is added.

Ref #7883.
2021-02-01 19:32:07 +02:00
Pavel Solodovnikov
9d17a654a6 raft: use null_sharder for raft tables
Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210201105300.110210-1-pa.solodovnikov@scylladb.com>
2021-02-01 18:52:04 +02:00
Gleb Natapov
382ee066bf database: drop duplicated function
The database lass have to duplicated functions keyspaces() and
get_keyspaces(). Drop the former since it is used in one place only.

Message-Id: <20210201135333.GA1403508@scylladb.com>
2021-02-01 18:52:04 +02:00
Tomasz Grabiec
eac9c1d80a Merge "raft: configuration changes with joint consensus" from Kostja
Support configuration changes based on joint consensus.
When a user adds a configuration entry, commit an interim "joint
consensus" configuration to the log first, and transition to the
final configuration once both C_old and C_new configurations
accept the joint entry.

Misc cleanups.

* scylla-dev/raft-config-changes-v2:
  raft: update README.md
  raft: add a simple test for configuration changes
  raft: joint consensus, wire up configuration changes in the API
  raft: joint consensus, count votes using joint config
  raft: joint consensus, wire up configuration changes in FSM
  raft: joint consensus, update progress tracker with joint configuration
  raft: joint consensus, don't store configuration in FSM
  raft: joint consensus, keep track of the last confchange index in the log
  raft: joint consensus, implement helpers in class configuration
  raft: joint consensus, use unordered_set for server_address list
  raft: joint consensus, switch configuration to joint
  raft: rename check_committed() to maybe_commit()
  raft: fix spelling and add comments
2021-02-01 18:52:04 +02:00
Benny Halevy
4b309e0829 compaction: log sstable origin
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-01 16:45:52 +02:00
Benny Halevy
77328a936a sstables: scylla_metadata: add support for sstable_origin
Add new scylla_metadata_type::SSTableOrigin.
Store and retrive a sstring to the scylla metadata component.
Pass sstable_writer_config::origin from the mx sstable writer
and ignore it in the k_l writer.

Add unit test to verify the sstable_origin extension
using both empty and a random string.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-01 16:45:52 +02:00
Benny Halevy
22f6023ac3 sstables: sstable_writer_config: add origin member
Add a string describing where the sstables originated
from (e.g. memtable, repair, streaming, compaction, etc.)

If configure_writer is called with a nullptr, the origin
will be equal to an empty string.

Introduce test_env_sstables_manager that provides an overload
of configure_writer with no parmeters that calls the base-class'
configure_writer with "test" origin.  This was to reduce the
code churn in this patch and to keep the tests simple.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-01 16:45:52 +02:00
Nadav Har'El
75a4281bff cql-pytest: test the units supposed to be usable for "duration" type
This patch adds a test for the different units which are supposed to
be usable for assigning a "duration" type in CQL. It turns out that
all documented units are supported correctly except µs (with a unicode
mu), so the test reproduces issue #8001.

The test xfails on Scylla (because µs is not supported) and passes
on Cassandra.

Refs: #8001.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210131192220.407481-1-nyh@scylladb.com>
2021-02-01 11:05:10 +01:00
Avi Kivity
bb202db1ff Merge 'dist/offline_installer/redhat: fix umask error' from Takuya ASADA
Since makeself script changes current umask, scylla_setup causes
"scylla does not work with current umask setting (0077)" error.
To fix that we need use latest version of makeself, and specfiy --keep-umask
option.

Fixes #6243

Closes #6244

* github.com:scylladb/scylla:
  dist/offline_redhat: fix umask error
  dist/offline_installer/redhat: support cross build
2021-01-31 18:47:27 +02:00
Takuya ASADA
49e4f318a0 dist/offline_redhat: fix umask error
Since makeself script changes current umask, scylla_setup causes
"scylla does not work with current umask setting (0077)" error.
To fix that we need use latest version of makeself, and specfiy --keep-umask
option.

Fixes #6243
2021-01-31 21:37:49 +09:00
Takuya ASADA
74d7e31576 dist/offline_installer/redhat: support cross build
Supported cross build by running CentOS7 on docker, now it's able to build
on Fedora.
It also supported switch container image, tested on Oracle Linux 7 and
CentOS 7/8.
2021-01-31 21:37:49 +09:00
Avi Kivity
9271e4bf6e Update seastar submodule
* seastar 52d41277a...cb3aaf07e (2):
  > tls: reloadable_credentials_base: add_dir_watch: fix root dir detection
  > scripts/perftune.py: convert nic option in old perftune.yaml to list for compatibility
2021-01-31 13:28:45 +02:00
Raphael S. Carvalho
298d54ceb0 utils/fragment_temporary_buffer: don't push empty fragment if data size is fragment-aligned
last fragment is unconditionally pushed to set of fragments, so if data
size is fragment-aligned, an empty fragment will be needlessly pushed to
the back of the fragment set.

note: i haven't tested if empty fragment at back of set will cause issues,
i think it won't, but this should be avoided anyway.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210129231532.871405-3-raphaelsc@scylladb.com>
2021-01-30 20:54:20 +02:00
Raphael S. Carvalho
e745f1e697 utils/fragmented_temporary_buffer: avoid reallocations by reserving upfront
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210129231532.871405-2-raphaelsc@scylladb.com>
2021-01-30 20:54:20 +02:00
Raphael S. Carvalho
08e838d4b5 utils/fragmented_temporary_buffer: simplify allocate_to_fit()
1) reuse default_fragment_size for knowledge of max fragment size
2) fragments_count is not a good name as it doesn't include last non-full
fragment (if present), so rename it.
3) simplify calculation of last fragment size

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210129231532.871405-1-raphaelsc@scylladb.com>
2021-01-30 20:54:20 +02:00
Pavel Solodovnikov
b9a280161d raft: introduce raft_rpc class
The patch contains a skeleton implementation for the Scylla-specific
Raft RPC module.

It uses `netw::messaging_service` as underlying mechanism to send
RPC messages.

The instance is supposed to be bound to a single raft group.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-30 01:12:35 +03:00
Pavel Solodovnikov
1a979dbba2 raft: add Raft RPC verbs to messaging_service and wire up the RPC calls
All RPC module APIs except for `send_snapshot` should resolve as
soon as the message is sent, so these messages are passed via
`send_message_oneway_timeout`.

`send_snapshot` message is sent via `send_message_timeout` and
returns a `future<>`, which resolves when snapshot transfer
finishes or fails with an exception.

All necessary functions to wire the new Raft RPC verbs are also
provided (such as `register` and `unregister` handlers).

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-30 01:11:17 +03:00
Pavel Solodovnikov
e30a55ba2f configure.py: compile serializer.cc
This file was not added to the configure.py,
which `raft_sys_table_storage` series was supposed to do.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-30 01:09:32 +03:00
Konstantin Osipov
a8f2fa7fa0 raft: update README.md 2021-01-29 22:07:08 +03:00
Konstantin Osipov
b7692af8bc raft: add a simple test for configuration changes
Test adding, removing replacing a node.

With fix-ups by Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-29 22:07:08 +03:00
Konstantin Osipov
c7b5a60320 raft: joint consensus, wire up configuration changes in the API
Now that we've implemented joint consensus based configuration changes,
replace add_server()/remove_server() with a more general set_configuration().
2021-01-29 22:07:08 +03:00
Konstantin Osipov
afadc7c0a1 raft: joint consensus, count votes using joint config
Send RequestVote to a joint config.

We need to exclude self from the list of peers
if we're not part of the current configuration.
Avoid disrupting the cluster in this case.

Maintain separate status for previous and current config when counting
votes.
2021-01-29 22:07:08 +03:00
Konstantin Osipov
8b86d91754 raft: joint consensus, wire up configuration changes in FSM
When add_entry() with new configuraiton is submitted,
create a joint configuration and switch to it immediately.
Refuse to enter joint configuration if a configuration
change is already in progress.
When the leader it committed an entry with joint configuration,
append a new entry with final configuration and switch to it.

Resign leadership if the current leader is not part of a new
configuration.

When we change from A, B, C to B, C, D and the leader is A,
then, when C_new starts to be used, the leader is not part of
the current configuration, so it doesn't have to be in the tracker.
Do not try to find & advance leader progress unconditionally then.
2021-01-29 22:07:08 +03:00
Konstantin Osipov
18a684ba11 raft: joint consensus, update progress tracker with joint configuration
The leader doesn't have to be part of the current
configuration, so add a way to access follower_progress for the leader
only if it is present.

Upon configuration changes, preserve progress information
for intact nodes, remove for removed, and create a new progress
object for added nodes.

When tracking commit progress in joint configuration mode,
calculate two commit indexes for two configurations, and
choose the smallest one.
2021-01-29 22:07:08 +03:00
Konstantin Osipov
20df1955b2 raft: joint consensus, don't store configuration in FSM
In follower state, FSM doesn't know the current cluster
configuration.  Instead of trying to watch the follower log for
configuration changes to keep FSM copy up to date, remove it from
FSM altogether since the follower doesn't need it anyway.

When entering candidate or leader state, fetch the most recent
configuration from the log and initialize the state specific
state with it.
2021-01-29 22:07:07 +03:00
Konstantin Osipov
b29181875c raft: joint consensus, keep track of the last confchange index in the log
When initializing the log, find the most recent configuration
change index, if present.
Maintain the most recent configuration change index when
the log is truncated or entries are appended to it.
The last configuration change index will be used by FSM when it enters
candidate or leader state to fetch the current configuration.

We never truncate beyond a single in-progress configuration
change, so storing the previous value of last_conf_idx
helps avoid log backward scan on truncation in 100% of cases.

Remove all unused log constructors.
2021-01-29 22:07:07 +03:00
Konstantin Osipov
6e128aa357 raft: joint consensus, implement helpers in class configuration 2021-01-29 22:07:07 +03:00
Konstantin Osipov
1ca738d9a2 raft: joint consensus, use unordered_set for server_address list 2021-01-29 22:07:07 +03:00
Konstantin Osipov
df944f953c raft: joint consensus, switch configuration to joint
In order to work correctly in transitional configuration,
participants must enter it after crashes, restarts and
state changes.

This means it must be stored in Raft log and snapshot
on the leader and followers.

This is most easily done if transitional configuration
is just a flavour of standard configuration.

In FSM, rename _current_config to _configuration,
it now contains both current and future configuration
at all times.
2021-01-29 22:07:07 +03:00
Konstantin Osipov
076e46af9e raft: rename check_committed() to maybe_commit()
This is what the function does, and it's the name
used in other implementations.
2021-01-29 22:07:07 +03:00
Gleb Natapov
aad0209b1c raft: fix spelling and add comments
Fix spelling errors in a few comments,
improve comments.

With fix-ups by Gleb Natapov <gleb@scylladb.com>
2021-01-29 22:07:07 +03:00
Pavel Emelyanov
575c992a35 test: Bring test_apply_monotonically_is_monotonic back to work
The idea of the monotonicity checking test is: try to apply
one one random partition to another random one sequentually
failing allocations. Each time allocation fails (with the
bad_alloc exception) -- check the exception guarantee is
respected, then apply (!) the very same two partitions to
each other. At the end of the test we make sure, that an
exception may pop up at any point of application and it
will be safe.

This idea is flawed currently. When verifying the guarantee
the test moves the 2nd partition and leaves it empty for the
next loop iteration. So right on the 2nd attempt to apply
partitions it becomes a no-op, doesn't fail and no more
exceptions arise.

Fix by restoring both partitions at the end of each check.
Broken since 74db08165d.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210129153641.5449-1-xemul@scylladb.com>
2021-01-29 18:47:15 +01:00
Tomasz Grabiec
16eb4c6ce2 Merge "raft: system table backed persistency module" from Pavel Solodovnikov
This series contains an initial implementation of raft persistency module
that uses `raft` system table as the underlying storage model.

"system.raft" table will be used as a backend storage for implementing
raft persistence module in Scylla. It combines both raft log,
persisted vote and term, and snapshot info.

The table is partitioned by group id, thus allowing multi-raft
operation. The rest of the table structure mirrors the fields of
corresponding core raft structures defined in `raft.hh`, such as
`raft::log_entry`.

The raft table stores the only the latest snapshot id while
the actual snapshot will be available in a separate table
called `system.raft_snapshots`. The schema of `raft_snapshots`
mirrors the fields of `raft::snapshot` structure.

IDL definitions are also added for every raft struct so that we
automatically provide serialization and deserialization facilities
needed both for persistency module and for future RPC implmementation.

The first patch is a side-change needed to provide complete
serialization/deserialization for `bytes_ostream`, which we
need when persisting the raft log in the table (since `data`
is a variant containing `raft::command` (aka `bytes_ostream`)
among others).
`bytes_ostream` was lacking `deserialize` function, which is
added in the patch.

The second patch provides serializer for `lw_shared_ptr<T>`
which will be used for `raft::append_entries`, which has
a field with `std::vector<const lw_shared_ptr<raft::log_entry>>`
type.

There is also a patch to extend `fragmented_temporary_buffer`
with a static function `allocate_to_fit` that allocates an
instance of the fragmented buffer that has a specified size.
Individual fragment size is limited to 128kb.

The patch-set also contains the test suite covering basic
functionality of the persistency module.

* manmanson/raft-api-impl-v11:
  raft/sys_table_storage: add basic tests for raft_sys_table_storage
  raft: introduce `raft_sys_table_storage` class
  utils: add `fragmented_temporary_buffer::allocate_to_fit`
  raft: add IDL definitions for raft types
  raft: create `system.raft` and `system.raft_snapshots` tables
  serializer: add `serializer<lw_shared_ptr<T>>` specialization
  serializer: add `deserialize` function overload for `bytes_ostream`
2021-01-29 11:40:39 +02:00
Pavel Solodovnikov
e309502c42 raft/sys_table_storage: add basic tests for raft_sys_table_storage
The test suite covers the most basic use cases for the system table
backed raft persistency module:
 * store/load vote and term
 * store/load snapshot
 * store snapshot with log tail truncation
 * store/load log entries
 * log truncation

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-29 02:00:27 +03:00
Pavel Solodovnikov
aebb1987b5 raft: introduce raft_sys_table_storage class
This is the implementation of raft persistency module that
uses `raft` system table as the underlying storage model.

The instance is supposed to be bound to a single raft group.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-29 02:00:12 +03:00
Pavel Solodovnikov
d14dc030ac utils: add fragmented_temporary_buffer::allocate_to_fit
Introduce `fragmented_temporary_buffer::allocate_to_fit` static
function returning an instance of the buffer of a specified size.

The allocated buffer fragments have a size of at most 128kb.
`bytes_ostream` has the same hard-coded limit, so just use the
same here.

This patch will be later needed for `raft::log_entry` raw data
serialization when writing to the underlying persistent storage.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-29 01:59:16 +03:00
Pavel Solodovnikov
e1504bbf0e raft: add IDL definitions for raft types
Changes to the `configuration` and `tagged_uint64` classes are needed
to overcome limitations of the IDL compiler tool, i.e. we need to
supply a constructor to the struct initializing all the
members (raft::configuration) and also need to make an accessor
function for private members (in case of raft::tagged_uint64).

All other structs mirror raft definitions in exactly the same way
they are declared in `raft.hh`.

`tagged_id` and `tagged_uint64` are used directly instead of their
typedef-ed companions defined in `raft.hh` since we don't want
to introduce indirect dependencies. In such case it can be guaranteed
that no accidental changes made outside of the idl file will affect idl
definitions.

This patch also fixes a minor typo in `snapshot_id_tag` struct used
in `snapshot_id` typedef.
2021-01-29 01:59:10 +03:00
Pavel Solodovnikov
cf5b8c4b79 raft: create system.raft and system.raft_snapshots tables
System raft table will be used as a backend storage for implementing
raft persistence module in Scylla. It combines both raft log,
persisted vote and term, and snapshot info.

The table is partitioned by group id, thus allowing multi-raft
operation. The rest of the table structure mirrors the fields of
corresponding core raft structures defined in `raft.hh`, such as
`raft::log_entry`.

The raft table stores the only the latest snapshot id while
the actual snapshot will be available in a separate table
called `system.raft_snapshots`. The schema of `raft_snapshots`
mirrors the fields of `raft::snapshot` structure.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-29 01:59:04 +03:00
Pavel Solodovnikov
83c26e542d serializer: add serializer<lw_shared_ptr<T>> specialization
This one works similar to `serializer<optional<T>>` and will be
later needed for serializing `raft::append_request`, which has
a field containing `lw_shared_ptr`.

Users to be warned, though: this code assumes that the pointer
is never null. This is done to mirror the serialize implementation
for `lw_shared_ptr:s` in the messaging_service.cc, which is
subject to being deleted in favor of the impl in the
`serializer_impl.hh`.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-29 01:58:46 +03:00
Avi Kivity
b32ece6975 Update tools/java submodule
* tools/java 4a55b81941...78c8ef4f54 (1):
  > nodetool: do no treat table name with dot as a secondary index

Fixes #6521.
2021-01-28 16:16:47 +02:00
Kamil Braun
bf115e7d69 schema_tables: put schema tables on shard 0
We use a custom sharder for all schema tables: every table under
the `system_schema` keyspace, plus `system.scylla_table_schema_history`.
This sharder puts all data on shard 0.

To achieve this, we hardcode the sharder in initial schema object
definitions. Furthermore - since the sharder is not stored inside schema
mutations yet - whenever we deserialize schema objects from mutations,
we modify the sharder based on the schema's keyspace and table names.

A regression test is added to ensure no one forgets to set the special
sharder for newly added schema tables. This test assumes that all newly
added schema tables will end up in the `system_schema` keyspace (other
tables may go unnoticed, unfortunately).

Closes #7947
2021-01-28 13:28:22 +02:00
Avi Kivity
32cdcc0c8b Merge "sstables: consolidate reader factory methods" from Botond
"
Currently there are three different methods for creating an sstable
reader:
* one for single key reads
* one for ranged reads
* and one nobody uses

This patch-set consolidates all these into a single `make_reader()`
method, which behind the scenes uses the same logic to dispatch to the
right sstable reader constructor that `sstables::as_mutation_source()`
uses.

This patch-set is part of an effort to clean up the jungle that is the
various reader creation methods. The next step is to clean up the
sstable_set, which has even more methods.

One very sad discovery I made while working on this patch-set is that
we
still default `mutation_reader::forwarding` to `yes` in the sstable
range reader creator method and in the
`mutation_source::make_reader()`.
I couldn't assume that all callers are passing what they mean as the
value for that parameter. I found many sites in tests that create
forwardable single partition readers. This is also something we should
address soon.

Tests: unit(release, debug:v3)
"

* 'sstables-consolidate-reader-factory-methods-v4' of https://github.com/denesb/scylla:
  cql_query_test: add unit test covering the non-optimal TWCS sstable read path
  sstable_mutation_reader: consolidate constructors
  tests: don't pass temporary ranges to readers
  sstables: sstable_mutation_reader: remove now unused whole sstable constructor
  sstables: stats: remove now unused sstable_partition_reads counter
  sstable: remove read_.*row.*_flat() methods
  tree-wide: use sstables::make_reader() instead of the read_.*row.*_flat() methods
  sstables: pass partition_range to create_single_key_sstable_reader()
  sstables: sstable: add make_reader()
2021-01-28 12:05:06 +02:00
Botond Dénes
1e9ce62ee6 cql_query_test: add unit test covering the non-optimal TWCS sstable read path
The sstable read path for TWCS tables takes a different path when the
optimized read path cannot be used. This path was found to be not
covered at all by unit tests which allowed a trivial use-after-free to
slip in. Add a unit test to cover this path as well, so ASAN can catch
such bugs in the future.
2021-01-28 11:34:03 +02:00
Avi Kivity
55609f2033 Update seastar submodule
* seastar a287bb1a3...52d41277a (8):
  > fair_queue: Preempted requests got re-queued too far
  > scripts/perftune.py: remove repeated items after merging options from file
  > file.hh: Remove fair_queue.hh
  > Merge "Reloadable TLS certificate tolerance" from Calle
  > Merge "Cancellable IO" from Pavel E
  > abort-source: Improve the subscriptions management
  > fair_queue: Improve requests preemption while in pending state
  > http: add support for Default handler (/*)
2021-01-28 08:45:33 +01:00
Konstantin Osipov
b4f875f08e uuid: reduce code dependency on UUID_gen.hh
Do not include UUID_gen.hh in trace_state.hh and lists.hh
to reduce header level dependency on it.

Message-Id: <20210127173114.725761-2-kostja@scylladb.com>
2021-01-27 20:08:29 +02:00
Botond Dénes
6024ef5dad sstable_mutation_reader: consolidate constructors
The two remaining sstable constructor are very similar apart from the
content of the initialize lambda. Speaking of which, the two remaining
initializer lambdas can be easily merged into one too. So this patch
does just that, consolidates the two constructors one and moves
consolidates as well as extracts the initializer method into a member
method. This means we have to store the previously captured variables as
members, but this is actually a good thing: when debugging we can see
the range and slice the reader is reading, and we are not actually
paying for it either -- they were already stored, just out of sight.
2021-01-27 17:38:17 +02:00
Botond Dénes
dd26a96e63 tests: don't pass temporary ranges to readers
The sstable_mutation_reader, like all other mutation readers expects
that the partition-range passed to it is kept alive by its creator
for the duration of its lifetime. However, the single-key constructor
of the sstable reader was more tolerant, as it only extracted the key
from the range, essentially requiring only the key to be kept alive (but
not the containing range). Naturally in time some code come to rely on
it and ended up passing temporary ranges to the reader. This behaviour
will no longer be acceptable as we are about to consolidate the various
sstable reader constructors, uniformly requiring that the range is kept
alive. So this patch fixes up the tests so they work with this stricter
requirement. Only two occurences were found.
2021-01-27 17:38:17 +02:00
Botond Dénes
43ad64db78 sstables: sstable_mutation_reader: remove now unused whole sstable constructor 2021-01-27 17:38:17 +02:00
Botond Dénes
ec6c540c30 sstables: stats: remove now unused sstable_partition_reads counter 2021-01-27 17:38:17 +02:00
Botond Dénes
5f18e9eb37 sstable: remove read_.*row.*_flat() methods 2021-01-27 17:38:17 +02:00
Botond Dénes
c3b4e990a2 tree-wide: use sstables::make_reader() instead of the read_.*row.*_flat() methods 2021-01-27 17:38:17 +02:00
Botond Dénes
080bc2ffec sstables: pass partition_range to create_single_key_sstable_reader()
We want to unify the various sstable reader creation methods and this
method taking a ring position instead of a partition range like
everybody else stands in the way of that.

This is effect reverts 68663d0de.
2021-01-27 17:38:14 +02:00
Wojciech Mitros
a1f93e4297 api: use a list instead of a vector to remove a large allocation in api handler
Follow-up to #7917

The size of an cf::column_family_info is 224 bytes, so an std::vector that
contains one for each column family may be very large, causing allocations
of over 1MB.
Considering the vector is used only for iteration, it can be changed to
a non-contiguous list instead.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #7973
2021-01-27 16:02:07 +02:00
Avi Kivity
aec231ba2e Merge "Unify query paths" from Botond
"
Currently we have two parallel query paths:
* database::query() -> table::query() -> data_query()
* mutation::query()

The former is used by single partition queries, the latter by range
scans, as mutation::query() is used to convert reconcilable_result to
query::result (which means it is also used in single partition queries
if it triggers read repair). This is a rather unfortunate situation as
we have two parallel implementation of the query code, which means they
are prone to diverge, and in fact they already have -- more on that
later.

This patchset aims to remedy this situation by retiring
`mutation::query()` and migrating users to an implementation based on
the "standard" query path, in other words one using the same building
blocks as the `database::query()` path. This means using
`compact_mutation` for compacting and `query_result_builder` for result
building. These components however were created to work with
`flat_mutation_reader`, however introducing a reader into this pipeline
would mean that we'd have to make all the related APIs asynchronous,
which would cause an insane amount of churn. To avoid this, this
patchset adds an API compatible `consume()` method to `mutation`, which
can accept a `compact_mutation` instance as-is. This allows an elegant
and succinct reimplementation. So far so good.

Like mentioned above, the two implementations have diverged in time, or
have been different from the start. The difference manifest when
calculating digests, more precisely in which tombstones are included in
the digest. The retired `mutation::query()` path incorporates only
non-purgeable tombstones in the digest. The standard query path however
incorporates all tombstones, even those that can be purged. After some
scrutiny however this difference proved to be completely theoretical,
as
the code path where this would matter -- converting reconcilable result
to query result -- passes min timestamp as the query time to the
compaction, so nothing is compacted and hence the difference has no
chance to manifest.

This patch-set was motivated by the desire to provide a single solution
to #7434, instead of two, one for each path.

Tests: unit(release:v2, debug:v2, dev:v3)
"

* 'unified-query-path/v3' of https://github.com/denesb/scylla:
  mutation: remove now unused query() and query_compacted()
  treewide: use query_mutations() instead of mutation::query()
  mutation_test: test_query_digest: ensure digest is produced consistently
  mutation_query: introduce query_mutation()
  mutation_query: to_data_query_result(): migrate to standard query code
  mutation_query: move to_data_query_result() to mutation_partition.cc
  mutation: add consume()
  flat_mutation_reader: move mutation consumer concepts to separate header
  mutation compactor: query compaction: ignore purgeable tombstones
2021-01-27 15:58:47 +02:00
Botond Dénes
a5a8037f6e sstables: sstable: add make_reader()
This will be the only method to create sstable readers with. For now we
leave the other variants, they as well as their users will be removed in
a following patch.
2021-01-27 15:20:06 +02:00
Nadav Har'El
2113849a2b cql-pytest: reproducer for toJson() bug with doubles
This patch adds a cql-pytest, test_json.py::test_tojson_double(),
which reproduces issue #7972 - where toJson() prints some doubles
incorrectly - truncated to integers, but some it prints fine (I
still don't know why, this will need to be debugged).

The test is marked xfail: It fails on Scylla, and passes on Cassandra.

Refs #7972.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210127124338.297544-1-nyh@scylladb.com>
2021-01-27 14:00:25 +01:00
Pavel Solodovnikov
10b117aada raft: create dummy impl for schema changes state machine
This patch introduces `schema_raft_state_machine` class
which is currently just a dummy implementation throwing a
"not implemented" exceptions for every call.

Will be needed later to construct an instance of `raft::server`.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210126193413.1520948-1-pa.solodovnikov@scylladb.com>
2021-01-27 12:33:27 +01:00
Pavel Solodovnikov
223c823963 serializer: add deserialize function overload for bytes_ostream
For some reason we had a distinct specialization of `serialize`
function to handle `bytes_ostream` but not `deserialize`.

This will be used in the following patches.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-26 23:21:15 +03:00
Asias He
c82250e0cf gossip: Allow deferring advertise of local node to be up
Currently the replacing node sets the status as STATUS_UNKNOWN when it
starts gossip service for the first time before it sets the status to
HIBERNATE to start the replacing operation. This introduces the
following race:

1) Replacing node using the same IP address of the node to be replaced
starts gossip service without setting the gossip STATUS (will be seen as
STATUS_UNKNOWN by other nodes)

2) Replacing node waits for gossip to settle and learns status and
tokens of existing nodes

3) Replacing node announces the HIBERNATE STATUS.

After Step 1 and before Step 3, existing nodes will mark the replacing
node as UP, but haven't marked the replacing node as doing replacing
yet. As a result, the replacing node will not be excluded from the read
replicas and will be considered a target node to serve CQL reads.

To fix, we make the replacing node avoid responding echo message when it is not
ready.

Fixes #7312

Closes #7714
2021-01-26 19:02:11 +01:00
Pekka Enberg
9fc83ac627 Update tools/java submodule
* tools/java 8080009794...4a55b81941 (1):
  > cassandra.in.sh: remove debug message
2021-01-26 15:56:58 +02:00
Avi Kivity
90a6c3bd7a build: reduce release mode inline tuning on aarch64
I see a miscompile on aarch64 where a call to format("{}", uuid)
translates a function pointer to -1. When called, this crashes.

Reduce the inline threshold from 2500 to 600. This doesn't guarantee
no miscompiles but all the tests pass with this parameter.

Closes #7953
2021-01-26 11:14:42 +02:00
Tomasz Grabiec
90f6bb754e Merge "raft: replication tests: fixes for debug mode" from Alejo
The following patches fix issues seen occasionally in debug mode.

Notes:
    - In debug mode there's still the UB nullptr arithmetic warning.

* https://github.com/alecco/scylla/tree/raft-ale-tests-07h-wait-propagation:
  raft: replication test: wait for log propagation
  raft: replication test: move wait for log to a function
  raft: replication test: remove unused member
  raft: replication test: use later()
  raft: testing: remove election wait time and just yield
2021-01-26 11:14:42 +02:00
Avi Kivity
f58151d191 test: mutation_test: fix initialization order bug with thread local storage
test_cell_external_memory_usage uses with_allocator() to observe how some
types allocate memory. However, compiler reordering (observed with clang 11
on aarch64) can move the various thread-local CQL type object initialization
into the with_allocator() scope; so any managed object allocated as part of
this initialization also gets measured, and the test fails. The code movement
is legal, as far as I can tell.

Fix this by initializing the type object early; use an atomic_thread_fence
as an optimization barrier so the compiler doesn't eliminate the or move
the early initialization.

Closes #7951
2021-01-26 11:14:42 +02:00
Nadav Har'El
356250f720 cql-pytest: tests for fromJson() failing to set tuple elements to null
This patch adds a test for trying to set a tuple element to null with
fromJson(), which works on Cassandra but fails on Scylla. So the test
xfails on Scylla. Reproduces issue #7954.

Refs #7954.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210124082311.126300-1-nyh@scylladb.com>
2021-01-26 11:14:42 +02:00
Avi Kivity
05c435dddc Merge "mutation readers: remove next_partition() workarounds" from Botond
"
`next_partition()` used to return void, so readers that had to call
future returning code had to work around this. Now that
`next_partition()` returns a future, we can get rid of these
workarounds.

Tests: unit(release, debug)
"

* 'next-partition-cross-shard-readers/v1' of https://github.com/denesb/scylla:
  mutation_reader: reader_lifecycle_policy::stopped_reader: drop pending_next_partition flag
  mutation_reader: evictable_reader: remove next_partition() workaround
  mutation_reader: shard_reader: remove next_partition() workaround
  mutation_reader: foreign_reader: remove next_partition() workaround
2021-01-26 11:14:42 +02:00
Nadav Har'El
067330c08f Merge 'redis: support large redis message' from Takuya ASADA
If the message is larger than current buffer size, we need to consume
more data until we reach to tail of the message.
To do so, we need to return nullptr when it's not on the tail.

Fixes #7273

Closes #7903

* github.com:scylladb/scylla:
  redis: rename _args_size/_size_left There are two types of numerical parameter in redis protocol:  - *[0-9]+ defined array size  - $[0-9]+ defined string size
  redis: fix large message handling
2021-01-25 10:11:17 +02:00
Takuya ASADA
229940aaff redis: rename _args_size/_size_left
There are two types of numerical parameter in redis protocol:
 - *[0-9]+ defined array size
 - $[0-9]+ defined string size

Currently, array size is stored to args_count, and string size is
stored to _arg_size / _size_left.
It's bit hard to understand since both uses same word "arg(s)", let's
rename string size variables to _bytes_count / _bytes_left.
2021-01-25 10:26:37 +09:00
Takuya ASADA
7a6ee9858f redis: fix large message handling
If the message is larger than current buffer size, we need to consume
more data until we reach to tail of the message.
To do so, we need to return nullptr when it's not on the tail.

Fixes #7273
2021-01-25 10:26:37 +09:00
Alejo Sanchez
0d694990cf raft: replication test: wait for log propagation
Wait until entries propagate after adding and before changing leader
using the same code as done for partitioning.

This fixes occasional hangs in debug mode when a test switches to a
different leader without leaving enough time for full propagation.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-24 20:33:54 -04:00
Alejo Sanchez
4d1ec88f90 raft: replication test: move wait for log to a function
Move wait for log propagation to its own function for reuse.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-24 20:25:48 -04:00
Alejo Sanchez
72f9b108e3 raft: replication test: remove unused member
Initial state doesn't need to specify total entries anymore.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-24 20:25:48 -04:00
Alejo Sanchez
db95d6e7f1 raft: replication test: use later()
Instead of sleep 1us use later()

Also use later to yield after sending append entries in rpc test impl.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-24 20:25:48 -04:00
Alejo Sanchez
f875ff72c9 raft: testing: remove election wait time and just yield
Replace sleep time for elect_me_leader with yield to speed things up.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-24 20:25:48 -04:00
Pekka Enberg
8258556832 Update tools/python3 submodule
* tools/python3 c579207...199ac90 (1):
  > dist: debian: adjust .orig tarball name for .rc releases
2021-01-24 21:30:59 +02:00
Gleb Natapov
020da49c89 storage_proxy: remove no longer needed range_slice_read_executor
After support for mixed cluster compatibility feature
DIGEST_MULTIPARTITION_READ was dropped in 854a44ff9b
range_slice_read_executor and never_speculating_read_executor become
identical, so remove the former for good.

Message-Id: <20210124122731.GA1122499@scylladb.com>
2021-01-24 14:45:22 +02:00
Benny Halevy
088f92e574 paxos_state: learn: fix injected error description
It was copy-pasted from another injection point.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201220091439.3604201-1-bhalevy@scylladb.com>
2021-01-24 11:51:23 +02:00
Takuya ASADA
5d527bd17e scylla_ntp_setup: use chrony on all distributions
To simplify scylla_ntp_setup, use chrony on all distributions.

Closes #7922
2021-01-24 11:45:58 +02:00
Takuya ASADA
984dc44ebf dist: drop /etc/security/limits.d/scylla.conf
Drop limits.d conf file, since we don't use it.
We set these parameters via systemd unit file instead.

Fixes #7925

Closes #7941
2021-01-24 11:43:39 +02:00
Benny Halevy
1847d49971 test: test_env: pick the highest sstable version by default
If possible, test the highest sstable format version,
as it's the mostly used.

If there pre-written sstables we need to load from the
test directory from an older version, either specify their
version explicitly, or use the new test_env::reusable_sst
method that looks up the latest sstable version in the
given directory and generation.

Test: unit(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201210161822.2833510-1-bhalevy@scylladb.com>
2021-01-24 10:38:55 +02:00
Botond Dénes
226088d12e mutation_reader: reader_lifecycle_policy::stopped_reader: drop pending_next_partition flag
Its not used anymore.
2021-01-22 16:18:59 +02:00
Botond Dénes
4eb65b12a0 mutation_reader: evictable_reader: remove next_partition() workaround
`next_partition()` now returns a future<>, so we can forward it to the
remote shard in the scope of the next partition call, remove the
now obsolete workaround for the synchronous next partition.
2021-01-22 16:18:30 +02:00
Botond Dénes
febd2feb4c mutation_reader: shard_reader: remove next_partition() workaround
`next_partition()` now returns a future<>, so we can forward it to the
remote shard in the scope of the next partition call, remove the
now obsolete workaround for the synchronous next partition.
2021-01-22 15:53:05 +02:00
Botond Dénes
9c96d74b72 mutation: remove now unused query() and query_compacted() 2021-01-22 15:36:37 +02:00
Botond Dénes
1a3ee71b39 treewide: use query_mutations() instead of mutation::query()
We want to retire the latter.
2021-01-22 15:36:37 +02:00
Botond Dénes
81da6b756f mutation_reader: foreign_reader: remove next_partition() workaround
`next_partition()` now returns a future<>, so we can forward it to the
remote shard in the scope of the next partition call, remove the
now obsolete workaround for the synchronous next partition.
2021-01-22 15:30:36 +02:00
Nadav Har'El
cb9e2ee00a cql-pytest: tests for fromJson() setting a map<ascii, int>
The fromJson() function can take a map JSON and use it to set a map column.
However, the specific example of a map<ascii, int> doesn't work in Scylla
(it does work in Cassandra). The xfailing tests in this patch demonstrate
this. Although the tests use perfectly legal ASCII, scylla fails the
fromJson() function, with a misleading error.

Refs #7949.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210121233855.100640-1-nyh@scylladb.com>
2021-01-22 14:29:25 +01:00
Botond Dénes
a9d726c7ba mutation_test: test_query_digest: ensure digest is produced consistently
Before we retire the mutation::query() code, expand the digest test to
check that the new code replacing it produces identical digest on all
possible equivalent mutations.
2021-01-22 15:27:48 +02:00
Botond Dénes
821ed96e0e mutation_query: introduce query_mutation()
This is a replacement of `mutation::query()`, but with an implementation
based on the standard query result building code.
This will allow us to migrate the remaining `mutation::query()` users
off of said method, which in turn will allow us to retire it finally.
2021-01-22 15:27:48 +02:00
Botond Dénes
c4f12221b8 mutation_query: to_data_query_result(): migrate to standard query code
Reimplement in terms of the standard query result building code. We want
to retire the alternative query result code in `mutation::query()` and
`to_data_query_result()` is one of the main users.
2021-01-22 15:27:48 +02:00
Botond Dénes
164582f33b mutation_query: move to_data_query_result() to mutation_partition.cc
We want to rewrite the above mentioned method's implementation in terms
of the standard query result building code (that of the `data_query()`
path), in order to retire the alternative query code in the mutation
class.
The `data_query()` code uses classes private to `mutation_partition.cc`
and instead of making these public, just move `to_data_query_result()`
to `mutation_partition.cc`.
2021-01-22 15:27:48 +02:00
Botond Dénes
d0c5f550a9 mutation: add consume()
This consume method accepts a `FlattenedConsumer`, the same one that the
name-sake `flat_mutation_reader::consume()` does. Indeed the main
purpose of this method is to allow using the standard query result
building stack with a mutation, the same way said stack is used with
mutation readers currently. This will allow us to replace the parallel
query result building code that currently exists in the
`mutation::query()` and friends, with the standard one.
2021-01-22 15:27:48 +02:00
Botond Dénes
9153f63135 flat_mutation_reader: move mutation consumer concepts to separate header
In the next patch we will want to use these concepts in `mutation.hh`. To
avoid pulling in the entire `flat_mutation_reader.hh` just for these,
and create a circular dependency in doing so, move them to a dedicated
header instead.
2021-01-22 15:27:48 +02:00
Botond Dénes
73808c12eb mutation compactor: query compaction: ignore purgeable tombstones
This behaviour is makes query result building sensitive to whether the
data was recently compacted or not, in particular different digests will
be produced depending on whether purgeable tombstones happened to be
compacted (and thus purged) or not. This means that two replicas can
produce different digests for the same data if has compacted some
purgeable tombstones and the other not.

To avoid this, drop purgeable tombstones during query compaction as
well.
2021-01-22 15:27:48 +02:00
Pavel Emelyanov
90d445464b compaction: Remove compaction_manager::enabled()
This method was marked with 'FIXME -- should not be public'
when it was introduced. Since then it has stopped being used
and can even be removed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210122083146.5886-1-xemul@scylladb.com>
2021-01-22 14:07:38 +02:00
Kamil Braun
570d15c7bc multishard_combining_reader: do not use smp::count
`multishard_combining_reader` currently only works under the assumption
that every table uses the same sharder configured using the node's number
of shards. But we could potentially specify a different sharder for a chosen table,
e.g. one that puts everything on shard 0.
Then this assumption will be broken and the reader causes a segfault.

Fixes #7945.
2021-01-21 18:28:18 +02:00
Nadav Har'El
328be1ca7c cql-pytest: tests for fromJson() not accepting empty string as integer
When writing to an integer column, Cassandra's fromJson() function allows
not just JSON number constants, it also allows a string containing a
number. Strings which do not hold a number fail with a FunctionFailure.
In particular, the empty string "" is an invalid number, and should fail.

The tests in this patch check this for two integer types: int and
varint.

Curiously, Cassandra and Scylla have opposite bugs here: Scylla fails
to recognize the error for varint, while Cassandra fails to recognize
the error for int. The tests in this patch reproduce these bugs.

The tests demonstrating Scylla's bug are marked xfail, and the tests
demonstrating Cassandra's bug is marked "cassandra_bug" (which means
it is marked xfail only when running against Cassandra, but expected
to succeed on Scylla.

Refs #7944.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210121133833.66075-1-nyh@scylladb.com>
2021-01-21 15:24:48 +01:00
Nadav Har'El
702b1b97bf cql: fix error return from execution of fromJson() and other functions
As reproduced in cql-pytest/test_json.py and reported in issue #7911,
failing fromJson() calls should return a FUNCTION_FAILURE error, but
currently produce a generic SERVER_ERROR, which can lead the client
to think the server experienced some unknown internal error and the
query can be retried on another server.

This patch adds a new cassandra_exception subclass that we were missing -
function_execution_exception - properly formats this error message (as
described in the CQL protocol documentation), and uses this exception
in two cases:

1. Parse errors in fromJson()'s parameters are converted into a
   function_execution_exception.

2. Any exceptions during the execute() of a native_scalar_function_for
   function is converted into a function_execution_exception.
   In particular, fromJson() uses a native_scalar_function_for.

   Note, however, that functions which already took care to produce
   a specific Cassandra error, this error is passed through and not
   converted to a function_execution_exception. An example is
   the blobAsText() which can return an invalid_request error, so
   it is left as such and not converted. This also happens in Cassandra.

All relevant tests in cql-pytest/test_json.py now pass, and are
no longer marked xfail. This patch also includes a few more improvements
to test_json.py.

Fixes #7911

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210118140114.4149997-1-nyh@scylladb.com>
2021-01-21 15:21:13 +01:00
Nadav Har'El
49440d67ad Merge: Fix multiple issues with timeuuid type
Merged patch series by Konstantin Osipov:

"These series improve uniqueness of generated timeuuids and change
list append/prepend logic to use client/LWT timestamp in timeuuids
generated for list keys. Timeuuid compare functions are
optimized.

The test coverage is extended for all of the above."

  uuid: add a comment warning against UUID::operator<
  uuid: replace slow versions of timeuiid compare with optimized/tested versions.
  test: add tests for legacy uuid compare & msb monotonicity
  test: add a test case for append/prepend limit
  test: add a test case for monotonicity of timeuuid least significant bits
  uuid: implement optimized timeuuid compare
  test: add a test case for list prepend/append with custom timestamp
  lists: rewrite list prepend to use append machinery
  lists: use query timestamp for list cell values during append
  uuid: fill in UUID node identifier part of UUID
  test: add a CQL test for list append/prepend operations
2021-01-21 13:20:07 +02:00
Konstantin Osipov
e18e2cb9f2 uuid: add a comment warning against UUID::operator< 2021-01-21 13:03:59 +03:00
Konstantin Osipov
845f6c667b uuid: replace slow versions of timeuiid compare with optimized/tested versions. 2021-01-21 13:03:59 +03:00
Konstantin Osipov
56d8d166cb test: add tests for legacy uuid compare & msb monotonicity 2021-01-21 13:03:59 +03:00
Konstantin Osipov
257c5b0879 test: add a test case for append/prepend limit 2021-01-21 13:03:59 +03:00
Konstantin Osipov
d6e65a3735 test: add a test case for monotonicity of timeuuid least significant bits
Ensure that timeuuid least significant bits are compared correctly.
2021-01-21 13:03:59 +03:00
Konstantin Osipov
0af3758aff uuid: implement optimized timeuuid compare
Introduce uint64_t based comparator for serialized timeuuids.

Respect Cassandra legacy for timeuuid compare order.

Scylla uses two versions of timeuuid compare:
- one for timeuuid values stored in uuid columns
- a different one for timeuuid values stored in timeuuid columns.

This commit re-implements the implementations of these comparators in
types.cc and deprecates the respective implementations types.cc. They
will be removed in a following patch.

A micro-benchmark at https://github.com/alecco/timeuuid-bench/
shows 2-4x speed up of the new comparators.
2021-01-21 13:03:59 +03:00
Konstantin Osipov
b4500a55c7 test: add a test case for list prepend/append with custom timestamp
Scylla now takes a custom timestamp into account when
executing list append/prepend operations. Test the new
semantics.
2021-01-21 13:03:59 +03:00
Konstantin Osipov
232ce6f611 lists: rewrite list prepend to use append machinery
Rewrite list prepend to use the same machinery
as append, and thus produce correct results when used in LWT.

After this patch, list prepend begins to honor user supplied timestamps.

If a user supplied timestamp for prepend is less than 2010-01-01 00:00:00
an exception is thrown.

Fixes #7611
2021-01-21 13:03:59 +03:00
Konstantin Osipov
2b8ce83eea lists: use query timestamp for list cell values during append
Scylla list cells are represented internally as a map of
timeuuid => value. To append a new value to a list
the coordinator generates a timeuuid reflecting the current time as key
and adds a value to the map using this key.

Before this patch, Scylla always generated a timeuuid for a new
value, even if the query had a user supplied or LWT timestamp.
This could break LWT linearizability. User supplied timestamps were
ignored.

This is reported as https://github.com/scylladb/scylla/issues/7611

A statement which appended multiple values to a list or a BATCH
generated an own microsecond-resolution timeuuid for each value:

BEGIN BATCH
  UPDATE ... SET a = a + [3]
  UPDATE ... SET a = a + [4]
APPLY BATCH

UPDATE ... SET a = a + [3, 4]

To fix the bug, it's necessary to preserve monotonicity of
timeuuids within a batch or multi-value append, but make sure
they all use the microsecond time, as is set by LWT or user.

To explain the fix, it's first necessary to recall the structure
of time-based UUIDs:

60 bits: time since start of GMT epoch, year 1582, represented
         in 100-nanosecond units
4 bits:  version
14 bits: clock sequence, a random number to avoid duplicates
         in case system clock is adjusted
2 bits:  type
48 bits: MAC address (or other hardware address)

The purpose of clockseq bits is as defined in
https://tools.ietf.org/html/rfc4122#section-4.1.5
is to reduce the probability of UUID collision in case clock
goes back in time or node id changes. The implementation should reset it
whenever one of these events may occur.

Since LWT microsecond time is guaranteed to be
unique by Paxos, the RFC provisioning for clockseq and MAC
slots becomes excessive.

The fix thus changes timeuuid slot content in the following way:
- time component now contains the same microsecond time for all
  values of a statement or a batch. The time is unique and monotonic in
  case of LWT. Otherwise it's most always monotonic, but may not be
  unique if two timestamps are created on different coordinators.
- clockseq component is used to store a sequence number which is
  unique and monotonic for all values within the statement/batch.
- to protect against time back-adjustments and duplicates
  if time is auto-generated, MAC component contains a random (spoof)
  MAC address, re-created on each restart. The address is different
  at each shard.

The change is made for all sources of time: user, generated, LWT.
Conditioning the list key generation algorithm on the source of
time would unnecessarily complicate the code while not increase
quality (uniqueness) of created list keys.

Since 14 bits of clockseq provide us with only 16383 distinct slots
per statement or batch, 3 extra bits in nanosecond part of the time
are used to extend the range to 131071 values per statement/batch.
If the rang is exceeded beyond the limit, an exception is produced.

A twist on the use of clockseq to extend timeuuid uniqueness is
that Scylla, like Cassandra, uses int8 compare to compare lower
bits of timeuuid for ordering. The patch takes this into account
and sign-complements the clockseq value to make it monotonic
according to the legacy compare function.

Fixes #7611

test: unit (dev)
2021-01-21 13:03:59 +03:00
Konstantin Osipov
6d1781be36 uuid: fill in UUID node identifier part of UUID
Before this patch, UUID generation code was not creating
sufficiently unique IDs: the 6 byte node identifier was mostly
empty, i.e. only containing shard id. This could lead to
collisions between queries executed concurrently at different
coordinators, and, since timeuuid is used as key in list append
and prepend operations, lead to lost updates.

To generate a unique node id, the patch uses a combination of
hardware MAC address (or a random number if no hardware address is
available) and the current shard id.

The shard id is mixed into higher bits of MAC, to reduce the
chances on NIC collision within the same network.

With sufficiently unique timeuuids as list cell keys, such updates
are no longer lost, but multi-value update can still be "merged"
with another multi-value update.

E.g. if node A executes SET l = l + [4, 5] and node B executes SET
l  = l + [6, 7], the list value could be any of [4, 5, 6, 7], [4,
6, 5, 7], [6, 4, 5, 7] and so on.

At least we are now less likely to get any value lost.

Fixes #6208.

@todo: initialize UUID subsystem explicitly in main()
and switch to using seastar::engine().net().network_interfaces()

test: unit (dev)
2021-01-21 13:03:53 +03:00
Avi Kivity
4cfaab208e allocation_strategy: set preferred max contiguous allocation to 128k for standard allocations
Now that managed_bytes and its users do not assume that a managed_bytes
instance allocated using standard_allocation_strategy is non-fragmented,
we can set the preferred max contiguous allocation to 128k. This causes
managed_bytes to fragment instances that are larger than this size.

Note that managed_bytes is the only user.

Closes #7943
2021-01-21 11:15:13 +02:00
Tomasz Grabiec
f08a3e3fd8 Merge "raft: test fixes, etcd tests, simplification" from Alejo
This patch set adds etcd unit tests for raft.

It also includes a fix for replication test in debug mode and a
simplification for append_request.

Tests: unit ({dev}), unit ({debug}), unit ({release})

*  https://github.com/alecco/scylla/tree/raft-ale-tests-09b:
  raft: etcd unit tests: test log replication
  raft: boost test etcd: test fsm can vote from any state
  raft: boost test etcd: port TestLeaderElectionOverwriteNewerLogs
  raft: replication test: add etcd test for cycling leaders
  raft: testing: provide primitives to wait for log propagation
  raft: etcd unit tests: initial boost tests
  raft: combine append_request _receive and _send
2021-01-21 10:41:33 +02:00
Pekka Enberg
7d98e05923 Update tools/python3 submodule
* tools/python3 1763a1a...c579207 (1):
  > dist/debian: handle rc version correctly
2021-01-21 10:41:33 +02:00
Avi Kivity
daa0e964fc dbuild: avoid --pids-limit with podman and cgroupsv1
Podman doesn't correctly support --pids-limit with cgroupsv1. Some
versions ignore it, and some versions reject the option.

To avoid the error, don't supply --pids-limit if cgroupsv2 is not
available (detected by its presence in /proc/filesystems). The user
is required to configure the pids limit in
/etc/containers/containers.conf.

Fixes #7938.

Closes #7939
2021-01-21 10:41:33 +02:00
Botond Dénes
4d581f1bb3 docs/README.md: guides: also mention running and debugging
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210120083304.36447-1-bdenes@scylladb.com>
2021-01-20 16:07:29 +02:00
Avi Kivity
f11a0700a8 Merge "mutation_writer: explicitly close writers" from Benny
"
_consumer_fut is expected to return an exception
on the abort path.  Wait for it and drop any exception
so it won't be abandoned as seen in #7904.

A future<> close() method was added to return
_consumer_fut.  It is called both after abort()
in the error path, and after consume_end_of_stream,
on the success path.

With that, consume_end_of_stream was made void
as it doesn't return a future<> anymore.

Fixes #7904
Test: unit(release)
"

* tag 'close-bucket-writer-v5' of github.com:bhalevy/scylla:
  mutation_writer: bucket_writer: add close
  mutation_writer/feed_writers: refactor bucket/shard writers
  mutation_writer: update bucket/shard writers consume_end_of_stream
2021-01-20 16:07:29 +02:00
Pekka Enberg
6cc981d089 scylla: Add "--build-mode" command line option
This adds a "--build-mode" command line option to "scylla" executable:

$ ./build/dev/scylla --build-mode
dev

This allows you to discover the build mode of a "scylla" executable
without resorting to "readelf", for example, to verify that you are
looking at the correct executable while debugging packaging issues.

Closes #7865
2021-01-20 16:07:29 +02:00
Botond Dénes
7eb8c71342 tools/scylla-types: add link to cql3-type-mapping.md
Just like scylla-sstable-index, scylla-types accepts types in (short)
cassandra class name notation. The mapping from the clq3 type names to
the class names is not straight-forward in all cases, so provide a link
to a table which lists the cassandra class name of all supported types
(and more).

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210120083816.37774-2-bdenes@scylladb.com>
2021-01-20 10:50:33 +02:00
Botond Dénes
882ade7c6a types/scylla-sstable-index: update URL to cql3-type-mapping.md
Said document was recently moved but the URL was not updated.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210120083816.37774-1-bdenes@scylladb.com>
2021-01-20 10:50:33 +02:00
Avi Kivity
114da51d73 Revert "commitlog: fix size of a write used to zero a segment"
This reverts commit df2f67626b. The fix
is correct, but has an unfortunate side effect with O_DSYNC: each
128k write also needs to flush the XFS log. This translates to
32MB/128k = 256 flushes, compared to one flush with the original code.

A better fix would be to prezero without O_DSYNC, then reopen the file
with O_DSYNC, but we can do that later.

Reopens #5857.
2021-01-20 10:23:43 +02:00
Avi Kivity
586f16bf79 Merge "Cut snitch -> storage service dependency" from Pavel E
"
Currently storage service and snitch implicitly depend on each
other. Storage service gossips snitch data on start, snitch
kicks the storage service when its configuration changes.

This interdependency is relaxed:

- snitch gossips all its state itself without using the
  storage service as a mediator
- storage service listens for snitch updates with the help
  of self-breaking subscription

Both changes make snitch independent from storage service,
remove yet another call for global storage service from the
codebase and make the storage service -> snitch reference
robust against dagling pointers/references

tests: unit(dev), dtest.rebuild.TestRebuild.simple_rebuild(dev)
"

* 'br-snitch-gossip-2' of https://github.com/xemul/scylla:
  storage-service: Subscribe to snitch to update topology
  snitch: Introduce reconfiguration signal
  snitch: Always gossip snitch info itself
  snitch: Do gossip DC and RACK itself
  snitch: Add generic gossiping helper
2021-01-20 10:23:43 +02:00
Pavel Solodovnikov
041072b59f raft: rename storage to persistence
The new naming scheme more clearly communicates to the client of
the raft library that the `persistence` interface implements
persistency layer of the fsm that is powering the raft
protocol itself rather than the client-side workflow and
user-provided `state_machine`.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201126135114.7933-1-pa.solodovnikov@scylladb.com>
2021-01-20 10:23:43 +02:00
Gleb Natapov
248449816b raft: fix snapshot transfer with existing log prefix
Current code that checks when snapshot has to be transferred does not
take in account the case where there can be log entries preceding the
snapshot. Fix the code to correctly test for snapshot transfer
condition.

Message-Id: <20210117095801.GB733394@scylladb.com>
2021-01-20 10:23:43 +02:00
Gleb Natapov
1ab262e86b raft: test: change replication_test to submit one entry at a time
replication_test's state machine is not commutative, so if commands are
applied in different order the states will be different as well. Since
the preemption check was added into co_await in seastar even waiting for
a ready future can preempt which will cause reordering of simultaneously
submitted entries in debug mode. For a long time we tried to keep entries
submission parallel in the test, but with the above seastar change it
is no longer possible to maintain it without changing the state machine
to be commutative. The patch changes the test to submit entries one by
one.

Message-Id: <20210117095147.GA733394@scylladb.com>
2021-01-20 10:23:43 +02:00
Benny Halevy
f29732573a mutation_writer: bucket_writer: add close
bucket_writer::close waits for the _consumer_fut.
It is called both after consume_end_of_stream()
and after abort().

_consumer_fut is expected to return an exception
on the abort path.  Wait for it and drop any exception
so it won't be abandoned as seen in #7904.

With that moved to close() time, consume_end_of_stream
doesn't need to return a future and is made void
all the way in the stack.  This is ok since
queue_reader_handle::push_end_of_stream is synchronous too.

Added a unit test that aborts the reader consumer
during `segregate_by_timestamp`, reproducing the
Exceptional future ignored issue without the fix.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-19 19:03:58 +02:00
Benny Halevy
fc3f9a57ff mutation_writer/feed_writers: refactor bucket/shard writers
Consolidate shard_based_splitting_writer::shard_writer
and timestamp_based_splitting_writer::bucket_writer
common code into mutation_writer::bucket_writer.

This provides a common place to handle consume_end_of_stream()
and abort(), and in particular the handling of the underlying
_conmsumer_fut.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-19 18:48:01 +02:00
Benny Halevy
a9d91a2d09 mutation_writer: update bucket/shard writers consume_end_of_stream
After 61520a33d6
feed_writers doesn't call consume_end_of_stream
after abort() so no need to test
            if (!_handle.is_terminated()) {

and consume_end_of_stream is now called in then_wrapped
rather than `finally` so it's ok if it throws.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-19 18:44:40 +02:00
Kamil Braun
1a8630e6a7 transport: silence "broken pipe" and "connection reset by peer" errors
The code would already silence broken pipe exceptions since it's
expected when the other side closes the connection or when we shutdown the
socket during Scylla shutdown, but the code wouldn't handle the following:
1. "Connection reset by peer" errors: these can also happen in the
   aforementioned two scenarios; the conditions that determine which of
   the two types of errors occur are unclear.
2. The scenarios would sometimes result in a `seastar::nested_exception`,
   mainly during shutdown. The errors could happen once when trying to send
   a response to a request (`_write_buf.write(...)/flush(...)`) and then
   again when trying to close the connection in a `finally` block. These
   nested exceptions were not silenced.

The commit handles each of these cases.
Closes #7907.

Closes #7931
2021-01-19 10:30:17 +02:00
Tomasz Grabiec
94749b01eb Merge "futurize flat_mutation_reader::next_partition" from Benny
The main motivation for this patchset is to prepare
for adding a async close() method to flat_mutation_reader.

In order to close the reader before destroying it
in all paths we need to make next_partition asynchronous
so it can asynchronously close a current reader before
destoring it, e.g. by reassignment of flat_mutation_reader_opt,
as done in scanning_reader::next_partition.

Test: unit(release, debug)

* git@github.com:bhalevy/scylla.git futurize-next-partition-v1:
  flat_mutation_reader: return future from next_partition
  multishard_mutation_query: read_context: save_reader: destroy reader_meta from the calling shard
  mutation_reader: filtering_reader: fill_buffer: futurize inner loop
  flat_mutation_reader::impl: consumer_adapter: futurize handle_result
  flat_mutation_reader: consume_pausable/in_thread: futurize_invoke consumer
  flat_mutation_reader: FlatMutationReaderConsumer: support also async consumer
  flat_mutation_reader:impl: get rid of _consume_done member
2021-01-19 10:19:03 +02:00
Alejo Sanchez
8a61e7defc raft: etcd unit tests: test log replication
etcd TestLogReplication

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:37 -04:00
Alejo Sanchez
417b18aaad raft: boost test etcd: test fsm can vote from any state
etcd TestVoteFromAnyState

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:37 -04:00
Alejo Sanchez
5a75c0e06a raft: boost test etcd: port TestLeaderElectionOverwriteNewerLogs
Log truncation of follower when node re-gains leadership.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:37 -04:00
Alejo Sanchez
f14c44c686 raft: replication test: add etcd test for cycling leaders
This test cycles 3 nodes as leaders without adding entries.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:37 -04:00
Alejo Sanchez
f627972186 raft: testing: provide primitives to wait for log propagation
For tests to be able to transition in a consistent state, in some cases
it's needed to allow the followers to catch up with the leader.

This prevents occasional hangs in debug mode for incoming tests.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:37 -04:00
Alejo Sanchez
948ae813e4 raft: etcd unit tests: initial boost tests
First batch of ported etcd raft unit tests.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:12 -04:00
Gleb Natapov
6d47a535b9 raft: combine append_request _receive and _send
Combine structs for append request send and receive into a single
struct.

Author:    Gleb Natapov <gleb@scylladb.com>
Date:      Mon Nov 23 14:33:14 2020 +0200
2021-01-18 12:24:13 -04:00
Konstantin Osipov
bf1a031bd6 test: add a CQL test for list append/prepend operations
Test single- and multi- value list append, prepend,
append and prepend in a batch, conditional statements.

This covers the parts of Cassandra which are working as documented
and which we intend to preserve compatibility with.
2021-01-18 17:32:00 +03:00
Jenkins
faf71c6f75 release: prepare for 4.5.dev 2021-01-18 16:05:25 +02:00
Avi Kivity
df3ef800c2 Merge 'Introduce load and stream feature' from Asias He
storage_service: Introduce load_and_stream

=== Introduction ===

This feature extends the nodetool refresh to allow loading arbitrary sstables
that do not belong to a node into the cluster. It loads the sstables from disk
and calculates the owning nodes of the data and streams to the owners
automatically.

From example, say the old cluster has 6 nodes and the new cluster has 3 nodes.
We can copy the sstables from the old cluster to any of the new nodes and
trigger the load and stream process.

This can make restores and migrations much easier.

=== Performance ===

I managed to get 40MB/s per shard on my build machine.
CPU: AMD Ryzen 7 1800X Eight-Core Processor
DISK: Samsung SSD 970 PRO 512GB

Assume 1TB sstables per node, each shard can do 40MB/s, each node has 32
shards, we can finish the load and stream 1TB of data in 13 mins on each
node.

1TB / 40 MB per shard * 32 shard / 60 s = 13 mins

=== Tests ===

backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_test
which creates a cluster with 4 nodes and inserts data, then use
load_and_stream to restore to a 2 nodes cluster.

=== Usage ===

curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true

=== Notes ===

Btw, with the old nodetool refresh, the node will not pick up the data
that does not belong to this node but it will not delete it either. One
has to run nodetool cleanup to remove those data manually which is a
surprise to me and probably to users as well. With load and stream, the
process will delete the sstables once it finishes stream, so no nodetool
cleanup is needed.

The name of this feature load and stream follows load and store in CPU world.

Fixes #7831

Closes #7846

* github.com:scylladb/scylla:
  storage_service: Introduce load_and_stream
  distributed_loader: Add get_sstables_from_upload_dir
  table: Add make_streaming_reader for given sstables set
2021-01-18 15:08:19 +02:00
Avi Kivity
60f5ec3644 Merge 'managed_bytes: switch to explicit linearization' from Michał Chojnowski
This is a revival of #7490.

Quoting #7490:

The managed_bytes class now uses implicit linearization: outside LSA, data is never fragmented, and within LSA, data is linearized on-demand, as long as the code is running within with_linearized_managed_bytes() scope.

We would like to stop linearizing managed_bytes and keep it fragmented at all times, since linearization can require large contiguous chunks. Large contiguous allocations are hard to satisfy and cause latency spikes.

As a first step towards that, we remove all implicitly linearizing accessors and replace them with an explicit linearization accessor, with_linearized().

Some of the linearization happens long before use, by creating a bytes_view of the managed_bytes object and passing it onwards, perhaps storing it for later use. This does not work with with_linearized(), which creates a temporary linearized view, and does not work towards the longer term goal of never linearizing. As a substitute a managed_bytes_view class is introduced that acts as a view for managed_bytes (for interoperability it can also be a view for bytes and is compatible with bytes_view).

By the end of the series, all linearizations are temporary, within the scope of a with_linearized() call and can be converted to fragmented consumption of the data at leisure.

This has limited practical value directly, as current uses of managed_bytes are limited to keys (which are limited to 64k). However, it enables converting the atomic_cell layer back to managed_bytes (so we can remove IMR) and the CQL layer to managed_bytes/managed_bytes_view, removing contiguous allocations from the coordinator.

Closes #7820

* github.com:scylladb/scylla:
  test: add hashers_test
  memtable: fix accounting of managed_bytes in partition_snapshot_accounter
  test: add managed_bytes_test
  utils: fragment_range: add a fragment iterator for FragmentedView
  keys: update comments after changes and remove an unused method
  mutation_test: use the correct preferred_max_contiguous_allocation in measuring_allocator
  row_cache: more indentation fixes
  utils: remove unused linearization facilities in `managed_bytes` class
  misc: fix indentation
  treewide: remove remaining `with_linearized_managed_bytes` uses
  memtable, row_cache: remove `with_linearized_managed_bytes` uses
  utils: managed_bytes: remove linearizing accessors
  keys, compound: switch from bytes_view to managed_bytes_view
  sstables: writer: add write_* helpers for managed_bytes_view
  compound_compat: transition legacy_compound_view from bytes_view to managed_bytes_view
  types: change equal() to accept managed_bytes_view
  types: add parallel interfaces for managed_bytes_view
  types: add to_managed_bytes(const sstring&)
  serializer_impl: handle managed_bytes without linearizing
  utils: managed_bytes: add managed_bytes_view::operator[]
  utils: managed_bytes: introduce managed_bytes_view
  utils: fragment_range: add serialization helpers for FragmentedMutableView
  bytes: implement std::hash using appending_hash
  utils: mutable_view: add substr()
  utils: fragment_range: add compare_unsigned
  utils: managed_bytes: make the constructors from bytes and bytes_view explicit
  utils: managed_bytes: introduce with_linearized()
  utils: managed_bytes: constrain with_linearized_managed_bytes()
  utils: managed_bytes: avoid internal uses of managed_bytes::data()
  utils: managed_bytes: extract do_linearize_pure()
  thrift: do not depend on implicit conversion of keys to bytes_view
  clustering_bounds_comparator: do not depend on implicit conversion of keys to bytes_view
  cql3: expression: linearize get_value_from_mutation() eariler
  bytes: add to_bytes(bytes)
  cql3: expression: mark do_get_value() as static
2021-01-18 11:01:28 +02:00
Asias He
4d32d03172 storage_service: Introduce load_and_stream
=== Introduction ===

This feature extends the nodetool refresh to allow loading arbitrary sstables
that do not belong to a node into the cluster. It loads the sstables from disk
and calculates the owning nodes of the data and streams to the owners
automatically.

From example, say the old cluster has 6 nodes and the new cluster has 3 nodes.
We can copy the sstables from the old cluster to any of the new nodes and
trigger the load and stream process.

This can make restores and migrations much easier.

=== Performance ===

I managed to get 40MB/s per shard on my build machine.
CPU: AMD Ryzen 7 1800X Eight-Core Processor
DISK: Samsung SSD 970 PRO 512GB

Assume 1TB sstables per node, each shard can do 40MB/s, each node has 32
shards, we can finish the load and stream 1TB of data in 13 mins on each
node.

1TB / 40 MB per shard * 32 shard / 60 s = 13 mins

=== Tests ===

backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_test
which creates a cluster with 4 nodes and inserts data, then use
load_and_stream to restore to a 2 nodes cluster.

=== Usage ===

curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true

=== Notes ===

Btw, with the old nodetool refresh, the node will not pick up the data
that does not belong to this node but it will not delete it either. One
has to run nodetool cleanup to remove those data manually which is a
surprise to me and probably to users as well. With load and stream, the
process will delete the sstables once it finishes stream, so no nodetool
cleanup is needed.

The name of this feature load and stream follows load and store in CPU world.

Fixes #7831
2021-01-18 16:32:33 +08:00
Avi Kivity
ab44464911 Revert "docker: remove sshd from the image"
This reverts commit 32fd38f349. Some
tests (in scylla-cluster-tests) depend on it.
2021-01-17 14:34:40 +02:00
Raphael S. Carvalho
00c29e1e24 table: Move notify_bootstrap_or_replace_*() out of line
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210117045747.69891-9-raphaelsc@scylladb.com>
2021-01-17 10:36:13 +02:00
Asias He
28007f13f8 distributed_loader: Add get_sstables_from_upload_dir
This function scans sstables under the upload directory and return a list of
sstables for each shard.

Refs #7831
2021-01-16 20:03:17 +08:00
Michał Chojnowski
5b72fb65ae test: add hashers_test
This test is a sanity check. It verifies that our wrappers over well known
hashes (xxhash, md5, sha256) actually calculate exactly those hashes.
It also checks that the `update()` methods of used hashers are linear with
respect to concatenation: that is, `update(a + b)` must be equivalent to
`update(a); update(b)`. This wasn't relied on before, but now we need to
confirm that hashing fragmented keys without linearizing them won't break
backward compatibility.
2021-01-15 18:28:24 +01:00
Michał Chojnowski
85048b349b memtable: fix accounting of managed_bytes in partition_snapshot_accounter
managed_bytes has a small overhead per each fragment. Due to that, managed_bytes
containing the same data can have different total memory usage in different
allocators. The smaller the preferred max allocation size setting is, the more
fragments are needed and the greater total per-fragment overhead is.
In particular, managed_bytes allocated in the LSA could grow in
memory usage when copied to the standard allocator, if the standard allocator
had a preferred max allocation setting smaller than the LSA.

partition_snapshot_accounter calculates the amount of memory used by
mutation fragments in the memtable (where they are allocated with LSA) based
on the memory usage after they are copied to the standard allocator.
This could result in an overestimation, as explained above.
But partition_snapshot_accounter must not overestimate the amount of freed
memory, as doing otherwise might result in OOM situations.

This patch prevents the overaccounting by adding minimal_external_memory_usage():
a new version of external_memory_usage(), which ignores allocator-dependent
overhead. In particular, it includes the per-fragment overhead in managed_bytes
only once, no matter how many fragments there are.
2021-01-15 18:21:13 +01:00
Michał Chojnowski
d31771c0b2 test: add managed_bytes_test 2021-01-15 18:21:13 +01:00
Michał Chojnowski
72ecbd6936 utils: fragment_range: add a fragment iterator for FragmentedView
A stylistic change. Iterators are the idiomatic way to iterate in C++.
2021-01-15 14:05:44 +01:00
Michał Chojnowski
2e38647a95 keys: update comments after changes and remove an unused method
The comments were outdated after the latest changes (bytes_view vs
managed_bytes_view).
compound_view_wrapper::get_component() is unused, so we remove it.
2021-01-15 14:05:44 +01:00
Piotr Sarna
6ae94d31c1 treewide: remove shared pointer usage from the pager
The pager interface doesn't really need to be virtual,
so the next step could be to remove the need for pointers
entirely, but migrating from shared_ptr to unique_ptr
is a low-hanging fruit.

Message-Id: <a5bdecb17ae58e914da020fb58a41f4574565c66.1610709560.git.sarna@scylladb.com>
2021-01-15 15:03:14 +02:00
Avi Kivity
f20736d93d Merge 'Support unofficial distributions' from Takuya ASADA
Since we introduced relocatable package and offline installer, scylla binary itself can run almost any distributions.
However, setup scripts are not designed to run in unsupported distributions, it causes error on such environment.
This PR adds minimal support to run offline installation on unsupported distributions, tested on SLES, Arch Linux and Gentoo.

Closes #7858

* github.com:scylladb/scylla:
  dist: use sysconfig_parser to parse gentoo config file
  dist: add package name translation
  dist: support SLES/OpenSUSE
  install.sh: add systemd existance check
  install.sh: ignore error missing sysctl entries
  dist: show warning on unsupported distributions
  dist: drop Ubuntu 14.04 code
  dist: move back is_amzn2() to scylla_util.py
  dist: rename is_gentoo_variant() to is_gentoo()
  dist: support Arch Linux
  dist: make sysconfig directory detectable
2021-01-14 16:59:49 +02:00
Raphael S. Carvalho
97e076365e Fix stalls on Memtable flush by preempting across fragment generation if needed
Flush is facing stalls because partition_snapshot_flat_reader::fill_buffer()
generates mutation fragment until buffer is full[1] without yielding.

this is the code path:
    flush_reader::fill_buffer()      <---------|
        flat_mutation_reader::consume_pausable()      <--------|
                partition_snapshot_flat_reader::fill_buffer() -|

[1]: https://github.com/scylladb/scylla/blob/6cfc949e/partition_snapshot_reader.hh#L261

This is fixed by breaking the loop in do_fill_buffer() if preemption is needed,
allowing do_until() to yield in sequence, and when it resumes, continue from
where it left off, until buffer is full.

Fixes #7885.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210114141417.285175-1-raphaelsc@scylladb.com>
2021-01-14 16:30:55 +02:00
Ivan Prisyazhnyy
32fd38f349 docker: remove sshd from the image
implicit revert of 6322293263

sshd previosly was used by the scylla manager 1.0.
new version does not need it. there is no point of
having it currently. it also confuses everyone.

Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>

Closes #7921
2021-01-14 12:52:24 +02:00
Pavel Emelyanov
2b31be0daa client-state,cdc: Remove call for storage_service from permissions check
The client_state::check_access() calls for global storage service
to get the features from it and check if the CDC feature is on.
The latter is needed to perform CDC-specific checks.

However it was noticed, that the check for the feature is excessive
as all the guarded if-s will resolve to false in case CDC is off
and the check_access will effectively work as it would with the
feature check.

With that observation, it's possible to ditch one more global storage
service reference.

tests: unit(dev), dtest(dev, auth)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210105063651.7081-1-xemul@scylladb.com>
2021-01-14 12:52:24 +02:00
Benny Halevy
29002e3b48 flat_mutation_reader: return future from next_partition
To allow it to asynchronously close underlying readers
on next_partition().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-13 17:35:07 +02:00
Benny Halevy
ff931c2ecc multishard_mutation_query: read_context: save_reader: destroy reader_meta from the calling shard
The reader_meta in _readers[shard] is created on shard 0 and must
be destroyed on it as well.

A following patch changes next_partition() to return a future<>
thus it introduces a continuation that requires access to `rm`.

We cannot move it down to the conuation safely, since it will be
wrongly destroyed in the invoked shard, so use do_with to hold it
in the scope of the calling shard until the invoked function
completes.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-13 17:35:07 +02:00
Benny Halevy
75c0c05f71 mutation_reader: filtering_reader: fill_buffer: futurize inner loop
Prepare for futurizing next_partition().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-13 17:35:07 +02:00
Benny Halevy
cd4d082e51 flat_mutation_reader::impl: consumer_adapter: futurize handle_result
Prepare for futurizing next_partition.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-13 17:35:07 +02:00
Benny Halevy
d8ae6d7591 flat_mutation_reader: consume_pausable/in_thread: futurize_invoke consumer
To support both sync and async consumers.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-13 17:35:07 +02:00
Benny Halevy
fdb3c59e35 flat_mutation_reader: FlatMutationReaderConsumer: support also async consumer
So that consumer_adapter and other consumers in the future
may return a future from consumer().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-13 17:35:07 +02:00
Benny Halevy
515bed90bb flat_mutation_reader:impl: get rid of _consume_done member
It is only used in consume_pausable, that can easily do
without it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-13 17:35:07 +02:00
Pavel Emelyanov
d3ee8774ad storage-service: Subscribe to snitch to update topology
Currently snitch explicitly calls storage service (if
it's initialized) to update topology on snitch data
change.

Instead of it -- make storage service subscribe on the
snitch reconfigure signal upon creation.

This finally makes snitch fully independent from storage
service.

In tests the snitch instance is not created, so check
for it before subscribing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-01-13 16:41:34 +03:00
Pavel Emelyanov
d1a2d0f894 snitch: Introduce reconfiguration signal
Add a notifier to snitch_base that gets triggered when the
snitch configuration changes to which others may subscribe.

For now only the gossiping-file-snitch triggers it when it
re-reads its config file. Other existing snitches are kinda
static in this sense.

The subscribe-trigger engine is based on scoped connection
from boost::signals2.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-01-13 16:41:34 +03:00
Pavel Emelyanov
ca336409d7 snitch: Always gossip snitch info itself
The gossiping_property_file_snitch updates the gossip RACK and DC
values upon config change. Right now this is done with the help
of storage service, but the needed code to gossip rack and dc is
already available in the snitch itself.

Said that -- gossip snitch info by snitch helper and remove the
storage_service's one. This makes the 2nd step decoupling snitch
and storage service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-01-13 16:41:34 +03:00
Pavel Emelyanov
99e71bd1f6 snitch: Do gossip DC and RACK itself
This is the 2nd step in generalizing the snitch data gossiping
and at the same the 1st step in decoupling storage service and
snitch.

During start storage service starts gossiper, which notifies the
snicth with .gossiper_starting() call, then the storage service
calls gossip_snitch_info.

This patch makes snitch itself do the last step.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-01-13 16:41:34 +03:00
Pavel Emelyanov
bc1a3a358d snitch: Add generic gossiping helper
Nowadays some snitch implementations gossip the INTERNAL_IP
value and storage_service gossip RACK and DC for all of them.

This functionality is going to be generalized and the first
step is in making a common method for a snitch to gossip its
data.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-01-13 16:41:34 +03:00
Takuya ASADA
7a74f8cd2e dist: use sysconfig_parser to parse gentoo config file
Use sysconfig_parser instead of regex, to improve code readability.
2021-01-13 21:34:23 +09:00
Takuya ASADA
2a4d293841 dist: add package name translation
Translate package name from CentOS package to different distribution
package name, to use single package name for pkg_install().
2021-01-13 21:27:14 +09:00
Takuya ASADA
0a9843842d dist: support SLES/OpenSUSE
Add support SLES/OpenSUSE on setup script.
2021-01-13 19:32:46 +09:00
Takuya ASADA
a34edf8169 install.sh: add systemd existance check
offline installer can run in non-systemd distributions, but it won't
work since we only have systemd units.
So check systemd existance and print error message.
2021-01-13 19:32:45 +09:00
Takuya ASADA
b8c35772b3 install.sh: ignore error missing sysctl entries
On some kernel may not have specified sysctl parameter, so we should
ignore the error.
2021-01-13 19:32:45 +09:00
Takuya ASADA
e8f74e800c dist: show warning on unsupported distributions
Add warning message on unsupported distributions, for
scylla_cpuscaling_setup and scylla_ntp_setup.
2021-01-13 19:32:45 +09:00
Takuya ASADA
2f344cf50d dist: drop Ubuntu 14.04 code
We don't support Ubuntu 14.04 anymore, drop them
2021-01-13 19:32:45 +09:00
Takuya ASADA
8e59f70080 dist: move back is_amzn2() to scylla_util.py
Distribution detection functions should be placed same place,
so move back it to scylla_util.py
2021-01-13 19:32:45 +09:00
Takuya ASADA
921b1676c0 dist: rename is_gentoo_variant() to is_gentoo()
is_redhat_variant() is the function to detect RHEL/CentOS/Fedora/OEL,
and is_debian_variant() is the function to detect Debian/Ubuntu.
Unlike these functions, is_gentoo_variant() does not detect "Gentoo variants",
we should rename it to is_gentoo().
2021-01-13 19:32:45 +09:00
Takuya ASADA
fffa8f5ded dist: support Arch Linux
Add support Arch Linux on setup script.
2021-01-13 19:32:45 +09:00
Takuya ASADA
0d11f9463d dist: make sysconfig directory detectable
Currently, install.sh provide a way to customize sysconfig directory,
but sysconfig directory is hardcoded on script.
Also, /etc/sysconfig seems correct to use default value, but current
code specify /etc/default as non-redhat distributions.

Instead of hardcoding, generate generate python script in install.sh
to save specified sysconfig directory path in python code.
2021-01-13 19:32:45 +09:00
Wojciech Mitros
93613e20a3 api: remove potential large allocation in /column_family/ GET request handler
The reply to a /column_family/ GET request contains info about all
column families. Currently, all this info is stored in a single
string when replying, and this string may require a big allocation
when there are many column families.
To avoid that allocation, instead of a single string, use a
body_writer function, which writes chunks of the message content
to the output stream.

Fixes #7916

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #7917
2021-01-13 12:04:18 +02:00
Avi Kivity
ed53b3347e Merge 'idl: remove the large allocation in mutation_partition_view::rows()' from Wojciech Mitros
After these changes the generated code deserializes the stream into a chunked vector, instead of an contiguous one, so even if there are many fields in it, there won't be any big allocations.

I haven't run the scylla cluster test with it yet but it passes the unit tests.

Closes #7919

* github.com:scylladb/scylla:
  idl: change the type of mutation_partition_view::rows() to a chunked_vector
  idl-compiler: allow fields of type utils::chunked_vector
2021-01-13 11:07:29 +02:00
Nadav Har'El
711b311d47 cql-pytest: tests for fromJson() integer overflow
Numbers in JSON are not limited in range, so when the fromJson() function
converts a number to a limited-range integer column in Scylla, this
conversion can overflow. The following tests check that this conversion
should result in an error (FunctionFailure), not silent trunction.

Scylla today does silently wrap around the number, so these tests
xfail. They pass on Cassandra.

Refs #7914.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210112151041.3940361-1-nyh@scylladb.com>
2021-01-13 11:07:29 +02:00
Nadav Har'El
617e1be1b6 cql-pytest: expand tests for fromJson() failures
This patch adds more (failing) tests for issue #7911, where fromJson()
failures should be reported as a clean FunctionFailure error, not an
internal server error.

The previous tests we had were about JSON parse failures, but a
different type of error we should support is valid JSON which returned
the wrong type - e.g., the JSON returning a string when an integer
was expected, or the JSON returning a string with non-ASCII characters
when ASCII was expected. So this patch adds more such tests. All of
them xfail on Scylla, and pass on Cassandra.

Refs #7911.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210112122211.3932201-1-nyh@scylladb.com>
2021-01-13 11:07:29 +02:00
Nadav Har'El
2ebe8055ee cql-pytest: add test for fromJson() null parameter.
This patch adds a reproducer test for issue #7912, which is about passing
a null parameter to the fromJson() function supposed to be legal (and
return a null value), and is legal in Cassandra, but isn't allowed in
Scylla.

There are two tests - for a prepared and unprepared statement - which
fail in different ways. The issue is still open so the tests xfail on
Scylla - and pass on Cassandra.

Refs #7912.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210112114254.3927671-1-nyh@scylladb.com>
2021-01-13 11:07:29 +02:00
dgarcia360
78e9f45214 docs: update url
Related issue scylladb/sphinx-scylladb-theme#88

Once this commit is merged, the docs will be published under the new domain name https://scylla.docs.scylladb.com

Frequently asked questions:

    Should we change the links in the README/docs folder?

GitHub automatically handles the redirections. For example, https://scylladb.github.io/sphinx-scylladb-theme/stable/examples/index.html redirects to https://sphinx-theme.scylladb.com/stable/examples/index.html
Nevertheless, it would be great to change URLs progressively to avoid the 301 redirections.

    Do I need to add this new domain in the custom dns domain section on GitHub settings?

It is not necessary. We have already edited the DNS for this domain and the theme creates programmatically the required CNAME file. If everything goes well, GitHub should detect the new URL after this PR is merged.

    The DNS doesn't seem to have the right SSL certificates

GitHub handles the certificate provisioning but is not aware of the subdomain for this repo yet. make multi-version will create a new file "CNAME". This is published in gh-pages branch, therefore GitHub should create the missing cert.

Closes #7877
2021-01-13 11:07:29 +02:00
Avi Kivity
d508a63d4b row_cache: linearize key in cache_entry::do_read()
do_read() does not linearize cache_entry::_key; this can cause a crash
with keys larger than 13k.

Fixes #7897.

Closes #7898
2021-01-13 11:07:29 +02:00
dgarcia360
36f8d35812 docs: added multiversion_regex_builder
Fixed makefile

Added path

Closes #7876
2021-01-13 11:07:29 +02:00
Benny Halevy
5e41228fe8 test: everywhere: use seastar::testing::local_random_engine
Use the thread_local seastar::testing::local_random_engine
in all seastar tests so they can be reproduced using
the --random-seed option.

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210112103713.578301-2-bhalevy@scylladb.com>
2021-01-13 11:07:29 +02:00
Benny Halevy
43ab094c88 configure: add utf8_test to pure_boost_tests
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210112103713.578301-1-bhalevy@scylladb.com>
2021-01-13 11:07:29 +02:00
Dejan Mircevski
d79c2cab63 cql3: Use correct comparator in timeuuid min/max
The min/max aggregators use aggregate_type_for comparators, and the
aggregate_type_for<timeuuid> is regular uuid.  But that yields wrong
results; timeuuids should be compared as timestamps.

Fix it by changing aggregate_type_for<timeuuid> from uuid to timeuuid,
so aggregators can distinguish betwen the two.  Then specialize the
aggregation utilities for timeuuid.

Add a cql-pytest and change some unit tests, which relied on naive
uuid comparators.

Fixes #7729.

Tests: unit (dev, debug)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #7910
2021-01-13 11:07:29 +02:00
Avi Kivity
96d64b7a1f Merge "Wire interposer consumer for memtable flush" from Raphael
"
Without interposer consumer on flush, it could happen that a new sstable,
produced by memtable flush, will not conform to the strategy invariant.
For example, with TWCS, this new sstable could span multiple time windows,
making it hard for the strategy to purge expired data. If interposer is
enabled, the data will be correctly segregated into different sstables,
each one spanning a single window.

Fixes #4617.

tests:
    - mode(dev).
    - manually tested it by forcing a flush of memtable spanning many windows
"

* 'segregation_on_flush_v2' of github.com:raphaelsc/scylla:
  test: Add test for TWCS interposer on memtable flush
  table: Wire interposer consumer for memtable flush
  table: Add write_memtable_to_sstable variant which accepts flat_mutation_reader
  table: Allow sstable write permit to be shared across monitors
  memtable: Track min timestamp
  table: Extend cache update to operate a memtable split into multiple sstables
2021-01-13 11:07:29 +02:00
Nadav Har'El
8164c52871 cql-pytest: add test for fromJson() parse error
This patch adds a reproducer test for issue #7911, which is about a parse
error in JSON string passed to the fromJson() function causing an
internal error instead of the expected FunctionFailure error.

The issue is still open so the test xfails on Scylla (and passes on
Cassandra).

Refs #7911.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210112094629.3920472-1-nyh@scylladb.com>
2021-01-13 11:07:29 +02:00
Pavel Solodovnikov
10e3da692f lwt: validate paxos_grace_seconds table option
The option can only take integer values >= 0, since negative
TTL is meaningless and is expected to fail the query when used
with `USING TTL` clause.

It's better to fail early on `CREATE TABLE` and `ALTER TABLE`
statement with a descriptive message rather than catch the
error during the first lwt `INSERT` or `UPDATE` while trying
to insert to system.paxos table with the desired TTL.

Tests: unit(dev)
Fixes: #7906

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210111202942.69778-1-pa.solodovnikov@scylladb.com>
2021-01-13 11:07:29 +02:00
Gleb Natapov
51bf5f5846 raft: test: do not check snapshot during backpressure test
Unfortunately snapshot checking still does not work in the presence of
log entries reordering. It is impossible to know when exactly the
snapshot will be taken and if it is taken before all smaller than
snapshot idx entries are applied the check will fail since it assumes
that.

This patch disabled snapshot checking for SUM state machine that is used
in backpressure test.

Message-Id: <20201126122349.GE1655743@scylladb.com>
2021-01-13 11:07:29 +02:00
Wojciech Mitros
59769efd3b idl: change the type of mutation_partition_view::rows() to a chunked_vector
The value of mutation_partition_view::rows() may be very large, but is
used almost exclusively for iteration, so in order to avoid a big allocation
for an std::vector, we change its type to an utils::chunked_vector.

Fixes #7918

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-01-13 04:25:53 +01:00
Wojciech Mitros
88e750f379 idl-compiler: allow fields of type utils::chunked_vector
The utils::chunked_vector has practically the same methods
as a std::vector, so the same code can be generated for it.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-01-13 04:09:18 +01:00
Avi Kivity
ccd09f1398 Update seastar submodule
* seastar 6b36e84c3...a287bb1a3 (1):
  > merge: file: correct dma alignment for odd filesystems

Ref #7794.
2021-01-11 20:38:59 +02:00
Tomasz Grabiec
6cfc949e62 Merge "sstables: validate the writer's input with the mutation fragment stream validator" from Botond
We have recently seen a suspected corrupt mutation fragment stream to get
into an sstable undetected, causing permanent corruption. One of the
suspected ways this could happen is the compaction sstable write path not
being covered with a validator. To prevent events like this in the future
make sure all sstable write paths are validated by embedding the validator
right into the sstable writer itself.

Refs: #7623
Refs: #7640

Tests: unit(release)

* https://github.com/denesb/scylla.git sstable-writer-fragment-stream-validation/v2:
  sstable_writer: add validation
  test/boost/sstable_datafile_test: sstable_scrub_test: disable key validation
  mutation_fragment_stream_validator: make it easier to validate concrete fragment types
  flat_mutation_reader: extract fragment stream validator into its own header
2021-01-11 14:57:48 +01:00
Calle Wilund
4be718ebfa commitlog: Force earlier cycle/flush iff segment reserve is empty
Attempt to hurry flushing/segment delete/recycle if we are trying
to get a segment for allocation, and reserve is empty when above
disk threshold. This is minimize time waited in allocation semaphore.
2021-01-11 12:45:36 +00:00
Calle Wilund
be8c359a62 commitlog: Make segment allocation wait iff disk usage > max
Instead of allowing new segments to be added, explicitly wait
for either disk delete or recycle to happen iff current disk
usage is larger than limit.
2021-01-11 12:45:36 +00:00
Calle Wilund
ab55a1b4e6 commitlog: Do partial (memtable) flushing based on threshold
Instead of asking to flush data for all segments, just request
up to an RP where we get comfortably below disk usage threshold.
2021-01-11 12:45:10 +00:00
Pekka Enberg
42806c6f40 Update seastar submodule
* seastar ed345cdb...6b36e84c (3):
  > perftune.py: Don't print nic driver name to avoid
Fixes #7905
  > io_tester: Make file sizes configurable
  > io_queue: Limit tickets for oversized requests
2021-01-11 14:12:06 +02:00
Pavel Solodovnikov
0981b786a8 db/query_options: specify serial consistency for DEFAULT specific_options
Cassandra constructs `QueryOptions.SpecificOptions` in the same
way that we do (by not providing `serial_constency`), but they
do have a user-defined constructor which does the following thing:

	this.serialConsistency = serialConsistency == null ? ConsistencyLevel.SERIAL : serialConsistency;

This effectively means that DEFAULT `SpecificOptions` always
have `SerialConsistency` set to `SERIAL`, while we leave this
`std::nullopt`, since we don't have a constructor for
`specific_options` which does this.

Supply `db::consistency_level::SERIAL` explicitly to the
`specific_options::DEFAULT` value.

Tests: unit(dev)
Fixes: #7850

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201231104018.362270-1-pa.solodovnikov@scylladb.com>
2021-01-11 12:12:29 +02:00
Nadav Har'El
a3f9bd9c3f cql-pytest: add xfailing reproducer for issue #7888
This adds a simple reproducer for a bug involving a CONTAINS relation on
frozen collection clustering columns when the query is restricted to a
single partition - resulting in a strange "marshalling error".

This bug still exists, so the test is marked xfail.

Refs #7888.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210107191417.3775319-1-nyh@scylladb.com>
2021-01-11 08:49:16 +01:00
Nadav Har'El
678da50a10 cql-pytest: add reproducers for reversed frozen collection bugs
We add a reproducer for issues #7868 and #7875 which are about bugs when
a table has a frozen collection as its clustering key, and it is sorted
in *reverse order*: If we tried to insert an item to such a table using an
unprepared statement, it failed with a wrong error ("invalid set literal"),
but if we try to set up a prepared statement, the result is even worse -
an assertion failure and a crash.

Interestingly, neither of these problems happen without reversed sort order
(WITH CLUSTERING ORDER BY (b DESC)), and we also add a test which
demonstrates that with default (increasing) order, everything works fine.

All tests pass successfully when run against Cassandra.

The fix for both issues was already committed, so I verified these tests
reproduced the bug before that commit, and pass now.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210110232312.3844408-1-nyh@scylladb.com>
2021-01-11 08:48:30 +01:00
Nadav Har'El
f32c34d8ad cql-pytest: port Cassandra's unit test validation/entities/frozen_collections_test
In this patch, we port validation/entities/frozen_collections_test.java,
containing 33 tests for frozen collections of all types, including
nesting collections.

In porting these tests, I uncovered four previously unknown bugs in Scylla:

Refs #7852: Inserting a row with a null key column should be forbidden.
Refs #7868: Assertion failure (crash) when clustering key is a frozen
            collection and reverse order.
Refs #7888: Certain combination of filtering, index, and frozen collection,
            causes "marshalling error" failure.
Refs #7902: Failed SELECT with tuple of reversed-ordered frozen collections.

These tests also provide two more reproducers for an already known bug:

Refs #7745: Length of map keys and set items are incorrectly limited to
            64K in unprepared CQL.

Due to these bugs, 7 out of the 33 tests here currently xfail. We actually
had more failing tests, but we fixed issue #7868 before this patch went in,
so its tests are passing at the time of this submission.

As usual in these sort of tests, all 33 pass when running against Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210110231350.3843686-1-nyh@scylladb.com>
2021-01-11 08:48:08 +01:00
Nadav Har'El
0516cd1609 alternator test: de-duplicate some duplicate code
In test_streams.py we had some code to get a list of shards and iterators
duplicated three times. Put it in a function, shards_and_latest_iterators(),
to reduce this duplication.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201006112421.426096-1-nyh@scylladb.com>
2021-01-11 08:47:25 +01:00
Botond Dénes
cb4d92aae4 sstable_writer: add validation
Add a mutation_fragment_stream_validating_filter to
sstables::writer_impl and use it in sstable_writer to validate the
fragment stream passed down to the writer implementation. This ensures
that all fragment streams written to disk are validated, and we don't
have to worry about validating each source separately.

The current validator from sstable::write_components() is removed. This
covers only part of the write paths. Ad-hoc validations in the reader
implementations are removed as well as they are now redundant.
2021-01-11 09:12:56 +02:00
Botond Dénes
4b254a26ab test/boost/sstable_datafile_test: sstable_scrub_test: disable key validation
The test violates clustering key order on purpose to produce a corrupt
sstable (to test scrub). Disable key validation so when we move the
validator into the writer itself in the next patch it doesn't abort the
test.
2021-01-11 09:12:56 +02:00
Botond Dénes
8dae6152bf mutation_fragment_stream_validator: make it easier to validate concrete fragment types
The current API is tailored to the `mutation_fragment` type. In
the next patch we will want to use the validator from a context where
the mutation fragments are already decomposed into their respective
concrete types, e.g. static_row, clustering_row, etc. To avoid having to
reconstruct a mutation fragment type just to use the validator, add an
API which allows validating these concrete types conveniently too.
2021-01-11 08:07:42 +02:00
Botond Dénes
495f9d54ba flat_mutation_reader: extract fragment stream validator into its own header
To allow using it without pulling in the huge `flat_mutation_reader.hh`.
2021-01-11 08:07:42 +02:00
Dejan Mircevski
3aa80f47fe abstract_type: Rework unreversal methods
Replace two methods for unreversal (`as` and `self_or_reversed`) with
a new one (`without_reversed`).  More flexible and better named.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #7889
2021-01-10 19:30:12 +02:00
Tomasz Grabiec
15b5b286d9 Merge "frozen_mutation: better diagnostics for out-of-order and duplicate rows" from Botond
Currently, frozen mutations, that contain partitions with out-of-order
or duplicate rows will trigger (if they even do) an assert in
`row::append_cell()`. However, this results in poor diagnostics (if at
all) as the context doesn't contain enough information on what exactly
went wrong. This results in a cryptic error message and an investigation
that can only start after looking at a coredump.

This series remedies this problem by explicitly checking for
out-of-order and duplicate rows, as early as possible, when the
supposedly empty row is created. If the row already existed (is a
duplicate) or it is not the last row in the partition (out-of-order row)
an exception is thrown and the deserialization is aborted. To further
improve diagnostics, the partition context is also added to the
exception.

Tests: unit(release)

* botond/frozen-mutation-bad-row-diagnostics/v3:
  frozen_mutation: add partition context to errors coming from deserializing
  partition_builder: accept_row(): use append_clustering_row()
  mutation_partition: add append_clustered_row()
2021-01-10 19:30:12 +02:00
Pekka Enberg
e5fe0acd15 Update seastar submodule
* seastar 56cfe179...ed345cdb (1):
  > perftune.py: Fix the dump options after adding multiple nics option
Refs #6266
2021-01-08 18:13:26 +01:00
Benny Halevy
60bde99e8e flat_mutation_reader: consume_in_thread: always filter.on_end_of_stream on return
Since we're calling _consumer.consume_end_of_stream()
unconditionally when consume_pausable_in_thread returns.

Refs #7623
Refs #7640

Test: unit(dev)
Dtest: materialized_views_test.py:TestMaterializedViews.interrupt_build_process_with_resharding_low_to_half_test

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210106103024.3494569-1-bhalevy@scylladb.com>
2021-01-08 18:13:26 +01:00
Michał Chojnowski
f317b3c39f mutation_test: use the correct preferred_max_contiguous_allocation in measuring_allocator
measuring_allocator is a wrapper around standard_allocator, but it exposed
the default preferred_max_contiguous_allocation, not the one from
standard_allocator. Thus managed_bytes allocated in those two allocators
had fragments of different size, and their total memory usage differed,
causing test_external_memory_usage to fail if
standard_allocator::preferred_max_contiguous_allocation was changed from the
default. Fix that.
2021-01-08 14:16:08 +01:00
Pavel Solodovnikov
907b73a652 row_cache: more indentation fixes
Fixup indentation issues introduced in recent patches.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-08 14:16:08 +01:00
Pavel Solodovnikov
eb523d4ac8 utils: remove unused linearization facilities in managed_bytes class
Remove the following bits of `managed_bytes` since they are unused:
* `with_linearized_managed_bytes` function template
* `linearization_context_guard` RAII wrapper class for managing
  `linearization_context` instances.
* `do_linearize` function
* `linearization_context` class

Since there is no more public or private methods in `managed_class`
to linearize the value except for explicit `with_linearized()`,
which doesn't use any of aforementioned parts, we can safely remove
these.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-08 14:16:08 +01:00
Pavel Solodovnikov
8709844566 misc: fix indentation
The patch fixes indentation issues introduced in previous patches
related to removing `with_linearized_managed_bytes` uses from the
code tree.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-08 14:16:08 +01:00
Pavel Solodovnikov
e04eb68a9c treewide: remove remaining with_linearized_managed_bytes uses
There is no point in calling the wrapper since linearization code
is private in `managed_bytes` class and there is no one to call
`managed_bytes::data` because it was deleted recently.

This patch is a prerequisite for removing
`with_linearized_managed_bytes` function completely, alongside with
the corresponding parts of implementation in `managed_bytes`.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-08 14:16:08 +01:00
Pavel Solodovnikov
bf8b138b42 memtable, row_cache: remove with_linearized_managed_bytes uses
Since `managed_bytes::data()` is deleted as well as other public
APIs of `managed_bytes` which would linearize stored values except
for explicit `with_linearized`, there is no point
invoking `with_linearized_managed_bytes` hack which would trigger
automatic linearization under the hood of managed_bytes.

Remove useless `with_linearized_managed_bytes` wrapper from
memtable and row_cache code.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-08 14:16:08 +01:00
Avi Kivity
3bf6b78668 utils: managed_bytes: remove linearizing accessors
Accessor that require linearization, such as data(), begin(),
and casting to bytes_view, are no longer used and are now removed.
2021-01-08 14:16:08 +01:00
Michał Chojnowski
dbcf987231 keys, compound: switch from bytes_view to managed_bytes_view
The keys classes (partition_key et al) already use managed_bytes,
but they assume the data is not fragmented and make liberal use
of that by casting to bytes_view. The view classes use bytes_view.

Change that to managed_bytes_view, and adjust return values
to managed_bytes/managed_bytes_view.

The callers are adjusted. In some places linearization (to_bytes())
is needed, but this isn't too bad as keys are always <= 64k and thus
will not be fragmented when out of LSA. We can remove this
linearization later.

The serialize_value() template is called from a long chain, and
can be reached with either bytes_view or managed_bytes_view.
Rather than trace and adjust all the callers, we patch it now
with constexpr if.

operator bytes_view (in keys) is converted to operator
managed_bytes_view, allowing callers to defer or avoid
linearization.
2021-01-08 14:16:08 +01:00
Michał Chojnowski
a1a0839164 sstables: writer: add write_* helpers for managed_bytes_view
We will use them in the upcoming patch where we transition keys from bytes_view
to mutable_bytes_view.
2021-01-08 14:16:08 +01:00
Michał Chojnowski
45c1b90eb5 compound_compat: transition legacy_compound_view from bytes_view to managed_bytes_view
The underlying view will change from bytes_view to managed_bytes_view in the
next commits, so we prepare for that.
2021-01-08 14:16:08 +01:00
Avi Kivity
d9fcc4f4ef types: change equal() to accept managed_bytes_view
bytes_view can convert to managed_bytes_view, so the change
is compatible with the existing representation and the next
patches, which change compound types to use managed_bytes_view.
2021-01-08 14:16:08 +01:00
Michał Chojnowski
1de0b9a425 types: add parallel interfaces for managed_bytes_view
We will need those to transition keys and compound from bytes_view to
managed_bytes_view.
2021-01-08 14:16:08 +01:00
Avi Kivity
d1f354f5fb types: add to_managed_bytes(const sstring&)
This is a helper for tests (similar to to_bytes(const sstring&)).
2021-01-08 14:16:08 +01:00
Michał Chojnowski
c6eb485675 serializer_impl: handle managed_bytes without linearizing
With managed_bytes_view implemented, it's easy to de/serialize managed_bytes
without linearization.
2021-01-08 14:16:08 +01:00
Michał Chojnowski
bf0ec63e34 utils: managed_bytes: add managed_bytes_view::operator[]
This operator has a single purpose: an easier port of legacy_compound_view
from bytes_view to managed_bytes_view.
It is inefficient and should be removed as soon as legacy_compound_view stops
using operator[].
2021-01-08 14:16:08 +01:00
Michał Chojnowski
778269151a utils: managed_bytes: introduce managed_bytes_view
managed_bytes_view is a non-owning view into managed_bytes.
It can also be implicitly constructed from bytes_view.

It conforms to the FragmentedView concept and is mainly used through that
interface.

It will be used as a replacement for bytes_view occurrences currently
obtained by linearizing managed_bytes.
2021-01-08 14:16:08 +01:00
Michał Chojnowski
cf7d25b98d utils: fragment_range: add serialization helpers for FragmentedMutableView
We will use them to write to managed_bytes_view in an upcoming patch,
to avoid linearization in compound_type::serialize_value.
2021-01-08 14:16:07 +01:00
Michał Chojnowski
75898ee44e bytes: implement std::hash using appending_hash
This is a preparation for the upcoming introduction of managed_bytes_view,
intended as a fragmented replacement for bytes_view.
To ease the transition, we want both types to give equal hashes for equal
contents.
2021-01-08 13:17:46 +01:00
Michał Chojnowski
4822730752 utils: mutable_view: add substr()
Analogous to bytes_view::substr.
This bit of functionality will be used to implement managed_bytes_mutable_view.
2021-01-08 13:17:46 +01:00
Dejan Mircevski
9eed26ca3d cql3: Fix maps::setter_by_key for unset values
Unset values for key and value were not handled.  Handle them in a
manner matching Cassandra.

This fixes all cases in testMapWithUnsetValues, so re-enable it (and
fix a comment typo in it).

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-01-07 13:22:20 +02:00
Dejan Mircevski
4515a49d4d cql3: Fix IN ? for unset values
When the right-hand side of IN is an unset value, we must report an
error, like Cassandra does.

This fixes testListWithUnsetValues, so re-enable it.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-01-07 13:22:20 +02:00
Dejan Mircevski
5bee97fa51 cql3: Fix handling of scalar unset value
Make the bind() operation of the scalar marker handle the unset-value
case (which it previously didn't).

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-01-07 13:22:20 +02:00
Dejan Mircevski
8b2f459622 cql3: Fix crash when removing unset_value from set
Avoid crash described in #7740 by ignoring the update when the
element-to-remove is UNSET_VALUE.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-01-07 13:22:20 +02:00
Pekka Enberg
e81f4caf67 Update seastar submodule
* seastar a2fc9d72...56cfe179 (1):
  > perftune.py: Fix nic_is_bond_iface() and other function signatures
Refs #6266
2021-01-07 13:22:20 +02:00
Takuya ASADA
10184ba64f redis: implement parse error, reply error message correctly
Since we haven't implemented parse error on redis protocol parser,
reply message is broken at parse error.
Implemented parse error, reply error message correctly.

Fixes #7861
Fixes #7114

Closes #7862
2021-01-07 13:22:20 +02:00
Dejan Mircevski
176ff0238a cql3: Fix handling of reverse-order maps
When the clustering order is reversed on a map column, the column type
is reversed_type_impl, not map_type_impl.  Therefore, we have to check
for both reversed type and map type in some places.

This patch handles reverse types in enough places to make
test_clustering_key_reverse_frozen_map pass.  However, it leaves
other places (invocations of is_map() and *_cast<map_type_impl>())
as they currently are; some are protected by callers from being
invoked on reverse types, but some are quite possibly bugs untriggered
by existing tests.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-01-07 13:22:20 +02:00
Dejan Mircevski
6bb10fcf36 cql3: Fix handling of reverse-order lists
When the clustering order is reversed on a list column, the column type
is reversed_type_impl, not list_type_impl.  Therefore, we have to check
for both reversed type and list type in some places.

This patch handles reverse types in enough places to make
test_clustering_key_reverse_frozen_list pass.  However, it leaves
other places (invocations of is_list() and *_cast<list_type_impl>())
as they currently are; some are protected by callers from being
invoked on reverse types, but some are quite possibly bugs untriggered
by existing tests.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-01-07 13:22:20 +02:00
Dejan Mircevski
14fa39cfa6 cql3: Fix handling of reverse-order sets
When the clustering order is reversed on a set column, the column type
is reversed_type_impl, not set_type_impl.  Therefore, we have to check
for both reversed type and set type in some places.

To make such checks easier, add convenience methods self_or_reversed()
and as() to abstract_type.  Invoke those methods (instead of is_set()
and casts) enough to make test_clustering_key_reverse_frozen_set pass.
Leave other invocations of is_set() and *_cast<set_type_impl>() as
they are; some are protected by callers from being invoked on reverse
types, but some are quite possibly bugs untriggered by existing tests.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-01-07 13:22:20 +02:00
Calle Wilund
7c84b16cd8 commitlog: Make flush threshold configurable 2021-01-05 18:16:09 +00:00
Calle Wilund
c3d95811da table: Add a flush RP mark to table, and shortcut if not above
Adds a second RP to table, marking where we flushed last.
If a new flush request comes in that is below this mark, we
can skip a second flush.

This is to (in future) support incremental CL flush.
2021-01-05 18:16:09 +00:00
Raphael Carvalho
28a2aca627 Fix doc for building pkgs for a specific build mode
Closes #7878
2021-01-05 18:56:21 +02:00
Tomasz Grabiec
1d717f37e2 vint-serialization: Reference the correct spec
We are not using the protobol buffers format for vint.

Message-Id: <1609865471-22292-1-git-send-email-tgrabiec@scylladb.com>
2021-01-05 18:54:09 +02:00
Vojtech Havel
d858c57357 cql3: allow SELECTs restricted by "IN" to retrieve collections
This patch enables select cql statements where collection columns are
selected  columns in queries where clustering column is restricted by
"IN" cql operator. Such queries are accepted by cassandra since v4.0.

The internals actually provide correct support for this feature already,
this patch simply removes relevant cql query check.

Tests: cql-pytest (testInRestrictionWithCollection)

Fixes #7743
Fixes #4251

Signed-off-by: Vojtech Havel <vojtahavel@gmail.com>
Message-Id: <20210104223422.81519-1-vojtahavel@gmail.com>
2021-01-05 14:39:18 +02:00
Pekka Enberg
e54cc078a1 Update seastar submodule
* seastar d1b5d41b...a2fc9d72 (6):
  > perftune.py: support passing multiple --nic options to tune multiple interfaces at once
  > perftune.py recognize and sort IRQs for Mellanox NICs
  > perftune.py: refactor getting of driver name into __get_driver_name()
Fixes #6266
  > install-dependencies: support Manjaro
  > append_challenged_posix_file_impl: optimize_queue: use max of sloppy_size_hint and speculative_size
  > future: do_until: handle exception in stop condition
2021-01-05 13:32:21 +02:00
Avi Kivity
43a2636229 Merge "Remove proxy from size-estimates reader" from Pavel E
"
The size_estimates_mutation_reader call for global proxy
to get database from. The database is used to find keyspaces
to work with. However, it's safe to keep the local database
refernece on the reader itself.

tests: unit(debug)
"

* 'br-no-proxy-in-size-estimate-reader' of https://github.com/xemul/scylla:
  size_estimate_reader: Use local db reference not global
  size_estimate_reader: Keep database reference on mutation reader
  size_estimate_reader: Keep database reference on virtual_reader
2021-01-05 11:28:09 +02:00
Pavel Emelyanov
9632af5d6b schema_tables: Drop unused merge_schema overload
After the d3aa1759 one of them became unused.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210105051724.5249-1-xemul@scylladb.com>
2021-01-05 11:25:22 +02:00
Michał Chojnowski
6c97027f85 utils: fragment_range: add compare_unsigned
We will use it to compare fragmented buffers (mainly managed_bytes_view in
types, compound, and tests) without linearization.
2021-01-04 22:50:45 +01:00
Michał Chojnowski
2d28471a59 utils: managed_bytes: make the constructors from bytes and bytes_view explicit
Conversions from views to owners have no business being implicit.
Besides, they would also cause various ambiguity problems when adding
managed_bytes_view.
2021-01-04 22:22:12 +01:00
Raphael S. Carvalho
d265bb9bdb test: Add test for TWCS interposer on memtable flush
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-01-04 16:55:06 -03:00
Raphael S. Carvalho
9124a708f1 table: Wire interposer consumer for memtable flush
From now on, memtable flush will use the strategy's interposer consumer
iff split_during_flush is enabled (disabled by default).
It has effect only for TWCS users as TWCS it's the only strategy that
goes on to implement this interposer consumer, which consists of
segregating data according to the window configuration.

Fixes #4617.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-01-04 16:26:07 -03:00
Raphael S. Carvalho
c926a948e5 table: Add write_memtable_to_sstable variant which accepts flat_mutation_reader
This new variant will be needed for interposer consumer.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-01-04 16:23:00 -03:00
Raphael S. Carvalho
32acb44fec table: Allow sstable write permit to be shared across monitors
As a preparation for interposer on flush, let's allow database write monitor
to store a shared sstable write permit, which will be released as soon as
any of the sstable writers reach the sealing stage.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-01-04 14:46:43 -03:00
Nadav Har'El
ed31dd1742 cql-pytest: port Cassandra's unit test validation/entities/counters_test
In this patch, we port validation/entities/collection_test.java, containing
7 tests for CQL counters. Happily, these tests did not uncover any bugs in
Scylla and all pass on both Cassandra and Scylla.

There is one small difference that I decided to ignore instead of reporting
a bug. If you try a CREATE TABLE with both counter and non-counter columns,
Scylla gives a ConfigurationException error, while Cassandra gives a more
reasonable InvalidRequest. The ported test currently allows both.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201223181325.3148928-1-nyh@scylladb.com>
2021-01-04 18:25:48 +01:00
Nadav Har'El
05d6eff850 cql-pytest: add tests for non-support of unicode equivalence`
In issue #7843 there were questions raised on how much does Scylla
support the notion of Unicode Equivalence, a.k.a. Unicode normalization.

Consider the Spanish letter ñ - it can be represented by a single Unicode
character 00F1, but can also be represented as a 006E (lowercase "n")
followed by a 0303 ("combining tilde"). Unicode specifies that these
two representations should be considered "equivalent" for purposes of
sorting or searching. But the following tests demonstrates that this
is not, in fact, supported in Scylla or Cassandra:

1. If you use one representation as the key, then looking up the other one
   will not find the row. Scylla (and Cassandra) do *not* consider
   the two strings equivalent.

2. The LIKE operator (a Scylla-only extension) doesn't know that
   the single-character ñ begins with an n, or that the two-character
   ñ is just a single character.
   This is despite the thinking on #7843 which by using ICU in the
   implementation of LIKE, we somehow got support for this. We didn't.

Refs #7843

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201229125330.3401954-1-nyh@scylladb.com>
2021-01-04 18:25:28 +01:00
Nadav Har'El
feb028c97e cql-pytest: add reproducer for issue 7856
This patch adds a reproducer for issue #7856, which is about frozen sets
and how we can in Scylla (but not in Cassandra), insert one in the "wrong"
order, but only in very specific circumstances which this reproducer
demonstrates: The bug can only be reproduced in a nested frozen collection,
and using prepared statements.

Refs #7856

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201231085500.3514263-1-nyh@scylladb.com>
2021-01-04 18:25:12 +01:00
Raphael S. Carvalho
738049cba2 memtable: Track min timestamp
Tracking both min and max timestamp will be required for memtable flush
to short-circuit interposer consumer if needed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-01-04 13:24:43 -03:00
Raphael S. Carvalho
5519fdba72 table: Extend cache update to operate a memtable split into multiple sstables
This extension is needed for future work where a memtable will be segregated
during flush into one sstable or more. So now multiple sstables can be added
to the set after a memtable flush, and compaction is only triggered at the
end.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-01-04 13:24:10 -03:00
Piotr Sarna
d5da455d95 schema_tables: describe calculate_schema_digest better
- the mystical `accept_predicate` is renamed to `accept_keyspace`
   to be more self-descriptive
 - a short comment is added to the original calculate_schema_digest
   function header, mentioning that it computes schema digest
   for non-system keyspaces

Refs #7854

Message-Id: <04f1435952940c64afd223bd10a315c3681b1bef.1609763443.git.sarna@scylladb.com>
2021-01-04 14:46:17 +02:00
Amos Kong
8b231a3bd9 install.sh: switch to use realpath for EnvironmentFile
In scylla-jmx, we fixed a hardcode sysconfdir in EnvironmentFile path,
realpath was used to convert the path. This patch changed to use
realpath in scylla repo to make it consistent with scylla-jmx.

Suggested-by: Pekka Enberg <penberg@scylladb.com>
Signed-off-by: Amos Kong <amos@scylladb.com>

Closes #7860
2021-01-04 12:45:17 +02:00
Avi Kivity
33ee07a9d8 Merge 'Skip internal distributed tables in schema_change_test' from Piotr Sarna
The original idea for `schema_change_test` was to ensure that **if** schema hasn't changed, the digest also remained unchanged. However, a cumbersome side effect of adding an internal distributed table (or altering one) is that all digests in `schema_change_test` are immediately invalid, because the schema changed.
Until now, each time a distributed system table was added/amended, a new test case for `schema_change_test` was generated, but this effort is not worth the effect - when a distributed system table is added, it will always propagate on its own, so generating a new test case does not bring any tangible new test coverage - it's just a pain.

To avoid this pain, `schema_change_test` now explicitly skips all internal keyspaces - which includes internal distributed tables - when calculating schema digest. That way, patches which change the way of computing the digest itself will still require adding a new test case, which is good, but, at the same time, changes to distributed tables will not force the developers to introduce needless schema features just for the sake of this test.

Tests:
 * unit(dev)
 * manual(rebasing on top of a change which adds two distributed system tables - all tests still passed)

Refs #7617

Closes #7854

* github.com:scylladb/scylla:
  schema_change_test: skip distributed system tables in digest
  schema_tables: allow custom predicates in schema digest calc
  alternator: drop unneeded sstring creation
  system_keyspace: migrate helper functions to string_view
  database: migrate find_keyspace to string views
2021-01-04 12:44:03 +02:00
Piotr Sarna
e26aa836a9 schema_change_test: skip distributed system tables in digest
With previous design of the schema change test, a regeneration
was necessary each time a new distributed system table was added.
It was not the original purpose of the test to keep track of new
distributed tables which simply propagate on their own,
so the test case is now modified: internal distributed tables
are not part of the schema digest anymore, which means that
changes inside them will not cause mismatches.

This change involves a one-shot regeneration of all digests,
which due to historical reasons included internal distributed
tables in the digest, but no further regenerations should ever
be necessary when a new internal distributed table is added.
2021-01-04 10:24:40 +01:00
Piotr Sarna
13a60b02ea schema_tables: allow custom predicates in schema digest calc
For testing purposes it would be useful to be able to skip computing
schema for certain tables (namely, internal distributed tables).
In order to allow that, a function which accepts a custom predicate
is added.
2021-01-04 10:11:41 +01:00
Piotr Sarna
12b5184933 alternator: drop unneeded sstring creation
It's now possible to use string views to check if a particular
table is a system table, so it's no longer needed to explicitly
create an sstring instance.
2021-01-04 09:47:01 +01:00
Piotr Sarna
f293c59a46 system_keyspace: migrate helper functions to string_view
Functions for checking if the keyspace is system/internal were based
on sstring references, which is impractical compared to string views
and may lead to unnecessary creation of sstring instances.
2021-01-04 09:47:01 +01:00
Piotr Sarna
aba9772eff database: migrate find_keyspace to string views
... in order to avoid creating unnecessary sstring instances
just to compare strings.
2021-01-04 09:47:01 +01:00
Gleb Natapov
d3aa17591c migration_manager: drop announce_locally flag
It looks like the history of the flag begins in Cassandra's
https://issues.apache.org/jira/browse/CASSANDRA-7327 where it is
introduced to speedup tests by not needing to start the gossiper.
The thing is we always start gossiper in our cql tests, so the flag only
introduce noise. And, of course, since we want to move schema to use raft
it goes against the nature of the raft to be able to apply modification only
locally, so we better get rid of the capability ASAP.

Tests: units(dev, debug)
Message-Id: <20201230111101.4037543-2-gleb@scylladb.com>
2021-01-03 13:58:09 +02:00
Gleb Natapov
491f10bb70 schema-tables: make schema update global when fixing legacy SI tables
When a node notice that it uses legacy SI tables it converts them to use
new format, but it update only local schema. It will only cause schema
discrepancy between nodes, there schema change should propagate
globally.

Fixes #7857.

Message-Id: <20201230111101.4037543-1-gleb@scylladb.com>
2021-01-03 13:57:46 +02:00
Raphael S. Carvalho
d55d65d77c compaction: Enable filtering reader only on behalf of cleanup compaction
After 13fa2bec4c, every compaction will be performed through a filtering
reader because consumers cannot do the filtering if interposer consumer is
enabled.

It turns out that filtering_reader is adding significant overhead when regular
compactions are running. As no other compaction type need to actually do
any filtering, let's limit filtering_reader to cleanup compaction.
Alternatively, we could disable interposer consumer on behalf of cleanup,
or add support for the consumers to do the filtering themselves but that
would add lots of complexity.

Fixes #7748.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201230194516.848347-2-raphaelsc@scylladb.com>
2021-01-03 12:02:43 +02:00
Raphael S. Carvalho
e42d277805 compaction: Drop needless partition filter for regular compaction
This filter is used to discard data that doesn't belong to current
shard, but scylla will only make a sstable available to regular
compaction after it was resharded on either boot or refresh.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201230194516.848347-1-raphaelsc@scylladb.com>
2021-01-03 12:02:42 +02:00
Pekka Enberg
5872b754e0 Revert "dist/docker: Remove 'epel-release' from Docker image"
This reverts commit ceb67e7728. The
"epel-release" package is needed to install the "supervisord"
package, which I somehow missed in testing...

Fixes #7851
2021-01-02 12:49:12 +02:00
Nadav Har'El
93a2c52338 cql-pytest: add tests for inserting rows with missing key columns
This patch adds two simple tests for what happens when a user tries to
insert a row with one of the key column missing. The first tests confirms
that if the column is completely missing, we correctly print an error
(this was issue #3665, that was already marked fixed).

However, the second test demonstrates that we still have a bug when
the key column appears on the command, but with a null value.
In this case, instead of failing the insert (as Cassandra does),
we silently ignore it. This is the proper behavior for UNSET_VALUE,
but not for null. So the second test is marked xfail, and I opened
issue #7852 about it.

Refs #3665
Refs #7852

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201230132350.3463906-1-nyh@scylladb.com>
2020-12-30 18:20:01 +01:00
Nadav Har'El
10fbef5bff cql-pytest: clean up test_using_timeout.py
In a previous version of test_using_timeout.py, we had tables pre-filled
with some content labled "everything". The current version of the tests
don't use it, so drop it completely.

One test, test_per_query_timeout_large_enough, still had code that did
	res = list(cql.execute(f"SELECT * FROM {table} USING TIMEOUT 24h"))
	assert res == everything
this was a bug - it only works as expected if this test is run before
anything other test is run, and will fail if we ever reorder or parallelize
these tests. So drop these two lines.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201229145435.3421185-1-nyh@scylladb.com>
2020-12-30 09:16:25 +01:00
Asias He
84f482bde4 table: Add make_streaming_reader for given sstables set
Add a streaming reader that streams from a given sstables set.

Refs #7831
2020-12-30 08:32:42 +08:00
Nadav Har'El
5f24ff9187 Merge 'Coroutinize alternator tagging requests' from Piotr Sarna
This miniseries rewrites two alternator request handlers from seastar threads to coroutines - since these handlers are not on a hot path and using seastar threads is way too heavy for such a simple routine.

NOTE: this pull request obviously has to wait until coroutines are fully supported in Seastar/Scylla.

Closes #7453

* github.com:scylladb/scylla:
  alternator: coroutinize untagging a resource
  alternator: coroutinize tagging a resource
2020-12-29 23:36:25 +02:00
Avi Kivity
700ddd1914 Merge 'scylla_setup: enable node_exporter for offline installation' from Amos Kong
node_exporter had been added to scylla-server package by commit
95197a09c9.

So we can enable it by default for offline installation.

Closes #7832

* github.com:scylladb/scylla:
  scylla_setup: cleanup if judgments
  scylla_setup: enable node_exporter for offline installation
2020-12-28 22:07:36 +02:00
Avi Kivity
1716359455 Update tools/jmx submodule
* tools/jmx 20469bf...2c95650 (1):
  > install.sh: set a valid WorkingDirectory for nonroot offline install
2020-12-28 21:19:04 +02:00
Avi Kivity
f7b731bc46 Merge 'Fix potential reactor stall on LCS compaction completion' from Raphael Carvalho
On every compaction completion, sstable set is rebuilt from scratch.
With LCS and ~160G of data per shard, it means we'll have to create
a new sstable set with ~1000 entries whenever compaction completes,
which will likely result in reactor stalling for a significant
amount of time.

Fixes #7758.

Closes #7842

* github.com:scylladb/scylla:
  table: Fix potential reactor stall on LCS compaction completion
  table: decouple preparation from execution when updating sstable set
  table: change rebuild_sstable_list to return new sstable set
  row_cache: allow external updater to decouple preparation from execution
2020-12-28 21:16:17 +02:00
Pavel Emelyanov
7ac435f67c test: Enhance test for range_tombstone_list de-overlapping
The range_tombstone_list always (unless misused?) contains de-overlapped
entries. There's a test_add_random that checks this, but it suffers from
several problems:

- generated "random" ranges are sequential and may only overlap on
  their borders
- test uses the keys of the same prefix length

Enhance the generator part to produce a purely random sequence of ranges
with bound keys of arbitrary length. Just pay attention to generate the
"valid" individual ranges, whose start is not ahead of the end.

Also -- rename the test to reflect what it's doing and increase the
number of iterations.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201228115525.20327-1-xemul@scylladb.com>
2020-12-28 18:26:48 +02:00
Raphael S. Carvalho
8dd7280107 table: Fix potential reactor stall on LCS compaction completion
On every compaction completion, sstable set is rebuilt from scratch.
With LCS and ~160G of data per shard, it means we'll have to create
a new sstable set with ~1000 entries whenever compaction completes,
which will likely result in reactor stalling for a significant
amount of time.

This is fixed by futurizing build_new_sstable_list(), so it will
yield whenever needed.

Fixes #7758.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-12-28 13:17:50 -03:00
Raphael S. Carvalho
6082da4703 table: decouple preparation from execution when updating sstable set
row cache now allows updater to first prepare the work, and then execute
the update atomically as the last step. let's do that when rebuilding
the set, so now new set is created in the preparation phase, and the
new set replaces the old one in the execution phase, satisfying the
atomicity requirement of row cache.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-12-28 13:17:48 -03:00
Raphael S. Carvalho
43f0200b8f table: change rebuild_sstable_list to return new sstable set
procedure is changed to return the new set, so caller will be responsible
for replacing the old set with the new one. this will allow our future
work where building new set and enabling it will be decoupled.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-12-28 13:17:47 -03:00
Raphael S. Carvalho
198b87503f row_cache: allow external updater to decouple preparation from execution
External updater may do some preparatory work like constructing a new sstable list,
and at the end atomically replace the old list by the new one.

Decoupling the preparation from execution will give us the following benefits:
- the preparation step can now yield if needed to avoid reactor stalls, as it's
been futurized.
- the execution step will now be able to provide strong exception guarantees, as
it's now decoupled from the preparation step which can be non-exception-safe.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-12-28 13:17:45 -03:00
Avi Kivity
3325960486 Update seastar submodule
* seastar 1f5e3d3419...d1b5d41b6d (1):
  > append_challenged_posix_file_impl: adjust sloppy_size only in optimize_queue

Fixes #7439 (the coredump part).
2020-12-28 13:00:04 +02:00
Nadav Har'El
7eda6b1e90 cql-pytest: increase default request timeout
The CQL tests in test/cql-pytest use the Python CQL driver's default
timeout for execute(), which is 10 seconds. This usually more than
enough. However, in extreme cases like noted in issue #7838, 10
seconds may not be enough. In that issue, we run a very slow debug
build on a very slow test machine, and encounter a very slow request
(a DROP KEYSPACE that needs to drop multiple tables).

So this patch increases the default timeout to an even larger
120 seconds. We don't care that this timeout is ridiculously
large - under normal operations it will never be reached, there
is no code which loops for this amount of time for example.

Tested that this patch fixes #7838 by choosing a much lower timeout
(1 second) and reproducing test failures caused by timeouts.

Fixes #7838.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201228090847.3234862-1-nyh@scylladb.com>
2020-12-28 11:19:37 +02:00
Amos Kong
8723b0ce86 leveled_compaction_strategy: fix boundary of maximum sstable level
The MAX_LEVELS is the levels count, but sstable level (index) starts
from 0. So the maximum and valid level is MAX_LEVELS - 1.

Signed-off-by: Amos Kong <amos@scylladb.com>

Closes #7833
2020-12-27 18:59:54 +02:00
Benny Halevy
8a745a0ee0 compaction: compaction_writer: destroy shared_sstable after the sstable_writer
sstable_writer may depend on the sstable throughout its whole lifecycle.
If the sstable is freed before the sstable_writer we might hit use-after-free
as in the follwing case:
```
std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*>::operator+=(long) at /usr/include/c++/10/bits/stl_deque.h:240
 (inlined by) std::operator+(std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*> const&, long) at /usr/include/c++/10/bits/stl_deque.h:378
 (inlined by) std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*>::operator[](long) const at /usr/include/c++/10/bits/stl_deque.h:252
 (inlined by) std::deque<sstables::compression::segmented_offsets::bucket, std::allocator<sstables::compression::segmented_offsets::bucket> >::operator[](unsigned long) at /usr/include/c++/10/bits/stl_deque.h:1327
 (inlined by) sstables::compression::segmented_offsets::push_back(unsigned long, sstables::compression::segmented_offsets::state&) at ./sstables/compress.cc:214
sstables::compression::segmented_offsets::writer::push_back(unsigned long) at ./sstables/compress.hh:123
 (inlined by) compressed_file_data_sink_impl<crc32_utils, (compressed_checksum_mode)1>::put(seastar::temporary_buffer<char>) at ./sstables/compress.cc:519
seastar::output_stream<char>::put(seastar::temporary_buffer<char>) at table.cc:?
 (inlined by) seastar::output_stream<char>::put(seastar::temporary_buffer<char>) at ././seastar/include/seastar/core/iostream-impl.hh:432
seastar::output_stream<char>::flush() at table.cc:?
seastar::output_stream<char>::close() at table.cc:?
sstables::file_writer::close() at sstables.cc:?
sstables::mc::writer::~writer() at writer.cc:?
 (inlined by) sstables::mc::writer::~writer() at ./sstables/mx/writer.cc:790
sstables::mc::writer::~writer() at writer.cc:?
flat_mutation_reader::impl::consumer_adapter<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~consumer_adapter() at compaction.cc:?
 (inlined by) std::_Optional_payload_base<sstables::compaction_writer>::_M_destroy() at /usr/include/c++/10/optional:260
 (inlined by) std::_Optional_payload_base<sstables::compaction_writer>::_M_reset() at /usr/include/c++/10/optional:280
 (inlined by) std::_Optional_payload<sstables::compaction_writer, false, false, false>::~_Optional_payload() at /usr/include/c++/10/optional:401
 (inlined by) std::_Optional_base<sstables::compaction_writer, false, false>::~_Optional_base() at /usr/include/c++/10/optional:474
 (inlined by) std::optional<sstables::compaction_writer>::~optional() at /usr/include/c++/10/optional:659
 (inlined by) sstables::compacting_sstable_writer::~compacting_sstable_writer() at ./sstables/compaction.cc:229
 (inlined by) compact_mutation<(emit_only_live_rows)0, (compact_for_sstables)1, sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>::~compact_mutation() at ././mutation_compactor.hh:468
 (inlined by) compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>::~compact_for_compaction() at ././mutation_compactor.hh:538
 (inlined by) std::default_delete<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >::operator()(compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>*) const at /usr/include/c++/10/bits/unique_ptr.h:85
 (inlined by) std::unique_ptr<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>, std::default_delete<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~unique_ptr() at /usr/include/c++/10/bits/unique_ptr.h:361
 (inlined by) stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >::~stable_flattened_mutations_consumer() at ././mutation_reader.hh:342
 (inlined by) flat_mutation_reader::impl::consumer_adapter<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~consumer_adapter() at ././flat_mutation_reader.hh:201
auto flat_mutation_reader::impl::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter>(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:272
 (inlined by) auto flat_mutation_reader::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter>(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:383
 (inlined by) auto flat_mutation_reader::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:389
 (inlined by) seastar::future<void> sstables::compaction::setup<noop_compacted_fragments_consumer>(noop_compacted_fragments_consumer)::{lambda(flat_mutation_reader)#1}::operator()(flat_mutation_reader)::{lambda()#1}::operator()() at ./sstables/compaction.cc:612
```

What happens here is that:

    compressed_file_data_sink_impl(output_stream<char> out, sstables::compression* cm, sstables::local_compression lc)
            : _out(std::move(out))
            , _compression_metadata(cm)
            , _offsets(_compression_metadata->offsets.get_writer())
            , _compression(lc)
            , _full_checksum(ChecksumType::init_checksum())

_compression_metadata points to a buffer held by the sstable object.
and _compression_metadata->offsets.get_writer returns a writer that keeps
a reference to the segmented_offsets in the sstables::compression
that is used in the ~writer -> close path.

Fixes #7821

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201227145726.33319-1-bhalevy@scylladb.com>
2020-12-27 17:02:13 +02:00
Pavel Emelyanov
387889315e mutation-partition: Relax putting a dummy entry into a continuous range
When applying a mutation partition to another if a dummy entry
from the source falls into a destination continuous range, it
can be just dropped. However, current implementation still
inserts it and then instantly removes.

Relax this code-flow by dropping the unwanted entry without
tossing it.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201224130438.11389-1-xemul@scylladb.com>
2020-12-27 14:47:32 +02:00
Amos Kong
9adc6f68ee scylla_setup: cleanup if judgments
This patch merged two nested if judgments.

Signed-off-by: Amos Kong <amos@scylladb.com>
2020-12-26 04:45:25 +08:00
Amos Kong
632b01ce4e scylla_setup: enable node_exporter for offline installation
node_exporter had been added to scylla-server package by commit
95197a09c9.

So we can enable it by default for offline installation.

Signed-off-by: Amos Kong <amos@scylladb.com>
2020-12-25 10:54:31 +08:00
Pavel Emelyanov
72c2482f73 mutation-partition: Construct rows_entry directly from clustering_row
When a rows_entry is added to row_cache it's constructed from
clustering_row  by unpacking all its internals and putting
them into the rows_entry's deletable_row. There's a shorter
way -- the clustering_row already has the deletale_row onboard
from which rows_entry can copy-construct its.

This lets keeping the rows_entry and deletable_row set of
constructors a bit shorter.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201224161112.20394-1-xemul@scylladb.com>
2020-12-24 18:13:44 +02:00
Avi Kivity
8f06a687b4 Merge "idl: minor improvements to idl compiler" from Pavel S
"
This series does a lot of cleanups, dead code removal, and most
importantly fixes the following things in IDL compiler tool:
 * The grammar now rejects invalid identifiers, which,
   in some cases, allowed to write things like `std:vector`.
 * Error reporting is improved significantly and failures are
   now pointing to the place of failure much more accurately.
   This is done by restricting rule backtracing on those rules
   which don't need it.
"

* 'idl-compiler-minor-fixes-v4' of https://github.com/ManManson/scylla:
  idl: move enum and class serializer code writers to the corresponding AST classes
  idl: extract writer functions for `write`, `read` and `skip` impls for classes and enums
  idl: minor fixes and code simplification
  idl: change argument name from `hout` to `cout` in all dependencies of `add_visitors` fn
  idl: fix parsing of basic types and discard unneeded terminals
  idl: remove unused functions
  idl: improve error tracing in the grammar and tighten-up some grammar rules
  idl: remove redundant `set_namespace` function
  idl: remove unused `declare_class` function
  idl: slightly change `str` and `repr` for AST types
  idl: place directly executed init code into if __name__=="__main__"
2020-12-24 15:14:09 +02:00
Takuya ASADA
95197a09c9 dist: add node_exporter to scylla-server package
To connection-less environment, we need to add node_exporter binary
to scylla-server package, not downloading it from internet.

Related #7765
Fixes #2190

Closes #7796
2020-12-24 11:44:13 +02:00
Pavel Solodovnikov
219ac2bab5 large_data_handler: fix segmentation fault when constructing data_value from a nullptr
It turns out that `cql_table_large_data_handler::record_large_rows`
and `cql_table_large_data_handler::record_large_cells` were broken
for reporting static cells and static rows from the very beginning:

In case a large static cell or a large static row is encountered,
it tries to execute `db::try_record` with `nullptr` additional values,
denoting that there is no clustering key to be recorded.

These values are next passed to `qctx.execute_cql()`, which
creates `data_value` instances for each statement parameter,
hence invoking `data_value(nullptr)`.

This uses `const char*` overload which delegates to
`std::string_view` ctor overload. It is UB to pass `nullptr`
pointer to `std::string_view` ctor. Hence leading to
segmentation faults in the aforementioned large data reporting
code.

What we want here is to make a null `data_value` instead, so
just add an overload specifically for `std::nullptr_t`, which
will create a null `data_value` with `text` type.

A regression test is provided for the issue (written in
`cql-pytest` framework).

Tests: test/cql-pytest/test_large_cells_rows.py

Fixes: #6780

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201223204552.61081-1-pa.solodovnikov@scylladb.com>
2020-12-24 11:37:43 +02:00
Nadav Har'El
79faaa34c7 alternator test: confirm that list index can't be a reference
In Alternator's expression parser in alternator/expressions.g, a list can be
indexed by a '[' INTEGER ']'. I had doubts whether maybe a value-reference
for the index, e.g., "something[:xyz]", should also work. So this patch adds
a test that checks whether "something[:xyz]" works, and confirms that both
DynamoDB and Alternator don't accept it and consider it a syntax error.

So Alternator's parser is correct to insist that the index be a literal
integer.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201214100302.2807647-1-nyh@scylladb.com>
2020-12-24 11:37:29 +02:00
Piotr Sarna
b62457d5b0 test: add verification to using timeout prepared statements
Previously the test cases only verified that the queries
did not time out with sufficiently large timeout, but now
they also check that appropriate data is inserted and can be read.
Message-Id: <8bc979434fce977c30d8516dc82789d4fe317696.1608734455.git.sarna@scylladb.com>
2020-12-24 11:37:29 +02:00
Piotr Sarna
1577e6f632 test: add cases for using timeout with batches
The test suite for USING TIMEOUT already included SELECT,
INSERT and UPDATE statements, but missed batches. The suite
is now updated to include batch tests.

Tests: unit(dev)
Message-Id: <a6738d2ed3d62681615523d01109362766c90325.1608734455.git.sarna@scylladb.com>
2020-12-24 11:37:29 +02:00
Piotr Sarna
4eb41b7d56 test: use random keys in tests for USING TIMEOUT
Since the tables are written to and it's possible to run
mutliple test cases concurrently, the cases now use pseudorandom
keys instead of hardcoded values.
Message-Id: <d864dbb096360c17cdc2ebd8e79bfd983c19910e.1608734455.git.sarna@scylladb.com>
2020-12-24 11:37:29 +02:00
Avi Kivity
0bbd78037f Update seastar submodule
* seastar 2bd8c8d088...1f5e3d3419 (5):
  > Merge "Avoid fair-queue rovers overflow if not configured" from Pavel E
  > doc: add a coroutines section to the tutorial
  > Merge "tests/perf: add random-seed config option" from Benny
  > iotune: Print parameters affecting the measurement results
  > cook: Add patch cmd for ragel build (signed char confusion on aarch64)
2020-12-24 11:37:29 +02:00
Piotr Sarna
3b26fc01c2 alternator: coroutinize untagging a resource
Historically, a seastar thread was used for this request
because it's not on a critical path, but a coroutine makes
the code simpler.
2020-12-23 15:53:57 +01:00
Piotr Sarna
1ca39cc8c1 alternator: coroutinize tagging a resource
Historically, a seastar thread was used for this request
because it's not on a critical path, but a coroutine makes
the code simpler.
2020-12-23 15:53:57 +01:00
Pavel Solodovnikov
3a91f1127d idl: move enum and class serializer code writers to the corresponding AST classes
Expand the role of AST classes to also supply methods for actually
generating the code. More changes will follow eventually until
all generation code is handled by these classes.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-22 23:23:12 +03:00
dgarcia360
fd5f0c3034 docs: add organization
Closes #7818
2020-12-22 15:33:31 +02:00
Pekka Enberg
ceb67e7728 dist/docker: Remove 'epel-release' from Docker image
We no longer need the 'epel-release' package for anything as our
scylla-server package bundles all the necessary dependencies.

Closes #7823
2020-12-22 14:55:17 +02:00
Avi Kivity
e2dfa24540 Merge "token_metadata: add clear_gently" from Benny
"
We've encountered a number of reactor stalls
related to token_metadata that were fixed
in 052a8d036d.

This is a follow-up series that adds a clear_gently
method to token_metadata that uses continuations
to prevent reactor stalls when destroying token_metadata
objects.

Test: unit(dev), {network_topology_strategy,storage_proxy}_test(debug)
"

* tag 'token_metadata_clear_gently-v3' of github.com:bhalevy/scylla:
  token_metadata: add clear_gently
  token_metadata: shared_token_metadata: add mutate_token_metadata
  token_metdata: futurize update_normal_tokens
  abstract_replication_strategy: get_pending_address_ranges: invoke clone_only_token_map if can_yield
  repair: replace_with_repair: convert to coroutine
2020-12-22 13:23:31 +02:00
Nadav Har'El
f2978e1873 cql-pytest: port Cassandra's collection_test.py
A previous patch added test/cql-pytest/cassandra_tests - a framework for
porting Cassandra's unit tests to Python - but only ported two tiny test
files with just 3 tests.  In this patch, we finally port a much larger
test file validation/entities/collection_test.java. This file includes
50 separate tests, which cover a lot of aspects of collection support,
as well as how other stuff interact with collections.

As of now, 23 (!) of these 50 tests fail, and exposed six new issues
in Scylla which I carefully documented:

Refs #7735: CQL parser missing support for Cassandra 3.10's new "+=" syntax
Refs #7740: CQL prepared statements incomplete support for "unset" values
Refs #7743: Restrictions missing support for "IN" on tables with
            collections, added in Cassandra 4.0
Refs #7745: Length of map keys and set items are incorrectly limited to 64K
            in unprepared CQL
Refs #7747: Handling of multiple list updates in a single request differs
            from recent Cassandra
Refs #7751: Allow selecting map values and set elements, like in
            Cassandra 4.0

These issues vary in severity - some are simply new Cassandra 4.0 features
that Scylla never implemented, but one (#7740) is an old Cassandra 2.2
feature which it seems we did not implement correctly in some cases that
involve collections.

Note that there are some things that the ported tests do not include.
In a handful of places there are things which the Python driver checks,
before sending a request - not giving us an opportunity to check how
the server handles such errors. Another notable change in this port is
that the original tests repeated a lot of tests with and without a
"nodetool flush". In this port I chose to stub the flush() function -
it does NOT flush. I think the point of these tests is to check the
correctness of the CQL features - *not* to verify that memtable flush
works correctly. Doing a real memtable flush is not only slow, it also
doesn't really check much (Scylla may still serve data from cache,
not sstables). So I decided it is pointless.

An important goal of this patch is that all 50 tests (except three
skipped tests because Python has client-side checking), pass when
run on Cassandra (with test/cql-pytest/run-cassandra). This is very
important: It was very easy to make mistakes while porting the tests,
and I did make many such mistakes; But running the against Cassandra
allowed me to fix those mistakes - because the correct tests should
pass on Cassandra. And now they do.

Unfortunately, the new tests are significantly slower than what we've
been accustomed in Alternator/CQL tests. The 50 tests create more than a
hundred tables, udfs, udts, and similar slow operations - they do not
reuse anything via fixtures. The total time for these 50 tests (in dev
build mode) is around 18 seconds. Just one test - testMapWithLargePartition
is responsibe for almost half (!) of that time - we should consider in
the future whether it's worth it or can be made smaller.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201215155802.2867386-1-nyh@scylladb.com>
2020-12-22 13:22:09 +02:00
Avi Kivity
5a33ce58a7 Update seastar submodule
* seastar 3b8903d406...2bd8c8d088 (8):
  > core: remove unused chrono.h reference
  > cmake: force cxx standard if dialect is specified
  > queue: add front()
  > coroutine: deprecate coroutine forwarding
  > memory: Use 2^n sizes when searching for preferred span size
  > shared_ptr: define debug_shared_ptr_counter_type constructor as noexcept
  > install-dependencies: add pkg-config to Debian/Ubuntu packages
  > log: do_log: prevent garbling due to context switch
2020-12-22 13:22:09 +02:00
Benny Halevy
322aa2f8b5 token_metadata: add clear_gently
clear_gently gently clears the token_metadata members.
It uses continuations to allow yielding if needed
to prevent reactor stalls.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-22 11:22:21 +02:00
Benny Halevy
56aa49ca81 token_metadata: shared_token_metadata: add mutate_token_metadata
mutate_token_metadata acquires the shared_token_metadata lock,
clones the token_metadata (using clone_async)
and calls an asynchronous functor on
the cloned copy of the token_metadata to mutate it.

If the functor is successful, the mutated clone
is set back to to the shared_token_metadata,
otherwise, the clone is destroyed.

With that, get rid of shared_token_metadata::clone

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-22 11:22:19 +02:00
Benny Halevy
e089c22ec1 token_metdata: futurize update_normal_tokens
The function complexity if O(#tokens) in the worst case
as for each endpoint token to traverses _token_to_endpoint_map
lineraly to erase the endpoint mapping if it exists.

This change renames the current implementation of
update_normal_tokens to update_normal_tokens_sync
and clones the code as a coroutine that returns a future
and may yield if needed.

Eventually we should futurize the whole token_metadata
and abstract_replication_strategy interface and get rid
of the synchronous functions.  Until then the sync
version is still required from call sites that
are neither returning a future nor run in a seastar thread.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-22 10:35:15 +02:00
Benny Halevy
e7f4cd89a9 abstract_replication_strategy: get_pending_address_ranges: invoke clone_only_token_map if can_yield
Optimize the can_yield case by invoking the futurized version
of clone_only_token_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-22 09:49:08 +02:00
Benny Halevy
55316df6bf repair: replace_with_repair: convert to coroutine
Prepare to futurizing update_normal_tokens.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-22 09:49:08 +02:00
Piotr Sarna
da7e87dc56 test: add cases for using timeout with bind markers
The test suite for USING TIMEOUT already included binding
the timeout value, but only for wildcard (?). The test case
is now extended with named bind markers.

Tests: unit(dev)
Message-Id: <b5344f40d26d90b36e90a04c2474127728535eaa.1608573624.git.sarna@scylladb.com>
2020-12-22 09:03:56 +02:00
Pekka Enberg
961b9e8390 install.sh: Add seastar-cpu-map.sh to $PATH
Add the seastar-cpu-map.sh to the SBINFILES variable, which is used to
create symbolic links to scripts so that they appear in $PATH.

Please note that there are additional Python scripts (like perftune.py),
which are not in $PATH. That's because Python scripts are handled
separately in "install.sh" and no Python script has a "sbin" symlink. We
might want to change this in the future, though.

Fixes #6731

Closes #7809
2020-12-21 14:12:27 +02:00
Avi Kivity
0f7b6dd180 utils: managed_bytes: introduce with_linearized()
This is a temporary scaffold for weaning ourselves off
linearization. It differs from with_linearized_managed_bytes in
that it does not rely on the environment (linearization_context)
and so is easier to remove.
2020-12-20 15:14:44 +01:00
Avi Kivity
c37e495958 utils: managed_bytes: constrain with_linearized_managed_bytes()
The passed function must be called with a no parameters; document
and enforce it.
2020-12-20 15:14:44 +01:00
Avi Kivity
a1df1b3c34 utils: managed_bytes: avoid internal uses of managed_bytes::data()
We use managed_bytes::data() in a few places when we know the
data is non-fragmented (such as when the small buffer optimization
is in use). We'd like to remove managed_bytes::data() as linearization
is bad, so in preparation for that, replace internal uses of data()
with the equivalent direct access.
2020-12-20 15:14:44 +01:00
Avi Kivity
72a2554a86 utils: managed_bytes: extract do_linearize_pure()
do_linearize() is an impure function as it changes state
in linearization_context. Extract the pure parts into a new
do_linearize_pure(). This will be used to linearize managed_bytes
without a linearization_context, during the transition period where
fragmented and non-fragmented values coexist.
2020-12-20 15:14:44 +01:00
Avi Kivity
4b3f0fd7c0 thrift: do not depend on implicit conversion of keys to bytes_view
This implicit conversion will soon be gone, as it is dangerous.
Ask for the representation explicitly.
2020-12-20 15:14:44 +01:00
Avi Kivity
8521248955 clustering_bounds_comparator: do not depend on implicit conversion of keys to bytes_view
This implicit conversion will soon be gone, as it is dangerous.
Ask for the representation explicitly.
2020-12-20 15:14:44 +01:00
Avi Kivity
1dd6d7029a cql3: expression: linearize get_value_from_mutation() eariler
do_get_value() is careful to return a fragmented view, but
its only caller get_value_from_mutation() linearizes it immediately
afterwards. Linearize it sooner; this prevents mixing in
fragmented values from cells (now via IMR) and fragmented values
from partition/clustering keys. It only works now because
keys are not fragmented outside LSA, and value_view has a special
case for single-fragment values.

This helps when keys become fragmented.
2020-12-20 15:14:44 +01:00
Avi Kivity
b59a21967c bytes: add to_bytes(bytes)
Converting from bytes to bytes is nonsensical, but it helps
when transitioning to other types (managed_bytes/managed_bytes_view),
and these types will have to_bytes() conversions.
2020-12-20 15:14:44 +01:00
Avi Kivity
28126257c2 cql3: expression: mark do_get_value() as static
It is used only later in this file.
2020-12-20 15:14:44 +01:00
Avi Kivity
b3e39d81aa Merge 'Avoid scanning sstables in parallel for TWCS single-partition queries' from Kamil Braun
We introduce a new single-key sstable reader for sstables created by `TimeWindowCompactionStrategy`.

The reader uses the fact that sstables created by TWCS are mostly disjoint with respect to the contained `position_in_partition`s in order to avoid having multiple sstable readers opened at the same time unnecessarily. In case there are overlapping ranges (for example, in the current time-window), it performs the necessary merging (it uses `clustering_order_reader_merger`, introduced recently).

The reader uses min/max clustering key metadata present in `md` sstables in order to decide when to open or close a sstable reader.

The following experiment was performed:
1. create a TWCS table with 1 minute windows
2. fill the table with 8 equal windows of data
   (each window flushed to a separate sstable)
3. perform `select * from ks.t where pk = 0 limit 1` query
   with and without the change

The expectation is that with the commit, only one sstable will be opened
to fetch that one row; without the commit all 8 sstables would be opened at once.
The difference in the value of `scylla_reactor_aio_bytes_read` was measured
(value after the query minus value before the query), both with and without the commit.

With the commit, the difference was 67584.
Without the commit, the difference was 528384.
528384 / 67584 ~= 7.8.

Fixes #6418.

Closes #7437

* github.com:scylladb/scylla:
  sstables: gather clustering key filtering statistics in TWCS single key reader
  sstables: use time_series_sstable_set in time_window_compaction_strategy
  sstable_set: new reader for TWCS single partition queries
  mutation_reader_test: test clustering_order_reader_merger with time_series_sstable_set
  sstable_set: introduce min_position_reader_queue
  sstable_set: introduce time_series_sstable_set
  sstables: add min_position and max_position accessors
  sstable_set: make create_single_key_sstable_reader a virtual method
  clustering_order_reader_merger: fix the 0 readers case
2020-12-19 23:53:18 +02:00
Kamil Braun
53414558a1 sstables: gather clustering key filtering statistics in TWCS single key reader 2020-12-18 16:33:27 +01:00
Kamil Braun
4f2d45001c sstables: use time_series_sstable_set in time_window_compaction_strategy
The following experiment was performed:
1. create a TWCS table with 1 minute windows
2. fill the table with 8 windows of data
   (each window flushed to a separate sstable)
3. perform `select * from ks.t where pk = 0 limit 1` query
   with and without the change

The expectation is that with the commit, only one sstable will be opened
to fetch that one row; without the commit all 8 sstables would be opened at once.
The difference in the value of `scylla_reactor_aio_bytes_read` was measured
(value after the query minus value before the query), both with and without the commit.

With the commit, the difference was 67584.
Without the commit, the difference was 528384.
528384 / 67584 ~= 7.8.

Fixes https://github.com/scylladb/scylla/issues/6418.
2020-12-18 16:33:27 +01:00
Kamil Braun
f0842ba34e sstable_set: new reader for TWCS single partition queries
This commit introduces a new implementation of `create_single_key_sstable_reader`
in `time_series_sstable_set` dedicated for TWCS-created sstables.

It uses the fact that such sstables are mostly disjoint with respect to
contained `position_in_partition`s in order to decrease the number of
sstable readers that are opened at the same time.

The implementation uses `clustering_order_reader_merger` under the hood.

The reader assumes that the schema does not have static columns and none
of the queried sstable contain partition tombstones; also, it assumes
that the sstables have the min/max clustering key metadata in order for
the implementation to be efficient. Thus, if we detect that some of
these assumptions aren't true, we fall back to the old implementation.
2020-12-18 16:33:27 +01:00
Kamil Braun
b41139a07f mutation_reader_test: test clustering_order_reader_merger with time_series_sstable_set 2020-12-18 16:33:27 +01:00
Kamil Braun
d0548aa77f sstable_set: introduce min_position_reader_queue
This is a queue of readers of sstables in a time_series_sstable_set,
returning the readers in order of the smallest position_in_partition
that the sstables have. It uses the min/max clustering key sstable
metadata.

The readers are opened lazily, at the moment of being returned.
2020-12-18 16:33:27 +01:00
Kamil Braun
52697022b0 sstable_set: introduce time_series_sstable_set
At this moment it is a slightly less efficient version of
bag_sstable_set, but in following commits we will use the new data
structures to gain advantage in single partition queries
for sstables created by TimeWindowCompactionStrategy.
2020-12-18 16:33:27 +01:00
Kamil Braun
2a160dd909 sstables: add min_position and max_position accessors
The methods return a lower-bound and an upper-bound for the
position-in-partitions appearing in a given sstable.
2020-12-18 16:33:27 +01:00
Kamil Braun
fe26da82ba sstable_set: make create_single_key_sstable_reader a virtual method
... of sstable_set_impl.

Soon we shall provide a specialized implementation in one of the
`sstable_set_impl` derived classes.

The existing implementation is used as the default one.
2020-12-18 12:31:16 +01:00
Kamil Braun
5e846b33b8 clustering_order_reader_merger: fix the 0 readers case
With 0 readers the merger would produce a `partition_end` fragment
when it should immediately return `end_of_stream` instead.
2020-12-18 12:30:40 +01:00
Gleb Natapov
85cffd1aeb lwt: rewrite storage_proxy::cas using coroutings
Makes code much simpler to understand.

Message-Id: <20201201160213.GW1655743@scylladb.com>
2020-12-17 18:15:35 +01:00
Avi Kivity
a60c81b615 Merge 'cql3: Fix handling of impossible restrictions on a primary-key column' from Dejan Mircevski
There were two problems with handling conflicting equalities on the same PK column (eg, c=1 AND c=0):
1. When the column is indexed, Scylla crashed (#7772)
2. Computing ranges and slices was throwing an exception

This series fixes them both; it also happens to resolve some old TODOs from restriction_test.

Tests: unit (dev, debug)

Closes #7804

* github.com:scylladb/scylla:
  cql3: Fix value_for when restriction is impossible
  cql3: Fix range computation for p=1 AND p=1
2020-12-17 12:01:36 +02:00
Dejan Mircevski
46b4b59945 cql3: Fix value_for when restriction is impossible
Previously, single_column_restrictions::value_for() assumed that a
column's restriction specifies exactly one value for the column.  But
since 37ebe521e3, multiple equalities on the same column are allowed,
so the restriction could be a conjunction of conflicting
equalities (eg, c=1 AND c=0).  That violates an assert and crashes
Scylla.

This patch fixes value_for() by gracefully handling the
impossible-restriction case.

Fixes #7772

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-12-16 15:00:29 -05:00
Dejan Mircevski
4bb1107652 cql3: Fix range computation for p=1 AND p=1
Previously compute_bounds was assuming that primary-key columns are
restricted by exactly one equality, resulting in the following error:

query 'select p from t where p=1 and p=1' failed:
 std::bad_variant_access (std::get: wrong index for variant)

This patch removes that assumption and deals correctly with the
multiple-equalities case.  As a byproduct, it also stops raising
"invalid null value" exceptions for null RHS values.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-12-16 14:46:48 -05:00
Pavel Solodovnikov
edf9ccee48 idl: extract writer functions for write, read and skip impls for classes and enums
Split `write`, `read` and `skip` serializer function writers to
separate functions in `handle_class` and `handle_enum` functions,
which slightly improves readability.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-16 20:33:55 +03:00
Pavel Solodovnikov
8049cb0f91 idl: minor fixes and code simplification
* Introduce `ns_qualified_name` and `template_params_str` functions
  to simplify code a little bit in `handle_enum` and `handle_class`
  functions.
* Previously each serializer had a separate namespace open-close
  statements, unify them into a single namespace scope.
* Fix a few more `hout` -> `cout` argument names.
* Rename `template` pattern to `template_decl` to improve clarity.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-16 19:32:08 +03:00
Pavel Solodovnikov
0de96426db idl: change argument name from hout to cout in all dependencies of add_visitors fn
Prior to the patch all functions that are called from `add_visitors`
and this function itself declared the argument denoting the output
file as `hout`. Though, this was quite misleading since `hout`
is meant to be header file with declarations, while `cout` is an
implementation file.

These functions write to implmentation file hence `hout` should
be changed to `cout` to avoid confusion.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-16 19:32:03 +03:00
Pavel Solodovnikov
0defb52855 idl: fix parsing of basic types and discard unneeded terminals
Prior to the patch `btype` production was using `with_colon`
rule, which accidentally supported parsing both numbers and
identifiers (along with other invalid inputs, such as "123asd").

It was changed to use `ns_qualified_ident` and those places
which can accept numeric constants, are explicitly listing
it as an alternative, e.g. template parameter list.

Unfortunately, I had to make TemplateType to explicitly construct
`BasicType` instances from numeric constants in template arguments
list. This is exactly the way it was handled before, though.

But nonetheless, this should be addressed sometime later.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-16 19:31:57 +03:00
Pavel Solodovnikov
0cc87ead3d idl: remove unused functions
Remove the following functions since they are not used:
* `open_namespaces`
* `close_namespaces`
* `flat_template`

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-16 19:31:51 +03:00
Pavel Solodovnikov
bea965a0a7 idl: improve error tracing in the grammar and tighten-up some grammar rules
This patch replaces use of some handwritten rules to use their
alternatives already defined in `pyparsing.pyparsing_common`
class, i.e.: `number`, `identifier` productions.

Changed ignore patterns for comments to use pre-defined
`pp.cppStyleComment` instead of hand-written combination of
'//'-style and C-style comment rules.

Operator '-' is now used whenever possible to improve debugging
experience: it disables default backtracking for productions
so that compiler fails earlier and can now point more precisely
to a place in the input string where it failed instead of
backtracking to the top-level rule and reporting error there.

Template names and class names now use `ns_qualified_ident`
rule instead of `with_colon` which prevents grammar from
matching invalid identifiers, such as `std:vector`.

Many places are using the updated `identifier` production, which
is working correctly unlike its predecessor: now inputs
such as `1ident` are considered invalid.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-16 19:31:46 +03:00
Pavel Solodovnikov
3a037bc5b6 idl: remove redundant set_namespace function
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-16 19:31:40 +03:00
Pavel Solodovnikov
e76e8aec0e idl: remove unused declare_class function
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-16 19:31:35 +03:00
Pavel Solodovnikov
745f4ac23b idl: slightly change str and repr for AST types
Surround string representation with angle brackets. This improves
readability when printing debug output.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-16 19:31:20 +03:00
Pavel Solodovnikov
4a61270701 idl: place directly executed init code into if __name__=="__main__"
Since idl compiler is not intended to be used as a module to other
python build scripts, move initialization code under an if checking
that current module name is "__main__".

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-16 19:30:33 +03:00
Gleb Natapov
37368726c9 migration_manager: remove unused announce() variant
Message-Id: <20201216153150.GG3244976@scylladb.com>
2020-12-16 18:14:07 +02:00
Konstantin Osipov
2c46938c2a commitlog: avoid a syscall in a most common case of segment recycle
When recycling a segment in O_DSYNC mode if the size of the segment
is neither shrunk nor grown, avoid calling file::truncate() or
file::allocate().

Message-Id: <20201215182332.1017339-2-kostja@scylladb.com>
2020-12-16 14:57:36 +02:00
Avi Kivity
fdb47c954d Merge "idl: allow IDL compiler to parse const specifiers for template arguments" from Pavel S
"
This patch series consists of the following patches:

1. The first one turned out to be a massive rewrite of almost
everything in `idl-compiler.py`. It aims to decouple parser
structures from the internal representation which is used
in the code-generation itself.

Prior to the patch everything was working with raw token lists and
the code was extremely fragile and hard to understand and modify.

Moreover, every change in the parser code caused a cascade effect
of breaking things at many different places, since they were relying
on the exact format of output produced by parsing rules.

Now there is a bunch of supplementary AST structures which provide
hierarchical and strongly typed structure as the output of parsing
routine.
It is much easier to verify (by the means of `isinstance`, for example)
and extend since the internal structures used in code-generation are
decoupled from the structure of parsing rules, which are now controlled
by custom parse actions providing high-level abstractions.

It is tested manually by checking that the old code produces exactly
the same autogenerated sources for all Scylla IDLs as the new one.

2 and 3. Cosmetics changes only: fixed a few typos and moved from
old-fashioned `string.Template` to python f-strings.

This improves readability of the idl-compiler code by a lot.

Only one non-functional whitespace change introduced.

4. This patch adds a very basic support for the parser to
understand `const` specifier in case it's used with a template
parameter for a data member in a class, e.g.

    struct my_struct {
        std::vector<const raft::log_entry> entries;
    };

It actually does two things:
* Adjusts `static_asserts` in corresponding serializer methods
  to match const-ness of fields.
* Defines a second serializer specialization for const type in
  `.dist.hh` right next to non-const one.

This seems to be sufficient for raft-related uses for now.
Please note there is no support for the following cases, though:

    const std::vector<raft::log_entry> entries;
    const raft::term_t term;

None of the existing IDLs are affected by the change, so that
we can gradually improve on the feature and write the idl
unit-tests to increase test coverage with time.

5. A basic unit-test that writes a test struct with an
`std::vector<S<const T>>` field and reads it back to verify
that serialization works correctly.

6. Basic documentation for AST classes.
TODO: should also update the docs in `docs/IDL.md`. But it is already
quite outdated, and some changes would even be out of scope for this
patch set.
"

* 'idl-compiler-refactor-v5' of https://github.com/ManManson/scylla:
  idl: add docstrings for AST classes
  idl: add unit-test for `const` specifiers feature
  idl: allow to parse `const` specifiers for template arguments
  idl: fix a few typos in idl-compiler
  idl: switch from `string.Template` to python f-strings and format string in idl-compiler
  idl: Decouple idl-compiler data structures from grammar structure
2020-12-16 14:05:33 +02:00
Gleb Natapov
61520a33d6 mutation_writer: pass exceptions through feed_writer
feed_writer() eats exception and transforms it into an end of stream
instead. Downstream validators hate when this happens.

Fixes #7482
Message-Id: <20201216090038.GB3244976@scylladb.com>
2020-12-16 13:18:19 +02:00
Pavel Solodovnikov
8b8dce15c3 idl: add docstrings for AST classes
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-16 09:03:39 +03:00
Botond Dénes
978ec7a4bb tools: introduce scylla-sstable-index
A tool which lists all partitions contained in an sstable index. As all
partitions in an sstable are indexed, this tool can be used to find out
what partitions are contained in a given sstable.

The printout has the following format:
$pos: $human_readable_value (pk{$raw_hex_value})

Where:
* $pos: the position of the partition in the (decompressed) data file
* $human_readable_value: the human readable partition key
* $raw_hex_value: the raw hexadecimal value of the binary representation
  of the partition key

For now the tool requires the types making up the partition key to be
specified on the command line, using the `--type|-t` command line
argument, using the Cassandra type class name notation for types.
As these are not assumed to be widely known, this patch includes a
document mapping all cql3 types to their Cassandra type class name
equivalent (but not just).

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201208092323.101349-1-bdenes@scylladb.com>
2020-12-15 18:46:47 +02:00
Calle Wilund
71c5dc82df database: Verify iff we actually are writing memtables to disk in truncate
Fixes #7732

When truncating with auto_snapshot on, we try to verify the low rp mark
from the CF against the sstables discarded by the truncation timestamp.
However, in a scenario like:

Fill memtables
Flush
Truncate with snapshot A
Fill memtables some more
Truncate
Move snapshot A to upload + refresh (load old tables)
Truncate

The last op will assert, because while we have sstables loaded, which
will be discarded now, we did not in fact generate any _new_ ones
(since memtables are empty), and the RP we get back from discard is
one from an earlier generation set.

(Any permutation of events that create the situation "empty memtable" +
"non-empty sstables with only old tables" will generate the same error).

Added a check that before flushing checks if we actually have any
data, and if not, does not uphold the RP relation assert.

Closes #7799
2020-12-15 16:24:36 +02:00
Avi Kivity
7636799b18 Merge 'Add waiting for flushes on table drops' from Piotr Sarna
This series makes sure that before the table is dropped, all pending memtable flushes related to its memtables would finish.
Normally, flushes are not problematic in Scylla, because all tables are by default `auto_snapshot=true`, which also implies that a table is flushed before being dropped. However, with `auto_snapshot=false` the flush is not attempted at all. It leads to the following race:
1. Run a node with `auto_snapshot=false`
2. Schedule a memtable flush  (e.g. via nodetool)
3. Get preempted in the middle of the flush
4. Drop the table
5. The flush that already started wakes up and starts operating on freed memory, which causes a segfault

Tests: manual(artificially preempting for a long time in bullet point 2. to ensure that the race occurs; segfaults were 100% reproducible before the series and do not happen anymore after the series is applied)

Fixes #7792

Closes #7798

* github.com:scylladb/scylla:
  database: add flushes to waiting for pending operations
  table: unify waiting for pending operations
  database: add a phaser for flush operations
  database: add waiting for pending streams on table drop
2020-12-15 16:02:47 +02:00
Pavel Solodovnikov
1e6df841a5 idl: add unit-test for const specifiers feature
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-15 16:03:18 +03:00
Pavel Solodovnikov
facf27dbe4 idl: allow to parse const specifiers for template arguments
This patch introduces very limited support for declaring `const`
template parameters in data members.

It's not covering all the cases, e.g.
`const type member_variable` and `const template_def<T1, T2, ...>`
syntax is not supported at the moment.

Though the changes are enough for raft-related use: this makes it
possible to declare `std::vector<raft::log_entries_ptr>` (aka
`std::vector<lw_shared_ptr<const raft::log_entry>>`) in the IDL.

Existing IDL files are not affected in any way.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-15 16:03:11 +03:00
Pavel Solodovnikov
f02703fcd7 idl: fix a few typos in idl-compiler
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-15 16:02:55 +03:00
Pavel Solodovnikov
28b602833f idl: switch from string.Template to python f-strings and format string in idl-compiler
Move to a modern and lightweight syntax of f-strings
introduced in python 3.6. It improves readability and provides
greater flexibility.

A few places are now using format strings instead, though.

In case when multiline substitution variable is used, the template
string should be first re-indented and only after that the
formatting should be applied, or we can end up with screwed
indentation the in generated sources.

This change introduces one invisible whitespace change
in `query.dist.impl.hh`, otherwise all generated code is exactly
the same.

Tests: build(dev) and diff genetated IDL sources by hand

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-15 16:01:17 +03:00
Pavel Solodovnikov
4ab1f7f55d idl: Decouple idl-compiler data structures from grammar structure
Instead of operating on the raw lists of tokens, transform them into
typed structures representation, which makes the code by many orders of
magnitude simpler to read, understand and extend.

This includes sweeping changes throughout the whole source code of the
tool, because almost every function was tightly coupled to the way
data was passed down from the parser right to the code generation
routines.

Tested manually by checking that old generated sources are precisely
the same as the new generated sources.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-15 15:59:17 +03:00
Piotr Sarna
b1208d0fcc database: add flushes to waiting for pending operations
In order to prevent races with table drops, the helper function
which waits for all pending operations to finish now also
waits for pending flushes.
2020-12-15 13:11:33 +01:00
Piotr Sarna
cd1e351dc1 table: unify waiting for pending operations
In order to reduce code duplication which already caused a bug,
waiting for pending operations is now unified with a single helper
function.
2020-12-15 13:11:25 +01:00
Piotr Sarna
df3204426d database: add a phaser for flush operations
Pending flushes can participate in races when a table
with auto_snapshot==false is dropped. The race is as follows:
1. A flush of table T is initiated
2. The flush operation is preempted
3. Table T is dropped without flushing, because it has auto_snapshot off
4. The flush operation from (2.) wakes up and continues
   working on table T, which is already dropped
5. Segfault/memory corruption

To prevent such races, a phaser for pending flushes is introduced
2020-12-15 12:59:36 +01:00
Piotr Sarna
57d63ca036 database: add waiting for pending streams on table drop
We already wait for pending reads and writes, so for completeness
we should also wait for all pending stream operations to finish
before dropping the table to avoid inconsistencies.
2020-12-15 12:55:45 +01:00
Takuya ASADA
ebc4076fa5 tools: toolchain: add node_exporter
Download node_exporter in frozen image to prepare adding node_exporter
to relocatable pacakge.

Related #2190

Closes #7765

[avi: updated toolchain, x86_64/aarch64/s390x]
2020-12-14 20:34:17 +02:00
Piotr Sarna
13317f7698 alternator: ensure correct isolation level in tracing tests
Taking advantage of the fact that isolation level can be defined
for a table with a tag, the tracing test that relies on CAS
can now be sure to have a correct isolation level.
Message-Id: <43f005ab9d566c7d3d55ce93c553127b1df9e87f.1607954739.git.sarna@scylladb.com>
2020-12-14 17:37:55 +02:00
Piotr Sarna
7081e361cc test: add isolation level requirement message to tracing tests
Alternator tracing tests require the cluster to have the 'always'
isolation level configured to work properly. If that's not the case,
the tests will fail due to not having CAS-related traces present
in the logs. In order to help the users fix their configuration,
a helper message is printed before the test case is performed.
Automatic tests do not need this, because they are all ran with
matching isolation level, but this message could greatly improve
the user experience for manual tests.
Message-Id: <62bcbf60e674f57a55c9573852b6a28f99cbf408.1607949754.git.sarna@scylladb.com>
2020-12-14 14:53:58 +02:00
Piotr Sarna
4b0303d8ae tests: make alternator tracing tests idempotent
The outcome of alternator tracing tests was that tracing probability
was always set to 0 after the test was finished. That makes sense
for most test runs, but manual tests can work on existing clusters
with tracing probability set to some other value. Due to preserve
previous trace probability, the value is now extracted and stored,
so that it can be restored after the test is done.
Message-Id: <94f829b63f92847b4abb3b16f228bf9870f90c2e.1607949754.git.sarna@scylladb.com>
2020-12-14 14:53:23 +02:00
Avi Kivity
19ff528ef3 Update seastar submodule
* seastar 2de43eb6bf...3b8903d406 (3):
  > coroutines: check preemption flag in co_await
  > memory: consider span freelist objects in small pool diagnostics
  > util: noncopyable_function: avoid gcc uninitialized error in move constructor
2020-12-14 12:50:32 +02:00
Pekka Enberg
8d00c16feb transport/server: Code cleanups
Fix up some coding style issues spotted while reading the code:

- Fix indentation to be 4 spaces

- Remove superfluous semicolons

Closes #7793
2020-12-14 12:48:05 +02:00
Konstantin Osipov
b6c6cc275f commitlog: align input of dma_write() during segment recycle
Normally a file size should be aligned around block size, since
we never write to it any unaligned size. However, we're not
protected against partial writes.

Just to be safe, align up the amount of bytes to zerofill
when recycling a segment.

Message-Id: <20201211142628.608269-4-kostja@scylladb.com>
2020-12-14 12:16:18 +02:00
Konstantin Osipov
ad6817bcde commitlog: fix typo in a comment
Message-Id: <20201211142628.608269-2-kostja@scylladb.com>
2020-12-14 12:16:14 +02:00
Benny Halevy
0e79e0f215 test: mutation_diff: extend section markers
When the different mutations are printed via
BOOST_REQUIRE_EQUAL, we don't get the "expect {} but got {}"
section markers.  Instead, the parts we're interested in
are bracketed like "critical check X == Y has failed [{} != {}]"

Test: with both formats:
- https://github.com/scylladb/scylla/files/3890627/test_concurrent_reads_and_eviction.log
- https://github.com/scylladb/scylla/files/4303117/flat_mutation_reader_test.118.log
- https://github.com/scylladb/scylla/files/5687372/flat_mutation_reader_test.172.log.gz

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201214100521.3814909-1-bhalevy@scylladb.com>
2020-12-14 12:11:34 +02:00
Nadav Har'El
72cb3e9255 alternator test: add missing wait for update_table to finish
Three tests in test_streams.py run update_table() on a table without
waiting for it to complete, and then call update_table() on the same
table or delete it. This always works in Scylla, and usually works in
AWS, but if we reach the second call, it may fail because the previous
update_table() did not take effect yet. We sometimes see these failures
when running the Alternator test suite against AWS.

So in this patch, after an each update_table() we wait for the table
to return from UPDATING to ACTIVE status.

The entire Alternator test suite now passes (or skipped) on AWS,
so: Fixes #7778.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201213164931.2767236-1-nyh@scylladb.com>
2020-12-14 09:18:38 +01:00
Nadav Har'El
43ce0aef3d alternator test: fix test wrongly failing on AWS
The test test_query_filter.py::test_query_filter_paging fails on AWS
and shouldn't fail, so this patch fixes the test. Note that this is
only a test problem - no fix is needed for Alternator itself.

The test reads 20 results with 1-result pages, and assumed that
21 pages are returned. The 21st page may happen because when the
server returns the 20th, it might not yet know there will be no
additional results, so another page is needed - and will be empty.
Still a different implementation might notice that the last page
completed the iteration, and not return an extra empty page. This is
perfectly fine, and this is what AWS DynamoDB does today - and should
not be considered an error.

Refs #7778

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201213143612.2761943-1-nyh@scylladb.com>
2020-12-14 09:18:31 +01:00
Nadav Har'El
4ab98a4c68 alternator: use a more specific error when Authorization header is missing
When request signature checking is enabled in Alternator, each request
should come with the appropriate Authorization header. Most errors in
this preparing this header will result in an InvalidSignatureException
response; But DynamoDB returns a more specific error when this header is
completely missing: MissingAuthenticationTokenException. We should do the
same, but before this patch we return InvalidSignatureException also for
a missing header.

The test test_authorization.py::test_no_authorization_header used to
enshrine our wrong error message, and failed when run against AWS.
After this patch, we fix the error message and the test - which now
passes against both Alternator and AWS.

Refs #7778.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201213133825.2759357-1-nyh@scylladb.com>
2020-12-14 09:18:24 +01:00
Avi Kivity
39afe14ad4 Merge 'Add per query timeout' from Piotr Sarna
This series allows setting per-query timeout via CQL. It's possible via the existing `USING` clause, which is extended to be available for `SELECT` statement as well. This parameter accepts a duration and can also be provided as a marker.
The parameter acts as a regular part of the `USING` clause, which means that it can be used along with `USING TIMESTAMP` and `USING TTL` without issues.
The series comes with a pytest test suite.

Examples:
```cql
        SELECT * FROM t USING TIMEOUT 200ms;
```
```cql
        INSERT INTO t(a,b,c) VALUES (1,2,3) USING TIMESTAMP 42 AND TIMEOUT 50ms;
```

Working with prepared statements works as usual - the timeout parameter can be
explicitly defined or provided as a marker:

```cql
        SELECT * FROM t USING TIMEOUT ?;
```
```cql
        INSERT INTO t(a,b,c) VALUES (?,?,?) USING TIMESTAMP 42 AND TIMEOUT 50ms;
```

Tests: unit(dev)
Fixes #7777

Closes #7781

* github.com:scylladb/scylla:
  test: add prepared statement tests to USING TIMEOUT suite
  docs: add an entry about USING TIMEOUT
  test: add a test suite for USING TIMEOUT
  storage_proxy: start propagating local timeouts as timeouts
  cql3: allow USING clause for SELECT statement
  cql3: add TIMEOUT attribute to the parser
  cql3: add per-query timeout to select statement
  cql3: add per-query timeout to batch statement
  cql3: add per-query timeout to modification statement
  cql3: add timeout to cql attributes
2020-12-14 09:46:46 +02:00
Piotr Sarna
d6e7e36280 test: add prepared statement tests to USING TIMEOUT suite 2020-12-14 07:50:40 +01:00
Piotr Sarna
da77ab832b docs: add an entry about USING TIMEOUT
The paragraph describes how USING TIMEOUT clause can be used
along with some simple examples.
2020-12-14 07:50:40 +01:00
Piotr Sarna
0148b41a02 test: add a test suite for USING TIMEOUT
The test suite is based on cql-pytest and checks if USING TIMEOUT
works as expected.
2020-12-14 07:50:40 +01:00
Piotr Sarna
27fba35832 storage_proxy: start propagating local timeouts as timeouts
A local timeout was previously propagated to the client as WriteFailure,
while there exists a more concrete error type for that: WriteTimeout.
2020-12-14 07:50:40 +01:00
Piotr Sarna
ddd9cb1b2a cql3: allow USING clause for SELECT statement
In order to be able to specify a timeout for SELECT statements,
it's now possible to use the USING clause with it.
2020-12-14 07:50:40 +01:00
Piotr Sarna
d3896a209b cql3: add TIMEOUT attribute to the parser
It's now possible to specify TIMEOUT as part of the USING clause.
2020-12-14 07:50:40 +01:00
Piotr Sarna
157be33b89 cql3: add per-query timeout to select statement
First of all, select statement is extended with an 'attrs' field,
which keeps the per-query attributes. Currently, only TIMEOUT
parameter is legal to use, since TIMESTAMP and TTL bear no meaning
for reads.

Secondly, if TIMEOUT attribute is set, it will be used as the effective
timeout for a particular query.
2020-12-14 07:50:40 +01:00
Piotr Sarna
20dedd0df7 cql3: add per-query timeout to batch statement
If TIMEOUT attribute is set, it will be used as the effective
timeout for a particular query.
2020-12-14 07:50:40 +01:00
Piotr Sarna
3c49b6bd88 cql3: add per-query timeout to modification statement
If TIMEOUT attribute is set, it will be used as the effective
timeout for a particular query.
2020-12-14 07:50:40 +01:00
Piotr Sarna
5bbd0b049b cql3: add timeout to cql attributes
This attribute will be used later to specify per-query timeout.
2020-12-14 07:50:40 +01:00
Benny Halevy
c60da2e90d cdc: remove _token_metadata from db_context
1. It's unused since cbe510d1b8
2. It's unsafe to keep a reference to token_metadata&
potentially across yield points.

The higher-level motivation is to make
storage_service::get_token_metadata() private so we
can control better how it's used.

For cdc, if the token_metadata is going to be needed
to the future, it'd be better get it from
db_context::_proxy.get_token_metadata_ptr().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201213162351.52224-2-bhalevy@scylladb.com>
2020-12-13 18:32:17 +02:00
Avi Kivity
0f967f911d Merge "storage_service: get_token_metadata_ptr to hold on to token_metadata" from Benny
"
This series fixes use-after-free via token_metadata&

We may currently get a token_metadata& via get_token_metadata() and
use it across yield points in a couple of sites:
- do_decommission_removenode_with_repair
- get_new_source_ranges

To fix that, get_token_metadata_ptr and hold on to it
across yielding.

Fixes #7790

Dtest: update_cluster_layout_tests:TestUpdateClusterLayout.simple_removenode_2_test(debug)
Test: unit(dev)
"

* tag 'storage_service-token_metadata_ptr-v2' of github.com:bhalevy/scylla:
  storage_service: get_new_source_ranges: don't hold token_metadata& across yield point
  storage_service: get_changed_ranges_for_leaving: no need to maybe_yield for each token_range
  storage_service: get_changed_ranges_for_leaving: release token_metadata_ptr sooner
  storage_service: get_changed_ranges_for_leaving: don't hold token_metadata& across yield
2020-12-13 17:37:24 +02:00
Aleksandr Bykov
e74dc311e7 dist: scylla_util: fix aws_instance.ebs_disks method
aws_instance.ebs_disks() method should return ebs disk
instead of ephemeral

Signed-off-by: Aleksandr Bykov <alex.bykov@scylladb.com>

Closes #7780
2020-12-13 17:33:37 +02:00
Benny Halevy
1fbc831dae storage_service: get_new_source_ranges: don't hold token_metadata& across yield point
Provide the token_metadata& to get_new_source_ranges by the caller,
who keeps it valid throughout the call.

Note that there is no need to clone_only_token_map
since the token_metadata_ptr is immutable and can be
used just as well for calling strat.get_range_addresses.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-13 16:42:00 +02:00
Benny Halevy
f13913d251 storage_service: get_changed_ranges_for_leaving: no need to maybe_yield for each token_range
Now that we pass can_yield::yes to calculate_natural_endpoints
for each token_range.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-13 16:42:00 +02:00
Benny Halevy
89ed0705e8 storage_service: get_changed_ranges_for_leaving: release token_metadata_ptr sooner
No need to hold on to the shared token_metadata_ptr
after we got clone_after_all_left().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-13 16:42:00 +02:00
Benny Halevy
684c4143df storage_service: get_changed_ranges_for_leaving: don't hold token_metadata& across yield
When yielding in clone_only_token_map or clone_after_all_left
the token_metadata got with get_token_metadata() may go away.

Use get_token_metadata_ptr() instead to hold on to it.

And with that, we don't need to clone_only_token_map.
`metadata` is not modified by calculate_natural_endpoints, so we
can just refer to the immutable copy retrieved with
get_token_metadata_ptr.

Fixes #7790

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-13 16:41:58 +02:00
Avi Kivity
65a0244614 Update tools/jmx submodule
* tools/jmx 6174a47...20469bf (1):
  > column_family: Return proper cardinality for toppartitions requests
2020-12-13 13:51:38 +02:00
Avi Kivity
9265b87610 Merge "Remove get_local_storage_proxy from validation" from Pavel E
"
The validate_column_family() helper uses the global proxy
reference to get database from. Fortunatelly, all the callers
of it can provide one via argument.

tests: unit(dev)
"

* 'br-no-proxy-in-validate' of https://github.com/xemul/scylla:
  validation: Remove get_local_storage_proxy call
  client_state: Call validate_column_family() with database arg
  client_state: Add database& arg to has_column_family_access
  storage_proxy: Add .local_db() getters
  validate: Mark database argument const
2020-12-13 13:12:57 +02:00
Avi Kivity
19aaf8eb83 Merge "Remove global storage service from index manager" from Pavel E
"
The initial intent was to remove call for global storage service from
secondary index manager's create_view_for_index(), but while fixing it
one of intermediate schema table's helper managed to benefit from it
by re-using the database reference flying by.

The cleanup is done by simply pushing the database reference along the
stack from the code that already has it down the create_view_for_index().

tests: unit(dev)
"

* 'br-no-storages-in-index-and-schema' of https://github.com/xemul/scylla:
  schema-tables: Use db from make_update_table_mutations in make_update_indices_mutations
  schema-tables: Add database argument to make_update_table_mutations
  schema-tables: Factor out calls getting database instance
  index-manager: Move feature evaluation one level up
2020-12-13 12:41:51 +02:00
Benny Halevy
aae3991246 repair: do_decommission_removenode_with_repair: don't deref ops when null
`ops` might be passed as a disengaged shared_ptr when called
from `decommission_with_repair`.

In this case we need to propagate to sync_data_using_repair a
disengaged std::optional<utils::UUID>.

Fixes #7788

DTest: update_cluster_layout_tests:TestUpdateClusterLayout.verify_latest_copy_decommission_node_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201213073743.331253-1-bhalevy@scylladb.com>
2020-12-13 12:37:18 +02:00
Avi Kivity
18be57a4e5 Update seastar submodule
* seastar 8b400c7b45...2de43eb6bf (3):
  > core: show span free sizes correctly in diagnostics
  > Merge "IO queues to share capacities" from Pavel E
  > file: make_file_impl: determine blockdev using st_mode
2020-12-12 21:57:01 +02:00
Pekka Enberg
c990f2bd34 Merge 'Reinstate [[nodiscard]] support' from Avi Kivity
The switch to clang disabled the clang-specific -Wunused-value
since it generated some harmless warnings. Unfortunately, that also
prevent [[nodiscard]] violations from warning.

Fix by clearing all instances of the warning (including [[nodiscard]]
violations that crept in while it was disabled) and reinstating the warning.

Closes #7767

* github.com:scylladb/scylla:
  build: reinstate -Wunused-value warning for [[nodiscard]]
  test: lib: don't ignore future in compare_readers()
  test: mutation_test: check both ranges when comparing summaries
  serialializer: silence unused value warning in variant deserializer
2020-12-12 09:54:05 +02:00
Avi Kivity
615b8e8184 dist: rpm: uninstall tuned when installing scylla-kernel-conf
tuned 2.11.0-9 and later writes to kerned.sched_wakeup_granularity_ns
and other sysctl tunables that we so laboriously tuned, dropping
performance by a factor of 5 (due to increased latency). Fix by
obsoleting tuned during install (in effect, we are a better tuned,
at least for us).

Not needed for .deb, since debian/ubunto do not install tuned by
default.

Fixes #7696

Closes #7776
2020-12-12 09:54:05 +02:00
Pavel Emelyanov
3a025cfa52 schema-tables: Use db from make_update_table_mutations in make_update_indices_mutations
Two halves of the tunnel finally connect -- the
latter helper needs the local database instance and
is only called by the former one which already has it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 21:23:53 +03:00
Pavel Emelyanov
89fd524c5a schema-tables: Add database argument to make_update_table_mutations
There are 3 callers of this helper (cdc, migration manager and tests)
and all of them already have the database object at hands.

The argument will be used by next patch to remove call for global
storage proxy instance from make_update_indices_mutations.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 21:21:22 +03:00
Pavel Emelyanov
1bcef04c7a schema-tables: Factor out calls getting database instance
The make_update_indices_mutations gets database instance
for two things -- to find the cf to work with and to get
the value of a feature for index view creation.

To suit both and to remove calls for global storage proxy
and service instances get the database once in the
function entrance. Next patch will clean this further.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 21:17:11 +03:00
Pavel Emelyanov
6dd10e771d index-manager: Move feature evaluation one level up
The create_view_for_index needs to know the state of the
correct-idx-token-in-secondary-index feature. To get one
it takes quite a long route through global storage service
instance.

Since there's only one caller of the method in question,
and the method is called in a loop, it's a bit faster to
get the feature value in caller and pass it in argument.

This will also help to get rid of the call for global
storage service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 21:14:12 +03:00
Pavel Emelyanov
3a3ee45488 size_estimate_reader: Use local db reference not global
The get_next_partition uses global proxy instance to get
the local database reference. Now it's available in the
reader object itself, so it's possible to remove this
call for global storage proxy.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 20:38:21 +03:00
Pavel Emelyanov
107dcbfbd6 size_estimate_reader: Keep database reference on mutation reader
This reader uses local databse instance in its get_next_partition
method to find keyspaces to work with

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 20:34:54 +03:00
Pavel Emelyanov
48e494fb62 size_estimate_reader: Keep database reference on virtual_reader
The database will be then used to create the mutation reader

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 20:31:35 +03:00
Pavel Emelyanov
83073f4e8b validation: Remove get_local_storage_proxy call
It is used in validate_column_family. The last caller of it was removed by
previous patch, so we may kill the helper itself

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 18:52:42 +03:00
Pavel Emelyanov
12cc539835 client_state: Call validate_column_family() with database arg
The previous patch brought the databse reference arg. And since
the currently called validate_column_family() overload _just_
gets the database from global proxy, it's better to shortcut.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 18:50:49 +03:00
Pavel Emelyanov
b0c4a9087d client_state: Add database& arg to has_column_family_access
It is called from cql3/statements' check_access methods and from thrift
handlers. The former have proxy argument from which they can get the
database. The latter already have the database itself on board.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 18:49:16 +03:00
Pavel Emelyanov
4c7bc8a3d1 storage_proxy: Add .local_db() getters
To facilitate the next patching

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 18:48:02 +03:00
Avi Kivity
a11ecfe231 Merge 'types: don't linearize in validate()' from Michał Chojnowski
A sequel to #7692.

This series gets rid of linearization when validating collections and tuple types. (Other types were already validated without linearizing).
The necessary helpers for reading from fragmented buffers were introduced in #7692. All this series does is put them to use in `validate()`.

Refs: #6138

Closes #7770

* github.com:scylladb/scylla:
  types: add single-fragment optimization in validate()
  utils: fragment_range: add with_simplified()
  cql3: statements: select_statement: remove unnecessary use of with_linearized
  cql3: maps: remove unnecessary use of with_linearized
  cql3: lists: remove unnecessary use of with_linearized
  cql3: tuples: remove unnecessary use of with_linearized
  cql3: sets: remove unnecessary use of with_linearized
  cql3: tuples: remove unnecessary use of with_linearized
  cql3: attributes: remove unnecessary uses of with_linearized
  types: validate lists without linearizing
  types: validate tuples without linearizing
  types: validate sets without linearizing
  types: validate maps without linearizing
  types: template abstract_type::validate on FragmentedView
  types: validate_visitor: transition from FragmentRange to FragmentedView
  utils: fragmented_temporary_buffer: add empty() to FragmentedView
  utils: fragmented_temporary_buffer: don't add to null pointer
2020-12-11 17:33:59 +02:00
Pavel Emelyanov
563b466227 validate: Mark database argument const
They are indeed used like that

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 18:27:45 +03:00
Michał Chojnowski
150473f074 types: add single-fragment optimization in validate()
Manipulating fragmented views is costlier that manipulating contiguous views,
so let's detect the common situation when the fragmented view is actually
contiguous underneath, and make use of that.

Note: this optimization is only useful for big types. For trivial types,
validation usually only checks the size of the view.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
e2d17879fc utils: fragment_range: add with_simplified()
Reading from contiguous memory (bytes_view) is significantly simpler
runtime-wise than reading from a fragmented view, due to less state and less
branching, so we often want to convert a fragmented view to a simple view before
processing it, if the fragmented view contains at most one fragment, which is
common. with_simplified() does just that.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
51ca5fa4c5 cql3: statements: select_statement: remove unnecessary use of with_linearized
We can validate directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
72186bee69 cql3: maps: remove unnecessary use of with_linearized
We can validate directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
3f3a10c588 cql3: lists: remove unnecessary use of with_linearized
We can validate directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
efa036329d cql3: tuples: remove unnecessary use of with_linearized
We can validate directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
4f359a7a99 cql3: sets: remove unnecessary use of with_linearized
We can validate directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
281417917b cql3: tuples: remove unnecessary use of with_linearized
We can validate directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
d1d1a00311 cql3: attributes: remove unnecessary uses of with_linearized
We can validate and deserialize directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
0581b3ff31 types: validate lists without linearizing
We can validate collections directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
4fe41b69fd types: validate tuples without linearizing
We can validate tuples directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
a7dd736d03 types: validate sets without linearizing
We can validate collections directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
1459608375 types: validate maps without linearizing
We can validate collections directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
82befbe8c0 types: template abstract_type::validate on FragmentedView
This is primarily a stylistic change. It makes the interface more consistent
with deserialize(). It will also allow us to call `validate()` for collection
elements in `validate_aux()`.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
15dbe00e8a types: validate_visitor: transition from FragmentRange to FragmentedView
This will allow us to easily get rid of linearizations when validating
collections and tuples, because the helpers used in validate_aux() already
have FragmentedView overloads.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
3647c0ba47 utils: fragmented_temporary_buffer: add empty() to FragmentedView
It's redundant with size_bytes(), but sometimes empty() is more readable and
reduces churn when replacing other types with FragmentedView.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
b4dd5d3bdb utils: fragmented_temporary_buffer: don't add to null pointer
When fragmented_temporary_buffer::view is created from a bytes_view,
_current is null. In that case, in remove_current(), null pointer offset
happens, and ubsan complains. Fix that.
2020-12-11 09:53:07 +01:00
Raphael S. Carvalho
e4b55f40f3 sstables: Fix sstable reshaping for STCS
The heuristic of STCS reshape is correct, and it built the compaction
descriptor correctly, but forgot to return it to the caller, so no
reshape was ever done on behalf of STCS even when the strategy
needed it.

Fixes #7774.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201209175044.1609102-1-raphaelsc@scylladb.com>
2020-12-10 12:45:25 +02:00
Asias He
829b4c1438 repair: Make removenode safe by default
Currently removenode works like below:

- The coordinator node advertises the node to be removed in
  REMOVING_TOKEN status in gossip

- Existing nodes learn the node in REMOVING_TOKEN status

- Existing nodes sync data for the range it owns

- Existing nodes send notification to the coordinator

- The coordinator node waits for notification and announce the node in
  REMOVED_TOKEN

Current problems:

- Existing nodes do not tell the coordinator if the data sync is ok or failed.

- The coordinator can not abort the removenode operation in case of error

- Failed removenode operation will make the node to be removed in
  REMOVING_TOKEN forever.

- The removenode runs in best effort mode which may cause data
  consistency issues.

  It means if a node that owns the range after the removenode
  operation is down during the operation, the removenode node operation
  will continue to succeed without requiring that node to perform data
  syncing. This can cause data consistency issues.

  For example, Five nodes in the cluster, RF = 3, for a range, n1, n2,
  n3 is the old replicas, n2 is being removed, after the removenode
  operation, the new replicas are n1, n5, n3. If n3 is down during the
  removenode operation, only n1 will be used to sync data with the new
  owner n5. This will break QUORUM read consistency if n1 happens to
  miss some writes.

Improvements in this patch:

- This patch makes the removenode safe by default.

We require all nodes in the cluster to participate in the removenode operation and
sync data if needed. We fail the removenode operation if any of them is down or
fails.

If the user want the removenode operation to succeed even if some of the nodes
are not available, the user has to explicitly pass a list of nodes that can be
skipped for the operation.

$ nodetool removenode --ignore-dead-nodes <list_of_dead_nodes_to_ignore> <host_id>

Example restful api:

$ curl -X POST "http://127.0.0.1:10000/storage_service/remove_node/?host_id=7bd303e9-4c7b-4915-84f6-343d0dbd9a49&ignore_nodes=127.0.0.3,127.0.0.5"

- The coordinator can abort data sync on existing nodes

For example, if one of the nodes fails to sync data. It makes no sense for
other nodes to continue to sync data because the whole operation will
fail anyway.

- The coordinator can decide which nodes to ignore and pass the decision
  to other nodes

Previously, there is no way for the coordinator to tell existing nodes
to run in strict mode or best effort mode. Users will have to modify
config file or run a restful api cmd on all the nodes to select strict
or best effort mode. With this patch, the cluster wide configuration is
eliminated.

Fixes #7359

Closes #7626
2020-12-10 10:14:39 +02:00
Piotr Sarna
20bdeb315a Merge ' types: add constraint on lexicographical_tri_compare()' from Avi Kivity
Verify that the input types are iterators and their value types are compatible
with the compare function.

Because some of the inputs were not actually valid iterators, they are adjusted
too.

Closes #7631

* github.com:scylladb/scylla:
  types: add constraint on lexicographical_tri_compare()
  composite: make composite::iterator a real input_iterator
  compound: make compount_type::iterator a real input_iterator
2020-12-09 18:48:01 +01:00
Nadav Har'El
a8fdbf31cd alternator: fix UpdateItem ADD for non-existent attribute
UpdateItem's "ADD" operation usually adds elements to an existing set
or adds a number to an existing counter. But it can *also* be used
to create a new set or counter (as if adding to an empty set or zero).

We unfortunately did not have a test for this case (creating a new set
or counter), and when I wrote such a test now, I discovered the
implementation was missing. So this patch adds both the test and the
implementation. The new test used to fail before this patch, and passes
with it - and passes on DynamoDB.

Note that we only had this bug for the newer UpdateItem syntax.
For the old AttributeUpdates syntax, we already support ADD actions
on missing attributes, and already tested it in test_update_item_add().
I just forgot to test the same thing for the newer syntax, so I missed
this bug :-(

Fixes #7763.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207085135.2551845-1-nyh@scylladb.com>
2020-12-09 18:44:30 +01:00
Juliusz Stasiewicz
b150906d39 gossip: Added SNITCH_NAME to application_state
Snitch name needs to be exchanged within cluster once, on shadow
round, so joining nodes cannot use wrong snitch. The snitch names
are compared on bootstrap and on normal node start.

If the cluster already used mixed snitches, the upgrade to this
version will fail. In this case customer needs to add a node with
correct snitch for every node with the wrong snitch, then put
down the nodes with the wrong snitch and only then do the upgrade.

Fixes #6832

Closes #7739
2020-12-09 15:45:25 +02:00
Nadav Har'El
781f9d9aca alternator: make default timeout configurable
Whereas in CQL the client can pass a timeout parameter to the server, in
the DynamoDB API there is no such feature; The server needs to choose
reasonable timeouts for its own internal operations - e.g., writes to disk,
querying other replicas, etc.

Until now, Alternator had a fixed timeout of 10 seconds for its
requests. This choice was reasonable - it is much higher than we expect
during normal operations, and still lower than the client-side timeouts
that some DynamoDB libraries have (boto3 has a one-minute timeout).
However, there's nothing holy about this number of 10 seconds, some
installations might want to change this default.

So this patch adds a configuration option, "--alternator-timeout-in-ms",
to choose this timeout. As before, it defaults to 10 seconds (10,000ms).

In particular, some test runs are unusually slow - consider for example
testing a debug build (which is already very slow) in an extremely
over-comitted test host. In some cases (see issue #7706) we noticed
the 10 second timeout was not enough. So in this patch we increase the
default timeout chosen in the "test/alternator/run" script to 30 seconds.

Please note that as the code is structured today, this timeout only
applies to some operations, such as GetItem, UpdateItem or Scan, but
does not apply to CreateTable, for example. This is a pre-existing
issue that this patch does not change.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207122758.2570332-1-nyh@scylladb.com>
2020-12-09 14:30:43 +01:00
Avi Kivity
f802356572 Revert "Revert "Merge "raft: fix replication if existing log on leader" from Gleb""
This reverts commit dc77d128e9. It was reverted
due to a strange and unexplained diff, which is now explained. The
HEAD on the working directory being pulled from was set back, so git
thought it was merging the intended commits, plus all the work that was
committed from HEAD to master. So it is safe to restore it.
2020-12-08 19:19:55 +02:00
Avi Kivity
1badd315ef Merge "Speed up devel tests 10 times" from Pavel E
"
The multishard_mutation_query test is toooo slow when built
with clang in dev mode. By reducing the number of scans it's
possible to shrink the full suite run time from half an hour
down to ~3 minutes.

tests: unit(dev)
"

* 'br-devel-mode-tests' of https://github.com/xemul/scylla:
  test: Make multishard_mutation_query test do less scans
  configure: Add -DDEVEL to dev build flags
2020-12-08 15:42:12 +02:00
Pavel Emelyanov
b837cf25b1 test: Make multishard_mutation_query test do less scans
When built by clang this dev-mode test takes ~30 minutes to
complete. Let's reduce this time by reducing the scale of
the test if DEVEL is set.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-08 15:55:04 +03:00
Pavel Emelyanov
703451311f configure: Add -DDEVEL to dev build flags
To let source code tell debug, dev and release builds
from each other.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-08 15:54:30 +03:00
Avi Kivity
461c9826de Merge 'scylla_setup: fix wrong command suggestion' from Takuya ASADA
scylla_setup command suggestion does not shows an argument of --io-setup,
because we mistakely stores bool value on it (recognized as 'store_true').
We always need to print '--io-setup X' on the suggestion instead.

Also, --nic is currently ignored by command suggestion, need to print it just like other options.

Related #7395

Closes #7724

* github.com:scylladb/scylla:
  scylla_setup: print --swap-directory and --swap-size on command suggestion
  scylla_setup: print --nic on command suggestion
  scylla_setup: fix wrong command suggestion on --io-setup scylla_setup command suggestion does not shows an argument of --io-setup, because we mistakely stores bool value on it (recognized as 'store_true'). We always need to print '--io-setup X' on the suggestion instead.
2020-12-08 13:58:55 +02:00
Avi Kivity
98271a5c57 Merge 'types: don't linearize in serialize_for_cql()' from Michał Chojnowski
A sequel to #7692.

This series gets rid of linearization in `serialize_for_cql`, which serializes collections and user types from `collection_mutation_view` to CQL. We switch from `bytes` to `bytes_ostream` as the intermediate buffer type.

The only user of of `serialize_for_cql` immediately copies the result to another `bytes_ostream`. We could avoid some copies and allocations by writing to the final `bytes_ostream` directly, but it's currently hidden behind a template.

Before this series, `serialize_for_cql_aux()` delegated the actual writing to `collection_type_impl::pack` and `tuple_type_impl::build_value`, by passing them an intermediate `vector`. After this patch, the writing is done directly in `serialize_for_cql_aux()`. Pros: we avoid the overhead of creating an intermediate vector, without bloating the source code (because creating that intermediate vector requires just as much code as serializing the values right away). Cons: we duplicate the CQL collection format knowledge contained in `collection_type_impl::pack` and `tuple_type_impl::build_value`.

Refs: #6138

Closes #7771

* github.com:scylladb/scylla:
  types: switch serialize_for_cql from bytes to bytes_ostream
  types: switch serialize_for_cql_aux from bytes to bytes_ostream
  types: serialize user types to bytes_ostream
  types: serialize lists to bytes_ostream
  types: serialize sets to bytes_ostream
  types: serialize maps to bytes_ostream
  utils: fragment_range: use range-based for loop instead of boost::for_each
  types: add write_collection_value() overload for bytes_ostream and value_view
2020-12-08 12:38:36 +02:00
Lubos Kosco
a0b1474bba scylla_util.py: Increase disk to ram ratio for GCP
Increase accepted disk-to-RAM ratio to 105 to accomodate even 7.5GB of
RAM for one NVMe log various reasons for not recommending the instance
type.

Fixes #7587

Closes #7600
2020-12-08 11:20:30 +02:00
Piotr Wojtczak
c09ab3b869 api: Add cardinality to toppartitions results
This change enhances the toppartitions api to also return
the cardinality of the read and write sample sets. It now uses
the size() method of space_saving_top_k class, counting the unique
operations in the sampled set for up to the given capacity.

Fixes #4089
Closes #7766
2020-12-08 09:38:59 +01:00
Nadav Har'El
86779664f4 alternator: fix broken Scan/Query paging with bytes keys
When an Alternator table has partition keys or sort keys of type "bytes"
(blobs), a Scan or Query which required paging used to fail - we used
an incorrect function to output LastEvaluatedKey (which tells the user
where to continue at the next page), and this incorrect function was
correct for strings and numbers - but NOT for bytes (for bytes, we
need to encode them as base-64).

This patch also includes two tests - for bytes partition key and
for bytes sort key - that failed before this patch and now pass.
The test test_fetch_from_system_tables also used to fail after a
Limit was added to it, because one of the tables it scans had a bytes
key. That test is also fixed by this patch.

Fixes #7768

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207175957.2585456-1-nyh@scylladb.com>
2020-12-08 09:38:23 +01:00
Eliran Sinvani
70770ff7fa debian pkg: Make deb packages explicitly depend on versioned components
Up until now, Scylla's debian packages dependencies versions were
unspecified. This was due to a technical difficulty to determine
the version of the dependent upon packages (such as scylla-python3
or scylla-jmx). Now, when those packages are also built as part of
this repo and are built with a version identical to the server package
itself we can depend all of our packages with explicit versions.
The motivation for this change is that if a user tries to install
a specific Scylla version by installing a specific meta package,
it will silently drag in the latest components instead of the ones
of the requested versions.
The expected change in behavior is that after this change an attempt
to install a metapackage with version which is not the latest will fail
with an explicit error hinting the user what other packages of the same
version should be explicitly included in the command line.

Fixes #5514

Closes #7727
2020-12-07 18:58:15 +02:00
Michał Chojnowski
d43fd456cd types: switch serialize_for_cql from bytes to bytes_ostream
Now we can serialize collections from collection_mutation_view_description
without linearizations.
2020-12-07 17:55:36 +01:00
Michał Chojnowski
81a55b032d types: switch serialize_for_cql_aux from bytes to bytes_ostream
We will switch serialize_for_cql itself to bytes_ostream soon.
2020-12-07 17:55:35 +01:00
Michał Chojnowski
71183cf0bd types: serialize user types to bytes_ostream
Avoids linearization by serializing to a fragmented type.
It's still linearized at the very end, this will be changed in the near future.
2020-12-07 17:52:06 +01:00
Michał Chojnowski
41b889d0c8 types: serialize lists to bytes_ostream
Avoids linearization by serializing to a fragmented type.
It's still linearized at the very end, this will be changed in the near future.
2020-12-07 17:49:21 +01:00
Michał Chojnowski
2b3d2c193d types: serialize sets to bytes_ostream
Avoids linearization by serializing to a fragmented type.
It's still linearized at the very end, this will be changed in the near future.
2020-12-07 17:47:49 +01:00
Michał Chojnowski
35823d12db types: serialize maps to bytes_ostream
Avoids linearization by serializing to a fragmented type.
It's still linearized at the very end, this will be changed in the near future.
2020-12-07 17:47:12 +01:00
Botond Dénes
ba7cf2f5fd tools/scylla-types: update name in description to use - instead of _
The executable was rename from using _ to using - to at one point but
apparently the description wasn't updated.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201207161626.79013-1-bdenes@scylladb.com>
2020-12-07 18:34:52 +02:00
Avi Kivity
7580a93ec8 build: reinstate -Wunused-value warning for [[nodiscard]]
The switch to clang disabled the clang-specific -Wunused-value
since it generated some harmless warnings. Unfortunately, that also
prevent [[nodiscard]] violations from warning.

Fix by reinstating the warning, now that all instances of the warning
have been fixed.
2020-12-07 16:51:19 +02:00
Avi Kivity
8fc0bbd487 test: lib: don't ignore future in compare_readers()
A fast_forward_to() call is not waited on in compare_readers(). Since
this is called in a thread, add a future::get() call to wait for it.
2020-12-07 16:50:20 +02:00
Avi Kivity
732d83dc0e test: mutation_test: check both ranges when comparing summaries
A copy/paste error means we ignore the termination of one of the
ranges. Change the comma expression to a disjunction to avoid
the unused value warning from clang.

The code is not perfect, since if the two ranges are not the same
size we'll invoke undefined behavior, but it is no worse than before
(where we ignored the comparison completely).
2020-12-07 16:47:52 +02:00
Avi Kivity
fc0a45af5f serialializer: silence unused value warning in variant deserializer
The variant deserializer uses a fold expression to implement
an if-tree with a short-circuit, producing an intermediate boolean
value to terminate evaluation. This intermediate value is unneeded,
but evokes a warning from clang when -Wunused-value is enabled.

Since we want to enable the warning, add a cast to void to ignore
the intermediate value.
2020-12-07 16:45:20 +02:00
Michał Chojnowski
60a3cecfea utils: fragment_range: use range-based for loop instead of boost::for_each
We want to pass bytes_ostream to this loop in later commits.
bytes_ostream does not conform to some boost concepts required by
boost::for_each, so let's just use C++'s native loop.
2020-12-07 12:50:36 +01:00
Piotr Sarna
1cc4ed50c1 db: fix getting local ranges for size estimates table
When getting local ranges, an assumption is made that
if a range does not contain an end or when its end is a maximum token,
then it must contain a start. This assumption proven not true
during manual tests, so it's now fortified with an additional check.

Here's a gdb output for a set of local ranges which causes an assertion
failure when calling `get_local_ranges` on it:

(gdb) p ranges
$1 = std::vector of length 2, capacity 2 = {{_interval = {_start = std::optional<interval_bound<dht::token>> = {[contained value] = {_value = {_kind = dht::token_kind::before_all_keys,
            _data = 0}, _inclusive = false}}, _end = std::optional<interval_bound<dht::token>> [no contained value], _singular = false}}, {_interval = {
      _start = std::optional<interval_bound<dht::token>> [no contained value], _end = std::optional<interval_bound<dht::token>> = {[contained value] = {_value = {
            _kind = dht::token_kind::before_all_keys, _data = 0}, _inclusive = true}}, _singular = false}}}

Closes #7764
2020-12-07 12:08:31 +02:00
Takuya ASADA
c3abba1913 scylla_setup: print --swap-directory and --swap-size on command suggestion
We need to print --swap-directory and --swap-size on command suggestion just like other options.

Related #7395
2020-12-07 18:40:59 +09:00
Takuya ASADA
582a3ffb2f scylla_setup: print --nic on command suggestion
We need to print --nic on command suggestion just like other options.

Related #7395
2020-12-07 18:40:59 +09:00
Nadav Har'El
220d6dde17 alternator, test: make test_fetch_from_system_tables faster
The test test_fetch_from_system_tables tests Alternator's system-table
feature by reading from all system tables. The intention was to confirm
we don't crash reading any of them - as they have different schemas and
can run into different problems (we had such problems in the initial
implementation). The intention was not to read *a lot* from each table -
we only make a single "Scan" call on each, to read one page of data.
However, the Scan call did not set a Limit, so the single page can get
pretty big.

This is not normally a problem, but in extremely slow runs - such as when
running the debug build on an extremely overcommitted test machine (e.g.,
issue #7706) reading this large page may take longer than our default
timeout. I'll send a separate patch for the timeout issue, but for now,
there is really no reason why we need to read a big page. It is good
enough to just read 50 rows (with Limit=50). This will still read all
the different types and make the test faster.

As an example, in the debug run on my laptop, this test spent 2.4
seconds to read the "compaction_history" table before this patch,
and only 0.1 seconds after this patch. 2.4 seconds is close to our
default timeout (10 seconds), 0.1 is very far.

Fixes #7706

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207075112.2548178-1-nyh@scylladb.com>
2020-12-07 08:52:31 +01:00
Michał Chojnowski
1fe7490970 types: add write_collection_value() overload for bytes_ostream and value_view
We will use it to serialize collections to bytes_ostream in serialize_for_cql().
2020-12-07 08:48:31 +01:00
Nadav Har'El
0cd05dd0fd cql-pytest: add tests for ALLOW FILTERING
The original goal of this patch was to replace the two single-node dtests
allow_filtering_test and allow_filtering_secondary_indexes_test, which
recently caused us problems when we wanted to change the ALLOW FILTERING
behavior but the tests were outside the tree. I'm hoping that after this
patch, those two tests could be removed from dtest.

But this patch actually tests more cases then those original dtest, and
moreover tests not just whether ALLOW FILTERING is required or not, but
also that the results of the filtering is correct.

Currently, four of the included tests are expected to fail ("xfail") on
Scylla, reproducing two issues:

1. Refs #5545:
   "WHERE x IN ..." on indexed column x wrongly requires ALLOW FILTERING
2. Refs #7608:
   "WHERE c=1" on clustering key c should require ALLOW FILTERING, but
   doesn't.

All tests, except the one for issue #5545, pass on Cassandra. That one
fails on Cassandra because doesn't support IN on an indexed column at all
(regardless of whether ALLOW FILTERING is used or not).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201115124631.1224888-1-nyh@scylladb.com>
2020-12-06 19:51:25 +02:00
Pavel Solodovnikov
56c0fcfcb2 cql_query_test: handle bounce_to_shard msg in test_null_value_tuple_floating_types_and_uuids
Use `prepared_on_shard` helper function to handle `bounce_to_shard`
messages that can happen when using LWT statements.

Fixes: #7757
Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201204172944.601730-1-pa.solodovnikov@scylladb.com>
2020-12-06 19:34:13 +02:00
Amos Kong
6b1659ee80 schema.cc/describe: fix invalid compaction options in schema
There is a typo in schema.cql of snapshot, lack of comma after
compaction strategy. It will fail to restore schema by the file.

    AND compaction = {'class': 'SizeTieredCompactionStrategy''max_compaction_threshold': '32'}

map_as_cql_param() function has a `first` parameter to smartly add
comma, the compaction_strategy_options is always not the first.

Fixes #7741

Signed-off-by: Amos Kong <amos@scylladb.com>

Closes #7734
2020-12-06 17:40:05 +02:00
Avi Kivity
ca950e6f08 Merge "Remove get_local_storage_service() from counters" from Pavel E
"
The storage service is called there to get the cached value
of db::system_keyspace::get_local_host_id(). Keeping the value
on database decouples it from storage service and kills one
more global storage service reference.

tests: unit(dev)
"

* 'br-remove-storage-service-from-counters-2' of https://github.com/xemul/scylla:
  counters: Drop call to get_local_storage_service and related
  counters: Use local id arg in transform_counter_update_to_shards
  database: Have local id arg in transform_counter_updates_to_shards()
  storage_service: Keep local host id to database
2020-12-06 16:15:21 +02:00
Avi Kivity
6e460e121a Merge 'docs: Add Sphinx and ScyllaDB theme' from David Garcia
This PR adds the Sphinx documentation generator and the custom theme ``sphinx-scylladb-theme``. Once merged, the GitHub Actions workflow should automatically publish the developer notes stored under ``docs`` directory on http://scylladb.github.io/scylla

1. Run the command ``make preview`` from the ``docs`` directory.
3. Check the terminal where you have executed the previous command. It should not raise warnings.
3. Open in a new browser tab http://127.0.0.1:5500/ to see the generated documentation pages.

The table of contents displays the files sorted as they appear on GitHub. In a subsequent iteration, @lauranovich and I will submit an additional PR proposing a new folder organization structure.

Closes #7752

* github.com:scylladb/scylla:
  docs: fixed warnings
  docs: added theme
2020-12-06 15:26:57 +02:00
Benny Halevy
64a4ffc579 large_data_handler: do not delete records in the absence of large_data_stats
The previous way of deleting records based on the whole
sstatble data_size causes overzealous deletions (#7668)
and inefficiency in the rows cache due to the large number
of range tombstones created.

Therefore we'd be better of by juts letting the
records expire using he 30 days TTL.

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201206083725.1386249-1-bhalevy@scylladb.com>
2020-12-06 11:34:37 +02:00
Avi Kivity
dc77d128e9 Revert "Merge "raft: fix replication if existing log on leader" from Gleb"
This reverts commit 0aa1f7c70a, reversing
changes made to 72c59e8000. The diff is
strange, including unrelated commits. There is no understanding of the
cause, so to be safe, revert and try again.
2020-12-06 11:34:19 +02:00
Pavel Emelyanov
df0e26035f counters: Drop call to get_local_storage_service and related
The local host id is now passed by argument, so we don't
need the counter_id::local() and some other methods that
call or are called by it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-04 16:31:12 +03:00
Pavel Emelyanov
914613b3c3 counters: Use local id arg in transform_counter_update_to_shards
Only few places in it need the uuid. And since it's only 16 bytes
it's possibvle to safely capture it by value in the called lambdas.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-04 16:30:31 +03:00
Pavel Emelyanov
62214e2258 database: Have local id arg in transform_counter_updates_to_shards()
There are two places that call it -- database code itself and
tests. The former already has the local host id, so just pass
one.

The latter are a bit trickier. Currently they use the value from
storage_service created by storage_service_for_tests, but since
this version of service doesn't pass through prepare_to_join()
the local_host_id value there is default-initialized, so just
default-initialize the needed argument in place.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-04 15:09:30 +03:00
Pavel Emelyanov
5a286ee8d4 storage_service: Keep local host id to database
The value in question is cached from db::system_keyspace
for places that want to have it without waiting for
futures. So far the only place is database counters code,
so keep the value on database itself. Next patches will
make use of it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-04 15:09:29 +03:00
Piotr Sarna
2015988373 Merge 'types: get rid of linearization in deserialize()' from Michał Chojnowski
Citing #6138: > In the past few years we have converted most of our codebase to
work in terms of fragmented buffers, instead of linearised ones, to help avoid
large allocations that put large pressure on the memory allocator.  > One
prominent component that still works exclusively in terms of linearised buffers
is the types hierarchy, more specifically the de/serialization code to/from CQL
format. Note that for most types, this is the same as our internal format,
notable exceptions are non-frozen collections and user types.  > > Most types
are expected to contain reasonably small values, but texts, blobs and especially
collections can get very large. Since the entire hierarchy shares a common
interface we can either transition all or none to work with fragmented buffers.

This series gets rid of intermediate linearizations in deserialization. The next
steps are removing linearizations from serialization, validation and comparison
code.

Series summary:
- Fix a bug in `fragmented_temporary_buffer::view::remove_prefix`. (Discovered
  while testing. Since it wasn't discovered earlier, I guess it doesn't occur in
  any code path in master.)
- Add a `FragmentedView` concept to allow uniform handling of various types of
  fragmented buffers (`bytes_view`, `temporary_fragmented_buffer::view`,
  `ser::buffer_view` and likely `managed_bytes_view` in the future).
- Implement `FragmentedView` for relevant fragmented buffer types.
- Add helper functions for reading from `FragmentedView`.
- Switch `deserialize()` and all its helpers from `bytes_view` to
  `FragmentedView`.
- Remove `with_linearized()` calls which just became unnecessary.
- Add an optimization for single-fragment cases.

The addition of `FragmentedView` might be controversial, because another concept
meant for the same purpose - `FragmentRange` - is already used. Unfortunately,
it lacks the functionality we need. The main (only?) thing we want to do with a
fragmented buffer is to extract a prefix from it and `FragmentRange` gives us no
way to do that, because it's immutable by design. We can work around that by
wrapping it into a mutable view which will track the offset into the immutable
`FragmentRange`, and that's exactly what `linearizing_input_stream` is. But it's
wasteful. `linearizing_input_stream` is a heavy type, unsuitable for passing
around as a view - it stores a pair of fragment iterators, a fragment view and a
size (11 words) to conform to the iterator-based design of `FragmentRange`, when
one fragment iterator (4 words) already contains all needed state, just hidden.
I suggest we replace `FragmentRange` with `FragmentedView` (or something
similar) altogether.

Refs: #6138

Closes #7692

* github.com:scylladb/scylla:
  types: collection: add an optimization for single-fragment buffers in deserialize
  types: add an optimization for single-fragment buffers in deserialize
  cql3: tuples: don't linearize in in_value::from_serialized
  cql3: expr: expression: replace with_linearize with linearized
  cql3: constants: remove unneeded uses of with_linearized
  cql3: update_parameters: don't linearize in prefetch_data_builder::add_cell
  cql3: lists: remove unneeded use of with_linearized
  query-result-set: don't linearize in result_set_builder::deserialize
  types: remove unneeded collection deserialization overloads
  types: switch collection_type_impl::deserialize from bytes_view to FragmentedView
  cql3: sets: don't linearize in value::from_serialized
  cql3: lists: don't linearize in value::from_serialized
  cql3: maps: don't linearize in value::from_serialized
  types: remove unused deserialize_aux
  types: deserialize: don't linearize tuple elements
  types: deserialize: don't linearize collection elements
  types: switch deserialize from bytes_view to FragmentedView
  types: deserialize tuple types from FragmentedView
  types: deserialize set type from FragmentedView
  types: deserialize map type from FragmentedView
  types: deserialize list type from FragmentedView
  types: add FragmentedView versions of read_collection_size and read_collection_value
  types: deserialize varint type from FragmentedView
  types: deserialize floating point types from FragmentedView
  types: deserialize decimal type from FragmentedView
  types: deserialize duration type from FragmentedView
  types: deserialize IP address types from FragmentedView
  types: deserialize uuid types from FragmentedView
  types: deserialize timestamp type from FragmentedView
  types: deserialize simple date type from FragmentedView
  types: deserialize time type from FragmentedView
  types: deserialize boolean type from FragmentedView
  types: deserialize integer types from FragmentedView
  types: deserialize string types from FragmentedView
  types: remove unused read_simple_opt
  types: implement read_simple* versions for FragmentedView
  utils: fragmented_temporary_buffer: implement FragmentedView for view
  utils: fragment_range: add single_fragmented_view
  serializer: implement FragmentedView for buffer_view
  utils: fragment_range: add linearized and with_linearized for FragmentedView
  utils: fragment_range: add FragmentedView
  utils: fragmented_temporary_buffer: fix view::remove_prefix
2020-12-04 09:46:20 +01:00
Michał Chojnowski
a1f7fabb3d types: collection: add an optimization for single-fragment buffers in deserialize
Helpers parametrized with single_fragmented_view should compile to better code,
so let's use them when possible.
2020-12-04 09:21:05 +01:00
Michał Chojnowski
08c394726e types: add an optimization for single-fragment buffers in deserialize
Values usually come in a single fragment, but we pay the cost of fragmented
deserialization nevertheless: bigger view objects (4 words instead of 2 words)
more state to keep updated (i.e. total view size in addition to current fragment
size) and more branches.

This patch adds a special case for single-fragment buffers to
abstract_type::deserialize. They are converted to a single_fragmented_view
before doing anything else. Templates instantiated with single_fragmented_view
should compile to better code than their multi-fragmented counterparts. If
abstract_type::deserialize is inlined, this patch should completely prevent any
performance penalties for switching from with_linearized to fragmented
deserialization.
2020-12-04 09:19:39 +01:00
Michał Chojnowski
f75db1fcf5 cql3: tuples: don't linearize in in_value::from_serialized
We can deserialize directly from fragmented buffers now.
2020-12-04 09:19:39 +01:00
Michał Chojnowski
68177a6721 cql3: expr: expression: replace with_linearize with linearized
with_linearized creates an additional internal `bytes` when the input is
fragmented. linearized copies the data directly to the output `bytes`, so it's
more efficient.
2020-12-04 09:19:39 +01:00
Michał Chojnowski
5ffe40d5a2 cql3: constants: remove unneeded uses of with_linearized
We can deserialize directly from fragmented buffers now.
2020-12-04 09:19:39 +01:00
Michał Chojnowski
3c98806df9 cql3: update_parameters: don't linearize in prefetch_data_builder::add_cell
We can deserialize directly from fragmented buffers now.
2020-12-04 09:19:39 +01:00
Michał Chojnowski
c43ef3951b cql3: lists: remove unneeded use of with_linearized
We can deserialize directly from fragmented buffers now.
2020-12-04 09:19:39 +01:00
Michał Chojnowski
0d5c5b8645 query-result-set: don't linearize in result_set_builder::deserialize
We can deserialize directly from fragmented buffers now.
2020-12-04 09:19:39 +01:00
Michał Chojnowski
04786dee30 types: remove unneeded collection deserialization overloads
Inherit the method from base class rather than reimplementing it in every child.
2020-12-04 09:19:39 +01:00
Michał Chojnowski
c08419e28d types: switch collection_type_impl::deserialize from bytes_view to FragmentedView
Devirtualizes collection_type_impl::deserialize (so it can be templated) and
adds a FragmentedView overload. This will allow us to deserialize collections
with explicit cql_serialization_format directly from fragmented buffers.
2020-12-04 09:19:37 +01:00
dgarcia360
1304f6a0bb docs: fixed warnings
docs: fixed warnings
2020-12-03 17:40:34 +01:00
dgarcia360
a340b46a79 docs: added theme 2020-12-03 17:37:18 +01:00
Michał Chojnowski
d731b34d95 cql3: sets: don't linearize in value::from_serialized
We can deserialize directly from fragmented buffers now.
2020-12-03 10:57:07 +01:00
Michał Chojnowski
64e64fd2b3 cql3: lists: don't linearize in value::from_serialized
We can deserialize directly from fragmented buffers now.
2020-12-03 10:57:07 +01:00
Michał Chojnowski
536a2f8c8d cql3: maps: don't linearize in value::from_serialized
We can deserialize directly from fragmented buffers now.
2020-12-03 10:57:07 +01:00
Michał Chojnowski
58d9f52363 types: remove unused deserialize_aux
Dead code.
2020-12-03 10:57:07 +01:00
Michał Chojnowski
8440279130 types: deserialize: don't linearize tuple elements
We can deserialize directly from fragmented buffers now.
2020-12-03 10:57:07 +01:00
Michał Chojnowski
a216b0545f types: deserialize: don't linearize collection elements
We can deserialize directly from fragmented buffers now.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
1ccdfc7a90 types: switch deserialize from bytes_view to FragmentedView
The final part of the transition of deserialize from bytes_view to
FragmentedView.
Adds a FragmentedView overload to abstract_type::deserialize and
switches deserialize_visitor from bytes_view to FragmentedView, allowing
deserialization of all types with no intermediate linearization.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
898cea4cde types: deserialize tuple types from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
507883f808 types: deserialize set type from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
9b211a7285 types: deserialize map type from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
5f1939554c types: deserialize list type from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
ad7ab73cd0 types: add FragmentedView versions of read_collection_size and read_collection_value
We will need those to deserialize collections from FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
495bf5c431 types: deserialize varint type from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
0f8ad89740 types: deserialize floating point types from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
0bb0291e50 types: deserialize decimal type from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
760bc5fd60 types: deserialize duration type from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
75a56f439b types: deserialize IP address types from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
9f668929db types: deserialize uuid types from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
3e1a24ca0d types: deserialize timestamp type from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
a4bc43ab19 types: deserialize simple date type from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
24bd986aea types: deserialize time type from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
c03ad52513 types: deserialize boolean type from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
2f351928e2 types: deserialize integer types from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
28b727082f types: deserialize string types from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
426308f526 types: remove unused read_simple_opt
Dead code.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
e1145fe410 types: implement read_simple* versions for FragmentedView
We will need those to switch deserialize() from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Botond Dénes
71722d8b41 frozen_mutation: add partition context to errors coming from deserializing 2020-12-02 15:08:49 +02:00
Botond Dénes
8d944ff755 partition_builder: accept_row(): use append_clustering_row()
The partition builder doesn't expect the looked-up row to exist. In fact
it already existing is a sign of a bug. Currently bugs resulting in
duplicate rows will manifest by tripping an assert in
`row::append_cell()`. This however results in poor diagnostics, so we
want to catch these errors sooner to be able to provide higher level
diagnostics. To this end, switch to the freshly introduced
`append_clustering_row()` so that duplicate rows are found early and in
a context where their identity is known.
2020-12-02 15:08:49 +02:00
Botond Dénes
63ea36e277 mutation_partition: add append_clustered_row()
A variant of `clutered_row()` which throws if the row already exists, or
if any greater row already exists.
2020-12-02 15:08:32 +02:00
Benny Halevy
c7311d1080 docs: sstable-scylla-format: document large_data_type in more details
This adds details about large_data_type on top of
ca5184052d
and introduces structured indentation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201202110539.634880-1-bhalevy@scylladb.com>
2020-12-02 13:25:49 +02:00
Avi Kivity
a95c2a946c Merge 'mutation_reader: introduce clustering_order_reader_merger' from Kamil Braun
This abstraction is used to merge the output of multiple readers, each
opened for a single partition query, into a non-decreasing stream
of mutation_fragments.

It is similar to `mutation_reader_merger`,
but an important difference is that the new merger may select new readers
in the middle of a partition after it already returned some fragments
from that partition.  It uses the new `position_reader_queue` abstraction
to select new readers. It doesn't support multi-partition (ring range) queries.

The new merger will be later used when reading from sstable sets created
by TimeWindowCompactionStrategy. This strategy creates many sstables
that are mostly disjoint w.r.t the contained clustering keys, so we can
delay opening sstable readers when querying a partition until after we have
processed all mutation fragments with positions before the keys
contained by these sstables.

A microbenchmark was added that compares the existing combining reader
(which uses `mutation_reader_merger` underneath) with a new combining reader
built using the new `clustering_order_reader_merger` and a simple queue of readers
that returns readers from some supplied set. The used set of readers is built from the following
ranges of keys (each range corresponds to a single reader):
`[0, 31]`, `[30, 61]`, `[60, 91]`, `[90, 121]`, `[120, 151]`.
The microbenchmark runs the reader and divides the result by the number of mutation fragments.
The results on my laptop were:
```
$ build/release/test/perf/perf_mutation_readers -t clustering_combined.* -r 10
single run iterations:    0
single run duration:      1.000s
number of runs:           10

test                                      iterations      median         mad         min         max
clustering_combined.ranges_generic           2911678   117.598ns     0.685ns   116.175ns   119.482ns
clustering_combined.ranges_specialized       3005618   111.015ns     0.349ns   110.063ns   111.840ns
```
`ranges_generic` denotes the existing combining reader, `ranges_specialized` denotes the new reader.

Split from https://github.com/scylladb/scylla/pull/7437.

Closes #7688

* github.com:scylladb/scylla:
  tests: mutation_source_test for clustering_order_reader_merger
  perf: microbenchmark for clustering_order_reader_merger
  mutation_reader_test: test clustering_order_reader_merger in memory
  test: generalize `random_subset` and move to header
  mutation_reader: introduce clustering_order_reader_merger
2020-12-02 12:15:35 +02:00
Kamil Braun
502ed2e9f7 tests: mutation_source_test for clustering_order_reader_merger 2020-12-02 11:13:58 +01:00
Nadav Har'El
fae2ba60e9 cql-pytest: start to port Cassandra's CQL unit tests
In issue #7722, it was suggested that we should port Cassandra's CQL unit
tests into our own repository, by translating the Java tests into Python
using the new cql-pytest framework. Cassandra's CQL unit test framework is
orders of magnitude faster than dtest, and in-tree, so Cassandra have been
moving many CQL correctness tests there, and we can also benefit from their
test cases.

In this patch, we take the first step in a long journey:

1. I created a subdirectory, test/cql-pytest/cassandra_tests, where all the
   translated Cassandra tests will reside. The structure of this directory
   will mirror that of the test/unit/org/apache/cassandra/cql3 directory in
   the Cassandra repository.
   pytest conveniently looks for test files recursively, so when all the
   cql-pytest are run, the cassandra_tests files will be run as well.
   As usual, one can also run only a subset of all the tests, e.g.,
   "test/cql-pytest/run -vs cassandra_tests" runs only the tests in the
   cassandra_tests subdirectory (and its subdirectories).

2. I translated into Python two of the smallest test files -
   validation/entities/{TimeuuidTest,DataTypeTest}.java - containing just
   three test functions.
   The plan is to translate entire Java test files one by one, and to mirror
   their original location in our own repository, so it will be easier
   to remember what we already translated and what remains to be done.

3. I created a small library, porting.py, of functions which resemble the
   common functions of the Java tests (CQLTester.java). These functions aim
   to make porting the tests easier. Despite the resemblence, the ported code
   is not 100% identical (of course) and some effort is still required in
   this porting. As we continue this porting effort, we'll probably need
   more of these functions, can can also continue to improve them to reduce
   the porting effort.

Refs #7722.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201201192142.2285582-1-nyh@scylladb.com>
2020-12-02 09:29:22 +01:00
Avi Kivity
77466177ab Merge 'Use large_data_counters in scylla_metadata to decide when to delete large_data records' from Benny Halevy
This series introduces a `large_data_counters` element to `scylla_metadata` component to explicitly count the number of `large_{partitions,rows,cells}` and `too_many_rows` in the sstable.  These are accounted for in the sstable writer whenever the respective large data entry is encountered.

It is taken into account in `large_data_handler::maybe_delete_large_data_entries`, when engaged.
Otherwise, if deleting a legacy sstable that has no such entry in `scylla_metadata`, just revert to using the current method of comparing the sstable's `data_size` to the various thresholds.

Fixes #7668

Test: unit(dev)
Dtest: wide_rows_test.py (in progress)

Closes #7669

* github.com:scylladb/scylla:
  docs: sstable-scylla-format: add large_data_stats subcomponent
  large_data_handler: maybe_delete_large_data_entries: use sstable large data stats
  large_data_handler: maybe_delete_large_data_entries: accept shared_sstable
  large_data_handler: maybe_delete_large_data_entries: move out of line
  sstables: load large_data_stats from scylla_metadata
  sstables: store large_data_stats in scylla_metadata
  sstables: writer: keep track of large data stats
  large_data_handler: expose methods to get threshold
  sstables: kl/writer: never record too many rows
  large_data_handler: indicate recording of large data entries
  large_data_handler: move constructor out of line
2020-12-02 10:08:18 +02:00
Nadav Har'El
5c08489569 cql-pytest: don't run tests if Scylla boot timed out
In test/cql-pytest/run.py we have a 200 second timeout to boot Scylla.
I never expected to reach this timeout - it normally takes (in dev
build mode) around 2 seconds, but in one run on Jenkins we did reach it.
It turns out that the code does not recognize this timeout correctly,
thought that Scylla booted correctly - and then failed all the
subtests when they fail to connect to Scylla.

This patch fixes the timeout logic. After the timeout, if Scylla's
CQL port is still not responsive, the test run is failed - without
trying to run many individual tests.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201201150927.2272077-1-nyh@scylladb.com>
2020-12-02 08:48:44 +02:00
Kamil Braun
2da723b9c8 cdc: produce postimage when inserting with no regular columns
When a row was inserted into a table with no regular columns, and no
such row existed in the first place, postimage would not be produced.
Fix this.

Fixes #7716.

Closes #7723
2020-12-01 18:01:23 +02:00
Benny Halevy
ca5184052d docs: sstable-scylla-format: add large_data_stats subcomponent
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-01 15:19:42 +02:00
Benny Halevy
4406a2514e large_data_handler: maybe_delete_large_data_entries: use sstable large data stats
If the sstable has scylla_metadata::large_data_stats use them
to determine whether to delete the corresponding large data records.

Otherwise, defer to the current method of comparing the sstable
data_size to the respective thresholds.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-01 15:19:42 +02:00
Benny Halevy
8cebe7776f large_data_handler: maybe_delete_large_data_entries: accept shared_sstable
Since the actual deletion if the large data entries
is done in the background, and we don't captures the shared_sstable,
we can safely pass it to maybe_delete_large_data_entries when
deleting the sstable in sstable::unlink and it will be release
as soon as maybe_delete_large_data_entries returns.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-01 15:19:42 +02:00
Benny Halevy
f7d0ae3d10 large_data_handler: maybe_delete_large_data_entries: move out of line
It is called on the cold path, when the sstable is deleted.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-01 15:19:42 +02:00
Benny Halevy
be4a58c34c sstables: load large_data_stats from scylla_metadata
Load the large data stats from the scylla_metadata component
if they are present. Otherwise, if we're opening a legacy sstable
that has scylla_metadata_type::LargeDataStats, leave
sstable::_large_data_stats disengaged.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-01 15:19:42 +02:00
Benny Halevy
92443ed71c sstables: store large_data_stats in scylla_metadata
Store the large data statistics in the scylla_metadata component.

These will be retrieved when loading the sstable and be
used for determining whether to delete the corresponding
large data entries upon sstable deletion.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-01 15:19:42 +02:00
Benny Halevy
79c19a166c sstables: writer: keep track of large data stats
In the next patch, this is will be written to the
sstable's scylla_metadata component.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-01 15:19:41 +02:00
Benny Halevy
8ab053bd44 large_data_handler: expose methods to get threshold
To be used for keeping large_data statistics in sstable.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-01 15:18:14 +02:00
Benny Halevy
f1257dfdc0 sstables: kl/writer: never record too many rows
rows_count is not tracked prior to the mc format.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-01 15:18:14 +02:00
Benny Halevy
dd7422a713 large_data_handler: indicate recording of large data entries
Return true from the maybe_{record,log}_* methods if
a large data record or log entry were emitted.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-01 15:18:14 +02:00
Benny Halevy
873107821b large_data_handler: move constructor out of line
No need for it to be inlined.

Also, add debug logging to the large data handler options.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-01 15:18:14 +02:00
Dejan Mircevski
e45af3b9b8 index: Ensure restriction is supported in find_idx
Previously, statement_restrictions::find_idx() would happily return an
index for a non-EQ restriction (because it checked only the column
name, not the operator).  This is incorrect: when the selected index
is for a non-EQ restriction, it is impossible to query that index
table.

Fixes #7659.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #7665
2020-12-01 15:16:48 +02:00
Avi Kivity
df572b41ae Update seastar submodule
* seastar 010fb0df1e...8b400c7b45 (6):
  > append_challenged_posix_file_impl::read_dma: allow iovec to cross _logical_size
  > Merge "Extend per task-queue timing statistics" from Pavel E
  > tls_test: Create test certs at build time
  > cook: upgrade hwloc version
  > memory: rate-limit diagnostics messages
  > util/log: add rate-limited version of writer version of log()
2020-12-01 15:12:25 +02:00
Tomasz Grabiec
0c5d23d274 thrift: Validate cell names when constructing clustering keys
Currently, if the user provides a cell name with too many components,
we will accept it and construct an invalid clusterin key. This may
result in undefined behavior down the stream.

It was caught by ASAN in a debug build when executing dtest
cql_tests.py:MiscellaneousCQLTester.cql3_insert_thrift_test with
nodetool flush manually added after the write. Triggered during
sstable writing to an MC-format sstable:

   seastar::shared_ptr<abstract_type const>::operator*() const at ././seastar/include/seastar/core/shared_ptr.hh:577
   sstables::mc::clustering_blocks_input_range::next() const at ./sstables/mx/writer.cc:180

To prevent corrupting the state in this way, we should fail
early. This patch addds validation which will fail thrift requests
which attempt to create invalid clustering keys.

Fixes #7568.

Example error:

  Internal server error: Cell name of ks.test has too many components, expected 1 got 2 in 0x0004000000040000017600

Message-Id: <1605550477-24810-1-git-send-email-tgrabiec@scylladb.com>
2020-12-01 15:12:08 +02:00
Avi Kivity
2fd895a367 Merge 'dist/common/scripts/scylla_setup: Optionally config rsyslog destination' from Amnon Heiman
This patch adds an option to scylla_setup to configure an rsyslog destination.

The monitoring stack has an option to get information from rsyslog it
requires that rsyslog on the scylla machines will send the trace line to
it.

The configuration will be in a Scylla configuration file, so it is safe to run it multiple times.

Fixes #7589

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes #7634

* github.com:scylladb/scylla:
  dist/common/scripts/scylla_setup: Optionally config rsyslog destination
  Adding dist/common/scripts/scylla_rsyslog_setup utility
2020-12-01 13:12:32 +02:00
Amnon Heiman
4036cecdea dist/common/scripts/scylla_setup: Optionally config rsyslog destination
This patch adds an option to scylla_setup to configure an rsyslog
destination.

The monitoring stack has an option to get information from rsyslog, it
requires that rsyslog on the scylla machines will send the trace line to
it.

If the /etc/rsyslog.d/ directory exists (that means the current system
runs rsyslog) it will ask if to add rsyslog configuration and if yes, it
would run scylla_rsyslog_setup.

Fixes #7589

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-12-01 12:33:37 +02:00
Takuya ASADA
572d6b2a4e scylla_setup: fix wrong command suggestion on --io-setup
scylla_setup command suggestion does not shows an argument of --io-setup,
because we mistakely stores bool value on it (recognized as 'store_true').
We always need to print '--io-setup X' on the suggestion instead.

Related #7395
2020-12-01 07:23:55 +09:00
Tomasz Grabiec
f8f81ec322 Merge "raft: various snapshot fixes" from Gleb
* scylla-dev/snapshot_fixes_v1:
  raft: ignore append_reply from a peer in SNAPSHOT state
  raft: Ignore outdated snapshots
  raft: set next_idx to correct value after snapshot transfer
2020-11-30 21:34:31 +01:00
Alejo Sanchez
72a64b05ea raft: replication test: fix total entries for initial snapshot
Since now total expected entries are updated by load snapshot, do not
trim the total entries expected values with the initial snapshot on
test state machine initialization.

reported by @gleb

Branch URL: https://github.com/alecco/scylla/tree/raft-ale-tests-06-snapshot-total-entries

Tests: unit ({dev}), unit ({debug}), unit ({release})

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20201125171232.321992-1-alejo.sanchez@scylladb.com>
2020-11-30 21:34:31 +01:00
Kamil Braun
af49a95627 perf: microbenchmark for clustering_order_reader_merger 2020-11-30 11:55:44 +01:00
Kamil Braun
4f7e2bf920 mutation_reader_test: test clustering_order_reader_merger in memory 2020-11-30 11:55:44 +01:00
Kamil Braun
b22aa6dbde test: generalize random_subset and move to header 2020-11-30 11:55:44 +01:00
Kamil Braun
0b36c5e116 mutation_reader: introduce clustering_order_reader_merger
This abstraction is used to merge the output of multiple readers, each
opened for a single partition query, into a non-decreasing stream
of mutation_fragments.

It is similar to `mutation_reader_merger`,
an important difference is that the new merger may select new readers
in the middle of a partition after it already returned some fragments
from that partition.  It uses the new `position_reader_queue` abstraction
to select new readers. It doesn't support multi-partition (ring range) queries.

The new merger will be later used when reading from sstable sets created
by TimeWindowCompactionStrategy. This strategy creates many sstables
that are mostly disjoint w.r.t the contained clustering keys, so we can
delay opening sstable readers when querying a partition until after we have
processed all mutation fragments with positions before the keys
contained by these sstables.
2020-11-30 11:55:44 +01:00
Avi Kivity
ea9c058be3 Merge 'Don't use secondary indices for multi-column restrictions' from Dejan Mircevski
Fix #7680 by never using secondary index for multi-column restrictions.

Modify expr::is_supported_by() to handle multi-column correctly.

Tests: unit (dev)

Closes #7699

* github.com:scylladb/scylla:
  cql3/expr: Clarify multi-column doesn't use indexing
  cql3: Don't use index for multi-column restrictions
  test: Add eventually_require_rows
2020-11-30 12:38:26 +02:00
Avi Kivity
12c20c4101 Merge 'test/cql-pytest: tests for string validation (UTF-8 and ASCII)' from Nadav Har'El
The first two patches in this series are small improvements to cql-pytest to prepare for the third and main patch. This third patch adds cql-pytest tests which check that we fail CQL queries that try to inject non-ASCII and non-UTF-8 strings for ascii and text columns, respectively.

The tests do not discover any unknown bug in Scylla, however, they do show that Scylla is more strict in its definition of "valid UTF-8" compared to Cassandra.

Closes #7719

* github.com:scylladb/scylla:
  test/cql-pytest: add tests for validation of inserted strings
  test/cql-pytest: add "scylla_only" fixture
  test/cpy-pytest: enable experimental features
2020-11-30 12:26:25 +02:00
Piotr Wojtczak
3560acd311 cql_metrics: Add metrics for CQL errors
This change adds tracking of all the CQL errors that can be
raised in response to a CQL message from a client, as described
in the CQL v4 protocol and with Scylla's CDC_WRITE_FAILUREs
included.

Fixes #5859

Closes #7604
2020-11-30 12:18:37 +02:00
Takuya ASADA
6238d105d9 dist/redhat: drop Conflicts with older kernel
We have "Conflicts: kernel < 3.10.0-514" on rpm package to make sure
the environment is running newer kernel.
However, user may use non-standard kernel which has different package name,
like kernel-ml or kernel-uek.
On such environment Conflicts tag does not works correctly.
Even the system running with newer kernel, rpm only checks "kernel" package
version number.

To avoid such issue, we need to drop Conflicts tag.

Fixes #7675
2020-11-30 11:38:42 +02:00
Nadav Har'El
48c78ade33 test/cql-pytest: add tests for validation of inserted strings
This patch adds comprehensive cql-pytest tests for checking the validation
of strings - ASCII or UTF-8 - in CQL. Strings can be represented in CQL
using several methods - a strings can be a string literal as
part of the statement, can be encoded as a blob (0x...), or
can be a binding parameter for a prepared statement, or returned
by user-defined functions - and these tests check all of them.

We already have low-level unit tests for UTF-8 parsing in
test/boost/utf8_test.cc, but the new tests here confirms that we really
call these low-level functions in the correct way. Moreover, since these
are CQL tests, they can also be run against Cassandra, and doing that
demonstrated that Scylla's UTF-8 parsing is *stricter* than Cassandra's -
Scylla's UTF-8 parser rejects the following sequences which Cassandra's
accepts:

 1. \xC0\x80 as another non-minimal representation of null. Note that other
    non-minimal encodings are rejected by Cassandra, as expected.
 2. Characters beyond the official Unicode range (or what Scylla considers
    the end of the range).
 3. UTF-16 surrogates - these are not considered valid UTF-8, but Cassandra
    accepts them, and Scylla does not.

In the future, we should consider whether Scylla is more correct than
Cassandra here (so we're fine), or whether compatibility is more important
than correctness (so this exposed a bug).

The ASCII tests reproduces issue #5421 - that trying to insert a
non-ASCII string into an "ascii" column should produce an error on
insert - not later when fetching the string. This test now passes,
because issue 5421 was already fixed.

These tests did not exposed any bug in Scylla (other than the differences
with Cassandra mentioned a bug), so all of them pass on Scylla. Two
of the tests fail on Cassandra, because Cassandra does not recognize
some invalid UTF-8 (according to Scylla's definition) as invalid.

Refs #5421.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-11-29 17:43:20 +02:00
Dejan Mircevski
5bc7e31284 restrictions: Forbid mixing ck=0 and (ck)=(0)
Reject the previously accepted case where the multi-column restriction
applied to just a single column, as it causes a crash downstream.  The
user can drop the parentheses to avoid the rejection.

Fixes #7710

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #7712
2020-11-29 17:06:41 +02:00
Avi Kivity
0584db1eb3 Merge "Unstall cleanup_compaction::get_ranges_for_invalidation" from Benny
"
This series adds maybe_yield called from
cleanup_compaction::get_ranges_for_invalidation
to avoid reactor stalls.

To achieve that, we first extract bool_class can_yield
to utils/maybe_yield.hh, and add a convience helper:
utils::maybe_yield(can_yield) that conditionally calls
seastar::thread::maybe_yield if it can (when called in a
seastar thread).

With that, we add a can_yield parameter to dht::to_partition_ranges
and dht::partition_range::deoverlap (defaults to false), and
use it from cleanup_compaction::get_ranges_for_invalidation,
as the latter is always called from `consume_in_thread`.

Fixes #7674

Test: unit(dev)
"

* tag 'unstall-get_ranges_for_invalidation-v2' of github.com:bhalevy/scylla:
  compaction: cleanup_compaction: get_ranges_for_invalidation: add yield points
  dht/i_partitioner: to_partition_ranges: support yielding
  locator: extract can_yield to utils/maybe_yield.hh
2020-11-29 14:10:39 +02:00
Asias He
0a3a2a82e1 api: Add force_remove_endpoint for gossip
It is used to force remove a node from gossip membership if something
goes wrong.

Note: run the force_remove_endpoint api at the same time on _all_ the
nodes in the cluster in order to prevent the removed nodes come back.
Becasue nodes without running the force_remove_endpoint api cmd can
gossip around the removed node information to other nodes in 2 *
ring_delay (2 * 30 seconds by default) time.

For instance, in a 3 nodes cluster, node 3 is decommissioned, to remove
node 3 from gossip membership prior the auto removal (3 days by
default), run the api cmd on both node 1 and node 2 at the same time.

$ curl -X POST --header "Accept: application/json"
"http://127.0.0.1:10000/gossiper/force_remove_endpoint/127.0.0.3"
$ curl -X POST --header "Accept: application/json"
"http://127.0.0.2:10000/gossiper/force_remove_endpoint/127.0.0.3"

Then run 'nodetool gossipinfo' on all the nodes to check the removed nodes
are not present.

Fixes #2134

Closes #5436
2020-11-29 13:58:46 +02:00
Nadav Har'El
0864933d4d test/cql-pytest: add "scylla_only" fixture
This patch adds a fixture "scylla_only" which can be used to mark tests
for Scylla-specific features. These tests are skipped when running against
other CQL servers - like Apache Cassandra.

We recognize Scylla by looking at whether any system table exists with
the name "scylla" in its name - Scylla has several of those, and Cassandra
has none.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-11-29 10:18:58 +02:00
Nadav Har'El
91ccb2afb5 test/cpy-pytest: enable experimental features
Enable experimental features, and in particular UDF, so we can test those
features in our tests.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-11-29 10:18:58 +02:00
Michał Chojnowski
fcb258cb01 utils: fragmented_temporary_buffer: implement FragmentedView for view
fragmented_temporary_buffer::view is one of the types we want to directly
deserialize from.
2020-11-27 15:26:13 +01:00
Michał Chojnowski
f6cc2b6a48 utils: fragment_range: add single_fragmented_view
bytes_view is one of the types we want to deserialize from (at least for now),
so we want to be able to pass it to deserialize() after it's transitioned to
FragmentView.

single_fragmented_view is a wrapper implementing FragmentedView for bytes_view.
It's constructed from bytes_view explicitly, because it's typically used in
context where we want to phase linearization (and by extension, bytes_view) out.
2020-11-27 15:26:13 +01:00
Michał Chojnowski
0b20c7ef65 serializer: implement FragmentedView for buffer_view
buffer_view is one of the types we want to directly deserialize from.
2020-11-27 15:26:13 +01:00
Michał Chojnowski
2008c0f62f utils: fragment_range: add linearized and with_linearized for FragmentedView
We would like those helpers to disappear one day but for now we still need them
until everything can handle fragmented buffers.
2020-11-27 15:26:13 +01:00
Michał Chojnowski
fc90bd5190 utils: fragment_range: add FragmentedView
This patch introduces FragmentedView - a concept intented as a general-purpose
interface for fragmented buffers.
Another concept made for this purpose, FragmentedRange, already exists in the
codebase. However, it's unwieldy. The iterator-based design of FragmentRange is
harder to implement and requires more code, but more importantly it makes
FragmentRange immutable.
Usually we want to read the beginning of the buffer and pass the rest of it
elsewhere. This is impossible with FragmentRange.
FragmentedView can do everything FragmentRange can do and more, except for
playing nicely with iterator-based collection methods, but those are useless for
fragmented buffers anyway.
2020-11-27 15:26:13 +01:00
Lubos Kosco
4d0587ed11 scylla_util.py: fix metadata gcp call for disks to get details
disk parsing expects output from recursive listing of GCP
metadata REST call, the method used to do it by default,
but now it requires a boolean flag to run in recursive mode

Fixes #7684

Closes #7685
2020-11-27 15:20:56 +02:00
Pekka Enberg
c84754a634 Update tools/java submodule
* tools/java ad48b44a26...8080009794 (1):
  > sstableloader: Fix command line parsing of "ignore-missing-columns"
2020-11-27 15:19:48 +02:00
Avi Kivity
390e07d591 dist: sysctl: configure more inotify instances
Since f3bcd4d205 ("Merge 'Support SSL Certificate Hot
Reloading' from Calle"), we reload certificates as they are
modified on disk. This uses inotify, which is limited by a
sysctl fs.inotify.max_user_instances, with a default of 128.

This is enough for 64 shards only, if both rpc and cql are
encrypted; above that startup fails.

Increase to 1200, which is enough for 6 instances * 200 shards.

Fixes #7700.

Closes #7701
2020-11-26 23:44:48 +02:00
Takuya ASADA
5f81f97773 install.sh: apply sysctl.d files on non-packaging installation
We don't apply sysctl.d files on non-packaging installation, apply them
just like rpm/deb taking care of that.

Fixes #7702

Closes #7705
2020-11-26 09:52:14 +02:00
Takuya ASADA
ba4d54efa3 dist/redhat: packaging dependencies.conf as normal file, not ghost
When we introduced dependencies.conf, we mistakenly added it on rpm as %ghost,
but it should be normal file, should be installed normally on package installation.

Fixes #7703

Closes #7704
2020-11-26 09:50:05 +02:00
Dejan Mircevski
7f8ed811c1 cql3/expr: Clarify multi-column doesn't use indexing
Although not currently used, the old code was wrong and confusing to
readers.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-11-25 10:59:13 -05:00
Avi Kivity
956f031a68 Merge 'Add missing shaded<>::stop in exceptional startup code for CQL/redis' from Calle Wilund
Fixes #7211

If we start a sharded<> object, then proceed to do potentially
exceptional stuff, we should destroy it on said exception.
Otherwise, the exception propagation will abort on RAII
destruction of the sharded<>. And we get no exception logging.

Closes #7697

* github.com:scylladb/scylla:
  redis::service: Shut down sharded<> subobject on startup exception
  transport::controller: Shut down distributed object on startup exception
2020-11-25 17:57:53 +02:00
Calle Wilund
55acf09662 redis::service: Shut down sharded<> subobject on startup exception
Refs #7211

If we start a sharded<> object, then proceed to do potentially
exceptional stuff, we should destroy it on said exception.
Otherwise, the exception propagation will abort on RAII
destruction of the sharded<>. And we get no exception logging.
2020-11-25 15:52:47 +00:00
Calle Wilund
ae4d5a60ca transport::controller: Shut down distributed object on startup exception
Fixes #7211

If we start a sharded<> object, then proceed to do potentially
exceptional stuff, we should destroy it on said exception.
Otherwise, the exception propagation will abort on RAII
destruction of the sharded<>. And we get no exception logging.
2020-11-25 15:52:47 +00:00
Dejan Mircevski
db63b40347 cql3: Don't use index for multi-column restrictions
The downstream code expects a single-column restriction when using an
index.  We could fix it, but we'd still have to filter the rows
fetched from the index table, unlike the code that queries the base
table directly.  For instance, WHERE (c1,c2,c3) = (1,2,3) with an
index on c3 can fetch just the right rows from the base table but all
the c3=3 rows from the index table.

Fixes #7680

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-11-25 10:39:04 -05:00
Dejan Mircevski
ab7aa57b24 test: Add eventually_require_rows
Makes it easier to combine eventually{assert_that} with useful error
messages.

Refs #7573.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-11-25 10:34:44 -05:00
Benny Halevy
e1fe1f18c7 compaction: cleanup_compaction: get_ranges_for_invalidation: add yield points
Avoid reactor stalls by allowing yielding in long-running loops
as seen in #7674.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-25 13:46:32 +02:00
Gleb Natapov
be6119b350 raft: ignore append_reply from a peer in SNAPSHOT state
If append_reply is received from a node that currently gets snapshot
transferred to it ignore it, it is a stray reply.
2020-11-25 12:36:41 +02:00
Gleb Natapov
851e3000c4 raft: Ignore outdated snapshots
Do not try to install snapshots that are older than current one.
2020-11-25 12:36:41 +02:00
Gleb Natapov
2ce9473037 raft: set next_idx to correct value after snapshot transfer
After snapshot is transferred progress::next_idx is set to its index,
but the code uses current snapshot to set it instead of the snapshot
that was transferred. Those can be different snapshots.
2020-11-25 11:34:49 +02:00
Tomasz Grabiec
0aa1f7c70a Merge "raft: fix replication if existing log on leader" from Gleb
* scylla-dev/add_dummy_v2:
  raft: test: replication works on leader change without adding an entry
  raft: commit a dummy entry after leader change
  raft: test: fix snapshot correctness check
  sstables: add `may_have_partition_tombstones` method
2020-11-24 11:35:18 +01:00
Gleb Natapov
51d1d20687 raft: test: replication works on leader change without adding an entry
Check that a newly elected leader commits all the entries in its log
without waiting for more entries to be submitted.
2020-11-24 11:35:18 +01:00
Gleb Natapov
6130fb8b39 raft: commit a dummy entry after leader change
After a node becomes leader it needs to do two things: send an append
message to establish its leadership and commit one entry to make sure
all previous entries with smaller terms are committed as well.
2020-11-24 11:35:18 +01:00
Gleb Natapov
e3a886738b raft: test: fix snapshot correctness check
Snapshot index cannot be used to check snapshot correctness since some
entries may not be command and thus do not affect snapshot value. Lest
use applied entries count instead.
2020-11-24 11:35:18 +01:00
Benny Halevy
37e971ad87 dht/i_partitioner: to_partition_ranges: support yielding
Allow yielding to prevent reactor stalls
when called with a long vector of ranges.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-24 12:23:56 +02:00
Benny Halevy
157a964a63 locator: extract can_yield to utils/maybe_yield.hh
Move the definition of bool_class can_yield to a standalone
header file and define there a maybe_yield(can_yield) helper.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-24 12:23:56 +02:00
Asias He
1b2155eb1d repair: Use same description for the same metric
In commit 9b28162f88 (repair: Use label
for node ops metrics), we switched to use label for different node
operations. We should use the same description for the same metric name.

Fixes #7681

Closes #7682
2020-11-24 09:35:39 +02:00
Avi Kivity
e8ff77c05f Merge 'sstables: a bunch of refactors' from Kamil Braun
1. sstables: move `sstable_set` implementations to a separate module

    All the implementations were kept in sstables/compaction_strategy.cc
    which is quite large even without them. `sstable_set` already had its
    own header file, now it gets its own implementation file.

    The declarations of implementation classes and interfaces (`sstable_set_impl`,
    `bag_sstable_set`, and so on) were also exposed in a header file,
    sstable_set_impl.hh, for the purposes of potential unit testing.

2. mutation_reader: move `mutation_reader::forwarding` to flat_mutation_reader.hh

    Files which need this definition won't have to include
    mutation_reader.hh, only flat_mutation_reader.hh (so the inclusions are
    in total smaller; mutation_reader.hh includes flat_mutation_reader.hh).

3. sstables: move sstable reader creation functions to `sstable_set`

    Lower level functions such as `create_single_key_sstable_reader`
    were made methods of `sstable_set`.

    The motivation is that each concrete sstable_set
    may decide to use a better sstable reading algorithm specific to the
    data structures used by this sstable_set. For this it needs to access
    the set's internals.

    A nice side effect is that we moved some code out of table.cc
    and database.hh which are huge files.

4. sstables: pass `ring_position` to `create_single_key_sstable_reader`

    instead of `partition_range`.

    It would be best to pass `partition_key` or `decorated_key` here.
    However, the implementation of this function needs a `partition_range`
    to pass into `sstable_set::select`, and `partition_range` must be
    constructed from `ring_position`s. We could create the `ring_position`
    internally from the key but that would involve a copy which we want to
    avoid.

5. sstable_set: refactor `filter_sstable_for_reader_by_pk`

    Introduce a `make_pk_filter` function, which given a ring position,
    returns a boolean function (a filter) that given a sstable, tells
    whether the sstable may contain rows with the given position.

    The logic has been extracted from `filter_sstable_for_reader_by_pk`.

Split from #7437.

Closes #7655

* github.com:scylladb/scylla:
  sstable_set: refactor filter_sstable_for_reader_by_pk
  sstables: pass ring_position to create_single_key_sstable_reader
  sstables: move sstable reader creation functions to `sstable_set`
  mutation_reader: move mutation_reader::forwarding to flat_mutation_reader.hh
  sstables: move sstable_set implementations to a separate module
2020-11-24 09:23:57 +02:00
Michał Chojnowski
9bceaac44c utils: fragmented_temporary_buffer: fix view::remove_prefix
This piece of logic was wrong for two unrelated reasons:
1. When fragmented_temporary_buffer::view is constructed from bytes_view,
_current is null. When remove_prefix was used on such view, null pointer
dereference happened.
2. It only worked for the first remove_prefix call. A second call would put a
wrong value in _current_position.
2020-11-24 03:05:13 +01:00
Kamil Braun
6c8b0af505 sstable_set: refactor filter_sstable_for_reader_by_pk
Introduce a `make_pk_filter` function, which given a ring position,
returns a boolean function (a filter) that given a sstable, tells
whether the sstable may contain rows with the given position.

The logic has been extracted from `filter_sstable_for_reader_by_pk`.
2020-11-23 12:35:10 +01:00
Kamil Braun
68663d0de0 sstables: pass ring_position to create_single_key_sstable_reader
instead of partition_range.

It would be best to pass `partition_key` or `decorated_key` here.
However, the implementation of this function needs a `partition_range`
to pass into `sstable_set::select`, and `partition_range` must be
constructed from `ring_position`s. We could create the `ring_position`
internally from the key but that would involve a copy which we want to
avoid.
2020-11-23 12:33:24 +01:00
Amnon Heiman
9e116d136e Adding dist/common/scripts/scylla_rsyslog_setup utility
scylla_rsyslog_setup adds a configuration file to rsyslog to forward the
trances to a remote server.

It will override any existing file, so it is safe to run it multiple
times.

It takes an ip, or ip and port from the users for that configuration, if
no port is provided, the default port of Scylla-Monitoring promtail is
used.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-11-22 15:48:48 +02:00
Kamil Braun
40d8bfa394 sstables: move sstable reader creation functions to sstable_set
Lower level functions such as `create_single_key_sstable_reader`
were made methods of `sstable_set`.

The motivation is that each concrete sstable_set
may decide to use a better sstable reading algorithm specific to the
data structures used by this sstable_set. For this it needs to access
the set's internals.

A nice side effect is that we moved some code out of table.cc
and database.hh which are huge files.
2020-11-19 17:52:39 +01:00
Kamil Braun
708093884c mutation_reader: move mutation_reader::forwarding to flat_mutation_reader.hh
Files which need this definition won't have to include
mutation_reader.hh, only flat_mutation_reader.hh (so the inclusions are
in total smaller; mutation_reader.hh includes flat_mutation_reader.hh).
2020-11-19 17:52:39 +01:00
Kamil Braun
b02b441c2e sstables: move sstable_set implementations to a separate module
All the implementations were kept in sstables/compaction_strategy.cc
which is quite large even without them. `sstable_set` already had its
own header file, now it gets its own implementation file.

The declarations of implementation classes and interfaces (`sstable_set_impl`,
`bag_sstable_set`, and so on) were also exposed in a header file,
sstable_set_impl.hh, for the purposes of potential unit testing.
2020-11-19 17:52:37 +01:00
Avi Kivity
1cf02cb9d8 types: add constraint on lexicographical_tri_compare()
Verify that the input types are iterators and their value
types are compatible with the compare function.
2020-11-17 15:19:46 +02:00
Avi Kivity
71e93d63c5 composite: make composite::iterator a real input_iterator
Iterators require a default constructor, so add one. This helps
a later patch use std::input_iterator to constrain template
parameters.
2020-11-17 15:19:46 +02:00
Avi Kivity
867b41b124 compound: make compount_type::iterator a real input_iterator
Iterators require a default constructor, so add one. This helps
a later patch use std::input_iterator to constrain template
parameters.
2020-11-17 15:19:38 +02:00
1488 changed files with 49886 additions and 19032 deletions

33
.github/workflows/pages.yml vendored Normal file
View File

@@ -0,0 +1,33 @@
name: "CI Docs"
on:
push:
branches:
- master
paths:
- 'docs/**'
jobs:
release:
name: Build
runs-on: ubuntu-latest
env:
LATEST_VERSION: master
steps:
- name: Checkout
uses: actions/checkout@v2
with:
persist-credentials: false
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v1
with:
python-version: 3.7
- name: Build docs
run: |
export PATH=$PATH:~/.local/bin
cd docs
make multiversion
- name: Deploy
run : ./docs/_utils/deploy.sh
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

2
.gitignore vendored
View File

@@ -25,3 +25,5 @@ tags
testlog
test/*/*.reject
.vscode
docs/_build
docs/poetry.lock

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

View File

@@ -498,6 +498,7 @@ set(scylla_sources
mutation_writer/multishard_writer.cc
mutation_writer/shard_based_splitting_writer.cc
mutation_writer/timestamp_based_splitting_writer.cc
mutation_writer/feed_writers.cc
partition_slice_builder.cc
partition_version.cc
querier.cc

View File

@@ -1,11 +1,18 @@
# Asking questions or requesting help
# Contributing to Scylla
Use the [ScyllaDB user mailing list](https://groups.google.com/forum/#!forum/scylladb-users) or the [Slack workspace](http://slack.scylladb.com) for general questions and help.
## Asking questions or requesting help
# Reporting an issue
Use the [Scylla Users mailing list](https://groups.google.com/g/scylladb-users) or the [Slack workspace](http://slack.scylladb.com) for general questions and help.
Please use the [Issue Tracker](https://github.com/scylladb/scylla/issues/) to report issues. Fill in as much information as you can in the issue template, especially for performance problems.
Join the [Scylla Developers mailing list](https://groups.google.com/g/scylladb-dev) for deeper technical discussions and to discuss your ideas for contributions.
# Contributing Code to Scylla
## Reporting an issue
To contribute code to Scylla, you need to sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send your changes as [patches](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches) to the [mailing list](https://groups.google.com/forum/#!forum/scylladb-dev). We don't accept pull requests on GitHub.
Please use the [issue tracker](https://github.com/scylladb/scylla/issues/) to report issues or to suggest features. Fill in as much information as you can in the issue template, especially for performance problems.
## Contributing code to Scylla
Before you can contribute code to Scylla for the first time, you should sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send the signed form cla@scylladb.com. You can then submit your changes as patches to the to the [scylladb-dev mailing list](https://groups.google.com/forum/#!forum/scylladb-dev) or as a pull request to the [Scylla project on github](https://github.com/scylladb/scylla).
If you need help formatting or sending patches, [check out these instructions](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches).
The Scylla C++ source code uses the [Seastar coding style](https://github.com/scylladb/seastar/blob/master/coding-style.md) so please adhere to that in your patches. Note that Scylla code is written with `using namespace seastar`, so should not explicitly add the `seastar::` prefix to Seastar symbols. You will usually not need to add `using namespace seastar` to new source files, because most Scylla header files have `#include "seastarx.hh"`, which does this.

1
DEDICATION.txt Normal file
View File

@@ -0,0 +1 @@
Dedicated to the memory of Alberto José Araújo, a coworker and a friend.

View File

@@ -5,3 +5,5 @@ It includes files from https://github.com/antonblanchard/crc32-vpmsum (author An
These files are located in utils/arch/powerpc/crc32-vpmsum. Their license may be found in licenses/LICENSE-crc32-vpmsum.TXT.
It includes modified code from https://gitbox.apache.org/repos/asf?p=cassandra-dtest.git (owned by The Apache Software Foundation)
It includes modified tests from https://github.com/etcd-io/etcd.git (owned by The etcd Authors)

View File

@@ -42,7 +42,7 @@ For further information, please see:
* [Docker image build documentation] for information on how to build Docker images.
[developer documentation]: HACKING.md
[build documentation]: docs/building.md
[build documentation]: docs/guides/building.md
[docker image build documentation]: dist/docker/redhat/README.md
## Running Scylla
@@ -65,7 +65,7 @@ $ ./tools/toolchain/dbuild ./build/release/scylla --help
## Testing
See [test.py manual](docs/testing.md).
See [test.py manual](docs/guides/testing.md).
## Scylla APIs and compatibility
By default, Scylla is compatible with Apache Cassandra and its APIs - CQL and
@@ -78,10 +78,7 @@ and the current compatibility of this feature as well as Scylla-specific extensi
## Documentation
Documentation can be found in [./docs](./docs) and on the
[wiki](https://github.com/scylladb/scylla/wiki). There is currently no clear
definition of what goes where, so when looking for something be sure to check
both.
Documentation can be found [here](https://scylla.docs.scylladb.com).
Seastar documentation can be found [here](http://docs.seastar.io/master/index.html).
User documentation can be found [here](https://docs.scylladb.com/).

View File

@@ -1,7 +1,7 @@
#!/bin/sh
PRODUCT=scylla
VERSION=4.4.dev
VERSION=4.5.7
if test -f version
then

2
abseil

Submodule abseil updated: 1e3d25b265...9c6a50fdd8

View File

@@ -62,6 +62,14 @@ static std::string apply_sha256(std::string_view msg) {
return to_hex(hasher.finalize());
}
static std::string apply_sha256(const std::vector<temporary_buffer<char>>& msg) {
sha256_hasher hasher;
for (const temporary_buffer<char>& buf : msg) {
hasher.update(buf.get(), buf.size());
}
return to_hex(hasher.finalize());
}
static std::string format_time_point(db_clock::time_point tp) {
time_t time_point_repr = db_clock::to_time_t(tp);
std::string time_point_str;
@@ -91,7 +99,7 @@ void check_expiry(std::string_view signature_date) {
std::string get_signature(std::string_view access_key_id, std::string_view secret_access_key, std::string_view host, std::string_view method,
std::string_view orig_datestamp, std::string_view signed_headers_str, const std::map<std::string_view, std::string_view>& signed_headers_map,
std::string_view body_content, std::string_view region, std::string_view service, std::string_view query_string) {
const std::vector<temporary_buffer<char>>& body_content, std::string_view region, std::string_view service, std::string_view query_string) {
auto amz_date_it = signed_headers_map.find("x-amz-date");
if (amz_date_it == signed_headers_map.end()) {
throw api_error::invalid_signature("X-Amz-Date header is mandatory for signature verification");
@@ -129,8 +137,7 @@ future<std::string> get_key_from_roles(cql3::query_processor& qp, std::string us
auth::meta::roles_table::qualified_name, auth::meta::roles_table::role_col_name);
auto cl = auth::password_authenticator::consistency_for_user(username);
auto& timeout = auth::internal_distributed_timeout_config();
return qp.execute_internal(query, cl, timeout, {sstring(username)}, true).then_wrapped([username = std::move(username)] (future<::shared_ptr<cql3::untyped_result_set>> f) {
return qp.execute_internal(query, cl, auth::internal_distributed_query_state(), {sstring(username)}, true).then_wrapped([username = std::move(username)] (future<::shared_ptr<cql3::untyped_result_set>> f) {
auto res = f.get0();
auto salted_hash = std::optional<sstring>();
if (res->empty()) {

View File

@@ -39,7 +39,7 @@ using key_cache = utils::loading_cache<std::string, std::string>;
std::string get_signature(std::string_view access_key_id, std::string_view secret_access_key, std::string_view host, std::string_view method,
std::string_view orig_datestamp, std::string_view signed_headers_str, const std::map<std::string_view, std::string_view>& signed_headers_map,
std::string_view body_content, std::string_view region, std::string_view service, std::string_view query_string);
const std::vector<temporary_buffer<char>>& body_content, std::string_view region, std::string_view service, std::string_view query_string);
future<std::string> get_key_from_roles(cql3::query_processor& qp, std::string username);

View File

@@ -123,7 +123,7 @@ struct rjson_engaged_ptr_comp {
// as internally they're stored in an array, and the order of elements is
// not important in set equality. See issue #5021
static bool check_EQ_for_sets(const rjson::value& set1, const rjson::value& set2) {
if (set1.Size() != set2.Size()) {
if (!set1.IsArray() || !set2.IsArray() || set1.Size() != set2.Size()) {
return false;
}
std::set<const rjson::value*, rjson_engaged_ptr_comp> set1_raw;
@@ -137,45 +137,107 @@ static bool check_EQ_for_sets(const rjson::value& set1, const rjson::value& set2
}
return true;
}
// Moreover, the JSON being compared can be a nested document with outer
// layers of lists and maps and some inner set - and we need to get to that
// inner set to compare it correctly with check_EQ_for_sets() (issue #8514).
static bool check_EQ(const rjson::value* v1, const rjson::value& v2);
static bool check_EQ_for_lists(const rjson::value& list1, const rjson::value& list2) {
if (!list1.IsArray() || !list2.IsArray() || list1.Size() != list2.Size()) {
return false;
}
auto it1 = list1.Begin();
auto it2 = list2.Begin();
while (it1 != list1.End()) {
// Note: Alternator limits an item's depth (rjson::parse() limits
// it to around 37 levels), so this recursion is safe.
if (!check_EQ(&*it1, *it2)) {
return false;
}
++it1;
++it2;
}
return true;
}
static bool check_EQ_for_maps(const rjson::value& list1, const rjson::value& list2) {
if (!list1.IsObject() || !list2.IsObject() || list1.MemberCount() != list2.MemberCount()) {
return false;
}
for (auto it1 = list1.MemberBegin(); it1 != list1.MemberEnd(); ++it1) {
auto it2 = list2.FindMember(it1->name);
if (it2 == list2.MemberEnd() || !check_EQ(&it1->value, it2->value)) {
return false;
}
}
return true;
}
// Check if two JSON-encoded values match with the EQ relation
static bool check_EQ(const rjson::value* v1, const rjson::value& v2) {
if (!v1) {
return false;
}
if (v1->IsObject() && v1->MemberCount() == 1 && v2.IsObject() && v2.MemberCount() == 1) {
if (v1 && v1->IsObject() && v1->MemberCount() == 1 && v2.IsObject() && v2.MemberCount() == 1) {
auto it1 = v1->MemberBegin();
auto it2 = v2.MemberBegin();
if ((it1->name == "SS" && it2->name == "SS") || (it1->name == "NS" && it2->name == "NS") || (it1->name == "BS" && it2->name == "BS")) {
return check_EQ_for_sets(it1->value, it2->value);
if (it1->name != it2->name) {
return false;
}
if (it1->name == "SS" || it1->name == "NS" || it1->name == "BS") {
return check_EQ_for_sets(it1->value, it2->value);
} else if(it1->name == "L") {
return check_EQ_for_lists(it1->value, it2->value);
} else if(it1->name == "M") {
return check_EQ_for_maps(it1->value, it2->value);
} else {
// Other, non-nested types (number, string, etc.) can be compared
// literally, comparing their JSON representation.
return it1->value == it2->value;
}
} else {
// If v1 and/or v2 are missing (IsNull()) the result should be false.
// In the unlikely case that the object is malformed (issue #8070),
// let's also return false.
return false;
}
return *v1 == v2;
}
// Check if two JSON-encoded values match with the NE relation
static bool check_NE(const rjson::value* v1, const rjson::value& v2) {
return !v1 || *v1 != v2; // null is unequal to anything.
return !check_EQ(v1, v2);
}
// Check if two JSON-encoded values match with the BEGINS_WITH relation
static bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2) {
// BEGINS_WITH requires that its single operand (v2) be a string or
// binary - otherwise it's a validation error. However, problems with
// the stored attribute (v1) will just return false (no match).
if (!v2.IsObject() || v2.MemberCount() != 1) {
throw api_error::validation(format("BEGINS_WITH operator encountered malformed AttributeValue: {}", v2));
}
auto it2 = v2.MemberBegin();
if (it2->name != "S" && it2->name != "B") {
throw api_error::validation(format("BEGINS_WITH operator requires String or Binary type in AttributeValue, got {}", it2->name));
}
bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2,
bool v1_from_query, bool v2_from_query) {
bool bad = false;
if (!v1 || !v1->IsObject() || v1->MemberCount() != 1) {
if (v1_from_query) {
throw api_error::validation("begins_with() encountered malformed argument");
} else {
bad = true;
}
} else if (v1->MemberBegin()->name != "S" && v1->MemberBegin()->name != "B") {
if (v1_from_query) {
throw api_error::validation(format("begins_with supports only string or binary type, got: {}", *v1));
} else {
bad = true;
}
}
if (!v2.IsObject() || v2.MemberCount() != 1) {
if (v2_from_query) {
throw api_error::validation("begins_with() encountered malformed argument");
} else {
bad = true;
}
} else if (v2.MemberBegin()->name != "S" && v2.MemberBegin()->name != "B") {
if (v2_from_query) {
throw api_error::validation(format("begins_with() supports only string or binary type, got: {}", v2));
} else {
bad = true;
}
}
if (bad) {
return false;
}
auto it1 = v1->MemberBegin();
auto it2 = v2.MemberBegin();
if (it1->name != it2->name) {
return false;
}
@@ -279,24 +341,40 @@ static bool check_NOT_NULL(const rjson::value* val) {
return val != nullptr;
}
// Only types S, N or B (string, number or bytes) may be compared by the
// various comparion operators - lt, le, gt, ge, and between.
// Note that in particular, if the value is missing (v->IsNull()), this
// check returns false.
static bool check_comparable_type(const rjson::value& v) {
if (!v.IsObject() || v.MemberCount() != 1) {
return false;
}
const rjson::value& type = v.MemberBegin()->name;
return type == "S" || type == "N" || type == "B";
}
// Check if two JSON-encoded values match with cmp.
template <typename Comparator>
bool check_compare(const rjson::value* v1, const rjson::value& v2, const Comparator& cmp) {
if (!v2.IsObject() || v2.MemberCount() != 1) {
throw api_error::validation(
format("{} requires a single AttributeValue of type String, Number, or Binary",
cmp.diagnostic));
bool check_compare(const rjson::value* v1, const rjson::value& v2, const Comparator& cmp,
bool v1_from_query, bool v2_from_query) {
bool bad = false;
if (!v1 || !check_comparable_type(*v1)) {
if (v1_from_query) {
throw api_error::validation(format("{} allow only the types String, Number, or Binary", cmp.diagnostic));
}
bad = true;
}
const auto& kv2 = *v2.MemberBegin();
if (kv2.name != "S" && kv2.name != "N" && kv2.name != "B") {
throw api_error::validation(
format("{} requires a single AttributeValue of type String, Number, or Binary",
cmp.diagnostic));
if (!check_comparable_type(v2)) {
if (v2_from_query) {
throw api_error::validation(format("{} allow only the types String, Number, or Binary", cmp.diagnostic));
}
bad = true;
}
if (!v1 || !v1->IsObject() || v1->MemberCount() != 1) {
if (bad) {
return false;
}
const auto& kv1 = *v1->MemberBegin();
const auto& kv2 = *v2.MemberBegin();
if (kv1.name != kv2.name) {
return false;
}
@@ -310,7 +388,8 @@ bool check_compare(const rjson::value* v1, const rjson::value& v2, const Compara
if (kv1.name == "B") {
return cmp(base64_decode(kv1.value), base64_decode(kv2.value));
}
clogger.error("check_compare panic: LHS type equals RHS type, but one is in {N,S,B} while the other isn't");
// cannot reach here, as check_comparable_type() verifies the type is one
// of the above options.
return false;
}
@@ -341,56 +420,71 @@ struct cmp_gt {
static constexpr const char* diagnostic = "GT operator";
};
// True if v is between lb and ub, inclusive. Throws if lb > ub.
// True if v is between lb and ub, inclusive. Throws or returns false
// (depending on bounds_from_query parameter) if lb > ub.
template <typename T>
static bool check_BETWEEN(const T& v, const T& lb, const T& ub) {
static bool check_BETWEEN(const T& v, const T& lb, const T& ub, bool bounds_from_query) {
if (cmp_lt()(ub, lb)) {
throw api_error::validation(
format("BETWEEN operator requires lower_bound <= upper_bound, but {} > {}", lb, ub));
if (bounds_from_query) {
throw api_error::validation(
format("BETWEEN operator requires lower_bound <= upper_bound, but {} > {}", lb, ub));
} else {
return false;
}
}
return cmp_ge()(v, lb) && cmp_le()(v, ub);
}
static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const rjson::value& ub) {
if (!v) {
static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const rjson::value& ub,
bool v_from_query, bool lb_from_query, bool ub_from_query) {
if ((v && v_from_query && !check_comparable_type(*v)) ||
(lb_from_query && !check_comparable_type(lb)) ||
(ub_from_query && !check_comparable_type(ub))) {
throw api_error::validation("between allow only the types String, Number, or Binary");
}
if (!v || !v->IsObject() || v->MemberCount() != 1 ||
!lb.IsObject() || lb.MemberCount() != 1 ||
!ub.IsObject() || ub.MemberCount() != 1) {
return false;
}
if (!v->IsObject() || v->MemberCount() != 1) {
throw api_error::validation(format("BETWEEN operator encountered malformed AttributeValue: {}", *v));
}
if (!lb.IsObject() || lb.MemberCount() != 1) {
throw api_error::validation(format("BETWEEN operator encountered malformed AttributeValue: {}", lb));
}
if (!ub.IsObject() || ub.MemberCount() != 1) {
throw api_error::validation(format("BETWEEN operator encountered malformed AttributeValue: {}", ub));
}
const auto& kv_v = *v->MemberBegin();
const auto& kv_lb = *lb.MemberBegin();
const auto& kv_ub = *ub.MemberBegin();
bool bounds_from_query = lb_from_query && ub_from_query;
if (kv_lb.name != kv_ub.name) {
throw api_error::validation(
if (bounds_from_query) {
throw api_error::validation(
format("BETWEEN operator requires the same type for lower and upper bound; instead got {} and {}",
kv_lb.name, kv_ub.name));
} else {
return false;
}
}
if (kv_v.name != kv_lb.name) { // Cannot compare different types, so v is NOT between lb and ub.
return false;
}
if (kv_v.name == "N") {
const char* diag = "BETWEEN operator";
return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag));
return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag), bounds_from_query);
}
if (kv_v.name == "S") {
return check_BETWEEN(std::string_view(kv_v.value.GetString(), kv_v.value.GetStringLength()),
std::string_view(kv_lb.value.GetString(), kv_lb.value.GetStringLength()),
std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()));
std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()),
bounds_from_query);
}
if (kv_v.name == "B") {
return check_BETWEEN(base64_decode(kv_v.value), base64_decode(kv_lb.value), base64_decode(kv_ub.value));
return check_BETWEEN(base64_decode(kv_v.value), base64_decode(kv_lb.value), base64_decode(kv_ub.value), bounds_from_query);
}
throw api_error::validation(
format("BETWEEN operator requires AttributeValueList elements to be of type String, Number, or Binary; instead got {}",
if (v_from_query) {
throw api_error::validation(
format("BETWEEN operator requires AttributeValueList elements to be of type String, Number, or Binary; instead got {}",
kv_lb.name));
} else {
return false;
}
}
// Verify one Expect condition on one attribute (whose content is "got")
@@ -437,19 +531,19 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu
return check_NE(got, (*attribute_value_list)[0]);
case comparison_operator_type::LT:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_lt{});
return check_compare(got, (*attribute_value_list)[0], cmp_lt{}, false, true);
case comparison_operator_type::LE:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_le{});
return check_compare(got, (*attribute_value_list)[0], cmp_le{}, false, true);
case comparison_operator_type::GT:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_gt{});
return check_compare(got, (*attribute_value_list)[0], cmp_gt{}, false, true);
case comparison_operator_type::GE:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_ge{});
return check_compare(got, (*attribute_value_list)[0], cmp_ge{}, false, true);
case comparison_operator_type::BEGINS_WITH:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_BEGINS_WITH(got, (*attribute_value_list)[0]);
return check_BEGINS_WITH(got, (*attribute_value_list)[0], false, true);
case comparison_operator_type::IN:
verify_operand_count(attribute_value_list, nonempty(), *comparison_operator);
return check_IN(got, *attribute_value_list);
@@ -461,7 +555,8 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu
return check_NOT_NULL(got);
case comparison_operator_type::BETWEEN:
verify_operand_count(attribute_value_list, exact_size(2), *comparison_operator);
return check_BETWEEN(got, (*attribute_value_list)[0], (*attribute_value_list)[1]);
return check_BETWEEN(got, (*attribute_value_list)[0], (*attribute_value_list)[1],
false, true, true);
case comparison_operator_type::CONTAINS:
{
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
@@ -573,7 +668,8 @@ static bool calculate_primitive_condition(const parsed::primitive_condition& con
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error(format("Wrong number of values {} in BETWEEN primitive_condition", cond._values.size()));
}
return check_BETWEEN(&calculated_values[0], calculated_values[1], calculated_values[2]);
return check_BETWEEN(&calculated_values[0], calculated_values[1], calculated_values[2],
cond._values[0].is_constant(), cond._values[1].is_constant(), cond._values[2].is_constant());
case parsed::primitive_condition::type::IN:
return check_IN(calculated_values);
case parsed::primitive_condition::type::VALUE:
@@ -604,13 +700,17 @@ static bool calculate_primitive_condition(const parsed::primitive_condition& con
case parsed::primitive_condition::type::NE:
return check_NE(&calculated_values[0], calculated_values[1]);
case parsed::primitive_condition::type::GT:
return check_compare(&calculated_values[0], calculated_values[1], cmp_gt{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_gt{},
cond._values[0].is_constant(), cond._values[1].is_constant());
case parsed::primitive_condition::type::GE:
return check_compare(&calculated_values[0], calculated_values[1], cmp_ge{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_ge{},
cond._values[0].is_constant(), cond._values[1].is_constant());
case parsed::primitive_condition::type::LT:
return check_compare(&calculated_values[0], calculated_values[1], cmp_lt{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_lt{},
cond._values[0].is_constant(), cond._values[1].is_constant());
case parsed::primitive_condition::type::LE:
return check_compare(&calculated_values[0], calculated_values[1], cmp_le{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_le{},
cond._values[0].is_constant(), cond._values[1].is_constant());
default:
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error(format("Unknown type {} in primitive_condition object", (int)(cond._op)));

View File

@@ -52,6 +52,7 @@ bool verify_expected(const rjson::value& req, const rjson::value* previous_item)
bool verify_condition(const rjson::value& condition, bool require_all, const rjson::value* previous_item);
bool check_CONTAINS(const rjson::value* v1, const rjson::value& v2);
bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2, bool v1_from_query, bool v2_from_query);
bool verify_condition_expression(
const parsed::condition_expression& condition_expression,

View File

@@ -59,6 +59,9 @@ public:
static api_error invalid_signature(std::string msg) {
return api_error("InvalidSignatureException", std::move(msg));
}
static api_error missing_authentication_token(std::string msg) {
return api_error("MissingAuthenticationTokenException", std::move(msg));
}
static api_error unrecognized_client(std::string msg) {
return api_error("UnrecognizedClientException", std::move(msg));
}
@@ -77,6 +80,9 @@ public:
static api_error trimmed_data_access_exception(std::string msg) {
return api_error("TrimmedDataAccessException", std::move(msg));
}
static api_error request_limit_exceeded(std::string msg) {
return api_error("RequestLimitExceeded", std::move(msg));
}
static api_error internal(std::string msg) {
return api_error("InternalServerError", std::move(msg), reply::status_type::internal_server_error);
}

View File

@@ -55,7 +55,7 @@
#include "schema.hh"
#include "alternator/tags_extension.hh"
#include "alternator/rmw_operation.hh"
#include <seastar/core/coroutine.hh>
#include <boost/range/adaptors.hpp>
logging::logger elogger("alternator-executor");
@@ -202,7 +202,7 @@ static schema_ptr get_table(service::storage_proxy& proxy, const rjson::value& r
if (!schema) {
// if we get here then the name was missing, since syntax or missing actual CF
// checks throw. Slow path, but just call get_table_name to generate exception.
get_table_name(request);
get_table_name(request);
}
return schema;
}
@@ -220,7 +220,7 @@ static std::tuple<bool, std::string_view, std::string_view> try_get_internal_tab
std::string_view ks_name = table_name.substr(0, delim);
table_name.remove_prefix(ks_name.size() + 1);
// Only internal keyspaces can be accessed to avoid leakage
if (!is_internal_keyspace(sstring(ks_name))) {
if (!is_internal_keyspace(ks_name)) {
return {false, "", ""};
}
return {true, ks_name, table_name};
@@ -476,8 +476,8 @@ future<executor::request_return_type> executor::delete_table(client_state& clien
return make_ready_future<request_return_type>(api_error::resource_not_found(
format("Requested resource not found: Table: {} not found", table_name)));
}
return _mm.announce_column_family_drop(keyspace_name, table_name, false, service::migration_manager::drop_views::yes).then([this, keyspace_name] {
return _mm.announce_keyspace_drop(keyspace_name, false);
return _mm.announce_column_family_drop(keyspace_name, table_name, service::migration_manager::drop_views::yes).then([this, keyspace_name] {
return _mm.announce_keyspace_drop(keyspace_name);
}).then([table_name = std::move(table_name)] {
// FIXME: need more attributes?
rjson::value table_description = rjson::empty_object();
@@ -704,52 +704,48 @@ static void update_tags_map(const rjson::value& tags, std::map<sstring, sstring>
static future<> update_tags(service::migration_manager& mm, schema_ptr schema, std::map<sstring, sstring>&& tags_map) {
schema_builder builder(schema);
builder.add_extension(tags_extension::NAME, ::make_shared<tags_extension>(std::move(tags_map)));
return mm.announce_column_family_update(builder.build(), false, std::vector<view_ptr>(), false);
return mm.announce_column_family_update(builder.build(), false, std::vector<view_ptr>());
}
future<executor::request_return_type> executor::tag_resource(client_state& client_state, service_permit permit, rjson::value request) {
_stats.api_operations.tag_resource++;
return seastar::async([this, &client_state, request = std::move(request)] () mutable -> request_return_type {
const rjson::value* arn = rjson::find(request, "ResourceArn");
if (!arn || !arn->IsString()) {
return api_error::access_denied("Incorrect resource identifier");
}
schema_ptr schema = get_table_from_arn(_proxy, rjson::to_string_view(*arn));
std::map<sstring, sstring> tags_map = get_tags_of_table(schema);
const rjson::value* tags = rjson::find(request, "Tags");
if (!tags || !tags->IsArray()) {
return api_error::validation("Cannot parse tags");
}
if (tags->Size() < 1) {
return api_error::validation("The number of tags must be at least 1") ;
}
update_tags_map(*tags, tags_map, update_tags_action::add_tags);
update_tags(_mm, schema, std::move(tags_map)).get();
return json_string("");
});
const rjson::value* arn = rjson::find(request, "ResourceArn");
if (!arn || !arn->IsString()) {
co_return api_error::access_denied("Incorrect resource identifier");
}
schema_ptr schema = get_table_from_arn(_proxy, rjson::to_string_view(*arn));
std::map<sstring, sstring> tags_map = get_tags_of_table(schema);
const rjson::value* tags = rjson::find(request, "Tags");
if (!tags || !tags->IsArray()) {
co_return api_error::validation("Cannot parse tags");
}
if (tags->Size() < 1) {
co_return api_error::validation("The number of tags must be at least 1") ;
}
update_tags_map(*tags, tags_map, update_tags_action::add_tags);
co_await update_tags(_mm, schema, std::move(tags_map));
co_return json_string("");
}
future<executor::request_return_type> executor::untag_resource(client_state& client_state, service_permit permit, rjson::value request) {
_stats.api_operations.untag_resource++;
return seastar::async([this, &client_state, request = std::move(request)] () -> request_return_type {
const rjson::value* arn = rjson::find(request, "ResourceArn");
if (!arn || !arn->IsString()) {
return api_error::access_denied("Incorrect resource identifier");
}
const rjson::value* tags = rjson::find(request, "TagKeys");
if (!tags || !tags->IsArray()) {
return api_error::validation(format("Cannot parse tag keys"));
}
const rjson::value* arn = rjson::find(request, "ResourceArn");
if (!arn || !arn->IsString()) {
co_return api_error::access_denied("Incorrect resource identifier");
}
const rjson::value* tags = rjson::find(request, "TagKeys");
if (!tags || !tags->IsArray()) {
co_return api_error::validation(format("Cannot parse tag keys"));
}
schema_ptr schema = get_table_from_arn(_proxy, rjson::to_string_view(*arn));
schema_ptr schema = get_table_from_arn(_proxy, rjson::to_string_view(*arn));
std::map<sstring, sstring> tags_map = get_tags_of_table(schema);
update_tags_map(*tags, tags_map, update_tags_action::delete_tags);
update_tags(_mm, schema, std::move(tags_map)).get();
return json_string("");
});
std::map<sstring, sstring> tags_map = get_tags_of_table(schema);
update_tags_map(*tags, tags_map, update_tags_action::delete_tags);
co_await update_tags(_mm, schema, std::move(tags_map));
co_return json_string("");
}
future<executor::request_return_type> executor::list_tags_of_resource(client_state& client_state, service_permit permit, rjson::value request) {
@@ -985,7 +981,7 @@ future<executor::request_return_type> executor::create_table(client_state& clien
return create_keyspace(keyspace_name).handle_exception_type([] (exceptions::already_exists_exception&) {
// Ignore the fact that the keyspace may already exist. See discussion in #6340
}).then([this, table_name, request = std::move(request), schema, view_builders = std::move(view_builders), tags_map = std::move(tags_map)] () mutable {
return futurize_invoke([&] { return _mm.announce_new_column_family(schema, false); }).then([this, table_info = std::move(request), schema, view_builders = std::move(view_builders), tags_map = std::move(tags_map)] () mutable {
return futurize_invoke([&] { return _mm.announce_new_column_family(schema); }).then([this, table_info = std::move(request), schema, view_builders = std::move(view_builders), tags_map = std::move(tags_map)] () mutable {
return parallel_for_each(std::move(view_builders), [this, schema] (schema_builder builder) {
return _mm.announce_new_view(view_ptr(builder.build()));
}).then([this, table_info = std::move(table_info), schema, tags_map = std::move(tags_map)] () mutable {
@@ -1241,10 +1237,16 @@ mutation put_or_delete_item::build(schema_ptr schema, api::timestamp_type ts) co
return m;
}
// The DynamoDB API doesn't let the client control the server's timeout.
// Let's pick something reasonable:
// The DynamoDB API doesn't let the client control the server's timeout, so
// we have a global default_timeout() for Alternator requests. The value of
// default_timeout is overwritten by main.cc based on the
// "alternator_timeout_in_ms" configuration parameter.
db::timeout_clock::duration executor::s_default_timeout = 10s;
void executor::set_default_timeout(db::timeout_clock::duration timeout) {
s_default_timeout = timeout;
}
db::timeout_clock::time_point executor::default_timeout() {
return db::timeout_clock::now() + 10s;
return db::timeout_clock::now() + s_default_timeout;
}
static future<std::unique_ptr<rjson::value>> get_previous_item(
@@ -1880,18 +1882,182 @@ static std::string get_item_type_string(const rjson::value& v) {
return it->name.GetString();
}
// attrs_to_get saves for each top-level attribute an attrs_to_get_node,
// a hierarchy of subparts that need to be kept. The following function
// takes a given JSON value and drops its parts which weren't asked to be
// kept. It modifies the given JSON value, or returns false to signify that
// the entire object should be dropped.
// Note that The JSON value is assumed to be encoded using the DynamoDB
// conventions - i.e., it is really a map whose key has a type string,
// and the value is the real object.
template<typename T>
static bool hierarchy_filter(rjson::value& val, const attribute_path_map_node<T>& h) {
if (!val.IsObject() || val.MemberCount() != 1) {
// This shouldn't happen. We shouldn't have stored malformed objects.
// But today Alternator does not validate the structure of nested
// documents before storing them, so this can happen on read.
throw api_error::internal(format("Malformed value object read: {}", val));
}
const char* type = val.MemberBegin()->name.GetString();
rjson::value& v = val.MemberBegin()->value;
if (h.has_members()) {
const auto& members = h.get_members();
if (type[0] != 'M' || !v.IsObject()) {
// If v is not an object (dictionary, map), none of the members
// can match.
return false;
}
rjson::value newv = rjson::empty_object();
for (auto it = v.MemberBegin(); it != v.MemberEnd(); ++it) {
std::string attr = it->name.GetString();
auto x = members.find(attr);
if (x != members.end()) {
if (x->second) {
// Only a part of this attribute is to be filtered, do it.
if (hierarchy_filter(it->value, *x->second)) {
rjson::set_with_string_name(newv, attr, std::move(it->value));
}
} else {
// The entire attribute is to be kept
rjson::set_with_string_name(newv, attr, std::move(it->value));
}
}
}
if (newv.MemberCount() == 0) {
return false;
}
v = newv;
} else if (h.has_indexes()) {
const auto& indexes = h.get_indexes();
if (type[0] != 'L' || !v.IsArray()) {
return false;
}
rjson::value newv = rjson::empty_array();
const auto& a = v.GetArray();
for (unsigned i = 0; i < v.Size(); i++) {
auto x = indexes.find(i);
if (x != indexes.end()) {
if (x->second) {
if (hierarchy_filter(a[i], *x->second)) {
rjson::push_back(newv, std::move(a[i]));
}
} else {
// The entire attribute is to be kept
rjson::push_back(newv, std::move(a[i]));
}
}
}
if (newv.Size() == 0) {
return false;
}
v = newv;
}
return true;
}
// Add a path to a attribute_path_map. Throws a validation error if the path
// "overlaps" with one already in the filter (one is a sub-path of the other)
// or "conflicts" with it (both a member and index is requested).
template<typename T>
void attribute_path_map_add(const char* source, attribute_path_map<T>& map, const parsed::path& p, T value = {}) {
using node = attribute_path_map_node<T>;
// The first step is to look for the top-level attribute (p.root()):
auto it = map.find(p.root());
if (it == map.end()) {
if (p.has_operators()) {
it = map.emplace(p.root(), node {std::nullopt}).first;
} else {
(void) map.emplace(p.root(), node {std::move(value)}).first;
// Value inserted for top-level node. We're done.
return;
}
} else if(!p.has_operators()) {
// If p is top-level and we already have it or a part of it
// in map, it's a forbidden overlapping path.
throw api_error::validation(format(
"Invalid {}: two document paths overlap at {}", source, p.root()));
} else if (it->second.has_value()) {
// If we're here, it != map.end() && p.has_operators && it->second.has_value().
// This means the top-level attribute already has a value, and we're
// trying to add a non-top-level value. It's an overlap.
throw api_error::validation(format("Invalid {}: two document paths overlap at {}", source, p.root()));
}
node* h = &it->second;
// The second step is to walk h from the top-level node to the inner node
// where we're supposed to insert the value:
for (const auto& op : p.operators()) {
std::visit(overloaded_functor {
[&] (const std::string& member) {
if (h->is_empty()) {
*h = node {typename node::members_t()};
} else if (h->has_indexes()) {
throw api_error::validation(format("Invalid {}: two document paths conflict at {}", source, p));
} else if (h->has_value()) {
throw api_error::validation(format("Invalid {}: two document paths overlap at {}", source, p));
}
typename node::members_t& members = h->get_members();
auto it = members.find(member);
if (it == members.end()) {
it = members.insert({member, std::make_unique<node>()}).first;
}
h = it->second.get();
},
[&] (unsigned index) {
if (h->is_empty()) {
*h = node {typename node::indexes_t()};
} else if (h->has_members()) {
throw api_error::validation(format("Invalid {}: two document paths conflict at {}", source, p));
} else if (h->has_value()) {
throw api_error::validation(format("Invalid {}: two document paths overlap at {}", source, p));
}
typename node::indexes_t& indexes = h->get_indexes();
auto it = indexes.find(index);
if (it == indexes.end()) {
it = indexes.insert({index, std::make_unique<node>()}).first;
}
h = it->second.get();
}
}, op);
}
// Finally, insert the value in the node h.
if (h->is_empty()) {
*h = node {std::move(value)};
} else {
throw api_error::validation(format("Invalid {}: two document paths overlap at {}", source, p));
}
}
// A very simplified version of the above function for the special case of
// adding only top-level attribute. It's not only simpler, we also use a
// different error message, referring to a "duplicate attribute"instead of
// "overlapping paths". DynamoDB also has this distinction (errors in
// AttributesToGet refer to duplicates, not overlaps, but errors in
// ProjectionExpression refer to overlap - even if it's an exact duplicate).
template<typename T>
void attribute_path_map_add(const char* source, attribute_path_map<T>& map, const std::string& attr, T value = {}) {
using node = attribute_path_map_node<T>;
auto it = map.find(attr);
if (it == map.end()) {
map.emplace(attr, node {std::move(value)});
} else {
throw api_error::validation(format(
"Invalid {}: Duplicate attribute: {}", source, attr));
}
}
// calculate_attrs_to_get() takes either AttributesToGet or
// ProjectionExpression parameters (having both is *not* allowed),
// and returns the list of cells we need to read, or an empty set when
// *all* attributes are to be returned.
// In our current implementation, only top-level attributes are stored
// as cells, and nested documents are stored serialized as JSON.
// So this function currently returns only the the top-level attributes
// but we also need to add, after the query, filtering to keep only
// the parts of the JSON attributes that were chosen in the paths'
// operators. Because we don't have such filtering yet (FIXME), we fail here
// if the requested paths are anything but top-level attributes.
std::unordered_set<std::string> calculate_attrs_to_get(const rjson::value& req, std::unordered_set<std::string>& used_attribute_names) {
// However, in our current implementation, only top-level attributes are
// stored as separate cells - a nested document is stored serialized together
// (as JSON) in the same cell. So this function return a map - each key is the
// top-level attribute we will need need to read, and the value for each
// top-level attribute is the partial hierarchy (struct hierarchy_filter)
// that we will need to extract from that serialized JSON.
// For example, if ProjectionExpression lists a.b and a.c[2], we
// return one top-level attribute name, "a", with the value "{b, c[2]}".
static attrs_to_get calculate_attrs_to_get(const rjson::value& req, std::unordered_set<std::string>& used_attribute_names) {
const bool has_attributes_to_get = req.HasMember("AttributesToGet");
const bool has_projection_expression = req.HasMember("ProjectionExpression");
if (has_attributes_to_get && has_projection_expression) {
@@ -1900,9 +2066,9 @@ std::unordered_set<std::string> calculate_attrs_to_get(const rjson::value& req,
}
if (has_attributes_to_get) {
const rjson::value& attributes_to_get = req["AttributesToGet"];
std::unordered_set<std::string> ret;
attrs_to_get ret;
for (auto it = attributes_to_get.Begin(); it != attributes_to_get.End(); ++it) {
ret.insert(it->GetString());
attribute_path_map_add("AttributesToGet", ret, it->GetString());
}
return ret;
} else if (has_projection_expression) {
@@ -1915,24 +2081,13 @@ std::unordered_set<std::string> calculate_attrs_to_get(const rjson::value& req,
throw api_error::validation(e.what());
}
resolve_projection_expression(paths_to_get, expression_attribute_names, used_attribute_names);
std::unordered_set<std::string> seen_column_names;
auto ret = boost::copy_range<std::unordered_set<std::string>>(paths_to_get |
boost::adaptors::transformed([&] (const parsed::path& p) {
if (p.has_operators()) {
// FIXME: this check will need to change when we support non-toplevel attributes
throw api_error::validation("Non-toplevel attributes in ProjectionExpression not yet implemented");
}
if (!seen_column_names.insert(p.root()).second) {
// FIXME: this check will need to change when we support non-toplevel attributes
throw api_error::validation(
format("Invalid ProjectionExpression: two document paths overlap with each other: {} and {}.",
p.root(), p.root()));
}
return p.root();
}));
attrs_to_get ret;
for (const parsed::path& p : paths_to_get) {
attribute_path_map_add("ProjectionExpression", ret, p);
}
return ret;
}
// An empty set asks to read everything
// An empty map asks to read everything
return {};
}
@@ -1953,7 +2108,7 @@ std::unordered_set<std::string> calculate_attrs_to_get(const rjson::value& req,
*/
void executor::describe_single_item(const cql3::selection::selection& selection,
const std::vector<bytes_opt>& result_row,
const std::unordered_set<std::string>& attrs_to_get,
const attrs_to_get& attrs_to_get,
rjson::value& item,
bool include_all_embedded_attributes)
{
@@ -1974,7 +2129,16 @@ void executor::describe_single_item(const cql3::selection::selection& selection,
std::string attr_name = value_cast<sstring>(entry.first);
if (include_all_embedded_attributes || attrs_to_get.empty() || attrs_to_get.contains(attr_name)) {
bytes value = value_cast<bytes>(entry.second);
rjson::set_with_string_name(item, attr_name, deserialize_item(value));
rjson::value v = deserialize_item(value);
auto it = attrs_to_get.find(attr_name);
if (it != attrs_to_get.end()) {
// attrs_to_get may have asked for only part of this attribute:
if (hierarchy_filter(v, it->second)) {
rjson::set_with_string_name(item, attr_name, std::move(v));
}
} else {
rjson::set_with_string_name(item, attr_name, std::move(v));
}
}
}
}
@@ -1986,7 +2150,7 @@ std::optional<rjson::value> executor::describe_single_item(schema_ptr schema,
const query::partition_slice& slice,
const cql3::selection::selection& selection,
const query::result& query_result,
const std::unordered_set<std::string>& attrs_to_get) {
const attrs_to_get& attrs_to_get) {
rjson::value item = rjson::empty_object();
cql3::selection::result_set_builder builder(selection, gc_clock::now(), cql_serialization_format::latest());
@@ -2022,8 +2186,16 @@ static bool check_needs_read_before_write(const parsed::value& v) {
}, v._value);
}
static bool check_needs_read_before_write(const parsed::update_expression& update_expression) {
return boost::algorithm::any_of(update_expression.actions(), [](const parsed::update_expression::action& action) {
static bool check_needs_read_before_write(const attribute_path_map<parsed::update_expression::action>& update_expression) {
return boost::algorithm::any_of(update_expression, [](const auto& p) {
if (!p.second.has_value()) {
// If the action is not on the top-level attribute, we need to
// read the old item: we change only a part of the top-level
// attribute, and write the full top-level attribute back.
return true;
}
// Otherwise, the action p.second.get_value() is just on top-level
// attribute. Check if it needs read-before-write:
return std::visit(overloaded_functor {
[&] (const parsed::update_expression::action::set& a) -> bool {
return check_needs_read_before_write(a._rhs._v1) || (a._rhs._op != 'v' && check_needs_read_before_write(a._rhs._v2));
@@ -2037,7 +2209,7 @@ static bool check_needs_read_before_write(const parsed::update_expression& updat
[&] (const parsed::update_expression::action::del& a) -> bool {
return true;
}
}, action._action);
}, p.second.get_value()._action);
});
}
@@ -2046,7 +2218,11 @@ public:
// Some information parsed during the constructor to check for input
// errors, and cached to be used again during apply().
rjson::value* _attribute_updates;
parsed::update_expression _update_expression;
// Instead of keeping a parsed::update_expression with an unsorted list
// list of actions, we keep them in an attribute_path_map which groups
// them by top-level attribute, and detects forbidden overlaps/conflicts.
attribute_path_map<parsed::update_expression::action> _update_expression;
parsed::condition_expression _condition_expression;
update_item_operation(service::storage_proxy& proxy, rjson::value&& request);
@@ -2077,16 +2253,22 @@ update_item_operation::update_item_operation(service::storage_proxy& proxy, rjso
throw api_error::validation("UpdateExpression must be a string");
}
try {
_update_expression = parse_update_expression(update_expression->GetString());
resolve_update_expression(_update_expression,
parsed::update_expression expr = parse_update_expression(update_expression->GetString());
resolve_update_expression(expr,
expression_attribute_names, expression_attribute_values,
used_attribute_names, used_attribute_values);
if (expr.empty()) {
throw api_error::validation("Empty expression in UpdateExpression is not allowed");
}
for (auto& action : expr.actions()) {
// Unfortunately we need to copy the action's path, because
// we std::move the action object.
auto p = action._path;
attribute_path_map_add("UpdateExpression", _update_expression, p, std::move(action));
}
} catch(expressions_syntax_error& e) {
throw api_error::validation(e.what());
}
if (_update_expression.empty()) {
throw api_error::validation("Empty expression in UpdateExpression is not allowed");
}
}
_attribute_updates = rjson::find(_request, "AttributeUpdates");
if (_attribute_updates) {
@@ -2128,6 +2310,187 @@ update_item_operation::needs_read_before_write() const {
(_returnvalues != returnvalues::NONE && _returnvalues != returnvalues::UPDATED_NEW);
}
// action_result() returns the result of applying an UpdateItem action -
// this result is either a JSON object or an unset optional which indicates
// the action was a deletion. The caller (update_item_operation::apply()
// below) will either write this JSON as the content of a column, or
// use it as a piece in a bigger top-level attribute.
static std::optional<rjson::value> action_result(
const parsed::update_expression::action& action,
const rjson::value* previous_item) {
return std::visit(overloaded_functor {
[&] (const parsed::update_expression::action::set& a) -> std::optional<rjson::value> {
return calculate_value(a._rhs, previous_item);
},
[&] (const parsed::update_expression::action::remove& a) -> std::optional<rjson::value> {
return std::nullopt;
},
[&] (const parsed::update_expression::action::add& a) -> std::optional<rjson::value> {
parsed::value base;
parsed::value addition;
base.set_path(action._path);
addition.set_constant(a._valref);
rjson::value v1 = calculate_value(base, calculate_value_caller::UpdateExpression, previous_item);
rjson::value v2 = calculate_value(addition, calculate_value_caller::UpdateExpression, previous_item);
rjson::value result;
// An ADD can be used to create a new attribute (when
// v1.IsNull()) or to add to a pre-existing attribute:
if (v1.IsNull()) {
std::string v2_type = get_item_type_string(v2);
if (v2_type == "N" || v2_type == "SS" || v2_type == "NS" || v2_type == "BS") {
result = v2;
} else {
throw api_error::validation(format("An operand in the update expression has an incorrect data type: {}", v2));
}
} else {
std::string v1_type = get_item_type_string(v1);
if (v1_type == "N") {
if (get_item_type_string(v2) != "N") {
throw api_error::validation(format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
}
result = number_add(v1, v2);
} else if (v1_type == "SS" || v1_type == "NS" || v1_type == "BS") {
if (get_item_type_string(v2) != v1_type) {
throw api_error::validation(format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
}
result = set_sum(v1, v2);
} else {
throw api_error::validation(format("An operand in the update expression has an incorrect data type: {}", v1));
}
}
return result;
},
[&] (const parsed::update_expression::action::del& a) -> std::optional<rjson::value> {
parsed::value base;
parsed::value subset;
base.set_path(action._path);
subset.set_constant(a._valref);
rjson::value v1 = calculate_value(base, calculate_value_caller::UpdateExpression, previous_item);
rjson::value v2 = calculate_value(subset, calculate_value_caller::UpdateExpression, previous_item);
if (!v1.IsNull()) {
return set_diff(v1, v2);
}
// When we return nullopt here, we ask to *delete* this attribute,
// which is unnecessary because we know the attribute does not
// exist anyway. This is a waste, but a small one. Note that also
// for the "remove" action above we don't bother to check if the
// previous_item add anything to remove.
return std::nullopt;
}
}, action._action);
}
// Print an attribute_path_map_node<action> as the list of paths it contains:
static std::ostream& operator<<(std::ostream& out, const attribute_path_map_node<parsed::update_expression::action>& h) {
if (h.has_value()) {
out << " " << h.get_value()._path;
} else if (h.has_members()) {
for (auto& member : h.get_members()) {
out << *member.second;
}
} else if (h.has_indexes()) {
for (auto& index : h.get_indexes()) {
out << *index.second;
}
}
return out;
}
// Apply the hierarchy of actions in an attribute_path_map_node<action> to a
// JSON object which uses DynamoDB's serialization conventions. The complete,
// unmodified, previous_item is also necessary for the right-hand sides of the
// actions. Modifies obj in-place or returns false if it is to be removed.
static bool hierarchy_actions(
rjson::value& obj,
const attribute_path_map_node<parsed::update_expression::action>& h,
const rjson::value* previous_item)
{
if (!obj.IsObject() || obj.MemberCount() != 1) {
// This shouldn't happen. We shouldn't have stored malformed objects.
// But today Alternator does not validate the structure of nested
// documents before storing them, so this can happen on read.
throw api_error::validation(format("Malformed value object read: {}", obj));
}
const char* type = obj.MemberBegin()->name.GetString();
rjson::value& v = obj.MemberBegin()->value;
if (h.has_value()) {
// Action replacing everything in this position in the hierarchy
std::optional<rjson::value> newv = action_result(h.get_value(), previous_item);
if (newv) {
obj = std::move(*newv);
} else {
return false;
}
} else if (h.has_members()) {
if (type[0] != 'M' || !v.IsObject()) {
// A .something on a non-map doesn't work.
throw api_error::validation(format("UpdateExpression: document paths not valid for this item:{}", h));
}
for (const auto& member : h.get_members()) {
std::string attr = member.first;
const attribute_path_map_node<parsed::update_expression::action>& subh = *member.second;
rjson::value *subobj = rjson::find(v, attr);
if (subobj) {
if (!hierarchy_actions(*subobj, subh, previous_item)) {
rjson::remove_member(v, attr);
}
} else {
// When a.b does not exist, setting a.b itself (i.e.
// subh.has_value()) is fine, but setting a.b.c is not.
if (subh.has_value()) {
std::optional<rjson::value> newv = action_result(subh.get_value(), previous_item);
if (newv) {
rjson::set_with_string_name(v, attr, std::move(*newv));
} else {
// Removing a.b when a is a map but a.b doesn't exist
// is silently ignored. It's not considered an error.
}
} else {
throw api_error::validation(format("UpdateExpression: document paths not valid for this item:{}", h));
}
}
}
} else if (h.has_indexes()) {
if (type[0] != 'L' || !v.IsArray()) {
// A [i] on a non-list doesn't work.
throw api_error::validation(format("UpdateExpression: document paths not valid for this item:{}", h));
}
unsigned nremoved = 0;
for (const auto& index : h.get_indexes()) {
unsigned i = index.first - nremoved;
const attribute_path_map_node<parsed::update_expression::action>& subh = *index.second;
if (i < v.Size()) {
if (!hierarchy_actions(v[i], subh, previous_item)) {
v.Erase(v.Begin() + i);
// If we have the actions "REMOVE a[1] SET a[3] = :val",
// the index 3 refers to the original indexes, before any
// items were removed. So we offset the next indexes
// (which are guaranteed to be higher than i - indexes is
// a sorted map) by an increased "nremoved".
nremoved++;
}
} else {
// If a[7] does not exist, setting a[7] itself (i.e.
// subh.has_value()) is fine - and appends an item, though
// not necessarily with index 7. But setting a[7].b will
// not work.
if (subh.has_value()) {
std::optional<rjson::value> newv = action_result(subh.get_value(), previous_item);
if (newv) {
rjson::push_back(v, std::move(*newv));
} else {
// Removing a[7] when the list has fewer elements is
// silently ignored. It's not considered an error.
}
} else {
throw api_error::validation(format("UpdateExpression: document paths not valid for this item:{}", h));
}
}
}
}
return true;
}
std::optional<mutation>
update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) const {
if (!verify_expected(_request, previous_item.get()) ||
@@ -2142,17 +2505,37 @@ update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::t
auto& row = m.partition().clustered_row(*_schema, _ck);
attribute_collector attrs_collector;
bool any_updates = false;
auto do_update = [&] (bytes&& column_name, const rjson::value& json_value) {
auto do_update = [&] (bytes&& column_name, const rjson::value& json_value,
const attribute_path_map_node<parsed::update_expression::action>* h = nullptr) {
any_updates = true;
if (_returnvalues == returnvalues::ALL_NEW ||
_returnvalues == returnvalues::UPDATED_NEW) {
rjson::set_with_string_name(_return_attributes,
to_sstring_view(column_name), rjson::copy(json_value));
if (_returnvalues == returnvalues::ALL_NEW) {
rjson::replace_with_string_name(_return_attributes,
to_sstring_view(column_name), rjson::copy(json_value));
} else if (_returnvalues == returnvalues::UPDATED_NEW) {
rjson::value&& v = rjson::copy(json_value);
if (h) {
// If the operation was only on specific attribute paths,
// leave only them in _return_attributes.
if (hierarchy_filter(v, *h)) {
rjson::set_with_string_name(_return_attributes,
to_sstring_view(column_name), std::move(v));
}
} else {
rjson::set_with_string_name(_return_attributes,
to_sstring_view(column_name), std::move(v));
}
} else if (_returnvalues == returnvalues::UPDATED_OLD && previous_item) {
std::string_view cn = to_sstring_view(column_name);
const rjson::value* col = rjson::find(*previous_item, cn);
if (col) {
rjson::set_with_string_name(_return_attributes, cn, rjson::copy(*col));
rjson::value&& v = rjson::copy(*col);
if (h) {
if (hierarchy_filter(v, *h)) {
rjson::set_with_string_name(_return_attributes, cn, std::move(v));
}
} else {
rjson::set_with_string_name(_return_attributes, cn, std::move(v));
}
}
}
const column_definition* cdef = _schema->get_column_definition(column_name);
@@ -2194,7 +2577,7 @@ update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::t
// can just move previous_item later, when we don't need it any more.
if (_returnvalues == returnvalues::ALL_NEW) {
if (previous_item) {
_return_attributes = std::move(*previous_item);
_return_attributes = rjson::copy(*previous_item);
} else {
// If there is no previous item, usually a new item is created
// and contains they given key. This may be cancelled at the end
@@ -2207,77 +2590,44 @@ update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::t
}
if (!_update_expression.empty()) {
std::unordered_set<std::string> seen_column_names;
for (auto& action : _update_expression.actions()) {
if (action._path.has_operators()) {
// FIXME: implement this case
throw api_error::validation("UpdateItem support for nested updates not yet implemented");
}
std::string column_name = action._path.root();
for (auto& actions : _update_expression) {
// The actions of _update_expression are grouped by top-level
// attributes. Here, all actions in actions.second share the same
// top-level attribute actions.first.
std::string column_name = actions.first;
const column_definition* cdef = _schema->get_column_definition(to_bytes(column_name));
if (cdef && cdef->is_primary_key()) {
throw api_error::validation(
format("UpdateItem cannot update key column {}", column_name));
throw api_error::validation(format("UpdateItem cannot update key column {}", column_name));
}
// DynamoDB forbids multiple updates in the same expression to
// modify overlapping document paths. Updates of one expression
// have the same timestamp, so it's unclear which would "win".
// FIXME: currently, without full support for document paths,
// we only check if the paths' roots are the same.
if (!seen_column_names.insert(column_name).second) {
throw api_error::validation(
format("Invalid UpdateExpression: two document paths overlap with each other: {} and {}.",
column_name, column_name));
}
std::visit(overloaded_functor {
[&] (const parsed::update_expression::action::set& a) {
auto value = calculate_value(a._rhs, previous_item.get());
do_update(to_bytes(column_name), value);
},
[&] (const parsed::update_expression::action::remove& a) {
if (actions.second.has_value()) {
// An action on a top-level attribute column_name. The single
// action is actions.second.get_value(). We can simply invoke
// the action and replace the attribute with its result:
std::optional<rjson::value> result = action_result(actions.second.get_value(), previous_item.get());
if (result) {
do_update(to_bytes(column_name), *result);
} else {
do_delete(to_bytes(column_name));
},
[&] (const parsed::update_expression::action::add& a) {
parsed::value base;
parsed::value addition;
base.set_path(action._path);
addition.set_constant(a._valref);
rjson::value v1 = calculate_value(base, calculate_value_caller::UpdateExpression, previous_item.get());
rjson::value v2 = calculate_value(addition, calculate_value_caller::UpdateExpression, previous_item.get());
rjson::value result;
std::string v1_type = get_item_type_string(v1);
if (v1_type == "N") {
if (get_item_type_string(v2) != "N") {
throw api_error::validation(format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
}
result = number_add(v1, v2);
} else if (v1_type == "SS" || v1_type == "NS" || v1_type == "BS") {
if (get_item_type_string(v2) != v1_type) {
throw api_error::validation(format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
}
result = set_sum(v1, v2);
} else {
throw api_error::validation(format("An operand in the update expression has an incorrect data type: {}", v1));
}
do_update(to_bytes(column_name), result);
},
[&] (const parsed::update_expression::action::del& a) {
parsed::value base;
parsed::value subset;
base.set_path(action._path);
subset.set_constant(a._valref);
rjson::value v1 = calculate_value(base, calculate_value_caller::UpdateExpression, previous_item.get());
rjson::value v2 = calculate_value(subset, calculate_value_caller::UpdateExpression, previous_item.get());
if (!v1.IsNull()) {
std::optional<rjson::value> result = set_diff(v1, v2);
if (result) {
do_update(to_bytes(column_name), *result);
} else {
do_delete(to_bytes(column_name));
}
}
}
}, action._action);
} else {
// We have actions on a path or more than one path in the same
// top-level attribute column_name - but not on the top-level
// attribute as a whole. We already read the full top-level
// attribute (see check_needs_read_before_write()), and now we
// need to modify pieces of it and write back the entire
// top-level attribute.
if (!previous_item) {
throw api_error::validation(format("UpdateItem cannot update nested document path on non-existent item"));
}
const rjson::value *toplevel = rjson::find(*previous_item, column_name);
if (!toplevel) {
throw api_error::validation(format("UpdateItem cannot update document path: missing attribute {}",
column_name));
}
rjson::value result = rjson::copy(*toplevel);
hierarchy_actions(result, actions.second, previous_item.get());
do_update(to_bytes(column_name), std::move(result), &actions.second);
}
}
}
if (_returnvalues == returnvalues::ALL_OLD && previous_item) {
@@ -2395,7 +2745,7 @@ static rjson::value describe_item(schema_ptr schema,
const query::partition_slice& slice,
const cql3::selection::selection& selection,
const query::result& query_result,
const std::unordered_set<std::string>& attrs_to_get) {
const attrs_to_get& attrs_to_get) {
std::optional<rjson::value> opt_item = executor::describe_single_item(std::move(schema), slice, selection, std::move(query_result), attrs_to_get);
if (!opt_item) {
// If there is no matching item, we're supposed to return an empty
@@ -2467,7 +2817,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
struct table_requests {
schema_ptr schema;
db::consistency_level cl;
std::unordered_set<std::string> attrs_to_get;
::shared_ptr<const attrs_to_get> attrs_to_get;
struct single_request {
partition_key pk;
clustering_key ck;
@@ -2482,7 +2832,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
tracing::add_table_name(trace_state, sstring(executor::KEYSPACE_NAME_PREFIX) + rs.schema->cf_name(), rs.schema->cf_name());
rs.cl = get_read_consistency(it->value);
std::unordered_set<std::string> used_attribute_names;
rs.attrs_to_get = calculate_attrs_to_get(it->value, used_attribute_names);
rs.attrs_to_get = ::make_shared<const attrs_to_get>(calculate_attrs_to_get(it->value, used_attribute_names));
verify_all_are_used(request, "ExpressionAttributeNames", used_attribute_names, "GetItem");
auto& keys = (it->value)["Keys"];
for (const rjson::value& key : keys.GetArray()) {
@@ -2512,7 +2862,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
future<std::tuple<std::string, std::optional<rjson::value>>> f = _proxy.query(rs.schema, std::move(command), std::move(partition_ranges), rs.cl,
service::storage_proxy::coordinator_query_options(executor::default_timeout(), permit, client_state, trace_state)).then(
[schema = rs.schema, partition_slice = std::move(partition_slice), selection = std::move(selection), attrs_to_get = rs.attrs_to_get] (service::storage_proxy::coordinator_query_result qr) mutable {
std::optional<rjson::value> json = describe_single_item(schema, partition_slice, *selection, *qr.query_result, std::move(attrs_to_get));
std::optional<rjson::value> json = describe_single_item(schema, partition_slice, *selection, *qr.query_result, *attrs_to_get);
return make_ready_future<std::tuple<std::string, std::optional<rjson::value>>>(
std::make_tuple(schema->cf_name(), std::move(json)));
});
@@ -2681,7 +3031,7 @@ void filter::for_filters_on(const noncopyable_function<void(std::string_view)>&
class describe_items_visitor {
typedef std::vector<const column_definition*> columns_t;
const columns_t& _columns;
const std::unordered_set<std::string>& _attrs_to_get;
const attrs_to_get& _attrs_to_get;
std::unordered_set<std::string> _extra_filter_attrs;
const filter& _filter;
typename columns_t::const_iterator _column_it;
@@ -2690,7 +3040,7 @@ class describe_items_visitor {
size_t _scanned_count;
public:
describe_items_visitor(const columns_t& columns, const std::unordered_set<std::string>& attrs_to_get, filter& filter)
describe_items_visitor(const columns_t& columns, const attrs_to_get& attrs_to_get, filter& filter)
: _columns(columns)
, _attrs_to_get(attrs_to_get)
, _filter(filter)
@@ -2739,6 +3089,12 @@ public:
std::string attr_name = value_cast<sstring>(entry.first);
if (_attrs_to_get.empty() || _attrs_to_get.contains(attr_name) || _extra_filter_attrs.contains(attr_name)) {
bytes value = value_cast<bytes>(entry.second);
// Even if _attrs_to_get asked to keep only a part of a
// top-level attribute, we keep the entire attribute
// at this stage, because the item filter might still
// need the other parts (it was easier for us to keep
// extra_filter_attrs at top-level granularity). We'll
// filter the unneeded parts after item filtering.
rjson::set_with_string_name(_item, attr_name, deserialize_item(value));
}
}
@@ -2749,11 +3105,24 @@ public:
void end_row() {
if (_filter.check(_item)) {
// As noted above, we kept entire top-level attributes listed in
// _attrs_to_get. We may need to only keep parts of them.
for (const auto& attr: _attrs_to_get) {
// If !attr.has_value() it means we were asked not to keep
// attr entirely, but just parts of it.
if (!attr.second.has_value()) {
rjson::value* toplevel= rjson::find(_item, attr.first);
if (toplevel && !hierarchy_filter(*toplevel, attr.second)) {
rjson::remove_member(_item, attr.first);
}
}
}
// Remove the extra attributes _extra_filter_attrs which we had
// to add just for the filter, and not requested to be returned:
for (const auto& attr : _extra_filter_attrs) {
rjson::remove_member(_item, attr);
}
rjson::push_back(_items, std::move(_item));
}
_item = rjson::empty_object();
@@ -2769,7 +3138,7 @@ public:
}
};
static rjson::value describe_items(schema_ptr schema, const query::partition_slice& slice, const cql3::selection::selection& selection, std::unique_ptr<cql3::result_set> result_set, std::unordered_set<std::string>&& attrs_to_get, filter&& filter) {
static rjson::value describe_items(schema_ptr schema, const query::partition_slice& slice, const cql3::selection::selection& selection, std::unique_ptr<cql3::result_set> result_set, attrs_to_get&& attrs_to_get, filter&& filter) {
describe_items_visitor visitor(selection.get_columns(), attrs_to_get, filter);
result_set->visit(visitor);
auto scanned_count = visitor.get_scanned_count();
@@ -2788,7 +3157,7 @@ static rjson::value encode_paging_state(const schema& schema, const service::pag
for (const column_definition& cdef : schema.partition_key_columns()) {
rjson::set_with_string_name(last_evaluated_key, std::string_view(cdef.name_as_text()), rjson::empty_object());
rjson::value& key_entry = last_evaluated_key[cdef.name_as_text()];
rjson::set_with_string_name(key_entry, type_to_string(cdef.type), rjson::parse(to_json_string(*cdef.type, *exploded_pk_it)));
rjson::set_with_string_name(key_entry, type_to_string(cdef.type), json_key_column_value(*exploded_pk_it, cdef));
++exploded_pk_it;
}
auto ck = paging_state.get_clustering_key();
@@ -2798,7 +3167,7 @@ static rjson::value encode_paging_state(const schema& schema, const service::pag
for (const column_definition& cdef : schema.clustering_key_columns()) {
rjson::set_with_string_name(last_evaluated_key, std::string_view(cdef.name_as_text()), rjson::empty_object());
rjson::value& key_entry = last_evaluated_key[cdef.name_as_text()];
rjson::set_with_string_name(key_entry, type_to_string(cdef.type), rjson::parse(to_json_string(*cdef.type, *exploded_ck_it)));
rjson::set_with_string_name(key_entry, type_to_string(cdef.type), json_key_column_value(*exploded_ck_it, cdef));
++exploded_ck_it;
}
}
@@ -2810,7 +3179,7 @@ static future<executor::request_return_type> do_query(service::storage_proxy& pr
const rjson::value* exclusive_start_key,
dht::partition_range_vector&& partition_ranges,
std::vector<query::clustering_range>&& ck_bounds,
std::unordered_set<std::string>&& attrs_to_get,
attrs_to_get&& attrs_to_get,
uint32_t limit,
db::consistency_level cl,
filter&& filter,
@@ -2845,12 +3214,12 @@ static future<executor::request_return_type> do_query(service::storage_proxy& pr
auto query_state_ptr = std::make_unique<service::query_state>(client_state, trace_state, std::move(permit));
command->slice.options.set<query::partition_slice::option::allow_short_read>();
auto query_options = std::make_unique<cql3::query_options>(cl, infinite_timeout_config, std::vector<cql3::raw_value>{});
auto query_options = std::make_unique<cql3::query_options>(cl, std::vector<cql3::raw_value>{});
query_options = std::make_unique<cql3::query_options>(std::move(query_options), std::move(paging_state));
auto p = service::pager::query_pagers::pager(schema, selection, *query_state_ptr, *query_options, command, std::move(partition_ranges), nullptr);
return p->fetch_page(limit, gc_clock::now(), executor::default_timeout()).then(
[p, schema, cql_stats, partition_slice = std::move(partition_slice),
[p = std::move(p), schema, cql_stats, partition_slice = std::move(partition_slice),
selection = std::move(selection), query_state_ptr = std::move(query_state_ptr),
attrs_to_get = std::move(attrs_to_get),
query_options = std::move(query_options),
@@ -3536,26 +3905,10 @@ future<> executor::create_keyspace(std::string_view keyspace_name) {
}
auto opts = get_network_topology_options(rf);
auto ksm = keyspace_metadata::new_keyspace(keyspace_name_str, "org.apache.cassandra.locator.NetworkTopologyStrategy", std::move(opts), true);
return _mm.announce_new_keyspace(ksm, api::new_timestamp(), false);
return _mm.announce_new_keyspace(ksm, api::new_timestamp());
});
}
static tracing::trace_state_ptr create_tracing_session() {
tracing::trace_state_props_set props;
props.set<tracing::trace_state_props::full_tracing>();
return tracing::tracing::get_local_tracing_instance().create_session(tracing::trace_type::QUERY, props);
}
tracing::trace_state_ptr executor::maybe_trace_query(client_state& client_state, sstring_view op, sstring_view query) {
tracing::trace_state_ptr trace_state;
if (tracing::tracing::get_local_tracing_instance().trace_next_query()) {
trace_state = create_tracing_session();
tracing::add_query(trace_state, query);
tracing::begin(trace_state, format("Alternator {}", op), client_state.get_client_address());
}
return trace_state;
}
future<> executor::start() {
// Currently, nothing to do on initialization. We delay the keyspace
// creation (create_keyspace()) until a table is actually created.

View File

@@ -53,6 +53,10 @@ namespace service {
class storage_service;
}
namespace cdc {
class metadata;
}
namespace alternator {
class rmw_operation;
@@ -70,11 +74,77 @@ public:
std::string to_json() const override;
};
namespace parsed {
class path;
};
// An attribute_path_map object is used to hold data for various attributes
// paths (parsed::path) in a hierarchy of attribute paths. Each attribute path
// has a root attribute, and then modified by member and index operators -
// for example in "a.b[2].c" we have "a" as the root, then ".b" member, then
// "[2]" index, and finally ".c" member.
// Data can be added to an attribute_path_map using the add() function, but
// requires that attributes with data not be *overlapping* or *conflicting*:
//
// 1. Two attribute paths which are identical or an ancestor of one another
// are considered *overlapping* and not allowed. If a.b.c has data,
// we can't add more data in a.b.c or any of its descendants like a.b.c.d.
//
// 2. Two attribute paths which need the same parent to have both a member and
// an index are considered *conflicting* and not allowed. E.g., if a.b has
// data, you can't add a[1]. The meaning of adding both would be that the
// attribute a is both a map and an array, which isn't sensible.
//
// These two requirements are common to the two places where Alternator uses
// this abstraction to describe how a hierarchical item is to be transformed:
//
// 1. In ProjectExpression: for filtering from a full top-level attribute
// only the parts for which user asked in ProjectionExpression.
//
// 2. In UpdateExpression: for taking the previous value of a top-level
// attribute, and modifying it based on the instructions in the user
// wrote in UpdateExpression.
template<typename T>
class attribute_path_map_node {
public:
using data_t = T;
// We need the extra unique_ptr<> here because libstdc++ unordered_map
// doesn't work with incomplete types :-(
using members_t = std::unordered_map<std::string, std::unique_ptr<attribute_path_map_node<T>>>;
// The indexes list is sorted because DynamoDB requires handling writes
// beyond the end of a list in index order.
using indexes_t = std::map<unsigned, std::unique_ptr<attribute_path_map_node<T>>>;
// The prohibition on "overlap" and "conflict" explained above means
// That only one of data, members or indexes is non-empty.
std::optional<std::variant<data_t, members_t, indexes_t>> _content;
bool is_empty() const { return !_content; }
bool has_value() const { return _content && std::holds_alternative<data_t>(*_content); }
bool has_members() const { return _content && std::holds_alternative<members_t>(*_content); }
bool has_indexes() const { return _content && std::holds_alternative<indexes_t>(*_content); }
// get_members() assumes that has_members() is true
members_t& get_members() { return std::get<members_t>(*_content); }
const members_t& get_members() const { return std::get<members_t>(*_content); }
indexes_t& get_indexes() { return std::get<indexes_t>(*_content); }
const indexes_t& get_indexes() const { return std::get<indexes_t>(*_content); }
T& get_value() { return std::get<T>(*_content); }
const T& get_value() const { return std::get<T>(*_content); }
};
template<typename T>
using attribute_path_map = std::unordered_map<std::string, attribute_path_map_node<T>>;
using attrs_to_get_node = attribute_path_map_node<std::monostate>;
using attrs_to_get = attribute_path_map<std::monostate>;
class executor : public peering_sharded_service<executor> {
service::storage_proxy& _proxy;
service::migration_manager& _mm;
db::system_distributed_keyspace& _sdks;
service::storage_service& _ss;
cdc::metadata& _cdc_metadata;
// An smp_service_group to be used for limiting the concurrency when
// forwarding Alternator request between shards - if necessary for LWT.
smp_service_group _ssg;
@@ -87,8 +157,8 @@ public:
static constexpr auto KEYSPACE_NAME_PREFIX = "alternator_";
static constexpr std::string_view INTERNAL_TABLE_PREFIX = ".scylla.alternator.";
executor(service::storage_proxy& proxy, service::migration_manager& mm, db::system_distributed_keyspace& sdks, service::storage_service& ss, smp_service_group ssg)
: _proxy(proxy), _mm(mm), _sdks(sdks), _ss(ss), _ssg(ssg) {}
executor(service::storage_proxy& proxy, service::migration_manager& mm, db::system_distributed_keyspace& sdks, service::storage_service& ss, cdc::metadata& cdc_metadata, smp_service_group ssg)
: _proxy(proxy), _mm(mm), _sdks(sdks), _ss(ss), _cdc_metadata(cdc_metadata), _ssg(ssg) {}
future<request_return_type> create_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> describe_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
@@ -117,10 +187,12 @@ public:
future<> create_keyspace(std::string_view keyspace_name);
static tracing::trace_state_ptr maybe_trace_query(client_state& client_state, sstring_view op, sstring_view query);
static sstring table_name(const schema&);
static db::timeout_clock::time_point default_timeout();
static void set_default_timeout(db::timeout_clock::duration timeout);
private:
static db::timeout_clock::duration s_default_timeout;
public:
static schema_ptr find_table(service::storage_proxy&, const rjson::value& request);
private:
@@ -136,16 +208,14 @@ public:
const query::partition_slice&,
const cql3::selection::selection&,
const query::result&,
const std::unordered_set<std::string>&);
const attrs_to_get&);
static void describe_single_item(const cql3::selection::selection&,
const std::vector<bytes_opt>&,
const std::unordered_set<std::string>&,
const attrs_to_get&,
rjson::value&,
bool = false);
void add_stream_options(const rjson::value& stream_spec, schema_builder&) const;
void supplement_table_info(rjson::value& descr, const schema& schema) const;
void supplement_table_stream_info(rjson::value& descr, const schema& schema) const;

View File

@@ -130,6 +130,27 @@ void condition_expression::append(condition_expression&& a, char op) {
}, _expression);
}
void path::check_depth_limit() {
if (1 + _operators.size() > depth_limit) {
throw expressions_syntax_error(format("Document path exceeded {} nesting levels", depth_limit));
}
}
std::ostream& operator<<(std::ostream& os, const path& p) {
os << p.root();
for (const auto& op : p.operators()) {
std::visit(overloaded_functor {
[&] (const std::string& member) {
os << '.' << member;
},
[&] (unsigned index) {
os << '[' << index << ']';
}
}, op);
}
return os;
}
} // namespace parsed
// The following resolve_*() functions resolve references in parsed
@@ -151,10 +172,9 @@ void condition_expression::append(condition_expression&& a, char op) {
// we need to resolve the expression just once but then use it many times
// (once for each item to be filtered).
static void resolve_path(parsed::path& p,
static std::optional<std::string> resolve_path_component(const std::string& column_name,
const rjson::value* expression_attribute_names,
std::unordered_set<std::string>& used_attribute_names) {
const std::string& column_name = p.root();
if (column_name.size() > 0 && column_name.front() == '#') {
if (!expression_attribute_names) {
throw api_error::validation(
@@ -166,7 +186,30 @@ static void resolve_path(parsed::path& p,
format("ExpressionAttributeNames missing entry '{}' required by expression", column_name));
}
used_attribute_names.emplace(column_name);
p.set_root(std::string(rjson::to_string_view(*value)));
return std::string(rjson::to_string_view(*value));
}
return std::nullopt;
}
static void resolve_path(parsed::path& p,
const rjson::value* expression_attribute_names,
std::unordered_set<std::string>& used_attribute_names) {
std::optional<std::string> r = resolve_path_component(p.root(), expression_attribute_names, used_attribute_names);
if (r) {
p.set_root(std::move(*r));
}
for (auto& op : p.operators()) {
std::visit(overloaded_functor {
[&] (std::string& s) {
r = resolve_path_component(s, expression_attribute_names, used_attribute_names);
if (r) {
s = std::move(*r);
}
},
[&] (unsigned index) {
// nothing to resolve
}
}, op);
}
}
@@ -603,52 +646,8 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {
}
rjson::value v1 = calculate_value(f._parameters[0], caller, previous_item);
rjson::value v2 = calculate_value(f._parameters[1], caller, previous_item);
// TODO: There's duplication here with check_BEGINS_WITH().
// But unfortunately, the two functions differ a bit.
// If one of v1 or v2 is malformed or has an unsupported type
// (not B or S), what we do depends on whether it came from
// the user's query (is_constant()), or the item. Unsupported
// values in the query result in an error, but if they are in
// the item, we silently return false (no match).
bool bad = false;
if (!v1.IsObject() || v1.MemberCount() != 1) {
bad = true;
if (f._parameters[0].is_constant()) {
throw api_error::validation(format("{}: begins_with() encountered malformed AttributeValue: {}", caller, v1));
}
} else if (v1.MemberBegin()->name != "S" && v1.MemberBegin()->name != "B") {
bad = true;
if (f._parameters[0].is_constant()) {
throw api_error::validation(format("{}: begins_with() supports only string or binary in AttributeValue: {}", caller, v1));
}
}
if (!v2.IsObject() || v2.MemberCount() != 1) {
bad = true;
if (f._parameters[1].is_constant()) {
throw api_error::validation(format("{}: begins_with() encountered malformed AttributeValue: {}", caller, v2));
}
} else if (v2.MemberBegin()->name != "S" && v2.MemberBegin()->name != "B") {
bad = true;
if (f._parameters[1].is_constant()) {
throw api_error::validation(format("{}: begins_with() supports only string or binary in AttributeValue: {}", caller, v2));
}
}
bool ret = false;
if (!bad) {
auto it1 = v1.MemberBegin();
auto it2 = v2.MemberBegin();
if (it1->name == it2->name) {
if (it2->name == "S") {
std::string_view val1 = rjson::to_string_view(it1->value);
std::string_view val2 = rjson::to_string_view(it2->value);
ret = val1.starts_with(val2);
} else /* it2->name == "B" */ {
ret = base64_begins_with(rjson::to_string_view(it1->value), rjson::to_string_view(it2->value));
}
}
}
return to_bool_json(ret);
return to_bool_json(check_BEGINS_WITH(v1.IsNull() ? nullptr : &v1, v2,
f._parameters[0].is_constant(), f._parameters[1].is_constant()));
}
},
{"contains", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {
@@ -667,6 +666,55 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {
},
};
// Given a parsed::path and an item read from the table, extract the value
// of a certain attribute path, such as "a" or "a.b.c[3]". Returns a null
// value if the item or the requested attribute does not exist.
// Note that the item is assumed to be encoded in JSON using DynamoDB
// conventions - each level of a nested document is a map with one key -
// a type (e.g., "M" for map) - and its value is the representation of
// that value.
static rjson::value extract_path(const rjson::value* item,
const parsed::path& p, calculate_value_caller caller) {
if (!item) {
return rjson::null_value();
}
const rjson::value* v = rjson::find(*item, p.root());
if (!v) {
return rjson::null_value();
}
for (const auto& op : p.operators()) {
if (!v->IsObject() || v->MemberCount() != 1) {
// This shouldn't happen. We shouldn't have stored malformed
// objects. But today Alternator does not validate the structure
// of nested documents before storing them, so this can happen on
// read.
throw api_error::validation(format("{}: malformed item read: {}", *item));
}
const char* type = v->MemberBegin()->name.GetString();
v = &(v->MemberBegin()->value);
std::visit(overloaded_functor {
[&] (const std::string& member) {
if (type[0] == 'M' && v->IsObject()) {
v = rjson::find(*v, member);
} else {
v = nullptr;
}
},
[&] (unsigned index) {
if (type[0] == 'L' && v->IsArray() && index < v->Size()) {
v = &(v->GetArray()[index]);
} else {
v = nullptr;
}
}
}, op);
if (!v) {
return rjson::null_value();
}
}
return rjson::copy(*v);
}
// Given a parsed::value, which can refer either to a constant value from
// ExpressionAttributeValues, to the value of some attribute, or to a function
// of other values, this function calculates the resulting value.
@@ -684,21 +732,12 @@ rjson::value calculate_value(const parsed::value& v,
auto function_it = function_handlers.find(std::string_view(f._function_name));
if (function_it == function_handlers.end()) {
throw api_error::validation(
format("UpdateExpression: unknown function '{}' called.", f._function_name));
format("{}: unknown function '{}' called.", caller, f._function_name));
}
return function_it->second(caller, previous_item, f);
},
[&] (const parsed::path& p) -> rjson::value {
if (!previous_item) {
return rjson::null_value();
}
std::string update_path = p.root();
if (p.has_operators()) {
// FIXME: support this
throw api_error::validation("Reading attribute paths not yet implemented");
}
const rjson::value* previous_value = rjson::find(*previous_item, update_path);
return previous_value ? rjson::copy(*previous_value) : rjson::null_value();
return extract_path(previous_item, p, caller);
}
}, v._value);
}

View File

@@ -49,15 +49,23 @@ class path {
// dot (e.g., ".xyz").
std::string _root;
std::vector<std::variant<std::string, unsigned>> _operators;
// It is useful to limit the depth of a user-specified path, because is
// allows us to use recursive algorithms without worrying about recursion
// depth. DynamoDB officially limits the length of paths to 32 components
// (including the root) so let's use the same limit.
static constexpr unsigned depth_limit = 32;
void check_depth_limit();
public:
void set_root(std::string root) {
_root = std::move(root);
}
void add_index(unsigned i) {
_operators.emplace_back(i);
check_depth_limit();
}
void add_dot(std::string(name)) {
_operators.emplace_back(std::move(name));
check_depth_limit();
}
const std::string& root() const {
return _root;
@@ -65,6 +73,13 @@ public:
bool has_operators() const {
return !_operators.empty();
}
const std::vector<std::variant<std::string, unsigned>>& operators() const {
return _operators;
}
std::vector<std::variant<std::string, unsigned>>& operators() {
return _operators;
}
friend std::ostream& operator<<(std::ostream&, const path&);
};
// When an expression is first parsed, all constants are references, like

View File

@@ -22,6 +22,8 @@
#include "alternator/server.hh"
#include "log.hh"
#include <seastar/http/function_handlers.hh>
#include <seastar/http/short_streams.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/json/json_elements.hh>
#include "seastarx.hh"
#include "error.hh"
@@ -59,6 +61,40 @@ inline std::vector<std::string_view> split(std::string_view text, char separator
return tokens;
}
// Handle CORS (Cross-origin resource sharing) in the HTTP request:
// If the request has the "Origin" header specifying where the script which
// makes this request comes from, we need to reply with the header
// "Access-Control-Allow-Origin: *" saying that this (and any) origin is fine.
// Additionally, if preflight==true (i.e., this is an OPTIONS request),
// the script can also "request" in headers that the server allows it to use
// some HTTP methods and headers in the followup request, and the server
// should respond by "allowing" them in the response headers.
// We also add the header "Access-Control-Expose-Headers" to let the script
// access additional headers in the response.
// This handle_CORS() should be used when handling any HTTP method - both the
// usual GET and POST, and also the "preflight" OPTIONS method.
static void handle_CORS(const request& req, reply& rep, bool preflight) {
if (!req.get_header("origin").empty()) {
rep.add_header("Access-Control-Allow-Origin", "*");
// This is the list that DynamoDB returns for expose headers. I am
// not sure why not just return "*" here, what's the risk?
rep.add_header("Access-Control-Expose-Headers", "x-amzn-RequestId,x-amzn-ErrorType,x-amzn-ErrorMessage,Date");
if (preflight) {
sstring s = req.get_header("Access-Control-Request-Headers");
if (!s.empty()) {
rep.add_header("Access-Control-Allow-Headers", std::move(s));
}
s = req.get_header("Access-Control-Request-Method");
if (!s.empty()) {
rep.add_header("Access-Control-Allow-Methods", std::move(s));
}
// Our CORS response never change anyway, let the browser cache it
// for two hours (Chrome's maximum):
rep.add_header("Access-Control-Max-Age", "7200");
}
}
}
// DynamoDB HTTP error responses are structured as follows
// https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html
// Our handlers throw an exception to report an error. If the exception
@@ -93,6 +129,10 @@ public:
[&] (const json::json_return_type& json_return_value) {
slogger.trace("api_handler success case");
if (json_return_value._body_writer) {
// Unfortunately, write_body() forces us to choose
// from a fixed and irrelevant list of "mime-types"
// at this point. But we'll override it with the
// one (application/x-amz-json-1.0) below.
rep->write_body("json", std::move(json_return_value._body_writer));
} else {
rep->_content += json_return_value._res;
@@ -105,14 +145,16 @@ public:
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
});
}), _type("json") { }
}) { }
api_handler(const api_handler&) = default;
future<std::unique_ptr<reply>> handle(const sstring& path,
std::unique_ptr<request> req, std::unique_ptr<reply> rep) override {
handle_CORS(*req, *rep, false);
return _f_handle(std::move(req), std::move(rep)).then(
[this](std::unique_ptr<reply> rep) {
rep->done(_type);
rep->set_mime_type("application/x-amz-json-1.0");
rep->done();
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
});
}
@@ -126,7 +168,6 @@ protected:
}
future_handler_function _f_handle;
sstring _type;
};
class gated_handler : public handler_base {
@@ -146,6 +187,7 @@ public:
health_handler(seastar::gate& pending_requests) : gated_handler(pending_requests) {}
protected:
virtual future<std::unique_ptr<reply>> do_handle(const sstring& path, std::unique_ptr<request> req, std::unique_ptr<reply> rep) override {
handle_CORS(*req, *rep, false);
rep->set_status(reply::status_type::ok);
rep->write_body("txt", format("healthy: {}", req->get_header("Host")));
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
@@ -178,7 +220,23 @@ protected:
}
};
future<> server::verify_signature(const request& req) {
// The CORS (Cross-origin resource sharing) protocol can send an OPTIONS
// request before ("pre-flight") the main request. The response to this
// request can be empty, but needs to have the right headers (which we
// fill with handle_CORS())
class options_handler : public gated_handler {
public:
options_handler(seastar::gate& pending_requests) : gated_handler(pending_requests) {}
protected:
virtual future<std::unique_ptr<reply>> do_handle(const sstring& path, std::unique_ptr<request> req, std::unique_ptr<reply> rep) override {
handle_CORS(*req, *rep, true);
rep->set_status(reply::status_type::ok);
rep->write_body("txt", sstring(""));
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
}
};
future<> server::verify_signature(const request& req, const chunked_content& content) {
if (!_enforce_authorization) {
slogger.debug("Skipping authorization");
return make_ready_future<>();
@@ -189,27 +247,34 @@ future<> server::verify_signature(const request& req) {
}
auto authorization_it = req._headers.find("Authorization");
if (authorization_it == req._headers.end()) {
throw api_error::invalid_signature("Authorization header is mandatory for signature verification");
throw api_error::missing_authentication_token("Authorization header is mandatory for signature verification");
}
std::string host = host_it->second;
std::vector<std::string_view> credentials_raw = split(authorization_it->second, ' ');
std::string_view authorization_header = authorization_it->second;
auto pos = authorization_header.find_first_of(' ');
if (pos == std::string_view::npos || authorization_header.substr(0, pos) != "AWS4-HMAC-SHA256") {
throw api_error::invalid_signature(format("Authorization header must use AWS4-HMAC-SHA256 algorithm: {}", authorization_header));
}
authorization_header.remove_prefix(pos+1);
std::string credential;
std::string user_signature;
std::string signed_headers_str;
std::vector<std::string_view> signed_headers;
for (std::string_view entry : credentials_raw) {
do {
// Either one of a comma or space can mark the end of an entry
pos = authorization_header.find_first_of(" ,");
std::string_view entry = authorization_header.substr(0, pos);
if (pos != std::string_view::npos) {
authorization_header.remove_prefix(pos + 1);
}
if (entry.empty()) {
continue;
}
std::vector<std::string_view> entry_split = split(entry, '=');
if (entry_split.size() != 2) {
if (entry != "AWS4-HMAC-SHA256") {
throw api_error::invalid_signature(format("Only AWS4-HMAC-SHA256 algorithm is supported. Found: {}", entry));
}
continue;
}
std::string_view auth_value = entry_split[1];
// Commas appear as an additional (quite redundant) delimiter
if (auth_value.back() == ',') {
auth_value.remove_suffix(1);
}
if (entry_split[0] == "Credential") {
credential = std::string(auth_value);
} else if (entry_split[0] == "Signature") {
@@ -219,7 +284,8 @@ future<> server::verify_signature(const request& req) {
signed_headers = split(auth_value, ';');
std::sort(signed_headers.begin(), signed_headers.end());
}
}
} while (pos != std::string_view::npos);
std::vector<std::string_view> credential_split = split(credential, '/');
if (credential_split.size() != 5) {
throw api_error::validation(format("Incorrect credential information format: {}", credential));
@@ -246,7 +312,7 @@ future<> server::verify_signature(const request& req) {
auto cache_getter = [&qp = _qp] (std::string username) {
return get_key_from_roles(qp, std::move(username));
};
return _key_cache.get_ptr(user, cache_getter).then([this, &req,
return _key_cache.get_ptr(user, cache_getter).then([this, &req, &content,
user = std::move(user),
host = std::move(host),
datestamp = std::move(datestamp),
@@ -256,7 +322,7 @@ future<> server::verify_signature(const request& req) {
service = std::move(service),
user_signature = std::move(user_signature)] (key_cache::value_ptr key_ptr) {
std::string signature = get_signature(user, *key_ptr, std::string_view(host), req._method,
datestamp, signed_headers_str, signed_headers_map, req.content, region, service, "");
datestamp, signed_headers_str, signed_headers_map, content, region, service, "");
if (signature != std::string_view(user_signature)) {
_key_cache.remove(user);
@@ -265,43 +331,91 @@ future<> server::verify_signature(const request& req) {
});
}
future<executor::request_return_type> server::handle_api_request(std::unique_ptr<request>&& req) {
static tracing::trace_state_ptr create_tracing_session(tracing::tracing& tracing_instance) {
tracing::trace_state_props_set props;
props.set<tracing::trace_state_props::full_tracing>();
props.set_if<tracing::trace_state_props::log_slow_query>(tracing_instance.slow_query_tracing_enabled());
return tracing_instance.create_session(tracing::trace_type::QUERY, props);
}
// truncated_content_view() prints a potentially long chunked_content for
// debugging purposes. In the common case when the content is not excessively
// long, it just returns a view into the given content, without any copying.
// But when the content is very long, it is truncated after some arbitrary
// max_len (or one chunk, whichever comes first), with "<truncated>" added at
// the end. To do this modification to the string, we need to create a new
// std::string, so the caller must pass us a reference to one, "buf", where
// we can store the content. The returned view is only alive for as long this
// buf is kept alive.
static std::string_view truncated_content_view(const chunked_content& content, std::string& buf) {
constexpr size_t max_len = 1024;
if (content.empty()) {
return std::string_view();
} else if (content.size() == 1 && content.begin()->size() <= max_len) {
return std::string_view(content.begin()->get(), content.begin()->size());
} else {
buf = std::string(content.begin()->get(), std::min(content.begin()->size(), max_len)) + "<truncated>";
return std::string_view(buf);
}
}
static tracing::trace_state_ptr maybe_trace_query(service::client_state& client_state, sstring_view op, const chunked_content& query) {
tracing::trace_state_ptr trace_state;
tracing::tracing& tracing_instance = tracing::tracing::get_local_tracing_instance();
if (tracing_instance.trace_next_query() || tracing_instance.slow_query_tracing_enabled()) {
trace_state = create_tracing_session(tracing_instance);
std::string buf;
tracing::add_session_param(trace_state, "alternator_op", op);
tracing::add_query(trace_state, truncated_content_view(query, buf));
tracing::begin(trace_state, format("Alternator {}", op), client_state.get_client_address());
}
return trace_state;
}
future<executor::request_return_type> server::handle_api_request(std::unique_ptr<request> req) {
_executor._stats.total_operations++;
sstring target = req->get_header(TARGET);
std::vector<std::string_view> split_target = split(target, '.');
//NOTICE(sarna): Target consists of Dynamo API version followed by a dot '.' and operation type (e.g. CreateTable)
std::string op = split_target.empty() ? std::string() : std::string(split_target.back());
slogger.trace("Request: {} {} {}", op, req->content, req->_headers);
return verify_signature(*req).then([this, op, req = std::move(req)] () mutable {
auto callback_it = _callbacks.find(op);
if (callback_it == _callbacks.end()) {
_executor._stats.unsupported_operations++;
throw api_error::unknown_operation(format("Unsupported operation {}", op));
}
return with_gate(_pending_requests, [this, callback_it = std::move(callback_it), op = std::move(op), req = std::move(req)] () mutable {
//FIXME: Client state can provide more context, e.g. client's endpoint address
// We use unique_ptr because client_state cannot be moved or copied
return do_with(std::make_unique<executor::client_state>(executor::client_state::internal_tag()),
[this, callback_it = std::move(callback_it), op = std::move(op), req = std::move(req)] (std::unique_ptr<executor::client_state>& client_state) mutable {
tracing::trace_state_ptr trace_state = executor::maybe_trace_query(*client_state, op, req->content);
tracing::trace(trace_state, op);
// JSON parsing can allocate up to roughly 2x the size of the raw document, + a couple of bytes for maintenance.
// FIXME: by this time, the whole HTTP request was already read, so some memory is already occupied.
// Once HTTP allows working on streams, we should grab the permit *before* reading the HTTP payload.
size_t mem_estimate = req->content.size() * 3 + 8000;
auto units_fut = get_units(*_memory_limiter, mem_estimate);
if (_memory_limiter->waiters()) {
++_executor._stats.requests_blocked_memory;
}
return units_fut.then([this, callback_it = std::move(callback_it), &client_state, trace_state, req = std::move(req)] (semaphore_units<> units) mutable {
return _json_parser.parse(req->content).then([this, callback_it = std::move(callback_it), &client_state, trace_state,
units = std::move(units), req = std::move(req)] (rjson::value json_request) mutable {
return callback_it->second(_executor, *client_state, trace_state, make_service_permit(std::move(units)), std::move(json_request), std::move(req)).finally([trace_state] {});
});
});
});
});
});
// JSON parsing can allocate up to roughly 2x the size of the raw
// document, + a couple of bytes for maintenance.
// TODO: consider the case where req->content_length is missing. Maybe
// we need to take the content_length_limit and return some of the units
// when we finish read_content_and_verify_signature?
size_t mem_estimate = req->content_length * 2 + 8000;
auto units_fut = get_units(*_memory_limiter, mem_estimate);
if (_memory_limiter->waiters()) {
++_executor._stats.requests_blocked_memory;
}
auto units = co_await std::move(units_fut);
assert(req->content_stream);
chunked_content content = co_await httpd::read_entire_stream(*req->content_stream);
co_await verify_signature(*req, content);
if (slogger.is_enabled(log_level::trace)) {
std::string buf;
slogger.trace("Request: {} {} {}", op, truncated_content_view(content, buf), req->_headers);
}
auto callback_it = _callbacks.find(op);
if (callback_it == _callbacks.end()) {
_executor._stats.unsupported_operations++;
co_return api_error::unknown_operation(format("Unsupported operation {}", op));
}
if (_pending_requests.get_count() >= _max_concurrent_requests) {
_executor._stats.requests_shed++;
co_return api_error::request_limit_exceeded(format("too many in-flight requests (configured via max_concurrent_requests_per_shard): {}", _pending_requests.get_count()));
}
_pending_requests.enter();
auto leave = defer([this] { _pending_requests.leave(); });
//FIXME: Client state can provide more context, e.g. client's endpoint address
// We use unique_ptr because client_state cannot be moved or copied
executor::client_state client_state{executor::client_state::internal_tag()};
tracing::trace_state_ptr trace_state = maybe_trace_query(client_state, op, content);
tracing::trace(trace_state, op);
rjson::value json_request = co_await _json_parser.parse(std::move(content));
co_return co_await callback_it->second(_executor, client_state, trace_state,
make_service_permit(std::move(units)), std::move(json_request), std::move(req));
}
void server::set_routes(routes& r) {
@@ -323,6 +437,7 @@ void server::set_routes(routes& r) {
// scan an entire subnet for nodes responding to the health request,
// or even just scan for open ports.
r.put(operation_type::GET, "/localnodes", new local_nodelist_handler(_pending_requests));
r.put(operation_type::OPTIONS, "/", new options_handler(_pending_requests));
}
//FIXME: A way to immediately invalidate the cache should be considered,
@@ -405,9 +520,10 @@ server::server(executor& exec, cql3::query_processor& qp)
}
future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
bool enforce_authorization, semaphore* memory_limiter) {
bool enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests) {
_memory_limiter = memory_limiter;
_enforce_authorization = enforce_authorization;
_max_concurrent_requests = std::move(max_concurrent_requests);
if (!port && !https_port) {
return make_exception_future<>(std::runtime_error("Either regular port or TLS port"
" must be specified in order to init an alternator HTTP server instance"));
@@ -419,12 +535,14 @@ future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std:
if (port) {
set_routes(_http_server._routes);
_http_server.set_content_length_limit(server::content_length_limit);
_http_server.set_content_streaming(true);
_http_server.listen(socket_address{addr, *port}).get();
_enabled_servers.push_back(std::ref(_http_server));
}
if (https_port) {
set_routes(_https_server._routes);
_https_server.set_content_length_limit(server::content_length_limit);
_https_server.set_content_streaming(true);
_https_server.set_tls_credentials(creds->build_reloadable_server_credentials([](const std::unordered_set<sstring>& files, std::exception_ptr ep) {
if (ep) {
slogger.warn("Exception loading {}: {}", files, ep);
@@ -462,7 +580,7 @@ server::json_parser::json_parser() : _run_parse_json_thread(async([this] {
return;
}
try {
_parsed_document = rjson::parse_yieldable(_raw_document);
_parsed_document = rjson::parse_yieldable(std::move(_raw_document));
_current_exception = nullptr;
} catch (...) {
_current_exception = std::current_exception();
@@ -472,12 +590,12 @@ server::json_parser::json_parser() : _run_parse_json_thread(async([this] {
})) {
}
future<rjson::value> server::json_parser::parse(std::string_view content) {
future<rjson::value> server::json_parser::parse(chunked_content&& content) {
if (content.size() < yieldable_parsing_threshold) {
return make_ready_future<rjson::value>(rjson::parse(content));
return make_ready_future<rjson::value>(rjson::parse(std::move(content)));
}
return with_semaphore(_parsing_sem, 1, [this, content] {
_raw_document = content;
return with_semaphore(_parsing_sem, 1, [this, content = std::move(content)] () mutable {
_raw_document = std::move(content);
_document_waiting.signal();
return _document_parsed.wait().then([this] {
if (_current_exception) {

View File

@@ -28,10 +28,13 @@
#include <optional>
#include "alternator/auth.hh"
#include "utils/small_vector.hh"
#include "utils/updateable_value.hh"
#include <seastar/core/units.hh>
namespace alternator {
using chunked_content = rjson::chunked_content;
class server {
static constexpr size_t content_length_limit = 16*MB;
using alternator_callback = std::function<future<executor::request_return_type>(executor&, executor::client_state&,
@@ -50,10 +53,11 @@ class server {
alternator_callbacks_map _callbacks;
semaphore* _memory_limiter;
utils::updateable_value<uint32_t> _max_concurrent_requests;
class json_parser {
static constexpr size_t yieldable_parsing_threshold = 16*KB;
std::string_view _raw_document;
chunked_content _raw_document;
rjson::value _parsed_document;
std::exception_ptr _current_exception;
semaphore _parsing_sem{1};
@@ -63,7 +67,10 @@ class server {
future<> _run_parse_json_thread;
public:
json_parser();
future<rjson::value> parse(std::string_view content);
// Moving a chunked_content into parse() allows parse() to free each
// chunk as soon as it is parsed, so when chunks are relatively small,
// we don't need to store the sum of unparsed and parsed sizes.
future<rjson::value> parse(chunked_content&& content);
future<> stop();
};
json_parser _json_parser;
@@ -72,12 +79,12 @@ public:
server(executor& executor, cql3::query_processor& qp);
future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
bool enforce_authorization, semaphore* memory_limiter);
bool enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests);
future<> stop();
private:
void set_routes(seastar::httpd::routes& r);
future<> verify_signature(const seastar::httpd::request& r);
future<executor::request_return_type> handle_api_request(std::unique_ptr<request>&& req);
future<> verify_signature(const seastar::httpd::request&, const chunked_content&);
future<executor::request_return_type> handle_api_request(std::unique_ptr<request> req);
};
}

View File

@@ -38,6 +38,7 @@ stats::stats() : api_operations{} {
#define OPERATION_LATENCY(name, CamelCaseName) \
seastar::metrics::make_histogram("op_latency", \
seastar::metrics::description("Latency histogram of an operation via Alternator API"), {op(CamelCaseName)}, [this]{return to_metrics_histogram(api_operations.name);}),
OPERATION(batch_get_item, "BatchGetItem")
OPERATION(batch_write_item, "BatchWriteItem")
OPERATION(create_backup, "CreateBackup")
OPERATION(create_global_table, "CreateGlobalTable")
@@ -96,6 +97,8 @@ stats::stats() : api_operations{} {
seastar::metrics::description("number writes that had to be bounced from this shard because of LWT requirements")),
seastar::metrics::make_total_operations("requests_blocked_memory", requests_blocked_memory,
seastar::metrics::description("Counts a number of requests blocked due to memory pressure.")),
seastar::metrics::make_total_operations("requests_shed", requests_shed,
seastar::metrics::description("Counts a number of requests shed due to overload.")),
seastar::metrics::make_total_operations("filtered_rows_read_total", cql_stats.filtered_rows_read_total,
seastar::metrics::description("number of rows read during filtering operations")),
seastar::metrics::make_total_operations("filtered_rows_matched_total", cql_stats.filtered_rows_matched_total,

View File

@@ -92,6 +92,7 @@ public:
uint64_t write_using_lwt = 0;
uint64_t shard_bounce_for_lwt = 0;
uint64_t requests_blocked_memory = 0;
uint64_t requests_shed = 0;
// CQL-derived stats
cql3::cql_stats cql_stats;
private:

View File

@@ -34,6 +34,7 @@
#include "cdc/log.hh"
#include "cdc/generation.hh"
#include "cdc/cdc_options.hh"
#include "cdc/metadata.hh"
#include "db/system_distributed_keyspace.hh"
#include "utils/UUID_gen.hh"
#include "cql3/selection/selection.hh"
@@ -470,8 +471,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
auto status = "DISABLED";
if (opts.enabled()) {
auto& metadata = _ss.get_cdc_metadata();
if (!metadata.streams_available()) {
if (!_cdc_metadata.streams_available()) {
status = "ENABLING";
} else {
status = "ENABLED";
@@ -499,19 +499,11 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
// TODO: creation time
auto normal_token_owners = _proxy.get_token_metadata_ptr()->count_normal_token_owners();
// cannot really "resume" query, must iterate all data. because we cannot query neither "time" (pk) > something,
// or on expired...
// TODO: maybe add secondary index to topology table to enable this?
return _sdks.cdc_get_versioned_streams({ normal_token_owners }).then([this, &db, schema, shard_start, limit, ret = std::move(ret), stream_desc = std::move(stream_desc), ttl](std::map<db_clock::time_point, cdc::streams_version> topologies) mutable {
// filter out cdc generations older than the table or now() - cdc::ttl (typically dynamodb_streams_max_window - 24h)
auto low_ts = std::max(as_timepoint(schema->id()), db_clock::now() - ttl);
// filter out cdc generations older than the table or now() - cdc::ttl (typically dynamodb_streams_max_window - 24h)
auto low_ts = std::max(as_timepoint(schema->id()), db_clock::now() - ttl);
auto i = topologies.lower_bound(low_ts);
// need first gen _intersecting_ the timestamp.
if (i != topologies.begin()) {
i = std::prev(i);
}
return _sdks.cdc_get_versioned_streams(low_ts, { normal_token_owners }).then([this, &db, shard_start, limit, ret = std::move(ret), stream_desc = std::move(stream_desc)] (std::map<db_clock::time_point, cdc::streams_version> topologies) mutable {
auto e = topologies.end();
auto prev = e;
@@ -519,9 +511,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
std::optional<shard_id> last;
// i is now at the youngest generation we include. make a mark of it.
auto first = i;
auto i = topologies.begin();
// if we're a paged query, skip to the generation where we left of.
if (shard_start) {
i = topologies.find(shard_start->time);
@@ -547,7 +537,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
};
// need a prev even if we are skipping stuff
if (i != first) {
if (i != topologies.begin()) {
prev = std::prev(i);
}
@@ -855,16 +845,18 @@ future<executor::request_return_type> executor::get_records(client_state& client
static const bytes op_column_name = cdc::log_meta_column_name_bytes("operation");
static const bytes eor_column_name = cdc::log_meta_column_name_bytes("end_of_batch");
auto key_names = boost::copy_range<std::unordered_set<std::string>>(
auto key_names = boost::copy_range<attrs_to_get>(
boost::range::join(std::move(base->partition_key_columns()), std::move(base->clustering_key_columns()))
| boost::adaptors::transformed([&] (const column_definition& cdef) { return cdef.name_as_text(); })
| boost::adaptors::transformed([&] (const column_definition& cdef) {
return std::make_pair<std::string, attrs_to_get_node>(cdef.name_as_text(), {}); })
);
// Include all base table columns as values (in case pre or post is enabled).
// This will include attributes not stored in the frozen map column
auto attr_names = boost::copy_range<std::unordered_set<std::string>>(base->regular_columns()
auto attr_names = boost::copy_range<attrs_to_get>(base->regular_columns()
// this will include the :attrs column, which we will also force evaluating.
// But not having this set empty forces out any cdc columns from actual result
| boost::adaptors::transformed([] (const column_definition& cdef) { return cdef.name_as_text(); })
| boost::adaptors::transformed([] (const column_definition& cdef) {
return std::make_pair<std::string, attrs_to_get_node>(cdef.name_as_text(), {}); })
);
std::vector<const column_definition*> columns;
@@ -1028,7 +1020,9 @@ future<executor::request_return_type> executor::get_records(client_state& client
}
// ugh. figure out if we are and end-of-shard
return cdc::get_local_streams_timestamp().then([this, iter, high_ts, start_time, ret = std::move(ret)](db_clock::time_point ts) mutable {
auto normal_token_owners = _proxy.get_token_metadata_ptr()->count_normal_token_owners();
return _sdks.cdc_current_generation_timestamp({ normal_token_owners }).then([this, iter, high_ts, start_time, ret = std::move(ret)](db_clock::time_point ts) mutable {
auto& shard = iter.shard;
if (shard.time < ts && ts < high_ts) {

View File

@@ -2925,6 +2925,10 @@
"id":"toppartitions_query_results",
"description":"nodetool toppartitions query results",
"properties":{
"read_cardinality":{
"type":"long",
"description":"Number of the unique operations in the sample set"
},
"read":{
"type":"array",
"items":{
@@ -2932,6 +2936,10 @@
},
"description":"Read results"
},
"write_cardinality":{
"type":"long",
"description":"Number of the unique operations in the sample set"
},
"write":{
"type":"array",
"items":{

View File

@@ -148,6 +148,30 @@
]
}
]
},
{
"path":"/gossiper/force_remove_endpoint/{addr}",
"operations":[
{
"method":"POST",
"summary":"Force remove an endpoint from gossip",
"type":"void",
"nickname":"force_remove_endpoint",
"produces":[
"application/json"
],
"parameters":[
{
"name":"addr",
"description":"The endpoint address",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
}
]
}

View File

@@ -76,7 +76,7 @@
"items":{
"type":"message_counter"
},
"nickname":"get_completed_messages",
"nickname":"get_replied_messages",
"produces":[
"application/json"
],

View File

@@ -104,6 +104,68 @@
}
]
},
{
"path":"/storage_service/toppartitions/",
"operations":[
{
"method":"GET",
"summary":"Toppartitions query",
"type":"toppartitions_query_results",
"nickname":"toppartitions_generic",
"produces":[
"application/json"
],
"parameters":[
{
"name":"table_filters",
"description":"Optional list of table name filters in keyspace:name format",
"required":false,
"allowMultiple":false,
"type":"array",
"items":{
"type":"string"
},
"paramType":"query"
},
{
"name":"keyspace_filters",
"description":"Optional list of keyspace filters",
"required":false,
"allowMultiple":false,
"type":"array",
"items":{
"type":"string"
},
"paramType":"query"
},
{
"name":"duration",
"description":"Duration (in milliseconds) of monitoring operation",
"required":true,
"allowMultiple":false,
"type": "long",
"paramType":"query"
},
{
"name":"list_size",
"description":"number of the top partitions to list",
"required":false,
"allowMultiple":false,
"type": "long",
"paramType":"query"
},
{
"name":"capacity",
"description":"capacity of stream summary: determines amount of resources used in query processing",
"required":false,
"allowMultiple":false,
"type": "long",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/nodes/leaving",
"operations":[
@@ -970,6 +1032,14 @@
"type":"string",
"paramType":"query"
},
{
"name":"ignore_nodes",
"description":"Which hosts are to ignore in this repair. Multiple hosts can be listed separated by commas.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"trace",
"description":"If the value is the string 'true' with any capitalization, enable tracing of the repair.",
@@ -1105,6 +1175,14 @@
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"ignore_nodes",
"description":"List of dead nodes to ingore in removenode operation",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
@@ -1756,6 +1834,22 @@
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"load_and_stream",
"description":"Load the sstables and stream to all replica nodes that owns the data",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"primary_replica_only",
"description":"Load the sstables and stream to primary replica node that owns the data. Repair is needed after the load and stream process",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
@@ -1866,6 +1960,14 @@
"allowMultiple":false,
"type":"long",
"paramType":"query"
},
{
"name":"fast",
"description":"Lightweight tracing mode: if true, slow queries tracing records only session headers",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
},
@@ -2364,6 +2466,10 @@
"threshold":{
"type":"long",
"description":"The slow query logging threshold in microseconds. Queries that takes longer, will be logged"
},
"fast":{
"type":"boolean",
"description":"Is lightweight tracing mode enabled. In that mode tracing ignore events and tracks only sessions."
}
}
},

View File

@@ -52,6 +52,22 @@
}
]
},
{
"path":"/system/drop_sstable_caches",
"operations":[
{
"method":"POST",
"summary":"Drop in-memory caches for data which is in sstables",
"type":"void",
"nickname":"drop_sstable_caches",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/system/uptime_ms",
"operations":[

View File

@@ -28,6 +28,7 @@
#include <algorithm>
#include "db/system_keyspace_view_types.hh"
#include "db/data_listeners.hh"
#include "storage_service.hh"
extern logging::logger apilog;
@@ -180,7 +181,7 @@ static future<json::json_return_type> get_cf_unleveled_sstables(http_context& ct
static int64_t min_partition_size(column_family& cf) {
int64_t res = INT64_MAX;
for (auto i: *cf.get_sstables() ) {
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
res = std::min(res, i->get_stats_metadata().estimated_partition_size.min());
}
return (res == INT64_MAX) ? 0 : res;
@@ -188,7 +189,7 @@ static int64_t min_partition_size(column_family& cf) {
static int64_t max_partition_size(column_family& cf) {
int64_t res = 0;
for (auto i: *cf.get_sstables() ) {
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
res = std::max(i->get_stats_metadata().estimated_partition_size.max(), res);
}
return res;
@@ -196,7 +197,7 @@ static int64_t max_partition_size(column_family& cf) {
static integral_ratio_holder mean_partition_size(column_family& cf) {
integral_ratio_holder res;
for (auto i: *cf.get_sstables() ) {
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
auto c = i->get_stats_metadata().estimated_partition_size.count();
res.sub += i->get_stats_metadata().estimated_partition_size.mean() * c;
res.total += c;
@@ -274,7 +275,7 @@ public:
static double get_compression_ratio(column_family& cf) {
sum_ratio<double> result;
for (auto i : *cf.get_sstables()) {
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
auto compression_ratio = i->get_compression_ratio();
if (compression_ratio != sstables::metadata_collector::NO_COMPRESSION_RATIO) {
result(compression_ratio);
@@ -310,8 +311,8 @@ void set_column_family(http_context& ctx, routes& r) {
return res;
});
cf::get_column_family.set(r, [&ctx] (const_req req){
vector<cf::column_family_info> res;
cf::get_column_family.set(r, [&ctx] (std::unique_ptr<request> req){
std::list<cf::column_family_info> res;
for (auto i: ctx.db.local().get_column_families_mapping()) {
cf::column_family_info info;
info.ks = i.first.first;
@@ -319,7 +320,7 @@ void set_column_family(http_context& ctx, routes& r) {
info.type = "ColumnFamilies";
res.push_back(info);
}
return res;
return make_ready_future<json::json_return_type>(json::stream_range_as_array(std::move(res), std::identity()));
});
cf::get_column_family_name_keyspace.set(r, [&ctx] (const_req req){
@@ -331,15 +332,15 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_memtable_columns_count.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], 0, [](column_family& cf) {
return map_reduce_cf(ctx, req->param["name"], uint64_t{0}, [](column_family& cf) {
return cf.active_memtable().partition_count();
}, std::plus<int>());
}, std::plus<>());
});
cf::get_all_memtable_columns_count.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, 0, [](column_family& cf) {
return map_reduce_cf(ctx, uint64_t{0}, [](column_family& cf) {
return cf.active_memtable().partition_count();
}, std::plus<int>());
}, std::plus<>());
});
cf::get_memtable_on_heap_size.set(r, [] (const_req req) {
@@ -424,7 +425,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_estimated_row_size_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {
utils::estimated_histogram res(0);
for (auto i: *cf.get_sstables() ) {
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
res.merge(i->get_stats_metadata().estimated_partition_size);
}
return res;
@@ -436,7 +437,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_estimated_row_count.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](column_family& cf) {
uint64_t res = 0;
for (auto i: *cf.get_sstables() ) {
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
res += i->get_stats_metadata().estimated_partition_size.count();
}
return res;
@@ -447,7 +448,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_estimated_column_count_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {
utils::estimated_histogram res(0);
for (auto i: *cf.get_sstables() ) {
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
res.merge(i->get_stats_metadata().estimated_cells_count);
}
return res;
@@ -599,7 +600,8 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_get_false_positive();
});
}, std::plus<uint64_t>());
@@ -607,7 +609,8 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_get_false_positive();
});
}, std::plus<uint64_t>());
@@ -615,7 +618,8 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_recent_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_get_recent_false_positive();
});
}, std::plus<uint64_t>());
@@ -623,7 +627,8 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_recent_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_get_recent_false_positive();
});
}, std::plus<uint64_t>());
@@ -655,48 +660,54 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_size();
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_size();
});
}, std::plus<uint64_t>());
});
cf::get_all_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_size();
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_size();
});
}, std::plus<uint64_t>());
});
cf::get_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_memory_size();
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_memory_size();
});
}, std::plus<uint64_t>());
});
cf::get_all_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_memory_size();
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_memory_size();
});
}, std::plus<uint64_t>());
});
cf::get_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->get_summary().memory_footprint();
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->get_summary().memory_footprint();
});
}, std::plus<uint64_t>());
});
cf::get_all_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->get_summary().memory_footprint();
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->get_summary().memory_footprint();
});
}, std::plus<uint64_t>());
});
@@ -973,42 +984,20 @@ void set_column_family(http_context& ctx, routes& r) {
});
});
cf::toppartitions.set(r, [&ctx] (std::unique_ptr<request> req) {
auto name_param = req->param["name"];
auto [ks, cf] = parse_fully_qualified_cf_name(name_param);
auto name = req->param["name"];
auto [ks, cf] = parse_fully_qualified_cf_name(name);
api::req_param<std::chrono::milliseconds, unsigned> duration{*req, "duration", 1000ms};
api::req_param<unsigned> capacity(*req, "capacity", 256);
api::req_param<unsigned> list_size(*req, "list_size", 10);
apilog.info("toppartitions query: name={} duration={} list_size={} capacity={}",
name_param, duration.param, list_size.param, capacity.param);
name, duration.param, list_size.param, capacity.param);
return seastar::do_with(db::toppartitions_query(ctx.db, ks, cf, duration.value, list_size, capacity), [&ctx](auto& q) {
return q.scatter().then([&q] {
return sleep(q.duration()).then([&q] {
return q.gather(q.capacity()).then([&q] (auto topk_results) {
apilog.debug("toppartitions query: processing results");
cf::toppartitions_query_results results;
for (auto& d: topk_results.read.top(q.list_size())) {
cf::toppartitions_record r;
r.partition = sstring(d.item);
r.count = d.count;
r.error = d.error;
results.read.push(r);
}
for (auto& d: topk_results.write.top(q.list_size())) {
cf::toppartitions_record r;
r.partition = sstring(d.item);
r.count = d.count;
r.error = d.error;
results.write.push(r);
}
return make_ready_future<json::json_return_type>(results);
});
});
});
return seastar::do_with(db::toppartitions_query(ctx.db, {{ks, cf}}, {}, duration.value, list_size, capacity), [&ctx] (db::toppartitions_query& q) {
return run_toppartitions_query(q, ctx, true);
});
});

View File

@@ -116,4 +116,7 @@ future<json::json_return_type> get_cf_stats(http_context& ctx, const sstring& n
future<json::json_return_type> get_cf_stats(http_context& ctx,
int64_t column_family_stats::*f);
std::tuple<sstring, sstring> parse_fully_qualified_cf_name(sstring name);
}

View File

@@ -58,6 +58,7 @@ void set_compaction_manager(http_context& ctx, routes& r) {
for (const auto& c : cm.get_compactions()) {
cm::summary s;
s.id = c->compaction_uuid.to_sstring();
s.ks = c->ks_name;
s.cf = c->cf_name;
s.unit = "keys";

View File

@@ -66,6 +66,13 @@ void set_gossiper(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(json_void());
});
});
httpd::gossiper_json::force_remove_endpoint.set(r, [](std::unique_ptr<request> req) {
gms::inet_address ep(req->param["addr"]);
return gms::get_local_gossiper().force_remove_endpoint(ep).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
}
}

View File

@@ -26,6 +26,7 @@
#include <seastar/http/exception.hh>
#include "utils/logalloc.hh"
#include "log.hh"
#include "database.hh"
namespace api {

View File

@@ -96,6 +96,10 @@ void set_messaging_service(http_context& ctx, routes& r, sharded<netw::messaging
return c.get_stats().sent_messages;
}));
get_replied_messages.set(r, get_client_getter(ms, [](const shard_info& c) {
return c.get_stats().replied;
}));
get_dropped_messages.set(r, get_client_getter(ms, [](const shard_info& c) {
// We don't have the same drop message mechanism
// as origin has.
@@ -155,6 +159,7 @@ void set_messaging_service(http_context& ctx, routes& r, sharded<netw::messaging
void unset_messaging_service(http_context& ctx, routes& r) {
get_timeout_messages.unset(r);
get_sent_messages.unset(r);
get_replied_messages.unset(r);
get_dropped_messages.unset(r);
get_exception_messages.unset(r);
get_pending_messages.unset(r);

View File

@@ -23,10 +23,13 @@
#include "api/api-doc/storage_service.json.hh"
#include "db/config.hh"
#include "db/schema_tables.hh"
#include <optional>
#include "utils/hash.hh"
#include <sstream>
#include <time.h>
#include <boost/range/adaptor/map.hpp>
#include <boost/range/adaptor/filtered.hpp>
#include <boost/algorithm/string/trim_all.hpp>
#include <boost/functional/hash.hpp>
#include "service/storage_service.hh"
#include "service/load_meter.hh"
#include "db/commitlog/commitlog.hh"
@@ -46,6 +49,9 @@
#include "transport/controller.hh"
#include "thrift/controller.hh"
#include "locator/token_metadata.hh"
#include "cdc/generation_service.hh"
extern logging::logger apilog;
namespace api {
@@ -94,6 +100,37 @@ static auto wrap_ks_cf(http_context &ctx, ks_cf_func f) {
};
}
seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request) {
namespace cf = httpd::column_family_json;
return q.scatter().then([&q, legacy_request] {
return sleep(q.duration()).then([&q, legacy_request] {
return q.gather(q.capacity()).then([&q, legacy_request] (auto topk_results) {
apilog.debug("toppartitions query: processing results");
cf::toppartitions_query_results results;
results.read_cardinality = topk_results.read.size();
results.write_cardinality = topk_results.write.size();
for (auto& d: topk_results.read.top(q.list_size())) {
cf::toppartitions_record r;
r.partition = (legacy_request ? "" : "(" + d.item.schema->ks_name() + ":" + d.item.schema->cf_name() + ") ") + sstring(d.item);
r.count = d.count;
r.error = d.error;
results.read.push(r);
}
for (auto& d: topk_results.write.top(q.list_size())) {
cf::toppartitions_record r;
r.partition = (legacy_request ? "" : "(" + d.item.schema->ks_name() + ":" + d.item.schema->cf_name() + ") ") + sstring(d.item);
r.count = d.count;
r.error = d.error;
results.write.push(r);
}
return make_ready_future<json::json_return_type>(results);
});
});
});
}
future<json::json_return_type> set_tables_autocompaction(http_context& ctx, const sstring &keyspace, std::vector<sstring> tables, bool enabled) {
if (tables.empty()) {
tables = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());
@@ -159,7 +196,7 @@ void unset_rpc_controller(http_context& ctx, routes& r) {
void set_repair(http_context& ctx, routes& r, sharded<netw::messaging_service>& ms) {
ss::repair_async.set(r, [&ctx, &ms](std::unique_ptr<request> req) {
static std::vector<sstring> options = {"primaryRange", "parallelism", "incremental",
"jobThreads", "ranges", "columnFamilies", "dataCenters", "hosts", "trace",
"jobThreads", "ranges", "columnFamilies", "dataCenters", "hosts", "ignore_nodes", "trace",
"startToken", "endToken" };
std::unordered_map<sstring, sstring> options_map;
for (auto o : options) {
@@ -225,7 +262,7 @@ void set_repair(http_context& ctx, routes& r, sharded<netw::messaging_service>&
try {
res = fut.get0();
} catch (std::exception& e) {
return make_exception_future<json::json_return_type>(httpd::server_error_exception(e.what()));
return make_exception_future<json::json_return_type>(httpd::bad_param_exception(e.what()));
}
return make_ready_future<json::json_return_type>(json::json_return_type(res));
});
@@ -287,6 +324,56 @@ void set_storage_service(http_context& ctx, routes& r) {
}));
});
ss::toppartitions_generic.set(r, [&ctx] (std::unique_ptr<request> req) {
bool filters_provided = false;
std::unordered_set<std::tuple<sstring, sstring>, utils::tuple_hash> table_filters {};
if (req->query_parameters.contains("table_filters")) {
filters_provided = true;
auto filters = req->get_query_param("table_filters");
std::stringstream ss { filters };
std::string filter;
while (!filters.empty() && ss.good()) {
std::getline(ss, filter, ',');
table_filters.emplace(parse_fully_qualified_cf_name(filter));
}
}
std::unordered_set<sstring> keyspace_filters {};
if (req->query_parameters.contains("keyspace_filters")) {
filters_provided = true;
auto filters = req->get_query_param("keyspace_filters");
std::stringstream ss { filters };
std::string filter;
while (!filters.empty() && ss.good()) {
std::getline(ss, filter, ',');
keyspace_filters.emplace(std::move(filter));
}
}
// when the query is empty return immediately
if (filters_provided && table_filters.empty() && keyspace_filters.empty()) {
apilog.debug("toppartitions query: processing results");
httpd::column_family_json::toppartitions_query_results results;
results.read_cardinality = 0;
results.write_cardinality = 0;
return make_ready_future<json::json_return_type>(results);
}
api::req_param<std::chrono::milliseconds, unsigned> duration{*req, "duration", 1000ms};
api::req_param<unsigned> capacity(*req, "capacity", 256);
api::req_param<unsigned> list_size(*req, "list_size", 10);
apilog.info("toppartitions query: #table_filters={} #keyspace_filters={} duration={} list_size={} capacity={}",
!table_filters.empty() ? std::to_string(table_filters.size()) : "all", !keyspace_filters.empty() ? std::to_string(keyspace_filters.size()) : "all", duration.param, list_size.param, capacity.param);
return seastar::do_with(db::toppartitions_query(ctx.db, std::move(table_filters), std::move(keyspace_filters), duration.value, list_size, capacity), [&ctx] (db::toppartitions_query& q) {
return run_toppartitions_query(q, ctx);
});
});
ss::get_leaving_nodes.set(r, [&ctx](const_req req) {
return container_to_vec(ctx.get_token_metadata().get_leaving_endpoints());
});
@@ -400,7 +487,7 @@ void set_storage_service(http_context& ctx, routes& r) {
});
ss::cdc_streams_check_and_repair.set(r, [&ctx] (std::unique_ptr<request> req) {
return service::get_local_storage_service().check_and_repair_cdc_streams().then([] {
return service::get_local_storage_service().get_cdc_generation_service().check_and_repair_cdc_streams().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
@@ -496,7 +583,22 @@ void set_storage_service(http_context& ctx, routes& r) {
ss::remove_node.set(r, [](std::unique_ptr<request> req) {
auto host_id = req->get_query_param("host_id");
return service::get_local_storage_service().removenode(host_id).then([] {
std::vector<sstring> ignore_nodes_strs= split(req->get_query_param("ignore_nodes"), ",");
auto ignore_nodes = std::list<gms::inet_address>();
for (std::string n : ignore_nodes_strs) {
try {
std::replace(n.begin(), n.end(), '\"', ' ');
std::replace(n.begin(), n.end(), '\'', ' ');
boost::trim_all(n);
if (!n.empty()) {
auto node = gms::inet_address(n);
ignore_nodes.push_back(node);
}
} catch (...) {
throw std::runtime_error(format("Failed to parse ignore_nodes parameter: ignore_nodes={}, node={}", ignore_nodes_strs, n));
}
}
return service::get_local_storage_service().removenode(host_id, std::move(ignore_nodes)).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
@@ -716,11 +818,19 @@ void set_storage_service(http_context& ctx, routes& r) {
ss::load_new_ss_tables.set(r, [&ctx](std::unique_ptr<request> req) {
auto ks = validate_keyspace(ctx, req->param);
auto cf = req->get_query_param("cf");
auto stream = req->get_query_param("load_and_stream");
auto primary_replica = req->get_query_param("primary_replica_only");
boost::algorithm::to_lower(stream);
boost::algorithm::to_lower(primary_replica);
bool load_and_stream = stream == "true" || stream == "1";
bool primary_replica_only = primary_replica == "true" || primary_replica == "1";
// No need to add the keyspace, since all we want is to avoid always sending this to the same
// CPU. Even then I am being overzealous here. This is not something that happens all the time.
auto coordinator = std::hash<sstring>()(cf) % smp::count;
return service::get_storage_service().invoke_on(coordinator, [ks = std::move(ks), cf = std::move(cf)] (service::storage_service& s) {
return s.load_new_sstables(ks, cf);
return service::get_storage_service().invoke_on(coordinator,
[ks = std::move(ks), cf = std::move(cf),
load_and_stream, primary_replica_only] (service::storage_service& s) {
return s.load_new_sstables(ks, cf, load_and_stream, primary_replica_only);
}).then_wrapped([] (auto&& f) {
if (f.failed()) {
auto msg = fmt::format("Failed to load new sstables: {}", f.get_exception());
@@ -776,6 +886,7 @@ void set_storage_service(http_context& ctx, routes& r) {
res.enable = tracing::tracing::get_local_tracing_instance().slow_query_tracing_enabled();
res.ttl = tracing::tracing::get_local_tracing_instance().slow_query_record_ttl().count() ;
res.threshold = tracing::tracing::get_local_tracing_instance().slow_query_threshold().count();
res.fast = tracing::tracing::get_local_tracing_instance().ignore_trace_events_enabled();
return res;
});
@@ -783,8 +894,9 @@ void set_storage_service(http_context& ctx, routes& r) {
auto enable = req->get_query_param("enable");
auto ttl = req->get_query_param("ttl");
auto threshold = req->get_query_param("threshold");
auto fast = req->get_query_param("fast");
try {
return tracing::tracing::tracing_instance().invoke_on_all([enable, ttl, threshold] (auto& local_tracing) {
return tracing::tracing::tracing_instance().invoke_on_all([enable, ttl, threshold, fast] (auto& local_tracing) {
if (threshold != "") {
local_tracing.set_slow_query_threshold(std::chrono::microseconds(std::stol(threshold.c_str())));
}
@@ -794,6 +906,9 @@ void set_storage_service(http_context& ctx, routes& r) {
if (enable != "") {
local_tracing.set_slow_query_enabled(strcasecmp(enable.c_str(), "true") == 0);
}
if (fast != "") {
local_tracing.set_ignore_trace_events(strcasecmp(fast.c_str(), "true") == 0);
}
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
@@ -963,7 +1078,7 @@ void set_storage_service(http_context& ctx, routes& r) {
tst.keyspace = schema->ks_name();
tst.table = schema->cf_name();
for (auto sstable : *t->get_sstables_including_compacted_undeleted()) {
for (auto sstables = t->get_sstables_including_compacted_undeleted(); auto sstable : *sstables) {
auto ts = db_clock::to_time_t(sstable->data_file_write_time());
::tm t;
::gmtime_r(&ts, &t);

View File

@@ -23,6 +23,7 @@
#include <seastar/core/sharded.hh>
#include "api.hh"
#include "db/data_listeners.hh"
namespace cql_transport { class controller; }
class thrift_controller;
@@ -40,5 +41,6 @@ void set_rpc_controller(http_context& ctx, routes& r, thrift_controller& ctl);
void unset_rpc_controller(http_context& ctx, routes& r);
void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_ctl);
void unset_snapshot(http_context& ctx, routes& r);
seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request = false);
}

View File

@@ -25,6 +25,9 @@
#include <seastar/core/reactor.hh>
#include <seastar/http/exception.hh>
#include "log.hh"
#include "database.hh"
extern logging::logger apilog;
namespace api {
@@ -70,6 +73,16 @@ void set_system(http_context& ctx, routes& r) {
}
return json::json_void();
});
hs::drop_sstable_caches.set(r, [&ctx](std::unique_ptr<request> req) {
apilog.info("Dropping sstable caches");
return ctx.db.invoke_on_all([] (database& db) {
return db.drop_caches();
}).then([] {
apilog.info("Caches dropped");
return json::json_return_type(json::json_void());
});
});
}
}

View File

@@ -24,142 +24,130 @@
#include "counters.hh"
#include "types.hh"
/// LSA mirator for cells with irrelevant type
///
///
const data::type_imr_descriptor& no_type_imr_descriptor() {
static thread_local data::type_imr_descriptor state(data::type_info::make_variable_size());
return state;
}
atomic_cell atomic_cell::make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {
auto& imr_data = no_type_imr_descriptor();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_dead(timestamp, deletion_time), &imr_data.lsa_migrator())
);
return atomic_cell_type::make_dead(timestamp, deletion_time);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value, atomic_cell::collection_member cm) {
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, bool(cm)), &imr_data.lsa_migrator())
);
return atomic_cell_type::make_live(timestamp, single_fragment_range(value));
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, managed_bytes_view value, atomic_cell::collection_member cm) {
return atomic_cell_type::make_live(timestamp, fragment_range(value));
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value, atomic_cell::collection_member cm) {
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, bool(cm)), &imr_data.lsa_migrator())
);
return atomic_cell_type::make_live(timestamp, value);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value, collection_member cm)
{
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, bool(cm)), &imr_data.lsa_migrator())
);
return atomic_cell_type::make_live(timestamp, value);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl, atomic_cell::collection_member cm) {
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, expiry, ttl, bool(cm)), &imr_data.lsa_migrator())
);
return atomic_cell_type::make_live(timestamp, single_fragment_range(value), expiry, ttl);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, managed_bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl, atomic_cell::collection_member cm) {
return atomic_cell_type::make_live(timestamp, fragment_range(value), expiry, ttl);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value,
gc_clock::time_point expiry, gc_clock::duration ttl, atomic_cell::collection_member cm) {
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, expiry, ttl, bool(cm)), &imr_data.lsa_migrator())
);
return atomic_cell_type::make_live(timestamp, value, expiry, ttl);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member cm)
{
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, expiry, ttl, bool(cm)), &imr_data.lsa_migrator())
);
return atomic_cell_type::make_live(timestamp, value, expiry, ttl);
}
atomic_cell atomic_cell::make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
auto& imr_data = no_type_imr_descriptor();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live_counter_update(timestamp, value), &imr_data.lsa_migrator())
);
return atomic_cell_type::make_live_counter_update(timestamp, value);
}
atomic_cell atomic_cell::make_live_uninitialized(const abstract_type& type, api::timestamp_type timestamp, size_t size) {
auto& imr_data = no_type_imr_descriptor();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live_uninitialized(imr_data.type_info(), timestamp, size), &imr_data.lsa_migrator())
);
}
static imr::utils::object<data::cell::structure> copy_cell(const data::type_imr_descriptor& imr_data, const uint8_t* ptr)
{
using imr_object_type = imr::utils::object<data::cell::structure>;
// If the cell doesn't own any memory it is trivial and can be copied with
// memcpy.
auto f = data::cell::structure::get_member<data::cell::tags::flags>(ptr);
if (!f.template get<data::cell::tags::external_data>()) {
data::cell::context ctx(f, imr_data.type_info());
// XXX: We may be better off storing the total cell size in memory. Measure!
auto size = data::cell::structure::serialized_object_size(ptr, ctx);
return imr_object_type::make_raw(size, [&] (uint8_t* dst) noexcept {
std::copy_n(ptr, size, dst);
}, &imr_data.lsa_migrator());
}
return imr_object_type::make(data::cell::copy_fn(imr_data.type_info(), ptr), &imr_data.lsa_migrator());
return atomic_cell_type::make_live_uninitialized(timestamp, size);
}
atomic_cell::atomic_cell(const abstract_type& type, atomic_cell_view other)
: atomic_cell(type.imr_state().type_info(),
copy_cell(type.imr_state(), other._view.raw_pointer()))
{ }
: _data(other._view) {
set_view(_data);
}
// Based on:
// - org.apache.cassandra.db.AbstractCell#reconcile()
// - org.apache.cassandra.db.BufferExpiringCell#reconcile()
// - org.apache.cassandra.db.BufferDeletedCell#reconcile()
int
compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right) {
if (left.timestamp() != right.timestamp()) {
return left.timestamp() > right.timestamp() ? 1 : -1;
}
if (left.is_live() != right.is_live()) {
return left.is_live() ? -1 : 1;
}
if (left.is_live()) {
auto c = compare_unsigned(left.value(), right.value());
if (c != 0) {
return c;
}
if (left.is_live_and_has_ttl() != right.is_live_and_has_ttl()) {
// prefer expiring cells.
return left.is_live_and_has_ttl() ? 1 : -1;
}
if (left.is_live_and_has_ttl()) {
if (left.expiry() != right.expiry()) {
return left.expiry() < right.expiry() ? -1 : 1;
} else {
// prefer the cell that was written later,
// so it survives longer after it expires, until purged.
if (left.ttl() != right.ttl()) {
return left.ttl() < right.ttl() ? 1 : -1;
} else {
return 0;
}
}
}
} else {
// Both are deleted
if (left.deletion_time() != right.deletion_time()) {
// Origin compares big-endian serialized deletion time. That's because it
// delegates to AbstractCell.reconcile() which compares values after
// comparing timestamps, which in case of deleted cells will hold
// serialized expiry.
return (uint64_t) left.deletion_time().time_since_epoch().count()
< (uint64_t) right.deletion_time().time_since_epoch().count() ? -1 : 1;
}
}
return 0;
}
atomic_cell_or_collection atomic_cell_or_collection::copy(const abstract_type& type) const {
if (!_data.get()) {
if (_data.empty()) {
return atomic_cell_or_collection();
}
auto& imr_data = type.imr_state();
return atomic_cell_or_collection(
copy_cell(imr_data, _data.get())
);
return atomic_cell_or_collection(managed_bytes(_data));
}
atomic_cell_or_collection::atomic_cell_or_collection(const abstract_type& type, atomic_cell_view acv)
: _data(copy_cell(type.imr_state(), acv._view.raw_pointer()))
: _data(acv._view)
{
}
bool atomic_cell_or_collection::equals(const abstract_type& type, const atomic_cell_or_collection& other) const
{
auto ptr_a = _data.get();
auto ptr_b = other._data.get();
if (!ptr_a || !ptr_b) {
return !ptr_a && !ptr_b;
if (_data.empty() || other._data.empty()) {
return _data.empty() && other._data.empty();
}
if (type.is_atomic()) {
auto a = atomic_cell_view::from_bytes(type.imr_state().type_info(), _data);
auto b = atomic_cell_view::from_bytes(type.imr_state().type_info(), other._data);
auto a = atomic_cell_view::from_bytes(type, _data);
auto b = atomic_cell_view::from_bytes(type, other._data);
if (a.timestamp() != b.timestamp()) {
return false;
}
@@ -191,28 +179,7 @@ bool atomic_cell_or_collection::equals(const abstract_type& type, const atomic_c
size_t atomic_cell_or_collection::external_memory_usage(const abstract_type& t) const
{
if (!_data.get()) {
return 0;
}
auto ctx = data::cell::context(_data.get(), t.imr_state().type_info());
auto view = data::cell::structure::make_view(_data.get(), ctx);
auto flags = view.get<data::cell::tags::flags>();
size_t external_value_size = 0;
if (flags.get<data::cell::tags::external_data>()) {
if (flags.get<data::cell::tags::collection>()) {
external_value_size = as_collection_mutation().data.size_bytes();
} else {
auto cell_view = data::cell::atomic_cell_view(t.imr_state().type_info(), view);
external_value_size = cell_view.value_size();
}
// Add overhead of chunk headers. The last one is a special case.
external_value_size += (external_value_size - 1) / data::cell::effective_external_chunk_length * data::cell::external_chunk_overhead;
external_value_size += data::cell::external_last_chunk_overhead;
}
return data::cell::structure::serialized_object_size(_data.get(), ctx)
+ imr_object_type::size_overhead + external_value_size;
return _data.external_memory_usage();
}
std::ostream&
@@ -221,7 +188,7 @@ operator<<(std::ostream& os, const atomic_cell_view& acv) {
return fmt_print(os, "atomic_cell{{{},ts={:d},expiry={:d},ttl={:d}}}",
acv.is_counter_update()
? "counter_update_value=" + to_sstring(acv.counter_update_value())
: to_hex(acv.value().linearize()),
: to_hex(to_bytes(acv.value())),
acv.timestamp(),
acv.is_live_and_has_ttl() ? acv.expiry().time_since_epoch().count() : -1,
acv.is_live_and_has_ttl() ? acv.ttl().count() : 0);
@@ -247,12 +214,11 @@ operator<<(std::ostream& os, const atomic_cell_view::printer& acvp) {
cell_value_string_builder << "counter_update_value=" << acv.counter_update_value();
} else {
cell_value_string_builder << "shards: ";
counter_cell_view::with_linearized(acv, [&cell_value_string_builder] (counter_cell_view& ccv) {
cell_value_string_builder << ::join(", ", ccv.shards());
});
auto ccv = counter_cell_view(acv);
cell_value_string_builder << ::join(", ", ccv.shards());
}
} else {
cell_value_string_builder << type.to_string(acv.value().linearize());
cell_value_string_builder << type.to_string(to_bytes(acv.value()));
}
return fmt_print(os, "atomic_cell{{{},ts={:d},expiry={:d},ttl={:d}}}",
cell_value_string_builder.str(),
@@ -271,12 +237,11 @@ operator<<(std::ostream& os, const atomic_cell::printer& acp) {
}
std::ostream& operator<<(std::ostream& os, const atomic_cell_or_collection::printer& p) {
if (!p._cell._data.get()) {
if (p._cell._data.empty()) {
return os << "{ null atomic_cell_or_collection }";
}
using dc = data::cell;
os << "{ ";
if (dc::structure::get_member<dc::tags::flags>(p._cell._data.get()).get<dc::tags::collection>()) {
if (p._cdef.type->is_multi_cell()) {
os << "collection ";
auto cmv = p._cell.as_collection_mutation();
os << collection_mutation_view::printer(*p._cdef.type, cmv);

View File

@@ -26,12 +26,12 @@
#include "tombstone.hh"
#include "gc_clock.hh"
#include "utils/managed_bytes.hh"
#include "utils/fragment_range.hh"
#include <seastar/net//byteorder.hh>
#include <seastar/util/bool_class.hh>
#include <cstdint>
#include <iosfwd>
#include "data/cell.hh"
#include "data/schema_info.hh"
#include "imr/utils.hh"
#include <concepts>
#include "utils/fragmented_temporary_buffer.hh"
#include "serializer.hh"
@@ -40,41 +40,191 @@ class abstract_type;
class collection_type_impl;
class atomic_cell_or_collection;
using atomic_cell_value_view = data::value_view;
using atomic_cell_value_mutable_view = data::value_mutable_view;
using atomic_cell_value = managed_bytes;
template <mutable_view is_mutable>
using atomic_cell_value_basic_view = managed_bytes_basic_view<is_mutable>;
using atomic_cell_value_view = atomic_cell_value_basic_view<mutable_view::no>;
using atomic_cell_value_mutable_view = atomic_cell_value_basic_view<mutable_view::yes>;
template <typename T>
requires std::is_trivial_v<T>
static void set_field(atomic_cell_value_mutable_view& out, unsigned offset, T val) {
auto out_view = managed_bytes_mutable_view(out);
out_view.remove_prefix(offset);
write<T>(out_view, val);
}
template <typename T>
requires std::is_trivial_v<T>
static void set_field(atomic_cell_value& out, unsigned offset, T val) {
auto out_view = atomic_cell_value_mutable_view(out);
set_field(out_view, offset, val);
}
template <FragmentRange Buffer>
static void set_value(managed_bytes& b, unsigned value_offset, const Buffer& value) {
auto v = managed_bytes_mutable_view(b).substr(value_offset, value.size_bytes());
for (auto frag : value) {
write_fragmented(v, single_fragmented_view(frag));
}
}
template <typename T, FragmentedView Input>
requires std::is_trivial_v<T>
static T get_field(Input in, unsigned offset = 0) {
in.remove_prefix(offset);
return read_simple<T>(in);
}
/*
* Represents atomic cell layout. Works on serialized form.
*
* Layout:
*
* <live> := <int8_t:flags><int64_t:timestamp>(<int64_t:expiry><int32_t:ttl>)?<value>
* <dead> := <int8_t: 0><int64_t:timestamp><int64_t:deletion_time>
*/
class atomic_cell_type final {
private:
static constexpr int8_t LIVE_FLAG = 0x01;
static constexpr int8_t EXPIRY_FLAG = 0x02; // When present, expiry field is present. Set only for live cells
static constexpr int8_t COUNTER_UPDATE_FLAG = 0x08; // Cell is a counter update.
static constexpr unsigned flags_size = 1;
static constexpr unsigned timestamp_offset = flags_size;
static constexpr unsigned timestamp_size = 8;
static constexpr unsigned expiry_offset = timestamp_offset + timestamp_size;
static constexpr unsigned expiry_size = 8;
static constexpr unsigned deletion_time_offset = timestamp_offset + timestamp_size;
static constexpr unsigned deletion_time_size = 8;
static constexpr unsigned ttl_offset = expiry_offset + expiry_size;
static constexpr unsigned ttl_size = 4;
friend class counter_cell_builder;
private:
static bool is_counter_update(atomic_cell_value_view cell) {
return cell.front() & COUNTER_UPDATE_FLAG;
}
static bool is_live(atomic_cell_value_view cell) {
return cell.front() & LIVE_FLAG;
}
static bool is_live_and_has_ttl(atomic_cell_value_view cell) {
return cell.front() & EXPIRY_FLAG;
}
static bool is_dead(atomic_cell_value_view cell) {
return !is_live(cell);
}
// Can be called on live and dead cells
static api::timestamp_type timestamp(atomic_cell_value_view cell) {
return get_field<api::timestamp_type>(cell, timestamp_offset);
}
static void set_timestamp(atomic_cell_value_mutable_view& cell, api::timestamp_type ts) {
set_field(cell, timestamp_offset, ts);
}
// Can be called on live cells only
private:
template <mutable_view is_mutable>
static managed_bytes_basic_view<is_mutable> do_get_value(managed_bytes_basic_view<is_mutable> cell) {
auto expiry_field_size = bool(cell.front() & EXPIRY_FLAG) * (expiry_size + ttl_size);
auto value_offset = flags_size + timestamp_size + expiry_field_size;
cell.remove_prefix(value_offset);
return cell;
}
public:
static atomic_cell_value_view value(managed_bytes_view cell) {
return do_get_value(cell);
}
static atomic_cell_value_mutable_view value(managed_bytes_mutable_view cell) {
return do_get_value(cell);
}
// Can be called on live counter update cells only
static int64_t counter_update_value(atomic_cell_value_view cell) {
return get_field<int64_t>(cell, flags_size + timestamp_size);
}
// Can be called only when is_dead() is true.
static gc_clock::time_point deletion_time(atomic_cell_value_view cell) {
assert(is_dead(cell));
return gc_clock::time_point(gc_clock::duration(get_field<int64_t>(cell, deletion_time_offset)));
}
// Can be called only when is_live_and_has_ttl() is true.
static gc_clock::time_point expiry(atomic_cell_value_view cell) {
assert(is_live_and_has_ttl(cell));
auto expiry = get_field<int64_t>(cell, expiry_offset);
return gc_clock::time_point(gc_clock::duration(expiry));
}
// Can be called only when is_live_and_has_ttl() is true.
static gc_clock::duration ttl(atomic_cell_value_view cell) {
assert(is_live_and_has_ttl(cell));
return gc_clock::duration(get_field<int32_t>(cell, ttl_offset));
}
static managed_bytes make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {
managed_bytes b(managed_bytes::initialized_later(), flags_size + timestamp_size + deletion_time_size);
b[0] = 0;
set_field(b, timestamp_offset, timestamp);
set_field(b, deletion_time_offset, static_cast<int64_t>(deletion_time.time_since_epoch().count()));
return b;
}
template <FragmentRange Buffer>
static managed_bytes make_live(api::timestamp_type timestamp, const Buffer& value) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size_bytes());
b[0] = LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
set_value(b, value_offset, value);
return b;
}
static managed_bytes make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + sizeof(value));
b[0] = LIVE_FLAG | COUNTER_UPDATE_FLAG;
set_field(b, timestamp_offset, timestamp);
set_field(b, value_offset, value);
return b;
}
template <FragmentRange Buffer>
static managed_bytes make_live(api::timestamp_type timestamp, const Buffer& value, gc_clock::time_point expiry, gc_clock::duration ttl) {
auto value_offset = flags_size + timestamp_size + expiry_size + ttl_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size_bytes());
b[0] = EXPIRY_FLAG | LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
set_field(b, expiry_offset, static_cast<int64_t>(expiry.time_since_epoch().count()));
set_field(b, ttl_offset, static_cast<int32_t>(ttl.count()));
set_value(b, value_offset, value);
return b;
}
static managed_bytes make_live_uninitialized(api::timestamp_type timestamp, size_t size) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + size);
b[0] = LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
return b;
}
template <mutable_view is_mutable>
friend class basic_atomic_cell_view;
friend class atomic_cell;
};
/// View of an atomic cell
template<mutable_view is_mutable>
class basic_atomic_cell_view {
protected:
data::cell::basic_atomic_cell_view<is_mutable> _view;
friend class atomic_cell;
public:
using pointer_type = std::conditional_t<is_mutable == mutable_view::no, const uint8_t*, uint8_t*>;
managed_bytes_basic_view<is_mutable> _view;
friend class atomic_cell;
protected:
explicit basic_atomic_cell_view(data::cell::basic_atomic_cell_view<is_mutable> v)
: _view(std::move(v)) { }
basic_atomic_cell_view(const data::type_info& ti, pointer_type ptr)
: _view(data::cell::make_atomic_cell_view(ti, ptr))
{ }
void set_view(managed_bytes_basic_view<is_mutable> v) {
_view = v;
}
basic_atomic_cell_view() = default;
explicit basic_atomic_cell_view(managed_bytes_basic_view<is_mutable> v) : _view(std::move(v)) { }
friend class atomic_cell_or_collection;
public:
operator basic_atomic_cell_view<mutable_view::no>() const noexcept {
return basic_atomic_cell_view<mutable_view::no>(_view);
}
void swap(basic_atomic_cell_view& other) noexcept {
using std::swap;
swap(_view, other._view);
}
bool is_counter_update() const {
return _view.is_counter_update();
return atomic_cell_type::is_counter_update(_view);
}
bool is_live() const {
return _view.is_live();
return atomic_cell_type::is_live(_view);
}
bool is_live(tombstone t, bool is_counter) const {
return is_live() && !is_covered_by(t, is_counter);
@@ -83,73 +233,72 @@ public:
return is_live() && !is_covered_by(t, is_counter) && !has_expired(now);
}
bool is_live_and_has_ttl() const {
return _view.is_expiring();
return atomic_cell_type::is_live_and_has_ttl(_view);
}
bool is_dead(gc_clock::time_point now) const {
return !is_live() || has_expired(now);
return atomic_cell_type::is_dead(_view) || has_expired(now);
}
bool is_covered_by(tombstone t, bool is_counter) const {
return timestamp() <= t.timestamp || (is_counter && t.timestamp != api::missing_timestamp);
}
// Can be called on live and dead cells
api::timestamp_type timestamp() const {
return _view.timestamp();
return atomic_cell_type::timestamp(_view);
}
void set_timestamp(api::timestamp_type ts) {
_view.set_timestamp(ts);
atomic_cell_type::set_timestamp(_view, ts);
}
// Can be called on live cells only
data::basic_value_view<is_mutable> value() const {
return _view.value();
atomic_cell_value_basic_view<is_mutable> value() const {
return atomic_cell_type::value(_view);
}
// Can be called on live cells only
size_t value_size() const {
return _view.value_size();
return atomic_cell_type::value(_view).size();
}
bool is_value_fragmented() const {
return _view.is_value_fragmented();
return _view.is_fragmented();
}
// Can be called on live counter update cells only
int64_t counter_update_value() const {
return _view.counter_update_value();
return atomic_cell_type::counter_update_value(_view);
}
// Can be called only when is_dead(gc_clock::time_point)
gc_clock::time_point deletion_time() const {
return !is_live() ? _view.deletion_time() : expiry() - ttl();
return !is_live() ? atomic_cell_type::deletion_time(_view) : expiry() - ttl();
}
// Can be called only when is_live_and_has_ttl()
gc_clock::time_point expiry() const {
return _view.expiry();
return atomic_cell_type::expiry(_view);
}
// Can be called only when is_live_and_has_ttl()
gc_clock::duration ttl() const {
return _view.ttl();
return atomic_cell_type::ttl(_view);
}
// Can be called on live and dead cells
bool has_expired(gc_clock::time_point now) const {
return is_live_and_has_ttl() && expiry() <= now;
}
bytes_view serialize() const {
return _view.serialize();
managed_bytes_view serialize() const {
return _view;
}
};
class atomic_cell_view final : public basic_atomic_cell_view<mutable_view::no> {
atomic_cell_view(const data::type_info& ti, const uint8_t* data)
: basic_atomic_cell_view<mutable_view::no>(ti, data) {}
atomic_cell_view(managed_bytes_view v)
: basic_atomic_cell_view(v) {}
template<mutable_view is_mutable>
atomic_cell_view(data::cell::basic_atomic_cell_view<is_mutable> view)
: basic_atomic_cell_view<mutable_view::no>(view) { }
atomic_cell_view(basic_atomic_cell_view<is_mutable> view)
: basic_atomic_cell_view<mutable_view::no>(view) {}
friend class atomic_cell;
public:
static atomic_cell_view from_bytes(const data::type_info& ti, const imr::utils::object<data::cell::structure>& data) {
return atomic_cell_view(ti, data.get());
static atomic_cell_view from_bytes(const abstract_type& t, managed_bytes_view v) {
return atomic_cell_view(v);
}
static atomic_cell_view from_bytes(const data::type_info& ti, bytes_view bv) {
return atomic_cell_view(ti, reinterpret_cast<const uint8_t*>(bv.begin()));
static atomic_cell_view from_bytes(const abstract_type& t, bytes_view v) {
return atomic_cell_view(managed_bytes_view(v));
}
friend std::ostream& operator<<(std::ostream& os, const atomic_cell_view& acv);
@@ -164,11 +313,11 @@ public:
};
class atomic_cell_mutable_view final : public basic_atomic_cell_view<mutable_view::yes> {
atomic_cell_mutable_view(const data::type_info& ti, uint8_t* data)
: basic_atomic_cell_view<mutable_view::yes>(ti, data) {}
atomic_cell_mutable_view(managed_bytes_mutable_view data)
: basic_atomic_cell_view(data) {}
public:
static atomic_cell_mutable_view from_bytes(const data::type_info& ti, imr::utils::object<data::cell::structure>& data) {
return atomic_cell_mutable_view(ti, data.get());
static atomic_cell_mutable_view from_bytes(const abstract_type& t, managed_bytes_mutable_view v) {
return atomic_cell_mutable_view(v);
}
friend class atomic_cell;
@@ -177,26 +326,31 @@ public:
using atomic_cell_ref = atomic_cell_mutable_view;
class atomic_cell final : public basic_atomic_cell_view<mutable_view::yes> {
using imr_object_type = imr::utils::object<data::cell::structure>;
imr_object_type _data;
atomic_cell(const data::type_info& ti, imr::utils::object<data::cell::structure>&& data)
: basic_atomic_cell_view<mutable_view::yes>(ti, data.get()), _data(std::move(data)) {}
managed_bytes _data;
atomic_cell(managed_bytes b) : _data(std::move(b)) {
set_view(_data);
}
public:
class collection_member_tag;
using collection_member = bool_class<collection_member_tag>;
atomic_cell(atomic_cell&&) = default;
atomic_cell& operator=(const atomic_cell&) = delete;
atomic_cell& operator=(atomic_cell&&) = default;
void swap(atomic_cell& other) noexcept {
basic_atomic_cell_view<mutable_view::yes>::swap(other);
_data.swap(other._data);
atomic_cell(atomic_cell&& o) noexcept : _data(std::move(o._data)) {
set_view(_data);
}
operator atomic_cell_view() const { return atomic_cell_view(_view); }
atomic_cell& operator=(const atomic_cell&) = delete;
atomic_cell& operator=(atomic_cell&& o) {
_data = std::move(o._data);
set_view(_data);
return *this;
}
operator atomic_cell_view() const { return atomic_cell_view(managed_bytes_view(_data)); }
atomic_cell(const abstract_type& t, atomic_cell_view other);
static atomic_cell make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value,
collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, managed_bytes_view value,
collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value,
collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value,
@@ -208,6 +362,8 @@ public:
static atomic_cell make_live_counter_update(api::timestamp_type timestamp, int64_t value);
static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, managed_bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value,

View File

@@ -52,9 +52,7 @@ struct appending_hash<atomic_cell_view> {
feed_hash(h, cell.timestamp());
if (cell.is_live()) {
if (cdef.is_counter()) {
counter_cell_view::with_linearized(cell, [&] (counter_cell_view ccv) {
::feed_hash(h, ccv);
});
::feed_hash(h, counter_cell_view(cell));
return;
}
if (cell.is_live_and_has_ttl()) {

View File

@@ -26,20 +26,14 @@
#include "schema.hh"
#include "hashing.hh"
#include "imr/utils.hh"
// A variant type that can hold either an atomic_cell, or a serialized collection.
// Which type is stored is determined by the schema.
// Has an "empty" state.
// Objects moved-from are left in an empty state.
class atomic_cell_or_collection final {
// FIXME: This has made us lose small-buffer optimisation. Unfortunately,
// due to the changed cell format it would be less effective now, anyway.
// Measure the actual impact because any attempts to fix this will become
// irrelevant once rows are converted to the IMR as well, so maybe we can
// live with this like that.
using imr_object_type = imr::utils::object<data::cell::structure>;
imr_object_type _data;
managed_bytes _data;
private:
atomic_cell_or_collection(imr::utils::object<data::cell::structure>&& data) : _data(std::move(data)) {}
atomic_cell_or_collection(managed_bytes&& data) : _data(std::move(data)) {}
public:
atomic_cell_or_collection() = default;
atomic_cell_or_collection(atomic_cell_or_collection&&) = default;
@@ -49,20 +43,16 @@ public:
atomic_cell_or_collection(atomic_cell ac) : _data(std::move(ac._data)) {}
atomic_cell_or_collection(const abstract_type& at, atomic_cell_view acv);
static atomic_cell_or_collection from_atomic_cell(atomic_cell data) { return { std::move(data._data) }; }
atomic_cell_view as_atomic_cell(const column_definition& cdef) const { return atomic_cell_view::from_bytes(cdef.type->imr_state().type_info(), _data); }
atomic_cell_ref as_atomic_cell_ref(const column_definition& cdef) { return atomic_cell_mutable_view::from_bytes(cdef.type->imr_state().type_info(), _data); }
atomic_cell_mutable_view as_mutable_atomic_cell(const column_definition& cdef) { return atomic_cell_mutable_view::from_bytes(cdef.type->imr_state().type_info(), _data); }
atomic_cell_view as_atomic_cell(const column_definition& cdef) const { return atomic_cell_view::from_bytes(*cdef.type, _data); }
atomic_cell_mutable_view as_mutable_atomic_cell(const column_definition& cdef) { return atomic_cell_mutable_view::from_bytes(*cdef.type, _data); }
atomic_cell_or_collection(collection_mutation cm) : _data(std::move(cm._data)) { }
atomic_cell_or_collection copy(const abstract_type&) const;
explicit operator bool() const {
return bool(_data);
return !_data.empty();
}
static constexpr bool can_use_mutable_view() {
return true;
}
void swap(atomic_cell_or_collection& other) noexcept {
_data.swap(other._data);
}
static atomic_cell_or_collection from_collection_mutation(collection_mutation data) { return std::move(data._data); }
collection_mutation_view as_collection_mutation() const;
bytes_view serialize() const;
@@ -82,12 +72,3 @@ public:
};
friend std::ostream& operator<<(std::ostream&, const printer&);
};
namespace std {
inline void swap(atomic_cell_or_collection& a, atomic_cell_or_collection& b) noexcept
{
a.swap(b);
}
}

View File

@@ -82,7 +82,7 @@ static future<> create_metadata_table_if_missing_impl(
b.set_uuid(uuid);
schema_ptr table = b.build();
return ignore_existing([&mm, table = std::move(table)] () {
return mm.announce_new_column_family(table, false);
return mm.announce_new_column_family(table);
});
}
@@ -108,7 +108,7 @@ future<> wait_for_schema_agreement(::service::migration_manager& mm, const datab
});
}
const timeout_config& internal_distributed_timeout_config() noexcept {
::service::query_state& internal_distributed_query_state() noexcept {
#ifdef DEBUG
// Give the much slower debug tests more headroom for completing auth queries.
static const auto t = 30s;
@@ -116,7 +116,9 @@ const timeout_config& internal_distributed_timeout_config() noexcept {
static const auto t = 5s;
#endif
static const timeout_config tc{t, t, t, t, t, t, t};
return tc;
static thread_local ::service::client_state cs(::service::client_state::internal_tag{}, tc);
static thread_local ::service::query_state qs(cs, empty_service_permit());
return qs;
}
}

View File

@@ -35,6 +35,7 @@
#include "log.hh"
#include "seastarx.hh"
#include "utils/exponential_backoff_retry.hh"
#include "service/query_state.hh"
using namespace std::chrono_literals;
@@ -87,6 +88,6 @@ future<> wait_for_schema_agreement(::service::migration_manager&, const database
///
/// Time-outs for internal, non-local CQL queries.
///
const timeout_config& internal_distributed_timeout_config() noexcept;
::service::query_state& internal_distributed_query_state() noexcept;
}

View File

@@ -103,7 +103,6 @@ future<bool> default_authorizer::any_granted() const {
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{},
true).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return !results->empty();
@@ -116,8 +115,7 @@ future<> default_authorizer::migrate_legacy_metadata() const {
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config).then([this](::shared_ptr<cql3::untyped_result_set> results) {
db::consistency_level::LOCAL_ONE).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
return do_with(
row.get_as<sstring>("username"),
@@ -197,7 +195,6 @@ default_authorizer::authorize(const role_or_anonymous& maybe_role, const resourc
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{*maybe_role.name, r.name()}).then([](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return permissions::NONE;
@@ -226,7 +223,7 @@ default_authorizer::modify(
return _qp.execute_internal(
query,
db::consistency_level::ONE,
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{permissions::to_strings(set), sstring(role_name), resource.name()}).discard_result();
});
}
@@ -251,7 +248,7 @@ future<std::vector<permission_details>> default_authorizer::list_all() const {
return _qp.execute_internal(
query,
db::consistency_level::ONE,
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{},
true).then([](::shared_ptr<cql3::untyped_result_set> results) {
std::vector<permission_details> all_details;
@@ -278,7 +275,7 @@ future<> default_authorizer::revoke_all(std::string_view role_name) const {
return _qp.execute_internal(
query,
db::consistency_level::ONE,
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{sstring(role_name)}).discard_result().handle_exception([role_name](auto ep) {
try {
std::rethrow_exception(ep);
@@ -298,7 +295,6 @@ future<> default_authorizer::revoke_all(const resource& resource) const {
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{resource.name()}).then_wrapped([this, resource](future<::shared_ptr<cql3::untyped_result_set>> f) {
try {
auto res = f.get0();
@@ -315,7 +311,6 @@ future<> default_authorizer::revoke_all(const resource& resource) const {
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{r.get_as<sstring>(ROLE_NAME), resource.name()}).discard_result().handle_exception(
[resource](auto ep) {
try {

View File

@@ -114,7 +114,7 @@ future<> password_authenticator::migrate_legacy_metadata() const {
return _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([this](::shared_ptr<cql3::untyped_result_set> results) {
internal_distributed_query_state()).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
auto username = row.get_as<sstring>("username");
auto salted_hash = row.get_as<sstring>(SALTED_HASH);
@@ -122,7 +122,7 @@ future<> password_authenticator::migrate_legacy_metadata() const {
return _qp.execute_internal(
update_row_query(),
consistency_for_user(username),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{std::move(salted_hash), username}).discard_result();
}).finally([results] {});
}).then([] {
@@ -139,7 +139,7 @@ future<> password_authenticator::create_default_if_missing() const {
return _qp.execute_internal(
update_row_query(),
db::consistency_level::QUORUM,
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt), DEFAULT_USER_NAME}).then([](auto&&) {
plogger.info("Created default superuser authentication record.");
});
@@ -236,7 +236,7 @@ future<authenticated_user> password_authenticator::authenticate(
return _qp.execute_internal(
query,
consistency_for_user(username),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{username},
true);
}).then_wrapped([=](future<::shared_ptr<cql3::untyped_result_set>> f) {
@@ -270,7 +270,7 @@ future<> password_authenticator::create(std::string_view role_name, const authen
return _qp.execute_internal(
update_row_query(),
consistency_for_user(role_name),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{passwords::hash(*options.password, rng_for_salt), sstring(role_name)}).discard_result();
}
@@ -287,7 +287,7 @@ future<> password_authenticator::alter(std::string_view role_name, const authent
return _qp.execute_internal(
query,
consistency_for_user(role_name),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{passwords::hash(*options.password, rng_for_salt), sstring(role_name)}).discard_result();
}
@@ -299,7 +299,7 @@ future<> password_authenticator::drop(std::string_view name) const {
return _qp.execute_internal(
query, consistency_for_user(name),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{sstring(name)}).discard_result();
}

View File

@@ -68,14 +68,13 @@ future<bool> default_role_row_satisfies(
return qp.execute_internal(
query,
db::consistency_level::ONE,
infinite_timeout_config,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([&qp, &p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{meta::DEFAULT_SUPERUSER_NAME},
true).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
@@ -100,7 +99,7 @@ future<bool> any_nondefault_role_row_satisfies(
return qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
internal_distributed_query_state()).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return false;
}

View File

@@ -154,7 +154,7 @@ future<> service::create_keyspace_if_missing(::service::migration_manager& mm) c
// We use min_timestamp so that default keyspace metadata will loose with any manual adjustments.
// See issue #2129.
return mm.announce_new_keyspace(ksm, api::min_timestamp, false);
return mm.announce_new_keyspace(ksm, api::min_timestamp);
}
return make_ready_future<>();
@@ -210,7 +210,6 @@ future<bool> service::has_existing_legacy_users() const {
return _qp.execute_internal(
default_user_query,
db::consistency_level::ONE,
infinite_timeout_config,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([this](auto results) {
if (!results->empty()) {
@@ -220,7 +219,6 @@ future<bool> service::has_existing_legacy_users() const {
return _qp.execute_internal(
default_user_query,
db::consistency_level::QUORUM,
infinite_timeout_config,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([this](auto results) {
if (!results->empty()) {
@@ -229,8 +227,7 @@ future<bool> service::has_existing_legacy_users() const {
return _qp.execute_internal(
all_users_query,
db::consistency_level::QUORUM,
infinite_timeout_config).then([](auto results) {
db::consistency_level::QUORUM).then([](auto results) {
return make_ready_future<bool>(!results->empty());
});
});

View File

@@ -86,7 +86,7 @@ static future<std::optional<record>> find_record(cql3::query_processor& qp, std:
return qp.execute_internal(
query,
consistency_for_role(role_name),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{sstring(role_name)},
true).then([](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
@@ -165,7 +165,7 @@ future<> standard_role_manager::create_default_role_if_missing() const {
return _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{meta::DEFAULT_SUPERUSER_NAME}).then([](auto&&) {
log.info("Created default superuser role '{}'.", meta::DEFAULT_SUPERUSER_NAME);
return make_ready_future<>();
@@ -192,7 +192,7 @@ future<> standard_role_manager::migrate_legacy_metadata() const {
return _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([this](::shared_ptr<cql3::untyped_result_set> results) {
internal_distributed_query_state()).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
role_config config;
config.is_superuser = row.get_or<bool>("super", false);
@@ -253,7 +253,7 @@ future<> standard_role_manager::create_or_replace(std::string_view role_name, co
return _qp.execute_internal(
query,
consistency_for_role(role_name),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{sstring(role_name), c.is_superuser, c.can_login},
true).discard_result();
}
@@ -296,7 +296,7 @@ standard_role_manager::alter(std::string_view role_name, const role_config_updat
build_column_assignments(u),
meta::roles_table::role_col_name),
consistency_for_role(role_name),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{sstring(role_name)}).discard_result();
});
}
@@ -315,7 +315,7 @@ future<> standard_role_manager::drop(std::string_view role_name) const {
return _qp.execute_internal(
query,
consistency_for_role(role_name),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{sstring(role_name)}).then([this, role_name](::shared_ptr<cql3::untyped_result_set> members) {
return parallel_for_each(
members->begin(),
@@ -354,7 +354,7 @@ future<> standard_role_manager::drop(std::string_view role_name) const {
return _qp.execute_internal(
query,
consistency_for_role(role_name),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{sstring(role_name)}).discard_result();
};
@@ -381,7 +381,7 @@ standard_role_manager::modify_membership(
return _qp.execute_internal(
query,
consistency_for_role(grantee_name),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{role_set{sstring(role_name)}, sstring(grantee_name)}).discard_result();
};
@@ -392,7 +392,7 @@ standard_role_manager::modify_membership(
format("INSERT INTO {} (role, member) VALUES (?, ?)",
meta::role_members_table::qualified_name),
consistency_for_role(role_name),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{sstring(role_name), sstring(grantee_name)}).discard_result();
case membership_change::remove:
@@ -400,7 +400,7 @@ standard_role_manager::modify_membership(
format("DELETE FROM {} WHERE role = ? AND member = ?",
meta::role_members_table::qualified_name),
consistency_for_role(role_name),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{sstring(role_name), sstring(grantee_name)}).discard_result();
}
@@ -503,7 +503,7 @@ future<role_set> standard_role_manager::query_all() const {
return _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([](::shared_ptr<cql3::untyped_result_set> results) {
internal_distributed_query_state()).then([](::shared_ptr<cql3::untyped_result_set> results) {
role_set roles;
std::transform(

View File

@@ -28,6 +28,7 @@
#include <iosfwd>
#include <functional>
#include "utils/mutable_view.hh"
#include <xxhash.h>
using bytes = basic_sstring<int8_t, uint32_t, 31, false>;
using bytes_view = std::basic_string_view<int8_t>;
@@ -35,6 +36,10 @@ using bytes_mutable_view = basic_mutable_view<bytes_view::value_type>;
using bytes_opt = std::optional<bytes>;
using sstring_view = std::string_view;
inline bytes to_bytes(bytes&& b) {
return std::move(b);
}
inline sstring_view to_sstring_view(bytes_view view) {
return {reinterpret_cast<const char*>(view.data()), view.size()};
}
@@ -43,17 +48,6 @@ inline bytes_view to_bytes_view(sstring_view view) {
return {reinterpret_cast<const int8_t*>(view.data()), view.size()};
}
namespace std {
template <>
struct hash<bytes_view> {
size_t operator()(bytes_view v) const {
return hash<sstring_view>()({reinterpret_cast<const char*>(v.begin()), v.size()});
}
};
}
struct fmt_hex {
bytes_view& v;
fmt_hex(bytes_view& v) noexcept : v(v) {}
@@ -94,6 +88,30 @@ struct appending_hash<bytes_view> {
}
};
struct bytes_view_hasher : public hasher {
XXH64_state_t _state;
bytes_view_hasher(uint64_t seed = 0) noexcept {
XXH64_reset(&_state, seed);
}
void update(const char* ptr, size_t length) noexcept {
XXH64_update(&_state, ptr, length);
}
size_t finalize() {
return static_cast<size_t>(XXH64_digest(&_state));
}
};
namespace std {
template <>
struct hash<bytes_view> {
size_t operator()(bytes_view v) const {
bytes_view_hasher h;
appending_hash<bytes_view>{}(h, v);
return h.finalize();
}
};
} // namespace std
inline int32_t compare_unsigned(bytes_view v1, bytes_view v2) {
auto size = std::min(v1.size(), v2.size());
if (size) {

View File

@@ -24,9 +24,10 @@
#include <boost/range/iterator_range.hpp>
#include "bytes.hh"
#include <seastar/core/unaligned.hh>
#include "hashing.hh"
#include <seastar/core/simple-stream.hh>
#include <concepts>
/**
* Utility for writing data into a buffer when its final size is not known up front.
*
@@ -39,7 +40,7 @@ public:
using size_type = bytes::size_type;
using value_type = bytes::value_type;
using fragment_type = bytes_view;
static constexpr size_type max_chunk_size() { return 128 * 1024; }
static constexpr size_type max_chunk_size() { return max_alloc_size() - sizeof(chunk); }
private:
static_assert(sizeof(value_type) == 1, "value_type is assumed to be one byte long");
struct chunk {
@@ -59,6 +60,7 @@ private:
void operator delete(void* ptr) { free(ptr); }
};
static constexpr size_type default_chunk_size{512};
static constexpr size_type max_alloc_size() { return 128 * 1024; }
private:
std::unique_ptr<chunk> _begin;
chunk* _current;
@@ -132,16 +134,15 @@ private:
return _current->size - _current->offset;
}
// Figure out next chunk size.
// - must be enough for data_size
// - must be enough for data_size + sizeof(chunk)
// - must be at least _initial_chunk_size
// - try to double each time to prevent too many allocations
// - do not exceed max_chunk_size
// - should not exceed max_alloc_size, unless data_size requires so
size_type next_alloc_size(size_t data_size) const {
auto next_size = _current
? _current->size * 2
: _initial_chunk_size;
next_size = std::min(next_size, max_chunk_size());
// FIXME: check for overflow?
next_size = std::min(next_size, max_alloc_size());
return std::max<size_type>(next_size, data_size + sizeof(chunk));
}
// Makes room for a contiguous region of given size.
@@ -233,9 +234,9 @@ public:
};
// Returns a place holder for a value to be written later.
template <typename T>
template <std::integral T>
inline
std::enable_if_t<std::is_fundamental<T>::value, place_holder<T>>
place_holder<T>
write_place_holder() {
return place_holder<T>{alloc(sizeof(T))};
}

View File

@@ -102,7 +102,7 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
// Points to the underlying reader conforming to _schema,
// either to *_underlying_holder or _read_context->underlying().underlying().
flat_mutation_reader* _underlying = nullptr;
std::optional<flat_mutation_reader> _underlying_holder;
flat_mutation_reader_opt _underlying_holder;
future<> do_fill_buffer(db::timeout_clock::time_point);
future<> ensure_underlying(db::timeout_clock::time_point);
@@ -112,6 +112,7 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
void move_to_next_range();
void move_to_range(query::clustering_row_ranges::const_iterator);
void move_to_next_entry();
void maybe_drop_last_entry() noexcept;
void add_to_buffer(const partition_snapshot_row_cursor&);
void add_clustering_row_to_buffer(mutation_fragment&&);
void add_to_buffer(range_tombstone&&);
@@ -122,6 +123,7 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
bool can_populate() const;
// Marks the range between _last_row (exclusive) and _next_row (exclusive) as continuous,
// provided that the underlying reader still matches the latest version of the partition.
// Invalidates _last_row.
void maybe_update_continuity();
// Tries to ensure that the lower bound of the current population range exists.
// Returns false if it failed and range cannot be populated.
@@ -163,11 +165,12 @@ public:
cache_flat_mutation_reader(const cache_flat_mutation_reader&) = delete;
cache_flat_mutation_reader(cache_flat_mutation_reader&&) = delete;
virtual future<> fill_buffer(db::timeout_clock::time_point timeout) override;
virtual void next_partition() override {
virtual future<> next_partition() override {
clear_buffer_to_next_partition();
if (is_buffer_empty()) {
_end_of_stream = true;
}
return make_ready_future<>();
}
virtual future<> fast_forward_to(const dht::partition_range&, db::timeout_clock::time_point timeout) override {
clear_buffer();
@@ -264,6 +267,9 @@ future<> cache_flat_mutation_reader::do_fill_buffer(db::timeout_clock::time_poin
}
_state = state::reading_from_underlying;
_population_range_starts_before_all_rows = _lower_bound.is_before_all_clustered_rows(*_schema);
if (!_read_context->partition_exists()) {
return read_from_underlying(timeout);
}
auto end = _next_row_in_range ? position_in_partition(_next_row.position())
: position_in_partition(_upper_bound);
return _underlying->fast_forward_to(position_range{_lower_bound, std::move(end)}, timeout).then([this, timeout] {
@@ -328,7 +334,6 @@ future<> cache_flat_mutation_reader::read_from_underlying(db::timeout_clock::tim
}
if (_next_row_in_range) {
maybe_update_continuity();
_last_row = _next_row;
add_to_buffer(_next_row);
try {
move_to_next_entry();
@@ -341,14 +346,14 @@ future<> cache_flat_mutation_reader::read_from_underlying(db::timeout_clock::tim
if (no_clustering_row_between(*_schema, _upper_bound, _next_row.position())) {
this->maybe_update_continuity();
} else if (can_populate()) {
rows_entry::compare less(*_schema);
rows_entry::tri_compare cmp(*_schema);
auto& rows = _snp->version()->partition().clustered_rows();
if (query::is_single_row(*_schema, *_ck_ranges_curr)) {
with_allocator(_snp->region().allocator(), [&] {
auto e = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(_ck_ranges_curr->start()->value()));
// Use _next_row iterator only as a hint, because there could be insertions after _upper_bound.
auto insert_result = rows.insert_check(_next_row.get_iterator_in_latest_version(), *e, less);
auto insert_result = rows.insert_before_hint(_next_row.get_iterator_in_latest_version(), *e, cmp);
auto inserted = insert_result.second;
auto it = insert_result.first;
if (inserted) {
@@ -364,7 +369,7 @@ future<> cache_flat_mutation_reader::read_from_underlying(db::timeout_clock::tim
auto e = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(*_schema, _upper_bound, is_dummy::yes, is_continuous::yes));
// Use _next_row iterator only as a hint, because there could be insertions after _upper_bound.
auto insert_result = rows.insert_check(_next_row.get_iterator_in_latest_version(), *e, less);
auto insert_result = rows.insert_before_hint(_next_row.get_iterator_in_latest_version(), *e, cmp);
auto inserted = insert_result.second;
if (inserted) {
clogger.trace("csm {}: inserted dummy at {}", fmt::ptr(this), _upper_bound);
@@ -374,6 +379,7 @@ future<> cache_flat_mutation_reader::read_from_underlying(db::timeout_clock::tim
clogger.trace("csm {}: mark {} as continuous", fmt::ptr(this), insert_result.first->position());
insert_result.first->set_continuous(true);
}
maybe_drop_last_entry();
});
}
} else {
@@ -404,12 +410,12 @@ bool cache_flat_mutation_reader::ensure_population_lower_bound() {
if (!_last_row.is_in_latest_version()) {
with_allocator(_snp->region().allocator(), [&] {
auto& rows = _snp->version()->partition().clustered_rows();
rows_entry::compare less(*_schema);
rows_entry::tri_compare cmp(*_schema);
// FIXME: Avoid the copy by inserting an incomplete clustering row
auto e = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(*_schema, *_last_row));
e->set_continuous(false);
auto insert_result = rows.insert_check(rows.end(), *e, less);
auto insert_result = rows.insert_before_hint(rows.end(), *e, cmp);
auto inserted = insert_result.second;
if (inserted) {
clogger.trace("csm {}: inserted lower bound dummy at {}", fmt::ptr(this), e->position());
@@ -427,6 +433,7 @@ void cache_flat_mutation_reader::maybe_update_continuity() {
with_allocator(_snp->region().allocator(), [&] {
rows_entry& e = _next_row.ensure_entry_in_latest().row;
e.set_continuous(true);
maybe_drop_last_entry();
});
} else {
_read_context->cache().on_mispopulate();
@@ -455,17 +462,17 @@ void cache_flat_mutation_reader::maybe_add_to_cache(const clustering_row& cr) {
clogger.trace("csm {}: populate({})", fmt::ptr(this), clustering_row::printer(*_schema, cr));
_lsa_manager.run_in_update_section_with_allocator([this, &cr] {
mutation_partition& mp = _snp->version()->partition();
rows_entry::compare less(*_schema);
rows_entry::tri_compare cmp(*_schema);
if (_read_context->digest_requested()) {
cr.cells().prepare_hash(*_schema, column_kind::regular_column);
}
auto new_entry = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(*_schema, cr.key(), cr.tomb(), cr.marker(), cr.cells()));
current_allocator().construct<rows_entry>(*_schema, cr.key(), cr.as_deletable_row()));
new_entry->set_continuous(false);
auto it = _next_row.iterators_valid() ? _next_row.get_iterator_in_latest_version()
: mp.clustered_rows().lower_bound(cr.key(), less);
auto insert_result = mp.clustered_rows().insert_check(it, *new_entry, less);
: mp.clustered_rows().lower_bound(cr.key(), cmp);
auto insert_result = mp.clustered_rows().insert_before_hint(it, *new_entry, cmp);
if (insert_result.second) {
_snp->tracker()->insert(*new_entry);
new_entry.release();
@@ -521,7 +528,6 @@ void cache_flat_mutation_reader::copy_from_cache_to_buffer() {
// We add the row to the buffer even when it's full.
// This simplifies the code. For more info see #3139.
if (_next_row_in_range) {
_last_row = _next_row;
add_to_buffer(_next_row);
move_to_next_entry();
} else {
@@ -570,8 +576,8 @@ void cache_flat_mutation_reader::move_to_range(query::clustering_row_ranges::con
clogger.trace("csm {}: insert dummy at {}", fmt::ptr(this), _lower_bound);
auto it = with_allocator(_lsa_manager.region().allocator(), [&] {
auto& rows = _snp->version()->partition().clustered_rows();
auto new_entry = current_allocator().construct<rows_entry>(*_schema, _lower_bound, is_dummy::yes, is_continuous::no);
return rows.insert_before(_next_row.get_iterator_in_latest_version(), *new_entry);
auto new_entry = alloc_strategy_unique_ptr<rows_entry>(current_allocator().construct<rows_entry>(*_schema, _lower_bound, is_dummy::yes, is_continuous::no));
return rows.insert_before(_next_row.get_iterator_in_latest_version(), std::move(new_entry));
});
_snp->tracker()->insert(*it);
_last_row = partition_snapshot_row_weakref(*_snp, it, true);
@@ -583,6 +589,38 @@ void cache_flat_mutation_reader::move_to_range(query::clustering_row_ranges::con
}
}
// Drops _last_row entry when possible without changing logical contents of the partition.
// Call only when _last_row and _next_row are valid.
// Calling after ensure_population_lower_bound() is ok.
// _next_row must have a greater position than _last_row.
// Invalidates references but keeps the _next_row valid.
inline
void cache_flat_mutation_reader::maybe_drop_last_entry() noexcept {
// Drop dummy entry if it falls inside a continuous range.
// This prevents unnecessary dummy entries from accumulating in cache and slowing down scans.
//
// Eviction can happen only from oldest versions to preserve the continuity non-overlapping rule
// (See docs/design-notes/row_cache.md)
//
if (_last_row
&& _last_row->dummy()
&& _last_row->continuous()
&& _snp->at_latest_version()
&& _snp->at_oldest_version()) {
with_allocator(_snp->region().allocator(), [&] {
_last_row->on_evicted(_read_context->cache()._tracker);
});
_last_row = nullptr;
// There could be iterators pointing to _last_row, invalidate them
_snp->region().allocator().invalidate_references();
// Don't invalidate _next_row, move_to_next_entry() expects it to be still valid.
_next_row.force_valid();
}
}
// _next_row must be inside the range.
inline
void cache_flat_mutation_reader::move_to_next_entry() {
@@ -590,14 +628,18 @@ void cache_flat_mutation_reader::move_to_next_entry() {
if (no_clustering_row_between(*_schema, _next_row.position(), _upper_bound)) {
move_to_next_range();
} else {
auto new_last_row = partition_snapshot_row_weakref(_next_row);
if (!_next_row.next()) {
move_to_end();
return;
}
_last_row = std::move(new_last_row);
_next_row_in_range = !after_current_range(_next_row.position());
clogger.trace("csm {}: next={}, cont={}, in_range={}", fmt::ptr(this), _next_row.position(), _next_row.continuous(), _next_row_in_range);
if (!_next_row.continuous()) {
start_reading_from_underlying();
} else {
maybe_drop_last_entry();
}
}
}
@@ -618,6 +660,13 @@ void cache_flat_mutation_reader::add_to_buffer(const partition_snapshot_row_curs
if (!row.dummy()) {
_read_context->cache().on_row_hit();
add_clustering_row_to_buffer(mutation_fragment(*_schema, _permit, row.row(_read_context->digest_requested())));
} else {
position_in_partition::less_compare less(*_schema);
if (less(_lower_bound, row.position())) {
_lower_bound = row.position();
_lower_bound_changed = true;
}
_read_context->cache()._tracker.on_dummy_row_hit();
}
}

View File

@@ -37,7 +37,7 @@
#include "idl/mutation.dist.impl.hh"
#include <iostream>
canonical_mutation::canonical_mutation(bytes data)
canonical_mutation::canonical_mutation(bytes_ostream data)
: _data(std::move(data))
{ }
@@ -45,8 +45,7 @@ canonical_mutation::canonical_mutation(const mutation& m)
{
mutation_partition_serializer part_ser(*m.schema(), m.partition());
bytes_ostream out;
ser::writer_of_canonical_mutation<bytes_ostream> wr(out);
ser::writer_of_canonical_mutation<bytes_ostream> wr(_data);
std::move(wr).write_table_id(m.schema()->id())
.write_schema_version(m.schema()->version())
.write_key(m.key())
@@ -54,7 +53,6 @@ canonical_mutation::canonical_mutation(const mutation& m)
.partition([&] (auto wr) {
part_ser.write(std::move(wr));
}).end_canonical_mutation();
_data = to_bytes(out.linearize());
}
utils::UUID canonical_mutation::column_family_id() const {

View File

@@ -32,9 +32,9 @@
// Safe to access from other shards via const&.
// Safe to pass serialized across nodes.
class canonical_mutation {
bytes _data;
bytes_ostream _data;
public:
explicit canonical_mutation(bytes);
explicit canonical_mutation(bytes_ostream);
explicit canonical_mutation(const mutation&);
canonical_mutation(canonical_mutation&&) = default;
@@ -51,7 +51,7 @@ public:
utils::UUID column_family_id() const;
const bytes& representation() const { return _data; }
const bytes_ostream& representation() const { return _data; }
friend std::ostream& operator<<(std::ostream& os, const canonical_mutation& cm);
};

View File

@@ -22,10 +22,13 @@
#include <boost/type.hpp>
#include <random>
#include <unordered_set>
#include <algorithm>
#include <seastar/core/sleep.hh>
#include <seastar/core/coroutine.hh>
#include "keys.hh"
#include "schema_builder.hh"
#include "database.hh"
#include "db/config.hh"
#include "db/system_keyspace.hh"
#include "db/system_distributed_keyspace.hh"
@@ -36,6 +39,8 @@
#include "gms/gossiper.hh"
#include "cdc/generation.hh"
#include "cdc/cdc_options.hh"
#include "cdc/generation_service.hh"
extern logging::logger cdc_log;
@@ -174,10 +179,29 @@ bool topology_description::operator==(const topology_description& o) const {
return _entries == o._entries;
}
const std::vector<token_range_description>& topology_description::entries() const {
const std::vector<token_range_description>& topology_description::entries() const& {
return _entries;
}
std::vector<token_range_description>&& topology_description::entries() && {
return std::move(_entries);
}
static std::vector<stream_id> create_stream_ids(
size_t index, dht::token start, dht::token end, size_t shard_count, uint8_t ignore_msb) {
std::vector<stream_id> result;
result.reserve(shard_count);
dht::sharder sharder(shard_count, ignore_msb);
for (size_t shard_idx = 0; shard_idx < shard_count; ++shard_idx) {
auto t = dht::find_first_token_for_shard(sharder, start, end, shard_idx);
// compose the id from token and the "index" of the range end owning vnode
// as defined by token sort order. Basically grouping within this
// shard set.
result.emplace_back(stream_id(t, index));
}
return result;
}
class topology_description_generator final {
const db::config& _cfg;
const std::unordered_set<dht::token>& _bootstrap_tokens;
@@ -217,18 +241,9 @@ class topology_description_generator final {
desc.token_range_end = end;
auto [shard_count, ignore_msb] = get_sharding_info(end);
desc.streams.reserve(shard_count);
desc.streams = create_stream_ids(index, start, end, shard_count, ignore_msb);
desc.sharding_ignore_msb = ignore_msb;
dht::sharder sharder(shard_count, ignore_msb);
for (size_t shard_idx = 0; shard_idx < shard_count; ++shard_idx) {
auto t = dht::find_first_token_for_shard(sharder, start, end, shard_idx);
// compose the id from token and the "index" of the range end owning vnode
// as defined by token sort order. Basically grouping within this
// shard set.
desc.streams.emplace_back(stream_id(t, index));
}
return desc;
}
public:
@@ -294,8 +309,39 @@ future<db_clock::time_point> get_local_streams_timestamp() {
});
}
// Run inside seastar::async context.
db_clock::time_point make_new_cdc_generation(
// non-static for testing
size_t limit_of_streams_in_topology_description() {
// Each stream takes 16B and we don't want to exceed 4MB so we can have
// at most 262144 streams but not less than 1 per vnode.
return 4 * 1024 * 1024 / 16;
}
// non-static for testing
topology_description limit_number_of_streams_if_needed(topology_description&& desc) {
int64_t streams_count = 0;
for (auto& tr_desc : desc.entries()) {
streams_count += tr_desc.streams.size();
}
size_t limit = std::max(limit_of_streams_in_topology_description(), desc.entries().size());
if (limit >= streams_count) {
return std::move(desc);
}
size_t streams_per_vnode_limit = limit / desc.entries().size();
auto entries = std::move(desc).entries();
auto start = entries.back().token_range_end;
for (size_t idx = 0; idx < entries.size(); ++idx) {
auto end = entries[idx].token_range_end;
if (entries[idx].streams.size() > streams_per_vnode_limit) {
entries[idx].streams =
create_stream_ids(idx, start, end, streams_per_vnode_limit, entries[idx].sharding_ignore_msb);
}
start = end;
}
return topology_description(std::move(entries));
}
future<db_clock::time_point> make_new_cdc_generation(
const db::config& cfg,
const std::unordered_set<dht::token>& bootstrap_tokens,
const locator::token_metadata_ptr tmptr,
@@ -306,13 +352,25 @@ db_clock::time_point make_new_cdc_generation(
using namespace std::chrono;
auto gen = topology_description_generator(cfg, bootstrap_tokens, tmptr, g).generate();
// If the cluster is large we may end up with a generation that contains
// large number of streams. This is problematic because we store the
// generation in a single row. For a generation with large number of rows
// this will lead to a row that can be as big as 32MB. This is much more
// than the limit imposed by commitlog_segment_size_in_mb. If the size of
// the row that describes a new generation grows above
// commitlog_segment_size_in_mb, the write will fail and the new node won't
// be able to join. To avoid such problem we make sure that such row is
// always smaller than 4MB. We do that by removing some CDC streams from
// each vnode if the total number of streams is too large.
gen = limit_number_of_streams_if_needed(std::move(gen));
// Begin the race.
auto ts = db_clock::now() + (
(!add_delay || ring_delay == milliseconds(0)) ? milliseconds(0) : (
2 * ring_delay + duration_cast<milliseconds>(generation_leeway)));
sys_dist_ks.insert_cdc_topology_description(ts, std::move(gen), { tmptr->count_normal_token_owners() }).get();
co_await sys_dist_ks.insert_cdc_topology_description(ts, std::move(gen), { tmptr->count_normal_token_owners() });
return ts;
co_return ts;
}
std::optional<db_clock::time_point> get_streams_timestamp_for(const gms::inet_address& endpoint, const gms::gossiper& g) {
@@ -321,63 +379,581 @@ std::optional<db_clock::time_point> get_streams_timestamp_for(const gms::inet_ad
return gms::versioned_value::cdc_streams_timestamp_from_string(streams_ts_string);
}
// Run inside seastar::async context.
static void do_update_streams_description(
static future<> do_update_streams_description(
db_clock::time_point streams_ts,
db::system_distributed_keyspace& sys_dist_ks,
db::system_distributed_keyspace::context ctx) {
if (sys_dist_ks.cdc_desc_exists(streams_ts, ctx).get0()) {
cdc_log.debug("update_streams_description: description of generation {} already inserted", streams_ts);
return;
if (co_await sys_dist_ks.cdc_desc_exists(streams_ts, ctx)) {
cdc_log.info("Generation {}: streams description table already updated.", streams_ts);
co_return;
}
// We might race with another node also inserting the description, but that's ok. It's an idempotent operation.
auto topo = sys_dist_ks.read_cdc_topology_description(streams_ts, ctx).get0();
auto topo = co_await sys_dist_ks.read_cdc_topology_description(streams_ts, ctx);
if (!topo) {
throw std::runtime_error(format("could not find streams data for timestamp {}", streams_ts));
throw no_generation_data_exception(streams_ts);
}
std::set<cdc::stream_id> streams_set;
for (auto& entry: topo->entries()) {
streams_set.insert(entry.streams.begin(), entry.streams.end());
}
std::vector<cdc::stream_id> streams_vec(streams_set.begin(), streams_set.end());
sys_dist_ks.create_cdc_desc(streams_ts, streams_vec, ctx).get();
co_await sys_dist_ks.create_cdc_desc(streams_ts, *topo, ctx);
cdc_log.info("CDC description table successfully updated with generation {}.", streams_ts);
}
void update_streams_description(
future<> update_streams_description(
db_clock::time_point streams_ts,
shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source& abort_src) {
try {
do_update_streams_description(streams_ts, *sys_dist_ks, { get_num_token_owners() });
} catch(...) {
co_await do_update_streams_description(streams_ts, *sys_dist_ks, { get_num_token_owners() });
} catch (...) {
cdc_log.warn(
"Could not update CDC description table with generation {}: {}. Will retry in the background.",
streams_ts, std::current_exception());
// It is safe to discard this future: we keep system distributed keyspace alive.
(void)seastar::async([
streams_ts, sys_dist_ks, get_num_token_owners = std::move(get_num_token_owners), &abort_src
] {
(void)(([] (db_clock::time_point streams_ts,
shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source& abort_src) -> future<> {
while (true) {
sleep_abortable(std::chrono::seconds(60), abort_src).get();
co_await sleep_abortable(std::chrono::seconds(60), abort_src);
try {
do_update_streams_description(streams_ts, *sys_dist_ks, { get_num_token_owners() });
return;
co_await do_update_streams_description(streams_ts, *sys_dist_ks, { get_num_token_owners() });
co_return;
} catch (...) {
cdc_log.warn(
"Could not update CDC description table with generation {}: {}. Will try again.",
streams_ts, std::current_exception());
}
}
});
})(streams_ts, std::move(sys_dist_ks), std::move(get_num_token_owners), abort_src));
}
}
static db_clock::time_point as_timepoint(const utils::UUID& uuid) {
return db_clock::time_point{std::chrono::milliseconds(utils::UUID_gen::get_adjusted_timestamp(uuid))};
}
static future<std::vector<db_clock::time_point>> get_cdc_desc_v1_timestamps(
db::system_distributed_keyspace& sys_dist_ks,
abort_source& abort_src,
const noncopyable_function<unsigned()>& get_num_token_owners) {
while (true) {
try {
co_return co_await sys_dist_ks.get_cdc_desc_v1_timestamps({ get_num_token_owners() });
} catch (...) {
cdc_log.warn(
"Failed to retrieve generation timestamps for rewriting: {}. Retrying in 60s.",
std::current_exception());
}
co_await sleep_abortable(std::chrono::seconds(60), abort_src);
}
}
// Contains a CDC log table's creation time (extracted from its schema's id)
// and its CDC TTL setting.
struct time_and_ttl {
db_clock::time_point creation_time;
int ttl;
};
/*
* See `maybe_rewrite_streams_descriptions`.
* This is the long-running-in-the-background part of that function.
* It returns the timestamp of the last rewritten generation (if any).
*/
static future<std::optional<db_clock::time_point>> rewrite_streams_descriptions(
std::vector<time_and_ttl> times_and_ttls,
shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source& abort_src) {
cdc_log.info("Retrieving generation timestamps for rewriting...");
auto tss = co_await get_cdc_desc_v1_timestamps(*sys_dist_ks, abort_src, get_num_token_owners);
cdc_log.info("Generation timestamps retrieved.");
// Find first generation timestamp such that some CDC log table may contain data before this timestamp.
// This predicate is monotonic w.r.t the timestamps.
auto now = db_clock::now();
std::sort(tss.begin(), tss.end());
auto first = std::partition_point(tss.begin(), tss.end(), [&] (db_clock::time_point ts) {
// partition_point finds first element that does *not* satisfy the predicate.
return std::none_of(times_and_ttls.begin(), times_and_ttls.end(),
[&] (const time_and_ttl& tat) {
// In this CDC log table there are no entries older than the table's creation time
// or (now - the table's ttl). We subtract 10s to account for some possible clock drift.
// If ttl is set to 0 then entries in this table never expire. In that case we look
// only at the table's creation time.
auto no_entries_older_than =
(tat.ttl == 0 ? tat.creation_time : std::max(tat.creation_time, now - std::chrono::seconds(tat.ttl)))
- std::chrono::seconds(10);
return no_entries_older_than < ts;
});
});
// Find first generation timestamp such that some CDC log table may contain data in this generation.
// This and all later generations need to be written to the new streams table.
if (first != tss.begin()) {
--first;
}
if (first == tss.end()) {
cdc_log.info("No generations to rewrite.");
co_return std::nullopt;
}
cdc_log.info("First generation to rewrite: {}", *first);
bool each_success = true;
co_await max_concurrent_for_each(first, tss.end(), 10, [&] (db_clock::time_point ts) -> future<> {
while (true) {
try {
co_return co_await do_update_streams_description(ts, *sys_dist_ks, { get_num_token_owners() });
} catch (const no_generation_data_exception& e) {
cdc_log.error("Failed to rewrite streams for generation {}: {}. Giving up.", ts, e);
each_success = false;
co_return;
} catch (...) {
cdc_log.warn("Failed to rewrite streams for generation {}: {}. Retrying in 60s.", ts, std::current_exception());
}
co_await sleep_abortable(std::chrono::seconds(60), abort_src);
}
});
if (each_success) {
cdc_log.info("Rewriting stream tables finished successfully.");
} else {
cdc_log.info("Rewriting stream tables finished, but some generations could not be rewritten (check the logs).");
}
if (first != tss.end()) {
co_return *std::prev(tss.end());
}
co_return std::nullopt;
}
future<> maybe_rewrite_streams_descriptions(
const database& db,
shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source& abort_src) {
if (!db.has_schema(sys_dist_ks->NAME, sys_dist_ks->CDC_DESC_V1)) {
// This cluster never went through a Scylla version which used this table
// or the user deleted the table. Nothing to do.
co_return;
}
if (co_await db::system_keyspace::cdc_is_rewritten()) {
co_return;
}
if (db.get_config().cdc_dont_rewrite_streams()) {
cdc_log.warn("Stream rewriting disabled. Manual administrator intervention may be required...");
co_return;
}
// For each CDC log table get the TTL setting (from CDC options) and the table's creation time
std::vector<time_and_ttl> times_and_ttls;
for (auto& [_, cf] : db.get_column_families()) {
auto& s = *cf->schema();
auto base = cdc::get_base_table(db, s.ks_name(), s.cf_name());
if (!base) {
// Not a CDC log table.
continue;
}
auto& cdc_opts = base->cdc_options();
if (!cdc_opts.enabled()) {
// This table is named like a CDC log table but it's not one.
continue;
}
times_and_ttls.push_back(time_and_ttl{as_timepoint(s.id()), cdc_opts.ttl()});
}
if (times_and_ttls.empty()) {
// There's no point in rewriting old generations' streams (they don't contain any data).
cdc_log.info("No CDC log tables present, not rewriting stream tables.");
co_return co_await db::system_keyspace::cdc_set_rewritten(std::nullopt);
}
// It's safe to discard this future: the coroutine keeps system_distributed_keyspace alive
// and the abort source's lifetime extends the lifetime of any other service.
(void)(([_times_and_ttls = std::move(times_and_ttls), _sys_dist_ks = std::move(sys_dist_ks),
_get_num_token_owners = std::move(get_num_token_owners), &_abort_src = abort_src] () mutable -> future<> {
auto times_and_ttls = std::move(_times_and_ttls);
auto sys_dist_ks = std::move(_sys_dist_ks);
auto get_num_token_owners = std::move(_get_num_token_owners);
auto& abort_src = _abort_src;
// This code is racing with node startup. At this point, we're most likely still waiting for gossip to settle
// and some nodes that are UP may still be marked as DOWN by us.
// Let's sleep a bit to increase the chance that the first attempt at rewriting succeeds (it's still ok if
// it doesn't - we'll retry - but it's nice if we succeed without any warnings).
co_await sleep_abortable(std::chrono::seconds(10), abort_src);
cdc_log.info("Rewriting stream tables in the background...");
auto last_rewritten = co_await rewrite_streams_descriptions(
std::move(times_and_ttls),
std::move(sys_dist_ks),
std::move(get_num_token_owners),
abort_src);
co_await db::system_keyspace::cdc_set_rewritten(last_rewritten);
})());
}
static void assert_shard_zero(const sstring& where) {
if (this_shard_id() != 0) {
on_internal_error(cdc_log, format("`{}`: must be run on shard 0", where));
}
}
class and_reducer {
private:
bool _result = true;
public:
future<> operator()(bool value) {
_result = value && _result;
return make_ready_future<>();
}
bool get() {
return _result;
}
};
class or_reducer {
private:
bool _result = false;
public:
future<> operator()(bool value) {
_result = value || _result;
return make_ready_future<>();
}
bool get() {
return _result;
}
};
class generation_handling_nonfatal_exception : public std::runtime_error {
using std::runtime_error::runtime_error;
};
constexpr char could_not_retrieve_msg_template[]
= "Could not retrieve CDC streams with timestamp {} upon gossip event. Reason: \"{}\". Action: {}.";
generation_service::generation_service(
const db::config& cfg, gms::gossiper& g, sharded<db::system_distributed_keyspace>& sys_dist_ks,
abort_source& abort_src, const locator::shared_token_metadata& stm)
: _cfg(cfg), _gossiper(g), _sys_dist_ks(sys_dist_ks), _abort_src(abort_src), _token_metadata(stm) {
}
future<> generation_service::stop() {
if (this_shard_id() == 0) {
co_await _gossiper.unregister_(shared_from_this());
}
_stopped = true;
}
generation_service::~generation_service() {
assert(_stopped);
}
future<> generation_service::after_join(std::optional<db_clock::time_point>&& startup_gen_ts) {
assert_shard_zero(__PRETTY_FUNCTION__);
assert(db::system_keyspace::bootstrap_complete());
_gen_ts = std::move(startup_gen_ts);
_gossiper.register_(shared_from_this());
_joined = true;
// Retrieve the latest CDC generation seen in gossip (if any).
co_await scan_cdc_generations();
}
void generation_service::on_join(gms::inet_address ep, gms::endpoint_state ep_state) {
assert_shard_zero(__PRETTY_FUNCTION__);
auto val = ep_state.get_application_state_ptr(gms::application_state::CDC_STREAMS_TIMESTAMP);
if (!val) {
return;
}
on_change(ep, gms::application_state::CDC_STREAMS_TIMESTAMP, *val);
}
void generation_service::on_change(gms::inet_address ep, gms::application_state app_state, const gms::versioned_value& v) {
assert_shard_zero(__PRETTY_FUNCTION__);
if (app_state != gms::application_state::CDC_STREAMS_TIMESTAMP) {
return;
}
auto ts = gms::versioned_value::cdc_streams_timestamp_from_string(v.value);
cdc_log.debug("Endpoint: {}, CDC generation timestamp change: {}", ep, ts);
handle_cdc_generation(ts).get();
}
future<> generation_service::check_and_repair_cdc_streams() {
if (!_joined) {
throw std::runtime_error("check_and_repair_cdc_streams: node not initialized yet");
}
auto latest = _gen_ts;
const auto& endpoint_states = _gossiper.get_endpoint_states();
for (const auto& [addr, state] : endpoint_states) {
if (!_gossiper.is_normal(addr)) {
throw std::runtime_error(format("All nodes must be in NORMAL state while performing check_and_repair_cdc_streams"
" ({} is in state {})", addr, _gossiper.get_gossip_status(state)));
}
const auto ts = get_streams_timestamp_for(addr, _gossiper);
if (!latest || (ts && *ts > *latest)) {
latest = ts;
}
}
bool should_regenerate = false;
std::optional<topology_description> gen;
static const auto timeout_msg = "Timeout while fetching CDC topology description";
static const auto topology_read_error_note = "Note: this is likely caused by"
" node(s) being down or unreachable. It is recommended to check the network and"
" restart/remove the failed node(s), then retry checkAndRepairCdcStreams command";
static const auto exception_translating_msg = "Translating the exception to `request_execution_exception`";
const auto tmptr = _token_metadata.get();
auto sys_dist_ks = get_sys_dist_ks();
try {
gen = co_await sys_dist_ks->read_cdc_topology_description(
*latest, { tmptr->count_normal_token_owners() });
} catch (exceptions::request_timeout_exception& e) {
cdc_log.error("{}: \"{}\". {}.", timeout_msg, e.what(), exception_translating_msg);
throw exceptions::request_execution_exception(exceptions::exception_code::READ_TIMEOUT,
format("{}. {}.", timeout_msg, topology_read_error_note));
} catch (exceptions::unavailable_exception& e) {
static const auto unavailable_msg = "Node(s) unavailable while fetching CDC topology description";
cdc_log.error("{}: \"{}\". {}.", unavailable_msg, e.what(), exception_translating_msg);
throw exceptions::request_execution_exception(exceptions::exception_code::UNAVAILABLE,
format("{}. {}.", unavailable_msg, topology_read_error_note));
} catch (...) {
const auto ep = std::current_exception();
if (is_timeout_exception(ep)) {
cdc_log.error("{}: \"{}\". {}.", timeout_msg, ep, exception_translating_msg);
throw exceptions::request_execution_exception(exceptions::exception_code::READ_TIMEOUT,
format("{}. {}.", timeout_msg, topology_read_error_note));
}
// On exotic errors proceed with regeneration
cdc_log.error("Exception while reading CDC topology description: \"{}\". Regenerating streams anyway.", ep);
should_regenerate = true;
}
if (!gen) {
cdc_log.error(
"Could not find CDC generation with timestamp {} in distributed system tables (current time: {}),"
" even though some node gossiped about it.",
latest, db_clock::now());
should_regenerate = true;
} else {
std::unordered_set<dht::token> gen_ends;
for (const auto& entry : gen->entries()) {
gen_ends.insert(entry.token_range_end);
}
for (const auto& metadata_token : tmptr->sorted_tokens()) {
if (!gen_ends.contains(metadata_token)) {
cdc_log.warn("CDC generation {} missing token {}. Regenerating.", latest, metadata_token);
should_regenerate = true;
break;
}
}
}
if (!should_regenerate) {
if (latest != _gen_ts) {
co_await do_handle_cdc_generation(*latest);
}
cdc_log.info("CDC generation {} does not need repair", latest);
co_return;
}
const auto new_gen_ts = co_await make_new_cdc_generation(_cfg,
{}, std::move(tmptr), _gossiper, *sys_dist_ks,
std::chrono::milliseconds(_cfg.ring_delay_ms()), true /* add delay */);
// Need to artificially update our STATUS so other nodes handle the timestamp change
auto status = _gossiper.get_application_state_ptr(
utils::fb_utilities::get_broadcast_address(), gms::application_state::STATUS);
if (!status) {
cdc_log.error("Our STATUS is missing");
cdc_log.error("Aborting CDC generation repair due to missing STATUS");
co_return;
}
// Update _gen_ts first, so that do_handle_cdc_generation (which will get called due to the status update)
// won't try to update the gossiper, which would result in a deadlock inside add_local_application_state
_gen_ts = new_gen_ts;
co_await _gossiper.add_local_application_state({
{ gms::application_state::CDC_STREAMS_TIMESTAMP, gms::versioned_value::cdc_streams_timestamp(new_gen_ts) },
{ gms::application_state::STATUS, *status }
});
co_await db::system_keyspace::update_cdc_streams_timestamp(new_gen_ts);
}
future<> generation_service::handle_cdc_generation(std::optional<db_clock::time_point> ts) {
assert_shard_zero(__PRETTY_FUNCTION__);
if (!ts) {
co_return;
}
if (!db::system_keyspace::bootstrap_complete() || !_sys_dist_ks.local_is_initialized()
|| !_sys_dist_ks.local().started()) {
// The service should not be listening for generation changes until after the node
// is bootstrapped. Therefore we would previously assume that this condition
// can never become true and call on_internal_error here, but it turns out that
// it may become true on decommission: the node enters NEEDS_BOOTSTRAP
// state before leaving the token ring, so bootstrap_complete() becomes false.
// In that case we can simply return.
co_return;
}
if (co_await container().map_reduce(and_reducer(), [ts = *ts] (generation_service& svc) {
return !svc._cdc_metadata.prepare(ts);
})) {
co_return;
}
bool using_this_gen = false;
try {
using_this_gen = co_await do_handle_cdc_generation_intercept_nonfatal_errors(*ts);
} catch (generation_handling_nonfatal_exception& e) {
cdc_log.warn(could_not_retrieve_msg_template, ts, e.what(), "retrying in the background");
async_handle_cdc_generation(*ts);
co_return;
} catch (...) {
cdc_log.error(could_not_retrieve_msg_template, ts, std::current_exception(), "not retrying");
co_return; // Exotic ("fatal") exception => do not retry
}
if (using_this_gen) {
cdc_log.info("Starting to use generation {}", *ts);
co_await update_streams_description(*ts, get_sys_dist_ks(),
[tmptr = _token_metadata.get()] { return tmptr->count_normal_token_owners(); },
_abort_src);
}
}
void generation_service::async_handle_cdc_generation(db_clock::time_point ts) {
assert_shard_zero(__PRETTY_FUNCTION__);
(void)(([] (db_clock::time_point ts, shared_ptr<generation_service> svc) -> future<> {
while (true) {
co_await sleep_abortable(std::chrono::seconds(5), svc->_abort_src);
try {
bool using_this_gen = co_await svc->do_handle_cdc_generation_intercept_nonfatal_errors(ts);
if (using_this_gen) {
cdc_log.info("Starting to use generation {}", ts);
co_await update_streams_description(ts, svc->get_sys_dist_ks(),
[tmptr = svc->_token_metadata.get()] { return tmptr->count_normal_token_owners(); },
svc->_abort_src);
}
co_return;
} catch (generation_handling_nonfatal_exception& e) {
cdc_log.warn(could_not_retrieve_msg_template, ts, e.what(), "continuing to retry in the background");
} catch (...) {
cdc_log.error(could_not_retrieve_msg_template, ts, std::current_exception(), "not retrying anymore");
co_return; // Exotic ("fatal") exception => do not retry
}
if (co_await svc->container().map_reduce(and_reducer(), [ts] (generation_service& svc) {
return svc._cdc_metadata.known_or_obsolete(ts);
})) {
co_return;
}
}
})(ts, shared_from_this()));
}
future<> generation_service::scan_cdc_generations() {
assert_shard_zero(__PRETTY_FUNCTION__);
std::optional<db_clock::time_point> latest;
for (const auto& ep: _gossiper.get_endpoint_states()) {
auto ts = get_streams_timestamp_for(ep.first, _gossiper);
if (!latest || (ts && *ts > *latest)) {
latest = ts;
}
}
if (latest) {
cdc_log.info("Latest generation seen during startup: {}", *latest);
co_await handle_cdc_generation(latest);
} else {
cdc_log.info("No generation seen during startup.");
}
}
future<bool> generation_service::do_handle_cdc_generation_intercept_nonfatal_errors(db_clock::time_point ts) {
assert_shard_zero(__PRETTY_FUNCTION__);
try {
co_return co_await do_handle_cdc_generation(ts);
} catch (exceptions::request_timeout_exception& e) {
throw generation_handling_nonfatal_exception(e.what());
} catch (exceptions::unavailable_exception& e) {
throw generation_handling_nonfatal_exception(e.what());
} catch (exceptions::read_failure_exception& e) {
throw generation_handling_nonfatal_exception(e.what());
} catch (...) {
const auto ep = std::current_exception();
if (is_timeout_exception(ep)) {
throw generation_handling_nonfatal_exception(format("{}", ep));
}
throw;
}
}
future<bool> generation_service::do_handle_cdc_generation(db_clock::time_point ts) {
assert_shard_zero(__PRETTY_FUNCTION__);
auto sys_dist_ks = get_sys_dist_ks();
auto gen = co_await sys_dist_ks->read_cdc_topology_description(
ts, { _token_metadata.get()->count_normal_token_owners() });
if (!gen) {
throw std::runtime_error(format(
"Could not find CDC generation with timestamp {} in distributed system tables (current time: {}),"
" even though some node gossiped about it.",
ts, db_clock::now()));
}
// If we're not gossiping our own generation timestamp (because we've upgraded from a non-CDC/old version,
// or we somehow lost it due to a byzantine failure), start gossiping someone else's timestamp.
// This is to avoid the upgrade check on every restart (see `should_propose_first_cdc_generation`).
// And if we notice that `ts` is higher than our timestamp, we will start gossiping it instead,
// so if the node that initially gossiped `ts` leaves the cluster while `ts` is still the latest generation,
// the cluster will remember.
if (!_gen_ts || *_gen_ts < ts) {
_gen_ts = ts;
co_await db::system_keyspace::update_cdc_streams_timestamp(ts);
co_await _gossiper.add_local_application_state(
gms::application_state::CDC_STREAMS_TIMESTAMP, gms::versioned_value::cdc_streams_timestamp(ts));
}
// Return `true` iff the generation was inserted on any of our shards.
co_return co_await container().map_reduce(or_reducer(), [ts, &gen] (generation_service& svc) {
auto gen_ = *gen;
return svc._cdc_metadata.insert(ts, std::move(gen_));
});
}
shared_ptr<db::system_distributed_keyspace> generation_service::get_sys_dist_ks() {
assert_shard_zero(__PRETTY_FUNCTION__);
if (!_sys_dist_ks.local_is_initialized()) {
throw std::runtime_error("system distributed keyspace not initialized");
}
return _sys_dist_ks.local_shared();
}
} // namespace cdc

View File

@@ -41,6 +41,7 @@
#include "db_clock.hh"
#include "dht/token.hh"
#include "locator/token_metadata.hh"
#include "utils/chunked_vector.hh"
namespace seastar {
class abort_source;
@@ -65,6 +66,7 @@ public:
stream_id() = default;
stream_id(bytes);
stream_id(dht::token, size_t);
bool is_set() const;
bool operator==(const stream_id&) const;
@@ -78,9 +80,6 @@ public:
partition_key to_partition_key(const schema& log_schema) const;
static int64_t token_from_bytes(bytes_view);
private:
friend class topology_description_generator;
stream_id(dht::token, size_t);
};
/* Describes a mapping of tokens to CDC streams in a token range.
@@ -113,7 +112,8 @@ public:
topology_description(std::vector<token_range_description> entries);
bool operator==(const topology_description&) const;
const std::vector<token_range_description>& entries() const;
const std::vector<token_range_description>& entries() const&;
std::vector<token_range_description>&& entries() &&;
};
/**
@@ -122,14 +122,19 @@ public:
*/
class streams_version {
public:
std::vector<stream_id> streams;
utils::chunked_vector<stream_id> streams;
db_clock::time_point timestamp;
std::optional<db_clock::time_point> expired;
streams_version(std::vector<stream_id> s, db_clock::time_point ts, std::optional<db_clock::time_point> exp)
streams_version(utils::chunked_vector<stream_id> s, db_clock::time_point ts)
: streams(std::move(s))
, timestamp(ts)
, expired(std::move(exp))
{}
};
class no_generation_data_exception : public std::runtime_error {
public:
no_generation_data_exception(db_clock::time_point generation_ts)
: std::runtime_error(format("could not find generation data for timestamp {}", generation_ts))
{}
};
@@ -162,7 +167,7 @@ future<db_clock::time_point> get_local_streams_timestamp();
* (not guaranteed in the current implementation, but expected to be the common case;
* we assume that `ring_delay` is enough for other nodes to learn about the new generation).
*/
db_clock::time_point make_new_cdc_generation(
future<db_clock::time_point> make_new_cdc_generation(
const db::config& cfg,
const std::unordered_set<dht::token>& bootstrap_tokens,
const locator::token_metadata_ptr tmptr,
@@ -185,13 +190,22 @@ std::optional<db_clock::time_point> get_streams_timestamp_for(const gms::inet_ad
*
* Returning from this function does not mean that the table update was successful: the function
* might run an asynchronous task in the background.
*
* Run inside seastar::async context.
*/
void update_streams_description(
future<> update_streams_description(
db_clock::time_point,
shared_ptr<db::system_distributed_keyspace>,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source&);
/* Part of the upgrade procedure. Useful in case where the version of Scylla that we're upgrading from
* used the "cdc_streams_descriptions" table. This procedure ensures that the new "cdc_streams_descriptions_v2"
* table contains streams of all generations that were present in the old table and may still contain data
* (i.e. there exist CDC log tables that may contain rows with partition keys being the stream IDs from
* these generations). */
future<> maybe_rewrite_streams_descriptions(
const database&,
shared_ptr<db::system_distributed_keyspace>,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source&);
} // namespace cdc

138
cdc/generation_service.hh Normal file
View File

@@ -0,0 +1,138 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* Modified by ScyllaDB
* Copyright (C) 2021 ScyllaDB
*
*/
#pragma once
#include "cdc/metadata.hh"
#include "gms/i_endpoint_state_change_subscriber.hh"
namespace db {
class system_distributed_keyspace;
}
namespace gms {
class gossiper;
}
namespace cdc {
class generation_service : public peering_sharded_service<generation_service>
, public async_sharded_service<generation_service>
, public gms::i_endpoint_state_change_subscriber {
bool _stopped = false;
// The node has joined the token ring. Set to `true` on `after_join` call.
bool _joined = false;
const db::config& _cfg;
gms::gossiper& _gossiper;
sharded<db::system_distributed_keyspace>& _sys_dist_ks;
abort_source& _abort_src;
const locator::shared_token_metadata& _token_metadata;
/* Maintains the set of known CDC generations used to pick streams for log writes (i.e., the partition keys of these log writes).
* Updated in response to certain gossip events (see the handle_cdc_generation function).
*/
cdc::metadata _cdc_metadata;
/* The latest known generation timestamp and the timestamp that we're currently gossiping
* (as CDC_STREAMS_TIMESTAMP application state).
*
* Only shard 0 manages this, hence it will be std::nullopt on all shards other than 0.
* This timestamp is also persisted in the system.cdc_local table.
*
* On shard 0 this may be nullopt only in one special case: rolling upgrade, when we upgrade
* from an old version of Scylla that didn't support CDC. In that case one node in the cluster
* will create the first generation and start gossiping it; it may be us, or it may be some
* different node. In any case, eventually - after one of the nodes gossips the first timestamp
* - we'll catch on and this variable will be updated with that generation.
*/
std::optional<db_clock::time_point> _gen_ts;
public:
generation_service(const db::config&, gms::gossiper&,
sharded<db::system_distributed_keyspace>&, abort_source&, const locator::shared_token_metadata&);
future<> stop();
~generation_service();
/* After the node bootstraps and creates a new CDC generation, or restarts and loads the last
* known generation timestamp from persistent storage, this function should be called with
* that generation timestamp moved in as the `startup_gen_ts` parameter.
* This passes the responsibility of managing generations from the node startup code to this service;
* until then, the service remains dormant.
* At the time of writing this comment, the startup code is in `storage_service::join_token_ring`, hence
* `after_join` should be called at the end of that function.
* Precondition: the node has completed bootstrapping and system_distributed_keyspace is initialized.
* Must be called on shard 0 - that's where the generation management happens.
*/
future<> after_join(std::optional<db_clock::time_point>&& startup_gen_ts);
cdc::metadata& get_cdc_metadata() {
return _cdc_metadata;
}
virtual void before_change(gms::inet_address, gms::endpoint_state, gms::application_state, const gms::versioned_value&) override {}
virtual void on_alive(gms::inet_address, gms::endpoint_state) override {}
virtual void on_dead(gms::inet_address, gms::endpoint_state) override {}
virtual void on_remove(gms::inet_address) override {}
virtual void on_restart(gms::inet_address, gms::endpoint_state) override {}
virtual void on_join(gms::inet_address, gms::endpoint_state) override;
virtual void on_change(gms::inet_address, gms::application_state, const gms::versioned_value&) override;
future<> check_and_repair_cdc_streams();
private:
/* Retrieve the CDC generation which starts at the given timestamp (from a distributed table created for this purpose)
* and start using it for CDC log writes if it's not obsolete.
*/
future<> handle_cdc_generation(std::optional<db_clock::time_point>);
/* If `handle_cdc_generation` fails, it schedules an asynchronous retry in the background
* using `async_handle_cdc_generation`.
*/
void async_handle_cdc_generation(db_clock::time_point);
/* Wrapper around `do_handle_cdc_generation` which intercepts timeout/unavailability exceptions.
* Returns: do_handle_cdc_generation(ts). */
future<bool> do_handle_cdc_generation_intercept_nonfatal_errors(db_clock::time_point);
/* Returns `true` iff we started using the generation (it was not obsolete or already known),
* which means that this node might write some CDC log entries using streams from this generation. */
future<bool> do_handle_cdc_generation(db_clock::time_point);
/* Scan CDC generation timestamps gossiped by other nodes and retrieve the latest one.
* This function should be called once at the end of the node startup procedure
* (after the node is started and running normally, it will retrieve generations on gossip events instead).
*/
future<> scan_cdc_generations();
/* generation_service code might be racing with system_distributed_keyspace deinitialization
* (the deinitialization order is broken).
* Therefore, whenever we want to access sys_dist_ks in a background task,
* we need to check if the instance is still there. Storing the shared pointer will keep it alive.
*/
shared_ptr<db::system_distributed_keyspace> get_sys_dist_ks();
};
} // namespace cdc

View File

@@ -32,6 +32,7 @@
#include "cdc/split.hh"
#include "cdc/cdc_options.hh"
#include "cdc/change_visitor.hh"
#include "cdc/metadata.hh"
#include "bytes.hh"
#include "database.hh"
#include "db/config.hh"
@@ -48,6 +49,9 @@
#include "cql3/untyped_result_set.hh"
#include "log.hh"
#include "utils/rjson.hh"
#include "utils/UUID_gen.hh"
#include "utils/managed_bytes.hh"
#include "utils/fragment_range.hh"
#include "types.hh"
#include "concrete_types.hh"
#include "types/listlike_partial_deserializing_iterator.hh"
@@ -70,7 +74,7 @@ using namespace std::chrono_literals;
logging::logger cdc_log("cdc");
namespace cdc {
static schema_ptr create_log_schema(const schema&, std::optional<utils::UUID> = {});
static schema_ptr create_log_schema(const schema&, std::optional<utils::UUID> = {}, schema_ptr = nullptr);
}
static constexpr auto cdc_group_name = "cdc";
@@ -217,10 +221,10 @@ public:
return;
}
auto new_log_schema = create_log_schema(new_schema, log_schema ? std::make_optional(log_schema->id()) : std::nullopt);
auto new_log_schema = create_log_schema(new_schema, log_schema ? std::make_optional(log_schema->id()) : std::nullopt, log_schema);
auto log_mut = log_schema
? db::schema_tables::make_update_table_mutations(keyspace.metadata(), log_schema, new_log_schema, timestamp, false)
? db::schema_tables::make_update_table_mutations(db, keyspace.metadata(), log_schema, new_log_schema, timestamp, false)
: db::schema_tables::make_create_table_mutations(keyspace.metadata(), new_log_schema, timestamp)
;
@@ -277,8 +281,8 @@ private:
}
};
cdc::cdc_service::cdc_service(service::storage_proxy& proxy)
: cdc_service(db_context::builder(proxy).build())
cdc::cdc_service::cdc_service(service::storage_proxy& proxy, cdc::metadata& cdc_metadata)
: cdc_service(db_context::builder(proxy, cdc_metadata).build())
{}
cdc::cdc_service::cdc_service(db_context ctxt)
@@ -486,7 +490,7 @@ bytes log_data_column_deleted_elements_name_bytes(const bytes& column_name) {
return to_bytes(cdc_deleted_elements_column_prefix) + column_name;
}
static schema_ptr create_log_schema(const schema& s, std::optional<utils::UUID> uuid) {
static schema_ptr create_log_schema(const schema& s, std::optional<utils::UUID> uuid, schema_ptr old) {
schema_builder b(s.ks_name(), log_name(s.cf_name()));
b.with_partitioner("com.scylladb.dht.CDCPartitioner");
b.set_compaction_strategy(sstables::compaction_strategy_type::time_window);
@@ -567,11 +571,25 @@ static schema_ptr create_log_schema(const schema& s, std::optional<utils::UUID>
b.set_uuid(*uuid);
}
/**
* #10473 - if we are redefining the log table, we need to ensure any dropped
* columns are registered in "dropped_columns" table, otherwise clients will not
* be able to read data older than now.
*/
if (old) {
// not super efficient, but we don't do this often.
for (auto& col : old->all_columns()) {
if (!b.has_column({col.name(), col.name_as_text() })) {
b.without_column(col.name_as_text(), col.type, api::new_timestamp());
}
}
}
return b.build();
}
db_context::builder::builder(service::storage_proxy& proxy)
: _proxy(proxy)
db_context::builder::builder(service::storage_proxy& proxy, cdc::metadata& cdc_metadata)
: _proxy(proxy), _cdc_metadata(cdc_metadata)
{}
db_context::builder& db_context::builder::with_migration_notifier(service::migration_notifier& migration_notifier) {
@@ -579,22 +597,11 @@ db_context::builder& db_context::builder::with_migration_notifier(service::migra
return *this;
}
db_context::builder& db_context::builder::with_token_metadata(const locator::token_metadata& token_metadata) {
_token_metadata = token_metadata;
return *this;
}
db_context::builder& db_context::builder::with_cdc_metadata(cdc::metadata& cdc_metadata) {
_cdc_metadata = cdc_metadata;
return *this;
}
db_context db_context::builder::build() {
return db_context{
_proxy,
_migration_notifier ? _migration_notifier->get() : service::get_local_storage_service().get_migration_notifier(),
_token_metadata ? _token_metadata->get() : service::get_local_storage_service().get_token_metadata(),
_cdc_metadata ? _cdc_metadata->get() : service::get_local_storage_service().get_cdc_metadata(),
_cdc_metadata,
};
}
@@ -672,6 +679,14 @@ void collection_iterator<bytes_view>::parse() {
_current = k;
}
template<>
void collection_iterator<managed_bytes_view>::parse() {
assert(_rem > 0);
_next = _v;
auto k = read_collection_value(_next, cql_serialization_format::internal());
_current = k;
}
template<typename Container, typename T>
class maybe_back_insert_iterator : public std::back_insert_iterator<Container> {
const abstract_type& _type;
@@ -715,16 +730,16 @@ private:
}
return false;
}
bool compare(const T&, const value_type& v);
int32_t compare(const T&, const value_type& v);
};
template<>
bool maybe_back_insert_iterator<std::vector<std::pair<bytes_view, bytes_view>>, bytes_view>::compare(const bytes_view& t, const value_type& v) {
int32_t maybe_back_insert_iterator<std::vector<std::pair<bytes_view, bytes_view>>, bytes_view>::compare(const bytes_view& t, const value_type& v) {
return _type.compare(t, v.first);
}
template<>
bool maybe_back_insert_iterator<std::vector<bytes_view>, bytes_view>::compare(const bytes_view& t, const value_type& v) {
int32_t maybe_back_insert_iterator<std::vector<bytes_view>, bytes_view>::compare(const bytes_view& t, const value_type& v) {
return _type.compare(t, v);
}
@@ -772,18 +787,18 @@ static bytes merge(const set_type_impl& ctype, const bytes_opt& prev, const byte
return set_type_impl::serialize_partially_deserialized_form(res, cql_serialization_format::internal());
}
static bytes merge(const user_type_impl& type, const bytes_opt& prev, const bytes_opt& next, const bytes_opt& deleted) {
std::vector<bytes_view_opt> res(type.size());
udt_for_each(prev, [&res, i = res.begin()](bytes_view_opt k) mutable {
std::vector<managed_bytes_view_opt> res(type.size());
udt_for_each(prev, [&res, i = res.begin()](managed_bytes_view_opt k) mutable {
*i++ = k;
});
udt_for_each(next, [&res, i = res.begin()](bytes_view_opt k) mutable {
udt_for_each(next, [&res, i = res.begin()](managed_bytes_view_opt k) mutable {
if (k) {
*i = k;
}
++i;
});
collection_iterator<bytes_view> e, d(deleted);
std::for_each(d, e, [&res](bytes_view k) {
collection_iterator<managed_bytes_view> e, d(deleted);
std::for_each(d, e, [&res](managed_bytes_view k) {
auto index = deserialize_field_index(k);
res[index] = std::nullopt;
});
@@ -822,13 +837,13 @@ static bytes_opt get_preimage_col_value(const column_definition& cdef, const cql
auto v = pirow->get_view(cdef.name_as_text());
auto f = cql_serialization_format::internal();
auto n = read_collection_size(v, f);
std::vector<bytes_view> tmp;
std::vector<bytes> tmp;
tmp.reserve(n);
while (n--) {
tmp.emplace_back(read_collection_value(v, f)); // key
tmp.emplace_back(read_collection_value(v, f).linearize()); // key
read_collection_value(v, f); // value. ignore.
}
return set_type_impl::serialize_partially_deserialized_form(tmp, f);
return set_type_impl::serialize_partially_deserialized_form({tmp.begin(), tmp.end()}, f);
},
[&] (const abstract_type& o) -> bytes {
return pirow->get_blob(cdef.name_as_text());
@@ -984,13 +999,13 @@ private:
};
static bytes get_bytes(const atomic_cell_view& acv) {
return acv.value().linearize();
return to_bytes(acv.value());
}
static bytes_view get_bytes_view(const atomic_cell_view& acv, std::vector<bytes>& buf) {
static bytes_view get_bytes_view(const atomic_cell_view& acv, std::forward_list<bytes>& buf) {
return acv.value().is_fragmented()
? bytes_view{buf.emplace_back(acv.value().linearize())}
: acv.value().first_fragment();
? bytes_view{buf.emplace_front(to_bytes(acv.value()))}
: acv.value().current_fragment();
}
static ttl_opt get_ttl(const atomic_cell_view& acv) {
@@ -1143,10 +1158,10 @@ struct process_row_visitor {
_touched_parts.set<stats::part_type::UDT>();
struct udt_visitor : public collection_visitor {
std::vector<bytes_opt> _added_cells;
std::vector<bytes>& _buf;
std::vector<bytes_view_opt> _added_cells;
std::forward_list<bytes>& _buf;
udt_visitor(ttl_opt& ttl_column, size_t num_keys, std::vector<bytes>& buf)
udt_visitor(ttl_opt& ttl_column, size_t num_keys, std::forward_list<bytes>& buf)
: collection_visitor(ttl_column), _added_cells(num_keys), _buf(buf) {}
void live_collection_cell(bytes_view key, const atomic_cell_view& cell) {
@@ -1155,7 +1170,7 @@ struct process_row_visitor {
}
};
std::vector<bytes> buf;
std::forward_list<bytes> buf;
udt_visitor v(_ttl_column, type.size(), buf);
visit_collection(v);
@@ -1174,9 +1189,9 @@ struct process_row_visitor {
struct map_or_list_visitor : public collection_visitor {
std::vector<std::pair<bytes_view, bytes_view>> _added_cells;
std::vector<bytes>& _buf;
std::forward_list<bytes>& _buf;
map_or_list_visitor(ttl_opt& ttl_column, std::vector<bytes>& buf)
map_or_list_visitor(ttl_opt& ttl_column, std::forward_list<bytes>& buf)
: collection_visitor(ttl_column), _buf(buf) {}
void live_collection_cell(bytes_view key, const atomic_cell_view& cell) {
@@ -1185,7 +1200,7 @@ struct process_row_visitor {
}
};
std::vector<bytes> buf;
std::forward_list<bytes> buf;
map_or_list_visitor v(_ttl_column, buf);
visit_collection(v);
@@ -1297,6 +1312,13 @@ struct process_change_visitor {
_clustering_row_states, _generate_delta_values);
visit_row_cells(v);
if (_enable_updating_state) {
// #7716: if there are no regular columns, our visitor would not have visited any cells,
// hence it would not have created a row_state for this row. In effect, postimage wouldn't be produced.
// Ensure that the row state exists.
_clustering_row_states.try_emplace(ckey);
}
_builder.set_operation(log_ck, v._cdc_op);
_builder.set_ttl(log_ck, v._ttl_column);
}
@@ -1648,13 +1670,7 @@ public:
try {
return _ctx._proxy.query(_schema, std::move(command), std::move(partition_ranges), select_cl, service::storage_proxy::coordinator_query_options(default_timeout(), empty_service_permit(), client_state)).then(
[s = _schema, partition_slice = std::move(partition_slice), selection = std::move(selection)] (service::storage_proxy::coordinator_query_result qr) -> lw_shared_ptr<cql3::untyped_result_set> {
cql3::selection::result_set_builder builder(*selection, gc_clock::now(), cql_serialization_format::latest());
query::result_view::consume(*qr.query_result, partition_slice, cql3::selection::result_set_builder::visitor(builder, *s, *selection));
auto result_set = builder.build();
if (!result_set || result_set->empty()) {
return {};
}
return make_lw_shared<cql3::untyped_result_set>(*result_set);
return make_lw_shared<cql3::untyped_result_set>(*s, std::move(qr.query_result), *selection, partition_slice);
});
} catch (exceptions::unavailable_exception& e) {
// `query` can throw `unavailable_exception`, which is seen by clients as ~ "NoHostAvailable".
@@ -1698,7 +1714,7 @@ public:
// as there will be no clustering row data to load into the state.
return;
}
ck_parts.emplace_back(*v);
ck_parts.emplace_back(v->linearize());
}
auto ck = clustering_key::from_exploded(std::move(ck_parts));

View File

@@ -80,7 +80,7 @@ class cdc_service final : public async_sharded_service<cdc::cdc_service> {
std::unique_ptr<impl> _impl;
public:
future<> stop();
cdc_service(service::storage_proxy&);
cdc_service(service::storage_proxy&, cdc::metadata&);
cdc_service(db_context);
~cdc_service();
@@ -100,20 +100,16 @@ public:
struct db_context final {
service::storage_proxy& _proxy;
service::migration_notifier& _migration_notifier;
const locator::token_metadata& _token_metadata;
cdc::metadata& _cdc_metadata;
class builder final {
service::storage_proxy& _proxy;
cdc::metadata& _cdc_metadata;
std::optional<std::reference_wrapper<service::migration_notifier>> _migration_notifier;
std::optional<std::reference_wrapper<const locator::token_metadata>> _token_metadata;
std::optional<std::reference_wrapper<cdc::metadata>> _cdc_metadata;
public:
builder(service::storage_proxy& proxy);
builder(service::storage_proxy& proxy, cdc::metadata&);
builder& with_migration_notifier(service::migration_notifier& migration_notifier);
builder& with_token_metadata(const locator::token_metadata& token_metadata);
builder& with_cdc_metadata(cdc::metadata&);
db_context build();
};

View File

@@ -51,7 +51,8 @@ static cdc::stream_id get_stream(
return entry.streams[shard_id];
}
static cdc::stream_id get_stream(
// non-static for testing
cdc::stream_id get_stream(
const std::vector<cdc::token_range_description>& entries,
dht::token tok) {
if (entries.empty()) {

View File

@@ -31,10 +31,7 @@ class checked_file_impl : public file_impl {
public:
checked_file_impl(const io_error_handler& error_handler, file f)
: _error_handler(error_handler), _file(f) {
_memory_dma_alignment = f.memory_dma_alignment();
_disk_read_dma_alignment = f.disk_read_dma_alignment();
_disk_write_dma_alignment = f.disk_write_dma_alignment();
: file_impl(*get_file_impl(f)), _error_handler(error_handler), _file(f) {
}
virtual future<size_t> write_dma(uint64_t pos, const void* buffer, size_t len, const io_priority_class& pc) override {

View File

@@ -67,8 +67,8 @@ public:
int operator()(const clustering_key_prefix& p1, int32_t w1, const clustering_key_prefix& p2, int32_t w2) const {
auto type = _s.get().clustering_key_prefix_type();
auto res = prefix_equality_tri_compare(type->types().begin(),
type->begin(p1), type->end(p1),
type->begin(p2), type->end(p2),
type->begin(p1.representation()), type->end(p1.representation()),
type->begin(p2.representation()), type->end(p2.representation()),
::tri_compare);
if (res) {
return res;

View File

@@ -65,6 +65,11 @@ private:
_current_start = position_in_partition_view::for_range_start(_current_range.front());
_current_end = position_in_partition_view::for_range_end(_current_range.front());
}
} else {
// If the first range is contiguous with the static row, then advance _current_end as much as we can
if (_current_range && !_current_range.front().start()) {
_current_end = position_in_partition_view::for_range_end(_current_range.front());
}
}
}

View File

@@ -22,7 +22,6 @@
#include "types/collection.hh"
#include "types/user.hh"
#include "concrete_types.hh"
#include "atomic_cell_or_collection.hh"
#include "mutation_partition.hh"
#include "compaction_garbage_collector.hh"
#include "combine.hh"
@@ -30,40 +29,28 @@
#include "collection_mutation.hh"
collection_mutation::collection_mutation(const abstract_type& type, collection_mutation_view v)
: _data(imr_object_type::make(data::cell::make_collection(v.data), &type.imr_state().lsa_migrator())) {}
: _data(v.data) {}
collection_mutation::collection_mutation(const abstract_type& type, const bytes_ostream& data)
: _data(imr_object_type::make(data::cell::make_collection(fragment_range_view(data)), &type.imr_state().lsa_migrator())) {}
static collection_mutation_view get_collection_mutation_view(const uint8_t* ptr)
{
auto f = data::cell::structure::get_member<data::cell::tags::flags>(ptr);
auto ti = data::type_info::make_collection();
data::cell::context ctx(f, ti);
auto view = data::cell::structure::get_member<data::cell::tags::cell>(ptr).as<data::cell::tags::collection>(ctx);
auto dv = data::cell::variable_value::make_view(view, f.get<data::cell::tags::external_data>());
return collection_mutation_view { dv };
}
collection_mutation::collection_mutation(const abstract_type& type, managed_bytes data)
: _data(std::move(data)) {}
collection_mutation::operator collection_mutation_view() const
{
return get_collection_mutation_view(_data.get());
return collection_mutation_view{managed_bytes_view(_data)};
}
collection_mutation_view atomic_cell_or_collection::as_collection_mutation() const {
return get_collection_mutation_view(_data.get());
return collection_mutation_view{managed_bytes_view(_data)};
}
bool collection_mutation_view::is_empty() const {
auto in = collection_mutation_input_stream(data);
auto in = collection_mutation_input_stream(fragment_range(data));
auto has_tomb = in.read_trivial<bool>();
return !has_tomb && in.read_trivial<uint32_t>() == 0;
}
template <typename F>
requires std::is_invocable_r_v<const data::type_info&, F, collection_mutation_input_stream&>
static bool is_any_live(const atomic_cell_value_view& data, tombstone tomb, gc_clock::time_point now, F&& read_cell_type_info) {
auto in = collection_mutation_input_stream(data);
bool collection_mutation_view::is_any_live(const abstract_type& type, tombstone tomb, gc_clock::time_point now) const {
auto in = collection_mutation_input_stream(fragment_range(data));
auto has_tomb = in.read_trivial<bool>();
if (has_tomb) {
auto ts = in.read_trivial<api::timestamp_type>();
@@ -73,9 +60,10 @@ static bool is_any_live(const atomic_cell_value_view& data, tombstone tomb, gc_c
auto nr = in.read_trivial<uint32_t>();
for (uint32_t i = 0; i != nr; ++i) {
auto& type_info = read_cell_type_info(in);
auto key_size = in.read_trivial<uint32_t>();
in.skip(key_size);
auto vsize = in.read_trivial<uint32_t>();
auto value = atomic_cell_view::from_bytes(type_info, in.read(vsize));
auto value = atomic_cell_view::from_bytes(type, in.read(vsize));
if (value.is_live(tomb, now, false)) {
return true;
}
@@ -84,33 +72,8 @@ static bool is_any_live(const atomic_cell_value_view& data, tombstone tomb, gc_c
return false;
}
bool collection_mutation_view::is_any_live(const abstract_type& type, tombstone tomb, gc_clock::time_point now) const {
return visit(type, make_visitor(
[&] (const collection_type_impl& ctype) {
auto& type_info = ctype.value_comparator()->imr_state().type_info();
return ::is_any_live(data, tomb, now, [&type_info] (collection_mutation_input_stream& in) -> const data::type_info& {
auto key_size = in.read_trivial<uint32_t>();
in.skip(key_size);
return type_info;
});
},
[&] (const user_type_impl& utype) {
return ::is_any_live(data, tomb, now, [&utype] (collection_mutation_input_stream& in) -> const data::type_info& {
auto key_size = in.read_trivial<uint32_t>();
auto key = in.read(key_size);
return utype.type(deserialize_field_index(key))->imr_state().type_info();
});
},
[&] (const abstract_type& o) -> bool {
throw std::runtime_error(format("collection_mutation_view::is_any_live: unknown type {}", o.name()));
}
));
}
template <typename F>
requires std::is_invocable_r_v<const data::type_info&, F, collection_mutation_input_stream&>
static api::timestamp_type last_update(const atomic_cell_value_view& data, F&& read_cell_type_info) {
auto in = collection_mutation_input_stream(data);
api::timestamp_type collection_mutation_view::last_update(const abstract_type& type) const {
auto in = collection_mutation_input_stream(fragment_range(data));
api::timestamp_type max = api::missing_timestamp;
auto has_tomb = in.read_trivial<bool>();
if (has_tomb) {
@@ -120,39 +83,16 @@ static api::timestamp_type last_update(const atomic_cell_value_view& data, F&& r
auto nr = in.read_trivial<uint32_t>();
for (uint32_t i = 0; i != nr; ++i) {
auto& type_info = read_cell_type_info(in);
const auto key_size = in.read_trivial<uint32_t>();
in.skip(key_size);
auto vsize = in.read_trivial<uint32_t>();
auto value = atomic_cell_view::from_bytes(type_info, in.read(vsize));
auto value = atomic_cell_view::from_bytes(type, in.read(vsize));
max = std::max(value.timestamp(), max);
}
return max;
}
api::timestamp_type collection_mutation_view::last_update(const abstract_type& type) const {
return visit(type, make_visitor(
[&] (const collection_type_impl& ctype) {
auto& type_info = ctype.value_comparator()->imr_state().type_info();
return ::last_update(data, [&type_info] (collection_mutation_input_stream& in) -> const data::type_info& {
auto key_size = in.read_trivial<uint32_t>();
in.skip(key_size);
return type_info;
});
},
[&] (const user_type_impl& utype) {
return ::last_update(data, [&utype] (collection_mutation_input_stream& in) -> const data::type_info& {
auto key_size = in.read_trivial<uint32_t>();
auto key = in.read(key_size);
return utype.type(deserialize_field_index(key))->imr_state().type_info();
});
},
[&] (const abstract_type& o) -> api::timestamp_type {
throw std::runtime_error(format("collection_mutation_view::last_update: unknown type {}", o.name()));
}
));
}
std::ostream& operator<<(std::ostream& os, const collection_mutation_view::printer& cmvp) {
fmt::print(os, "{{collection_mutation_view ");
cmvp._cmv.with_deserialized(cmvp._type, [&os, &type = cmvp._type] (const collection_mutation_view_description& cmvd) {
@@ -278,28 +218,31 @@ static collection_mutation serialize_collection_mutation(
auto size = accumulate(cells, (size_t)4, element_size);
size += 1;
if (tomb) {
size += sizeof(tomb.timestamp) + sizeof(tomb.deletion_time);
size += sizeof(int64_t) + sizeof(int64_t);
}
bytes_ostream ret;
ret.reserve(size);
auto out = ret.write_begin();
*out++ = bool(tomb);
managed_bytes ret(managed_bytes::initialized_later(), size);
managed_bytes_mutable_view out(ret);
write<uint8_t>(out, uint8_t(bool(tomb)));
if (tomb) {
write(out, tomb.timestamp);
write(out, tomb.deletion_time.time_since_epoch().count());
write<int64_t>(out, tomb.timestamp);
write<int64_t>(out, tomb.deletion_time.time_since_epoch().count());
}
auto writeb = [&out] (bytes_view v) {
serialize_int32(out, v.size());
out = std::copy_n(v.begin(), v.size(), out);
auto writek = [&out] (bytes_view v) {
write<int32_t>(out, v.size());
write_fragmented(out, single_fragmented_view(v));
};
auto writev = [&out] (managed_bytes_view v) {
write<int32_t>(out, v.size());
write_fragmented(out, v);
};
// FIXME: overflow?
serialize_int32(out, boost::distance(cells));
write<int32_t>(out, boost::distance(cells));
for (auto&& kv : cells) {
auto&& k = kv.first;
auto&& v = kv.second;
writeb(k);
writek(k);
writeb(v.serialize());
writev(v.serialize());
}
return collection_mutation(type, ret);
}
@@ -448,13 +391,12 @@ deserialize_collection_mutation(const abstract_type& type, collection_mutation_i
return visit(type, make_visitor(
[&] (const collection_type_impl& ctype) {
// value_comparator(), ugh
auto& type_info = ctype.value_comparator()->imr_state().type_info();
return deserialize_collection_mutation(in, [&type_info] (collection_mutation_input_stream& in) {
return deserialize_collection_mutation(in, [&ctype] (collection_mutation_input_stream& in) {
// FIXME: we could probably avoid the need for size
auto ksize = in.read_trivial<uint32_t>();
auto key = in.read(ksize);
auto vsize = in.read_trivial<uint32_t>();
auto value = atomic_cell_view::from_bytes(type_info, in.read(vsize));
auto value = atomic_cell_view::from_bytes(*ctype.value_comparator(), in.read(vsize));
return std::make_pair(key, value);
});
},
@@ -464,8 +406,7 @@ deserialize_collection_mutation(const abstract_type& type, collection_mutation_i
auto ksize = in.read_trivial<uint32_t>();
auto key = in.read(ksize);
auto vsize = in.read_trivial<uint32_t>();
auto value = atomic_cell_view::from_bytes(
utype.type(deserialize_field_index(key))->imr_state().type_info(), in.read(vsize));
auto value = atomic_cell_view::from_bytes(*utype.type(deserialize_field_index(key)), in.read(vsize));
return std::make_pair(key, value);
});
},

View File

@@ -31,7 +31,6 @@
#include <iosfwd>
class abstract_type;
class bytes_ostream;
class compaction_garbage_collector;
class row_tombstone;
@@ -70,7 +69,7 @@ struct collection_mutation_view_description {
collection_mutation serialize(const abstract_type&) const;
};
using collection_mutation_input_stream = utils::linearizing_input_stream<atomic_cell_value_view, marshal_exception>;
using collection_mutation_input_stream = utils::linearizing_input_stream<fragment_range<managed_bytes_view>, marshal_exception>;
// Given a linearized collection_mutation_view, returns an auxiliary struct allowing the inspection of each cell.
// The struct is an observer of the data given by the collection_mutation_view and is only valid while the
@@ -80,7 +79,7 @@ collection_mutation_view_description deserialize_collection_mutation(const abstr
class collection_mutation_view {
public:
atomic_cell_value_view data;
managed_bytes_view data;
// Is this a noop mutation?
bool is_empty() const;
@@ -97,7 +96,7 @@ public:
// calls it on the corresponding description of `this`.
template <typename F>
inline decltype(auto) with_deserialized(const abstract_type& type, F f) const {
auto stream = collection_mutation_input_stream(data);
auto stream = collection_mutation_input_stream(fragment_range(data));
return f(deserialize_collection_mutation(type, stream));
}
@@ -122,12 +121,11 @@ public:
// The mutation may also contain a collection-wide tombstone.
class collection_mutation {
public:
using imr_object_type = imr::utils::object<data::cell::structure>;
imr_object_type _data;
managed_bytes _data;
collection_mutation() {}
collection_mutation(const abstract_type&, collection_mutation_view);
collection_mutation(const abstract_type& type, const bytes_ostream& data);
collection_mutation(const abstract_type&, managed_bytes);
operator collection_mutation_view() const;
};
@@ -136,4 +134,4 @@ collection_mutation merge(const abstract_type&, collection_mutation_view, collec
collection_mutation difference(const abstract_type&, collection_mutation_view, collection_mutation_view);
// Serializes the given collection of cells to a sequence of bytes ready to be sent over the CQL protocol.
bytes serialize_for_cql(const abstract_type&, collection_mutation_view, cql_serialization_format);
bytes_ostream serialize_for_cql(const abstract_type&, collection_mutation_view, cql_serialization_format);

View File

@@ -43,7 +43,7 @@ public:
const ::schema& schema() const {
return *_schema;
}
friend int tri_compare(const compatible_ring_position_view& x, const compatible_ring_position_view& y) {
friend std::strong_ordering tri_compare(const compatible_ring_position_view& x, const compatible_ring_position_view& y) {
return dht::ring_position_tri_compare(*x._schema, *x._rpv, *y._rpv);
}
friend bool operator<(const compatible_ring_position_view& x, const compatible_ring_position_view& y) {
@@ -83,7 +83,7 @@ public:
const ::schema& schema() const {
return *_schema;
}
friend int tri_compare(const compatible_ring_position& x, const compatible_ring_position& y) {
friend std::strong_ordering tri_compare(const compatible_ring_position& x, const compatible_ring_position& y) {
return dht::ring_position_tri_compare(*x._schema, *x._rp, *y._rp);
}
friend bool operator<(const compatible_ring_position& x, const compatible_ring_position& y) {
@@ -133,7 +133,7 @@ public:
};
return std::visit(rpv_accessor{}, *_crp_or_view);
}
friend int tri_compare(const compatible_ring_position_or_view& x, const compatible_ring_position_or_view& y) {
friend std::strong_ordering tri_compare(const compatible_ring_position_or_view& x, const compatible_ring_position_or_view& y) {
struct schema_accessor {
const ::schema& operator()(const compatible_ring_position& crp) {
return crp.schema();

View File

@@ -73,12 +73,19 @@ private:
* <len(value1)><value1><len(value2)><value2>...<len(value_n)><value_n>
*
*/
template<typename RangeOfSerializedComponents, typename CharOutputIterator>
static void serialize_value(RangeOfSerializedComponents&& values, CharOutputIterator& out) {
template<typename RangeOfSerializedComponents, FragmentedMutableView Out>
static void serialize_value(RangeOfSerializedComponents&& values, Out out) {
for (auto&& val : values) {
assert(val.size() <= std::numeric_limits<size_type>::max());
write<size_type>(out, size_type(val.size()));
out = std::copy(val.begin(), val.end(), out);
using val_type = std::remove_cvref_t<decltype(val)>;
if constexpr (FragmentedView<val_type>) {
write_fragmented(out, val);
} else if constexpr (std::same_as<val_type, managed_bytes>) {
write_fragmented(out, managed_bytes_view(val));
} else {
write_fragmented(out, single_fragmented_view(val));
}
}
}
template <typename RangeOfSerializedComponents>
@@ -90,25 +97,27 @@ private:
return len;
}
public:
bytes serialize_single(bytes&& v) const {
managed_bytes serialize_single(managed_bytes&& v) const {
return serialize_value({std::move(v)});
}
managed_bytes serialize_single(bytes&& v) const {
return serialize_value({std::move(v)});
}
template<typename RangeOfSerializedComponents>
static bytes serialize_value(RangeOfSerializedComponents&& values) {
static managed_bytes serialize_value(RangeOfSerializedComponents&& values) {
auto size = serialized_size(values);
if (size > std::numeric_limits<size_type>::max()) {
throw std::runtime_error(format("Key size too large: {:d} > {:d}", size, std::numeric_limits<size_type>::max()));
}
bytes b(bytes::initialized_later(), size);
auto i = b.begin();
serialize_value(values, i);
managed_bytes b(managed_bytes::initialized_later(), size);
serialize_value(values, managed_bytes_mutable_view(b));
return b;
}
template<typename T>
static bytes serialize_value(std::initializer_list<T> values) {
static managed_bytes serialize_value(std::initializer_list<T> values) {
return serialize_value(boost::make_iterator_range(values.begin(), values.end()));
}
bytes serialize_optionals(const std::vector<bytes_opt>& values) const {
managed_bytes serialize_optionals(const std::vector<bytes_opt>& values) const {
return serialize_value(values | boost::adaptors::transformed([] (const bytes_opt& bo) -> bytes_view {
if (!bo) {
throw std::logic_error("attempted to create key component from empty optional");
@@ -116,7 +125,7 @@ public:
return *bo;
}));
}
bytes serialize_value_deep(const std::vector<data_value>& values) const {
managed_bytes serialize_value_deep(const std::vector<data_value>& values) const {
// TODO: Optimize
std::vector<bytes> partial;
partial.reserve(values.size());
@@ -127,25 +136,26 @@ public:
}
return serialize_value(partial);
}
bytes decompose_value(const value_type& values) const {
managed_bytes decompose_value(const value_type& values) const {
return serialize_value(values);
}
class iterator {
public:
using iterator_category = std::input_iterator_tag;
using value_type = const bytes_view;
using value_type = const managed_bytes_view;
using difference_type = std::ptrdiff_t;
using pointer = const bytes_view*;
using reference = const bytes_view&;
using pointer = const value_type*;
using reference = const value_type&;
private:
bytes_view _v;
bytes_view _current;
managed_bytes_view _v;
managed_bytes_view _current;
size_t _remaining = 0;
private:
void read_current() {
_remaining = _v.size_bytes();
size_type len;
{
if (_v.empty()) {
_v = bytes_view(nullptr, 0);
return;
}
len = read_simple<size_type>(_v);
@@ -153,15 +163,16 @@ public:
throw_with_backtrace<marshal_exception>(format("compound_type iterator - not enough bytes, expected {:d}, got {:d}", len, _v.size()));
}
}
_current = bytes_view(_v.begin(), len);
_v.remove_prefix(len);
_current = _v.prefix(len);
_v.remove_prefix(_current.size_bytes());
}
public:
struct end_iterator_tag {};
iterator(const bytes_view& v) : _v(v) {
iterator(const managed_bytes_view& v) : _v(v) {
read_current();
}
iterator(end_iterator_tag, const bytes_view& v) : _v(nullptr, 0) {}
iterator(end_iterator_tag, const managed_bytes_view& v) : _v() {}
iterator() {}
iterator& operator++() {
read_current();
return *this;
@@ -173,29 +184,40 @@ public:
}
const value_type& operator*() const { return _current; }
const value_type* operator->() const { return &_current; }
bool operator!=(const iterator& i) const { return _v.begin() != i._v.begin(); }
bool operator==(const iterator& i) const { return _v.begin() == i._v.begin(); }
bool operator==(const iterator& i) const { return _remaining == i._remaining; }
};
static iterator begin(const bytes_view& v) {
static iterator begin(managed_bytes_view v) {
return iterator(v);
}
static iterator end(const bytes_view& v) {
static iterator end(managed_bytes_view v) {
return iterator(typename iterator::end_iterator_tag(), v);
}
static boost::iterator_range<iterator> components(const bytes_view& v) {
static boost::iterator_range<iterator> components(managed_bytes_view v) {
return { begin(v), end(v) };
}
value_type deserialize_value(bytes_view v) const {
value_type deserialize_value(managed_bytes_view v) const {
std::vector<bytes> result;
result.reserve(_types.size());
std::transform(begin(v), end(v), std::back_inserter(result), [] (auto&& v) {
return bytes(v.begin(), v.end());
return to_bytes(v);
});
return result;
}
bool less(managed_bytes_view b1, managed_bytes_view b2) const {
return with_linearized(b1, [&] (bytes_view bv1) {
return with_linearized(b2, [&] (bytes_view bv2) {
return less(bv1, bv2);
});
});
}
bool less(bytes_view b1, bytes_view b2) const {
return compare(b1, b2) < 0;
}
size_t hash(managed_bytes_view v) const{
return with_linearized(v, [&] (bytes_view v) {
return hash(v);
});
}
size_t hash(bytes_view v) const {
if (_byte_order_equal) {
return std::hash<bytes_view>()(v);
@@ -208,6 +230,13 @@ public:
}
return h;
}
int compare(managed_bytes_view b1, managed_bytes_view b2) const {
return with_linearized(b1, [&] (bytes_view bv1) {
return with_linearized(b2, [&] (bytes_view bv2) {
return compare(bv1, bv2);
});
});
}
int compare(bytes_view b1, bytes_view b2) const {
if (_byte_order_comparable) {
if (_is_reversed) {
@@ -222,15 +251,21 @@ public:
});
}
// Retruns true iff given prefix has no missing components
bool is_full(bytes_view v) const {
bool is_full(managed_bytes_view v) const {
assert(AllowPrefixes == allow_prefixes::yes);
return std::distance(begin(v), end(v)) == (ssize_t)_types.size();
}
bool is_empty(managed_bytes_view v) const {
return v.empty();
}
bool is_empty(const managed_bytes& v) const {
return v.empty();
}
bool is_empty(bytes_view v) const {
return begin(v) == end(v);
}
void validate(bytes_view v) const {
std::vector<bytes_view> values(begin(v), end(v));
void validate(managed_bytes_view v) const {
std::vector<managed_bytes_view> values(begin(v), end(v));
if (AllowPrefixes == allow_prefixes::no && values.size() < _types.size()) {
throw marshal_exception(fmt::format("compound::validate(): non-prefixable compound cannot be a prefix"));
}
@@ -243,6 +278,13 @@ public:
_types[i]->validate(values[i], cql_serialization_format::internal());
}
}
bool equal(managed_bytes_view v1, managed_bytes_view v2) const {
return with_linearized(v1, [&] (bytes_view bv1) {
return with_linearized(v2, [&] (bytes_view bv2) {
return equal(bv1, bv2);
});
});
}
bool equal(bytes_view v1, bytes_view v2) const {
if (_byte_order_equal) {
return compare_unsigned(v1, v2) == 0;

View File

@@ -54,9 +54,9 @@ template <typename CompoundType>
class legacy_compound_view {
static_assert(!CompoundType::is_prefixable, "Legacy view not defined for prefixes");
CompoundType& _type;
bytes_view _packed;
managed_bytes_view _packed;
public:
legacy_compound_view(CompoundType& c, bytes_view packed)
legacy_compound_view(CompoundType& c, managed_bytes_view packed)
: _type(c)
, _packed(packed)
{ }
@@ -147,18 +147,18 @@ public:
{ }
// @k1 and @k2 must be serialized using @type, which was passed to the constructor.
int operator()(bytes_view k1, bytes_view k2) const {
int operator()(managed_bytes_view k1, managed_bytes_view k2) const {
if (_type.is_singular()) {
return compare_unsigned(*_type.begin(k1), *_type.begin(k2));
}
return lexicographical_tri_compare(
_type.begin(k1), _type.end(k1),
_type.begin(k2), _type.end(k2),
[] (const bytes_view& c1, const bytes_view& c2) -> int {
[] (const managed_bytes_view& c1, const managed_bytes_view& c2) -> int {
if (c1.size() != c2.size() || !c1.size()) {
return c1.size() < c2.size() ? -1 : c1.size() ? 1 : 0;
}
return memcmp(c1.begin(), c2.begin(), c1.size());
return compare_unsigned(c1, c2);
});
}
};
@@ -188,7 +188,7 @@ public:
// @packed is assumed to be serialized using supplied @type.
template <typename CompoundType>
static inline
bytes to_legacy(CompoundType& type, bytes_view packed) {
bytes to_legacy(CompoundType& type, managed_bytes_view packed) {
legacy_compound_view<CompoundType> lv(type, packed);
bytes legacy_form(bytes::initialized_later(), lv.size());
std::copy(lv.begin(), lv.end(), legacy_form.begin());
@@ -264,6 +264,12 @@ private:
static void write_value(Value&& val, CharOutputIterator& out) {
out = std::copy(val.begin(), val.end(), out);
}
template<typename CharOutputIterator>
static void write_value(managed_bytes_view val, CharOutputIterator& out) {
for (bytes_view frag : fragment_range(val)) {
out = std::copy(frag.begin(), frag.end(), out);
}
}
template <typename CharOutputIterator>
static void write_value(const data_value& val, CharOutputIterator& out) {
val.serialize(out);
@@ -405,6 +411,7 @@ public:
iterator(end_iterator_tag) : _v(nullptr, 0) {}
public:
iterator() : iterator(end_iterator_tag()) {}
iterator& operator++() {
read_current();
return *this;

View File

@@ -99,8 +99,8 @@ listen_address: localhost
# listen_on_broadcast_address: false
# port for the CQL native transport to listen for clients on
# For security reasons, you should not expose this port to the internet. Firewall it if needed.
# To disable the CQL native transport, set this option to 0.
# For security reasons, you should not expose this port to the internet. Firewall it if needed.
# To disable the CQL native transport, remove this option and configure native_transport_port_ssl.
native_transport_port: 9042
# Like native_transport_port, but clients are forwarded to specific shards, based on the

View File

@@ -59,6 +59,9 @@ i18n_xlat = {
}
python3_dependencies = subprocess.run('./install-dependencies.sh --print-python3-runtime-packages', shell=True, capture_output=True, encoding='utf-8').stdout.strip()
node_exporter_filename = subprocess.run('./install-dependencies.sh --print-node-exporter-filename', shell=True, capture_output=True, encoding='utf-8').stdout.strip()
node_exporter_dirname = os.path.basename(node_exporter_filename).rstrip('.tar.gz')
def pkgname(name):
if name in i18n_xlat:
@@ -123,18 +126,21 @@ def ensure_tmp_dir_exists():
os.makedirs(tempfile.tempdir)
def try_compile_and_link(compiler, source='', flags=[]):
def try_compile_and_link(compiler, source='', flags=[], verbose=False):
ensure_tmp_dir_exists()
with tempfile.NamedTemporaryFile() as sfile:
ofile = tempfile.mktemp()
try:
sfile.file.write(bytes(source, 'utf-8'))
sfile.file.flush()
# We can't write to /dev/null, since in some cases (-ftest-coverage) gcc will create an auxiliary
# output file based on the name of the output file, and "/dev/null.gcsa" is not a good name
return subprocess.call([compiler, '-x', 'c++', '-o', ofile, sfile.name] + args.user_cflags.split() + flags,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL) == 0
ret = subprocess.run([compiler, '-x', 'c++', '-o', ofile, sfile.name] + args.user_cflags.split() + flags,
capture_output=True)
if verbose:
print(f"Compilation failed: {compiler} -x c++ -o {ofile} {sfile.name} {args.user_cflags} {flags}")
print(source)
print(ret.stdout.decode('utf-8'))
print(ret.stderr.decode('utf-8'))
return ret.returncode == 0
finally:
if os.path.exists(ofile):
os.unlink(ofile)
@@ -161,7 +167,21 @@ def linker_flags(compiler):
link_flags.append(threads_flag)
return ' '.join(link_flags)
else:
print('Note: neither lld nor gold found; using default system linker')
linker = ''
try:
subprocess.call(["gold", "-v"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
linker = 'gold'
except:
pass
try:
subprocess.call(["lld", "-v"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
linker = 'lld'
except:
pass
if linker:
print(f'Linker {linker} found, but the compilation attempt failed, defaulting to default system linker')
else:
print('Note: neither lld nor gold found; using default system linker')
return ''
@@ -255,29 +275,34 @@ modes = {
'cxxflags': '-DDEBUG -DSANITIZE -DDEBUG_LSA_SANITIZER -DSCYLLA_ENABLE_ERROR_INJECTION',
'cxx_ld_flags': '',
'stack-usage-threshold': 1024*40,
'optimization-level': 'g',
},
'release': {
'cxxflags': '-O3 -ffunction-sections -fdata-sections ',
'cxxflags': '-ffunction-sections -fdata-sections ',
'cxx_ld_flags': '-Wl,--gc-sections',
'stack-usage-threshold': 1024*13,
'optimization-level': '3',
},
'dev': {
'cxxflags': '-O1 -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSCYLLA_ENABLE_ERROR_INJECTION',
'cxxflags': '-DDEVEL -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSCYLLA_ENABLE_ERROR_INJECTION',
'cxx_ld_flags': '',
'stack-usage-threshold': 1024*21,
'optimization-level': '2',
},
'sanitize': {
'cxxflags': '-Os -DDEBUG -DSANITIZE -DDEBUG_LSA_SANITIZER -DSCYLLA_ENABLE_ERROR_INJECTION',
'cxxflags': '-DDEBUG -DSANITIZE -DDEBUG_LSA_SANITIZER -DSCYLLA_ENABLE_ERROR_INJECTION',
'cxx_ld_flags': '',
'stack-usage-threshold': 1024*50,
'optimization-level': 's',
}
}
scylla_tests = set([
'test/boost/UUID_test',
'test/boost/cdc_generation_test',
'test/boost/aggregate_fcts_test',
'test/boost/allocation_strategy_test',
'test/boost/alternator_base64_test',
'test/boost/alternator_unit_test',
'test/boost/anchorless_list_test',
'test/boost/auth_passwords_test',
'test/boost/auth_resource_test',
@@ -329,8 +354,8 @@ scylla_tests = set([
'test/boost/gossip_test',
'test/boost/gossiping_property_file_snitch_test',
'test/boost/hash_test',
'test/boost/hashers_test',
'test/boost/idl_test',
'test/boost/imr_test',
'test/boost/input_stream_test',
'test/boost/json_cql_query_test',
'test/boost/json_test',
@@ -344,10 +369,10 @@ scylla_tests = set([
'test/boost/estimated_histogram_test',
'test/boost/logalloc_test',
'test/boost/managed_vector_test',
'test/boost/managed_bytes_test',
'test/boost/intrusive_array_test',
'test/boost/map_difference_test',
'test/boost/memtable_test',
'test/boost/meta_test',
'test/boost/multishard_mutation_query_test',
'test/boost/murmur_hash_test',
'test/boost/mutation_fragment_test',
@@ -372,6 +397,7 @@ scylla_tests = set([
'test/boost/schema_change_test',
'test/boost/schema_registry_test',
'test/boost/secondary_index_test',
'test/boost/tracing',
'test/boost/index_with_paging_test',
'test/boost/serialization_test',
'test/boost/serialized_action_test',
@@ -386,6 +412,7 @@ scylla_tests = set([
'test/boost/sstable_directory_test',
'test/boost/sstable_test',
'test/boost/sstable_move_test',
'test/boost/statement_restrictions_test',
'test/boost/storage_proxy_test',
'test/boost/top_k_test',
'test/boost/transport_test',
@@ -401,9 +428,13 @@ scylla_tests = set([
'test/boost/vint_serialization_test',
'test/boost/virtual_reader_test',
'test/boost/bptree_test',
'test/boost/btree_test',
'test/boost/radix_tree_test',
'test/boost/double_decker_test',
'test/boost/stall_free_test',
'test/boost/imr_test',
'test/boost/raft_address_map_test',
'test/boost/raft_sys_table_storage_test',
'test/boost/sstable_set_test',
'test/manual/ec2_snitch_test',
'test/manual/enormous_table_scan_test',
'test/manual/gce_snitch_test',
@@ -422,6 +453,7 @@ scylla_tests = set([
'test/perf/perf_mutation',
'test/perf/perf_collection',
'test/perf/perf_row_cache_update',
'test/perf/perf_row_cache_reads',
'test/perf/perf_simple_query',
'test/perf/perf_sstable',
'test/unit/lsa_async_eviction_test',
@@ -429,7 +461,11 @@ scylla_tests = set([
'test/unit/row_cache_alloc_stress_test',
'test/unit/row_cache_stress_test',
'test/unit/bptree_stress_test',
'test/unit/btree_stress_test',
'test/unit/bptree_compaction_test',
'test/unit/btree_compaction_test',
'test/unit/radix_tree_stress_test',
'test/unit/radix_tree_compaction_test',
])
perf_tests = set([
@@ -443,13 +479,15 @@ perf_tests = set([
raft_tests = set([
'test/raft/replication_test',
'test/boost/raft_fsm_test',
'test/raft/fsm_test',
'test/raft/etcd_test',
])
apps = set([
'scylla',
'test/tools/cql_repl',
'tools/scylla-types',
'tools/scylla-sstable-index',
])
tests = scylla_tests | perf_tests | raft_tests
@@ -489,6 +527,8 @@ arg_parser.add_argument('--dpdk-target', action='store', dest='dpdk_target', def
help='Path to DPDK SDK target location (e.g. <DPDK SDK dir>/x86_64-native-linuxapp-gcc)')
arg_parser.add_argument('--debuginfo', action='store', dest='debuginfo', type=int, default=1,
help='Enable(1)/disable(0)compiler debug information generation')
arg_parser.add_argument('--optimization-level', action='append', dest='mode_o_levels', metavar='MODE=LEVEL', default=[],
help=f'Override default compiler optimization level for mode (defaults: {" ".join([x+"="+modes[x]["optimization-level"] for x in modes])})')
arg_parser.add_argument('--static-stdc++', dest='staticcxx', action='store_true',
help='Link libgcc and libstdc++ statically')
arg_parser.add_argument('--static-thrift', dest='staticthrift', action='store_true',
@@ -511,25 +551,34 @@ arg_parser.add_argument('--with-antlr3', dest='antlr3_exec', action='store', def
help='path to antlr3 executable')
arg_parser.add_argument('--with-ragel', dest='ragel_exec', action='store', default='ragel',
help='path to ragel executable')
arg_parser.add_argument('--build-raft', dest='build_raft', action='store_true', default=False,
help='build raft code')
add_tristate(arg_parser, name='stack-guards', dest='stack_guards', help='Use stack guards')
arg_parser.add_argument('--verbose', dest='verbose', action='store_true',
help='Make configure.py output more verbose (useful for debugging the build process itself)')
arg_parser.add_argument('--test-repeat', dest='test_repeat', action='store', type=str, default='1',
help='Set number of times to repeat each unittest.')
arg_parser.add_argument('--test-timeout', dest='test_timeout', action='store', type=str, default='7200')
arg_parser.add_argument('--clang-inline-threshold', action='store', type=int, dest='clang_inline_threshold', default=-1,
help="LLVM-specific inline threshold compilation parameter")
args = arg_parser.parse_args()
if not args.build_raft:
all_artifacts.difference_update(raft_tests)
tests.difference_update(raft_tests)
defines = ['XXH_PRIVATE_API',
'SEASTAR_TESTING_MAIN',
]
extra_cxxflags = {}
extra_cxxflags = {
'debug': {},
'dev': {},
'release': {},
'sanitize': {}
}
scylla_raft_core = [
'raft/raft.cc',
'raft/server.cc',
'raft/fsm.cc',
'raft/tracker.cc',
'raft/log.cc',
]
scylla_core = (['database.cc',
'absl-flat_hash_map.cc',
@@ -572,14 +621,16 @@ scylla_core = (['database.cc',
'counters.cc',
'compress.cc',
'zstd.cc',
'sstables/mp_row_consumer.cc',
'sstables/sstables.cc',
'sstables/sstables_manager.cc',
'sstables/sstable_set.cc',
'sstables/mx/reader.cc',
'sstables/mx/writer.cc',
'sstables/kl/reader.cc',
'sstables/kl/writer.cc',
'sstables/sstable_version.cc',
'sstables/compress.cc',
'sstables/partition.cc',
'sstables/sstable_mutation_reader.cc',
'sstables/compaction.cc',
'sstables/compaction_strategy.cc',
'sstables/size_tiered_compaction_strategy.cc',
@@ -836,7 +887,6 @@ scylla_core = (['database.cc',
'vint-serialization.cc',
'utils/arch/powerpc/crc32-vpmsum/crc32_wrapper.cc',
'querier.cc',
'data/cell.cc',
'mutation_writer/multishard_writer.cc',
'multishard_mutation_query.cc',
'reader_concurrency_semaphore.cc',
@@ -847,8 +897,16 @@ scylla_core = (['database.cc',
'utils/error_injection.cc',
'mutation_writer/timestamp_based_splitting_writer.cc',
'mutation_writer/shard_based_splitting_writer.cc',
'mutation_writer/feed_writers.cc',
'lua.cc',
] + [Antlr3Grammar('cql3/Cql.g')] + [Thrift('interface/cassandra.thrift', 'Cassandra')]
'service/raft/schema_raft_state_machine.cc',
'service/raft/raft_sys_table_storage.cc',
'serializer.cc',
'service/raft/raft_rpc.cc',
'service/raft/raft_gossip_failure_detector.cc',
'service/raft/raft_services.cc',
] + [Antlr3Grammar('cql3/Cql.g')] + [Thrift('interface/cassandra.thrift', 'Cassandra')] \
+ scylla_raft_core
)
api = ['api/api.cc',
@@ -944,6 +1002,7 @@ idls = ['idl/gossip_digest.idl.hh',
'idl/view.idl.hh',
'idl/messaging_service.idl.hh',
'idl/paxos.idl.hh',
'idl/raft.idl.hh',
]
headers = find_headers('.', excluded_dirs=['idl', 'build', 'seastar', '.git'])
@@ -968,20 +1027,14 @@ scylla_tests_dependencies = scylla_core + idls + scylla_tests_generic_dependenci
'test/lib/random_schema.cc',
]
scylla_raft_dependencies = [
'raft/raft.cc',
'raft/server.cc',
'raft/fsm.cc',
'raft/progress.cc',
'raft/log.cc',
'utils/uuid.cc'
]
scylla_raft_dependencies = scylla_raft_core + ['utils/uuid.cc']
deps = {
'scylla': idls + ['main.cc', 'release.cc', 'utils/build_id.cc'] + scylla_core + api + alternator + redis,
'test/tools/cql_repl': idls + ['test/tools/cql_repl.cc'] + scylla_core + scylla_tests_generic_dependencies,
#FIXME: we don't need all of scylla_core here, only the types module, need to modularize scylla_core.
'tools/scylla-types': idls + ['tools/scylla-types.cc'] + scylla_core,
'tools/scylla-sstable-index': idls + ['tools/scylla-sstable-index.cc'] + scylla_core,
}
pure_boost_tests = set([
@@ -1001,13 +1054,13 @@ pure_boost_tests = set([
'test/boost/dynamic_bitset_test',
'test/boost/enum_option_test',
'test/boost/enum_set_test',
'test/boost/hashers_test',
'test/boost/idl_test',
'test/boost/json_test',
'test/boost/keys_test',
'test/boost/like_matcher_test',
'test/boost/linearizing_input_stream_test',
'test/boost/map_difference_test',
'test/boost/meta_test',
'test/boost/nonwrapping_range_test',
'test/boost/observable_test',
'test/boost/range_test',
@@ -1017,11 +1070,13 @@ pure_boost_tests = set([
'test/boost/top_k_test',
'test/boost/vint_serialization_test',
'test/boost/bptree_test',
'test/boost/utf8_test',
'test/boost/btree_test',
'test/manual/streaming_histogram_test',
])
tests_not_using_seastar_test_framework = set([
'test/boost/alternator_base64_test',
'test/boost/alternator_unit_test',
'test/boost/small_vector_test',
'test/manual/gossip',
'test/manual/message',
@@ -1036,7 +1091,11 @@ tests_not_using_seastar_test_framework = set([
'test/unit/lsa_sync_eviction_test',
'test/unit/row_cache_alloc_stress_test',
'test/unit/bptree_stress_test',
'test/unit/btree_stress_test',
'test/unit/bptree_compaction_test',
'test/unit/btree_compaction_test',
'test/unit/radix_tree_stress_test',
'test/unit/radix_tree_compaction_test',
'test/manual/sstable_scan_footprint_test',
]) | pure_boost_tests
@@ -1079,8 +1138,6 @@ deps['test/boost/estimated_histogram_test'] = ['test/boost/estimated_histogram_t
deps['test/boost/anchorless_list_test'] = ['test/boost/anchorless_list_test.cc']
deps['test/perf/perf_fast_forward'] += ['release.cc']
deps['test/perf/perf_simple_query'] += ['release.cc']
deps['test/boost/meta_test'] = ['test/boost/meta_test.cc']
deps['test/boost/imr_test'] = ['test/boost/imr_test.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
deps['test/boost/reusable_buffer_test'] = [
"test/boost/reusable_buffer_test.cc",
"test/lib/log.cc",
@@ -1095,10 +1152,11 @@ deps['test/boost/linearizing_input_stream_test'] = [
]
deps['test/boost/duration_test'] += ['test/lib/exception_utils.cc']
deps['test/boost/alternator_base64_test'] += ['alternator/base64.cc']
deps['test/boost/alternator_unit_test'] += ['alternator/base64.cc']
deps['test/raft/replication_test'] = ['test/raft/replication_test.cc'] + scylla_raft_dependencies
deps['test/boost/raft_fsm_test'] = ['test/boost/raft_fsm_test.cc', 'test/lib/log.cc'] + scylla_raft_dependencies
deps['test/raft/fsm_test'] = ['test/raft/fsm_test.cc', 'test/lib/log.cc'] + scylla_raft_dependencies
deps['test/raft/etcd_test'] = ['test/raft/etcd_test.cc', 'test/lib/log.cc'] + scylla_raft_dependencies
deps['utils/gz/gen_crc_combine_table'] = ['utils/gz/gen_crc_combine_table.cc']
@@ -1139,7 +1197,6 @@ warnings = [
'-Wno-delete-non-abstract-non-virtual-dtor',
'-Wno-unknown-attributes',
'-Wno-braced-scalar-init',
'-Wno-unused-value',
'-Wno-range-loop-construct',
'-Wno-unused-function',
'-Wno-implicit-int-float-conversion',
@@ -1155,9 +1212,18 @@ warnings = [w
warnings = ' '.join(warnings + ['-Wno-error=deprecated-declarations'])
def clang_inline_threshold():
if args.clang_inline_threshold != -1:
return args.clang_inline_threshold
elif platform.machine() == 'aarch64':
# we see miscompiles with 1200 and above with format("{}", uuid)
return 600
else:
return 2500
optimization_flags = [
'--param inline-unit-growth=300', # gcc
'-mllvm -inline-threshold=2500', # clang
f'-mllvm -inline-threshold={clang_inline_threshold()}', # clang
]
optimization_flags = [o
for o in optimization_flags
@@ -1168,6 +1234,15 @@ if flag_supported(flag='-Wstack-usage=4096', compiler=args.cxx):
for mode in modes:
modes[mode]['cxxflags'] += f' -Wstack-usage={modes[mode]["stack-usage-threshold"]} -Wno-error=stack-usage='
for mode_level in args.mode_o_levels:
( mode, level ) = mode_level.split('=', 2)
if mode not in modes:
raise Exception(f'Mode {mode} is missing, cannot configure optimization level for it')
modes[mode]['optimization-level'] = level
for mode in modes:
modes[mode]['cxxflags'] += f' -O{modes[mode]["optimization-level"]}'
linker_flags = linker_flags(compiler=args.cxx)
dbgflag = '-g -gz' if args.debuginfo else ''
@@ -1233,7 +1308,8 @@ compiler_test_src = '''
int main() { return 0; }
'''
if not try_compile_and_link(compiler=args.cxx, source=compiler_test_src):
print('Wrong GCC version. Scylla needs GCC >= 10.1.1 to compile.')
try_compile_and_link(compiler=args.cxx, source=compiler_test_src, verbose=True)
print('Wrong compiler version or incorrect flags. Scylla needs GCC >= 10.1.1 with coroutines (-fcoroutines) or clang >= 10.0.0 to compile.')
sys.exit(1)
if not try_compile(compiler=args.cxx, source='#include <boost/version.hpp>'):
@@ -1284,7 +1360,9 @@ scylla_release = file.read().strip()
file = open(f'{outdir}/SCYLLA-PRODUCT-FILE', 'r')
scylla_product = file.read().strip()
extra_cxxflags["release.cc"] = "-DSCYLLA_VERSION=\"\\\"" + scylla_version + "\\\"\" -DSCYLLA_RELEASE=\"\\\"" + scylla_release + "\\\"\""
for m in ['debug', 'release', 'sanitize', 'dev']:
cxxflags = "-DSCYLLA_VERSION=\"\\\"" + scylla_version + "\\\"\" -DSCYLLA_RELEASE=\"\\\"" + scylla_release + "\\\"\" -DSCYLLA_BUILD_MODE=\"\\\"" + m + "\\\"\""
extra_cxxflags[m]["release.cc"] = cxxflags
for m in ['debug', 'release', 'sanitize']:
modes[m]['cxxflags'] += ' ' + dbgflag
@@ -1438,6 +1516,7 @@ abseil_libs = ['absl/' + lib for lib in [
'numeric/libabsl_int128.a',
'hash/libabsl_city.a',
'hash/libabsl_hash.a',
'hash/libabsl_wyhash.a',
'base/libabsl_malloc_internal.a',
'base/libabsl_spinlock_wait.a',
'base/libabsl_base.a',
@@ -1458,9 +1537,6 @@ libs = ' '.join([maybe_static(args.staticyamlcpp, '-lyaml-cpp'), '-latomic', '-l
if not args.staticboost:
args.user_cflags += ' -DBOOST_TEST_DYN_LINK'
if build_raft:
args.user_cflags += ' -DENABLE_SCYLLA_RAFT'
# thrift version detection, see #4538
proc_res = subprocess.run(["thrift", "-version"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
proc_res_output = proc_res.stdout.decode("utf-8")
@@ -1741,8 +1817,8 @@ with open(buildfile_tmp, 'w') as f:
for obj in compiles:
src = compiles[obj]
f.write('build {}: cxx.{} {} || {} {}\n'.format(obj, mode, src, seastar_dep, gen_headers_dep))
if src in extra_cxxflags:
f.write(' cxxflags = {seastar_cflags} $cxxflags $cxxflags_{mode} {extra_cxxflags}\n'.format(mode=mode, extra_cxxflags=extra_cxxflags[src], **modeval))
if src in extra_cxxflags[mode]:
f.write(' cxxflags = {seastar_cflags} $cxxflags $cxxflags_{mode} {extra_cxxflags}\n'.format(mode=mode, extra_cxxflags=extra_cxxflags[mode][src], **modeval))
for swagger in swaggers:
hh = swagger.headers(gen_dir)[0]
cc = swagger.sources(gen_dir)[0]
@@ -1798,7 +1874,7 @@ with open(buildfile_tmp, 'w') as f:
f.write(textwrap.dedent('''\
build $builddir/{mode}/iotune: copy $builddir/{mode}/seastar/apps/iotune/iotune
''').format(**locals()))
f.write('build $builddir/{mode}/dist/tar/{scylla_product}-package.tar.gz: package $builddir/{mode}/scylla $builddir/{mode}/iotune $builddir/SCYLLA-RELEASE-FILE $builddir/SCYLLA-VERSION-FILE $builddir/debian/debian | always\n'.format(**locals()))
f.write('build $builddir/{mode}/dist/tar/{scylla_product}-package.tar.gz: package $builddir/{mode}/scylla $builddir/{mode}/iotune $builddir/SCYLLA-RELEASE-FILE $builddir/SCYLLA-VERSION-FILE $builddir/debian/debian $builddir/node_exporter | always\n'.format(**locals()))
f.write(' mode = {mode}\n'.format(**locals()))
f.write(f'build $builddir/dist/{mode}/redhat: rpmbuild $builddir/{mode}/dist/tar/{scylla_product}-package.tar.gz\n')
f.write(f' mode = {mode}\n')
@@ -1807,7 +1883,7 @@ with open(buildfile_tmp, 'w') as f:
f.write(f'build dist-server-{mode}: phony $builddir/dist/{mode}/redhat $builddir/dist/{mode}/debian\n')
f.write(f'build dist-jmx-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-jmx-package.tar.gz dist-jmx-rpm dist-jmx-deb\n')
f.write(f'build dist-tools-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-tools-package.tar.gz dist-tools-rpm dist-tools-deb\n')
f.write(f'build dist-python3-{mode}: phony dist-python3-tar dist-python3-rpm dist-python3-deb compat-python3-rpm compat-python3-deb\n')
f.write(f'build dist-python3-{mode}: phony dist-python3-tar dist-python3-rpm dist-python3-deb\n')
f.write(f'build dist-unified-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-unified-package-{scylla_version}.{scylla_release}.tar.gz\n')
f.write(f'build $builddir/{mode}/dist/tar/{scylla_product}-unified-package-{scylla_version}.{scylla_release}.tar.gz: unified $builddir/{mode}/dist/tar/{scylla_product}-package.tar.gz $builddir/{mode}/dist/tar/{scylla_product}-python3-package.tar.gz $builddir/{mode}/dist/tar/{scylla_product}-jmx-package.tar.gz $builddir/{mode}/dist/tar/{scylla_product}-tools-package.tar.gz | always\n')
f.write(f' mode = {mode}\n')
@@ -1873,22 +1949,6 @@ with open(buildfile_tmp, 'w') as f:
build dist-tools-tar: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-tools-package.tar.gz'.format(mode=mode, scylla_product=scylla_product) for mode in build_modes])}
build dist-tools: phony dist-tools-tar dist-tools-rpm dist-tools-deb
rule compat-python3-reloc
command = mkdir -p $builddir/release && ln -f $dir/$artifact $builddir/release/
rule compat-python3-rpm
command = cd $dir && ./reloc/build_rpm.sh --reloc-pkg $artifact --builddir ../../build/redhat
rule compat-python3-deb
command = cd $dir && ./reloc/build_deb.sh --reloc-pkg $artifact --builddir ../../build/debian
build $builddir/release/{scylla_product}-python3-package.tar.gz: compat-python3-reloc tools/python3/build/{scylla_product}-python3-package.tar.gz
dir = tools/python3
artifact = $builddir/{scylla_product}-python3-package.tar.gz
build compat-python3-rpm: compat-python3-rpm tools/python3/build/{scylla_product}-python3-package.tar.gz
dir = tools/python3
artifact = $builddir/{scylla_product}-python3-package.tar.gz
build compat-python3-deb: compat-python3-deb tools/python3/build/{scylla_product}-python3-package.tar.gz
dir = tools/python3
artifact = $builddir/{scylla_product}-python3-package.tar.gz
build tools/python3/build/{scylla_product}-python3-package.tar.gz: build-submodule-reloc
reloc_dir = tools/python3
args = --packages "{python3_dependencies}"
@@ -1899,7 +1959,7 @@ with open(buildfile_tmp, 'w') as f:
dir = tools/python3
artifact = $builddir/{scylla_product}-python3-package.tar.gz
build dist-python3-tar: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-python3-package.tar.gz'.format(mode=mode, scylla_product=scylla_product) for mode in build_modes])}
build dist-python3: phony dist-python3-tar dist-python3-rpm dist-python3-deb $builddir/release/{scylla_product}-python3-package.tar.gz compat-python3-rpm compat-python3-deb
build dist-python3: phony dist-python3-tar dist-python3-rpm dist-python3-deb $builddir/release/{scylla_product}-python3-package.tar.gz
build dist-deb: phony dist-server-deb dist-python3-deb dist-jmx-deb dist-tools-deb
build dist-rpm: phony dist-server-rpm dist-python3-rpm dist-jmx-rpm dist-tools-rpm
build dist-tar: phony dist-unified-tar dist-server-tar dist-python3-tar dist-jmx-tar dist-tools-tar
@@ -1957,6 +2017,9 @@ with open(buildfile_tmp, 'w') as f:
rule debian_files_gen
command = ./dist/debian/debian_files_gen.py
build $builddir/debian/debian: debian_files_gen | always
rule extract_node_exporter
command = tar -C build -xvpf {node_exporter_filename} --no-same-owner && rm -rfv build/node_exporter && mv -v build/{node_exporter_dirname} build/node_exporter
build $builddir/node_exporter: extract_node_exporter | always
''').format(**globals()))
os.rename(buildfile_tmp, buildfile)

View File

@@ -36,9 +36,9 @@ converting_mutation_partition_applier::upgrade_cell(const abstract_type& new_typ
atomic_cell::collection_member cm) {
if (cell.is_live() && !old_type.is_counter()) {
if (cell.is_live_and_has_ttl()) {
return atomic_cell::make_live(new_type, cell.timestamp(), cell.value().linearize(), cell.expiry(), cell.ttl(), cm);
return atomic_cell::make_live(new_type, cell.timestamp(), cell.value(), cell.expiry(), cell.ttl(), cm);
}
return atomic_cell::make_live(new_type, cell.timestamp(), cell.value().linearize(), cm);
return atomic_cell::make_live(new_type, cell.timestamp(), cell.value(), cm);
} else {
return atomic_cell(new_type, cell);
}

View File

@@ -19,16 +19,10 @@
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "service/storage_service.hh"
#include "counters.hh"
#include "mutation.hh"
#include "combine.hh"
counter_id counter_id::local()
{
return counter_id(service::get_local_storage_service().get_local_id());
}
std::ostream& operator<<(std::ostream& os, const counter_id& id) {
return os << id.to_uuid();
}
@@ -124,16 +118,14 @@ void counter_cell_view::apply(const column_definition& cdef, atomic_cell_or_coll
assert(!dst_ac.is_counter_update());
assert(!src_ac.is_counter_update());
with_linearized(dst_ac, [&] (counter_cell_view dst_ccv) {
with_linearized(src_ac, [&] (counter_cell_view src_ccv) {
auto src_ccv = counter_cell_view(src_ac);
auto dst_ccv = counter_cell_view(dst_ac);
if (dst_ccv.shard_count() >= src_ccv.shard_count()) {
auto dst_amc = dst.as_mutable_atomic_cell(cdef);
auto src_amc = src.as_mutable_atomic_cell(cdef);
if (!dst_amc.is_value_fragmented() && !src_amc.is_value_fragmented()) {
if (apply_in_place(cdef, dst_amc, src_amc)) {
return;
}
if (apply_in_place(cdef, dst_amc, src_amc)) {
return;
}
}
@@ -148,8 +140,6 @@ void counter_cell_view::apply(const column_definition& cdef, atomic_cell_or_coll
auto cell = result.build(std::max(dst_ac.timestamp(), src_ac.timestamp()));
src = std::exchange(dst, atomic_cell_or_collection(std::move(cell)));
});
});
}
std::optional<atomic_cell> counter_cell_view::difference(atomic_cell_view a, atomic_cell_view b)
@@ -164,8 +154,8 @@ std::optional<atomic_cell> counter_cell_view::difference(atomic_cell_view a, ato
return { };
}
return with_linearized(a, [&] (counter_cell_view a_ccv) {
return with_linearized(b, [&] (counter_cell_view b_ccv) {
auto a_ccv = counter_cell_view(a);
auto b_ccv = counter_cell_view(b);
auto a_shards = a_ccv.shards();
auto b_shards = b_ccv.shards();
@@ -192,15 +182,13 @@ std::optional<atomic_cell> counter_cell_view::difference(atomic_cell_view a, ato
diff = atomic_cell::make_live(*counter_type, a.timestamp(), bytes_view());
}
return diff;
});
});
}
void transform_counter_updates_to_shards(mutation& m, const mutation* current_state, uint64_t clock_offset) {
void transform_counter_updates_to_shards(mutation& m, const mutation* current_state, uint64_t clock_offset, utils::UUID local_id) {
// FIXME: allow current_state to be frozen_mutation
auto transform_new_row_to_shards = [&s = *m.schema(), clock_offset] (column_kind kind, auto& cells) {
auto transform_new_row_to_shards = [&s = *m.schema(), clock_offset, local_id] (column_kind kind, auto& cells) {
cells.for_each_cell([&] (column_id id, atomic_cell_or_collection& ac_o_c) {
auto& cdef = s.column_at(kind, id);
auto acv = ac_o_c.as_atomic_cell(cdef);
@@ -208,7 +196,7 @@ void transform_counter_updates_to_shards(mutation& m, const mutation* current_st
return; // continue -- we are in lambda
}
auto delta = acv.counter_update_value();
auto cs = counter_shard(counter_id::local(), delta, clock_offset + 1);
auto cs = counter_shard(counter_id(local_id), delta, clock_offset + 1);
ac_o_c = counter_cell_builder::from_single_shard(acv.timestamp(), cs);
});
};
@@ -223,7 +211,7 @@ void transform_counter_updates_to_shards(mutation& m, const mutation* current_st
clustering_key::less_compare cmp(*m.schema());
auto transform_row_to_shards = [&s = *m.schema(), clock_offset] (column_kind kind, auto& transformee, auto& state) {
auto transform_row_to_shards = [&s = *m.schema(), clock_offset, local_id] (column_kind kind, auto& transformee, auto& state) {
std::deque<std::pair<column_id, counter_shard>> shards;
state.for_each_cell([&] (column_id id, const atomic_cell_or_collection& ac_o_c) {
auto& cdef = s.column_at(kind, id);
@@ -231,14 +219,13 @@ void transform_counter_updates_to_shards(mutation& m, const mutation* current_st
if (!acv.is_live()) {
return; // continue -- we are in lambda
}
counter_cell_view::with_linearized(acv, [&] (counter_cell_view ccv) {
auto cs = ccv.local_shard();
auto ccv = counter_cell_view(acv);
auto cs = ccv.get_shard(counter_id(local_id));
if (!cs) {
return; // continue
}
shards.emplace_back(std::make_pair(id, counter_shard(*cs)));
});
});
transformee.for_each_cell([&] (column_id id, atomic_cell_or_collection& ac_o_c) {
auto& cdef = s.column_at(kind, id);
@@ -253,7 +240,7 @@ void transform_counter_updates_to_shards(mutation& m, const mutation* current_st
auto delta = acv.counter_update_value();
if (shards.empty() || shards.front().first > id) {
auto cs = counter_shard(counter_id::local(), delta, clock_offset + 1);
auto cs = counter_shard(counter_id(local_id), delta, clock_offset + 1);
ac_o_c = counter_cell_builder::from_single_shard(acv.timestamp(), cs);
} else {
auto& cs = shards.front().second;

View File

@@ -61,8 +61,6 @@ public:
return !(*this == other);
}
public:
static counter_id local();
// For tests.
static counter_id generate_random() {
return counter_id(utils::make_random_uuid());
@@ -83,21 +81,20 @@ class basic_counter_shard_view {
total_size = unsigned(logical_clock) + sizeof(int64_t),
};
private:
using pointer_type = std::conditional_t<is_mutable == mutable_view::no, const signed char*, signed char*>;
pointer_type _base;
managed_bytes_basic_view<is_mutable> _base;
private:
template<typename T>
T read(offset off) const {
T value;
std::copy_n(_base + static_cast<unsigned>(off), sizeof(T), reinterpret_cast<signed char*>(&value));
return value;
auto v = _base;
v.remove_prefix(size_t(off));
return read_simple_native<T>(v);
}
public:
static constexpr auto size = size_t(offset::total_size);
public:
basic_counter_shard_view() = default;
explicit basic_counter_shard_view(pointer_type ptr) noexcept
: _base(ptr) { }
explicit basic_counter_shard_view(managed_bytes_basic_view<is_mutable> v) noexcept
: _base(v) { }
counter_id id() const { return read<counter_id>(offset::id); }
int64_t value() const { return read<int64_t>(offset::value); }
@@ -108,15 +105,24 @@ public:
static constexpr size_t size = size_t(offset::total_size) - off;
signed char tmp[size];
std::copy_n(_base + off, size, tmp);
std::copy_n(other._base + off, size, _base + off);
std::copy_n(tmp, size, other._base + off);
auto tmp_view = single_fragmented_mutable_view(bytes_mutable_view(std::data(tmp), std::size(tmp)));
managed_bytes_mutable_view this_view = _base.substr(off, size);
managed_bytes_mutable_view other_view = other._base.substr(off, size);
copy_fragmented_view(tmp_view, this_view);
copy_fragmented_view(this_view, other_view);
copy_fragmented_view(other_view, tmp_view);
}
void set_value_and_clock(const basic_counter_shard_view& other) noexcept {
static constexpr size_t off = size_t(offset::value);
static constexpr size_t size = size_t(offset::total_size) - off;
std::copy_n(other._base + off, size, _base + off);
managed_bytes_mutable_view this_view = _base.substr(off, size);
managed_bytes_mutable_view other_view = other._base.substr(off, size);
copy_fragmented_view(this_view, other_view);
}
bool operator==(const basic_counter_shard_view& other) const {
@@ -142,11 +148,6 @@ class counter_shard {
counter_id _id;
int64_t _value;
int64_t _logical_clock;
private:
template<typename T>
static void write(const T& value, bytes::iterator& out) {
out = std::copy_n(reinterpret_cast<const signed char*>(&value), sizeof(T), out);
}
private:
// Shared logic for applying counter_shards and counter_shard_views.
// T is either counter_shard or basic_counter_shard_view<U>.
@@ -197,10 +198,10 @@ public:
static constexpr size_t serialized_size() {
return counter_shard_view::size;
}
void serialize(bytes::iterator& out) const {
write(_id, out);
write(_value, out);
write(_logical_clock, out);
void serialize(atomic_cell_value_mutable_view& out) const {
write_native<counter_id>(out, _id);
write_native<int64_t>(out, _value);
write_native<int64_t>(out, _logical_clock);
}
};
@@ -237,7 +238,7 @@ public:
size_t serialized_size() const {
return _shards.size() * counter_shard::serialized_size();
}
void serialize(bytes::iterator& out) const {
void serialize(atomic_cell_value_mutable_view& out) const {
for (auto&& cs : _shards) {
cs.serialize(out);
}
@@ -248,31 +249,18 @@ public:
}
atomic_cell build(api::timestamp_type timestamp) const {
// If we can assume that the counter shards never cross fragment boundaries
// the serialisation code gets much simpler.
static_assert(data::cell::maximum_external_chunk_length % counter_shard::serialized_size() == 0);
auto ac = atomic_cell::make_live_uninitialized(*counter_type, timestamp, serialized_size());
auto dst_it = ac.value().begin();
auto dst_current = *dst_it++;
auto dst = ac.value();
for (auto&& cs : _shards) {
if (dst_current.empty()) {
dst_current = *dst_it++;
}
assert(!dst_current.empty());
auto value_dst = dst_current.data();
cs.serialize(value_dst);
dst_current.remove_prefix(counter_shard::serialized_size());
cs.serialize(dst);
}
return ac;
}
static atomic_cell from_single_shard(api::timestamp_type timestamp, const counter_shard& cs) {
// We don't really need to bother with fragmentation here.
static_assert(data::cell::maximum_external_chunk_length >= counter_shard::serialized_size());
auto ac = atomic_cell::make_live_uninitialized(*counter_type, timestamp, counter_shard::serialized_size());
auto dst = ac.value().first_fragment().begin();
auto dst = ac.value();
cs.serialize(dst);
return ac;
}
@@ -311,12 +299,7 @@ public:
template<mutable_view is_mutable>
class basic_counter_cell_view {
protected:
using linearized_value_view = std::conditional_t<is_mutable == mutable_view::no,
bytes_view, bytes_mutable_view>;
using pointer_type = std::conditional_t<is_mutable == mutable_view::no,
bytes_view::const_pointer, bytes_mutable_view::pointer>;
basic_atomic_cell_view<is_mutable> _cell;
linearized_value_view _value;
private:
class shard_iterator {
public:
@@ -326,12 +309,12 @@ private:
using pointer = basic_counter_shard_view<is_mutable>*;
using reference = basic_counter_shard_view<is_mutable>&;
private:
pointer_type _current;
managed_bytes_basic_view<is_mutable> _current;
basic_counter_shard_view<is_mutable> _current_view;
size_t _pos = 0;
public:
shard_iterator() = default;
shard_iterator(pointer_type ptr) noexcept
: _current(ptr), _current_view(ptr) { }
shard_iterator(managed_bytes_basic_view<is_mutable> v, size_t offset) noexcept
: _current(v), _current_view(_current), _pos(offset) { }
basic_counter_shard_view<is_mutable>& operator*() noexcept {
return _current_view;
@@ -340,8 +323,8 @@ private:
return &_current_view;
}
shard_iterator& operator++() noexcept {
_current += counter_shard_view::size;
_current_view = basic_counter_shard_view<is_mutable>(_current);
_pos += counter_shard_view::size;
_current_view = basic_counter_shard_view<is_mutable>(_current.substr(_pos, counter_shard_view::size));
return *this;
}
shard_iterator operator++(int) noexcept {
@@ -350,8 +333,8 @@ private:
return it;
}
shard_iterator& operator--() noexcept {
_current -= counter_shard_view::size;
_current_view = basic_counter_shard_view<is_mutable>(_current);
_pos -= counter_shard_view::size;
_current_view = basic_counter_shard_view<is_mutable>(_current.substr(_pos, counter_shard_view::size));
return *this;
}
shard_iterator operator--(int) noexcept {
@@ -360,31 +343,29 @@ private:
return it;
}
bool operator==(const shard_iterator& other) const noexcept {
return _current == other._current;
}
bool operator!=(const shard_iterator& other) const noexcept {
return !(*this == other);
return _pos == other._pos;
}
};
public:
boost::iterator_range<shard_iterator> shards() const {
auto begin = shard_iterator(_value.data());
auto end = shard_iterator(_value.data() + _value.size());
auto value = _cell.value();
auto begin = shard_iterator(value, 0);
auto end = shard_iterator(value, value.size());
return boost::make_iterator_range(begin, end);
}
size_t shard_count() const {
return _cell.value().size_bytes() / counter_shard_view::size;
return _cell.value().size() / counter_shard_view::size;
}
protected:
public:
// ac must be a live counter cell
explicit basic_counter_cell_view(basic_atomic_cell_view<is_mutable> ac, linearized_value_view vv) noexcept
: _cell(ac), _value(vv)
explicit basic_counter_cell_view(basic_atomic_cell_view<is_mutable> ac) noexcept
: _cell(ac)
{
assert(_cell.is_live());
assert(!_cell.is_counter_update());
}
public:
api::timestamp_type timestamp() const { return _cell.timestamp(); }
static data_type total_value_type() { return long_type; }
@@ -405,11 +386,6 @@ public:
return *it;
}
std::optional<counter_shard_view> local_shard() const {
// TODO: consider caching local shard position
return get_shard(counter_id::local());
}
bool operator==(const basic_counter_cell_view& other) const {
return timestamp() == other.timestamp() && boost::equal(shards(), other.shards());
}
@@ -418,14 +394,6 @@ public:
struct counter_cell_view : basic_counter_cell_view<mutable_view::no> {
using basic_counter_cell_view::basic_counter_cell_view;
template<typename Function>
static decltype(auto) with_linearized(basic_atomic_cell_view<mutable_view::no> ac, Function&& fn) {
return ac.value().with_linearized([&] (bytes_view value_view) {
counter_cell_view ccv(ac, value_view);
return fn(ccv);
});
}
// Reversibly applies two counter cells, at least one of them must be live.
static void apply(const column_definition& cdef, atomic_cell_or_collection& dst, atomic_cell_or_collection& src);
@@ -440,9 +408,8 @@ struct counter_cell_mutable_view : basic_counter_cell_view<mutable_view::yes> {
using basic_counter_cell_view::basic_counter_cell_view;
explicit counter_cell_mutable_view(atomic_cell_mutable_view ac) noexcept
: basic_counter_cell_view<mutable_view::yes>(ac, ac.value().first_fragment())
: basic_counter_cell_view<mutable_view::yes>(ac)
{
assert(!ac.value().is_fragmented());
}
void set_timestamp(api::timestamp_type ts) { _cell.set_timestamp(ts); }
@@ -451,7 +418,7 @@ struct counter_cell_mutable_view : basic_counter_cell_view<mutable_view::yes> {
// Transforms mutation dst from counter updates to counter shards using state
// stored in current_state.
// If current_state is present it has to be in the same schema as dst.
void transform_counter_updates_to_shards(mutation& dst, const mutation* current_state, uint64_t clock_offset);
void transform_counter_updates_to_shards(mutation& dst, const mutation* current_state, uint64_t clock_offset, utils::UUID local_id);
template<>
struct appending_hash<counter_shard_view> {

View File

@@ -394,6 +394,7 @@ selectStatement returns [std::unique_ptr<raw::select_statement> expr]
bool allow_filtering = false;
bool is_json = false;
bool bypass_cache = false;
auto attrs = std::make_unique<cql3::attributes::raw>();
}
: K_SELECT (
( K_JSON { is_json = true; } )?
@@ -408,11 +409,12 @@ selectStatement returns [std::unique_ptr<raw::select_statement> expr]
( K_LIMIT rows=intValue { limit = rows; } )?
( K_ALLOW K_FILTERING { allow_filtering = true; } )?
( K_BYPASS K_CACHE { bypass_cache = true; })?
( usingClause[attrs] )?
{
auto params = make_lw_shared<raw::select_statement::parameters>(std::move(orderings), is_distinct, allow_filtering, is_json, bypass_cache);
$expr = std::make_unique<raw::select_statement>(std::move(cf), std::move(params),
std::move(sclause), std::move(wclause), std::move(limit), std::move(per_partition_limit),
std::move(gbcolumns));
std::move(gbcolumns), std::move(attrs));
}
;
@@ -521,6 +523,7 @@ usingClause[std::unique_ptr<cql3::attributes::raw>& attrs]
usingClauseObjective[std::unique_ptr<cql3::attributes::raw>& attrs]
: K_TIMESTAMP ts=intValue { attrs->timestamp = ts; }
| K_TTL t=intValue { attrs->time_to_live = t; }
| K_TIMEOUT to=term { attrs->timeout = to; }
;
/**
@@ -928,7 +931,7 @@ alterKeyspaceStatement returns [std::unique_ptr<cql3::statements::alter_keyspace
alterTableStatement returns [std::unique_ptr<alter_table_statement> expr]
@init {
alter_table_statement::type type;
auto props = make_shared<cql3::statements::cf_prop_defs>();
auto props = cql3::statements::cf_prop_defs();
std::vector<alter_table_statement::column_change> column_changes;
std::vector<std::pair<shared_ptr<cql3::column_identifier::raw>, shared_ptr<cql3::column_identifier::raw>>> renames;
}
@@ -944,7 +947,7 @@ alterTableStatement returns [std::unique_ptr<alter_table_statement> expr]
| '(' id1=cident { column_changes.emplace_back(alter_table_statement::column_change{id1}); }
(',' idn=cident { column_changes.emplace_back(alter_table_statement::column_change{idn}); } )* ')'
)
| K_WITH properties[*props] { type = alter_table_statement::type::opts; }
| K_WITH properties[props] { type = alter_table_statement::type::opts; }
| K_RENAME { type = alter_table_statement::type::rename; }
id1=cident K_TO toId1=cident { renames.emplace_back(id1, toId1); }
( K_AND idn=cident K_TO toIdn=cident { renames.emplace_back(idn, toIdn); } )*
@@ -984,9 +987,9 @@ alterTypeStatement returns [std::unique_ptr<alter_type_statement> expr]
*/
alterViewStatement returns [std::unique_ptr<alter_view_statement> expr]
@init {
auto props = make_shared<cql3::statements::cf_prop_defs>();
auto props = cql3::statements::cf_prop_defs();
}
: K_ALTER K_MATERIALIZED K_VIEW cf=columnFamilyName K_WITH properties[*props]
: K_ALTER K_MATERIALIZED K_VIEW cf=columnFamilyName K_WITH properties[props]
{
$expr = std::make_unique<alter_view_statement>(std::move(cf), std::move(props));
}
@@ -1121,7 +1124,7 @@ dataResource returns [uninitialized<auth::resource> res]
: K_ALL K_KEYSPACES { $res = auth::resource(auth::resource_kind::data); }
| K_KEYSPACE ks = keyspaceName { $res = auth::make_data_resource($ks.id); }
| ( K_COLUMNFAMILY )? cf = columnFamilyName
{ $res = auth::make_data_resource($cf.name->get_keyspace(), $cf.name->get_column_family()); }
{ $res = auth::make_data_resource($cf.name.has_keyspace() ? $cf.name.get_keyspace() : "", $cf.name.get_column_family()); }
;
roleResource returns [uninitialized<auth::resource> res]
@@ -1258,8 +1261,8 @@ ident returns [shared_ptr<cql3::column_identifier> id]
// Keyspace & Column family names
keyspaceName returns [sstring id]
@init { auto name = make_shared<cql3::cf_name>(); }
: ksName[*name] { $id = name->get_keyspace(); }
@init { auto name = cql3::cf_name(); }
: ksName[name] { $id = name.get_keyspace(); }
;
indexName returns [::shared_ptr<cql3::index_name> name]
@@ -1267,9 +1270,9 @@ indexName returns [::shared_ptr<cql3::index_name> name]
: (ksName[*name] '.')? idxName[*name]
;
columnFamilyName returns [::shared_ptr<cql3::cf_name> name]
@init { $name = ::make_shared<cql3::cf_name>(); }
: (ksName[*name] '.')? cfName[*name]
columnFamilyName returns [cql3::cf_name name]
@init { $name = cql3::cf_name(); }
: (ksName[name] '.')? cfName[name]
;
userTypeName returns [uninitialized<cql3::ut_name> name]
@@ -1546,6 +1549,10 @@ relation[std::vector<cql3::relation_ptr>& clauses]
{
$clauses.emplace_back(cql3::multi_column_relation::create_non_in_relation(ids, type, literal));
}
| type=relationType K_SCYLLA_CLUSTERING_BOUND literal=tupleLiteral /* (a, b, c) > (1, 2, 3) or (a, b, c) > (?, ?, ?) */
{
$clauses.emplace_back(cql3::multi_column_relation::create_scylla_clustering_bound_non_in_relation(ids, type, literal));
}
| type=relationType tupleMarker=markerForTuple /* (a, b, c) >= ? */
{ $clauses.emplace_back(cql3::multi_column_relation::create_non_in_relation(ids, type, tupleMarker)); }
)
@@ -1761,6 +1768,7 @@ basic_unreserved_keyword returns [sstring str]
| K_PER
| K_PARTITION
| K_GROUP
| K_TIMEOUT
) { $str = $k.text; }
;
@@ -1911,11 +1919,15 @@ K_PARTITION: P A R T I T I O N;
K_SCYLLA_TIMEUUID_LIST_INDEX: S C Y L L A '_' T I M E U U I D '_' L I S T '_' I N D E X;
K_SCYLLA_COUNTER_SHARD_LIST: S C Y L L A '_' C O U N T E R '_' S H A R D '_' L I S T;
K_SCYLLA_CLUSTERING_BOUND: S C Y L L A '_' C L U S T E R I N G '_' B O U N D;
K_GROUP: G R O U P;
K_LIKE: L I K E;
K_TIMEOUT: T I M E O U T;
// Case-insensitive alpha characters
fragment A: ('a'|'A');
fragment B: ('b'|'B');

View File

@@ -70,11 +70,11 @@ abstract_marker::raw::raw(int32_t bind_index)
::shared_ptr<term> abstract_marker::raw::prepare(database& db, const sstring& keyspace, lw_shared_ptr<column_specification> receiver) const
{
if (receiver->type->is_collection()) {
if (receiver->type->get_kind() == abstract_type::kind::list) {
if (receiver->type->without_reversed().is_list()) {
return ::make_shared<lists::marker>(_bind_index, receiver);
} else if (receiver->type->get_kind() == abstract_type::kind::set) {
} else if (receiver->type->without_reversed().is_set()) {
return ::make_shared<sets::marker>(_bind_index, receiver);
} else if (receiver->type->get_kind() == abstract_type::kind::map) {
} else if (receiver->type->without_reversed().is_map()) {
return ::make_shared<maps::marker>(_bind_index, receiver);
}
assert(0);

View File

@@ -44,12 +44,13 @@
namespace cql3 {
std::unique_ptr<attributes> attributes::none() {
return std::unique_ptr<attributes>{new attributes{{}, {}}};
return std::unique_ptr<attributes>{new attributes{{}, {}, {}}};
}
attributes::attributes(::shared_ptr<term>&& timestamp, ::shared_ptr<term>&& time_to_live)
attributes::attributes(::shared_ptr<term>&& timestamp, ::shared_ptr<term>&& time_to_live, ::shared_ptr<term>&& timeout)
: _timestamp{std::move(timestamp)}
, _time_to_live{std::move(time_to_live)}
, _timeout{std::move(timeout)}
{ }
bool attributes::is_timestamp_set() const {
@@ -60,6 +61,10 @@ bool attributes::is_time_to_live_set() const {
return bool(_time_to_live);
}
bool attributes::is_timeout_set() const {
return bool(_timeout);
}
int64_t attributes::get_timestamp(int64_t now, const query_options& options) {
if (!_timestamp) {
return now;
@@ -72,14 +77,12 @@ int64_t attributes::get_timestamp(int64_t now, const query_options& options) {
if (tval.is_unset_value()) {
return now;
}
return with_linearized(*tval, [&] (bytes_view val) {
try {
data_type_for<int64_t>()->validate(val, options.get_cql_serialization_format());
data_type_for<int64_t>()->validate(*tval, options.get_cql_serialization_format());
} catch (marshal_exception& e) {
throw exceptions::invalid_request_exception("Invalid timestamp value");
}
return value_cast<int64_t>(data_type_for<int64_t>()->deserialize(val));
});
return value_cast<int64_t>(data_type_for<int64_t>()->deserialize(*tval));
}
int32_t attributes::get_time_to_live(const query_options& options) {
@@ -93,16 +96,15 @@ int32_t attributes::get_time_to_live(const query_options& options) {
if (tval.is_unset_value()) {
return 0;
}
auto ttl = with_linearized(*tval, [&] (bytes_view val) {
try {
data_type_for<int32_t>()->validate(val, options.get_cql_serialization_format());
data_type_for<int32_t>()->validate(*tval, options.get_cql_serialization_format());
}
catch (marshal_exception& e) {
throw exceptions::invalid_request_exception("Invalid TTL value");
}
auto ttl = value_cast<int32_t>(data_type_for<int32_t>()->deserialize(*tval));
return value_cast<int32_t>(data_type_for<int32_t>()->deserialize(val));
});
if (ttl < 0) {
throw exceptions::invalid_request_exception("A TTL must be greater or equal to 0");
}
@@ -115,6 +117,25 @@ int32_t attributes::get_time_to_live(const query_options& options) {
return ttl;
}
db::timeout_clock::duration attributes::get_timeout(const query_options& options) const {
auto timeout = _timeout->bind_and_get(options);
if (timeout.is_null() || timeout.is_unset_value()) {
throw exceptions::invalid_request_exception("Timeout value cannot be unset/null");
}
cql_duration duration = value_cast<cql_duration>(duration_type->deserialize(*timeout));
if (duration.months || duration.days) {
throw exceptions::invalid_request_exception("Timeout values cannot be expressed in days/months");
}
if (duration.nanoseconds % 1'000'000 != 0) {
throw exceptions::invalid_request_exception("Timeout values cannot have granularity finer than milliseconds");
}
if (duration.nanoseconds < 0) {
throw exceptions::invalid_request_exception("Timeout values must be non-negative");
}
return std::chrono::duration_cast<db::timeout_clock::duration>(std::chrono::nanoseconds(duration.nanoseconds));
}
void attributes::collect_marker_specification(variable_specifications& bound_names) const {
if (_timestamp) {
_timestamp->collect_marker_specification(bound_names);
@@ -122,12 +143,16 @@ void attributes::collect_marker_specification(variable_specifications& bound_nam
if (_time_to_live) {
_time_to_live->collect_marker_specification(bound_names);
}
if (_timeout) {
_timeout->collect_marker_specification(bound_names);
}
}
std::unique_ptr<attributes> attributes::raw::prepare(database& db, const sstring& ks_name, const sstring& cf_name) const {
auto ts = !timestamp ? ::shared_ptr<term>{} : timestamp->prepare(db, ks_name, timestamp_receiver(ks_name, cf_name));
auto ttl = !time_to_live ? ::shared_ptr<term>{} : time_to_live->prepare(db, ks_name, time_to_live_receiver(ks_name, cf_name));
return std::unique_ptr<attributes>{new attributes{std::move(ts), std::move(ttl)}};
auto to = !timeout ? ::shared_ptr<term>{} : timeout->prepare(db, ks_name, timeout_receiver(ks_name, cf_name));
return std::unique_ptr<attributes>{new attributes{std::move(ts), std::move(ttl), std::move(to)}};
}
lw_shared_ptr<column_specification> attributes::raw::timestamp_receiver(const sstring& ks_name, const sstring& cf_name) const {
@@ -138,4 +163,8 @@ lw_shared_ptr<column_specification> attributes::raw::time_to_live_receiver(const
return make_lw_shared<column_specification>(ks_name, cf_name, ::make_shared<column_identifier>("[ttl]", true), data_type_for<int32_t>());
}
lw_shared_ptr<column_specification> attributes::raw::timeout_receiver(const sstring& ks_name, const sstring& cf_name) const {
return make_lw_shared<column_specification>(ks_name, cf_name, ::make_shared<column_identifier>("[timeout]", true), duration_type);
}
}

View File

@@ -54,31 +54,39 @@ class attributes final {
private:
const ::shared_ptr<term> _timestamp;
const ::shared_ptr<term> _time_to_live;
const ::shared_ptr<term> _timeout;
public:
static std::unique_ptr<attributes> none();
private:
attributes(::shared_ptr<term>&& timestamp, ::shared_ptr<term>&& time_to_live);
attributes(::shared_ptr<term>&& timestamp, ::shared_ptr<term>&& time_to_live, ::shared_ptr<term>&& timeout);
public:
bool is_timestamp_set() const;
bool is_time_to_live_set() const;
bool is_timeout_set() const;
int64_t get_timestamp(int64_t now, const query_options& options);
int32_t get_time_to_live(const query_options& options);
db::timeout_clock::duration get_timeout(const query_options& options) const;
void collect_marker_specification(variable_specifications& bound_names) const;
class raw final {
public:
::shared_ptr<term::raw> timestamp;
::shared_ptr<term::raw> time_to_live;
::shared_ptr<term::raw> timeout;
std::unique_ptr<attributes> prepare(database& db, const sstring& ks_name, const sstring& cf_name) const;
private:
lw_shared_ptr<column_specification> timestamp_receiver(const sstring& ks_name, const sstring& cf_name) const;
lw_shared_ptr<column_specification> time_to_live_receiver(const sstring& ks_name, const sstring& cf_name) const;
lw_shared_ptr<column_specification> timeout_receiver(const sstring& ks_name, const sstring& cf_name) const;
};
};

View File

@@ -35,6 +35,28 @@ struct authorized_prepared_statements_cache_size {
class authorized_prepared_statements_cache_key {
public:
using cache_key_type = std::pair<auth::authenticated_user, typename cql3::prepared_cache_key_type::cache_key_type>;
struct view {
const auth::authenticated_user& user_ref;
const cql3::prepared_cache_key_type& prep_cache_key_ref;
};
struct view_hasher {
size_t operator()(const view& kv) {
return cql3::authorized_prepared_statements_cache_key::hash(kv.user_ref, kv.prep_cache_key_ref.key());
}
};
struct view_equal {
bool operator()(const authorized_prepared_statements_cache_key& k1, const view& k2) {
return k1.key().first == k2.user_ref && k1.key().second == k2.prep_cache_key_ref.key();
}
bool operator()(const view& k2, const authorized_prepared_statements_cache_key& k1) {
return operator()(k1, k2);
}
};
private:
cache_key_type _key;
@@ -100,10 +122,12 @@ private:
public:
using key_type = cache_key_type;
using key_view_type = typename key_type::view;
using key_view_hasher = typename key_type::view_hasher;
using key_view_equal = typename key_type::view_equal;
using value_type = checked_weak_ptr;
using entry_is_too_big = typename cache_type::entry_is_too_big;
using iterator = typename cache_type::iterator;
using value_ptr = typename cache_type::value_ptr;
private:
cache_type _cache;
logging::logger& _logger;
@@ -124,38 +148,12 @@ public:
}).discard_result();
}
iterator find(const auth::authenticated_user& user, const cql3::prepared_cache_key_type& prep_cache_key) {
struct key_view {
const auth::authenticated_user& user_ref;
const cql3::prepared_cache_key_type& prep_cache_key_ref;
};
struct hasher {
size_t operator()(const key_view& kv) {
return cql3::authorized_prepared_statements_cache_key::hash(kv.user_ref, kv.prep_cache_key_ref.key());
}
};
struct equal {
bool operator()(const key_type& k1, const key_view& k2) {
return k1.key().first == k2.user_ref && k1.key().second == k2.prep_cache_key_ref.key();
}
bool operator()(const key_view& k2, const key_type& k1) {
return operator()(k1, k2);
}
};
return _cache.find(key_view{user, prep_cache_key}, hasher(), equal());
}
iterator end() {
return _cache.end();
value_ptr find(const auth::authenticated_user& user, const cql3::prepared_cache_key_type& prep_cache_key) {
return _cache.find(key_view_type{user, prep_cache_key}, key_view_hasher(), key_view_equal());
}
void remove(const auth::authenticated_user& user, const cql3::prepared_cache_key_type& prep_cache_key) {
iterator it = find(user, prep_cache_key);
_cache.remove(it);
_cache.remove(key_view_type{user, prep_cache_key}, key_view_hasher(), key_view_equal());
}
size_t size() const {

View File

@@ -192,9 +192,12 @@ public:
virtual ::shared_ptr<terminal> bind(const query_options& options) override {
auto bytes = bind_and_get(options);
if (!bytes) {
if (bytes.is_null()) {
return ::shared_ptr<terminal>{};
}
if (bytes.is_unset_value()) {
return UNSET_VALUE;
}
return ::make_shared<constants::value>(std::move(cql3::raw_value::make_value(to_bytes(*bytes))));
}
};
@@ -227,9 +230,7 @@ public:
} else if (value.is_unset_value()) {
return;
}
auto increment = with_linearized(*value, [] (bytes_view value_view) {
return value_cast<int64_t>(long_type->deserialize_value(value_view));
});
auto increment = value_cast<int64_t>(long_type->deserialize_value(*value));
m.set_cell(prefix, column, make_counter_update_cell(increment, params));
}
};
@@ -244,9 +245,7 @@ public:
} else if (value.is_unset_value()) {
return;
}
auto increment = with_linearized(*value, [] (bytes_view value_view) {
return value_cast<int64_t>(long_type->deserialize_value(value_view));
});
auto increment = value_cast<int64_t>(long_type->deserialize_value(*value));
if (increment == std::numeric_limits<int64_t>::min()) {
throw exceptions::invalid_request_exception(format("The negation of {:d} overflows supported counter precision (signed 8 bytes integer)", increment));
}

View File

@@ -59,6 +59,8 @@ class result_message;
namespace cql3 {
class query_processor;
class metadata;
shared_ptr<const metadata> make_empty_metadata();
@@ -99,11 +101,9 @@ public:
* @param options options for this query (consistency, variables, pageSize, ...)
*/
virtual future<::shared_ptr<cql_transport::messages::result_message>>
execute(service::storage_proxy& proxy, service::query_state& state, const query_options& options) const = 0;
execute(query_processor& qp, service::query_state& state, const query_options& options) const = 0;
virtual bool depends_on_keyspace(const sstring& ks_name) const = 0;
virtual bool depends_on_column_family(const sstring& cf_name) const = 0;
virtual bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const = 0;
virtual shared_ptr<const metadata> get_result_metadata() const = 0;

View File

@@ -27,7 +27,9 @@
#include <fmt/ostream.h>
#include <unordered_map>
#include "cql3/constants.hh"
#include "cql3/lists.hh"
#include "cql3/statements/request_validations.hh"
#include "cql3/tuples.hh"
#include "index/secondary_index_manager.hh"
#include "types/list.hh"
@@ -43,7 +45,8 @@ using boost::adaptors::transformed;
namespace {
std::optional<atomic_cell_value_view> do_get_value(const schema& schema,
static
bytes_opt do_get_value(const schema& schema,
const column_definition& cdef,
const partition_key& key,
const clustering_key_prefix& ckey,
@@ -51,9 +54,9 @@ std::optional<atomic_cell_value_view> do_get_value(const schema& schema,
gc_clock::time_point now) {
switch (cdef.kind) {
case column_kind::partition_key:
return atomic_cell_value_view(key.get_component(schema, cdef.component_index()));
return to_bytes(key.get_component(schema, cdef.component_index()));
case column_kind::clustering_key:
return atomic_cell_value_view(ckey.get_component(schema, cdef.component_index()));
return to_bytes(ckey.get_component(schema, cdef.component_index()));
default:
auto cell = cells.find_cell(cdef.id);
if (!cell) {
@@ -61,7 +64,7 @@ std::optional<atomic_cell_value_view> do_get_value(const schema& schema,
}
assert(cdef.is_atomic());
auto c = cell->as_atomic_cell(cdef);
return c.is_dead(now) ? std::nullopt : std::optional<atomic_cell_value_view>(c.value());
return c.is_dead(now) ? std::nullopt : bytes_opt(to_bytes(c.value()));
}
}
@@ -138,9 +141,8 @@ bytes_opt get_value_from_partition_slice(
/// Returns col's value from a mutation.
bytes_opt get_value_from_mutation(const column_value& col, row_data_from_mutation data) {
const auto v = do_get_value(
return do_get_value(
data.schema_, *col.col, data.partition_key_, data.clustering_key_, data.other_columns, data.now);
return v ? v->linearize() : bytes_opt();
}
/// Returns col's value from the fetched data.
@@ -154,7 +156,7 @@ bytes_opt get_value(const column_value& col, const column_value_eval_bag& bag) {
/// Type for comparing results of get_value().
const abstract_type* get_value_comparator(const column_definition* cdef) {
return cdef->type->is_reversed() ? cdef->type->underlying_type().get() : cdef->type.get();
return &cdef->type->without_reversed();
}
/// Type for comparing results of get_value().
@@ -355,16 +357,12 @@ bytes_opt next_value(query::result_row_view::iterator_type& iter, const column_d
if (cdef->type->is_multi_cell()) {
auto cell = iter.next_collection_cell();
if (cell) {
return cell->with_linearized([] (bytes_view data) {
return bytes(data.cbegin(), data.cend());
});
return linearized(*cell);
}
} else {
auto cell = iter.next_atomic_cell();
if (cell) {
return cell->value().with_linearized([] (bytes_view data) {
return bytes(data.cbegin(), data.cend());
});
return linearized(cell->value());
}
}
return std::nullopt;
@@ -417,6 +415,8 @@ bool is_one_of(const column_value& col, term& rhs, const column_value_eval_bag&
} else if (auto mkr = dynamic_cast<lists::marker*>(&rhs)) {
// This is `a IN ?`. RHS elements are values representable as bytes_opt.
const auto values = static_pointer_cast<lists::value>(mkr->bind(bag.options));
statements::request_validations::check_not_null(
values, "Invalid null value for column %s", col.col->name_as_text());
return boost::algorithm::any_of(values->get_elements(), [&] (const bytes_opt& b) {
return equal(b, col, bag);
});
@@ -568,7 +568,8 @@ const auto deref = boost::adaptors::transformed([] (const bytes_opt& b) { return
/// Returns possible values from t, which must be RHS of IN.
value_list get_IN_values(
const ::shared_ptr<term>& t, const query_options& options, const serialized_compare& comparator) {
const ::shared_ptr<term>& t, const query_options& options, const serialized_compare& comparator,
sstring_view column_name) {
// RHS is prepared differently for different CQL cases. Cast it dynamically to discern which case this is.
if (auto dv = dynamic_pointer_cast<lists::delayed_value>(t)) {
// Case `a IN (1,2,3)`.
@@ -578,8 +579,12 @@ value_list get_IN_values(
return to_sorted_vector(std::move(result_range), comparator);
} else if (auto mkr = dynamic_pointer_cast<lists::marker>(t)) {
// Case `a IN ?`. Collect all list-element values.
const auto val = static_pointer_cast<lists::value>(mkr->bind(options));
return to_sorted_vector(val->get_elements() | non_null | deref, comparator);
const auto val = mkr->bind(options);
if (val == constants::UNSET_VALUE) {
throw exceptions::invalid_request_exception(format("Invalid unset value for column {}", column_name));
}
statements::request_validations::check_not_null(val, "Invalid null value for column %s", column_name);
return to_sorted_vector(static_pointer_cast<lists::value>(val)->get_elements() | non_null | deref, comparator);
}
throw std::logic_error(format("get_IN_values(single column) on invalid term {}", *t));
}
@@ -606,22 +611,6 @@ value_list get_IN_values(const ::shared_ptr<term>& t, size_t k, const query_opti
static constexpr bool inclusive = true, exclusive = false;
/// A range of all X such that X op val.
nonwrapping_range<bytes> to_range(oper_t op, const bytes& val) {
switch (op) {
case oper_t::GT:
return nonwrapping_range<bytes>::make_starting_with(interval_bound(val, exclusive));
case oper_t::GTE:
return nonwrapping_range<bytes>::make_starting_with(interval_bound(val, inclusive));
case oper_t::LT:
return nonwrapping_range<bytes>::make_ending_with(interval_bound(val, exclusive));
case oper_t::LTE:
return nonwrapping_range<bytes>::make_ending_with(interval_bound(val, inclusive));
default:
throw std::logic_error(format("to_range: unknown comparison operator {}", op));
}
}
} // anonymous namespace
expression make_conjunction(expression a, expression b) {
@@ -650,7 +639,7 @@ bool is_satisfied_by(
std::vector<bytes_opt> first_multicolumn_bound(
const expression& restr, const query_options& options, statements::bound bnd) {
auto found = find_atom(restr, [bnd] (const binary_operator& oper) {
return matches(oper.op, bnd) && std::holds_alternative<std::vector<column_value>>(oper.lhs);
return matches(oper.op, bnd) && is_multi_column(oper);
});
if (found) {
return static_pointer_cast<tuples::value>(found->rhs->bind(options))->get_elements();
@@ -659,6 +648,27 @@ std::vector<bytes_opt> first_multicolumn_bound(
}
}
template<typename T>
nonwrapping_range<T> to_range(oper_t op, const T& val) {
static constexpr bool inclusive = true, exclusive = false;
switch (op) {
case oper_t::EQ:
return nonwrapping_range<T>::make_singular(val);
case oper_t::GT:
return nonwrapping_range<T>::make_starting_with(interval_bound(val, exclusive));
case oper_t::GTE:
return nonwrapping_range<T>::make_starting_with(interval_bound(val, inclusive));
case oper_t::LT:
return nonwrapping_range<T>::make_ending_with(interval_bound(val, exclusive));
case oper_t::LTE:
return nonwrapping_range<T>::make_ending_with(interval_bound(val, inclusive));
default:
throw std::logic_error(format("to_range: unknown comparison operator {}", op));
}
}
template nonwrapping_range<clustering_key_prefix> to_range(oper_t, const clustering_key_prefix&);
value_set possible_lhs_values(const column_definition* cdef, const expression& expr, const query_options& options) {
const auto type = cdef ? get_value_comparator(cdef) : long_type.get();
return std::visit(overloaded_functor{
@@ -686,7 +696,7 @@ value_set possible_lhs_values(const column_definition* cdef, const expression& e
return oper.op == oper_t::EQ ? value_set(value_list{*val})
: to_range(oper.op, *val);
} else if (oper.op == oper_t::IN) {
return get_IN_values(oper.rhs, options, type->as_less_comparator());
return get_IN_values(oper.rhs, options, type->as_less_comparator(), cdef->name_as_text());
}
throw std::logic_error(format("possible_lhs_values: unhandled operator {}", oper));
},
@@ -776,9 +786,11 @@ bool is_supported_by(const expression& expr, const secondary_index::index& idx)
return idx.supports_expression(*col.col, oper.op);
},
[&] (const std::vector<column_value>& cvs) {
return boost::algorithm::any_of(cvs, [&] (const column_value& c) {
return idx.supports_expression(*c.col, oper.op);
});
if (cvs.size() == 1) {
return idx.supports_expression(*cvs[0].col, oper.op);
}
// We don't use index table for multi-column restrictions, as it cannot avoid filtering.
return false;
},
[&] (const token&) { return false; },
}, oper.lhs);
@@ -800,7 +812,7 @@ bool has_supporting_index(
}
std::ostream& operator<<(std::ostream& os, const column_value& cv) {
os << *cv.col;
os << cv.col->name_as_text();
if (cv.sub) {
os << '[' << *cv.sub << ']';
}
@@ -815,10 +827,10 @@ std::ostream& operator<<(std::ostream& os, const expression& expr) {
std::visit(overloaded_functor{
[&] (const token& t) { os << "TOKEN"; },
[&] (const column_value& col) {
fmt::print(os, "({})", col);
fmt::print(os, "{}", col);
},
[&] (const std::vector<column_value>& cvs) {
fmt::print(os, "(({}))", fmt::join(cvs, ","));
fmt::print(os, "({})", fmt::join(cvs, ","));
},
}, opr.lhs);
os << ' ' << opr.op << ' ' << *opr.rhs;

View File

@@ -73,11 +73,18 @@ struct token {};
enum class oper_t { EQ, NEQ, LT, LTE, GTE, GT, IN, CONTAINS, CONTAINS_KEY, IS_NOT, LIKE };
/// Describes the nature of clustering-key comparisons. Useful for implementing SCYLLA_CLUSTERING_BOUND.
enum class comparison_order : char {
cql, ///< CQL order. (a,b)>(1,1) is equivalent to a>1 OR (a=1 AND b>1).
clustering, ///< Table's clustering order. (a,b)>(1,1) means any row past (1,1) in storage.
};
/// Operator restriction: LHS op RHS.
struct binary_operator {
std::variant<column_value, std::vector<column_value>, token> lhs;
oper_t op;
::shared_ptr<term> rhs;
comparison_order order = comparison_order::cql;
};
/// A conjunction of restrictions.
@@ -131,6 +138,10 @@ extern value_set possible_lhs_values(const column_definition*, const expression&
/// Turns value_set into a range, unless it's a multi-valued list (in which case this throws).
extern nonwrapping_range<bytes> to_range(const value_set&);
/// A range of all X such that X op val.
template<typename T>
nonwrapping_range<T> to_range(oper_t op, const T& val);
/// True iff the index can support the entire expression.
extern bool is_supported_by(const expression&, const secondary_index::index&);
@@ -182,7 +193,8 @@ inline const binary_operator* find(const expression& e, oper_t op) {
}
inline bool needs_filtering(oper_t op) {
return (op == oper_t::CONTAINS) || (op == oper_t::CONTAINS_KEY) || (op == oper_t::LIKE);
return (op == oper_t::CONTAINS) || (op == oper_t::CONTAINS_KEY) || (op == oper_t::LIKE) ||
(op == oper_t::IS_NOT) || (op == oper_t::NEQ) ;
}
inline auto find_needs_filtering(const expression& e) {
@@ -211,6 +223,10 @@ inline bool is_compare(oper_t op) {
}
}
inline bool is_multi_column(const binary_operator& op) {
return holds_alternative<std::vector<column_value>>(op.lhs);
}
inline bool has_token(const expression& e) {
return find_atom(e, [] (const binary_operator& o) { return std::holds_alternative<token>(o.lhs); });
}
@@ -219,6 +235,14 @@ inline bool has_slice_or_needs_filtering(const expression& e) {
return find_atom(e, [] (const binary_operator& o) { return is_slice(o.op) || needs_filtering(o.op); });
}
inline bool is_clustering_order(const binary_operator& op) {
return op.order == comparison_order::clustering;
}
inline auto find_clustering_order(const expression& e) {
return find_atom(e, is_clustering_order);
}
/// True iff binary_operator involves a collection.
extern bool is_on_collection(const binary_operator&);

View File

@@ -219,7 +219,7 @@ struct aggregate_type_for<simple_date_native_type> {
template<>
struct aggregate_type_for<timeuuid_native_type> {
using type = timeuuid_native_type::primary_type;
using type = timeuuid_native_type;
};
template<>
@@ -227,6 +227,7 @@ struct aggregate_type_for<time_native_type> {
using type = time_native_type::primary_type;
};
// WARNING: never invoke this on temporary values; it will return a dangling reference.
template <typename Type>
const Type& max_wrapper(const Type& t1, const Type& t2) {
using std::max;
@@ -241,6 +242,10 @@ inline const net::inet_address& max_wrapper(const net::inet_address& t1, const n
return std::memcmp(t1.data(), t2.data(), len) >= 0 ? t1 : t2;
}
inline const timeuuid_native_type& max_wrapper(const timeuuid_native_type& t1, const timeuuid_native_type& t2) {
return t1.uuid.timestamp() > t2.uuid.timestamp() ? t1 : t2;
}
template <typename Type>
class impl_max_function_for final : public aggregate_function::aggregate {
std::optional<typename aggregate_type_for<Type>::type> _max{};
@@ -323,6 +328,7 @@ make_max_function() {
return make_shared<max_function_for<Type>>();
}
// WARNING: never invoke this on temporary values; it will return a dangling reference.
template <typename Type>
const Type& min_wrapper(const Type& t1, const Type& t2) {
using std::min;
@@ -337,6 +343,10 @@ inline const net::inet_address& min_wrapper(const net::inet_address& t1, const n
return std::memcmp(t1.data(), t2.data(), len) <= 0 ? t1 : t2;
}
inline timeuuid_native_type min_wrapper(timeuuid_native_type t1, timeuuid_native_type t2) {
return t1.uuid.timestamp() < t2.uuid.timestamp() ? t1 : t2;
}
template <typename Type>
class impl_min_function_for final : public aggregate_function::aggregate {
std::optional<typename aggregate_type_for<Type>::type> _min{};

View File

@@ -24,6 +24,7 @@
#include "error_injection_fcts.hh"
#include "utils/error_injection.hh"
#include "types/list.hh"
#include <seastar/core/map_reduce.hh>
namespace cql3
{

View File

@@ -54,7 +54,7 @@ std::ostream& operator<<(std::ostream& os, const std::vector<data_type>& arg_typ
namespace cql3 {
namespace functions {
static logging::logger log("cql3_fuctions");
logging::logger log("cql3_fuctions");
bool abstract_function::requires_thread() const { return false; }
@@ -181,13 +181,18 @@ inline
shared_ptr<function>
make_from_json_function(database& db, const sstring& keyspace, data_type t) {
return make_native_scalar_function<true>("fromjson", t, {utf8_type},
[&db, &keyspace, t](cql_serialization_format sf, const std::vector<bytes_opt>& parameters) -> bytes_opt {
rjson::value json_value = rjson::parse(utf8_type->to_string(parameters[0].value()));
bytes_opt parsed_json_value;
if (!json_value.IsNull()) {
parsed_json_value.emplace(from_json_object(*t, json_value, sf));
[&db, keyspace, t](cql_serialization_format sf, const std::vector<bytes_opt>& parameters) -> bytes_opt {
try {
rjson::value json_value = rjson::parse(utf8_type->to_string(parameters[0].value()));
bytes_opt parsed_json_value;
if (!json_value.IsNull()) {
parsed_json_value.emplace(from_json_object(*t, json_value, sf));
}
return parsed_json_value;
} catch(rjson::error& e) {
throw exceptions::function_execution_exception("fromJson",
format("Failed parsing fromJson parameter: {}", e.what()), keyspace, {t->name()});
}
return parsed_json_value;
});
}

View File

@@ -78,7 +78,22 @@ public:
return Pure;
}
virtual bytes_opt execute(cql_serialization_format sf, const std::vector<bytes_opt>& parameters) override {
return _func(sf, parameters);
try {
return _func(sf, parameters);
} catch(exceptions::cassandra_exception&) {
// If the function's code took the time to produce an official
// cassandra_exception, pass it through. Otherwise, below we will
// wrap the unknown exception in a function_execution_exception.
throw;
} catch(...) {
std::vector<sstring> args;
args.reserve(arg_types().size());
for (const data_type& a : arg_types()) {
args.push_back(a->name());
}
throw exceptions::function_execution_exception(name().name,
format("Failed execution of function {}: {}", name(), std::current_exception()), name().keyspace, std::move(args));
}
}
};

View File

@@ -21,9 +21,13 @@
#include "user_function.hh"
#include "lua.hh"
#include "log.hh"
namespace cql3 {
namespace functions {
extern logging::logger log;
user_function::user_function(function_name name, std::vector<data_type> arg_types, std::vector<sstring> arg_names,
sstring body, sstring language, data_type return_type, bool called_on_null_input, sstring bitcode,
lua::runtime_config cfg)
@@ -56,7 +60,9 @@ bytes_opt user_function::execute(cql_serialization_format sf, const std::vector<
}
values.push_back(bytes ? type->deserialize(*bytes) : data_value::make_null(type));
}
if (!seastar::thread::running_in_thread()) {
on_internal_error(log, "User function cannot be executed in this context");
}
return lua::run_script(lua::bitcode_view{_bitcode}, values, return_type(), _cfg).get0();
}
}

View File

@@ -53,11 +53,11 @@ const sstring& index_name::get_idx() const
return _idx_name;
}
::shared_ptr<cf_name> index_name::get_cf_name() const
cf_name index_name::get_cf_name() const
{
auto cf = ::make_shared<cf_name>();
cf_name cf;
if (has_keyspace()) {
cf->set_keyspace(get_keyspace(), true);
cf.set_keyspace(get_keyspace(), true);
}
return cf;
}

View File

@@ -55,7 +55,7 @@ public:
const sstring& get_idx() const;
::shared_ptr<cf_name> get_cf_name() const;
cf_name get_cf_name() const;
virtual sstring to_string() const override;
};

View File

@@ -55,6 +55,7 @@ bool keyspace_element_name::has_keyspace() const
const sstring& keyspace_element_name::get_keyspace() const
{
assert(_ks_name);
return *_ks_name;
}

View File

@@ -25,8 +25,8 @@
#include "cql3_type.hh"
#include "constants.hh"
#include <boost/iterator/transform_iterator.hpp>
#include <boost/range/adaptor/reversed.hpp>
#include "types/list.hh"
#include "utils/UUID_gen.hh"
namespace cql3 {
@@ -40,7 +40,7 @@ lw_shared_ptr<column_specification>
lists::value_spec_of(const column_specification& column) {
return make_lw_shared<column_specification>(column.ks_name, column.cf_name,
::make_shared<column_identifier>(format("value({})", *column.name), true),
dynamic_pointer_cast<const list_type_impl>(column.type)->get_elements_type());
dynamic_cast<const list_type_impl&>(column.type->without_reversed()).get_elements_type());
}
lw_shared_ptr<column_specification>
@@ -87,7 +87,7 @@ lists::literal::prepare(database& db, const sstring& keyspace, lw_shared_ptr<col
void
lists::literal::validate_assignable_to(database& db, const sstring keyspace, const column_specification& receiver) const {
if (!dynamic_pointer_cast<const list_type_impl>(receiver.type)) {
if (!receiver.type->without_reversed().is_list()) {
throw exceptions::invalid_request_exception(format("Invalid list literal for {} of type {}",
*receiver.name, receiver.type->as_cql3_type()));
}
@@ -125,18 +125,11 @@ lists::literal::to_string() const {
lists::value
lists::value::from_serialized(const fragmented_temporary_buffer::view& val, const list_type_impl& type, cql_serialization_format sf) {
return with_linearized(val, [&] (bytes_view v) {
return from_serialized(v, type, sf);
});
}
lists::value
lists::value::from_serialized(bytes_view v, const list_type_impl& type, cql_serialization_format sf) {
try {
// Collections have this small hack that validate cannot be called on a serialized object,
// but compose does the validation (so we're fine).
// FIXME: deserializeForNativeProtocol()?!
auto l = value_cast<list_type_impl::native_type>(type.deserialize(v, sf));
auto l = value_cast<list_type_impl::native_type>(type.deserialize(val, sf));
std::vector<bytes_opt> elements;
elements.reserve(l.size());
for (auto&& element : l) {
@@ -227,17 +220,15 @@ lists::delayed_value::bind(const query_options& options) {
::shared_ptr<terminal>
lists::marker::bind(const query_options& options) {
const auto& value = options.get_value_at(_bind_index);
auto& ltype = static_cast<const list_type_impl&>(*_receiver->type);
auto& ltype = dynamic_cast<const list_type_impl&>(_receiver->type->without_reversed());
if (value.is_null()) {
return nullptr;
} else if (value.is_unset_value()) {
return constants::UNSET_VALUE;
} else {
try {
return with_linearized(*value, [&] (bytes_view v) {
ltype.validate(v, options.get_cql_serialization_format());
return make_shared<lists::value>(value::from_serialized(v, ltype, options.get_cql_serialization_format()));
});
ltype.validate(*value, options.get_cql_serialization_format());
return make_shared<lists::value>(value::from_serialized(*value, ltype, options.get_cql_serialization_format()));
} catch (marshal_exception& e) {
throw exceptions::invalid_request_exception(
format("Exception while binding column {:s}: {:s}", _receiver->name->to_cql_string(), e.what()));
@@ -245,20 +236,6 @@ lists::marker::bind(const query_options& options) {
}
}
constexpr db_clock::time_point lists::precision_time::REFERENCE_TIME;
thread_local lists::precision_time lists::precision_time::_last = {db_clock::time_point::max(), 0};
lists::precision_time
lists::precision_time::get_next(db_clock::time_point millis) {
// FIXME: and if time goes backwards?
assert(millis <= _last.millis);
auto next = millis < _last.millis
? precision_time{millis, 9999}
: precision_time{millis, std::max(0, _last.nanos - 1)};
_last = next;
return next;
}
void
lists::setter::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
auto value = _t->bind(params._options);
@@ -308,9 +285,7 @@ lists::setter_by_index::execute(mutation& m, const clustering_key_prefix& prefix
return;
}
auto idx = with_linearized(*index, [] (bytes_view v) {
return value_cast<int32_t>(data_type_for<int32_t>()->deserialize(v));
});
auto idx = value_cast<int32_t>(data_type_for<int32_t>()->deserialize(*index));
auto&& existing_list_opt = params.get_prefetched_list(m.key(), prefix, column);
if (!existing_list_opt) {
throw exceptions::invalid_request_exception("Attempted to set an element on a list which is null");
@@ -398,10 +373,18 @@ lists::do_append(shared_ptr<term> value,
collection_mutation_description appended;
appended.cells.reserve(to_add.size());
for (auto&& e : to_add) {
auto uuid1 = utils::UUID_gen::get_time_UUID_bytes();
auto uuid = bytes(reinterpret_cast<const int8_t*>(uuid1.data()), uuid1.size());
// FIXME: can e be empty?
appended.cells.emplace_back(std::move(uuid), params.make_cell(*ltype->value_comparator(), *e, atomic_cell::collection_member::yes));
try {
auto uuid1 = utils::UUID_gen::get_time_UUID_bytes_from_micros_and_submicros(
params.timestamp(),
params._options.next_list_append_seq());
auto uuid = bytes(reinterpret_cast<const int8_t*>(uuid1.data()), uuid1.size());
// FIXME: can e be empty?
appended.cells.emplace_back(
std::move(uuid),
params.make_cell(*ltype->value_comparator(), *e, atomic_cell::collection_member::yes));
} catch (utils::timeuuid_submicro_out_of_range) {
throw exceptions::invalid_request_exception("Too many list values per single CQL statement or batch");
}
}
m.set_cell(prefix, column, appended.serialize(*ltype));
} else {
@@ -425,20 +408,42 @@ lists::prepender::execute(mutation& m, const clustering_key_prefix& prefix, cons
auto&& lvalue = dynamic_pointer_cast<lists::value>(std::move(value));
assert(lvalue);
auto time = precision_time::REFERENCE_TIME - (db_clock::now() - precision_time::REFERENCE_TIME);
// For prepend we need to be able to generate a unique but decreasing
// timeuuid. We achieve that by by using a time in the past which
// is 2x the distance between the original timestamp (it
// would be the current timestamp, user supplied timestamp, or
// unique monotonic LWT timestsamp, whatever is in query
// options) and a reference time of Jan 1 2010 00:00:00.
// E.g. if query timestamp is Jan 1 2020 00:00:00, the prepend
// timestamp will be Jan 1, 2000, 00:00:00.
// 2010-01-01T00:00:00+00:00 in api::timestamp_time format (microseconds)
static constexpr int64_t REFERENCE_TIME_MICROS = 1262304000L * 1000 * 1000;
int64_t micros = params.timestamp();
if (micros > REFERENCE_TIME_MICROS) {
micros = REFERENCE_TIME_MICROS - (micros - REFERENCE_TIME_MICROS);
} else {
// Scylla, unlike Cassandra, respects user-supplied timestamps
// in prepend, but there is nothing useful it can do with
// a timestamp less than Jan 1, 2010, 00:00:00.
throw exceptions::invalid_request_exception("List prepend custom timestamp must be greater than Jan 1 2010 00:00:00");
}
collection_mutation_description mut;
mut.cells.reserve(lvalue->get_elements().size());
// We reverse the order of insertion, so that the last element gets the lastest time
// (lists are sorted by time)
auto ltype = static_cast<const list_type_impl*>(column.type.get());
for (auto&& v : lvalue->_elements | boost::adaptors::reversed) {
auto&& pt = precision_time::get_next(time);
auto uuid = utils::UUID_gen::get_time_UUID_bytes(pt.millis.time_since_epoch().count(), pt.nanos);
mut.cells.emplace_back(bytes(uuid.data(), uuid.size()), params.make_cell(*ltype->value_comparator(), *v, atomic_cell::collection_member::yes));
int clockseq = params._options.next_list_prepend_seq(lvalue->_elements.size(), utils::UUID_gen::SUBMICRO_LIMIT);
for (auto&& v : lvalue->_elements) {
try {
auto uuid = utils::UUID_gen::get_time_UUID_bytes_from_micros_and_submicros(micros, clockseq++);
mut.cells.emplace_back(bytes(uuid.data(), uuid.size()), params.make_cell(*ltype->value_comparator(), *v, atomic_cell::collection_member::yes));
} catch (utils::timeuuid_submicro_out_of_range) {
throw exceptions::invalid_request_exception("Too many list values per single CQL statement or batch");
}
}
// now reverse again, to get the original order back
std::reverse(mut.cells.begin(), mut.cells.end());
m.set_cell(prefix, column, mut.serialize(*ltype));
}

View File

@@ -43,7 +43,6 @@
#include "cql3/abstract_marker.hh"
#include "to_string.hh"
#include "utils/UUID_gen.hh"
#include "operation.hh"
namespace cql3 {
@@ -73,7 +72,6 @@ public:
};
class value : public multi_item_terminal, collection_terminal {
static value from_serialized(bytes_view v, const list_type_impl& type, cql_serialization_format sf);
public:
std::vector<bytes_opt> _elements;
public:
@@ -122,28 +120,6 @@ public:
virtual ::shared_ptr<terminal> bind(const query_options& options) override;
};
/*
* For prepend, we need to be able to generate unique but decreasing time
* UUID, which is a bit challenging. To do that, given a time in milliseconds,
* we adds a number representing the 100-nanoseconds precision and make sure
* that within the same millisecond, that number is always decreasing. We
* do rely on the fact that the user will only provide decreasing
* milliseconds timestamp for that purpose.
*/
private:
class precision_time {
public:
// Our reference time (1 jan 2010, 00:00:00) in milliseconds.
static constexpr db_clock::time_point REFERENCE_TIME{std::chrono::milliseconds(1262304000000)};
private:
static thread_local precision_time _last;
public:
db_clock::time_point millis;
int32_t nanos;
static precision_time get_next(db_clock::time_point millis);
};
public:
class setter : public operation {
public:

View File

@@ -55,14 +55,14 @@ lw_shared_ptr<column_specification>
maps::key_spec_of(const column_specification& column) {
return make_lw_shared<column_specification>(column.ks_name, column.cf_name,
::make_shared<column_identifier>(format("key({})", *column.name), true),
dynamic_pointer_cast<const map_type_impl>(column.type)->get_keys_type());
dynamic_cast<const map_type_impl&>(column.type->without_reversed()).get_keys_type());
}
lw_shared_ptr<column_specification>
maps::value_spec_of(const column_specification& column) {
return make_lw_shared<column_specification>(column.ks_name, column.cf_name,
::make_shared<column_identifier>(format("value({})", *column.name), true),
dynamic_pointer_cast<const map_type_impl>(column.type)->get_values_type());
dynamic_cast<const map_type_impl&>(column.type->without_reversed()).get_values_type());
}
::shared_ptr<term>
@@ -88,7 +88,9 @@ maps::literal::prepare(database& db, const sstring& keyspace, lw_shared_ptr<colu
values.emplace(k, v);
}
delayed_value value(static_pointer_cast<const map_type_impl>(receiver->type)->get_keys_type()->as_less_comparator(), values);
delayed_value value(
dynamic_cast<const map_type_impl&>(receiver->type->without_reversed()).get_keys_type()->as_less_comparator(),
values);
if (all_terminal) {
return value.bind(query_options::DEFAULT);
} else {
@@ -98,7 +100,7 @@ maps::literal::prepare(database& db, const sstring& keyspace, lw_shared_ptr<colu
void
maps::literal::validate_assignable_to(database& db, const sstring& keyspace, const column_specification& receiver) const {
if (!dynamic_pointer_cast<const map_type_impl>(receiver.type)) {
if (!receiver.type->without_reversed().is_map()) {
throw exceptions::invalid_request_exception(format("Invalid map literal for {} of type {}", *receiver.name, receiver.type->as_cql3_type()));
}
auto&& key_spec = maps::key_spec_of(receiver);
@@ -158,15 +160,13 @@ maps::value::from_serialized(const fragmented_temporary_buffer::view& fragmented
// Collections have this small hack that validate cannot be called on a serialized object,
// but compose does the validation (so we're fine).
// FIXME: deserialize_for_native_protocol?!
return with_linearized(fragmented_value, [&] (bytes_view value) {
auto m = value_cast<map_type_impl::native_type>(type.deserialize(value, sf));
auto m = value_cast<map_type_impl::native_type>(type.deserialize(fragmented_value, sf));
std::map<bytes, bytes, serialized_compare> map(type.get_keys_type()->as_less_comparator());
for (auto&& e : m) {
map.emplace(type.get_keys_type()->decompose(e.first),
type.get_values_type()->decompose(e.second));
}
return maps::value { std::move(map) };
});
} catch (marshal_exception& e) {
throw exceptions::invalid_request_exception(e.what());
}
@@ -263,14 +263,16 @@ maps::marker::bind(const query_options& options) {
return constants::UNSET_VALUE;
}
try {
with_linearized(*val, [&] (bytes_view value) {
_receiver->type->validate(value, options.get_cql_serialization_format());
});
_receiver->type->validate(*val, options.get_cql_serialization_format());
} catch (marshal_exception& e) {
throw exceptions::invalid_request_exception(
format("Exception while binding column {:s}: {:s}", _receiver->name->to_cql_string(), e.what()));
}
return ::make_shared<maps::value>(maps::value::from_serialized(*val, static_cast<const map_type_impl&>(*_receiver->type), options.get_cql_serialization_format()));
return ::make_shared<maps::value>(
maps::value::from_serialized(
*val,
dynamic_cast<const map_type_impl&>(_receiver->type->without_reversed()),
options.get_cql_serialization_format()));
}
void
@@ -305,6 +307,12 @@ maps::setter_by_key::execute(mutation& m, const clustering_key_prefix& prefix, c
assert(column.type->is_multi_cell()); // "Attempted to set a value for a single key on a frozen map"m
auto key = _k->bind_and_get(params._options);
auto value = _t->bind_and_get(params._options);
if (value.is_unset_value()) {
return;
}
if (key.is_unset_value()) {
throw invalid_request_exception("Invalid unset map key");
}
if (!key) {
throw invalid_request_exception("Invalid null map key");
}

View File

@@ -59,30 +59,32 @@ namespace cql3 {
* - SELECT ... WHERE (a, b) IN ?
*/
class multi_column_relation final : public relation {
public:
using mode = expr::comparison_order;
private:
std::vector<shared_ptr<column_identifier::raw>> _entities;
shared_ptr<term::multi_column_raw> _values_or_marker;
std::vector<shared_ptr<term::multi_column_raw>> _in_values;
shared_ptr<tuples::in_raw> _in_marker;
mode _mode;
public:
multi_column_relation(std::vector<shared_ptr<column_identifier::raw>> entities,
expr::oper_t relation_type, shared_ptr<term::multi_column_raw> values_or_marker,
std::vector<shared_ptr<term::multi_column_raw>> in_values, shared_ptr<tuples::in_raw> in_marker)
std::vector<shared_ptr<term::multi_column_raw>> in_values, shared_ptr<tuples::in_raw> in_marker, mode m = mode::cql)
: relation(relation_type)
, _entities(std::move(entities))
, _values_or_marker(std::move(values_or_marker))
, _in_values(std::move(in_values))
, _in_marker(std::move(in_marker))
, _mode(m)
{ }
static shared_ptr<multi_column_relation> create_multi_column_relation(
std::vector<shared_ptr<column_identifier::raw>> entities, expr::oper_t relation_type,
shared_ptr<term::multi_column_raw> values_or_marker, std::vector<shared_ptr<term::multi_column_raw>> in_values,
shared_ptr<tuples::in_raw> in_marker) {
shared_ptr<tuples::in_raw> in_marker, mode m = mode::cql) {
return ::make_shared<multi_column_relation>(std::move(entities), relation_type, std::move(values_or_marker),
std::move(in_values), std::move(in_marker));
std::move(in_values), std::move(in_marker), m);
}
/**
@@ -99,6 +101,15 @@ public:
return create_multi_column_relation(std::move(entities), relation_type, std::move(values_or_marker), {}, {});
}
/**
* Same as above, but sets the magic mode that causes us to treat the restrictions as "raw" clustering bounds
*/
static shared_ptr<multi_column_relation> create_scylla_clustering_bound_non_in_relation(std::vector<shared_ptr<column_identifier::raw>> entities,
expr::oper_t relation_type, shared_ptr<term::multi_column_raw> values_or_marker) {
assert(relation_type != expr::oper_t::IN);
return create_multi_column_relation(std::move(entities), relation_type, std::move(values_or_marker), {}, {}, mode::clustering);
}
/**
* Creates a multi-column IN relation with a list of IN values or markers.
* For example: "SELECT ... WHERE (a, b) IN ((0, 1), (2, 3))"
@@ -191,7 +202,7 @@ protected:
return cs->column_specification;
});
auto t = to_term(col_specs, *get_value(), db, schema->ks_name(), bound_names);
return ::make_shared<restrictions::multi_column_restriction::slice>(schema, rs, bound, inclusive, t);
return ::make_shared<restrictions::multi_column_restriction::slice>(schema, rs, bound, inclusive, t, _mode);
}
virtual shared_ptr<restrictions::restriction> new_contains_restriction(database& db, schema_ptr schema,

View File

@@ -102,13 +102,7 @@ private:
using cache_key_type = typename prepared_cache_key_type::cache_key_type;
using cache_type = utils::loading_cache<cache_key_type, prepared_cache_entry, utils::loading_cache_reload_enabled::no, prepared_cache_entry_size, utils::tuple_hash, std::equal_to<cache_key_type>, prepared_cache_stats_updater>;
using cache_value_ptr = typename cache_type::value_ptr;
using cache_iterator = typename cache_type::iterator;
using checked_weak_ptr = typename statements::prepared_statement::checked_weak_ptr;
struct value_extractor_fn {
checked_weak_ptr operator()(prepared_cache_entry& e) const {
return e->checked_weak_from_this();
}
};
public:
static const std::chrono::minutes entry_expiry;
@@ -116,12 +110,9 @@ public:
using key_type = prepared_cache_key_type;
using value_type = checked_weak_ptr;
using statement_is_too_big = typename cache_type::entry_is_too_big;
/// \note both iterator::reference and iterator::value_type are checked_weak_ptr
using iterator = boost::transform_iterator<value_extractor_fn, cache_iterator>;
private:
cache_type _cache;
value_extractor_fn _value_extractor_fn;
public:
prepared_statements_cache(logging::logger& logger, size_t size)
@@ -135,16 +126,12 @@ public:
});
}
iterator find(const key_type& key) {
return boost::make_transform_iterator(_cache.find(key.key()), _value_extractor_fn);
}
iterator end() {
return boost::make_transform_iterator(_cache.end(), _value_extractor_fn);
}
iterator begin() {
return boost::make_transform_iterator(_cache.begin(), _value_extractor_fn);
value_type find(const key_type& key) {
cache_value_ptr vp = _cache.find(key.key());
if (vp) {
return (*vp)->checked_weak_from_this();
}
return value_type();
}
template <typename Pred>

View File

@@ -42,20 +42,21 @@
#include "cql3/cql_config.hh"
#include "query_options.hh"
#include "version.hh"
#include "db/consistency_level_type.hh"
namespace cql3 {
const cql_config default_cql_config;
thread_local const query_options::specific_options query_options::specific_options::DEFAULT{-1, {}, {}, api::missing_timestamp};
thread_local const query_options::specific_options query_options::specific_options::DEFAULT{
-1, {}, db::consistency_level::SERIAL, api::missing_timestamp};
thread_local query_options query_options::DEFAULT{default_cql_config,
db::consistency_level::ONE, infinite_timeout_config, std::nullopt,
db::consistency_level::ONE, std::nullopt,
std::vector<cql3::raw_value_view>(), false, query_options::specific_options::DEFAULT, cql_serialization_format::latest()};
query_options::query_options(const cql_config& cfg,
db::consistency_level consistency,
const ::timeout_config& timeout_config,
std::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value> values,
std::vector<cql3::raw_value_view> value_views,
@@ -64,7 +65,6 @@ query_options::query_options(const cql_config& cfg,
cql_serialization_format sf)
: _cql_config(cfg)
, _consistency(consistency)
, _timeout_config(timeout_config)
, _names(std::move(names))
, _values(std::move(values))
, _value_views(value_views)
@@ -76,7 +76,6 @@ query_options::query_options(const cql_config& cfg,
query_options::query_options(const cql_config& cfg,
db::consistency_level consistency,
const ::timeout_config& timeout_config,
std::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value> values,
bool skip_metadata,
@@ -84,7 +83,6 @@ query_options::query_options(const cql_config& cfg,
cql_serialization_format sf)
: _cql_config(cfg)
, _consistency(consistency)
, _timeout_config(timeout_config)
, _names(std::move(names))
, _values(std::move(values))
, _value_views()
@@ -97,7 +95,6 @@ query_options::query_options(const cql_config& cfg,
query_options::query_options(const cql_config& cfg,
db::consistency_level consistency,
const ::timeout_config& timeout_config,
std::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value_view> value_views,
bool skip_metadata,
@@ -105,7 +102,6 @@ query_options::query_options(const cql_config& cfg,
cql_serialization_format sf)
: _cql_config(cfg)
, _consistency(consistency)
, _timeout_config(timeout_config)
, _names(std::move(names))
, _values()
, _value_views(std::move(value_views))
@@ -115,12 +111,11 @@ query_options::query_options(const cql_config& cfg,
{
}
query_options::query_options(db::consistency_level cl, const ::timeout_config& timeout_config, std::vector<cql3::raw_value> values,
query_options::query_options(db::consistency_level cl, std::vector<cql3::raw_value> values,
specific_options options)
: query_options(
default_cql_config,
cl,
timeout_config,
{},
std::move(values),
false,
@@ -133,7 +128,6 @@ query_options::query_options(db::consistency_level cl, const ::timeout_config& t
query_options::query_options(std::unique_ptr<query_options> qo, lw_shared_ptr<service::pager::paging_state> paging_state)
: query_options(qo->_cql_config,
qo->_consistency,
qo->get_timeout_config(),
std::move(qo->_names),
std::move(qo->_values),
std::move(qo->_value_views),
@@ -146,7 +140,6 @@ query_options::query_options(std::unique_ptr<query_options> qo, lw_shared_ptr<se
query_options::query_options(std::unique_ptr<query_options> qo, lw_shared_ptr<service::pager::paging_state> paging_state, int32_t page_size)
: query_options(qo->_cql_config,
qo->_consistency,
qo->get_timeout_config(),
std::move(qo->_names),
std::move(qo->_values),
std::move(qo->_value_views),
@@ -158,7 +151,7 @@ query_options::query_options(std::unique_ptr<query_options> qo, lw_shared_ptr<se
query_options::query_options(std::vector<cql3::raw_value> values)
: query_options(
db::consistency_level::ONE, infinite_timeout_config, std::move(values))
db::consistency_level::ONE, std::move(values))
{}
void query_options::prepare(const std::vector<lw_shared_ptr<column_specification>>& specs)

View File

@@ -51,7 +51,6 @@
#include "cql3/column_identifier.hh"
#include "cql3/values.hh"
#include "cql_serialization_format.hh"
#include "timeout_config.hh"
namespace cql3 {
@@ -75,7 +74,6 @@ public:
private:
const cql_config& _cql_config;
const db::consistency_level _consistency;
const timeout_config& _timeout_config;
const std::optional<std::vector<sstring_view>> _names;
std::vector<cql3::raw_value> _values;
std::vector<cql3::raw_value_view> _value_views;
@@ -83,7 +81,16 @@ private:
const specific_options _options;
cql_serialization_format _cql_serialization_format;
std::optional<std::vector<query_options>> _batch_options;
// We must use the same microsecond-precision timestamp for
// all cells created by an LWT statement or when a statement
// has a user-provided timestamp. In case the statement or
// a BATCH appends many values to a list, each value should
// get a unique and monotonic timeuuid. This sequence is
// used to make all time-based UUIDs:
// 1) share the same microsecond,
// 2) monotonic
// 3) unique.
mutable int _list_append_seq = 0;
private:
/**
* @brief Batch query_options constructor.
@@ -109,7 +116,6 @@ public:
explicit query_options(const cql_config& cfg,
db::consistency_level consistency,
const timeout_config& timeouts,
std::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value> values,
bool skip_metadata,
@@ -117,7 +123,6 @@ public:
cql_serialization_format sf);
explicit query_options(const cql_config& cfg,
db::consistency_level consistency,
const timeout_config& timeouts,
std::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value> values,
std::vector<cql3::raw_value_view> value_views,
@@ -126,7 +131,6 @@ public:
cql_serialization_format sf);
explicit query_options(const cql_config& cfg,
db::consistency_level consistency,
const timeout_config& timeouts,
std::optional<std::vector<sstring_view>> names,
std::vector<cql3::raw_value_view> value_views,
bool skip_metadata,
@@ -158,13 +162,10 @@ public:
// forInternalUse
explicit query_options(std::vector<cql3::raw_value> values);
explicit query_options(db::consistency_level, const timeout_config& timeouts,
std::vector<cql3::raw_value> values, specific_options options = specific_options::DEFAULT);
explicit query_options(db::consistency_level, std::vector<cql3::raw_value> values, specific_options options = specific_options::DEFAULT);
explicit query_options(std::unique_ptr<query_options>, lw_shared_ptr<service::pager::paging_state> paging_state);
explicit query_options(std::unique_ptr<query_options>, lw_shared_ptr<service::pager::paging_state> paging_state, int32_t page_size);
const timeout_config& get_timeout_config() const { return _timeout_config; }
db::consistency_level get_consistency() const {
return _consistency;
}
@@ -241,6 +242,39 @@ public:
return _cql_config;
}
// Generate a next unique list sequence for list append, e.g.
// a = a + [val1, val2, ...]
int next_list_append_seq() const {
return _list_append_seq++;
}
// To preserve prepend monotonicity within a batch, each next
// value must get a timestamp that's smaller than the previous one:
// BEGIN BATCH
// UPDATE t SET l = [1, 2] + l WHERE pk = 0;
// UPDATE t SET l = [3] + l WHERE pk = 0;
// UPDATE t SET l = [4] + l WHERE pk = 0;
// APPLY BATCH
// SELECT l FROM t WHERE pk = 0;
// l
// ------------
// [4, 3, 1, 2]
//
// This function reserves the given number of prepend entries
// and returns an id for the first prepended entry (it
// got to be the smallest one, to preserve the order of
// a multi-value append).
//
// @retval sequence number of the first entry of a multi-value
// append. To get the next value, add 1.
int next_list_prepend_seq(int num_entries, int max_entries) const {
if (_list_append_seq + num_entries < max_entries) {
_list_append_seq += num_entries;
return max_entries - _list_append_seq;
}
return max_entries;
}
void prepare(const std::vector<lw_shared_ptr<column_specification>>& specs);
private:
void fill_value_views();
@@ -258,7 +292,7 @@ query_options::query_options(query_options&& o, std::vector<OneMutationDataRange
std::vector<query_options> tmp;
tmp.reserve(values_ranges.size());
std::transform(values_ranges.begin(), values_ranges.end(), std::back_inserter(tmp), [this](auto& values_range) {
return query_options(_cql_config, _consistency, _timeout_config, {}, std::move(values_range), _skip_metadata, _options, _cql_serialization_format);
return query_options(_cql_config, _consistency, {}, std::move(values_range), _skip_metadata, _options, _cql_serialization_format);
});
_batch_options = std::move(tmp);
}

View File

@@ -84,11 +84,12 @@ public:
}
};
query_processor::query_processor(service::storage_proxy& proxy, database& db, service::migration_notifier& mn, query_processor::memory_config mcfg, cql_config& cql_cfg)
query_processor::query_processor(service::storage_proxy& proxy, database& db, service::migration_notifier& mn, service::migration_manager& mm, query_processor::memory_config mcfg, cql_config& cql_cfg)
: _migration_subscriber{std::make_unique<migration_subscriber>(this)}
, _proxy(proxy)
, _db(db)
, _mnotifier(mn)
, _mm(mm)
, _cql_config(cql_cfg)
, _internal_state(new internal_state())
, _prepared_cache(prep_cache_log, mcfg.prepared_statment_cache_size)
@@ -527,7 +528,7 @@ query_processor::process_authorized_statement(const ::shared_ptr<cql_statement>
statement->validate(_proxy, client_state);
auto fut = statement->execute(_proxy, query_state, options);
auto fut = statement->execute(*this, query_state, options);
return fut.then([statement] (auto msg) {
if (msg) {
@@ -619,7 +620,6 @@ query_options query_processor::make_internal_options(
const statements::prepared_statement::checked_weak_ptr& p,
const std::initializer_list<data_value>& values,
db::consistency_level cl,
const timeout_config& timeout_config,
int32_t page_size) const {
if (p->bound_names.size() != values.size()) {
throw std::invalid_argument(
@@ -643,11 +643,10 @@ query_options query_processor::make_internal_options(
api::timestamp_type ts = api::missing_timestamp;
return query_options(
cl,
timeout_config,
bound_values,
cql3::query_options::specific_options{page_size, std::move(paging_state), serial_consistency, ts});
}
return query_options(cl, timeout_config, bound_values);
return query_options(cl, bound_values);
}
statements::prepared_statement::checked_weak_ptr query_processor::prepare_internal(const sstring& query_string) {
@@ -668,10 +667,13 @@ struct internal_query_state {
bool more_results = true;
};
::shared_ptr<internal_query_state> query_processor::create_paged_state(const sstring& query_string,
const std::initializer_list<data_value>& values, int32_t page_size) {
::shared_ptr<internal_query_state> query_processor::create_paged_state(
const sstring& query_string,
db::consistency_level cl,
const std::initializer_list<data_value>& values,
int32_t page_size) {
auto p = prepare_internal(query_string);
auto opts = make_internal_options(p, values, db::consistency_level::ONE, infinite_timeout_config, page_size);
auto opts = make_internal_options(p, values, cl, page_size);
::shared_ptr<internal_query_state> res = ::make_shared<internal_query_state>(
internal_query_state{
query_string,
@@ -750,7 +752,7 @@ future<> query_processor::for_each_cql_result(
future<::shared_ptr<untyped_result_set>>
query_processor::execute_paged_internal(::shared_ptr<internal_query_state> state) {
return state->p->statement->execute(_proxy, *_internal_state, *state->opts).then(
return state->p->statement->execute(*this, *_internal_state, *state->opts).then(
[state, this](::shared_ptr<cql_transport::messages::result_message> msg) mutable {
class visitor : public result_message::visitor_base {
::shared_ptr<internal_query_state> _state;
@@ -789,7 +791,16 @@ future<::shared_ptr<untyped_result_set>>
query_processor::execute_internal(
const sstring& query_string,
db::consistency_level cl,
const timeout_config& timeout_config,
const std::initializer_list<data_value>& values,
bool cache) {
return execute_internal(query_string, cl, *_internal_state, values, cache);
}
future<::shared_ptr<untyped_result_set>>
query_processor::execute_internal(
const sstring& query_string,
db::consistency_level cl,
service::query_state& query_state,
const std::initializer_list<data_value>& values,
bool cache) {
@@ -797,13 +808,13 @@ query_processor::execute_internal(
log.trace("execute_internal: {}\"{}\" ({})", cache ? "(cached) " : "", query_string, ::join(", ", values));
}
if (cache) {
return execute_with_params(prepare_internal(query_string), cl, timeout_config, values);
return execute_with_params(prepare_internal(query_string), cl, query_state, values);
} else {
auto p = parse_statement(query_string)->prepare(_db, _cql_stats);
p->statement->raw_cql_statement = query_string;
p->statement->validate(_proxy, *_internal_state);
auto checked_weak_ptr = p->checked_weak_from_this();
return execute_with_params(std::move(checked_weak_ptr), cl, timeout_config, values).finally([p = std::move(p)] {});
return execute_with_params(std::move(checked_weak_ptr), cl, query_state, values).finally([p = std::move(p)] {});
}
}
@@ -811,11 +822,11 @@ future<::shared_ptr<untyped_result_set>>
query_processor::execute_with_params(
statements::prepared_statement::checked_weak_ptr p,
db::consistency_level cl,
const timeout_config& timeout_config,
service::query_state& query_state,
const std::initializer_list<data_value>& values) {
auto opts = make_internal_options(p, values, cl, timeout_config);
return do_with(std::move(opts), [this, p = std::move(p)](auto & opts) {
return p->statement->execute(_proxy, *_internal_state, opts).then([](auto msg) {
auto opts = make_internal_options(p, values, cl);
return do_with(std::move(opts), [this, &query_state, p = std::move(p)](auto & opts) {
return p->statement->execute(*this, query_state, opts).then([](auto msg) {
return make_ready_future<::shared_ptr<untyped_result_set>>(::make_shared<untyped_result_set>(msg));
});
});
@@ -843,7 +854,7 @@ query_processor::execute_batch(
}
log.trace("execute_batch({}): {}", batch->get_statements().size(), oss.str());
}
return batch->execute(_proxy, query_state, options);
return batch->execute(*this, query_state, options);
});
});
}
@@ -932,20 +943,22 @@ bool query_processor::migration_subscriber::should_invalidate(
sstring ks_name,
std::optional<sstring> cf_name,
::shared_ptr<cql_statement> statement) {
return statement->depends_on_keyspace(ks_name) && (!cf_name || statement->depends_on_column_family(*cf_name));
return statement->depends_on(ks_name, cf_name);
}
future<> query_processor::query(
future<> query_processor::query_internal(
const sstring& query_string,
db::consistency_level cl,
const std::initializer_list<data_value>& values,
int32_t page_size,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f) {
return for_each_cql_result(create_paged_state(query_string, values), std::move(f));
return for_each_cql_result(create_paged_state(query_string, cl, values, page_size), std::move(f));
}
future<> query_processor::query(
future<> query_processor::query_internal(
const sstring& query_string,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f) {
return for_each_cql_result(create_paged_state(query_string, {}), std::move(f));
return query_internal(query_string, db::consistency_level::ONE, {}, 1000, std::move(f));
}
}

View File

@@ -58,6 +58,10 @@
#include "service/query_state.hh"
#include "transport/messages/result_message.hh"
namespace service {
class migration_manager;
}
namespace cql3 {
namespace statements {
@@ -115,6 +119,7 @@ private:
service::storage_proxy& _proxy;
database& _db;
service::migration_notifier& _mnotifier;
service::migration_manager& _mm;
const cql_config& _cql_config;
struct stats {
@@ -149,7 +154,7 @@ public:
static std::unique_ptr<statements::raw::parsed_statement> parse_statement(const std::string_view& query);
query_processor(service::storage_proxy& proxy, database& db, service::migration_notifier& mn, memory_config mcfg, cql_config& cql_cfg);
query_processor(service::storage_proxy& proxy, database& db, service::migration_notifier& mn, service::migration_manager& mm, memory_config mcfg, cql_config& cql_cfg);
~query_processor();
@@ -165,16 +170,19 @@ public:
return _proxy;
}
const service::migration_manager& get_migration_manager() const noexcept { return _mm; }
service::migration_manager& get_migration_manager() noexcept { return _mm; }
cql_stats& get_cql_stats() {
return _cql_stats;
}
statements::prepared_statement::checked_weak_ptr get_prepared(const std::optional<auth::authenticated_user>& user, const prepared_cache_key_type& key) {
if (user) {
auto it = _authorized_prepared_cache.find(*user, key);
if (it != _authorized_prepared_cache.end()) {
auto vp = _authorized_prepared_cache.find(*user, key);
if (vp) {
try {
return it->get()->checked_weak_from_this();
return vp->get()->checked_weak_from_this();
} catch (seastar::checked_ptr_is_null_exception&) {
// If the prepared statement got invalidated - remove the corresponding authorized_prepared_statements_cache entry as well.
_authorized_prepared_cache.remove(*user, key);
@@ -185,11 +193,7 @@ public:
}
statements::prepared_statement::checked_weak_ptr get_prepared(const prepared_cache_key_type& key) {
auto it = _prepared_cache.find(key);
if (it == _prepared_cache.end()) {
return statements::prepared_statement::checked_weak_ptr();
}
return *it;
return _prepared_cache.find(key);
}
future<::shared_ptr<cql_transport::messages::result_message>>
@@ -215,8 +219,7 @@ public:
// creating namespaces, etc) is explicitly forbidden via this interface.
future<::shared_ptr<untyped_result_set>>
execute_internal(const sstring& query_string, const std::initializer_list<data_value>& values = { }) {
return execute_internal(query_string, db::consistency_level::ONE,
infinite_timeout_config, values, true);
return execute_internal(query_string, db::consistency_level::ONE, values, true);
}
statements::prepared_statement::checked_weak_ptr prepare_internal(const sstring& query);
@@ -224,75 +227,49 @@ public:
/*!
* \brief iterate over all cql results using paging
*
* You Create a statement with optional paraemter and pass
* a function that goes over the results.
* You create a statement with optional parameters and pass
* a function that goes over the result rows.
*
* The passed function would be called for all the results, return stop_iteration::yes
* to stop during iteration.
* The passed function would be called for all rows; return future<stop_iteration::yes>
* to stop iteration.
*
* For example:
return query("SELECT * from system.compaction_history",
[&history] (const cql3::untyped_result_set::row& row) mutable {
....
....
return stop_iteration::no;
});
* You can use place holder in the query, the prepared statement will only be done once.
*
*
* query_string - the cql string, can contain place holder
* f - a function to be run on each of the query result, if the function return false the iteration would stop
* args - arbitrary number of query parameters
*/
template<typename... Args>
future<> query(
const sstring& query_string,
std::function<stop_iteration(const cql3::untyped_result_set_row&)>&& f,
Args&&... args) {
return for_each_cql_result(
create_paged_state(query_string, { data_value(std::forward<Args>(args))... }), std::move(f));
}
/*!
* \brief iterate over all cql results using paging
*
* You Create a statement with optional paraemter and pass
* a function that goes over the results.
*
* The passed function would be called for all the results, return future<stop_iteration::yes>
* to stop during iteration.
*
* For example:
return query("SELECT * from system.compaction_history",
[&history] (const cql3::untyped_result_set::row& row) mutable {
return query_internal(
"SELECT * from system.compaction_history",
db::consistency_level::ONE,
{},
[&history] (const cql3::untyped_result_set::row& row) mutable {
....
....
return make_ready_future<stop_iteration>(stop_iteration::no);
});
* You can use place holder in the query, the prepared statement will only be done once.
* You can use placeholders in the query, the statement will only be prepared once.
*
*
* query_string - the cql string, can contain place holder
* values - query parameters value
* f - a function to be run on each of the query result, if the function return stop_iteration::no the iteration
* would stop
* query_string - the cql string, can contain placeholders
* cl - consistency level of the query
* values - values to be substituted for the placeholders in the query
* page_size - maximum page size
* f - a function to be run on each row of the query result,
* if the function returns stop_iteration::yes the iteration will stop
*/
future<> query(
future<> query_internal(
const sstring& query_string,
db::consistency_level cl,
const std::initializer_list<data_value>& values,
int32_t page_size,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f);
/*
* \brief iterate over all cql results using paging
* An overload of the query with future function without query parameters.
* An overload of query_internal without query parameters
* using CL = ONE, no timeout, and page size = 1000.
*
* query_string - the cql string, can contain place holder
* f - a function to be run on each of the query result, if the function return stop_iteration::no the iteration
* would stop
* query_string - the cql string, can contain placeholders
* f - a function to be run on each row of the query result,
* if the function returns stop_iteration::yes the iteration will stop
*/
future<> query(
future<> query_internal(
const sstring& query_string,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)>&& f);
@@ -305,14 +282,19 @@ public:
future<::shared_ptr<untyped_result_set>> execute_internal(
const sstring& query_string,
db::consistency_level,
const timeout_config& timeout_config,
const std::initializer_list<data_value>& = { },
bool cache = false);
future<::shared_ptr<untyped_result_set>> execute_internal(
const sstring& query_string,
db::consistency_level,
service::query_state& query_state,
const std::initializer_list<data_value>& = { },
bool cache = false);
future<::shared_ptr<untyped_result_set>> execute_with_params(
statements::prepared_statement::checked_weak_ptr p,
db::consistency_level,
const timeout_config& timeout_config,
service::query_state& query_state,
const std::initializer_list<data_value>& = { });
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
@@ -341,7 +323,6 @@ private:
const statements::prepared_statement::checked_weak_ptr& p,
const std::initializer_list<data_value>&,
db::consistency_level,
const timeout_config& timeout_config,
int32_t page_size = -1) const;
future<::shared_ptr<cql_transport::messages::result_message>>
@@ -354,8 +335,9 @@ private:
*/
::shared_ptr<internal_query_state> create_paged_state(
const sstring& query_string,
const std::initializer_list<data_value>& = { },
int32_t page_size = 1000);
db::consistency_level,
const std::initializer_list<data_value>&,
int32_t page_size);
/*!
* \brief run a query using paging

View File

@@ -377,21 +377,24 @@ protected:
class multi_column_restriction::slice final : public multi_column_restriction {
using restriction_shared_ptr = ::shared_ptr<clustering_key_restrictions>;
private:
using mode = expr::comparison_order;
term_slice _slice;
mode _mode;
slice(schema_ptr schema, std::vector<const column_definition*> defs, term_slice slice)
slice(schema_ptr schema, std::vector<const column_definition*> defs, term_slice slice, mode m)
: multi_column_restriction(schema, std::move(defs))
, _slice(slice)
, _mode(m)
{ }
public:
slice(schema_ptr schema, std::vector<const column_definition*> defs, statements::bound bound, bool inclusive, shared_ptr<term> term)
: slice(schema, defs, term_slice::new_instance(bound, inclusive, term))
slice(schema_ptr schema, std::vector<const column_definition*> defs, statements::bound bound, bool inclusive, shared_ptr<term> term, mode m = mode::cql)
: slice(schema, defs, term_slice::new_instance(bound, inclusive, term), m)
{
expression = expr::binary_operator{
std::vector<expr::column_value>(defs.cbegin(), defs.cend()),
expr::pick_operator(bound, inclusive),
std::move(term)};
std::move(term),
m};
}
virtual bool is_supported_by(const secondary_index::index& index) const override {
@@ -404,7 +407,7 @@ public:
}
virtual std::vector<bounds_range_type> bounds_ranges(const query_options& options) const override {
if (!is_mixed_order()) {
if (_mode == mode::clustering || !is_mixed_order()) {
return bounds_ranges_unified_order(options);
} else {
return bounds_ranges_mixed_order(options);
@@ -440,6 +443,11 @@ public:
get_columns_in_commons(other));
auto other_slice = static_pointer_cast<slice>(other);
static auto mode2str = [](auto m) { return m == mode::cql ? "plain" : "SCYLLA_CLUSTERING_BOUND"; };
check_true(other_slice->_mode == this->_mode,
"Invalid combination of restrictions (%s / %s)",
mode2str(this->_mode), mode2str(other_slice->_mode)
);
check_false(_slice.has_bound(statements::bound::START) && other_slice->_slice.has_bound(statements::bound::START),
"More than one restriction was found for the start bound on %s",
get_columns_in_commons(other));
@@ -484,7 +492,7 @@ private:
auto end_prefix = clustering_key_prefix::from_optional_exploded(*_schema, end_components);
end_bound = bounds_range_type::bound(std::move(end_prefix), _slice.is_inclusive(statements::bound::END));
}
if (!is_asc_order()) {
if (_mode == mode::cql && !is_asc_order()) {
std::swap(start_bound, end_bound);
}
auto range = bounds_range_type(start_bound, end_bound);

Some files were not shown because too many files have changed in this diff Show More