Compare commits

...

3401 Commits

Author SHA1 Message Date
Avi Kivity
f0a8465345 Update seastar submodule (json crash in describe_ring)
* seastar 0568c231cd...fd0d7c1c9a (1):
  > Merge 'stream_range_as_array: always close output stream' from Benny Halevy

Fixes #10592.
2022-06-08 16:50:51 +03:00
Beni Peled
1e6fe6391f release: prepare for 4.5.7 2022-05-16 15:27:49 +03:00
Benny Halevy
237c67367c table: clear: serialize with ongoing flush
Get all flush permits to serialize with any
ongoing flushes and preventing further flushes
during table::clear, in particular calling
discard_completed_segments for every table and
clearing the memtables in clear_and_add.

Fixes #10423

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit aae532a96b)
2022-05-15 13:44:15 +03:00
Raphael S. Carvalho
f5707399c3 compaction: LCS: don't write to disengaged optional on compaction completion
Dtest triggers the problem by:
1) creating table with LCS
2) disabling regular compaction
3) writing a few sstables
4) running maintenance compaction, e.g. cleanup

Once the maintenance compaction completes, disengaged optional _last_compacted_keys
triggers an exception in notify_completion().

_last_compacted_keys is used by regular for its round-robin file picking
policy. It stores the last compacted key for each level. Meaning it's
irrelevant for any other compaction type.

Regular compaction is responsible for initializing it when it runs for
the first time to pick files. But with it disabled, notify_completion()
will find it uninitialized, therefore resulting in bad_optional_access.

To fix this, the procedure is skipped if _last_compacted_keys is
disengaged. Regular compaction, once re-enabled, will be able to
fill _last_compacted_keys by looking at metadata of the files.

compaction_test.py::TestCompaction::test_disable_autocompaction_doesnt_
block_user_initiated_compactions[CLEANUP-LeveledCompactionStrategy]
now passes.

Fixes #10378.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #10508

(cherry picked from commit 8e99d3912e)
2022-05-15 13:20:40 +03:00
Juliusz Stasiewicz
4c28744bfd CQL: Replace assert by exception on invalid auth opcode
One user observed this assertion fail, but it's an extremely rare event.
The root cause - interlacing of processing STARTUP and OPTIONS messages -
is still there, but now it's harmless enough to leave it as is.

Fixes #10487

Closes #10503

(cherry picked from commit 603dd72f9e)
2022-05-10 14:09:54 +02:00
Eliran Sinvani
3b253e1166 prepared_statements: Invalidate batch statement too
It seams that batch prepared statements always return false for
depends_on, this in turn renders the removal criteria from the
prepared statements cache to always be false which result by the
queries not being evicted.
Here we change the function to return the true state meaning,
they will return true if one of the sub queries is dependant
upon the keyspace and/ or column family.

Fixes #10129

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
(cherry picked from commit 4eb0398457)
2022-05-08 12:37:00 +03:00
Eliran Sinvani
2dedfacabf cql3 statements: Change dependency test API to express better it's
purpose

Cql statements used to have two API functions, depends_on_keyspace and
depends_on_column_family. The former, took as a parameter only a table
name, which makes no sense. There could be multiple tables with the same
name each in a different keyspace and it doesn't make sense to
generalize the test - i.e to ask "Does a statement depend on any table
named XXX?"
In this change we unify the two calls to one - depends on that takes a
keyspace name and optionally also a table name, that way every logical
dependency tests that makes sense is supported by a single API call.

(cherry picked from commit bf50dbd35b)

Ref #10129
2022-05-08 12:36:48 +03:00
Calle Wilund
414e2687d4 cdc: Ensure columns removed from log table are registered as dropped
If we are redefining the log table, we need to ensure any dropped
columns are registered in "dropped_columns" table, otherwise clients will not
be able to read data older than now.
Includes unit test.

Should probably be backported to all CDC enabled versions.

Fixes #10473
Closes #10474

(cherry picked from commit 78350a7e1b)

Backport notes: removed cql-pytest test from the original commit, since
the cql-pytest framework on branch-4.5 is missing the `nodetool`
package.
2022-05-05 11:31:33 +02:00
Tomasz Grabiec
3d63f12d3b loading_cache: Make invalidation take immediate effect
There are two issues with current implementation of remove/remove_if:

  1) If it happens concurrently with get_ptr(), the latter may still
  populate the cache using value obtained from before remove() was
  called. remove() is used to invalidate caches, e.g. the prepared
  statements cache, and the expected semantic is that values
  calculated from before remove() should not be present in the cache
  after invalidation.

  2) As long as there is any active pointer to the cached value
  (obtained by get_ptr()), the old value from before remove() will be
  still accessible and returned by get_ptr(). This can make remove()
  have no effect indefinitely if there is persistent use of the cache.

One of the user-perceived effects of this bug is that some prepared
statements may not get invalidated after a schema change and still use
the old schema (until next invalidation). If the schema change was
modifying UDT, this can cause statement execution failures. CQL
coordinator will try to interpret bound values using old set of
fields. If the driver uses the new schema, the coordinaotr will fail
to process the value with the following exception:

  User Defined Type value contained too many fields (expected 5, got 6)

The patch fixes the problem by making remove()/remove_if() erase old
entries from _loading_values immediately.

The predicate-based remove_if() variant has to also invalidate values
which are concurrently loading to be safe. The predicate cannot be
avaluated on values which are not ready. This may invalidate some
values unnecessarily, but I think it's fine.

Fixes #10117

Message-Id: <20220309135902.261734-1-tgrabiec@scylladb.com>
(cherry picked from commit 8fa704972f)
2022-05-04 15:50:16 +03:00
Vlad Zolotarov
e09f2d0ea0 loading_shared_values/loading_cache: get rid of iterators interface and return value_ptr from find(...) instead
loading_shared_values/loading_cache'es iterators interface is dangerous/fragile because
iterator doesn't "lock" the entry it points to and if there is a
preemption point between aquiring non-end() iterator and its
dereferencing the corresponding cache entry may had already got evicted (for
whatever reason, e.g. cache size constraints or expiration) and then
dereferencing may end up in a use-after-free and we don't have any
protection against it in the value_extractor_fn today.

And this is in addition to #8920.

So, instead of trying to fix the iterator interface this patch kills two
birds in a single shot: we are ditching the iterators interface
completely and return value_ptr from find(...) instead - the same one we
are returning from loading_cache::get_ptr(...) asyncronous APIs.

A similar rework is done to a loading_shared_values loading_cache is
based on: we drop iterators interface and return
loading_shared_values::entry_ptr from find(...) instead.

loading_cache::value_ptr already takes care of "lock"ing the returned value so that it
would relain readable even if it's evicted from the cache by the time
one tries to read it. And of course it also takes care of updating the
last read time stamp and moving the corresponding item to the top of the
MRU list.

Fixes #8920

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20210817222404.3097708-1-vladz@scylladb.com>
(cherry picked from commit 7bd1bcd779)

[avi: prerequisite to backporting #10117]
2022-05-04 15:42:21 +03:00
Avi Kivity
484b23c08e Merge 'replica/database: drop_column_family(): properly cleanup stale querier cache entries' from Botond Dénes
Said method has to evict all querier cache entries, belonging to the to-be-dropped table. This is already the case, but there was a window where new entries could sneak in, causing a stale reference to the table to be de-referenced later when they are evicted due to TTL. This window is now closed, the entries are evicted after the method has waited for all ongoing operations on said table to stop.

Fixes: #10450

Closes #10451

* github.com:scylladb/scylla:
  replica/database: drop_column_family(): drop querier cache entries after waiting for ops
  replica/database: finish coroutinizing drop_column_family()
  replica/database: make remove(const column_family&) private

(cherry picked from commit 7f1e368e92)
2022-05-01 18:41:36 +03:00
Avi Kivity
97bcbd3c1f Update tools/java submodule (bad IPv6 addresses in nodetool)
* tools/java 8700c89b07...42151ec974 (1):
  > CASSANDRA-17581 fix NodeProbe: Malformed IPv6 address at index

Fixes #10442
2022-04-28 11:36:12 +03:00
Yaron Kaikov
85d3fe744b release: prepare for 4.5.6 2022-04-24 14:42:47 +03:00
Tomasz Grabiec
48d4759fad utils/chunked_vector: Fix sigsegv during reserve()
Fixes the case of make_room() invoked with last_chunk_capacity_deficit
but _size not in the last reserved chunk.

Found during code review, no known user impact.

Fixes #10363.

Message-Id: <20220411222605.641614-1-tgrabiec@scylladb.com>
(cherry picked from commit 01eeb33c6e)

[avi: make max_chunk_capacity() public for backport]
2022-04-13 10:30:57 +03:00
Avi Kivity
c4e8d2e761 transport: return correct error codes when downgrading v4 {WRITE,READ}_FAILURE to {WRITE,READ}_TIMEOUT
Protocol v4 added WRITE_FAILURE and READ_FAILURE. When running under v3
we downgrade these exceptions to WRITE_TIMEOUT and READ_TIMEOUT (since
the client won't understand the v4 errors), but we still send the new
error codes. This causes the client to become confused.

Fix by updating the error codes.

A better fix is to move the error code from the constructor parameter
list and hard-code it in the constructor, but that is left for a follow-up
after this minimal fix.

Fixes #5610.

Closes #10362

(cherry picked from commit 987e6533d2)
2022-04-13 09:49:36 +03:00
Avi Kivity
45c93db71f Update seastar submodule (logger deadlock with large messages)
* seastar 4456fcdfc0...0568c231cd (2):
  > log: Fix silencer to be shard-local and logger-global
  > log: Silence logger when logging

Fixes #10336.
2022-04-05 19:50:13 +03:00
Piotr Sarna
bab3afca5e cql3: fix qualifying restrictions with IN for indexing
When a query contains IN restriction on its partition key,
it's currently not eligible for indexing. It was however
erroneously qualified as such, which lead to fetching incorrect
results. This commit fixes the issue by not allowing such queries
to undergo indexing, and comes with a regression test.

Fixes #10300

Closes #10302

(cherry picked from commit c0fd53a9d7)
(cherry picked from commit ded169476b0498aeedeeca40a612f2cebb54348a)
2022-04-04 10:27:25 +02:00
Avi Kivity
8a0f3cc136 Update seastar submodule (pidof command not installed)
* seastar 7487cae646...4456fcdfc0 (1):
  > seastar-cpu-map.sh: switch from pidof to pgrep
Fixes #10238.
2022-03-29 12:41:53 +03:00
Yaron Kaikov
c789a18195 release: prepare for 4.5.5 2022-03-28 10:44:46 +03:00
Benny Halevy
195c8a6f80 atomic_cell: compare_atomic_cell_for_merge: compare ttl if expiry is equal
Following up on a57c087c89,
compare_atomic_cell_for_merge should compare the ttl value in the
reverse order since, when comparing two cells that are identical
in all attributes but their ttl, we want to keep the cell with the
smaller ttl value rather than the larger ttl, since it was written
at a later (wall-clock) time, and so would remain longer after it
expires, until purged after gc_grace seconds.

Fixes #10173

Test: mutation_test.test_cell_ordering, unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302154328.2400717-1-bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220306091913.106508-1-bhalevy@scylladb.com>
(cherry picked from commit a085ef74ff)
2022-03-24 18:18:15 +02:00
Benny Halevy
8a263600dd atomic_cell: compare_atomic_cell_for_merge: compare ttl if expiry is equal
Unlike atomic_cell_or_collection::equals, compare_atomic_cell_for_merge
currently returns std::strong_ordering::equal if two cells are equal in
every way except their ttl:s.

The problem with that is that the cells' hashes are different and this
will cause repair to keep trying to repair discrepancies caused by the
ttl being different.

This may be triggered by e.g. the spark migrator that computes the ttl
based on the expiry time by subtracting the expiry time from the current
time to produce a respective ttl.

If the cell is migrated multiple times at different times, it will generate
cells that the same expiry (by design) but have different ttl values.

Fixes #10156

Test: mutation_test.test_cell_ordering, unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302154328.2400717-1-bhalevy@scylladb.com>
(cherry picked from commit a57c087c89)
2022-03-24 18:17:28 +02:00
Avi Kivity
da806c4451 replica, atomic_cell: move atomic_cell merge code from replica module to atomic_cell.cc
compare_atomic_cell_for_merge() was placed in database.cc, before
atomic_cell.cc existed. Move it to its correct place.

Closes #9889

(cherry picked from commit 6c53717a39)

[avi: 4.5 backport: retain pre-shapship-operator code]
2022-03-24 18:11:42 +02:00
Benny Halevy
619bfb7c4e main: shutdown: do not abort on certain system errors
Currently any unhandled error during deferred shutdown
is rethrown in a noexcept context (in ~deferred_action),
generating a core dump.

The core dump is not helpful if the cause of the
error is "environmental", i.e. in the system, rather
than in scylla itself.

This change detects several such errors and calls
_Exit(255) to exit the process early, without leaving
a coredump behind.  Otherwise, call abort() explicitly,
rather than letting terminate() be called implicitly
by the destructor exception handling code.

Fixes #9573

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220227101054.1294368-1-bhalevy@scylladb.com>
(cherry picked from commit 132c9d5933)
2022-03-24 14:50:12 +02:00
Nadav Har'El
06795d29c5 Seastar: backport Seastar fix for missing scring escape in JSON output
Backported Seastar fix:
  > Merge 'json/formatter: Escape strings' from Juliusz Stasiewicz

Fixes #9061

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-03-23 22:32:15 +02:00
Asias He
7e625f3397 storage_service: Generate view update for load and stream
Currently, view will be not updated because the streaming reason is set
to streaming::stream_reason::rebuild. On the receiver side, only
streaming with the reason streaming::stream_reason::repair will trigger
view update.

Change the stream reason to repair to trigger view update for load and
stream. This makes load_and_stream behaves the same as nodetool refresh.

Note: However, this is not very efficient though.

Consider RF = 3, sst1, sst2, sst3 from the older cluster. When sst1 is
loaded, it streams to 3 replica nodes, if we generate view updates, we
will have 3 view updates for this replica (each of the peer nodes finds
its peer and writes the view update to peer). After loading sst2 and
sst3, we will have 9 view updates in total for a single partition.
If we create the view after the load and stream process, we will only
have 3 view updates for a single partition.

If we create the view after the load and stream process, we will only
have 3 view updates for a single partition.

Fixes #9205

Closes #9213

(cherry picked from commit eaf4d2afb4)
2022-03-09 16:37:13 +02:00
Nadav Har'El
865cfebfed cql: INSERT JSON should refuse empty-string partition key
Add the missing partition-key validation in INSERT JSON statements.

Scylla, following the lead of Cassandra, forbids an empty-string partition
key (please note that this is not the same as a null partition key, and
that null clustering keys *are* allowed).

Trying to INSERT, UPDATE or DELETE a partition with an empty string as
the partition key fails with a "Key may not be empty". However, we had a
loophole - you could insert such empty-string partition keys using an
"INSERT ... JSON" statement.

The problem was that the partition key validation was done in one place -
`modification_statement::build_partition_keys()`. The INSERT, UPDATE and
DELETE statements all inherited this same method and got the correct
validation. But the INSERT JSON statement - insert_prepared_json_statement
overrode the build_partition_keys() method and this override forgot to call
the validation function. So in this patch we add the missing validation.

Note that the validation function checks for more than just empty strings -
there is also a length limit for partition keys.

This patch also adds a cql-pytest reproducer for this bug. Before this
patch, the test passed on Cassandra but failed on Scylla.

Reported by @FortTell
Fixes #9853.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220116085216.21774-1-nyh@scylladb.com>
(cherry picked from commit 8fd5041092)
2022-03-02 23:02:01 +02:00
Botond Dénes
1963d1cc25 mutation_writer: feed_writer(): handle exceptions from consume_end_of_stream()
Currently the exception handling code of feed_writer() assumes
consume_end_of_stream() doesn't throw. This is false and an exception
from said method can currently lead to an unclean destroy of the writer
and reader. Fix by also handling exceptions from
consume_end_of_stream() too.

Closes #10147
2022-03-02 14:20:20 +01:00
Takuya ASADA
53b0aaa4e8 scylla_raid_setup: revert workaround patch and stop using mdmonitor
We found that monitor mode of mdadm does not work on RAID0, and it is
not a bug, expected behavior according to RHEL developer.
Therefore, we should revert workaround patch which downgrades mdadm,
and stop enabling mdmonitor since we always use RAID0.

See #9540

Closes #10077
2022-02-17 18:36:42 +02:00
Yaron Kaikov
ebfa2279a4 release: prepare for 4.5.4 2022-02-14 08:24:59 +02:00
Nadav Har'El
5e38a69f6d alternator: allow REMOVE of non-existent nested attribute
DynamoDB allows an UpdateItem operation "REMOVE x.y" when a map x
exists in the item, but x.y doesn't - the removal silently does
nothing. Alternator incorrectly generated an error in this case,
and unfortunately we didn't have a test for this case.

So in this patch we add the missing test (which fails on Alternator
before this patch - and passes on DynamoDB) and then fix the behavior.
After this patch, "REMOVE x.y" will remain an error if "x" doesn't
exist (saying "document paths not valid for this item"), but if "x"
exists and is a map, but "x.y" doesn't, the removal will silently
do nothing and will not be an error.

Fixes #10043.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220207133652.181994-1-nyh@scylladb.com>
(cherry picked from commit 9982a28007)
2022-02-08 12:09:38 +02:00
Nadav Har'El
6a54033a63 docker: don't repeat "--alternator-address" option twice
If the Docker startup script is passed both "--alternator-port" and
"--alternator-https-port", a combination which is supposed to be
allowed, it passes to Scylla the "--alternator-address" option twice.
This isn't necessary, and worse - not allowed.

So this patch fixes the scyllasetup.py script to only pass this
parameter once.

Fixes #10016.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220202165814.1700047-1-nyh@scylladb.com>
(cherry picked from commit cb6630040d)
2022-02-03 18:40:03 +02:00
Avi Kivity
ec44412cd9 Update seastar submodule (gratuitous exceptions on allocation failure)
* seastar e26dc6665f...770167e835 (1):
  > core: memory: Avoid current_backtrace() on alloc failure when logging suppressed

Fixes #9982.
2022-01-30 20:05:06 +02:00
Avi Kivity
8f45f65b09 Merge 'scylla_raid_setup: use mdmonitor only when RAID level > 0' from Takuya ASADA
We found that monitor mode of mdadm does not work on RAID0, and it is
not a bug, expected behavior according to RHEL developer.
Therefore, we should stop enabling mdmonitor when RAID0 is specified.

Fixes #9540

----

This reverts 0d8f932 and introduce correct fix.

Closes #9970

* github.com:scylladb/scylla:
  scylla_raid_setup: use mdmonitor only when RAID level > 0
  Revert "scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8"

(cherry picked from commit df22396a34)
2022-01-27 10:24:53 +02:00
Avi Kivity
5a7324c423 Update tools/java submodule (maxPendingPerConnection default)
* tools/java dbcea78e7d...8700c89b07 (2):
  > Fix NullPointerException in SettingsMode
  > cassandra-stress: Remove maxPendingPerConnection default

Ref #7748.
2022-01-12 21:37:07 +02:00
Nadav Har'El
2eb0ad7b4f alternator: allow Authorization header to be without spaces
The "Authorization" HTTP header is used in DynamoDB API to sign
requests. Our parser for this header, in server::verify_signature(),
required the different components of this header to be separated by
a comma followed by a whitespace - but it turns out that in DynamoDB
both spaces and commas are optional - one of them is enough.

At least one DynamoDB client library - the old "boto" (which predated
boto3) - builds this header without spaces.

In this patch we add a test that shows that an Authorization header
with spaces removed works fine in DynamoDB but didn't work in
Alternator, and after this patch modifies the parsing code for this
header, the test begins to pass (and the other tests show that the
previously-working cases didn't break).

Fixes #9568

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211101214114.35693-1-nyh@scylladb.com>
(cherry picked from commit 56eb994d8f)
2021-12-29 14:40:02 +02:00
Nadav Har'El
5d7064e00e alternator: return the correct Content-Type header
Although the DynamoDB API responses are JSON, additional conventions apply
to these responses - such as how error codes are encoded in JSON. For this
reason, DynamoDB uses the content type `application/x-amz-json-1.0` instead
of the standard `application/json` in its responses.

Until this patch, Scylla used `application/json` in its responses. This
unexpected content-type didn't bother any of the AWS libraries which we
tested, but it does bother the aiodynamo library (see HENNGE/aiodynamo#27).

Moreover, we should return the x-amz-json-1.0 content type for future
proofing: It turns out that AWS already defined x-amz-json-1.1 - see:
https://awslabs.github.io/smithy/1.0/spec/aws/aws-json-1_1-protocol.html
The 1.1 content type differs (only) in how it encodes error replies.
If one day DynamoDB starts to use this new reply format (it doesn't yet)
and if DynamoDB libraries will need to differenciate between the two
reply formats, Alternator better return the right one.

This patch also includes a new test that the Content-Type header is
returned with the expected value. The test passes on DynamoDB, and
after this patch it starts to pass on Alternator as well.

Fixes #9554.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211031094621.1193387-1-nyh@scylladb.com>
(cherry picked from commit 6ae0ea0c48)
2021-12-29 12:22:49 +02:00
Takuya ASADA
b56b9f5ed5 scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8
On CentOS8, mdmonitor.service does not works correctly when using
mdadm-4.1-15.el8.x86_64 and later versions.
Until we find a solution, let's pinning the package version to older one
which does not cause the issue (4.1-14.el8.x86_64).

Fixes #9540

Closes #9782

(cherry picked from commit 0d8f932f0b)
2021-12-28 11:38:24 +02:00
Avi Kivity
c8f14886dc Revert "cql3: Reject updates with NULL key values"
This reverts commit 44c784cb79. It
causes a regression without 6afdc6004c,
and it is too complicated to backport at this time.

Ref #9311.
2021-12-23 13:03:13 +02:00
Botond Dénes
5c8057749b treewide: distinguish truncated frame errors
We have two identical "Truncated frame" errors, at:
* read_frame_size() in serialization_visitors.hh;
* cql_server::connection::read_and_decompress_frame() in
  transport/server.cc;

When such an exception is thrown, it is impossible to tell where was it
thrown from and it doesn't have any further information contained in it
(beyond the basic information it being thrown implies).
This patch solves both problems: it makes the exception messages unique
per location and it adds information about why it was thrown (the
expected vs. real size of the frame).

Ref: #9482

Closes #9520

(cherry picked from commit 9ec55e054d)
2021-12-21 15:27:17 +02:00
Pavel Emelyanov
a35646b874 row-cache: Handle exception (un)safety of rows_entry insertion
The B-tree's insert_before() is throwing operation, its caller
must account for that. When the rows_entry's collection was
switched on B-tree all the risky places were fixed by ee9e1045,
but few places went under the radar.

In the cache_flat_mutation_reader there's a place where a C-pointer
is inserted into the tree, thus potentially leaking the entry.

In the partition_snapshot_row_cursor there are two places that not
only leak the entry, but also leave it in the LRU list. The latter
it quite nasty, because those entry can be evicted, eviction code
tries to get rows_entry iterator from "this", but the hook happens
to be unattached (because insertion threw) and fails the assert.

fixes: #9728

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit ee103636ac)
2021-12-14 16:01:55 +02:00
Pavel Emelyanov
da8708932d partition_snapshot_row_cursor: Shuffle ensure_result creation
Both places get the C-pointer on the freshly allocated rows_entry,
insert it where needed and return back the dereferenced pointer.

The C-pointer is going to become smart-pointer that would go out
of scope before return. This change prepares for that by constructing
the ensure_result from the iterator, that's returned from insertion
of the entry.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 9fd8db318d)

Ref #9728
2021-12-14 16:01:48 +02:00
Nadav Har'El
78a545716a Update Seastar module with additional backports
Backported a Seastar fix:
  > Merge 'metrics: Fix dtest->ulong conversion error' from Benny Halevy

Fixes #9794
2021-12-14 13:18:07 +02:00
Asias He
1488278fc1 storage_service: Wait for seastar::get_units in node_ops
The seastar::get_units returns a future, we have to wait for it.

Fixes #9767

Closes #9768

(cherry picked from commit 9859c76de1)
2021-12-12 18:42:33 +02:00
Nadav Har'El
d9455a910f alternator: add missing BatchGetItem metric
Unfortunately, defining metrics in Scylla requires some code
duplication, with the metrics declared in one place but exported in a
different place in the code. When we duplicated this code in Alternator,
we accidentally dropped the first metric - for BatchGetItem. The metric
was accounted in the code, but not exported to Prometheus.

In addition to fixing the missing metric, this patch also adds a test
that confirms that the BatchGetItem metric increases when the
BatchGetItem operation is used. This test failed before this patch, and
passes with it. The test only currently tests this for BatchGetItem
(and BatchWriteItem) but it can be later expanded to cover all the other
operations as well.

Fixes #9406

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210929121611.373074-1-nyh@scylladb.com>
(cherry picked from commit 5cbe9178fd)
2021-12-06 12:43:49 +02:00
Yaron Kaikov
406b4bce8d release: prepare for 4.5.3 2021-12-05 14:48:33 +02:00
Botond Dénes
417e853b9b mutation_reader: shard_reader: ensure referenced objects are kept alive
The shard reader can outlive its parent reader (the multishard reader).
This creates a problem for lifecycle management: readers take the range
and slice parameters by reference and users keep these alive until the
reader is alive. The shard reader outliving the top-level reader means
that any background read-ahead that it has to wait on will potentially
have stale references to the range and the slice. This was seen in the
wild recently when the evictable reader wrapped by the shard reader hit
a use-after-free while wrapping up a background read-ahead.
This problem was solved by fa43d76 but any previous versions are
susceptible to it.

This patch solves this problem by having the shard reader copy and keep
the range and slice parameters in stable storage, before passing them
further down.

Fixes: #9719

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211202113910.484591-1-bdenes@scylladb.com>
2021-12-02 17:49:36 +02:00
Dejan Mircevski
44c784cb79 cql3: Reject updates with NULL key values
We were silently ignoring INSERTs with NULL values for primary-key
columns, which Cassandra rejects.  Fix it by rejecting any
modification_statement that would operate on empty partition or
clustering range.

This is the most direct fix, because range and slice are calculated in
one place for all modification statements.  It covers not only NULL
cases, but also impossible restrictions like c>0 AND c<0.
Unfortunately, Cassandra doesn't treat all modification statements
consistently, so this fix cannot fully match its behavior.  We err on
the side of tolerance, accepting some DELETE statements that Cassandra
rejects.  We add a TODO for rejecting such DELETEs later.

Fixes #7852.

Tests: unit (dev), cql-pytest against Cassandra 4.0

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #9286

(cherry picked from commit 1fdaeca7d0)
2021-11-29 17:19:51 +02:00
Piotr Jastrzebski
ab425a11a8 sstables: Fix writing KA/LA sstables index
Before this patch when writing an index block, the sstables writer was
storing range tombstones that span the boundary of the block in order
of end bounds. This led to a range tombstone being ignored by a reader
if there was a row tombstone inside it.

This patch sorts the range tombstones based on start bound before
writing them to the index file.

The assumption is that writing an index block is rare so we can afford
sortting the tombstones at that point. Additionally this is a writer of
an old format and writing to it will be dropped in the next major
release so it should be rarely used already.

Kudos to Kamil Braun <kbraun@scylladb.com> for finding the reproducer.

Test: unit(dev)

Fixes #9690

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit scylladb/scylla-enterprise@eb093afd6f)
2021-11-26 15:36:03 +01:00
Tomasz Grabiec
2228a1a92a cql: Fix missing data in indexed queries with base table short reads
Indexed queries are using paging over the materialized view
table. Results of the view read are then used to issue reads of the
base table. If base table reads are short reads, the page is returned
to the user and paging state is adjusted accordingly so that when
paging is resumed it will query the view starting from the row
corresponding to the next row in the base which was not yet
returned. However, paging state's "remaining" count was not reset, so
if the view read was exhausted the reading will stop even though the
base table read was short.

Fix by restoring the "remaining" count when adjusting the paging state
on short read.

Tests:

  - index_with_paging_test
  - secondary_index_test

Fixes #9198
Message-Id: <20210818131840.1160267-1-tgrabiec@scylladb.com>

(cherry picked from commit 1e4da2dcce)
2021-11-23 11:22:13 +02:00
Takuya ASADA
36b190a65e docker: add stopwaitsecs
We need stopwaitsecs just like we do TimeoutStpSec=900 on
scylla-server.service, to avoid timeout on scylla-server shutdown.

Fixes #9485

Closes #9545

(cherry picked from commit c9499230c3)
2021-11-15 13:36:40 +02:00
Takuya ASADA
f7e5339c14 scylla_io_setup: support ARM instances on AWS
Add preset parameters for AWS ARM intances.

Fixes #9493

(cherry picked from commit 4e8060ba72)
2021-11-15 13:35:15 +02:00
Asias He
4c4972cb33 gossip: Fix use-after-free in real_mark_alive and mark_dead
In commit 11a8912093 (gossiper:
get_gossip_status: return string_view and make noexcept)
get_gossip_status returns a pointer to an endpoint_state in
endpoint_state_map.

After commit 425e3b1182 (gossip: Introduce
direct failure detector), gossiper::mark_dead and gossiper::real_mark_alive
can yield in the middle of the function. It is possible that
endpoint_state can be removed, causing use-after-free to access it.

To fix, make a copy before we yield.

Fixes #8859

Closes #8862

(cherry picked from commit 7a32cab524)
2021-11-15 13:23:02 +02:00
Takuya ASADA
50ce5bef2c scylla_util.py: On is_gce(), return False when it's on GKE
GKE metadata server does not provide same metadata as GCE, we should not
return True on is_gce().
So try to fetch machine-type from metadata server, return False if it
404 not found.

Fixes #9471

Signed-off-by: Takuya ASADA <syuu@scylladb.com>

Closes #9582

(cherry picked from commit 9b4cf8c532)
2021-11-15 13:18:02 +02:00
Asias He
f864eea844 repair: Return HTTP 400 when repiar id is not found
There are two APIs for checking the repair status and they behave
differently in case the id is not found.

```
{"host": "192.168.100.11:10001", "method": "GET", "uri":
"/storage_service/repair_async/system_auth?id=999", "duration": "1ms",
"status": 400, "bytes": 49, "dump": "HTTP/1.1 400 Bad
Request\r\nContent-Length: 49\r\nContent-Type: application/json\r\nDate:
Wed, 03 Nov 2021 10:49:33 GMT\r\nServer: Seastar
httpd\r\n\r\n{\"message\": \"unknown repair id 999\", \"code\": 400}"}

{"host": "192.168.100.11:10001", "method": "GET", "uri":
"/storage_service/repair_status?id=999&timeout=1", "duration": "0ms",
"status": 500, "bytes": 49, "dump": "HTTP/1.1 500 Internal Server
Error\r\nContent-Length: 49\r\nContent-Type: application/json\r\nDate:
Wed, 03 Nov 2021 10:49:33 GMT\r\nServer: Seastar
httpd\r\n\r\n{\"message\": \"unknown repair id 999\", \"code\": 500}"}
```

The correct status code is 400 as this is a parameter error and should
not be retried.

Returning status code 500 makes smarter http clients retry the request
in hopes of server recovering.

After this patch:

curl -X PGET
'http://127.0.0.1:10000/storage_service/repair_async/system_auth?id=9999'
{"message": "unknown repair id 9999", "code": 400}

curl -X GET
'http://127.0.0.1:10000/storage_service/repair_status?id=9999'
{"message": "unknown repair id 9999", "code": 400}

Fixes #9576

Closes #9578

(cherry picked from commit f5f5714aa6)
2021-11-15 13:15:59 +02:00
Calle Wilund
b9735ab079 cdc: fix broken function signature in maybe_back_insert_iterator
Fixes #9103

compare overload was declared as "bool" even though it is a tri-cmp.
causes us to never use the speed-up shortcut (lessen search set),
in turn meaning more overhead for collections.

Closes #9104

(cherry picked from commit 59555fa363)
2021-11-15 13:13:04 +02:00
Takuya ASADA
766e16f19e scylla_io_setup: handle nr_disks on GCP correctly
nr_disks is int, should not be string.

Fixes #9429

Closes #9430

(cherry picked from commit 3b798afc1e)
2021-11-15 13:06:30 +02:00
Michał Chojnowski
e6520df41c utils: fragment_range: fix FragmentedView utils for views with empty fragments
The copying and comparing utilities for FragmentedView are not prepared to
deal with empty fragments in non-empty views, and will fall into an infinite
loop in such case.
But data coming in result_row_view can contain such fragments, so we need to
fix that.

Fixes #8398.

Closes #8397

(cherry picked from commit f23a47e365)
2021-11-15 12:55:25 +02:00
Hagit Segev
26aca7b9f7 release: prepare for 4.5.2 2021-11-14 14:19:34 +02:00
Avi Kivity
103c85a23f build: clobber user/group info from node_exporter tarball
node_exporter is packaged with some random uid/gid in the tarball.
When extracting it as an ordinary user this isn't a problem, since
the uid/gid are reset to the current user, but that doesn't happen
under dbuild since `tar` thinks the current user is root. This causes
a problem if one wants to delete the build directory later, since it
becomes owned by some random user (see /etc/subuid)

Reset the uid/gid infomation so this doesn't happen.

Closes #9579

Fixes #9610.

(cherry picked from commit e1817b536f)
2021-11-10 14:18:56 +02:00
Nadav Har'El
db66b62e80 alternator: fix bug in ReturnValues=ALL_NEW
This patch fixes a bug in UpdateItem's ReturnValues=ALL_NEW, which in
some cases returned the OLD (pre-modification) value of some of the
attributes, instead of its NEW value.

The bug was caused by a confusion in our JSON utility function,
rjson::set(), which sounds like it can set any member of a map, but in
fact may only be used to add a *new* member - if a member with the same
name (key) already existed, the result is undefined (two values for the
same key). In ReturnValues=ALL_NEW we did exactly this: we started with
a copy of the original item, and then used set() to override some of the
members. This is not allowed.

So in this patch, we introduce a new function, rjson::replace(), which
does what we previously thought that rjson::set() does - i.e., replace a
member if it exists, or if not, add it. We call this function in
the ReturnValues=ALL_NEW code.

This patch also adds a test case that reproduces the incorrect ALL_NEW
results - and gets fixed by this patch.

In an upcoming patch, we should rename the confusingly-named set()
functions and audit all their uses. But we don't do this in this patch
yet. We just add some comments to clarify what set() does - but don't
change it, and just add one new function for replace().

Fixes #9542

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211104134937.40797-1-nyh@scylladb.com>
(cherry picked from commit b95e431228)
2021-11-08 13:11:37 +02:00
Asias He
098fcf900f storage_service: Abort restore_replica_count when node is removed from the cluster
Consider the following procedure:

- n1, n2, n3
- n3 is down
- n1 runs nodetool removenode uuid_of_n3 to removenode from n3 the
  cluster
- n1 is down in the middle of removenode operation

Node n1 will set n3 to removing gossip status during removenode
operation. Whenever existing nodes learn a node is in removing gossip
status, they will call restore_replica_count to stream data from other
nodes for the ranges n3 loses if n3 was removed from the cluster. If
the streaming fails, the streaming will sleep and retry. The current
max number of retry attempts is 5. The sleep interval starts at 60
seconds and increases 1.5 times per sleep.

This can leave the cluster in a bad state. For example, nodes can go
out of disk space if the streaming continues.  We need a way to abort
such streaming attempts.

To abort the removenode operation and forcely remove the node, users
can run `nodetool removenode force` on any existing nodes to move the
node from removing gossip status to removed gossip status. However,
the restore_replica_count will not be aborted.

In this patch, a status checker is added in restore_replica_count, so
that once a node is in removed gossip status, restore_replica_count
will be aborted.

This patch is for older releases without the new NODE_OPS_CMD
infrastructure where such abort will happen automatically in case of
error.

Fixes #8651

Closes #8655

(cherry picked from commit 0858619cba)
2021-11-02 17:25:34 +02:00
Benny Halevy
84025f6ce0 large_data_handle: add sstable name to log messages
Although the sstable name is part of the system.large_* records,
it is not printed in the log.
In particular, this is essential for the "too many rows" warning
that currently does not record a row in any large_* table
so we can't correlate it with a sstable.

Fixes #9524

Test: unit(dev)
DTest: wide_rows_test.py

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211027074104.1753093-1-bhalevy@scylladb.com>
(cherry picked from commit a21b1fbb2f)
2021-10-29 10:41:21 +03:00
Asias He
9898a114a6 repair: Handle everywhere_topology in bootstrap_with_repair
The everywhere_topology returns the number of nodes in the cluster as
RF. This makes only streaming from the node losing the range impossible
since no node is losing the range after bootstrap.

Shortcut to stream from all nodes in local dc in case the keyspace is
everywhere_topology.

Fixes #8503

(cherry picked from commit 3c36517598)
2021-10-28 18:56:01 +03:00
Yaron Kaikov
4c0eac0491 release: prepare for 4.5.1 2021-10-24 14:11:07 +03:00
Benny Halevy
c1d8ce7328 date_tiered_manifest: get_now: fix use after free of sstable_list
The sstable_list is destroyed right after the temporary
lw_shared_ptr<sstable_list> returned from `cf.get_sstables()`
is dereferenced.

Fixes #9138

Test: unit(dev)
DTest: resharding_test.py:ReshardingTombstones_with_DateTieredCompactionStrategy.disable_tombstone_removal_during_reshard_test (debug)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210804075813.42526-1-bhalevy@scylladb.com>
(cherry picked from commit 3ad0067272)
2021-10-20 17:03:49 +03:00
Jan Ciolek
5c5a71d2d7 cql3: Fix need_filtering on indexed table
There were cases where a query on an indexed table
needed filtering but need_filtering returned false.

This is fixed by using new conditions in cases where
we are using an index.

Fixes #8991.
Fixes #7708.

For now this is an overly conservative implementation
that returns true in some cases where filtering
is not needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit 54149242b4)
2021-10-18 11:31:20 +03:00
Benny Halevy
454ff04ff6 utils: phased_barrier: advance_and_await: make noexcept
As a function returning a future, simplify
its interface by handling any exceptions and
returning an exceptional future instead of
propagating the exception.

In this specific case, throwing from advance_and_await()
will propagate through table::await_pending_* calls
short-circuiting a .finally clause in table::stop().

Also, mark as noexcept methods of class table calling
advance_and_await and table::await_pending_ops that depends on them.

Fixes #8636

A followup patch will convert advance_and_await to a coroutine.
This is done separately to facilitate backporting of this patch.

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210511161407.218402-1-bhalevy@scylladb.com>
(cherry picked from commit c0dafa75d9)
2021-10-13 12:26:03 +03:00
Takuya ASADA
f5f5b9a307 scylla_raid_setup: enabling mdmonitor.service on Debian variants
On Debian variants, mdmonitor.service cannnot enable because it missing
[Install] section, so 'systemctl enable mdmonitor.service' will fail,
not able to run mdmonitor after the system restarted.

To force running the service, add Wants=mdmonitor.service on
var-lib-scylla.mount.

Fixes #8494

Closes #8530

(cherry picked from commit c9324634ca)
2021-10-12 13:58:49 +03:00
Avi Kivity
a433c5fe06 Merge 'rjson: Add throwing allocator' from Piotr Sarna
This series adds a wrapper for the default rjson allocator which throws on allocation/reallocation failures. It's done to work around several rapidjson (the underlying JSON parsing library) bugs - in a few cases, malloc/realloc return value is not checked, which results in dereferencing a null pointer (or an arbitrary pointer computed as 0 + `size`, with the `size` parameter being provided by the user). The new allocator will throw an `rjson:error` if it fails to allocate or reallocate memory.
This series comes with unit tests which checks the new allocator behavior and also validates that an internal rapidjson structure which we indirectly rely upon (Stack) is not left in invalid state after throwing. The last part is verified by the fact that its destructor ran without errors.

Fixes #8521
Refs #8515

Tests:
 * unit(release)
 * YCSB: inserting data similar to the one mentioned in #8515 - 1.5MB objects clustered in partitions 30k objects in size - nothing crashed during various YCSB workloads, but nothing also crashed for me locally before this patch, so it's not 100% robust
 relevant YCSB workload config for using 1.5MB objects:
```yaml
fieldcount=150
fieldlength=10000
```

Closes #8529

* github.com:scylladb/scylla:
  test: add a test for rjson allocation
  test: rename alternator_base64_test to alternator_unit_test
  rjson: add a throwing allocator

(cherry picked from commit c36549b22e)
2021-10-12 13:56:59 +03:00
Benny Halevy
7f96ee6689 streaming: stream_session: do not escape curly braces in format strings
Those turn into '{}' in the formatted strings and trigger
a logger error in the following sstlog.warn(err.c_str())
call.

Fixes #8436

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210408173048.124417-1-bhalevy@scylladb.com>
(cherry picked from commit 76cd315c42)
2021-10-12 13:49:13 +03:00
Calle Wilund
ebe196e32d table: ensure memtable is actually in memtable list before erasing
Fixes #8749

if a table::clear() was issued while we were flushing a memtable,
the memtable is already gone from list. We need to check this before
erase. Otherwise we get random memory corruption via
std::vector::erase

v2:
* Make interface more set-like (tolerate non-existance in erase).

Closes #8904

(cherry picked from commit 373fa3fa07)
2021-10-12 13:47:25 +03:00
Benny Halevy
38aa455e83 utils: merge_to_gently: prevent stall in std::copy_if
std::copy_if runs without yielding.

See https://github.com/scylladb/scylla/issues/8897#issuecomment-867522480

Note that the standard states that no iterators or references are invalidated
on insert so we can keep inserting before last1 when merging the
remainder of list2 at the tail of list1.

Fixes #8897

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 453e7c8795)
2021-10-12 13:05:45 +03:00
Avi Kivity
a99382a076 main: start background reclaim before bootstrap
We start background reclaim after we bootstrap, so bootstrap doesn't
benefit from it, and sees long stalls.

Fix by moving background reclaim initialization early, before
storage_service::join_cluster().

(storage_service::join_cluster() is quite odd in that main waits
for it synchronously, compared to everything else which is just
a background service that is only initialized in main).

Fixes #8473.

Closes #8474

(cherry picked from commit 935378fa53)
2021-10-12 13:00:49 +03:00
Pavel Emelyanov
6e2d055be3 mutation: Keep range tombstone in tree when consuming
Current code std::move()-s the range tombstone into consumer thus
moving the tombstone's linkage to the containing list as well. As
the result the orignal range tombstone itself leaks as it leaves
the tree and cannot be reached on .clear(). Another danger is that
the iterator pointing to the tombstone becomes invalid while it's
then ++-ed to advance to the next entry.

The immediate fix is to keep the tombstone linked to the list while
moving.

fixes: #9207

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210825100834.3216-1-xemul@scylladb.com>
(cherry picked from commit b012040a76)
2021-10-12 12:57:49 +03:00
Michael Livshin
152f710dec avoid race between compaction and table stop
Also add a debug-only compaction-manager-side assertion that tests
that no new compaction tasks were submitted for a table that is being
removed (debug-only because not constant-time).

Fixes #9448.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20211007110416.159110-1-michael.livshin@scylladb.com>
(cherry picked from commit e88891a8af)
2021-10-12 12:51:34 +03:00
Asias He
14620444a2 storage_service: Fix argument in send_meta_data::do_receive
The extra status print is not needed in the log.

Fixes the following error:

ERROR 2021-08-10 10:54:21,088 [shard 0] storage_service -
service/storage_service.cc:3150 @do_receive: failed to log message:
fmt='send_meta_data: got error code={}, from node={}, status={}':
fmt::v7::format_error (argument not found)

Fixes #9183

Closes #9189

(cherry picked from commit ce8fd051c9)
2021-10-12 12:45:47 +03:00
Yaron Kaikov
47be33a104 release: prepare for 4.5.0 2021-10-06 14:12:31 +03:00
Takuya ASADA
18b8388958 scylla_cpuscaling_setup: add --force option
To building Ubuntu AMI with CPU scaling configuration, we need force
running mode for scylla_cpuscaling_setup, which run setup without
checking scaling_governor support.

See scylladb/scylla-machine-image#204

Closes #9326

(cherry picked from commit f928dced0c)
2021-10-05 16:20:10 +03:00
Takuya ASADA
56b24818ec scylla_cpuscaling_setup: disable ondemand.service on Ubuntu
On Ubuntu, scaling_governor becomes powersave after rebooted, even we configured cpufrequtils.
This is because ondemand.service, it unconditionally change scaling_governor to ondemand or powersave.
cpufrequtils will start before ondemand.service, scaling_governor overwrite by ondemand.service.
To configure scaling_governor correctly, we have to disable this service.

Fixes #9324

Closes #9325

(cherry picked from commit cd7fe9a998)
2021-10-03 14:08:22 +03:00
Raphael S. Carvalho
5ed149b7e1 compaction_manager: prevent unbounded growth of pending tasks
There will be unbounded growth of pending tasks if they are submitted
faster than retiring them. That can potentially happen if memtables
are frequently flushed too early. It was observed that this unbounded
growth caused task queue violations as the queue will be filled
with tons of tasks being reevaluated. By avoiding duplication in
pending task list for a given table T, growth is no longer unbounded
and consequently reevaluation is no longer aggressive.

Refs #9331.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210930125718.41243-1-raphaelsc@scylladb.com>
(cherry picked from commit 52302c3238)
2021-10-03 13:10:37 +03:00
Eliran Sinvani
9265dbd5f7 dist: rpm: Add specific versioning and python3 dependency
The Red Hat packages were missing two things, first the metapackage
wasn't dependant at all in the python3 package and second, the
scylla-server package dependencies didn't contain a version as part
of the dependency which can cause to some problems during upgrade.
Doing both of the things listed here is a bit of an overkill as either
one of them separately would solve the problem described in #XXXX
but both should be applied in order to express the correct concept.

Fixes #8829

Closes #8832

(cherry picked from commit 9bfb2754eb)
2021-09-12 16:01:01 +03:00
Avi Kivity
443fda8fb1 Merge "evictable_readers: don't drop static rows, drop assumption about snapshot isolation" from Botond
"
This mini-series fixes two loosely related bugs around reader recreation
in the evictable reader (related by both being around reader
recreation). A unit test is also added which reproduces both of them and
checks that the fixes indeed work. More details in the patches
themselves.
This series replaces the two independent patches sent before:
* [PATCH v1] evictable_reader: always reset static row drop flag
* [PATCH v1] evictable_reader: relax partition key check on reader
  recreation

As they depend on each other, it is easier to add a test if they are in
a series.

Fixes: #8923
Fixes: #8893

Tests: unit(dev, mutation_reader_test:debug)
"

* 'evictable-reader-recreation-more-bugs/v1' of https://github.com/denesb/scylla:
  test: mutation_reader_test: add more test for reader recreation
  evictable_reader: relax partition key check on reader recreation
  evictable_reader: always reset static row drop flag

(cherry picked from commit 4209dfd753)
2021-09-06 20:35:14 +03:00
Pavel Emelyanov
02da29fd05 btree: Destroy, not drop, node on clone roll-back
The node in this place is not yet attached to its parent, so
in btree::debug::yes (tests only) mode the node::drop()'s parent
checks will access null parent pointer.

However, in non-tesing runtime there's a chance that a linear
node fails to clone one of its keys and gets here. In this case
it will carry both leftmost and rightmost flags and the assertion
in drop will fire.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 1d857d604a)

Ref #9248.
2021-09-06 11:16:29 +03:00
Yaron Kaikov
edead1caf9 release: prepare for 4.5.rc7 2021-09-06 08:40:20 +03:00
Tomasz Grabiec
8dbd4edbb5 Merge 'hints: use token_metadata to tell if node has left the ring' from Piotr Dulikowski
This PR changes the `can_send` function so that it looks at the `token_metadata` in order to tell if the destination node is in the ring. Previously, gossiper state was used for that purpose and required a relatively complicated condition to check. The new logic just uses `token_metadata::is_member` which reduces complexity of the `can_send` function.

Additionally, `storage_service` is slightly modified so that during a removenode operation the `token_metadata` is first updated and only then endpoint lifecycle subscribers are notified. This was done in order to prevent a race just like the one which happened in #5087 - hints manager is a lifecycle subscriber and starts a draining operation when a node is removed, and in order for draining to work correctly, `can_send` should keep returning true for that node.

Tests:

- unit(dev)
- dtest(hintedhandoff_additional_test.py)
- dtest(topology_test.py)

Closes #8387

* github.com:scylladb/scylla:
  hints: clarify docstring comment for can_send
  hints: use token_metadata to tell if node is in the ring
  hints: slightly reogranize "if" statement in can_send
  storage_service: release token_metadata lock before notify_left
  storage_service: notify_left after token_metadata is replicated

(cherry picked from commit 307bd354d2)

Ref #5087.
2021-09-05 17:38:11 +03:00
Pavel Emelyanov
02bb2e1f4c btree: Dont leak kids on clone roll-back
When failed-to-be-cloned node cleans itself it must also clear
all its child nodes. Plain destroy() doesn't do it, it only
frees the provided node.

fixes: #9248

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit d1a1a2dac2)
2021-09-05 17:33:48 +03:00
Benny Halevy
4bae31523d distributed_loader: distributed_loader::get_sstables_from_upload_dir: do not copy vector containing foreign shared sstables
lw_shared_ptr must not be copied on a foreign shard.
Copying the vector on shard 0 tries increases the reference count of
lw_shared_ptr<sstable> elements that were created on other shards,
as seen in https://github.com/scylladb/scylla/issues/9278.

Fixes #9278

DTest: migration_test.py:TestLoadAndStream_with_3_0_md.load_and_stream_increase_cluster_test(debug)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210902084313.2003328-1-bhalevy@scylladb.com>
(cherry picked from commit 33f579f783)
2021-09-05 17:02:38 +03:00
Michał Chojnowski
55348131f9 utils: compact-radix-tree: fix accidental cache line bouncing
Whenever a node_head_ptr is assigned to nil_root, the _backref inside it is
overwritten. But since nil_root is shared between shards, this causes severe
cache line bouncing. (It was observed to reduce the total write throughput
of Scylla by 90% on a large NUMA machine).

This backreference is never read anyway, so fix this bug by not writing it.

Fixes #9252

Closes #9246

(cherry picked from commit 126baa7850)
2021-08-29 15:45:33 +03:00
Avi Kivity
c1b9de3d5e Revert "messaging_service: Enforce dc/rack membership iff required for non-tls connections"
This reverts commit a0745f9498. It breaks
multiregion clusters on AWS.

Ref #8418.
2021-08-29 15:44:11 +03:00
Avi Kivity
9956bce436 Revert "messaging_service: Bind to listen address, not broadcast"
This reverts commit 6dc7ef512d. It stands
in the way of reverting  a0745f9498, which
is implicated in #8418.
2021-08-29 15:43:29 +03:00
Pavel Solodovnikov
95f32428e4 raft: create system tables only when raft experimental feature is set
Also introduce a tiny function to return raft-enabled db config
for cql testing.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210826091432.279532-1-pa.solodovnikov@scylladb.com>
(cherry picked from commit c0854a0f62)

Ref #9239.
2021-08-29 14:01:36 +03:00
Pavel Solodovnikov
5b3319816a db: add experimental option for raft
Introduce `raft` experimental option.
Adjust the tests accordingly to accomodate the new option.

It's not enabled by default when providing
`--experimental=true` config option and should be
requested explicitly via `--experimental-options=raft`
config option.

Hide the code related to `raft_group_registry` behind
the switch. The service object is still constructed
but no initialization is performed (`init()` is not
called) if the flag is not set.

Later, other raft-related things, such as raft schema
changes, will also use this flag.

Also, don't introduce a corresponding gossiper feature
just yet, because again, it should be done after the
raft schema changes API contract is stabilized.

This will be done in a separate series, probably related to
implementing the feature itself.

Tests: unit(dev)

Ref #9239.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210823121956.167682-1-pa.solodovnikov@scylladb.com>
(cherry picked from commit 22794efc22)
2021-08-29 14:01:33 +03:00
Avi Kivity
8f63a9de31 Update seastar submodule (perftune failure on bond NIC)
* seastar dab10ba6ad...70ea9312a1 (1):
  > perftune.py: instrument bonding tuning flow with 'nic' parameter

Fixes #9225.
2021-08-19 16:58:15 +03:00
Hagit Segev
88314fedfa release: prepare for 4.5.rc6 2021-08-17 14:17:04 +03:00
Calle Wilund
b0edfa6d70 commitlog/config: Make hard size enforcement false by default + add config opt
Refs #9053

Flips default for commitlog disk footprint hard limit enforcement to off due
to observed latency stalls with stress runs. Instead adds an optional flag
"commitlog_use_hard_size_limit" which can be turned on to in fact do enforce it.

Sort of tape and string fix until we can properly tweak the balance between
cl & sstable flush rate.

Closes #9195

(cherry picked from commit 3633c077be)
2021-08-16 10:05:08 +03:00
Takuya ASADA
9338f6b6b8 scylla_cpuscaling_setup: change scaling_governor path
On some environment /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
does not exist even it supported CPU scaling.
Instead, /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor is
avaliable on both environment, so we should switch to it.

Fixes #9191

Closes #9193

(cherry picked from commit e5bb88b69a)
2021-08-12 12:10:04 +03:00
Asias He
28940ef505 table: Fix is_shared assert for load and stream
The reader is used by load and stream to read sstables from the upload
directory which are not guaranteed to belong to the local shard.

Using the make_range_sstable_reader instead of
make_local_shard_sstable_reader.

Tests:

backup_restore_tests.py:TestBackupRestore.load_and_stream_using_snapshot_test
backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_2_test
backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_1_test
migration_test.py:TestLoadAndStream.load_and_stream_asymmetric_cluster_test
migration_test.py:TestLoadAndStream.load_and_stream_decrease_cluster_test
migration_test.py:TestLoadAndStream.load_and_stream_frozen_pk_test
migration_test.py:TestLoadAndStream.load_and_stream_increase_cluster_test
migration_test.py:TestLoadAndStream.load_and_stream_primary_replica_only_test

Fixes #9173

Closes #9185

(cherry picked from commit 040b626235)
2021-08-12 11:17:33 +03:00
Raphael S. Carvalho
4c03bcce4c compaction: Prevent tons of compaction of fully expired sstable from happening in parallel
Compaction manager can start tons of compaction of fully expired sstable in
parallel, which may consume a significant amount of resources.
This problem is caused by weight being released too early in compaction, after
data is all compacted but before table is called to update its state, like
replacing sstables and so on.
Fully expired sstables aren't actually compacted, so the following can happen:
- compaction 1 starts for expired sst A with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 2 starts for expired sst B with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 3 starts for expired sst C with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 1 is done updating table state, so it finally completes and
releases all the resources.
- compaction 2 is done updating table state, so it finally completes and
releases all the resources.
- compaction 3 is done updating table state, so it finally completes and
releases all the resources.

This happens because, with expired sstable, compaction will release weight
faster than it will update table state, as there's nothing to be compacted.

With my reproducer, it's very easy to reach 50 parallel compactions on a single
shard, but that number can be easily worse depending on the amount of sstables
with fully expired data, across all tables. This high parallelism can happen
only with a couple of tables, if there are many time windows with expired data,
as they can be compacted in parallel.

Prior to 55a8b6e3c9, weight was released earlier in compaction, before
last sstable was sealed, but right now, there's no need to release weight
earlier. Weight can be released in a much simpler way, after the compaction is
actually done. So such compactions will be serialized from now on.

Fixes #8710.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210527165443.165198-1-raphaelsc@scylladb.com>

[avi: drop now unneeded storage_service_for_tests]

(cherry picked from commit a7cdd846da)
2021-08-10 18:16:24 +03:00
Piotr Jastrzebski
860e2190a9 api: use proper type to reduce partition count
Partition count is of a type size_t but we use std::plus<int>
to reduce values of partition count in various column families.
This patch changes the argument of std::plus to the right type.
Using std::plus<int> for size_t compiles but does not work as expected.
For example plus<int>(2147483648LL, 1LL) = -2147483647 while the code
would probably want 2147483649.

Fixes #9090

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

Closes #9074

(cherry picked from commit 90a607e844)
2021-08-10 18:16:15 +03:00
Nadav Har'El
6cf88812f6 secondary index: fix regression in CREATE INDEX IF NOT EXISTS
The recent commit 0ef0a4c78d added helpful
error messages in case an index cannot be created because the intended
name of its materialized view is already taken - but accidentally broke
the "CREATE INDEX IF NOT EXISTS" feature.

The checking code was correct, but in the wrong place: we need to first
check maybe the index already exists and "IF NOT EXISTS" was chosen -
and only do this new error checking if this is not the case.

This patch also includes a cql-pytest test for reproducing this bug.
The bug is also reproduced by the translated Cassandra unit tests
    cassandra_tests/validation/entities/secondary_index_test.py::
    testCreateAndDropIndex
and this is how I found this bug. After these patch, all these tests
pass.

Fixes #8717.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210526143635.624398-1-nyh@scylladb.com>
(cherry picked from commit 97e827e3e1)
2021-08-10 18:02:02 +03:00
Nadav Har'El
89fbcf9c81 Merge 'Fix index name conflicts with regular tables' from Piotr Sarna
When an index is created without an explicit name, a default name
is chosen. However, there was no check if a table with conflicting
name already exists. The check is now in place and if any conflicts
are found, a new index name is chosen instead.
When an index is created *with* an explicit name and a conflicting
regular table is found, index creation should simply fail.

This series comes with a test.

Fixes #8620
Tests: unit(release)

Closes #8632

* github.com:scylladb/scylla:
  cql-pytest: add regression tests for index creation
  cql3: fail to create an index if there is a name conflict
  database: check for conflicting table names for indexes

(cherry picked from commit cee4c075d2)
2021-08-10 12:33:21 +03:00
Calle Wilund
6dc7ef512d messaging_service: Bind to listen address, not broadcast
Refs #8418

Broadcast can (apparently) be an address not actually on machine, but
on the other side of NAT. Thus binding local side of outgoing
connection there will fail.
Bind instead to listen_address (or broadcast, if listen_to_broadcast),
this will require routing + NAT to make the connection looking
like from broadcast from node connected to, to allow the connection
(if using partial encryption).

Note: this is somewhat verified somewhat limitedly. I would suggest
verifying various multi rack/dc setups before relying on it.

Closes #8974

(cherry picked from commit b8b5f69111)
2021-07-18 14:09:03 +03:00
Asias He
ae39b30ed3 repair: Consider memory bloat when calculate repair parallelism
The repair parallelism is calculated by the number of memory allocated to
repair and memory usage per repair instance. Currently, it does not
consider memory bloat issues (e.g., issue #8640) which cause repair to
use more memory and cause std::bad_alloc.

Be more conservative when calculating the parallelism to avoid repair
using too much memory.

Fixes #8641

Closes #8652

(cherry picked from commit b8749f51cb)
2021-07-15 13:01:43 +03:00
Avi Kivity
d54372d699 Dedicate Scylla 4.5 release 2021-07-14 18:41:02 +03:00
Hagit Segev
7f96719c55 release: prepare for 4.5.rc5 2021-07-11 16:56:02 +03:00
Takuya ASADA
0c33983c71 scylla-fstrim.timer: drop BindsTo=scylla-server.service
To avoid restart scylla-server.service unexpectedly, drop BindsTo=
from scylla-fstrim.timer.

Fixes #8921

Closes #8973

(cherry picked from commit def81807aa)
2021-07-08 10:06:12 +03:00
Calle Wilund
3c51b4066b commitlog: Use defensive copies of segment list in iterations
Fixes #8952

In 5ebf5835b0 we added a segment
prune after flushing, to deal with deadlocks in shutdown.
This means that calls that issue sync/flush-like ops "for-all",
need to operate on a defensive copy of the list.

Closes #8980

(cherry picked from commit ce45ffdffb)
2021-07-08 09:56:14 +03:00
Takuya ASADA
efa4d24deb dist/redhat: fix systemd unit name of scylla-node-exporter
systemd unit name of scylla-node-exporter is
scylla-node-exporter.service, not node-exporter.service.

Fixes #8966

Closes #8967

(cherry picked from commit f19ebe5709)
2021-07-07 18:37:42 +03:00
Takuya ASADA
3e1d608111 dist: stop removing /etc/systemd/system/*.mount on package uninstall
Listing /etc/systemd/system/*.mount as ghost file seems incorrect,
since user may want to keep using RAID volume / coredump directory after
uninstalling Scylla, or user may want to upgrade enterprise version.

Also, we mixed two types of files as ghost file, it should handle differently:
 1. automatically generated by postinst scriptlet
 2. generated by user invoked scylla_setup

The package should remove only 1, since 2 is generated by user decision.

However, just dropping .mount from %files section causes another
problem, rpm will remove these files during upgrade, instead of
uninstall (#8924).

To fix both problem, specify .mount files as "%ghost %config".
It will keep files both package upgrade and package remove.

See scylladb/scylla-enterprise#1780

Closes #8810
Closes #8924

Closes #8959

(cherry picked from commit f71f9786c7)
2021-07-07 18:37:42 +03:00
Pavel Emelyanov
a87bb38c29 hasher: More picky noexcept marking of feed_hash()
Commit 5adb8e555c marked the ::feed_hash() and a visitor lambda of
digester::feed_hash() as noexcept. This was quite recklesl as the
appending_hash<>::operator()s called by ::feed_hash() are not all
marked noexcept. In particular, the appending_hash<row>() is not
such and seem to throw.

The original intent of the mentioned commit was to facilitate the
partition_hasher in repair/ code. The hasher itself had been removed
by the 0af7a22c21, so it no longer needs the feed_hash-s to be
noexcepts.

The fix is to inherit noexcept from the called hashers, but for the
digester::feed_hash part the noexcept is just removed until clang
compilation bug #50994 is fixed.

fixes: #8983
tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210706153608.4299-1-xemul@scylladb.com>
(cherry picked from commit 63a2fed585)
2021-07-07 18:36:00 +03:00
Raphael S. Carvalho
ab3e284e04 LCS/reshape: Don't reshape single sstable in level 0 with strict mode
With strict mode, it could happen that a sstable alone in level 0 is
selected for offstrategy compaction, which means that we could run
into an infinite reshape process.

This is fixed by respecting the offstrategy threshold. Unit test is
added.

Fixes #8573.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210506181324.49636-1-raphaelsc@scylladb.com>
(cherry picked from commit 8480839932)
2021-07-07 14:06:20 +03:00
Raphael S. Carvalho
b0e833d9e5 LCS: reshape: Fix overlapping check when determining if a sstable set is disjoint
Wrong comparison operator is used when checking for overlapping. It
would miss overlapping when last key of a sstable is equal to the first
key of another sstable that comes next in the set, which is sorted by
first key.

Fixes #8531.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 39ecddbd34)
2021-07-07 14:04:04 +03:00
Hagit Segev
9e55d9bd04 release: prepare for 4.5.rc4 2021-07-07 12:17:19 +03:00
Avi Kivity
f89f4e69a0 Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed (#8695)​ (v3)' from Calle Wilund
Fixes #8270

If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk footprint that is over limit causing new segment allocation to stall.

We need to take a few things into account:
1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here.
2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task.
3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit.
4.) (v2) Must ensure discard/delete routines are executed. Because we can race with background disk syncs, we may need to
    issue segment prunes from end_flush() so we wake up actual file deletion/recycling
5.) (v2) Shutdown must ensure discard/delete is run after we've disabled background task etc, otherwise we might fail waking up replenish and get stuck in gate
6.) (v2) Recycling or deleting segments must be consistent, regardless of shutdown. For same reason as above.
7.) (v3) Signal recycle/delete queues/promise on shutdown (with recognized marker) to handle edge case where we only have a single (allocating) segment in the list, and cannot wake up replenisher in any more civilized way.

Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything).

New attempt at this, should fix intermittent shutdown deadlocks in commitlog_test.

Closes #8764

* github.com:scylladb/scylla:
  commitlog_test: Add test case for usage/disk size threshold mismatch
  commitlog_test: Improve test assertion
  commitlog: Add waitable future for background sync/flush
  commitlog: abort queues on shutdown
  commitlog: break out "abort" calls into member functions
  commitlog: Do explicit discard+delete in shutdown
  commitlog: Recycle or not should not depend on shutdown state
  commitlog: Issue discard_unused_segments on segment::flush end IFF deletable
  commitlog: Flush all segments if we only have one.
  commitlog: Always force flush if segment allocation is waiting
  commitlog: Include segment wasted (slack) size in footprint check
  commitlog: Adjust (lower) usage threshold

(cherry picked from commit 14252c8b71)
2021-06-27 14:05:36 +03:00
Nadav Har'El
7146646bf4 Merge 'commitlog: make_checked_file for segments, report and ignore other errors on shutdown' from Benny Halevy
Shutdown must never fail, otherwise it may cause hangs
as seen in https://github.com/scylladb/scylla/issues/8577.

This change wraps the file created in `allocate_segment_ex` in `make_checked_file` so that scylla will abort when failing to write to the commitlog files.

In case other errors are seen during shutdown, just log them and continue with shutting down to prevent scylla from hanging.

Fixes #8577

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #8578

* github.com:scylladb/scylla:
  commitlog: segment_manager::shutdown: abort on errors
  commitlog: allocate_segment_ex: make_checked_file

(cherry picked from commit 48ff641f67)
2021-06-27 14:04:04 +03:00
Benny Halevy
b3aba49ab0 commitlog: segment_manager: max_size must be aligned
This was triggered by the test_total_space_limit_of_commitlog dtest.
When it passes a very large commitlog_segment_size_in_mb (1/6th of the
free memory size, in mb), segment_manager constructor limits max_size
to std::numeric_limits<position_type>::max() which is 0xffffffff.

This causes allocate_segment_ex to loop forever when writing the segment
file since `dma_write` returns 0 when the count is unaligned (seen 4095).

The fix here is to select a sligtly small maxsize that is aligned
down to a multiple of 1MB.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210407121059.277912-1-bhalevy@scylladb.com>
(cherry picked from commit 705f9c4f79)
2021-06-27 14:03:41 +03:00
Hagit Segev
706de00ef2 release: prepare for 4.5.rc3 2021-06-20 17:14:50 +03:00
Piotr Sarna
cd5b915460 Merge 'view: fix use-after-move when handling view update failures'
Backport of 6726fe79b6.

The code was susceptible to use-after-move if both local
and remote updates were going to be sent.
The whole routine for sending view updates is now rewritten
to avoid use-after-move.

Fixes #8830
Tests: unit(release),
       dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)

Closes #8834

* backport-6726fe7:
  view: fix use-after-move when handling view update failures
  db,view: explicitly move the mutation to its helper function
  db,view: pass base token by value to mutate_MV
2021-06-16 13:27:12 +02:00
Piotr Sarna
247d30f075 view: fix use-after-move when handling view update failures
The code was susceptible to use-after-move if both local
and remote updates were going to be sent.
The whole routine for sending view updates is now rewritten
to avoid use-after-move.

Refs #8830
Tests: unit(release),
       dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)
2021-06-16 13:25:50 +02:00
Piotr Sarna
13da17e6fe db,view: explicitly move the mutation to its helper function
The `apply_to_remote_endpoints` helper function used to take
its `mut` parameter by reference, but then moved the value from it,
which is confusing and prone to errors. Since the value is moved-from,
let's pass it to the helper function as rvalue ref explicitly.
2021-06-16 13:25:46 +02:00
Piotr Sarna
6e29d74ab8 db,view: pass base token by value to mutate_MV
The base token is passed cross-continuations, so the current way
of passing it by const reference probably only works because the token
copying is cheap enough to optimize the reference out.
Fix by explicitly taking the token by value.
2021-06-16 13:23:10 +02:00
Tomasz Grabiec
e820e7f3c5 Merge 'Backport for 4.5: Fix replacing node takes writes' from Asias He
This backport fixes the follow issue:

    Cassandra stress fails to achieve consistency during replace node operation #8013

without the NODE_OPS_CMD infrastructure.

The commit c82250e0cf (gossip: Allow deferring advertise of local node to be up) which fixes for

     During replace node operation - replacing node is used to respond to read queries #7312

is already present in 4.5 branch.

Closes #8703

* github.com:scylladb/scylla:
  storage_service: Delay update pending ranges for replacing node
  gossip: Add helper to wait for a node to be up
2021-06-08 23:20:25 +02:00
Nadav Har'El
0cebafd104 alternator: fix equality check of nested document containing a set
In issue #5021 we noticed that the equality check in Alternator's condition
expressions needs to handle sets differently - we need to compare the set's
elements ignoring their order. But the implementation we added to fix that
issue was only correct when the entire attribute was a set... In the
general case, an attribute can be a nested document, with only some
inner set. The equality-checking function needs to tranverse this nested
document, and compare the sets inside it as appropriate. This is what
we do in this patch.

This patch also adds a new test comparing equality of a nested document with
some inner sets. This test passes on DynamoDB, failed on Alternator before
this patch, and passes with this patch.

Refs #5021
Fixes #8514

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419184840.471858-1-nyh@scylladb.com>
(cherry picked from commit dae7528fe5)
2021-06-07 09:10:42 +03:00
Nadav Har'El
9abd4677b1 alternator: fix inequality check of two sets
In issue #5021 we noted that Alternator's equality operator needs to be
fixed for the case of comparing two sets, because the equality check needs
to take into account the possibility of different element order.

Unfortunately, we fixed only the equality check operator, but forgot there
is also an inequality operator!

So in this patch we fix the inequality operator, and also add a test for
it that was previously missing.

The implementation of the inequality operator is trivial - it's just the
negation of the equality test. Our pre-existing tests verify that this is
the correct implementation (e.g., if attribute x doesn't exist, then "x = 3"
is false but "x <> 3" is true).

Refs #5021
Fixes #8513

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419141450.464968-1-nyh@scylladb.com>
(cherry picked from commit 50f3201ee2)
2021-06-07 08:42:46 +03:00
Nadav Har'El
9a07d7ca76 alternator: fix equality check of two unset attributes
When a condition expression (ConditionExpression, FilterExpression, etc.)
checks for equality of two item attributes, i.e., "x = y", and when one of
these attributes was missing we correctly returned false.
However, we also need to return false when *both* attributes are missing in
the item, because this is what DynamoDB does in this case. In other words
an unset attribute is never equal to anything - not even to another unset
attribute. This was not happening before this patch:

When x and y were both missing attributes, Alternator incorrectly returned
true for "x = y", and this patch fixes this case. It also fixes "x <> y"
which should to be true when both x and y are unset (but was false
before this patch).

The other comparison operators - <, <=, >, >=, BETWEEN, were all
implemented correctly even before this patch.

This patch also includes tests for all the two-unset-attribute cases of
all the operators listed above. As usual, we check that these tests pass
on both DynamoDB and Alternator to confirm our new behavior is the correct
one - before this patch, two of the new tests failed on Alternator and
passed on DynamoDB.

Fixes #8511

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419123911.462579-1-nyh@scylladb.com>
(cherry picked from commit 46448b0983)
2021-06-06 15:55:57 +03:00
Takuya ASADA
d92a26636a dist: add DefaultDependencies=no to .mount units
To avoid ordering cycle error on Ubuntu, add DefaultDependencies=no
on .mount units.

Fixes #8482

Closes #8495

(cherry picked from commit 0b01e1a167)
2021-05-31 12:11:51 +03:00
Yaron Kaikov
7445bfec86 install.sh: Setup aio-max-nr upon installation
This is a follow up change to #8512.

Let's add aio conf file during scylla installation process and make sure
we also remove this file when uninstall Scylla

As per Avi Kivity's suggestion, let's set aio value as static
configuration, and make it large enough to work with 500 cpus.

Closes #8650

Refs: #8713

(cherry picked from commit dd453ffe6a)
2021-05-27 14:12:19 +03:00
Yaron Kaikov
b077b198bf scylla_io_setup: configure "aio-max-nr" before iotune
On severl instance types in AWS and Azure, we get the following failure
during scylla_io_setup process:
```
ERROR 2021-04-14 07:50:35,666 [shard 5] seastar - Could not setup Async
I/O: Resource temporarily unavailable. The most common cause is not
enough request capacity in /proc/sys/fs/aio-max-nr. Try increasing that
number or reducing the amount of logical CPUs available for your
application
```

We have scylla_prepare:configure_io_slots() running before the
scylla-server.service start, but the scylla_io_setup is taking place
before

1) Let's move configure_io_slots() to scylla_util.py since both
   scylla_io_setup and scylla_prepare are import functions from it
2) cleanup scylla_prepare since we don't need the same function twice
3) Let's use configure_io_slots() during scylla_io_setup to avoid such
failure

Fixes: #8587

Closes #8512

Refs: #8713

(cherry picked from commit 588a065304)
2021-05-27 14:11:17 +03:00
Avi Kivity
44f85d2ba0 Update seastar submodule (httpd handler not reading content)
* seastar dadd299e7d...dab10ba6ad (1):
  > httpd: allow handler to not read an empty content
Fixes #8691.
2021-05-25 11:32:18 +03:00
Asias He
ccfe1d12ea storage_service: Delay update pending ranges for replacing node
In commit c82250e0cf (gossip: Allow deferring
advertise of local node to be up), the replacing node is changed to postpone
the responding of gossip echo message to avoid other nodes sending read
requests to the replacing node. It works as following:

1) replacing node does not respond echo message to avoid other nodes to
mark replacing node as alive

2) replacing node advertises hibernate state so other nodes knows
replacing node is replacing

3) replacing node responds echo message so other nodes can mark
replacing node as alive

This is problematic because after step 2, the existing nodes in the
cluster will start to send writes to the replacing node, but at this
time it is possible that existing nodes haven't marked the replacing
node as alive, thus failing the write request unnecessarily.

For instance, we saw the following errors in issue #8013 (Cassandra
stress fails to achieve consistency when only one of the nodes is down)

```
scylla:
[shard 1] consistency - Live nodes 2 do not satisfy ConsistencyLevel (2
required, 1 pending, live_endpoints={127.0.0.2, 127.0.0.1},
pending_endpoints={127.0.0.3}) [shard 0] gossip - Fail to send
EchoMessage to 127.0.0.3: std::runtime_error (Not ready to respond
gossip echo message)

c-s:
java.io.IOException: Operation x10 on key(s) [4c4f4d37324c35304c30]:
Error executing: (UnavailableException): Not enough replicas available
for query at consistency QUORUM (2 required but only 1 alive
```

To solve this problem for older releases without the patch "repair:
Switch to use NODE_OPS_CMD for replace operation", a minimum fix is
implemented in this patch. Once existing nodes learn the replacing node
is in HIBERNATE state, they add the replacing as replacing, but only add
the replacing to the pending list only after the replacing node is
marked as alive.

With this patch, when the existing nodes start to write to the replacing
node, the replacing node is already alive.

Tests: replace_address_test.py:TestReplaceAddress.replace_node_same_ip_test + manual test
Fixes: #8013

Closes #8614

(cherry picked from commit e4872a78b5)
2021-05-25 14:12:31 +08:00
Asias He
b0399a7c3b gossip: Add helper to wait for a node to be up
This patch adds gossiper::wait_alive helper to wait for nodes to be up
on all shards.

Refs #8013

(cherry picked from commit f690f3ee8e)
2021-05-25 14:12:11 +08:00
Takuya ASADA
b81919dbe2 scylla_raid_setup: use /dev/disk/by-uuid to specify filesystem
Currently, var-lib-scylla.mount may fails because it can start before
MDRAID volume initialized.
We may able to add "After=dev-disk-by\x2duuid-<uuid>.device" to wait for
device become available, but systemd manual says it automatically
configure dependency for mount unit when we specify filesystem path by
"absolute path of a device node".

So we need to replace What=UUID=<uuid> to What=/dev/disk/by-uuid/<uuid>.

Fixes #8279

Closes #8681

(cherry picked from commit 3d307919c3)
2021-05-24 17:23:56 +03:00
Takuya ASADA
5651a20ba1 install.sh: apply correct file security context when copying files
Currently, unified installer does not apply correct file security context
while copying files, it causes permission error on scylla-server.service.
We should apply default file security context while copying files, using
'-Z' option on /usr/bin/install.

Also, because install -Z requires normalized path to apply correct security
context, use 'realpath -m <PATH>' on path variables on the script.

Fixes #8589

Closes #8602

(cherry picked from commit 60c0b37a4c)
2021-05-19 12:40:12 +03:00
Takuya ASADA
8c30b83ea4 install.sh: fix not such file or directory on nonroot
Since we have added scylla-node-exporter, we needed to do 'install -d'
for systemd directory and sysconfig directory before copying files.

Fixes #8663

Closes #8664

(cherry picked from commit 6faa8b97ec)
2021-05-19 12:40:12 +03:00
Avi Kivity
fce7eab9ac Merge 'Fix type checking in index paging' from Piotr Sarna
When recreating the paging state from an indexed query,
a bunch of panic checks were introduced to make sure that
the code is correct. However, one of the checks is too eager -
namely, it throws an error if the base column type is not equal
to the view column type. It usually works correctly, unless the
base column type is a clustering key with DESC clustering order,
in which case the type is actually "reversed". From the point of view
of the paging state generation it's not important, because both
types deserialize in the same way, so the check should be less
strict and allow the base type to be reversed.

Tests: unit(release), along with the additional test case
       introduced in this series; the test also passes
       on Cassandra

Fixes #8666

Closes #8667

* github.com:scylladb/scylla:
  test: add a test case for paging with desc clustering order
  cql3: relax a type check for index paging

(cherry picked from commit 593ad4de1e)
2021-05-19 12:40:09 +03:00
Raphael S. Carvalho
ab8eefade7 compaction_manager: Don't swallow exception in procedure used by reshape and resharding
run_custom_job() was swallowing all exceptions, which is definitely
wrong because failure in a resharding or reshape would be incorrectly
interpreted as success, which means upper layer will continue as if
everything is ok. For example, ignoring a failure in resharding could
result in a shared sstable being left unresharded, so when that sstable
reaches a table, scylla would abort as shared ssts are no longer
accepted in the main sstable set.
Let's allow the exception to be propagated, so failure will be
communicated, and resharding and reshape will be all or nothing, as
originally intended.

Fixes #8657.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210515015721.384667-1-raphaelsc@scylladb.com>
(cherry picked from commit 10ae77966c)
2021-05-18 13:00:07 +03:00
Hagit Segev
e2704554b5 release: prepare for 4.5.rc2 2021-05-12 14:55:53 +03:00
Lauro Ramos Venancio
e36e490469 TWCS: initialize _highest_window_seen
The timestamp_type is an int64_t. So, it has to be explicitly
initialized before using it.

This missing inicialization prevented the major compactation
from happening when a time window finishes, as described in #8569.

Fixes #8569

Signed-off-by: Lauro Ramos Venancio <lauro.venancio@incognia.com>

Closes #8590

(cherry picked from commit 15f72f7c9e)
2021-05-06 08:51:58 +03:00
Pavel Emelyanov
c97005fbb8 tracing: Stop tracing in main's deferred action
Tracing is created in two steps and is destroyed in two too.
The 2nd step doesn't have the corresponding stop part, so here
it is -- defer tracing stop after it was started.

But need to keep in mind, that tracing is also shut down on
drain, so the stopping should handle this.

Fixes #8382
tests: unit(dev), manual(start-stop, aborted-start)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210331092221.1602-1-xemul@scylladb.com>
(cherry picked from commit 887a1b0d3d)
2021-05-05 15:22:05 +03:00
Nadav Har'El
d881d539f3 Update tools/java submodule
Backport sstableloader fix in tools/java submodule.
Fixes #8230.

* tools/java 768a59a6f1...dbcea78e7d (1):
  > sstableloader: Handle non-prepared batches with ":" in identifier names

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-05-03 10:02:00 +03:00
Avi Kivity
b8a502fab0 Merge '[branch 4.5] Backport reader_permit: always forward resources to the semaphore' from Botond Dénes
This is a backport of 8aaa3a7bb8 to <= branch-4.5. The main conflicts were around Benny's reader close series (fa43d7680), but it also turned out that an additional patch (2f1d65ca11) also has to backported to make sure admission on signaling resources doesn't deadlock.

Refs: https://github.com/scylladb/scylla/issues/8493

Closes #8558

* github.com:scylladb/scylla:
  test: mutation_reader_test: add test_reader_concurrency_semaphore_forward_progress
  test: mutation_reader_test: add test_reader_concurrency_semaphore_readmission_preserves_units
  reader_concurrency_semaphore: add dump_diagnostics()
  reader_permit: always forward resources
  test: multishard_mutation_query_test: fuzzy-test: don't consume resource up-front
  reader_concurrency_semaphore: make admission conditions consistent
2021-04-29 15:36:03 +03:00
Botond Dénes
f7f2bb482f test: mutation_reader_test: add test_reader_concurrency_semaphore_forward_progress
This unit test checks that the semaphore doesn't get into a deadlock
when contended, in the presence of many memory-only reads (that don't
wait for admission). This is tested by simulating the 3 kind of reads we
currently have in the system:
* memory-only: reads that don't pass admission and only own memory.
* admitted: reads that pass admission.
* evictable: admitted reads that are furthermore evictable.

The test creates and runs a large number of these reads in parallel,
read kinds being selected randomly, then creates a watchdog which
kills the test if no progress is being made.

(cherry picked from commit 45d580f056)
2021-04-29 15:26:21 +03:00
Botond Dénes
b16db6512c test: mutation_reader_test: add test_reader_concurrency_semaphore_readmission_preserves_units
This unit test passes a read through admission again-and-again, just
like an evictable reader would be during its lifetime. When readmitted
the read sometimes has to wait and sometimes not. This is to check that
the readmitting a previously admitted reader doesn't leak any units.

(cherry picked from commit cadc26de38)
2021-04-29 15:26:21 +03:00
Botond Dénes
1a7c8223fe reader_concurrency_semaphore: add dump_diagnostics()
Allow semaphore related tests to include a diagnostics printout in error
messages to help determine why the test failed.

(cherry picked from commit d246e2df0a)
2021-04-29 15:26:21 +03:00
Botond Dénes
ac6aa66a7b reader_permit: always forward resources
This commit conceptually reverts 4c8ab10. Said commit was meant to
prevent the scenario where memory-only permits -- those that don't pass
admission but still consume memory -- completely prevent the admission
of reads, possibly even causing a deadlock because a permit might even
blocks its own admission. The protection introduced by said commit
however proved to be very problematic. It made the status of resources
on the permit very hard to reason about and created loopholes via which
permits could accumulate without tracking or they could even leak
resources. Instead of continuing to patch this broken system, this
commit does away with this "protection" based on the observation that
deadlocks are now prevented anyway by the admission criteria introduced
by 0fe75571d9, which admits a read anyway when all the initial count
resources are available (meaning no admitted reader is alive),
regardless of availability of memory.
The benefits of this revert is that the semaphore now knows about all
the resources and is able to do its job better as it is not "lied to"
about resource by the permits. Furthermore the status of a permit's
resources is much simpler to reason about, there are no more loopholes
in unexpected state transitions to swallow/leak resources.
To prove that this revert is indeed safe, in the next commit we add
robust tests that stress test admission on a highly contested semaphore.
This patch also does away with the registered/admitted differentiation
of permits, as this doesn't make much sense anymore, instead these two
are unified into a single "active" state. One can always tell whether a
permit was admitted or not from whether it owns count resources anyway.

(cherry picked from commit caaa8ef59a)
2021-04-29 15:26:21 +03:00
Botond Dénes
98a39884c3 test: multishard_mutation_query_test: fuzzy-test: don't consume resource up-front
The fuzzy test consumes a large chunk of resource from the semaphore
up-front to simulate a contested semaphore. This isn't an accurate
simulation, because no permit will have more than 1 units in reality.
Furthermore this can even cause a deadlock since 8aaa3a7 as now we rely
on all count units being available to make forward progress when memory
is scarce.
This patch just cuts out this part of the test, we now have a dedicated
unit test for checking a heavily contested semaphore, that does it
properly, so no need to try to fix this clumsy attempt that is just
making trouble at this point.

Refs: #8493

Tests: release(multishard_mutation_query_test:fuzzy_test)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210429084458.40406-1-bdenes@scylladb.com>
(cherry picked from commit 26ae9555d1)
2021-04-29 15:26:21 +03:00
Eliran Sinvani
88192811e7 Materialized views: fix possibly old views comming from other nodes
Migration manager has a function to get a schema (for read or write),
this function queries a peer node and retrieves the schema from it. One
scenario where it can happen is if an old node, queries an old not fixed
index.
This makes a hole through which views that are only adjusted for reading
can slip through.

Here we plug the hole by fixing such views before they are registered.

Closes #8509

(cherry picked from commit 480a12d7b3)

Fixes #8554.
2021-04-29 14:02:01 +03:00
Botond Dénes
32f21f7281 reader_concurrency_semaphore: make admission conditions consistent
Currently there are two places where we check admission conditions:
`do_wait_admission()` and `signal()`. Both use `has_available_units()`
to check resource availability, but the former has some additional
resource related conditions on top (in `may_proceed()`), which lead to
the two paths working with slightly different conditions. To fix, push
down all resource availability related checks to `has_available_units()`
to ensure admission conditions are consistent across all paths.

(cherry picked from commit d90cd6402c)
2021-04-27 18:12:29 +03:00
Piotr Sarna
c9eaf95750 Merge 'commitlog: Fix race and edge condition in delete_segments' from Calle Wilund
Fixes #8363
Fixes #8376

Delete segements has two issues when running with size-limited
commit log and strict adherence to said limit.

1.) It uses parallel processing, with deferral. This means that
    the disk usage variables it looks at might not be fully valid
    - i.e. we might have already issued a file delete that will
    reduce disk footprint such that a segment could instead be
    recycled, but since vars are (and should) only updated
    _post_ delete, we don't know.
2.) It does not take into account edge conditions, when we only
    delete a single segment, and this segment is the border segment
    - i.e. the one pushing us over the limit, yet allocation is
    desperately waiting for recycling. In this case we should
    allow it to live on, and assume that next delete will reduce
    footprint. Note: to ensure exact size limit, make sure
    total size is a multiple of segment size.

if we had an error in recycling (disk rename?), and no elements
are available, we could have waiters hoping they will get segements.
abort the queue (not permanent, but wakes up waiters), and let them
retry. Since we did deletions instead, disk footprint should allow
for new allocs at least. Or more likely, everything is broken, but
we will at least make more noise.

Closes #8372

* github.com:scylladb/scylla:
  commitlog: Add signalling to recycle queue iff we fail to recycle
  commitlog: Fix race and edge condition in delete_segments
  commitlog: coroutinize delete_segments
  commitlog_test: Add test for deadlock in recycle waiter

(cherry picked from commit 8e808a56d2)
2021-04-21 18:01:37 +03:00
Avi Kivity
44c6d0fcf9 Update seastar submodule (low bandwidth disks)
* seastar 72e3baed9c...dadd299e7d (2):
  > io_queue: Honor disks with tiny request rate
  > io_queue: Shuffle fair_group creation

Fixes #8378.
2021-04-21 14:00:35 +03:00
Avi Kivity
4bcc0badb2 Point seastar submodule at scylla-seastar.git
This allows us to backport fixes to seastar selectively.
2021-04-21 14:00:00 +03:00
Tomasz Grabiec
97664e63fe Merge 'Make sure that cache_flat_mutation_reader::do_fill_buffer does not fast forward finished underlying reader' from Piotr Jastrzębski
It is possible that a partition is in cache but is not present in sstables that are underneath.
In such case:
1. cache_flat_mutation_reader will fast forward underlying reader to that partition
2. The underlying reader will enter the state when it's empty and its is_end_of_stream() returns true
3. Previously cache_flat_mutation_reader::do_fill_buffer would try to fast forward such empty underlying reader
4. This PR fixes that

Test: unit(dev)

Fixes #8435
Fixes #8411

Closes #8437

* github.com:scylladb/scylla:
  row_cache: remove redundant check in make_reader
  cache_flat_mutation_reader: fix do_fill_buffer
  read_context: add _partition_exists
  read_context: remove skip_first_fragment arg from create_underlying
  read_context: skip first fragment in ensure_underlying

(cherry picked from commit 163f2be277)
2021-04-20 12:49:10 +03:00
Kamil Braun
204964637a time_series_sstable_set: return partition start if some sstables were ck-filtered out
When a particular partition exists in at least one sstable, the cache
expects any single-partition query to this partition to return a `partition_start`
fragment, even if the result is empty.

In `time_series_sstable_set::create_single_key_sstable_reader` it could
happen that all sstables containing data for the given query get
filtered out and only sstables without the relevant partition are left,
resulting in a reader which immediately returns end-of-stream (while it
should return a `partition_start` and if not in forwarding mode, a
`partition_end`). This commit fixes that.

We do it by extending the reader queue (used by the clustering reader
merger) with a `dummy_reader` which will be returned by the queue as
the very first reader. This reader only emits a `partition_start` and,
if not in forwarding mode, a `partition_end` fragment.

Fixes #8447.

Closes #8448

(cherry picked from commit 5c7ed7a83f)
2021-04-20 12:49:10 +03:00
Kamil Braun
c402abe8e9 clustering_order_reader_merger: handle empty readers
The merger could return end-of-stream if some (but not all) of the
underlying readers were empty (i.e. not even returning a
`partition_start`). This could happen in places where it was used
(`time_series_sstable_set::create_single_key_sstable_reader`) if we
opened an sstable which did not have the queried partition but passed
all the filters (specifically, the bloom filter returned a false
positive for this sstable).

The commit also extends the random tests for the merger to include empty
readers and adds an explicit test case that catches this bug (in a
limited scope: when we merge a single empty reader).

It also modifies `test_twcs_single_key_reader_filtering` (regression
test for #8432) because the time where the clustering key filter is
invoked changes (some invocations move from the constructor of the
merger to operator()). I checked manually that it still catches the bug
when I reintroduce it.

Fixes #8445.

Closes #8446

(cherry picked from commit 7ffb0d826b)
2021-04-20 12:49:10 +03:00
Yaron Kaikov
4a78d6403e release: prepare for 4.5.rc1 2021-04-14 17:02:17 +03:00
Kamil Braun
2f20d52ac7 sstables: fix TWCS single key reader sstable filter
The filter passed to `min_position_reader_queue`, which was used by
`clustering_order_reader_merger`, would incorrectly include sstables as
soon as they passed through the PK (bloom) filter, and would include
sstables which didn't pass the PK filter (if they passed the CK
filter). Fortunately this wouldn't cause incorrect data to be returned,
but it would cause sstables to be opened unnecessarily (these sstables
would immediately return eof), resulting in a performance drop. This commit
fixes the filter and adds a regression test which uses statistics to
check how many times the CK filter was invoked.

Fixes #8432.

Closes #8433

(cherry picked from commit 3687757115)
2021-04-11 12:59:31 +03:00
Raphael S. Carvalho
540439ee46 sstable_set: Implement compound_sstable_set's create_single_key_sstable_reader()
compound set isn't overriding create_single_key_sstable_reader(), so
default implementation is always called. Although default impl will
provide correct behavior, specialized ones which provides better perf,
which currently is only available for TWCS, were being ignored.

compound set impl of single key reader will basically combine single key
readers of all sets managed by it.

Fixes #8415.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210406205009.75020-1-raphaelsc@scylladb.com>
(cherry picked from commit 8e0a1ca866)
2021-04-11 12:59:31 +03:00
Gleb Natapov
a0622e85ab storage_proxy: do not crash on LOCAL_QUORUM access to a DC with zero replication
If a table that is not replicated to a certain DC (rf=0) is accessed
with LOCAL_QUORUM on that DC the current code will crash since the
'targets' array will be empty and read executor does not handle it.
Fix it by replying with empty result.

Fixes #8354

Message-Id: <YGro+l2En3fF80CO@scylladb.com>
(cherry picked from commit cd24dfc7e5)
2021-04-06 18:12:36 +03:00
Nadav Har'El
90741dc62c Update submodule tools/java
Backport for refs #8390.

* tools/java ccc4201ded...768a59a6f1 (1):
  > sstableloader: fix handling of rewritten partition

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-04-05 18:47:49 +03:00
Piotr Jastrzebski
83cfa6a63c config: ignore enable_sstables_mc_format flag
Don't allow users to disable MC sstables format any more.
We would like to retire some old cluster features that has been around
for years. Namely MC_SSTABLE and UNBOUNDED_RANGE_TOMBSTONES. To do this
we first have to make sure that all existing clusters have them enabled.
It is impossible to know that unless we stop supporting
enable_sstables_mc_format flag.

Test: unit(dev)

Refs #8352

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

Closes #8360

(cherry picked from commit 57c7964d6c)
2021-04-01 14:04:58 +03:00
Hagit Segev
1816c6df8c release: prepare for 4.5.rc0 2021-03-31 22:25:05 +03:00
Avi Kivity
f9244734f9 Update seastar submodule
* seastar 48376c76a...72e3baed9 (3):
  > file: Add RFW_NOWAIT detection case for AuFS
  > sharded: provide type info on no sharded instance exception
  > iotune: Estimate accuarcy of measurement

Added missing include "database.hh" to api/lsa.cc since seastar::sharded<>
now needs full type information.
2021-03-31 10:40:04 +03:00
Avi Kivity
de10a74a84 Merge 'types: remove linearization from abstract_type::compare' from Wojciech Mitros
This patch is another series on removing big allocations from scylla.
The buffers in `compare_visitor` were replaced with `managed_bytes_view`, similiar change was also needed in tuple_deserializing_iterator and listlike_partial_deserializing_iterator, and was applied as well.

Tests:unit(dev)

Closes #8357

* github.com:scylladb/scylla:
  types: remove linearization from abstract_type::compare
  types: replace buffers in tuple_deserializing_iterator with fragmented ones
  types: make tuple_type_impl::split work with any FragmentedViews
  types: move read_collection_size/value specialization to header file
2021-03-31 08:50:52 +03:00
Wojciech Mitros
f57fa935a2 types: remove linearization from abstract_type::compare
To avoid high latencies caused by large contigous allocations
needed by linearizing, work on fragmented buffers instead.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-03-31 06:35:10 +02:00
Wojciech Mitros
daa31be37f types: replace buffers in tuple_deserializing_iterator with fragmented ones
In preparation for removing linearization from abstract_type::compare,
add options to avoid linearization in tuple_deserializing_iterator.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-03-31 06:35:09 +02:00
Wojciech Mitros
823d4c7529 types: make tuple_type_impl::split work with any FragmentedViews
We may want to store a tuple in a fragmented buffer. To split it
into a vector of optional bytes, tuple_type_impl::split can be used.
To split a contiguous buffer(bytes_view), simply pass
single_fragmented_view(bytes_view).

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-03-31 06:34:37 +02:00
Piotr Sarna
6a2377a233 Merge 'Fast slow query trace doc' from Ivan
Addressed https://github.com/scylladb/scylla/pull/8314#issuecomment-803671234
(write issue: "Tracing: slow query fast mode documentation request")

adds a fast slow queries tracing mode documentation to the docs/guide/tracing.md

patch to the scylla-doc will be dup-ed after this one merged

cc @nyh
cc @vladzcloudius

Closes #8373

* github.com:scylladb/scylla:
  tracing: api: fast mode doc improvement
  tracing: fast slow query tracing mode docs
2021-03-30 17:57:04 +02:00
Ivan Prisyazhnyy
778d9217f3 tracing: api: fast mode doc improvement
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
2021-03-30 16:22:56 +02:00
Ivan Prisyazhnyy
b3b66fb629 tracing: fast slow query tracing mode docs
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
2021-03-30 16:22:56 +02:00
Avi Kivity
d2921b5112 Merge 'Clean up > 2-year-old features' from Piotr Sarna
Following the work started in 253a7640e, a new batch of old features is assumed to be always available. They are all still announced via gossip, but the code assumes that the feature is always true, because we only support upgrades from a previous release, and the release window is considerably smaller than 2 years.

Features picked this time via `git blame`, along with the date of their introduction:

* fe4afb1aa3 (Asias He                  2018-09-05 14:52:10 +0800  109) static const sstring ROW_LEVEL_REPAIR = "ROW_LEVEL_REPAIR";
* ff5e541335 (Calle Wilund              2019-02-05 13:06:07 +0000  110) static const sstring TRUNCATION_TABLE = "TRUNCATION_TABLE";
* fefef7b9eb (Tomasz Grabiec            2019-03-05 19:08:07 +0100  111) static const sstring CORRECT_STATIC_COMPACT_IN_MC = "CORRECT_STATIC_COMPACT_IN_MC";

Tests: unit(dev)

Closes #8235

* github.com:scylladb/scylla:
  sstables,test: remove variables depending on old features
  gms: make CORRECT_STATIC_COMPACT_IN_MC ft unconditionally true
  sstables: stop relying on CORRECT_STATIC_COMPACT_IN_MC feature
  gms: make TRUNCATION_TABLE feature unconditionally true
  gms: make ROW_LEVEL_REPAIR feature unconditionally true
  repair: stop relying on ROW_LEVEL_REPAIR feature
2021-03-30 16:13:35 +03:00
Calle Wilund
c0666ea89b commitlog: Fix inner loop condition in allocation pre-fill
Fixes #8369

This was originally found (and fixed) by @gleb-cloudius, but the patch set with
the fix was reverted at some point, and the fix went away. Now the error remains
even in new, nice coroutine code.

We check the wrong var in the inner loop of the pre-fill path of
allocate_segment_ex, often causing us to generate giant writev:s of more or less
the whole file.  Not intended.

Closes #8370
2021-03-30 12:14:55 +02:00
Avi Kivity
c2866f46b5 test: relax quota for tests on machines with small page size
8a8589038c ("test: increase quota for tests to 6GB") increased
the quota for tests from 2GB to 6GB. I later found that the increased
requirement is related to the page size: Address Sanitizer allocates
at least a page per object, and so if the page size is larger the
memory requirement is also larger.

Make use of this by only increasing the quota if the page size
is greater than 4096 (I've only seen 4096 and 65536 in the wild).
This allows greater parallelism when the page size is small.

Closes #8371
2021-03-30 12:13:42 +02:00
Avi Kivity
8785dd62cb tests: use kernel page cache
Tests are short-lived and use a small amount of data. They
are also often run repeatly, and the data is deleted immediately
after the test. This is a good scenario for using the kernel page
cache, as it can cache read-only data from test to test, and avoid
spilling write data to disk if it is deleted quickly.

Acknowledge this by using the new --kernel-page-cache option for
tests.

This is expected to help on large machines, where the disk can be
overloaded. Smaller machines with NVMe disks probably will not see
a difference.

Closes #8347
2021-03-30 12:04:55 +02:00
Piotr Sarna
6de2691bbd sstables,test: remove variables depending on old features
In order to maintain backward compatibility wrt. cluster features,
two boolean variables were kept in sstable writers:
 - correctly_serialize_non_compound_range_tombstones
 - correctly_serialize_static_compact_in_mc

Since these features are assumed to always be present now,
the above variables are no longer needed and can be purged.
2021-03-30 09:37:41 +02:00
Piotr Sarna
e42dee6afb gms: make CORRECT_STATIC_COMPACT_IN_MC ft unconditionally true
The feature is assumed to be true due to being over 2 years old.
It's still advertised in gossip, but it's assumed to always be present.
2021-03-30 09:37:13 +02:00
Piotr Sarna
28c9af6fa5 sstables: stop relying on CORRECT_STATIC_COMPACT_IN_MC feature
The feature bit is going away because it's over 2 years old,
so the code which depended on it becomes unconditional.
2021-03-30 09:37:04 +02:00
Piotr Sarna
08c4350968 gms: make TRUNCATION_TABLE feature unconditionally true
Turns out the feature was not used presently.
Historically, the commit which removed the support is
30a700c5b0 .
2021-03-30 09:36:45 +02:00
Piotr Sarna
c070178c7e gms: make ROW_LEVEL_REPAIR feature unconditionally true
The feature is assumed to be true due to being over 2 years old.
It's still advertised in gossip, but it's assumed to always be present.
2021-03-30 09:36:11 +02:00
Piotr Sarna
80ebedd242 repair: stop relying on ROW_LEVEL_REPAIR feature
The feature is going away because it's over 2 years old,
so the code which depended on it becomes unconditional.
2021-03-30 09:35:40 +02:00
Avi Kivity
c1badc6317 noexcept_traits: convert enable_if to concepts
A little easier to read.

Closes #8329
2021-03-30 09:30:23 +02:00
Avi Kivity
405c4e7af1 serializer: replace enable_if in deserialized_bytes_proxy with constraint
Simpler to read and understand.

Closes #8303
2021-03-30 09:30:06 +02:00
Avi Kivity
7c953f33d5 utils: disk-error-handler: replace enable_if with concepts
Simpler, cleaner. We also replace the deprecated std::result_of_t
with std::invoke_result_t.

Closes #8305
2021-03-30 09:29:46 +02:00
Nadav Har'El
115324f71a Merge 'Add partial admission control to Thrift frontend' from Piotr Sarna
This pull request adds partial admission control to Thrift frontend. The solution is partial mostly because the Thrift layer, aside from allowing Thrift messages, may also be used as a base protocol for CQL messages. Coupling admission control to this one is a little bit more complicated due to how the layer currently works - a Thrift handler, created once per connection, keeps a local `query_state` instance for the occasion of handling CQL requests. However, `query_state` should be kept per query, not per connection, so adding admission control to this aspect of the frontend is left for later.

Finally, the way service permits are passed from the server, via the handler factory, handler and then to queries is hacky. I haven't figured out how to force Thrift to pass custom context per query, so the way it works now is by relying on the fact that the server does not yield (in Seastar sense) between having read the request and launching the proper handler. Due to that, it's possible to just store the service permit in the server itself, pass the reference (address) to it down to the handler, and then read it back from the handling code and claim ownership of it. It works, but if anyone has a better idea, please share.

Refs #4826

Closes #8313

* github.com:scylladb/scylla:
  thrift: add support for max_concurrent_requests_per_shard
  thrift: add metrics for admission control
  thrift: add a counter for in-flight requests
  thrift: add a counter for blocked requests
  thrift: partially add admission control
  service_permit: add a getter for the number of units held
  thrift: coroutinize processing a request
  memory_limiter: add a missing seastarx include
2021-03-29 21:36:50 +03:00
Raphael S. Carvalho
a390f4eb61 sstables: optimize LCS reshape for repair-based operations
LCS reshape is currently inefficient for repair-based operation, because
the disjoint run of 256 sstables is reshaped into bigger L0 files, which
will be then integrated into the main sstable set.
On reshape completion, LCS has to compact those big L0 files onto higher
levels, until last level is reached, producing bad write amplification.

A much better approach is to instead compact that disjoint run into the
best possible level L, which can be figured out with:
	log (base fan_out) of (total_size / max_sstable_size)
This compaction will be essentially a copy operation. It's important
to do it rather than only mutating the level of sstables because we have
to reshape the input run according to LCS parameters like sstable size.

For repair-based bootstrap/replace, the input disjoint run is now efficiently
reshaped into an ideal level L, so there's no compaction backlog once
reshape completes.

This behavior will manifest in the log as this:
LeveledManifest - Reshaping 256 disjoint sstables in level 0 into level 2

For repair-based decommission/removenode though, which reshape wasn't
wired on yet, level L may temporarily hold 2 disjoint runs, which overlap
one another, but LCS itself will incrementally merge them through either
promotion of L-1 into L, or by detecting overlapping in level L and
merging the overlapping sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210329171826.42873-1-raphaelsc@scylladb.com>
2021-03-29 20:22:04 +03:00
Botond Dénes
3c54c990ab test: view_build_test: test_view_update_generator_buffering: fail gracefully
Failures in this test typically happen inside the test consumer object.
These however don't stop the test as the code invoking the consumer
object handles exceptions coming from it. So the test will run to
completion and will fail again when comparing the produced output with
the expected one. This results in distracting failures. The real problem
is not the difference in the output, but the first check that failed,
which is however buried in the noise. To prevent this add an "ok" flag
which is set to false if the consumer fails. In this case the additional
checks are skipped in the end to not generate useless noise.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210326083147.26113-2-bdenes@scylladb.com>
2021-03-29 17:58:28 +03:00
Avi Kivity
a8463cfb37 Merge "reader_permit: signal leaked resources" from Botond
"
When a permit is destroyed we check if it still holds on to any
resources in the destructor. Any resources the permit still holds on are
leaked resources, as users should have released these. Currently we just
invoke `on_internal_error_noexcept()` to handle this, which -- depending
on the configuration -- will result in an error message or an assert. In
the former case, the resources will be leaked for good. This mini-series
fixes this, by signaling back these resources to the semaphore. This
helps avoid an eventual complete dry-up of all semaphore resources and a
subsequent complete shutdown of reads.

Tests: unit(release, debug)
"

* 'reader-permit-signal-leaked-resources/v1' of https://github.com/denesb/scylla:
  reader_permit: signal leaked resources
  test: test_reader_lifecycle_policy: keep semaphores alive until all ops cease
  sstables: generate_summary(): extend the lifecycle of the reader concurrency semaphore
2021-03-29 17:57:31 +03:00
Botond Dénes
9e01c4c667 test: view_build_test: test_view_update_generator_buffering: use separate permit for readers
Said test has two separate logical readers, but they share the same
permit, which is illegal. This didn't cause any problems yet, but soon
the semaphore will start to keep score of active/inactive permits which
will be confused by such sharing, so have them use separate permits.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210326083147.26113-1-bdenes@scylladb.com>
2021-03-29 17:35:51 +03:00
Takuya ASADA
6f678ab7ff aws: initialize self._disks['ebs'] when no EBS disks
Seems like aws_instance.ebs_disks() causes traceback when no EBS disks
available, need to initialize with empty list.

Fixes #8365

Closes #8366
2021-03-29 17:21:14 +03:00
Gleb Natapov
13a3cf62bb raft: move incoming message processing into per state functions
Clean up step() function by moving state specific processing into per
state functions. This way it is easier to see how each state handles
individual messages. No functional changes here.

Message-Id: <YGHCiTWjq+L/jVCB@scylladb.com>
2021-03-29 15:48:43 +02:00
Tomasz Grabiec
43fd322856 Merge 'scylla-gdb.py: Add io-queues command' from Piotr Sarna
The command can be used to inspect IO queues of a local reactor.
Example output:
```
    (gdb) scylla io-queues
        Dev 0:
            Class:                  |shares:         |ptr:
            --------------------------------------------------------------------------------
            "default"               |1               |(seastar::priority_class_data *)0x6000002c6500
            "commitlog"             |1000            |(seastar::priority_class_data *)0x6000003ad940
            "memtable_flush"        |1000            |(seastar::priority_class_data *)0x6000005cb300
            "streaming"             |200             |(seastar::priority_class_data *)0x0
            "query"                 |1000            |(seastar::priority_class_data *)0x600000718580
            "compaction"            |1000            |(seastar::priority_class_data *)0x6000030ef0c0

            Max request size:    2147483647
            Max capacity:        Ticket(weight: 4194303, size: 4194303)
            Capacity tail:       Ticket(weight: 73168384, size: 100561888)
            Capacity head:       Ticket(weight: 77360511, size: 104242143)

            Resources executing: Ticket(weight: 2176, size: 514048)
            Resources queued:    Ticket(weight: 384, size: 98304)
            Handles: (1)
                Class 0x6000005d7278:
                    Ticket(weight: 128, size: 32768)
                    Ticket(weight: 128, size: 32768)
                    Ticket(weight: 128, size: 32768)
            Pending in sink: (0)
```

Created when debugging a core dump. Turned out not to be immediately useful for this use case, but I'm publishing it since it may come in handy in future investigations.

Closes #8362

* github.com:scylladb/scylla:
  scylla-gdb: add io-queues command
  scylla-gdb.py: add parsing std::priority_queue
  scylla-gdb.py: add parsing std::atomic
  scylla-gdb.py: add parsing std::shared_ptr
  scylla-db.py: add parsing intrusive_slist
2021-03-29 15:31:48 +02:00
Piotr Sarna
adf07eb8fb scylla-gdb: add io-queues command
The command can be used to inspect reactor's IO queues. Example output:
(gdb) scylla io-queues
    Dev 0:
	Class:                  |shares:         |ptr:
	--------------------------------------------------------------------------------
        "default"               |1               |(seastar::priority_class_data *)0x6000002c6500
        "commitlog"             |1000            |(seastar::priority_class_data *)0x6000003ad940
        "memtable_flush"        |1000            |(seastar::priority_class_data *)0x6000005cb300
        "streaming"             |200             |(seastar::priority_class_data *)0x0
        "query"                 |1000            |(seastar::priority_class_data *)0x600000718580
        "compaction"            |1000            |(seastar::priority_class_data *)0x6000030ef0c0

        Max request size:    2147483647
        Max capacity:        Ticket(weight: 4194303, size: 4194303)
        Capacity tail:       Ticket(weight: 73168384, size: 100561888)
        Capacity head:       Ticket(weight: 77360511, size: 104242143)

        Resources executing: Ticket(weight: 2176, size: 514048)
        Resources queued:    Ticket(weight: 384, size: 98304)
        Handles: (1)
            Class 0x6000005d7278:
                Ticket(weight: 128, size: 32768)
                Ticket(weight: 128, size: 32768)
                Ticket(weight: 128, size: 32768)
        Pending in sink: (0)
2021-03-29 15:01:25 +02:00
Piotr Sarna
f162423b8a scylla-gdb.py: add parsing std::priority_queue
The parsing assumes that the underlying storage is a vector,
which is often enough the case.
2021-03-29 13:10:36 +02:00
Piotr Sarna
e36c1f1d25 scylla-gdb.py: add parsing std::atomic 2021-03-29 13:10:36 +02:00
Piotr Sarna
0d4d04d3e6 scylla-gdb.py: add parsing std::shared_ptr 2021-03-29 13:10:36 +02:00
Piotr Sarna
c61822bc86 scylla-db.py: add parsing intrusive_slist 2021-03-29 13:10:36 +02:00
Piotr Sarna
bc1c92fd05 Merge 'Improve flat_mutation_reader::consume_pausable' from Piotr Jastrzębski
`flat_mutation_reader::consume_pausable` is widely used in Scylla. Some places
worth mentioning are memtables and combined readers but there are others as
well.

This patchset improves `consume_pausable` in three ways:
1. it removes unnecessary allocation
2. it rearranges ifs to not check the same thing twice
3. for a consumer that returns plain stop_iteration not a future<stop_iteration>
   it reduces the amount of future usage

Test: unit(dev, release, debug)

Combined reader microbenchmark has shown from 2% to 22% improvement in median
execution time while memtable microbenchmark has shown from 3.6% to 7.8%
improvement in median execution time.

Before the change:
```
./build/release/test/perf/perf_mutation_readers --random-seed 3549335083
single run iterations:    0
single run duration:      1.000s
number of runs:           5
number of cores:          16
random seed:              3549335083

test                                      iterations      median         mad         min         max
combined.one_row                             1316234   140.120ns     0.020ns   140.074ns   140.141ns
combined.single_active                          7332    91.484us    31.890ns    91.453us    91.778us
combined.many_overlapping                        945   870.973us   429.720ns   868.625us   871.403us
combined.disjoint_interleaved                   7102    85.989us     7.847ns    85.973us    85.997us
combined.disjoint_ranges                        7129    85.570us     7.840ns    85.562us    85.596us
combined.overlapping_partitions_disjoint_rows        5458   124.787us    56.738ns   124.731us   125.370us
clustering_combined.ranges_generic           1920688   217.940ns     0.184ns   217.742ns   218.275ns
clustering_combined.ranges_specialized       1935318   194.610ns     0.199ns   194.210ns   195.228ns
memtable.one_partition_one_row                624001     1.600us     1.405ns     1.599us     1.605us
memtable.one_partition_many_rows               79551    12.555us     1.829ns    12.549us    12.558us
memtable.many_partitions_one_row               40557    24.748us    77.083ns    24.644us    25.135us
memtable.many_partitions_many_rows              3220   310.429us    57.628ns   310.295us   311.189us
```

After the change:
```
./build/release/test/perf/perf_mutation_readers --random-seed 3549335083
single run iterations:    0
single run duration:      1.000s
number of runs:           5
number of cores:          16
random seed:              3549335083

test                                      iterations      median         mad         min         max
combined.one_row                             1358839   109.222ns     0.122ns   109.089ns   109.348ns
combined.single_active                          7525    87.305us    25.540ns    87.273us    87.362us
combined.many_overlapping                        962   853.195us     1.904us   851.244us   855.142us
combined.disjoint_interleaved                   7310    81.988us    28.877ns    81.949us    82.032us
combined.disjoint_ranges                        7315    81.699us    37.144ns    81.662us    81.874us
combined.overlapping_partitions_disjoint_rows        5591   120.964us    15.294ns   120.949us   121.120us
clustering_combined.ranges_generic           1954722   211.993ns     0.052ns   211.883ns   212.084ns
clustering_combined.ranges_specialized       2042194   187.807ns     0.066ns   187.732ns   188.289ns
memtable.one_partition_one_row                648701     1.542us     0.339ns     1.542us     1.543us
memtable.one_partition_many_rows               85007    11.759us     1.168ns    11.752us    11.782us
memtable.many_partitions_one_row               43893    22.805us    17.147ns    22.782us    22.843us
memtable.many_partitions_many_rows              3441   290.220us    41.720ns   290.172us   290.306us
```

Closes #8359

* github.com:scylladb/scylla:
  flat_mutation_reader: optimize consume_pausable for some consumers
  flat_mutation_reader: special case consumers in consume_pausable
  flat_mutation_reader: Change order of checks in consume_pausable
  flat_mutation_reader: fix indentation in consume_pausable
  flat_mutation_reader: Remove allocation in consume_pausable
  perf: Add benchmarks for large partitions
2021-03-29 13:06:56 +02:00
Piotr Sarna
4c79f132b6 thrift: add support for max_concurrent_requests_per_shard
The Thrift frontend is now capable of limiting the max number
of concurrent in-flight requests. Surplus requests are shed.

Tests: manual
2021-03-29 13:05:16 +02:00
Piotr Sarna
9f53327c9d thrift: add metrics for admission control
The new metrics include information about how many requests
were blocked on memory, how much is still available, etc.
2021-03-29 13:05:16 +02:00
Piotr Sarna
6b021779d2 thrift: add a counter for in-flight requests 2021-03-29 13:05:16 +02:00
Piotr Sarna
9391515461 thrift: add a counter for blocked requests
The counter tracks how many requests were blocked by the
memory estimation based admission control semaphore.
2021-03-29 13:05:16 +02:00
Piotr Sarna
ef1de114f0 thrift: partially add admission control
This commit adds admission control in the form of passing
service permits to the Thrift server.
The support is partial, because Thrift also supports running CQL
queries, and for that purpose a query_state object is kept
in the Thrift handler. However, the handler is generally created
once per connection, not once per query, and the query_state object
is supposed to keep the state of a single query only.
In order to keep this series simpler, the CQL-on-top-of-Thrift
layer is not touched and is left as TODO.
Moreover, the Thrift layer does not make it easy to pass custom
per-query context (like service_permit), so the implementation
uses a trick: the service permit is created on the server
and then passed as reference to its connections and their respective
Thrift handlers. Then, each time a query is read from the socket,
this service permit is overwritten and then read back from the Thrift
handler. This mechanism heavily relies on the fact that there are
zero preemption points between overwriting the service permit
and reading it back by the handler. Otherwise, races may occur.
This assumption was verified by code inspection + empirical tests,
but if somebody is aware that it may not always hold, please speak up.
2021-03-29 13:05:16 +02:00
Nadav Har'El
ccc75bfe2a Merge 'Disable thrift by default' from Piotr Sarna
The Thrift layer is functional, but it's not usually the first-choice protocol for Scylla users, so it's hereby disabled by default.

Fixes #8336

Closes #8338

* github.com:scylladb/scylla:
  docs: mention disabling Thrift by default
  db,config: disable Thrift by default
2021-03-29 12:48:20 +03:00
Piotr Sarna
3388694e69 service_permit: add a getter for the number of units held
The helper function makes debugging considerably easier.
2021-03-29 11:34:18 +02:00
Piotr Sarna
364b921e25 thrift: coroutinize processing a request
While not particularly useful now, it will facilitate
later changes which introduce service permits.
2021-03-29 11:34:18 +02:00
Piotr Sarna
09621e5fc5 memory_limiter: add a missing seastarx include
It's that or declaring everything that belongs to seastar namespace
explicitly, and including "seastarx.hh" is more standard.
2021-03-29 11:34:18 +02:00
Michał Chojnowski
8c45225f21 docs: remove the obsolete IMR design note
IMR, as described in this design note, was removed in 001652815c.
This doc should have been removed back then, but was overlooked.

Closes #8340
2021-03-29 10:58:05 +02:00
Pekka Enberg
aec33c599b Update tools/python3 submodule
* tools/python3 6f3bcbe...ad04e8e (2):
  > dist/debian: fix renaming debian/scylla-* files rule
  > fix license of package build script to AGPL
2021-03-29 11:50:24 +03:00
Pekka Enberg
203b7394d7 Update tools/java submodule
* tools/java 7b66b7a0fc...ccc4201ded (1):
  > dist/debian: fix renaming debian/scylla-* files rule
2021-03-29 11:50:19 +03:00
Tomasz Grabiec
c0ce122f77 Merge "raft: wire up rpc add_server/remove_server for configuration changes" from Pavel Solodovnikov
Raft instance needs to update RPC subsystem on changes in
configuration, so that RPC can deliver messages to the new nodes
in configuration, as well as dispose of the old nodes.
I.e. the nodes which are not the part of the most recent
configuration anymore.

The effective scope of RPC mappings is limited by the piece of
code which sends messages to both the "new" nodes (which
are added to the cluster with the most recent configuration
change) and the "old" nodes which are removed from the cluster.

Until the messages are successfully delivered to at least
the majority of "old" nodes and we have heard back from them,
the mappings should be kept intact. After that point the RPC
mappings for the removed nodes are no longer of interest
and thus can be immediately disposed.

There is also another problem to be solved: in Raft an instance may
need to communicate with a peer outside its current configuration.
This may happen, e.g., when a follower falls out of sync with the
majority and then a configuration is changed and a leader not present
in the old configuration is elected.

The solution is to introduce the concept of "expirable" updates to
the RPC subsystem.

When RPC receives a message from an unknown peer, it also adds the
return address of the peer to the address map with a TTL. Should
we need to respond to the peer, its address will be known.

An outgoing communication to an unconfigured peer is impossible.

* manmanson/raft_mappings_wiring_v12:
  raft: update README.md with info on RPC server address mappings
  raft: wire up `rpc::add_server` and `rpc::remove_server` for configuration changes
  raft/fsm: add optional `rpc_configuration` field to fsm_output
  raft: maintain current rpc context in `server_impl`
  raft: use `.contains` instead of `.count` for std::set in `raft::configuration::diff`
  raft: unit-tests for `raft_address_map`
  raft: support expiring server address mappings for rpc module
2021-03-29 10:28:45 +02:00
Piotr Jastrzebski
86cf566692 flat_mutation_reader: optimize consume_pausable for some consumers
consumers that return stop_iteration not future<stop_iteration> don't
have to consume a single fragment per each iteration of repeat. They can
consume whole buffer in each iteration.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-03-29 09:55:14 +02:00
Piotr Jastrzebski
26cc4f112d flat_mutation_reader: special case consumers in consume_pausable
consume_pausable works with consumers that return either stop_iteration
or future<stop_iteration>. So far it was calling futurize_invoke for
both. This patch special cases consumers that return
future<stop_iteration> and don't call futurize_invoke for them as this
is unnecessary work.

More importantly, this will allow the following patch to optimize
consumers that return plain stop_iteration.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-03-29 09:55:14 +02:00
Piotr Jastrzebski
164e23d2b1 flat_mutation_reader: Change order of checks in consume_pausable
This way we can avoid checking is_buffer_empty twice.
Compiler might be able to optimize this out but why depend on it
when the alternative is not less readable.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-03-29 09:55:14 +02:00
Piotr Jastrzebski
776ba29cec flat_mutation_reader: fix indentation in consume_pausable
Code was left with wrong indentation by the previous commit that
removed do_with call around the code that's currently present.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-03-29 09:55:14 +02:00
Piotr Jastrzebski
9fb0014d72 flat_mutation_reader: Remove allocation in consume_pausable
The allocation was introduced in 515bed90bb but I couldn't figure out
why it's needed. It seems that the consumer can just be captured inside
lambda. Tests seem to support the idea.

Indentation will be fixed in the following commit to make the review
easier.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-03-29 09:55:14 +02:00
Piotr Jastrzebski
3aa7bee5e3 perf: Add benchmarks for large partitions
in perf_mutation_readers.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-03-29 09:48:11 +02:00
Avi Kivity
ec4d91f9eb tools: toolchain: dbuild: improve cgroupv2 detection code
dbuild detects if the kernel is using cgroupv2 by checking if the
cgroup2 filesystem is mounted on /sys/fs/cgroup. However, on Ubuntu
20.10, the cgroup filesystem is mounted on /sys/fs/cgroup and the
cgroup2 filesystem is mounted on /sys/fs/cgroup/unified. This second
mount matches the search expression and gives a false positive.

Fix by adding a space at the end; this will fail to match
/sys/fs/cgroup/unified.

Closes #8355
2021-03-29 09:31:29 +03:00
Pavel Solodovnikov
2d9e94f050 raft: update README.md with info on RPC server address mappings
Describe the high-level scheme of managing RPC mappings and
also expand on the introduction of "expirable" RPC mappings concept
and why these are needed.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 22:47:13 +03:00
Pavel Solodovnikov
f61206e483 raft: wire up rpc::add_server and rpc::remove_server for configuration changes
Raft instance needs to update RPC subsystem on changes in
configuration, so that RPC can deliver messages to the new nodes
in configuration, as well as dispose of the old nodes.
I.e. the nodes which are not the part of the most recent
configuration anymore.

The effective scope of RPC mappings is limited by the piece of
code which sends messages to both the "new" nodes (which
are added to the cluster with the most recent configuration
change) and the "old" nodes which are removed from the cluster.

Until the messages are successfully delivered to at least
the majority of "old" nodes and we have heard back from them,
the mappings should be kept intact. After that point the RPC
mappings for the removed nodes are no longer of interest
and thus can be immediately disposed.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 22:47:09 +03:00
Pavel Solodovnikov
16d9e8e9af raft/fsm: add optional rpc_configuration field to fsm_output
The field is set in `fsm.get_output` whenever
`_log.last_conf_idx()` or the term changes.

Also, add `_last_conf_idx` and `_last_term` to
`fsm::last_observed_state`, they are utilized
in the condition to evaluate current rpc
configuration in `fsm.get_output()`.

This will be used later to update rpc config state
stored in `server_impl` and maintain rpc address map.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 22:47:05 +03:00
Tomasz Grabiec
6035fd05b3 Merge "Unify drain() and drain_on_shutdown()" from Pavel Emelyanov
The start-stop code is drifting towards a straightforward
scheme of a bunch of

  service foo;
  foo.start();
  auto stop_foo = defer([&foo] { foo.stop(); });

blocks. The drain_on_shutdown() and its relation to drain()
and decommission() is a big hurdle on the way of this effort.

This set unifies drain() and drain_on_shutdown() so that drain
really becomes just some first steps of the regular shutdown,
i.e. -- what it should be. Some synchronisation bits around it
are still needed, though.

This unification also closes a bunch not-yet-caught bugs when
parts of the system remained running in case shutdown happens
after nodetool drain. In this case the whole drain_on_sutdown()
becomes a noop (just returns drain()'s future) and what's
missing in drain() becomes missing on shutdown.

tests: unit(dev), dtest(simple_boot_shutdown : dev),
       manual(start+stop, start+drain+stop : dev)
refs: #2737

* xemul/br-drain-on-shutdown:
  drain_on_shutdown: Simplify
  drain: Fix indentation
  storage_service: Unify drain and drain_on_shutdown
  storage_proxy: Drain and unsubscribe in main.cc
  migration_manager: Stop it in two phases
  stream_manager: Stop instances on drain
  batchlog_manager: Stop its instances on shutdown
  tracing: Shutdown tracing in drain
  tracing: Stop it in main.cc
  system_distributed_keyspace: Stop it in main.cc
  storage_service: Move (un)subscription to migration events
2021-03-26 18:37:27 +01:00
Pavel Solodovnikov
19cc85b3b6 raft: maintain current rpc context in server_impl
Introduce rpc server_address that represents the
last observed state of address mappings
for RPC module.

It does not correspond to any kind of configuration
in the raft sense, just an artificial construct
corresponding to the largest set of server
addresses coming from both previous and current
raft configurations (to be able to contact both
joining and leaving servers).

This will be used later to update rpc module mappings
when cluster configuration changes.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 20:22:44 +03:00
Pavel Solodovnikov
8799ccbab0 raft: use .contains instead of .count for std::set in raft::configuration::diff
`std::unordered_set::contains` is introduced in C++20 and provides
clearer semantics to check existence of a given element in a set.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 20:22:44 +03:00
Pavel Solodovnikov
7c229998e8 raft: unit-tests for raft_address_map
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 20:22:44 +03:00
Pavel Solodovnikov
3c4d46728d raft: support expiring server address mappings for rpc module
This patch introduces `raft_address_map` class to abstract
the notion of expirable address mappings for a raft rpc module.

In Raft an instance may need to communicate with a peer outside
its current configuration. This may happen, e.g., when a follower
falls out of sync with the majority and then a configuration is
changed and a leader not present in the old configuration is elected.

The solution is to introduce the concept of "expirable" updates to
the RPC subsystem.

When RPC receives a message from an unknown peer, it also adds the
return address of the peer to the address map with a TTL. Should
we need to respond to the peer, its address will be known.

An outgoing communication to an unconfigured peer is impossible.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 20:22:44 +03:00
Pavel Emelyanov
d1796ab3dc drain_on_shutdown: Simplify
The modern version of this method doesn't need the
run_with_no_api_lock(), as it's launched on shard 0
anyway, neither it needs logging before and after
as it's done by the deferred action from main that
calls it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:46 +03:00
Pavel Emelyanov
58b47efe16 drain: Fix indentation
Previous patch left it broken for readability.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:46 +03:00
Pavel Emelyanov
8d7ad6de03 storage_service: Unify drain and drain_on_shutdown
Now they only differ in one bit -- compaction manager is
drained on drain and is left running (until regular stop)
on shutdown. So this unification adds a boolean flag for
this case.

Also the indentation is deliberately left broken for the
sake of patch readability.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:46 +03:00
Pavel Emelyanov
b60099c2f8 storage_proxy: Drain and unsubscribe in main.cc
Currently shutdown after drain leaves storage proxy
subscribed on storage_service events and without the
storage_proxy::drain_on_shutdown being called. So it
seems safe if the whole thing is relocated closer to
its starting peers.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:46 +03:00
Pavel Emelyanov
9a8f125890 migration_manager: Stop it in two phases
Before the patch the migration manager was stopped in
two ways and one was buggy.

Plain shutdown -- it's just sharded::stop-ed by defer
in main(), but this happens long after the shutdown
of commitlog, which is not correct.
Shutdown after drain -- it's stopped twice, first time
right before the commitlog shutdown, second -- the
same defer in main(). And since the sharded::stop is
reentrable, the 2nd stop works noop.

This patch splits the stop into two phases: first it
stops the instances and does this in _both_ -- plain
shutdown and shutdown after drain. This phase is done
before commitlog shutdown in both cases. Second, the
existring deferred sharded::stop in main.cc.

This changes needs the migration_manager::stop() to
become re-entrable, but that's easily checked with the
help of abort_source the migration_manager has.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:45 +03:00
Pavel Emelyanov
de8a7fe798 stream_manager: Stop instances on drain
It's not seen directly from ths patch itself, but the only
difference between first several calls that drain() makes
and the stop_transport() is the do_stop_stream_manager()
in the latter.

Again, it's partially a bugfix (shutdown after drain leaves
streaming running), partially a must-have thing (streaming
is not expected in the background after drain), partially a
unification of two drains out there.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:45 +03:00
Pavel Emelyanov
9a7e2a218b batchlog_manager: Stop its instances on shutdown
It's now stopped (not sharded::stop(), but batchlog_manager::stop)
on plain drain, but plain shutdown leaves it running, so fill this
gap.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:45 +03:00
Pavel Emelyanov
bcc3935ce7 tracing: Shutdown tracing in drain
First of all, shutdown that happens after nodetoo drain leaves
tracing up-n-running, so it's effectively a bugfix. But also
a step towards unified drain and drain_on_shutdown.

Keeping this bit in drain seems to be required because drain
stops transport, flushes column families and shuts commitlog
down. Any tracing activity happening after it looks uncalled
for.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:45 +03:00
Pavel Emelyanov
f1d7804102 tracing: Stop it in main.cc
The tracing::stop() just checks that it was shutdown()-ed and
otherwise a noop, so it's OK to stop tracing later. This brings
drain() and drain_on_shutdown() closer to each other and makes
main.cc look more like it should.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:45 +03:00
Pavel Emelyanov
d7cccec97f system_distributed_keyspace: Stop it in main.cc
It's now stopped in drain_on_shutdown, but since its stop()
method is a noop, it doesn't matter where it is. Keeping it
in main.cc next to related start brings drain_on_shutdown()
closer to drain() and the whole thing closer to the Ideal
start-stop sequence.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:45 +03:00
Pavel Emelyanov
5456174d69 storage_service: Move (un)subscription to migration events
After the patch the subscription effectively happens at
the same time as before, but is now located in main.cc,
so no real change here.

The unsubscription was in the drain_on_shutdown before
the patch, but after it it happens to be a defer next
to its peer, i.e. later, but it shouldn't be disastrous
for two reasons. First -- client services and migration
manager are already stopped. Second -- before the patch
this subscription was _not_ cancelled if shutdown ran
after nodetool drain and it didn't cause troubles.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-26 18:58:45 +03:00
Botond Dénes
d64b1fdd6a reader_permit: signal leaked resources
When destroying a permit with leaked resources we call
`on_internal_error_noexcept()` in the destructor. This method logs an
error or asserts depending on the configuration. When not asserting, we
need to return the leaked units to the semaphore, otherwise they will be
leaked for good. We can do this because we know exactly how many
resources the user of the permit leaked (never signalled).
2021-03-26 14:23:32 +02:00
Botond Dénes
0f1a72ba59 test: test_reader_lifecycle_policy: keep semaphores alive until all ops cease
To ensure the semaphores outlive all permits created as part of the
tests.
2021-03-26 14:22:43 +02:00
Botond Dénes
f843e3de08 sstables: generate_summary(): extend the lifecycle of the reader concurrency semaphore
Used to produce the needed permits for the index reads, such that it
over-lives all the permits in use.
2021-03-26 11:06:02 +02:00
Tomasz Grabiec
f86896d387 Merge "Iterate range tombstones in partition_snapshot_reader" from Pavel Emelyanov
Currently the guy copies and merges all range tombstones from all partition
versions (that match the given range, but still) when being initialized or
decides to refresh iterators. This is a lot of potentially useless work and
memory, as the reader may be dropped before it emits all the mutations from
the given range(s).

It's better to walk the tombstones step-by-step, like it's done for rows.

fixes: #1671
tests: unit(dev)

* xemul/br-partiion-snapshot-reader-on-demand-range-tombstones-2:
  range_tombstone_stream: Remove unused methods
  partition_snapshot_reader: Emit range tombstones on demand
  partition_snapshot_reader: Introduce maybe_refresh_state
  partition_snapshot_reader: Move range tombstone stream member
  partition_snapshot_reader: Add reset_state method to helper class
  partition_snapshot_reader: Downgrade heap comparator
  partition_snapshot_reader: Use on-demand comparators
  range_tombstone_list: Add new slice() helper
  range_tombstone_list: Introduce iterator_range alias
2021-03-26 01:27:18 +01:00
Pavel Emelyanov
c6a0e0439e files: Construct file_impls properly
Constructors of classes inherited from file_impl copy alignment
values by hands, but miss the overwrite one, thus on a new file
it remains default-initialized.

To fix this and not to forget to properly initalize future fields
from file_impl, use the impl's copy constructor.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210325104830.31923-1-xemul@scylladb.com>
2021-03-26 00:22:11 +01:00
Tomasz Grabiec
ef06a939c4 Merge "raft: seven etcd unit tests ported" from Alejo
Seven etcd unit tests as boost tests.

* alejo/raft-tests-etcd-08-v4-communicate-v5:
  raft: etcd unit tests: test proposal handling scenarios
  raft: etcd unit tests: test old messages ignored
  raft: etcd unit tests: test single node precandidate
  raft: etcd unit tests: test dueling precandidates
  raft: etcd unit tests: test dueling candidates
  raft: etcd unit tests: test cannot commit without new term
  raft: etcd unit tests: test single node commit
  raft: etcd unit tests: update test_leader_election_overwrite_newer_logs
  raft: etcd unit tests: fix test_progress_leader
  raft: testing: log comparison helper functions
  raft: testing: helper to make fsm candidate
  raft: testing: expose log for test verification
  raft: testing: use server_address_set
  raft: testing: add prevote configuration
  raft: testing: make become_follower() available for tests
2021-03-25 20:27:07 +01:00
Alejo Sanchez
ace0ee514f raft: etcd unit tests: test proposal handling scenarios
TestProposal
For multiple scenarios, check proposal handling.

Note, instead of expecting an explicit result for each specified case,
the test automatically checks for expected behavior when quorum is
reached or not.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
77163ea76a raft: etcd unit tests: test old messages ignored
TestOldMessages
Checks an append request from a leader from a previous term is ignored.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
bf65b19803 raft: etcd unit tests: test single node precandidate
TestSingleNodePreCandidate
Checks a single node configuration with precandidate on works to
automatically elect the node.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
de7051467b raft: etcd unit tests: test dueling precandidates
TestDuelingPreCandidates
In a configuration of 3 nodes, two nodes don't see each other and they
compete for leadership. Loser (3) should revert to follower when prevote
is rejected and revert to term 1.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
aa7d23f86b raft: etcd unit tests: test dueling candidates
TestDuelingCandidates
In a configuration of 3 nodes, two nodes don't see each other and they
compete for leadership. Once reconnected, loser should not disrupt.

But note it will remain candidate with current algorithm without
prevoting and other fsms will not bump term.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
1eac94e7d6 raft: etcd unit tests: test cannot commit without new term
TestCannotCommitWithoutNewTermEntry tests the entries cannot be
committed when leader changes, no new proposal comes in and ChangeTerm
proposal is filtered.

NOTE: this doesn't check committed but it's implicit for next round;
      this could also use communicate() providing committed output map

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
b421fe3605 raft: etcd unit tests: test single node commit
Port etcd TestSingleNodeCommit

In a single node configuration elect the node, add 2 entries and check
number of committed entries.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
9b4538476b raft: etcd unit tests: update test_leader_election_overwrite_newer_logs
Make test_leader_election_overwrite_newer_logs use newer communicate()
and other new helpers.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
368eec1190 raft: etcd unit tests: fix test_progress_leader
Make implementation follow closer to original test.
Use newer boost test helpers.

NOTE: in etcd it seems a leader's self progress is in PIPELINE state.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:28 -04:00
Alejo Sanchez
ba29970e29 raft: testing: log comparison helper functions
Two helper functions to compare logs. For now only index, term, and data
type are used. Data content comparison does not seem to be necessary for now.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:28 -04:00
Alejo Sanchez
aeab4cf4a9 raft: testing: helper to make fsm candidate
Current election_timeout() helper might bump the term twice.
It's convenient and less error prone to have a more fine grained helper
that stops right when candidate state is reached.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:19 -04:00
Alejo Sanchez
7a6616f1cb raft: testing: expose log for test verification
Let derived classes access the log to verify its contents.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:03:46 -04:00
Alejo Sanchez
05b1f57e67 raft: testing: use server_address_set
Use server_address_set in local namespace for brevity.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:01:12 -04:00
Alejo Sanchez
9d0a7d8ccf raft: testing: add prevote configuration
Provide a generic prevote configuration for tests.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:00:28 -04:00
Dejan Mircevski
b2a04985f7 cql-pytest: Drop needless INSERT in test_null
One INSERT statement was unnecessary for the test, so delete it.
Another was necessary, so explain it.

Tests: cql-pytest/test_null on both Scylla and Cassandra

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8304
2021-03-25 16:37:00 +01:00
Tomasz Grabiec
7b30d31d77 Merge "raft: test configuration changes" from Kostja
Test raft configuration changes:
a node with empty configuration, transitioning
to an entirely different cluster, transitioning
in presence of down nodes, leader change during
configuration change, stray replies, etc.

* scylla-dev/raft-empty-confchange-v5: (21 commits)
  raft: (testing) stray replies from removed followers
  raft: always return a non-zero configuration index from the log
  raft: (testing) leader change during configuration change
  raft: (testing) test confchange {ABCDE} -> {ABCDEFG}
  raft: (testing) test confchange {ABCDEF} -> {ABCGH}
  raft: (testing) test confchange {ABC} -> {CDE}
  raft: (testing) test confchange {AB} -> {CD}
  raft: (testing) test confchange {A} -> {B}
  raft: (testing) test a server with empty configuration
  raft: (testing) introduce testing utilities
  raft: (testing) simplify id allocation in test
  raft: (testing) add select_leader() helper
  raft: (testing) introduce communicate() helper
  raft: (testing) style cleanup in raft_fsm_test
  raft: (testing) fix bug in election_threshold
  raft: minor style changes & comments
  raft: do not assert when transitioning to empty config
  raft: assert we never apply a snapshot over uncommitted entries (leader)
  raft: improve tracing
  raft: add fsm_output::empty() helper to aid testing
  ...
2021-03-25 14:01:09 +01:00
Wojciech Mitros
b152dc8c86 types: move read_collection_size/value specialization to header file
The template method needs to be specialized in each file that is
using it. To avoid rewriting the specialization into multiple files,
move it to the header file.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-03-25 12:18:38 +01:00
Avi Kivity
46185d7d82 Update tools/jmx submodule
* tools/jmx 9c687b5...440313e (1):
  > storage_service: Add a generic toppartitions endpoint
2021-03-25 12:36:10 +02:00
Alejo Sanchez
7e6807e8fc raft: testing: make become_follower() available for tests
Some etcd tests need to force a follower with a specific leader.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-24 19:11:09 -04:00
Piotr Wojtczak
c1daf2bb24 column_family: Make toppartitions queries more generic
Right now toppartitions can only be invoked on one column family at a time.
This change introduces a natural extension to this functionality,
allowing to specify a list of families.

We provide three ways for filtering in the query parameter "name_list":
    1. A specific column family to include in the form "ks:cf"
    2. A keyspace, telling the server to include all column families in it.
       Specified by omitting the cf name, i.e. "ks:"
    3. All column families, which is represented by an empty list
The list can include any amount of one or both of the 1. and 2. option.

Fixes #4520

Closes #7864
2021-03-24 17:54:05 +02:00
Raphael S. Carvalho
bcbb39999b LCS: Fix terrible write amplification when reshaping level 0
LCS reshape is basically 'major compacting' level 0 until it contains less than
N sstables.

That produces terrible write amplification, because any given byte will be
compacted (initial # of sstables / max_threshold (32)) times. So if L0 initially
contained 256 ssts, there would be a WA of about 8.

This terrible write amplification can be reduced by performing STCS instead on
L0, which will leave L0 in a good shape without hurting WA as it happens
now.

Fixes #8345.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210322150655.27011-1-raphaelsc@scylladb.com>
2021-03-24 17:48:50 +02:00
Piotr Sarna
24a43681b4 thrift: handle gate closed exception on retry
During the retry mechanism, it's possible to encounter a gate
closed exception, which should simply be ignored, because
it indicates that the server is shutting down.

Closes #8337
2021-03-24 17:41:58 +02:00
Konstantin Osipov
1a1d7ab662 raft: (testing) stray replies from removed followers 2021-03-24 14:05:55 +03:00
Konstantin Osipov
0295163f6f raft: always return a non-zero configuration index from the log
Return snapshot index for last configuration index if there
is no configuration in the log.
2021-03-24 14:05:55 +03:00
Konstantin Osipov
cec59e53ef raft: (testing) leader change during configuration change 2021-03-24 14:05:36 +03:00
Pavel Emelyanov
37bec6fb76 commitlog: Open files with append_is_unlikely
This open option tells seastar that the file in question
will be truncated to the needed size right at once and all
the subsequent writes will happen within this size. This
hint turns off append optimization in seastar that's not
that cheap and helps so save few cpu cycles.

The option was introduced in seastar by 8bec57bc.

tests: unit(dev), dtest(commitlog:
                        test_batch_commitlog,
                        test_periodic_commitlog,
                        test_commitlog_replay_on_startup)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210323115409.31215-1-xemul@scylladb.com>
2021-03-24 13:05:33 +02:00
Konstantin Osipov
a203c8833f raft: (testing) test confchange {ABCDE} -> {ABCDEFG} 2021-03-24 14:04:18 +03:00
Konstantin Osipov
40e117d36e raft: (testing) test confchange {ABCDEF} -> {ABCGH} 2021-03-24 14:04:18 +03:00
Konstantin Osipov
14b2d5d308 raft: (testing) test confchange {ABC} -> {CDE}
Test leader change during configuration change.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
3c718a175e raft: (testing) test confchange {AB} -> {CD} 2021-03-24 14:04:18 +03:00
Konstantin Osipov
2e30c8540e raft: (testing) test confchange {A} -> {B}
Test non-restart and leader restart scenario.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
e23da06fef raft: (testing) test a server with empty configuration
Try becoming a candidate for such server, or adding it
to an existing configuration.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
b18599c630 raft: (testing) introduce testing utilities
Add a discrete_failure_detector, to be able
to mark a single server dead.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
8d26d24370 raft: (testing) simplify id allocation in test 2021-03-24 14:04:18 +03:00
Konstantin Osipov
322a15ec33 raft: (testing) add select_leader() helper
With leader stepdown extension, leadership transfer can happen
to any follower with long enough log. Add a helper to select that
follower from a list.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
4a00da276d raft: (testing) introduce communicate() helper
Allow to communicate between arbitrary number of FSMs. Drop
messages to FSMs which are not in the argument list.
Stop communication upon predicate.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
7182323ac0 raft: (testing) style cleanup in raft_fsm_test
1) Avoid memory violations on test failure
2) Print better diagnostics on failure (BOOST_CHECK_EQUAL vs
   BOOST_CHECK)
2021-03-24 14:04:18 +03:00
Konstantin Osipov
f0f25bf7fb raft: (testing) fix bug in election_threshold
election_threshold was ticking one extra tick,
causing the follower to become candidate in some cases.
This was rendering tests unstable.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
00d7379bc9 raft: minor style changes & comments
Add comments explaining the rationale from transfer_leadership()
(more PhD quotes), encapsulate stable leader check in tick()
into a lambda and add more detailed comments to it.
2021-03-24 14:04:18 +03:00
Piotr Sarna
06131e21a3 configure.py: add customizing clang inline threshold
Until clang figures things out with the now infamous
`-llvm -inline-threshold X` parameter, let's allow customizing
it to make the compilation of release builds less tiresome.
For instance, scylla's row_level.o object file currently does not compile
for me until I decrease the inline threshold to a low value (e.g. 50).

Message-Id: <54113db9438e3c3371410996f49b7fbe9a1b7257.1616422536.git.sarna@scylladb.com>
2021-03-24 12:09:26 +02:00
Tomasz Grabiec
9272e74e8c sstable: writer: ka/la: Write row marker cell after row tombstone
Row marker has a cell name which sorts after the row tombstone's start
bound. The old code was writing the marker first, then the row
tombstone, which is incorrect.

This was harmeless to our sstable reader, which recognized both as
belonging to the current clustering row fragment, and collects both
fine.

However, if both atoms trigger creation of promoted index blocks, the
writer will create a promoted index with entries wich violate the cell
name ordering. It's very unlikely to run into in practice, since to
trigger promoted index entries for both atoms, the clustering key
would be so large so that the size of the marker cell exceeds the
desired promoted index block size, which is 64KB by default (but
user-controlled via column_index_size_in_kb option). 64KB is also the
limit on clustering key size accepted by the system.

This was caught by one of our unit tests:

  sstable_conforms_to_mutation_source_test

...which runs a battery of mutation reader tests with various
desired promoted index block sizes, including the target size of 1
byte, which triggers an entry for every atom.

The test started to fail for some random seeds after commit ecb6abe
inside the
test_streamed_mutation_forwarding_is_consistent_with_slicing test
case, reporting a mutation mismatch in the following line:

    assert_that(*sliced_m).is_equal_to(*fwd_m, slice_with_ranges.row_ranges(*m.schema(), m.key()));

It compares mutations read from the same sstable using different
methods, slicing using clustering key restricitons, and fast
forwarding. The reported mismatch was that fwd_m contained the row
marker, but sliced_m did not. The sstable does contain the marker, so
both reads should return it.

After reverting the commit which introduced dynamic adjustments, the
test passes, but both mutations are missing the marker, both are
wrong!

They are wrong because the promoted index contians entries whose
starting positions violate the ordering, so binary search gets confused
and selects the row tombstone's position, which is emitted after the
marker, thus skipping over the row marker.

The explanation for why the test started to fail after dynamic
adjustements is the following. The promoted index cursor works by
incrementally parsing buffers fed by the file input stream. It first
parses the whole block and then does a binary search within the parsed
array. The entries which cursor touches during binary search depend on
the size of the block read from the file. The commit which enabled
dynamic adjustements causes the block size to be different for
subsequent reads, which allows one of the reads to walk over the
corrupted entries and read the correct data by selecting the entry
corresponding to the row marker.

Fixes #8324
Message-Id: <20210322235812.1042137-1-tgrabiec@scylladb.com>
2021-03-23 16:13:47 +01:00
Tomasz Grabiec
235154cca5 Merge "Teach scylla-gdb new trees in row cache" from Pavel Emelyanov
Clustering rows are now stored in intrusive btree, cells are
now stored in radix tree, but scylla-gdb tries to walk the
intrusive_set and vector/set union respectively.

For the former case -- the btree wrapper is introduced.

For the latter -- compiler optimizes-away too many important
bits and walking the tree turns into a bunch of hard-coded
hacks and reiterpret-casts. Untill better solution is found,
just print the address of the tree root.

* xemul/br-gdb-btree-rows:
  gdb: Show address of the row::_cells tree (or "empty" mark)
  gdb: Add support for intrusive B tree
  gdb: Use helper to get rows from mutation_partition
2021-03-23 12:50:17 +01:00
Pavel Emelyanov
1cd9ec952f gdb: Show address of the row::_cells tree (or "empty" mark)
Currently clang optimizes-out lots of critical stuff from
compact radix tree. Untill we find out the way to walk the
tree in gdb, it's better to at least show where it is in
memory.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-23 13:29:40 +03:00
Pavel Emelyanov
5c85fcb3c9 gdb: Add support for intrusive B tree
Rows inside partition are now stored in an intrusive B-tree,
so here's the helper class that wraps this collection.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-23 12:54:44 +03:00
Pavel Emelyanov
ed38b18a84 gdb: Use helper to get rows from mutation_partition
Preparation for the next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-23 12:54:14 +03:00
Avi Kivity
3c292e31af utils: utf8: fix validate_partial() on non-SIMD-optimized architectures
validate_partial() is declared in the internal namespace, but defined
outside it. This causes calls to validate_partial() to be ambiguous
on architectures that haven't been SIMD-optimized yet (e.g. s390x).

Fix by defining it in the internal namespace.

Closes #8268
2021-03-23 09:21:14 +02:00
Avi Kivity
957259fab7 tools: toolchain: prepare: adjust manifest manipulations
The manifest manipulation commands stopped working with podman 3;
the containers-storage: prefix now throws errors.

Switch to `buildah manifest`; since we're building with buildah,
we might as well maintain the manifest with buildah as well.

Closes #8231
2021-03-23 09:18:19 +02:00
Avi Kivity
4dae434f69 utils: crc: fix build with big-endian architectures and 1-byte objects
crc has some code to reverse endianness on big-endian machines, but does
not handle the case of a 1-byte object (which doesn't need any adjustement).
This causes clang to complain that the switch statement doesn't handle that
case.

Fix by adding a no-op case.

Closes #8269
2021-03-23 09:16:20 +02:00
Konstantin Osipov
ce29fb44c3 raft: do not assert when transitioning to empty config
Throw instead, to make this case testable.
2021-03-22 18:55:40 +03:00
Konstantin Osipov
2ee15ad6c7 raft: assert we never apply a snapshot over uncommitted entries (leader) 2021-03-22 18:55:40 +03:00
Konstantin Osipov
c7f7ad2c4e raft: improve tracing
Add tracing to apply_snapshot, request_vote.
2021-03-22 18:55:40 +03:00
Konstantin Osipov
4dd66edae5 raft: add fsm_output::empty() helper to aid testing
Used in testing to implement trivial transport.
2021-03-22 18:55:40 +03:00
Konstantin Osipov
89349f550c raft: aid testing by providing fsm::id() 2021-03-22 18:55:40 +03:00
Botond Dénes
742a33730a scylla-gdb.py: dereference_smart_ptr(): add support for seastar::smart_ptr
Although a seastar::smart_ptr is trivial to dereference manually, so is
adding support for it to dereference_smart_ptr(), avoiding the annoying
(but brief) detour which is currently needed.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210322150149.84534-1-bdenes@scylladb.com>
2021-03-22 17:30:35 +02:00
Piotr Sarna
b774d69ad2 docs: mention disabling Thrift by default
Thrift is no longer enabled by default, so the documentation
should mention that, as well as the suggested way of enabling it
if necessary.
2021-03-22 14:32:51 +01:00
Raphael S. Carvalho
c86dd125a1 sstables: clean up partitioned_sstable_set::insert()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210322130227.16805-2-raphaelsc@scylladb.com>
2021-03-22 15:30:32 +02:00
Raphael S. Carvalho
48d8cc261e sstables: don't swallow exception in partitioned_sstable_set::insert()
regression introduced by 02b2df1ea9 (Fri Mar 12 01:22:41 2021 -0300).

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210322130227.16805-1-raphaelsc@scylladb.com>
2021-03-22 15:30:31 +02:00
Avi Kivity
50dda795e9 Update seastar submodule
* seastar 83339edb04...48376c76a1 (2):
  > iotune: Warn user about write-back cache mode
  > reactor: add --kernel-page-cache option to disable O_DIRECT
2021-03-22 13:33:08 +02:00
Avi Kivity
74df67776b bytes_ostream: convert write_placeholder from enable_if to concepts
Concepts are easier to read and result in better error messages.

This change also tightens the constraint from "std::is_fundamental" to
"std::integral". The differences are floating point values, nullptr_t,
and void. The latter two are illegal/useless to write, and nobody uses
floating point values for list lengths, so everything still compiles.

Closes #8326
2021-03-22 12:00:07 +01:00
Piotr Sarna
e2443337d9 db,config: disable Thrift by default
It will still be possible to use Thrift once it's enabled
in the yaml file, but it's better to not open this port
by default, since Thrift is definitely not the first choice
for Scylla users.

Fixes #8336
2021-03-22 10:54:26 +01:00
Piotr Sarna
23057dd186 Merge 'Implement RAFT's leader stepdown extension' from Gleb
This series implements leader stepdown extension. See patch 4 for
justification for its existence. First three patches either implement
cleanups to existing code that future patch will touch or fix bugs
that need to be fixed in order for stepdown test to work.

* 'raft-leader-stepdown-v3' of github.com:scylladb/scylla-dev:
  raft: add test for leader stepdown
  raft: introduce leader stepdown procedure
  raft: fix replication when leader is not part of current config
  raft: do not update last election time if current leader is not a part of current configuration
  raft: move log limiting semaphore into the leader state
2021-03-22 09:45:19 +01:00
Avi Kivity
3c44445c07 Merge "Introduce off-strategy compaction for repair-based bootstrap and replace" from Raphael
"
Scylla suffers with aggressive compaction after repair-based operation has initiated. That translates into bad latency and slowness for the operation itself.

This aggressiveness comes from the fact that:
1) new sstables are immediately added to the compaction backlog, so reducing bandwidth available for the operation.
2) new sstables are in bad shape when integrated into the main sstable set, not conforming to the strategy invariant.

To solve this problem, new sstables will be incrementally reshaped, off the compaction strategy, until finally integrated into the main set.

The solution takes advantage there's only one sstable per vnode range, meaning sstables generated by repair-based operations are disjoint.

NOTE: off-strategy for repair-based decommission and removenode will follow this series and require little work as the infrastructure is introduced in this series.

Refs #5226.
"

* 'offstrategy_v7' of github.com:raphaelsc/scylla:
  tests: Add unit test for off-strategy sstable compaction
  table: Wire up off-strategy compaction on repair-based bootstrap and replace
  table: extend add_sstable_and_update_cache() for off-strategy
  sstables/compaction_manager: Add function to submit off-strategy work
  table: Introduce off-strategy compaction on maintenance sstable set
  table: change build_new_sstable_list() to accept other sstable sets
  table: change non_staging_sstables() to filter out off-strategy sstables
  table: Introduce maintenance sstable set
  table: Wire compound sstable set
  table: prepare make_reader_excluding_sstables() to work with compound sstable set
  table: prepare discard_sstables() to work with compound sstable set
  table: extract add_sstable() common code into a function
  sstable_set: Introduce compound sstable set
  reshape: STCS: preserve token contiguity when reshaping disjoint sstables
2021-03-22 10:43:13 +02:00
Gleb Natapov
272cb1c1e6 raft: add test for leader stepdown 2021-03-22 10:31:16 +02:00
Gleb Natapov
9d6bf7f351 raft: introduce leader stepdown procedure
Section 3.10 of the PhD describes two cases for which the extension can
be helpful:

1. Sometimes the leader must step down. For example, it may need to reboot
 for maintenance, or it may be removed from the cluster. When it steps
 down, the cluster will be idle for an election timeout until another
 server times out and wins an election. This brief unavailability can be
 avoided by having the leader transfer its leadership to another server
 before it steps down.

2. In some cases, one or more servers may be more suitable to lead the
 cluster than others. For example, a server with high load would not make
 a good leader, or in a WAN deployment, servers in a primary datacenter
 may be preferred in order to minimize the latency between clients and
 the leader. Other consensus algorithms may be able to accommodate these
 preferences during leader election, but Raft needs a server with a
 sufficiently up-to-date log to become leader, which might not be the
 most preferred one. Instead, a leader in Raft can periodically check
 to see whether one of its available followers would be more suitable,
 and if so, transfer its leadership to that server. (If only human leaders
 were so graceful.)

The patch here implements the extension and employs it automatically
when a leader removes itself from a cluster.
2021-03-22 10:28:43 +02:00
Gleb Natapov
888b52dea1 raft: fix replication when leader is not part of current config
When a leader orchestrates its own removal from a cluster there is a
situation where the leader is still responsible for replication, but it
is no longer part of active configuration. Current code skips replication
in this case though. Fix it by always replicating in the leader state.
2021-03-22 09:52:17 +02:00
Gleb Natapov
1acc8996bc raft: do not update last election time if current leader is not a part of current configuration
Since we use external failure detector instead of relying on empty
AppendRequests from a leader there can be a situation where a node
is no longer part of a certain raft group but is still alive (and also
may be part of other raft groups). In such case last election time
should not be updated even if the node is alive. It is the same as if
it would have stopped to send empty AppendRequests in original raft.
2021-03-22 09:52:17 +02:00
Gleb Natapov
ccf4435759 raft: move log limiting semaphore into the leader state
Log limiting semaphore is used on a leader only, so it should be stored
inside the leader state.
2021-03-22 09:52:17 +02:00
Takuya ASADA
35a14ab22b configure.py: drop compat-python3 targets
Since we switched scylla-python3 build directory to tools/python3/build
on Jenkins, we nolonger need compat-python3 targets, drop them.

Related scylladb/scylla-pkg#1554

Closes #8328
2021-03-21 18:04:27 +02:00
Benny Halevy
f562c9c2f3 test: sstable_datafile_test: tombstone_purge_test: use a longer ttl
As seen in next-3319 unit testing on jenkins
The cell ttl may expire during the test (presuming
that the test machine was overloaded), leading to:
```
INFO  2021-03-21 10:05:23,048 [shard 0] compaction - [Compact tests.tombstone_purge 2fcaf680-8a1c-11eb-b1b9-97020c5d261e] Compacting [/jenkins/workspace/scylla-master/next/scylla/testlog/release/scylla-af8644ec-7f07-4ffe-80bf-6703a942e435/la-17-big-Data.db:level=0:origin=, ]
INFO  2021-03-21 10:05:23,048 [shard 0] compaction - [Compact tests.tombstone_purge 2fcaf680-8a1c-11eb-b1b9-97020c5d261e] Compacted 1 sstables to []. 4kB to 0 bytes (~0% of original) in 0ms = 0 bytes/s. ~128 total partitions merged to 0.
./test/lib/mutation_assertions.hh(108): fatal error: in "tombstone_purge_test": Mutations differ, expected {table: 'tests.tombstone_purge', key: {'id': alpha, token: -7531858254489963}, mutation_partition: {
  rows: [
    {
      cont: true,
      dummy: false,
      position: {
        bound_weight: 0,
      },
      'value': { atomic_cell{1,ts=1616313953,expiry=1616313958,ttl=5} },
    },
  ]
}
}
 ...but got: {table: 'tests.tombstone_purge', key: {'id': alpha, token: -7531858254489963}, mutation_partition: {
  rows: [
    {
      cont: true,
      dummy: false,
      position: {
        bound_weight: 0,
      },
      'value': { atomic_cell{DEAD,ts=1616313953,deletion_time=1616313953} },
    },
  ]
}
}
```

This corresponds to:
```
2395            auto mut2 = make_expiring(alpha, ttl);
2396            auto mut3 = make_insert(beta);
...
2399            auto sst2 = make_sstable_containing(sst_gen, {mut2, mut3});
```

Extend (logical) ttl to 10 seconds to reduce flakiness
due to real-time timing.

Test: sstable_datafile_test(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210321142931.1226850-1-bhalevy@scylladb.com>
2021-03-21 16:42:00 +02:00
Avi Kivity
1e820687eb Merge "reader_concurrency_semaphore: limit non-admitted inactive reads" from Botond
"
Due to bad interaction of recent changes (913d970 and 4c8ab10) inctive
readers that are not admitted have managed to completely fly under the
radar, avoiding any sort of limitation. The reason is that pre-admission
the permits don't forward their resource cost to the semaphore, to
prevent them possibly blocking their own admission later. However this
meant that if such a reader is registered as inactive, it completely
avoids the normal resource based eviction mechanism and can accumulate
without bounds.
The real solution to this is to move the semaphore before the cache and
make all reads pass admission before they get started (#4758). Although
work has been started towards this, it is still a while until it lands.
In the meanwhile this patchset provides a workaround in the form of a
new inactive state, which -- like admitted -- causes the permit to
forward its cost to the semaphore, making sure these un-admitted
inactive reads are accounted for and evicted if there is too much of
them.

Fixes: #8258

Tests: unit(release), dtest(oppartitions_test.py:TestTopPartitions.test_read_by_gause_key_distribution_for_compound_primary_key_and_large_rows_number)
"

* 'reader-concurrency-semaphore-limit-inactive-reads/v4' of https://github.com/denesb/scylla:
  test: mutation_reader_test: add test for permit cleanup
  test: querier_cache_test: add memory based cache eviction test
  reader_permit: add inactive state
  querier: insert(): account immediately evicted querier as resource based eviction
  reader_concurrency_semaphore: fix clear_inactive_reads()
  reader_concurrency_semaphore: make inactive_read_handle a weak reference
  reader_concurrency_semaphore: make evict() noexcept
  reader_concurrency_semaphore: update out-of-date comments
2021-03-21 16:24:54 +02:00
Nadav Har'El
ab75226626 test/cql-pytest: remove xfail from passing test
After commit 0bd201d3ca ("cql3: Skip indexed
column for CK restrictions") fixed issue #7888, the test
cassandra_tests/validation/entities/frozen_collections_test.py::testClusteringColumnFiltering
began passing, as expected. So we can remove its "xfail" label.

Refs #7888.

cassandra_tests/validation/entities/frozen_collections_test.py::testClusteringColumnFiltering

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210321080522.1831115-1-nyh@scylladb.com>
2021-03-21 16:02:30 +02:00
Avi Kivity
e2cd551880 Update seastar submodule
* seastar ea5e529f30...83339edb04 (21):
  > cmake: filter out -Wno-error=#warnings from pkgconfig (seastar.pc)
  > Merge 'utils/log.cc: fix nested_exception logging (again)' from Vlad Zolotarov
Fixes #8327.
  > file: Add option to refuse the append-challenged file
  > Merge "Teach io-tester to work on block device" from Pavel E
  > Merge "Cleanup files code" from Pavel E
  > install-dependencies: Support rhel-8.3
  > install-dependencies: Add some missing rh packages
  > file, reactor: reinstate RWF_NOWAIT support
  > file: Prevent fsxattr.fsx_extsize from overflow
  > cmake: enable clang's -Wno-error=#warnings if supported
  > cmake: harden seastar_supports_flag aginst inputs with spaces or #
  > cmake: fix seastar_supports_flag failing after first invocation
  > thread: Stop backtraces in main() on s390x architecture
  > intent: Explicitly declare constructors for references
  > test: file_io_test: parallel_overwrite: use testing::local_random_engine
  > util: log-impl: rework log_buf::inserter_iterator
  > rwlock: pass timeout parameter to get_units
  > concepts: require lib support to enable concepts
  > rpc: print more info on bad protocol magic
  > seastar-addr2line: strip input line to restore multiline support
  > log: skip on unknown nested mixing instead of stopping the logging
Ref #8327.
2021-03-21 15:58:10 +02:00
Nadav Har'El
10bf2ba60a cql-pytest: translate Cassandra's reproducers for issue #2962
This is a translation of Cassandra's CQL unit test source file
validation/entities/SecondaryIndexOnMapEntriesTest.java into our
our cql-pytest framework.

This test file checks various features of indexing (with secondary index)
individual entries of maps. All these tests pass on Cassandra, but fail on
Scylla because of issue #2962 - we do not yet support indexing of the content
of unfrozen collections. The failing test currently fail as soon as they
try to create the index, with the message:
"Cannot create secondary index on non-frozen collection or UDT column v".

Refs #2962.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210310124638.1653606-1-nyh@scylladb.com>
2021-03-21 12:30:00 +02:00
Avi Kivity
75da8a8d81 Merge 'Fix the retry mechanism in Thrift frontend' from Piotr Sarna
Thrift used to be quite unsafe with regard to its retry mechanism, which caused very rapid use of resources, namely the number of file descriptors. It was also prone to use-after-free due to spawning futures without guarding the captured objects with anything.
The mechanism is now cleaned up, and a simple exponential backoff replaced previous constant backoff policy.

Fixes #8317
Tests: unit(dev), manual(see #8317 for a simple reproducer)

Closes #8318

* github.com:scylladb/scylla:
  thrift: add exponential backoff for retries
  thrift: fix and simplify retry logic
2021-03-21 12:26:13 +02:00
Avi Kivity
a78f43b071 Merge 'tracing: fast slow query tracing' from Ivan Prisyazhnyy
The set of patches introduces a new tracing mode - `fast slow query tracing`. In this mode, Scylla tracks only tracing sessions and omits all tracing events if the tracing context does not have a `full_tracing` state set.

Fixes #2572

Motivation
---

We want to run production systems with that option always enabled so we could always catch slow queries without an overhead. The next step is we are gonna optimize further the costs of having tracing enabled to minimize session context handling overhead to allow it to be as transparent for the end-user as possible.

Fast tracing mode
---

To read the status do

    $ curl -v http://localhost:10000/storage_service/slow_query

To enable fast slow-query tracing

    $ curl -v --request POST http://localhost:10000/storage_service/slow_query\?fast=true\&enable=true

Potential optimizations
---

- remove tracing::begin(lazy_eval)
- replace tracing::begin(string) for enum to remove copying and memory allocations
- merge parameters allocations
- group parameters check for trace context
- delay formatting
- reuse prepared statement shared_ptr instead of both copying it and copying its query

Performance
---

100% cache hits
---

1 Core:

```
$ SCYLLA_HOME=/home/sitano.public/Projects/scylla build/release/scylla --smp 1 --cpuset 7 --log-to-syslog 0 --log-to-stdout 1 --default-log-level info --network-stack posix --workdir /home/sitano.public/Projects/scylla --developer-mode 1 --listen-address 0.0.0.0 --api-address 0.0.0.0 --rpc-address 0.0.0.0 --broadcast-rpc-address 172.18.0.1 --broadcast-address 127.0.0.1

./cassandra-stress write n=100000 no-warmup -pop seq=1..100000 -node 127.0.0.1 -log level=verbose -rate threads=1 -mode native cql3

curl --request POST http://localhost:10000/storage_service/slow_query\?fast\=false\&enable\=false
for i in $(seq 5); do
  taskset -c 2,3,4,5 ./cassandra-stress read duration=5m -pop seq=1..100000 -node 127.0.0.1 -log level=verbose -rate threads=4 throttle=30000/s -mode native cql3
done

curl --request POST http://localhost:10000/storage_service/slow_query\?fast\=true\&enable\=true
for i in $(seq 5); do
  taskset -c 2,3,4,5 ./cassandra-stress read duration=5m -pop seq=1..100000 -node 127.0.0.1 -log level=verbose -rate threads=4 throttle=30000/s -mode native cql3
done

curl --request POST http://localhost:10000/storage_service/slow_query\?fast\=false\&enable\=true
for i in $(seq 5); do
  taskset -c 2,3,4,5 ./cassandra-stress read duration=5m -pop seq=1..100000 -node 127.0.0.1 -log level=verbose -rate threads=4 throttle=30000/s -mode native cql3
done
```

  | qps |   |   |  
-- | -- | -- | -- | --
  | baseline | fast, slow | nofast, slow | %[1-fastslow/baseline]
  | 29,018 | 26,468 | 23,591 | 8.79%
  | 28,909 | 26,274 | 23,584 | 9.11%
  | 28,900 | 26,547 | 23,598 | 8.14%
  | 28,921 | 26,669 | 23,596 | 7.79%
  | 28,821 | 26,385 | 23,601 | 8.45%
stdev | 70.24030182 | 150.9678774 | 6.670832032 |  
avg | 28,914 | 26,469 | 23,594 |  
stderr | 0.24% | 0.57% | 0.03% |  
%[avg/baseline] |   | **8.46%** | 18.40% |  

8.46% performance degradation in `fast slow query mode` for pure in-memory workload with minimum traces.
18.40%  performance degradation in `original slow query mode` for pure in-memory workload with minimum traces.

0% cache hits
---

1GB memory, 1 Core:

    $ SCYLLA_HOME=/home/sitano.public/Projects/scylla build/release/scylla --memory 1G --smp 1 --cpuset 7 --log-to-syslog 0 --log-to-stdout 1 --default-log-level info --network-stack posix --workdir /home/sitano.public/Projects/scylla --developer-mode 1 --listen-address 0.0.0.0 --api-address 0.0.0.0 --rpc-address 0.0.0.0 --broadcast-rpc-address 172.18.0.1 --broadcast-address 127.0.0.1

2.4GB, 10000000 keys data:

    $ ./cassandra-stress write n=10000000 no-warmup -pop seq=1..10000000 -node 127.0.0.1 -log level=verbose -rate threads=4 -mode native cql3
    $ curl --request POST http://localhost:10000/storage_service/slow_query\?fast\=true\&enable\=true

CASSANDRA_STRESS prepared statements with BYPASS CACHE

    $ taskset -c 2,3,4,5 ./cassandra-stress read duration=5m -pop seq=1..10000000 -node 127.0.0.1 -log level=verbose -rate threads=4 throttle=30000/s -mode native cql3

20000 reads IOPS, 100MB/s from disk

  | qps |   |   |  
-- | -- | -- | -- | --
  | baseline reads | fast, slow reads | %[1-fastslow/baseline] |  
  | 9,575 | 9,054 | 5.44% |  
  | 9,614 | 9,065 | 5.71% |  
  | 9,610 | 9,066 | 5.66% |  
  | 9,611 | 9,062 | 5.71% |  
  | 9,614 | 9,073 | 5.63% |  
stdev | 16.75410397 | 6.892024376 |
avg | 9,605 | 9,064 |
stderr | 0.17% | 0.08% |
%[avg/baseline] |   | **5.63%** |

5.63% performance degradation in `fast slow query mode` for pure on-disk workload with minimum traces.

Closes #8314

* github.com:scylladb/scylla:
  tracing: fast mode unit test
  tracing: rest api for lightweight slow query tracing
  tracing: omit tracing session events and subsessions in fast mode
2021-03-21 12:15:17 +02:00
Dejan Mircevski
318f773d81 types: Unreverse tuple subtype for serialization
When a tuple value is serialized, we go through every element type and
use it to serialize element values.  But an element type can be
reversed, which is artificially different from the type of the value
being read.  This results in a server error due to the type mismatch.
Fix it by unreversing the element type prior to comparing it to the
value type.

Fixes #7902

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8316
2021-03-21 12:07:29 +02:00
Dejan Mircevski
0bd201d3ca cql3: Skip indexed column for CK restrictions
When querying an index table, we assemble clustering-column
restrictions for that query by going over the base table token,
partition columns, and clustering columns.  But if one of those
columns is the indexed column, there is a problem; the indexed column
is the index table's partition key, not clustering key.  We end up
with invalid clustering slice, which can cause problems downstream.

Fix this by skipping the indexed column when assembling the clustering
restrictions.

Tests: unit (dev)

Fixes #7888

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8320
2021-03-21 09:52:06 +02:00
Avi Kivity
58b7f225ab keys: convert trichotomic comparators to return std::strong_ordering
A trichotomic comparator returning an int an easily be mistaken
for a less comparator as the return types are convertible.

Use the new std::strong_ordering instead.

A caller in cql3's update_parameters.hh is also converted, following
the path of least resistance.

Ref #1449.

Test: unit (dev)

Closes #8323
2021-03-21 09:30:43 +02:00
Avi Kivity
29a5047982 utils: error_injection: convert enable_if to concepts
Constrain inject() with a requires clause rather than enable_if,
simplifying the code and compiler diagnostics.

Note that the second instance could not have been called, since
the template argument does not appear in the function parameter
list and thus could not be deduced. This is corrected here.

Closes #8322
2021-03-21 09:28:23 +02:00
Avi Kivity
c28d67dd7f types: time_point_to_string: convert enable_if to concepts
time_point_to_string ensures its input is a time_point with
millisecond resolution (though it neglects to verify the epoch
is what it expects). Change the test from a clunky enable_if to
a nicer concept.

Closes #8321
2021-03-21 09:11:40 +02:00
Tomasz Grabiec
88a019ba21 Merge "raft: respond with snapshot_reply to send_snapshot RPC" from Kostja
Currently send_snapshot is the only two-way RPC used by Raft.
However, the sender (the leader) does not look at the receiver's
reply, other than checks it's not an error. This has the following
issues:

- if the follower has a newer term and rejects the snapshot for
  that reason, the leader will not learn about a newer follower
  term and will not step down
- the send_snapshot message doesn't pass through a single-endpoint
  fsm::step() and thus may not follow the general Raft rules
  which apply for all messages.
- making a general purpose transport that simply calls fsm::step()
  for every message becomes impossible.

Fix it by actually responding with snapshot_reply to send_snapshot
RPC, generating this reply in fsm::step() on the follower,
and feeding into fsm::step() on the leader.

* scylla-dev/raft-send-snapshot-v2:
  raft: pass snapshot_reply into fsm::step()
  raft: respond with snapshot_reply to send_snapshot RPC
  raft: set follower's next_idx when switching to SNAPSHOT mode
  raft: set the current leader upon getting InstallSnapshot
2021-03-19 18:13:40 +01:00
Piotr Sarna
31d3854bb7 thrift: add exponential backoff for retries
The original backoff mechanism which just retries after 1ms
may still lead to rapid resource depletion.
Instead, an exponential backoff is used, with a cap of ~2s.

Tests: manual, with cassandra-stress and browsing logs
2021-03-19 13:16:39 +01:00
Piotr Sarna
f81044d75d thrift: fix and simplify retry logic
The retry logic for Thrift frontend had two bugs:
1. Due to missing break in a switch statement,
   two retry calls were always performed instead of one,
   which acts a little bit like a Seastar forkbomb
2. The delayed action was not guarded with any gate,
   so it was theoretically possible to access a captured `this`
   pointer of an object which already got deallocated.

In order to fix the above, the logic is simplified to always
retry with backoff - it makes very little sense to skip the backoff
and immediate retries are not needed by anyone, while they cause
severe overload risk.

Tests: manual - a simple cassandra-stress invocation was able to crash
       scylla with a segfault:
       $ cassandra-stress write -mode thrift -rate threads=2000

Fixes #8317
2021-03-19 13:15:35 +01:00
Nadav Har'El
abab1d906c Merge 'sstables: convert enable_if to equivalent concepts' from Avi Kivity
enable_if is hard to understand, especially its error messages. Convert
enable_if in sstable code to concepts.

A new concept is introduced, self_describing, for the case of a type
that follows the obj.describe_type() protocol. Otherwise this is quite
straightforward.

Closes #8315

* github.com:scylladb/scylla:
  sstables: vector write: convert to concepts
  sstables: check_truncated_and_assign: convert to concept
  sstables: convert write() to concepts
  sstables: convert write_vint() to concepts
  sstables: vector parse(): convert to concept
  sstables: convert parse() for a self-describing type to concept
  sstables: read_vint(): convert enable_if to concepts
  sstables: add concept for self-describing type
2021-03-18 23:09:34 +02:00
Raphael S. Carvalho
64d78eae6a tests: Add unit test for off-strategy sstable compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 16:56:00 -03:00
Avi Kivity
bf0c7d1340 sstables: vector write: convert to concepts
We have an integral and a non-integral overload, each constrained
with enable_if. We use std::integral to constrain the integral
overload and leave the other unconstrained, as C++ will choose the
more constrained version when applicable.
2021-03-18 19:26:54 +02:00
Avi Kivity
11636563d9 sstables: check_truncated_and_assign: convert to concept
Use std::integral instead of static_assert to reject non-integral
parameters.
2021-03-18 19:26:54 +02:00
Avi Kivity
42e3f33722 sstables: convert write() to concepts
There are three variants: integral, enum, and self-describing
(currently expressed as not integral and not enum). Convert to
concepts by using the standard concepts or the new self_describing
concept.
2021-03-18 19:26:43 +02:00
Avi Kivity
4832041857 sstables: convert write_vint() to concepts
Instead of a maze of deleted functions, enable_if, and static_assert,
use the standard std::integral concept.
2021-03-18 19:24:42 +02:00
Nadav Har'El
0b2cf21932 alternator-test: increase read timeout and avoid retries
By default the boto3 library waits up to 60 second for a response,
and if got no response, it sends the same request again, multiple
times. We already noticed in the past that it retries too many times
thus slowing down failures, so in our test configuration lowered the
number of retries to 3, but the setting of 60-second-timeout plus
3 retries still causes two problems:

  1. When the test machine and the build are extremely slow, and the
     operation is long (usually, CreateTable or DeleteTable involving
     multiple views), the 60 second timeout might not be enough.

  2. If the timeout is reached, boto3 silently retries the same operation.
     This retry may fail because the previous one really succeeded at
     least partially! The symptom is tests which report an error when
     creating a table which already exists, or deleting a table which
     dooesn't exist.

The solution in this patch is first of all to never do retries - if
a query fails on internal server error, or times out, just report this
failure immediately. We don't expect to see transient errors during
local tests, so this is exactly the right behavior.
The second thing we do is to increase the default timeout. If 1 minute
was not enough, let's raise it to 5 minutes. 5 minutes should be enough
for every operation (famous last words...).

Even if 5 minutes is not enough for something, at least we'll now see
the timeout errors instead of some wierd errors caused by retrying an
operation which was already almost done.

Fixes #8135

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210222125630.1325011-1-nyh@scylladb.com>
2021-03-18 18:58:08 +02:00
Avi Kivity
777d48e78d sstables: vector parse(): convert to concept
The two vector parse() overloads select between integral members
and non-integral members. Use std::integral to constrain the
integral overload and leave the other unconstrained; C++ will choose
the more constrained version when it applies.
2021-03-18 18:48:11 +02:00
Avi Kivity
bc42aee7c1 sstables: convert parse() for a self-describing type to concept
This parse() overload uses "not integral and not enum" to reject
non-self-describing types. Express it directly with the self_describing
concept instead.
2021-03-18 18:47:00 +02:00
Avi Kivity
a96b8e8aed sstables: read_vint(): convert enable_if to concepts
Convert read_vint() to a concept. The explicitly deleted version
is no longer needed since wrongly-typed inputs will be rejected
by the constraint. Similarly the static assert can be dropped
for the same reason.
2021-03-18 18:45:05 +02:00
Avi Kivity
bba9c1c616 sstables: add concept for self-describing type
Our sstable parsing and writing code contains a self-describing
type concept, where a type can advertise its members via a
describe_types() member function with a specific protocol.

Formalize that into a C++ concept. This is a little tricky, since
describe_type() accepts a parameter that is itself a template, and
requires clauses only work with concrete type. To handle this problem,
create such a concrete example type and use it in the concept.
2021-03-18 17:52:54 +02:00
Botond Dénes
7980140549 test: test_utils: do_check()/do_require(): tone down log to trace
They are way too noisy to be at debug level.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210318143547.101932-1-bdenes@scylladb.com>
2021-03-18 16:59:59 +02:00
Raphael S. Carvalho
65b09567dd table: Wire up off-strategy compaction on repair-based bootstrap and replace
Now, sstables created by bootstrap and replace will be added to the
maintenance set, and once the operation completes, off-strategy compaction
will be started.

We wait until the end of operation to trigger off-strategy, as reshaping
can be more efficient if we wait for all sstables before deciding what
to compact. Also, waiting for completion is no longer an issue because
we're able to read from new sstables using partitioned_sstable_set and
their existence aren't accounted by the compaction backlog tracker yet.

Refs #5226.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
c45d2e1d27 table: extend add_sstable_and_update_cache() for off-strategy
Function is extended to add sstable to maintenance set if requested
by the caller.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
6ca2ac34ac sstables/compaction_manager: Add function to submit off-strategy work
This new variant will allow its caller to submit off-strategy job
asynchronously on behalf of a given table.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
e0e5bf8285 table: Introduce off-strategy compaction on maintenance sstable set
Off-strategy compaction is about incrementally reshaping the off-strategy
sstables in maintenance set, using our existing reshape mechanism, until
the set is ready for integration into the main sstable set.
The whole operation is done in maintenance mode, using the streaming
scheduling group.
We can do it this way because data in maintenance set is disjoint, so
effects on read amplification is avoided by using
partitioned_sstable_set, which is able to efficiently and incrementally
retrieve data from disjoint sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
439e9b6fab table: change build_new_sstable_list() to accept other sstable sets
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
6e95860e09 table: change non_staging_sstables() to filter out off-strategy sstables
SSTables that are off-strategy should be excluded by this function as
it's used to select candidates for regular compaction.
So in addition to only returning candidates from the main set, let's
also rename it to precisely reflect its behavior.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
c64a156c53 table: Introduce maintenance sstable set
This new sstable set will hold sstables created by repair-based
operations. A repair-based op creates 1 sstable per vrange (256),
so sstables added to this new set are disjoint, therefore they
can be efficiently read from using partitioned_sstable_set.

Compound set is changed to include this new set, so sstables in
this new set are automatically included when creating readers,
computing statistics, and so on.
This new set is not backlog tracked, so changes were needed to
prevent a sstable in this set from being added or removed from
the tracker.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:47 -03:00
Raphael S. Carvalho
1e7a444a8b table: Wire compound sstable set
From now own, _sstables  becomes the compound set, and _main_sstables refer
only to the main sstables of the table. In the near future, maintenance
set will be introduced and will also be managed by the compound set.

So add_sstable() and on_compaction_completion() are changed to
explicitly insert and remove sstables from the main set.

By storing compound set in _sstables, functions which used _sstables for
creating reader, computing statistics, etc, will not have to be changed
when we introduce the maintenance set, so code change is a lot minimized
by this approach.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:46:06 -03:00
Raphael S. Carvalho
42b309b43e table: prepare make_reader_excluding_sstables() to work with compound sstable set
Compound set will not be inserted or erased directly, so let's change
this function to build a new set from scratch instead.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:42:50 -03:00
Raphael S. Carvalho
4e142458eb table: prepare discard_sstables() to work with compound sstable set
After compound set, discard_sstables() will have to prune each set
individually and later refresh the compound set. So let's change
the function to support multiple sstable sets, taking into account
that a sstable set may not want to be backlog tracked.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:42:50 -03:00
Raphael S. Carvalho
d25822a030 table: extract add_sstable() common code into a function
The purpose is to allow the code to be eventually reused by maintenance
sstable set, which will be soon introduced.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:42:50 -03:00
Raphael S. Carvalho
e4b5f5ba33 sstable_set: Introduce compound sstable set
This new sstable set implementation is useful for combining operation of
multiple sstable sets, which can still be referenced individually via
its shared ptr reference.
It will be used when maintenance set is introduced in table, so a
compound set is required to allow both sets to have their operations
efficiently combined.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:42:49 -03:00
Raphael S. Carvalho
1261519266 reshape: STCS: preserve token contiguity when reshaping disjoint sstables
When reshaping hundreds of disjoint sstables, like on bootstrap,
contiguity wasn't being preserved because the heuristic for picking
candidates didn't take into account their token range, which resulted
in reshape messing with the contiguity that could otherwise be
preserved by respecting the token order of the disjoint sstables.
In other words, sstables with the smallest first tokens should be
compacted first. By doing that, the contiguity is preserved even
across size tiers, after reshape has completed its possible multiple
rounds to get all the data in shape.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:36:18 -03:00
Botond Dénes
ad02f313dd test: mutation_reader_test: add test for permit cleanup
Check that a permit correctly restores the units on the semaphore in
each state it can be destroyed in.
2021-03-18 16:18:22 +02:00
Raphael S. Carvalho
e53cedabb1 LCS: reshape: tolerate more sstables in level 0 with relaxed mode
Relaxed mode, used during initialization, of reshape only tolerates min_threshold
(default: 4) L0 sstables. However, relaxed mode should tolerate more sstables in
level 0, otherwise boot will have to reshape level 0 every time it crosses the
min threshold. So let's make LCS reshape tolerate a max of max_threshold and 32.
This change is beneficial because once table is populated, LCS regular compaction
can decide to merge those sstables in level 0 into level 1 instead, therefore
reducing WA.

Refs #8297.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210318131442.17935-1-raphaelsc@scylladb.com>
2021-03-18 15:58:21 +02:00
Botond Dénes
2b7c1bce86 scylla-gdb.py: add variant_member convenience function
Allow conveniently accessing the active member of an `std::variant`
instance.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210318134427.92668-1-bdenes@scylladb.com>
2021-03-18 15:57:51 +02:00
Konstantin Osipov
fcc6e621f8 raft: pass snapshot_reply into fsm::step()
By the time we receive snapshot_reply from a follower
we may no longer be the leader. Follower term may be
different from snapshot term, e.g. the follower may
be aware of a new leader already and have a higher term.

We should pass this information into (possibly ex-) leader FSM via
fsm::step() so that it can correctly change its state, and
not call FSM directly.
2021-03-18 16:56:46 +03:00
Konstantin Osipov
4afa662d62 raft: respond with snapshot_reply to send_snapshot RPC
Raft send_snapshot RPC is actually two-way, the follower
responds with snapshot_reply message. This message until now
was, however, muted by RPC.

Do not mute snapshot_reply any more:
- to make it obvious the RPC is two way
- to feed the follower response directly into leader's FSM and
  thus ensure that FSM testing results produced when using a test
  transport are representative of the real world uses of
  raft::rpc.
2021-03-18 16:56:42 +03:00
Konstantin Osipov
cb3314d756 raft: set follower's next_idx when switching to SNAPSHOT mode
Set follower's next_idx to snapshot index + 1 when switching
it to snapshot mode. If snapshot transfer succeeds, that's the
best match for the follower's next replication index. If it fails,
the leader will send a new probe to find out the follower position
again and re-try sending a possibly newer snapshot.

The change helps reduce protocol state managed outside FSM.
2021-03-18 16:35:11 +03:00
Ivan Prisyazhnyy
f00391af8b tracing: fast mode unit test
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
2021-03-18 15:05:09 +02:00
Ivan Prisyazhnyy
7cbe2aa9c6 tracing: rest api for lightweight slow query tracing
The patch adds REST API support for the lightweight
slow query tracing (fast) mode that is implemented by
omitting all of the trace events during the tracing.

    $ curl -v http://localhost:10000/storage_service/slow_query
    $ curl -v --request POST http://localhost:10000/storage_service/slow_query\?fast=true\&enable=true

Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
2021-03-18 15:05:05 +02:00
Ivan Prisyazhnyy
85fbca2049 tracing: omit tracing session events and subsessions in fast mode
If tracing::tracing::_ignore_trace_events is enabled then
the tracing system must ignore all sessions events
for non full_tracing sessions (probability tracing and
user requested) and creating subsessions with the
make_trace_info.

Patch introduces the slow query tracing fast mode that
omits all events during tracing.

Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
2021-03-18 15:04:47 +02:00
Botond Dénes
c822f0d02a test: querier_cache_test: add memory based cache eviction test
Ensure that the memory consumption of querier cache entries is kept
under the limit.
2021-03-18 14:58:21 +02:00
Botond Dénes
a14bb4ba94 reader_permit: add inactive state
This state will be used for permits that are not in admitted state when
registered as inactive. We can have such reads if a read can be served
entirely from cache/memtables and it doesn't have to go to disk and
hence doesn't go through admission. These permits currently don't
forward their cost to the semaphore so they won't prevent their own
admission creating a deadlock. However, when in inactive state, we do
want to keep tabs on their resource consumption so we don't accumulate
too much of these inactive reads. So introduce a new state for these
non-admitted inactive reads. When entering the inactive state, the
permit registers its cost with the semaphore, and when unregistered as
inactive, it retracts it. This is a workaround (khm hack) until #4758 is
solved and all permits will be admitted on creation.
2021-03-18 14:58:21 +02:00
Botond Dénes
594636ebbf querier: insert(): account immediately evicted querier as resource based eviction
`reader_concurrency_semaphore::register_inactive_read()` drops the
registered inactive read immediately if there is a resource shortage.
This is in effect a resource based eviction, so account it as such in
`querier::insert()`.
2021-03-18 14:57:57 +02:00
Botond Dénes
1a337d0ec1 reader_concurrency_semaphore: fix clear_inactive_reads()
Broken by the move to an intrusive container (9cbbf40), which caused
said method to only clear the container but not destroy the inactive
reads contained therein. This patch restores the previous behaviour and
also adds a call the destructor (to ensure inactive reads are cleaned up
under any circumstances), as well as a unit test.
2021-03-18 14:57:57 +02:00
Botond Dénes
581edc4e4e reader_concurrency_semaphore: make inactive_read_handle a weak reference
Having the handle keep an owning reference to the inactive read lead to
awkward situations, where the inactive read is destroyed during eviction
in certain situations only (querier cache) and not in other cases.
Although the users didn't notice anything from this, it lead to very
brittle code inside the reader concurrency semaphore. Among others, the
inactive read destructor has to be open coded in evict() which already
lead to mistakes.
This patch goes back to the weak pointer paradigm used a while ago,
which is a much more natural fit for this. Inactive reads are still kept
in an intrusive list in the semaphore but the handle now keeps a weak
pointer to them. When destroyed the handler will destroy the inactive
read if it is still alive. When evicting the inactive read, it will
set the pointer in the handle to null.
2021-03-18 14:57:57 +02:00
Botond Dénes
cbc83b8b1b reader_concurrency_semaphore: make evict() noexcept
In the next patch it will be called from a destructor.
2021-03-18 14:57:57 +02:00
Botond Dénes
2d348e0211 reader_concurrency_semaphore: update out-of-date comments 2021-03-18 14:57:57 +02:00
Botond Dénes
3b8220f777 scylla-gdb.py: update w.r.t. storage_proxy::_hints_manager not being optional
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210318110256.50137-1-bdenes@scylladb.com>
2021-03-18 12:47:57 +01:00
Piotr Sarna
2509b7dbde Merge 'dht: convert ring_position and decorated_key to std::strong_ordering' from Avi Kivity
As #1449 notes, trichotomic comparators returning int are dangerous as they
can be mistaken for less comparators. This series converts dht::ring_position
and dht::decorated_key, as well as a few closely related downstream types, to
return std::strong_ordering.

Closes #8225

* github.com:scylladb/scylla:
  dht: ring_position, decorated_key: convert tri_comparators to std::strong_ordering
  pager: rephrase misleading comparison check
  test: total_order_checks: prepare for std::strong_ordering
  test: mutation_test: prepare merge_container for std::strong_ordering
  intrusive_array: prepare for std::strong_ordering
  utils: collection-concepts: prepare for std::strong_ordering
2021-03-18 11:51:54 +01:00
Avi Kivity
378556418c dht: ring_position, decorated_key: convert tri_comparators to std::strong_ordering
Convert tri_comparators to return std::strong_ordering rather than int,
to prevent confusion with less comparators. Downstream users are either
also converted, or adjust the return type back to int, whichever happens
to be simpler; in all cases the change it trivial.
2021-03-18 12:40:05 +02:00
Avi Kivity
4ead1a79ce pager: rephrase misleading comparison check
We check !result_of_tri_compare, which makes it look like we're
checking a boolean predicate, whereas we're really checking for
equality. Change to result_of_tri_compare == 0, which is less likely
to be confusing, and is also compatible with std::strong_ordering.
2021-03-18 12:40:05 +02:00
Avi Kivity
a5f17b9a2d test: total_order_checks: prepare for std::strong_ordering
Adjust the total_order_check template to work with comparators
returning either int (as a temporary compatibility measure) or
std::strong_ordering (for #1449 safety).
2021-03-18 12:40:05 +02:00
Avi Kivity
f0092ae475 test: mutation_test: prepare merge_container for std::strong_ordering
The function merge_container() accepts a trichotomic comparator returning
an int. As #1449 explains, this is dangerous as it could be mistaken for
a less comparator. Switch to std::strong_ordering, but leave a compatible
merge_container() in place as it is still needed (even after this series).
2021-03-18 12:40:05 +02:00
Avi Kivity
fe0f983dfb intrusive_array: prepare for std::strong_ordering
Newer comparators can return std::strong_ordering, so don't
expect an int.
2021-03-18 12:40:05 +02:00
Avi Kivity
9fbe4850c9 utils: collection-concepts: prepare for std::strong_ordering
collection-concepts includes a Comparable concept for a trichotomic
comparator function, used in intrusive btree and double_decker. Prepare
for std::strong_ordering by also allowing std::strong_ordering as a
return type. Once we've cleaned the code base, we can tighten it to
only allow std::strong_ordering.
2021-03-18 12:40:03 +02:00
Piotr Sarna
0bcf584992 docs: mention --no-rebase in maintainer.md
For a default git config it's enough to pull with --no-ff
to ensure that a merge commit is created, but with a custom
configuration, it's better to also explicitly prevent
rebasing.

Message-Id: <7dc6027f1f38fa4db7435592a3b72308b1a08614.1616063525.git.sarna@scylladb.com>
2021-03-18 12:38:29 +02:00
Piotr Sarna
5a852d3812 Merge 'Decouple memory limiter sem from storage service' from Pavel
This set removes few more calls for global storage service and prevents
more of them to happen in thrift that's about to start using the memory
limiter semaphore too.

The set turns this semaphore into a sharded one living in the scope of
main(), makes others use the local instance and removes the no longer
needed bits from storage service.

tests: unit(dev)
branch: https://github.com/xemul/scylla/commits/br-global-memory-limiter-sem

* xemul_drop_memory_limiter:
  storage_service: Drop memory limiter
  memory_limiter: Use main-local instance everyehere
  main: Have local memory limiter and carry where needed
  memory_limiter: Encapsulate memory limiting facility
  cql_server: Remove semaphore getter fn from config
2021-03-18 11:29:32 +01:00
Pavel Emelyanov
dcdd207349 storage_service: Drop memory limiter
Nobody uses it now.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-18 11:28:45 +01:00
Pavel Emelyanov
f0a79574d4 memory_limiter: Use main-local instance everyehere
The cql_server and alternator both need the limiter, so
patch them to stop using storage service's one and use
the main-local one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-18 11:28:45 +01:00
Pavel Emelyanov
359e9caf54 main: Have local memory limiter and carry where needed
Prepare memory limiters to have non-global instance of
the service. For now the main-local instance is not
used and (!) is not stopped for real, just like the
storage_service's one is.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-18 11:28:45 +01:00
Pavel Emelyanov
4ca2ae1341 memory_limiter: Encapsulate memory limiting facility
The storage service carries sempaphore and a size_t value
to facilitate the memory limiting for client services.

This patch encapsulates both fields on a separate helper
class that will be used by whoever needs it without
messing with the storage service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-18 11:28:45 +01:00
Pavel Emelyanov
c2f94fb527 cql_server: Remove semaphore getter fn from config
The cql_server() need to get the memory limiter semaphore
from local storage service instance. To make this happen
a callback in introduced on the config structure. The same
can be achieved in a simler manner -- by providing the
local storage service instances directly.

Actually, the storage service will be removed in further
patches from this place, so this patch is mostly to get
rid of the callback from the config.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-18 11:28:45 +01:00
Nadav Har'El
4a7d3175e9 test/alternator: make another test faster
The slowest test in test_streams.py is test_list_streams_paged. It is meant
to test the ListStreams operation with paging. The existing test repeated
its test four times, for four different stream types. However, there is
no reason to suspect that the ListStreams operation might somehow be
different for the four stream types... We already have other tests which
create streams of the four types, and uses these streams - we don't
need the test for ListStreams to also test creating the four types.

By doing this test just once, not four times, we can save around 1.5
seconds of test time.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210318073755.1784349-1-nyh@scylladb.com>
2021-03-18 11:24:18 +01:00
Nadav Har'El
79af728335 test/alternator: make tracing test a bit faster
In the test test_tracing.py::test_tracing_all, we do some operations and
then need to wait until they appear in the tracing table.
The current code used an exponentially-increasing delay during this wait,
starting with 0.1 seconds and then doubling the delay until we find what
we're looking for.

However, it turns out that the delay until the data appears in the table
is deliberately chosen by Scylla - and is always around 2 seconds.
In this case, an exponential delay is really bad - we will usually wait
for around 1 seconds too long after the needed wait of 2 seconds.

So in this patch we replace the exponential delay by a constant delay -
we wait 0.3 seconds between each retry.

This change makes the test test_tracing.py::test_tracing_all finish
in a little over 2 seconds, instead of a little over 3 seconds
before this patch. We cannot reduce this 2 second time any further
unless we make the 2-second tracing delay configurable.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210318000040.1782933-1-nyh@scylladb.com>
2021-03-18 11:24:18 +01:00
Nadav Har'El
4e87f95b42 test/alternator: remove slow and unhelpful test
The test test_table.py::test_table_streams_on creates tables with various
stream types, and then immediately deletes them without testing anything.
This is a slow test (taking almost a full second on my laptop), and is
redundant because in test_streams.py we have tests which create tables
with streams in the same way - but then actually test that things work
with these streams. So this test might as well be removed, and this is
what we do in this patch.

Removing this test shaves another second from the Alternator test suite's
run time.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210317230530.1780849-1-nyh@scylladb.com>
2021-03-18 11:24:18 +01:00
Nadav Har'El
879656e3e0 test/alternator: make a test faster, safer and more correct
The test
test_condition_expression.py::test_condition_expression_with_forbidden_rmw
takes half a second to run (dev build, on my laptop), one of the slowest
tests in Alternator's test suite. Part of the reason was that it needlessly
set the same table to forbidden_rmw, multiple times.

Instead of doing that, we switch to using the test_table_s_forbid_rmw
fixture, which is a table like test_table_s but created just once in
forbid_rmw mode.

The result is a faster test (0.05 seconds instead of 0.5 seconds), but
also safer if we ever want to run tests in parallel. It also fixes a
bug in the test: At the end of the test, we intended to double-check
that although the forbid_rmw table forbids read-modify-write operations,
it does allow pure writes. Yet the test did this after clearing the
forbid_rmw mode... So after this patch the test verifies this on the
forbid_rmw table, as intended.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210317222703.1779992-1-nyh@scylladb.com>
2021-03-18 11:24:18 +01:00
Nadav Har'El
1c2e473e62 test/alternator: make a test faster
The test
test_condition_expression.py::test_condition_expression_with_permissive_write_isolation

Currently takes (on my laptop, dev build) a full two seconds, one of
the slowest tests. It is not surprising it is slow - it runs five other
tests three times each (for three different write isolation modes),
but it doesn't have to be this slow. Before this patch, for each of
the five tests we switch the write isolation mode three times, and
these switches involve schema changes and are fairly slow. So in
this patch we reverse the loop - and switch the write isolation mode
to the outer loop.

This patch halves the runtime of this test - from two seconds to one.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210317221045.1779329-1-nyh@scylladb.com>
2021-03-18 11:24:18 +01:00
Takuya ASADA
d9a625c842 scylla_setup: don't run node-exporter setup when it's not installed
We need to run package existance check before run setup of
node-exporter.

Fixes #8276

Closes #8278
2021-03-18 11:24:18 +01:00
Avi Kivity
f038d1555c Merge 'Add more context to configure.py' from Piotr Sarna
This series makes configure.py output slightly more helpful in case of incorrect parameters passed to the compiler/linker.

Closes #8267

* github.com:scylladb/scylla:
  configure: print more context if the linking attempt failed
  configure: provide more context on failed ./configure.py run
  configure: add verbose option to try_compile_and_link
2021-03-18 11:24:18 +01:00
Takuya ASADA
0424a41e30 tools/toolchain: stop ignoring error on install-dependencies.sh, run jmx/java script correctly
We should run install-dependencies.sh with -e option to prevent ignoring
error in the script.
Also, need to add tools/jmx/install-dependencies.sh and
tools/java/install-dependencies.sh, to fix 'No such file or directory'
error on them.

Fixes #8293

Closes #8294

[avi: did not regenerate toolchain image, since no new packages are
      installed]
2021-03-18 11:24:18 +01:00
Avi Kivity
b91d6776a0 Update tools/java submodule
* tools/java fdc8fcc22c...7b66b7a0fc (1):
  > dist/redhat: add support SLES
2021-03-18 11:24:18 +01:00
Nadav Har'El
bd742f2951 Merge 'treewide: get rid of incorrect reinterpret casts' from Michał Chojnowski
In some places we use the `*reinterpret_cast<const net::packed<T>*>(&x)`
pattern to reinterpret memory. This is a violation of C++'s aliasing rules,
which invokes undefined behaviour.

The blessed way to correctly reinterpret memory is to copy it into a new
object. Let's do that.

Note: the reinterpret_cast way has no performance advantage. Compilers
recognize the memory copy pattern and optimize it away.

Closes #8241

* github.com:scylladb/scylla:
  treewide: get rid of unaligned_cast
  treewide: get rid of incorrect reinterpret casts
2021-03-18 11:24:18 +01:00
Benny Halevy
7862cad669 sstable_set: partitioned_sstable_set: clone: do clone all sstables
The existing implementation wrongfully shares _all sstables
rather than cloning it. This caused a use-after-free
in `repair_meta::do_estimate_partitions_on_local_shard`
when traversing a shared sstable_set, during which
`table::make_reader_excluding_sstables` erased an entry.
The erase should have happened on a cloned copy
of the sstable_list, not on a shared copy.

The regression was introduced in
c3b8757fa1.

Added a unit test that reproduces the share-on-copy issue
for partitioned_stable_set (sstables::sstable_set).

Fixes #8274

Test: unit(release, debug)
DTest: materialized_views_test.py:TestMaterializedViews.simple_repair_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210317145552.701559-1-bhalevy@scylladb.com>
2021-03-18 11:15:59 +02:00
Piotr Sarna
ea096de1b4 service, transport: avoid using private storage_service fields
... in the transport controller. Instead, simple getters would suffice.

Message-Id: <582a71d0c1b61edf0107f5a2ef96536c395972d0.1615988516.git.sarna@scylladb.com>
2021-03-18 11:15:59 +02:00
Nadav Har'El
42169b2eef Merge 'Alternator: add slow query logging' from Piotr Sarna
This series adds slow query logging capability to alternator. Queries which last longer than the specified threshold are logged in `system_traces.node_slow_log` and traced.

In order to be better prepared for https://github.com/scylladb/scylla/issues/2572, this series also expands the tracing API to allow custom key-value params and adds a custom `alternator_op` parameter to the slow node log. This information can also be deduced from the tracing session id by consulting the system_traces.events table, but https://github.com/scylladb/scylla/issues/2572 's assumption is that this tracing might not always be available in the future.

This series comes with a simple test case which checks if operation logs indeed end up in `system_traces.node_slow_log`.

Tests:
unit(dev, alternator pytest)
manual: verified that no operations are logged if slow query logging is disabled; verified that operations that take less time than the threshold are not logged; verified with test_batch.py::test_batch_write_item_large that a large-enough operation is indeed logged and traced.

Fixes #8292

Example trace:

```cql
cqlsh> select parameters, duration from system_traces.node_slow_log where start_time=b7a44589-8711-11eb-8053-14c6c5faf955;

 parameters                                                                                  | duration
---------------------------------------------------------------------------------------------+----------
 {'alternator_op': 'DeleteTable', 'query': '{"TableName": "alternator_Test_1615979572905"}'} |    75732
```

Closes #8298

* github.com:scylladb/scylla:
  alternator: add test for slow query logging
  alternator: allow enabling slow query logging
  tracing: allow providing a custom session record param
2021-03-18 11:15:59 +02:00
Avi Kivity
de45575ea9 Merge "Allow all supported compaction types to be stopped by nodetool stop" from Raphael
"
All compaction types can now be stopped with the nodetool stop
command, example: nodetool stop SCRUB

Supported types are: COMPACTION, CLEANUP, VALIDATION, SCRUB,
INDEX_BUILD, RESHARD, UPGRADE, RESHAPE.
"

* 'stop_compaction_types_v2' of github.com:raphaelsc/scylla:
  compaction: Allow all supported compaction types to be stopped
  compaction: introduce function to map compaction name to respective type
  compaction: refactor mapping of compaction type to string
  compaction: move compaction_name() out of line
2021-03-18 11:15:59 +02:00
Botond Dénes
981699ae76 sstables: move promoted_index_blocks_reader into own header
index_entry.hh (the current home of `promoted_index_blocks_reader`) is
included in `sstables.hh` and thus in half our code-base. All that code
really doesn't need the definition of the promoted index blocks reader
which also pulls in the sstables parser mechanism. Move it into its own
header and only include it where it is actually needed: the promoted
index cursor implementations.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210317093654.34196-1-bdenes@scylladb.com>
2021-03-18 11:15:59 +02:00
Botond Dénes
5859195b36 sstables: mx/parser.hh: add missing include
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210317093806.34858-1-bdenes@scylladb.com>
2021-03-18 11:15:59 +02:00
Benny Halevy
2e7677f76b sstables: sstable_set_impl: include mutation_reader.hh
To make sstables/sstable_set_impl.hh self-sufficient
mutation_reader.hh provides position_reader_queue,
needed by time_series_sstable_set.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210317094223.590067-1-bhalevy@scylladb.com>
2021-03-18 11:15:59 +02:00
Konstantin Osipov
66c729da66 raft: set the current leader upon getting InstallSnapshot
If the current leader is set, the follower will not vote
for another candidate. This is also known as "sticky leadership" rule.

Before this change, the rule was enacted only upon receiving
AppendEntries RPC from the leader. Turn it on also upon receiving
InstallSnapshot RPC.
2021-03-18 08:36:57 +03:00
Michał Chojnowski
5c3385730b treewide: get rid of unaligned_cast
unaligned_cast violates strict aliasing rules. Replace it with
safe equivalents.
2021-03-17 17:00:41 +01:00
Michał Chojnowski
4e35befcf2 treewide: get rid of incorrect reinterpret casts
In some places we use the `*reinterpret_cast<const net::packed<T>*>(&x)`
pattern to reinterpret memory. This is a violation of C++'s aliasing rules,
which invokes undefined behaviour.

The blessed way to correctly reinterpret memory is to copy it into a new
object. Let's do that.

Note: the reinterpret_cast way has no performance advantage. Compilers
recognize the memory copy pattern and optimize it away.
2021-03-17 17:00:38 +01:00
Piotr Sarna
efe734c575 alternator: add test for slow query logging
The test checks whether slow queries are properly logged
in the system_traces.node_slow_log system table.
The test is deterministic because it uses the threshold of 0ms
to qualify a query as slow, which effectively makes all queries
"slow enough".
2021-03-17 13:24:26 +01:00
Benny Halevy
6846319e65 partitioned_sstables_set: insert: propagate exception
Do not swallow the caught exception.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210316170821.496218-1-bhalevy@scylladb.com>
2021-03-17 13:29:03 +02:00
Piotr Sarna
f9adee70d2 alternator: allow enabling slow query logging
Alternator is now aware of the slow query logging configuration
and can start tracing slow queries.
2021-03-17 11:20:42 +01:00
Piotr Sarna
5386739354 tracing: allow providing a custom session record param
The mechanism of session record params is currently only used
to store query strings and a couple more params like consistency level,
but since we now have more frontends than just CQL and Thrift,
it would be nice to also allow the users to put custom parameters in
there.
An immediate first user of this mechanism would be alternator,
which is going to put the operation type under the "alternator_op" key.
The operation type is not part of the query string due to how DynamoDB's
protocol works - the op type is stored separately in the HTTP header.
While it's possible to extract the operation type from the session_id,
it might not be the case once #2572 is implemented.
2021-03-17 11:14:28 +01:00
Gleb Natapov
32d386d0d8 raft: fix use after free during logging in append_entries_reply()
As the existing comment explains a progress can be deleted at the point
of logging. The logging should only be done if the progress still
exists.

Message-Id: <YFDFVRQU1iVYhFdM@scylladb.com>
2021-03-17 09:59:22 +02:00
Dejan Mircevski
8db24fc03b cql3/expr: Handle IN ? bound to null
Previously, we crashed when the IN marker is bound to null.  Throw
invalid_request_exception instead.

Fixes #8265

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8287
2021-03-17 09:59:22 +02:00
Avi Kivity
1afd6fbe06 hashing: appending_hash: convert from enable_if to concepts
A little simpler to understand.

Closes #8288
2021-03-17 09:59:22 +02:00
Piotr Sarna
7961a28835 Merge 'storage_proxy: Include counter writes in...
...  `writes_coordinator_outside_replica_set`' from Juliusz Stasiewicz

With this change, coordinator prefers himself as the "counter leader", so if
another endpoint is chosen as the leader, we know that coordinator was not a
member of replica set. With this guarantee we can increment
`scylla_storage_proxy_coordinator_writes_coordinator_outside_replica_set` metric
after electing different leader (that metric used to neglect the counter
updates).

The motivation for this change is to have more reliable way of counting
non-token-aware queries.

Fixes #4337
Closes #8282

* github.com:scylladb/scylla:
  storage_proxy: Include counter writes in `writes_coordinator_outside_replica_set`
  counters: Favor coordinator as leader
2021-03-17 09:59:22 +02:00
Avi Kivity
972ea9900c Merge 'commitlog: Make pre-allocation drop O_DSYNC while pre-filling' from Calle Wilund
Refs #7794

Iff we need to pre-fill segment file ni O_DSYNC mode, we should
drop this for the pre-fill, to avoid issuing flushes until the file
is filled. Done by temporarily closing, re-opening in "normal" mode,
filling, then re-opening.

Closes #8250

* github.com:scylladb/scylla:
  commitlog: Make pre-allocation drop O_DSYNC while pre-filling
  commitlog: coroutinize allocate_segment_ex
2021-03-17 09:59:22 +02:00
Dejan Mircevski
992d5c6184 cql3/expr: Improve column printing
Before this change, we would print an expression like this:

((ColumnDefinition{name=c, type=org.apache.cassandra.db.marshal.Int32Type, kind=CLUSTERING_COLUMN, componentIndex=0, droppedAt=-9223372036854775808}) = 0000007b)

Now, we print the same expression like this:

(c = 0000007b)

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8285
2021-03-17 09:59:22 +02:00
Tomasz Grabiec
40121621f6 Merge "Kill some get_local_migration_manager() calls" from Pavel Emelyanov
There are a bunch of such calls in schema altering statements and
there's currently no way to obtain the migration manager for such
statements, so a relatively big rework needed.

The solution in this set is -- all statements' execute() methods are
called with query processor as first argument (now the storage proxy
is there), query processor references and provides migration manager
for statements. Those statements that need proxy can already get it
from the query processor.

Afterwards table_helper and thrift code can also stop using the global
migration manager instance, since they both have query processor in
needed places. While patching them a couple of calls to global storage
proxy also go away.

The new query processor -> migration manager dependency fits into
current start-stop sequence: the migration manager is started early,
the query processor is started after it. On stop the query processor
remains alive, but the migration manager stops. But since no code
currently (should) call get_local_migration_manager() it will _not_
call the query_processor::get_migration_manager() either, so this
dangling reference is ugly, but safe.

Another option could be to make storage proxy reference migration
manager, but this dependency doesn't look correct -- migration manager
is higher-level service than the storage proxy is, it is migration
manager who currently calls storage proxy, but not the vice versa.

* xemul/br-kill-some-migration-managers-2:
  cql3: Get database directly from query processor
  thrift: Use query_processor::get_migration_manager()
  table_helper: Use query_processor::get_migration_manager()
  cql3: Use query_processor::get_migration_manager() (lambda captures cases)
  cql3: Use query_processor::get_migration_manager() (alter_type statement)
  cql3: Use query_processor::get_migration_manager() (trivial cases)
  query_processor: Keep migration manager onboard
  cql3: Pass query processor to announce_migration:s
  cql3: Switch to qp (almost) in schema-altering-stmt
  cql3: Change execute()'s 1st arg to query_processor
2021-03-17 09:59:22 +02:00
Raphael S. Carvalho
2065e2c912 partitioned_sstable_set: adjust select_sstable_runs() to work with compound set
compound set will select runs from all of its managed sets, so let's
adjust select_sstable_runs() to only return runs which belong to it.
without this adjustment, selection of runs would fail because
function would try to unconditionally retrieve the run which may
live somewhere else.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312042255.111060-3-raphaelsc@scylladb.com>
2021-03-17 09:59:22 +02:00
Raphael S. Carvalho
02b2df1ea9 sstable_set: move select_sstable_runs() into partitioned_sstable_set
after compound set is introduced, select_sstable_runs() will no longer
work because the sstable runs live in sstable_set, but they should
actually live in the sstable_set being written to.

Given that runs is a concept that belongs only to strategies which
use partitioned_sstable_set, let's move the implementation of
select_sstable_runs() to it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312042255.111060-2-raphaelsc@scylladb.com>
2021-03-17 09:59:22 +02:00
Avi Kivity
11308c05f4 Update tools/jmx submodule
* tools/jmx 15c1d4f...9c687b5 (1):
  > dist/redhat: add support SLES
2021-03-17 09:59:22 +02:00
Calle Wilund
a0745f9498 messaging_service: Enforce dc/rack membership iff required for non-tls connections
When internode_encryption is "rack" or "dc", we should enforce incoming
connections are from the appropriate address spaces iff answering on
non-tls socket.

This is implemented by having two protocol handlers. One for tls/full notls,
and one for mixed (needs checking) connections. The latter will ask
snitch if remote address is kosher, and refuse the connection otherwise.

Note: requires seastar patches:
"rpc: Make is possible for rpc server instance to refuse connection"
"RPC: (client) retain local address and use on stream creation"

Note that ip-level checks are not exhaustive. If a user is also using
"require_client_auth" with dc/rack tls setting we should warn him that
there is a possibility that someone could spoof himself pass the
authentication.

Closes #8051
2021-03-17 09:59:22 +02:00
Avi Kivity
bcd41cb32d Merge 'Support installing our rpm to SLES' from Takuya ASADA
Basically SLES support is already done in f20736d93d, but it was for offline installer.
This fixes few more problems to install our rpm to SLES.
After this change, we can just install our rpm for both CentOS/RHEL and SLES in single image, like unified deb.
SLES uses original package manager called 'zypper', but it does support yum repository so no need to change required for repo.

Closes #8277

* github.com:scylladb/scylla:
  scylla_coredump_setup: support SLES
  scylla_setup: use rpm to check package availability for SLES
  dist: install optional packages for SLES
2021-03-17 09:59:22 +02:00
Tomasz Grabiec
cc0bb92afe Merge "raft: provide a ticker for each raft server" from Pavel Solodovnikov
Automatically initialize and start a timer in
`raft_services::add_server` for each raft server instance created.

The patch set also changes several other things in order
for tickers to work:

1. A bug in `raft_sys_table_storage` which caused an exception
   if `raft::server::start` is called without any persisted state.
2. `raft_services::add_server` now automatically calls
   `raft::server::start()` since a server instance should be started
   before any of its methods can be called.
3. Raft servers can now start with initial term = 0. There was an
   artificial restriction which is now lifted.
4. Raft schema state machine now returns a ready future instead of
   throwing "not implemented" exception in `abort()`.

* github.com/ManManson/scylla.git/raft_services_tickers_v9_next_rebase:
  raft/raft_services: provide a ticker for each raft server
  raft/raft_services: switch from plain `throw` to `on_internal_error`
  raft/raft_services: start server instance automatically in `add_server`
  raft: return ready future instead of throwing in schema_raft_state_machine
  raft: allow raft server to start with initial term 0
  raft/raft_sys_table_storage: fix loading term/vote and snapshot from empty state
2021-03-17 09:59:22 +02:00
Nadav Har'El
e344f74858 Merge 'logalloc: improve background reclaim shares management' from Avi Kivity
The log structured allocator's background reclaimer tries to
allocate CPU power proportional to memory demand, but a
bug made that not happen. Fix the bug, add some logging,
and future-proof the timer. Also, harden the test against
overcommitted test machines.

Fixes #8234.

Test: logalloc_test(dev), 20 concurrent runs on 2 cores (1 hyperthread each)

Closes #8281

* github.com:scylladb/scylla:
  test: logalloc_test: harden background reclain test against cpu overcommit
  logalloc: background reclaim: use default scheduling group for adjusting shares
  logalloc: background reclaim: log shares adjustment under trace level
  logalloc: background reclaim: fix shares not updated by periodic timer
2021-03-17 09:59:21 +02:00
Pavel Solodovnikov
aaea8c6c7d raft/raft_services: provide a ticker for each raft server
Automatically initialize a ticker for each raft server
instance when `raft_services::add_server` is called.
A ticker is a timer which regularly calls `raft::server::tick`
in order to tick its raft protocol state machine.

Note that the timer should start after the server calls
its `start()` method, because otherwise it would crash
since fsm is not initialized yet.

Currently, the tick interval is hardcoded to be 100ms.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Pavel Solodovnikov
1496a3559f raft/raft_services: switch from plain throw to on_internal_error
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Pavel Solodovnikov
975c9a8021 raft/raft_services: start server instance automatically in add_server
Raft server instance cannot be used in any way prior
to calling the `start()` method, which initializes
its internal state, e.g. raft protocol state machine.
Otherwise, it will likely result in a crash.

Also, properly stop the servers on shutdown via
`raft_services::stop_servers()`.

In case some exception happened inside `add_server`,
the `init` function will de-initialize what it already
initialized, i.e. raft rpc verbs. This is important
since otherwise it would break further initialization
process and, what is more important, will prevent raft
rpc verbs deinitialization. This will cause a crash in
`messaging_service` uninit procedure, because raft rpc
handlers would still be initialized.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Pavel Solodovnikov
0b3dba07bd raft: return ready future instead of throwing in schema_raft_state_machine
The current implementation throws an exception, which will cause
a crash when stopping scylla. This will be used in the next patch.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Pavel Solodovnikov
93c565a1bf raft: allow raft server to start with initial term 0
Prior to the fix there was an assert to check in
`raft::server_impl::start` that the initial term is not 0.

This restriction is completely artificial and can be lifted
without any problems, which will be described below.

The only place that is dependent on this corner case is in
`server_impl::io_fiber`. Whenever term or vote has changed,
they will be both set in `fsm::get_output`. `io_fiber` checks
whether it needs to persist term and vote by validating that
the term field is set (by actually executing a `term != 0`
condition).

This particular check is based on an unobvious fact that the
term will never be 0 in case `fsm::get_output` saves
term and vote values, indicating that they need to be
persisted.

Vote and term can change independently of each other, so that
checking only for term obscures what is happening and why
even more.

In either case term will never be 0, because:

1. If the term has changed, then it's naturally greater than 0,
   since it's a monotonically increasing value.
2. If the vote has changed, it means that we received
   a vote request message. In such case we have already updated
   our term to the requester's term.

Switch to using an explicit optional in `fsm_output` so that
a reader don't have to think about the motivation behind this `if`
and just checks that `term_and_vote` optional is engaged.

Given the motivation described above, the corresponding

    assert(_fsm->get_current_term() != term_t(0));

in `server_impl::start` is removed.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Pavel Solodovnikov
ae5f26adec raft/raft_sys_table_storage: fix loading term/vote and snapshot from empty state
When a raft server is started for the first time and there isn't
any persisted state yet, provide default return values for
`load_term_and_vote` and `load_snapshot`. The code currently
does not handle this corner case correctly and fail with an
exception.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-17 09:59:21 +02:00
Juliusz Stasiewicz
f77d0f5439 storage_proxy: Include counter writes in writes_coordinator_outside_replica_set
Coordinator prefers himself as the "counter leader", so if another
endpoint is chosen as the leader, we know that coordinator was
not a member of replica set. We can use this information to
increment relevant metric (which used to neglect the counters
completely).

Fixes #4337
2021-03-16 12:07:16 +01:00
Juliusz Stasiewicz
5689106b92 counters: Favor coordinator as leader
This not only reduces internode traffic but is also needed for a
later change in this PR: metrics for non-token-aware writes
including counter updates.
2021-03-16 12:07:13 +01:00
Pavel Emelyanov
a7a5ad4ded range_tombstone_stream: Remove unused methods
Both methods apply a list of tombstones to the stream. One
was unused even before the set, the other one became unused
after previous patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 12:08:18 +03:00
Pavel Emelyanov
2e6255c499 partition_snapshot_reader: Emit range tombstones on demand
Currently the reader gets all range tombstones from the given
range and places them into a stream. When filling the buffer
with fragments the range tombstones are extracted from the
stream one by one.

This is memory consuming, the reader's memory usage shouldn't
depend on the number of inhabitants in the partition range.

The patch implements the heap-based cursor for range tombstones
almost like it's done for rows.

The heap contains range_tombstone_list::iterator_ranges, the
tombstones are popped from the heap when needed, are applied
into the stream and then are emitted from it into the buffer.
The refresh_state() is called on each new range to set up the
iterators, and when lsa reports references invalidation to
refresh the iterators. To let the refresh_state revalidate the
iterators, the position at which the last range tombstone was
emitted is maintained.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 12:08:18 +03:00
Pavel Emelyanov
ef61f84426 partition_snapshot_reader: Introduce maybe_refresh_state
The existing refresh_state() is supposed to setup or revalidate
iterators to rows inside partition versions if needed. It will
be called in more than one place soon, so here's the helper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 12:08:18 +03:00
Pavel Emelyanov
5e0a8130d4 partition_snapshot_reader: Move range tombstone stream member
The lsa_partition_reader is the helper sub-class for
partition_snapshot_reader that, among other things, is
responsible for filling the stream of range tombstones,
that's then used by the reader itself.

Next patches will change the way range tombstones are
emitted by the reader, so hide the stream inside the
helper subclass in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 12:08:18 +03:00
Pavel Emelyanov
755d993031 partition_snapshot_reader: Add reset_state method to helper class
This method "notifies" the lsa_reader helper class when the owning
reader moves to a new range. This method is now empty, but will be
used by next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 12:07:20 +03:00
Pavel Emelyanov
a387fbd984 partition_snapshot_reader: Downgrade heap comparator
Next patch will extend the comparator to manage heap of
range tombstones. Not to add yet another comparator to
it (and not to create another heap comparator class) just
use the comparator that's common for both -- rows and range
tombstones.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 12:06:19 +03:00
Pavel Emelyanov
2179014efa partition_snapshot_reader: Use on-demand comparators
There are already two raii-sh comparators on reader, next patch will
need to add the third. This just bloats the reader, the comparators
in question are state-less and can be created on demand for free.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 12:04:47 +03:00
Pavel Emelyanov
c8b2079705 range_tombstone_list: Add new slice() helper
There are two of them now -- one to return iterator_range that
covers the given query::clustering_range, the other to return
it for two given positions.

In the next patch the 3rd one is needed -- the slice() to get
iterator_range that's

a) starts strictly after a given position
b) ends after the given clustering_range's end

It will be used to refresh the range tombstones iterators after
some of them will have been emitted. The same thing is currently
done by partition_snapshot_reader's refresh_state wrt rows:

    if (last_row)
        start = rows.upper_bound(last_row) // continuation
    else
        start = rows.lower_bound(range.start) // initial

    end = rows.upper_bound(range.end) // end is the same in
                                      // either case

Respectively for range tombstones the goal is the same.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 11:55:28 +03:00
Pavel Emelyanov
7e1170ecb9 range_tombstone_list: Introduce iterator_range alias
The range_tombstone_list::slice() set of methods return
back pair of iterators represending a range. In the next
patches this pair will be actively used, and it's handy
to have a shorter alias for it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-16 11:55:28 +03:00
Piotr Sarna
2201c9b146 configure: print more context if the linking attempt failed
Previously, when a linking attempt failed, configure.py immediately
printed that neither lld nor gold was found, which might be misleading
if the linkers are installed, but the compilation failed anyway.
The printed information is now more specific, and combined with the
previous commit, it will also provide more information why the
compilation attempt failed.
2021-03-16 07:39:05 +01:00
Piotr Sarna
f86b879933 configure: provide more context on failed ./configure.py run
If the configuration step failed, it used to only inform that
it must be due to the wrong GCC version, which can be misleading.
For instance, trying to compile on clang with incorrect flags
also resulted in an "wrong GCC version" message.
Now, the message is more generic, but it also prints the stderr
output from the miscompilation, which may help pinpoint the problem:

$ ./configure.py --mode release --cflags='-fhello -fcolor-diagnostics -mllvm -opt-bisect-limit=10000' --compiler=clang++ --c-compiler=clang
Note: neither lld nor gold found; using default system linker
Compilation failed: clang++ -x c++ -o build/tmp/tmp1177gojf /home/sarna/repo/scylla/build/tmp/tmp_u3voys6 -fhello -fcolor-diagnostics -mllvm -opt-bisect-limit=10000 []

// clang pretends to be gcc (defined __GNUC__), so we
// must check it first
\#ifdef __clang__

\#if __clang_major__ < 10
    #error "MAJOR"
\#endif

\#elif defined(__GNUC__)

\#if __GNUC__ < 10
    #error "MAJOR"
\#elif __GNUC__ == 10
    #if __GNUC_MINOR__ < 1
        #error "MINOR"
    #elif __GNUC_MINOR__ == 1
        #if __GNUC_PATCHLEVEL__ < 1
            #error "PATCHLEVEL"
        #endif
    #endif
\#endif

\#else

\#error "Unrecognized compiler"

\#endif

int main() { return 0; }

clang-11: error: unknown argument: '-fhello'
distcc[4085341] ERROR: compile (null) on localhost failed

Wrong compiler version or incorrect flags. Scylla needs GCC >= 10.1.1 with coroutines (-fcoroutines) or clang >= 10.0.0 to compile.
2021-03-16 07:39:03 +01:00
Piotr Sarna
6389246d6e configure: add verbose option to try_compile_and_link
Which will be useful later for providing more context
why a ./configure.py run failed.
2021-03-16 07:35:16 +01:00
Pavel Emelyanov
12e4269dce cql3: Get database directly from query processor
After previous patches some places in cql3 code take a
long path to get database reference:

  query processor -> storage proxy -> database

The query processor can provide the database reference
by itself, so take this chance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:36:04 +03:00
Pavel Emelyanov
fb49550943 thrift: Use query_processor::get_migration_manager()
Thrift needs migration manager to call announce_<something> on
it and currently it grabs blobak migration manager instance.

Since thrift handler has query processor rerefence onboard and
the query processor can provide the migration manager reference,
it's time to remove few more globals from thrift code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:35:59 +03:00
Pavel Emelyanov
6dc9a16b4e table_helper: Use query_processor::get_migration_manager()
After the migration manager can be obtained from the query
processor the table heler can also benefit from it and not
call for global migration manager instance any longer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:35:53 +03:00
Pavel Emelyanov
a9646dd779 cql3: Use query_processor::get_migration_manager() (lambda captures cases)
There are few schema altering statements that need to have
the query processor inside lambda continuations. Fortunately,
they all are continuations of make_ready_future<>()s, so the
query processor can be simply captured by reference and used.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:35:48 +03:00
Pavel Emelyanov
50e4eacd08 cql3: Use query_processor::get_migration_manager() (alter_type statement)
This statement needs the query processor one step below the
stack from its .announce_migration method. So here's the
dedicated patch for it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:35:43 +03:00
Pavel Emelyanov
464e58abf7 cql3: Use query_processor::get_migration_manager() (trivial cases)
Most of the schema altering statements implementations can now
stop calling for global migration manager instance and get it
from the query processor.

Here are the trivial cases when the query processor is just
avaiable at the place where it's needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:35:36 +03:00
Pavel Emelyanov
1de235f4da query_processor: Keep migration manager onboard
The query processor sits upper than the migration manager,
in the services layering, it's started after and (will be)
stopped before the migration manager.

The migration manager is needed in schema altering statements
which are called with query processor argument. They will
later get the migration manager from the query processor.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:00:58 +03:00
Pavel Emelyanov
1e8f0963f9 cql3: Pass query processor to announce_migration:s
Now when the only call to .announce_migration gas the
query processor at hands -- pass it to the real statements.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:00:33 +03:00
Pavel Emelyanov
470928dd94 cql3: Switch to qp (almost) in schema-altering-stmt
The schema altering statements are all inherited from the same
base class which delcares a pure virtual .announce_migration()
method. All the real statements are called with storage proxy
argument, while the need the migration manager. So like in the
previous patch -- replace storage proxy with query processor.

While doing the replacement also get the database instance from
the querty processor, not from proxy.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:00:33 +03:00
Pavel Emelyanov
26c115f379 cql3: Change execute()'s 1st arg to query_processor
Currently the statement's execute() method accepts storage
proxy as the first argument. This is enough for all of them
but schema altering ones, because the latter need to call
migration manager's announce.

To provide the migration manager to those who need it it's
needed to have some higher-level service that the proxy. The
query processor seems to be good candidate for it.

Said that -- all the .execute()s now accept the querty
processor instead of the proxy and get the proxy itself from
the query processor.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-15 19:00:33 +03:00
Avi Kivity
65fea203d2 test: logalloc_test: harden background reclain test against cpu overcommit
Use thread CPU time instead of real time to avoid an overcommitted
machine from not being able to supply enough CPU for the test.
2021-03-15 13:54:49 +02:00
Avi Kivity
290897ddbc logalloc: background reclaim: use default scheduling group for adjusting shares
If the shares are currently low, we might not get enough CPU time to
adjust the shares in time.

This is currently no-op, since Seastar runs the callback outside
scheduling groups (and only uses the scheduling group for inherited
continuations); but better be insulated against such details.
2021-03-15 13:54:49 +02:00
Avi Kivity
a87f6498c3 logalloc: background reclaim: log shares adjustment under trace level
Useful when debugging, but too noisy at any other time.
2021-03-15 13:54:49 +02:00
Avi Kivity
ce1b1d6ec4 logalloc: background reclaim: fix shares not updated by periodic timer
adjust_shares() thinks it needs to do nothing if the main loop
is running, but in reality it can only avoid waking the main loop;
it still needs to adjust the shares unconditionally. Otherwise,
the background reclaim shares can get locked into a low value.

Fix by splitting the conditional into two.
2021-03-15 13:54:37 +02:00
Tomasz Grabiec
bf6c4e0b24 Merge "raft: consolidate tests in raft directory" from Alejo
Move boost tests to tests/raft and factor out common helpers.

* alejo/raft-tests-reorg-5-rebase-next-2:
  raft: tests: move common helpers to header
  raft: tests: move boost tests to tests/raft
2021-03-15 11:59:16 +01:00
Takuya ASADA
e8cfd5114f scylla_coredump_setup: support SLES
SLES requires to install systemd-coredump package and enable
systemd-coredump.socket to use systemd-coredump.
2021-03-15 19:19:56 +09:00
Takuya ASADA
13871ff1f8 scylla_setup: use rpm to check package availability for SLES
Use rpm to check scylla packages installed on SLES.
2021-03-15 19:18:44 +09:00
Takuya ASADA
e3b5ffcf14 dist: install optional packages for SLES
Support SUSE original package manager 'zypper' for pkg_install()
function.
2021-03-15 19:17:48 +09:00
Alejo Sanchez
88063b6e3e raft: tests: move common helpers to header
Move common test helper functions and data structures to a common
helpers.hh header.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-15 06:16:58 -04:00
Alejo Sanchez
6139ad6337 raft: tests: move boost tests to tests/raft
Move raft boost tests to test/raft directory.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-15 06:16:58 -04:00
Calle Wilund
48ca01c3ab commitlog: Make pre-allocation drop O_DSYNC while pre-filling
Refs #7794

Iff we need to pre-fill segment file ni O_DSYNC mode, we should
drop this for the pre-fill, to avoid issuing flushes until the file
is filled. Done by temporarily closing, re-opening in "normal" mode,
filling, then re-opening.

v2:
* More comment
v3:
* Add missing flush
v4:
* comment
v5:
* Split coroutine and fix into separate patches
2021-03-15 09:35:45 +00:00
Calle Wilund
ae3b8e6fdf commitlog: coroutinize allocate_segment_ex
To make further changes here easier to write and read.
2021-03-15 09:35:37 +00:00
Avi Kivity
f326a2253c Update tools/java submodule
* tools/java 2c6110500c...fdc8fcc22c (1):
  > sstableloader: Use compound "where" restrictions for clustering
2021-03-15 11:19:22 +02:00
Raphael S. Carvalho
7171244844 compaction_manager: Fix performance of cleanup compaction due to unlimited parallelism
Prior to 463d0ab, only one table could be cleaned up at a time on a given shard.
Since then, all tables belonging to a given keyspace are cleaned up in parallel.
Cleanup serialization on each shard was enforced with a semaphore, which was
incorrectly removed by the patch aforementioned.

So space requirement for cleanup to succeed can be up to the size of keyspace,
increasing the chances of node running out of space.

Node could also run out of memory if there are tons of tables in the keyspace.
Memory requirement is at least #_of_tables * 128k (not taking into account write
behind, etc). With 5k tables, it's ~0.64G per shard.

Also all tables being cleaned up in parallel will compete for the same
disk and cpu bandwidth, so making them all much slower, and consequently
the operation time is significantly higher.

This problem was detected with cleanup, but scrub and upgrade go through the
same rewrite procedure, so they're affected by exact the same problem.

Fixes #8247.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312162223.149993-1-raphaelsc@scylladb.com>
2021-03-14 14:31:26 +02:00
Nadav Har'El
d73934372d storage_service: correct missing exception in logging rebuild failure
When failing to rebuild a node, we would print the error with the useless
explanation "<no exception>". The problem was a typo in the logging command
which used std::current_exception() - which wasn't relevant in that point -
instead of "ep".

Refs #8089

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210314113118.1690132-1-nyh@scylladb.com>
2021-03-14 14:11:11 +02:00
Tomasz Grabiec
f2ecb4617e Merge "raft: implement prevoting stage in leader election" from Gleb
This is how PhD explain the need for prevoting stage:

  One downside of Raft's leader election algorithm is that a server that
  has been partitioned from the cluster is likely to cause a disruption
  when it regains connectivity. When a server is partitioned, it will
  not receive heartbeats. It will soon increment its term to start
  an election, although it won't be able to collect enough votes to
  become leader. When the server regains connectivity sometime later, its
  larger term number will propagate to the rest of the cluster (either
  through the server's RequestVote requests or through its AppendEntries
  response). This will force the cluster leader to step down, and a new
  election will have to take place to select a new leader.

  Prevoting stage is addressing that. In the Prevote algorithm, a
  candidate only increments its term if it first learns from a majority of
  the cluster that they would be willing to grant the candidate their votes
  (if the candidate's log is sufficiently up-to-date, and the voters have
  not received heartbeats from a valid leader for at least a baseline
  election timeout).

  The Prevote algorithm solves the issue of a partitioned server disrupting
  the cluster when it rejoins. While a server is partitioned, it won't
  be able to increment its term, since it can't receive permission
  from a majority of the cluster. Then, when it rejoins the cluster, it
  still won't be able to increment its term, since the other servers
  will have been receiving regular heartbeats from the leader. Once the
  server receives a heartbeat from the leader itself, it will return to
  the follower state(in the same term).

In our implementation we have "stable leader" extension that prevents
spurious RequestVote to dispose an active leader, but AppendEntries with
higher term will still do that, so prevoting extension is also required.

* scylla-dev/raft-prevote-v5:
  raft: store leader and candidate state in state variant
  raft: add boost tests for prevoting
  raft: implement prevoting stage in leader election
  raft: reset the leader on entering candidate state
  raft: use modern unordered_set::contains instead of find in become_candidate
2021-03-12 11:15:51 +01:00
Gleb Natapov
e231186a7b raft: store leader and candidate state in state variant
We already have server state dependant state in fsm, so there is no need
to maintain "voters" and "tracker" optionals as well. The upside is that
optional and variant sates cannot drift apart now.
2021-03-12 11:12:57 +02:00
Gleb Natapov
e17e7d57bd raft: add boost tests for prevoting 2021-03-12 11:12:57 +02:00
Gleb Natapov
1f868d516e raft: implement prevoting stage in leader election
This is how PhD explain the need for prevoting stage:

  One downside of Raft's leader election algorithm is that a server that
  has been partitioned from the cluster is likely to cause a disruption
  when it regains connectivity. When a server is partitioned, it will
  not receive heartbeats. It will soon increment its term to start
  an election, although it won't be able to collect enough votes to
  become leader. When the server regains connectivity sometime later, its
  larger term number will propagate to the rest of the cluster (either
  through the server's RequestVote requests or through its AppendEntries
  response). This will force the cluster leader to step down, and a new
  election will have to take place to select a new leader.

  Prevoting stage is addressing that. In the Prevote algorithm, a
  candidate only increments its term if it first learns from a majority of
  the cluster that they would be willing to grant the candidate their votes
  (if the candidate's log is sufficiently up-to-date, and the voters have
  not received heartbeats from a valid leader for at least a baseline
  election timeout).

  The Prevote algorithm solves the issue of a partitioned server disrupting
  the cluster when it rejoins. While a server is partitioned, it won't
  be able to increment its term, since it can't receive permission
  from a majority of the cluster. Then, when it rejoins the cluster, it
  still won't be able to increment its term, since the other servers
  will have been receiving regular heartbeats from the leader. Once the
  server receives a heartbeat from the leader itself, it will return to
  the follower state(in the same term).

In our implementation we have "stable leader" extension that prevents
spurious RequestVote to dispose an active leader, but AppendEntries with
higher term will still do that, so prevoting extension is also required.
2021-03-12 11:09:21 +02:00
Raphael S. Carvalho
f6fc32c8da table: use new sstable_set::for_each_sstable
for_each_sstable() is preferred over all() because it's guaranteed to
perform no copy.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210311163009.42210-2-raphaelsc@scylladb.com>
2021-03-11 18:47:17 +02:00
Raphael S. Carvalho
e7a6f3926a sstable_set: introduce for_each_sstable()
This new method is preferred over all() for iterations purposes, because
all() may have to copy sstables into a temporary.
For example, all() implementation of the upcoming compound_sstable_set
will have no choice but to merge all sstables from N managed sets into
a temporary.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210311163009.42210-1-raphaelsc@scylladb.com>
2021-03-11 18:47:16 +02:00
Avi Kivity
486f6bf29c Merge "sstables: move format specific reader code to kl/, mx/" from Botond
"
Currently the sstable reader code is scattered across several source
files as following (paths are relative to sstables/):
* partition.cc - generic reader code;
* row.hh - format specific code related to building mutation fragments
  from cells;
* mp_row_consumer.hh - format specific code related to parsing the raw
  byte stream;

This is a strange organization scheme given that the generic sstable
reader is a template and as such it doesn't itself depend on the other
headers where the consumer and context implementations live. Yet these
are all included in partition.cc just so the reader factory function can
instantiate the sstable reader template with the format specific
objects.

This patchset reorganizes this code such that the generic sstable reader
is exposed in a header. Furthermore, format specific code is moved to
the kl/ and mx/ directories respectively. Each directory has a
reader.hh with a single factory function which creates the reader, all
the format specific code is hidden from sight. The added benefit is that
now reader code specific to a format is centralized in the format
specific folder, just like the writer code.

This patchset only moves code around, no logical changes are made.

Tests: unit(dev)
"

* 'sstable-reader-separation/v1' of https://github.com/denesb/scylla:
  sstables: get rid of mp_row_consumer.{hh,cc}
  sstables: get rid of row.hh
  sstables/mp_row_consumer.hh: remove unused struct new_mutation
  sstables: move mx specific context and consumer to mx/reader.cc
  sstables: move kl specific context and consumer to kl/reader.cc
  sstables: mv partition.cc sstable_mutation_reader.hh
2021-03-11 16:57:54 +02:00
Raphael S. Carvalho
6ff8bb4eac compaction: Allow all supported compaction types to be stopped
Let's make stop_compaction() use sstables::to_compaction_type(),
so all supported compaction types can now be aborted.

Refs #7738.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-11 09:30:11 -03:00
Raphael S. Carvalho
f1b8d5f20f compaction: introduce function to map compaction name to respective type
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-11 09:29:59 -03:00
Raphael S. Carvalho
a44bc233f5 compaction: refactor mapping of compaction type to string
This will make it easier to introduce new type and also to map type to
string and vice-versa, using reverse lookup.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-11 09:29:53 -03:00
Raphael S. Carvalho
503a0ea928 compaction: move compaction_name() out of line
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-11 09:29:46 -03:00
Botond Dénes
361ba473c7 sstables: get rid of mp_row_consumer.{hh,cc}
Move stuff contained therein to `sstable_mutation_reader.{hh,cc}` which
will serve as the collection point of utility stuff needed by all reader
implementations.
2021-03-11 12:17:13 +02:00
Botond Dénes
3ba782bddd sstables: get rid of row.hh
Move stuff contained therein to `sstable_mutation_reader.{hh,cc}` which
will serve as the collection point of utility stuff needed by all reader
implementations.
2021-03-11 12:17:13 +02:00
Botond Dénes
f5b0657fa5 sstables/mp_row_consumer.hh: remove unused struct new_mutation 2021-03-11 12:17:13 +02:00
Botond Dénes
cecc7f8064 sstables: move mx specific context and consumer to mx/reader.cc
Move all the mx format specific context and consumer code to
mx/reader.cc and add a factory function `mx::make_reader()` which takes
over the job of instantiating the `sstable_mutation_reader` with the mx
specific context and consumer.
2021-03-11 12:17:13 +02:00
Botond Dénes
4e3ae9d913 sstables: move kl specific context and consumer to kl/reader.cc
Move all the kl format specific context and consumer code to
kl/reader* and add a factory function `kl::make_reader()` which takes
over the job of instantiating the `sstable_mutation_reader` with the kl
specific context and consumer. Code which is used by test is moved to
kl/reader_impl.hh, while code that can be hidden us moved to
kl/reader.cc. Users who just want to create a reader only have to
include kl/reader.hh.
2021-03-11 12:17:13 +02:00
Botond Dénes
0ec040921d sstables: mv partition.cc sstable_mutation_reader.hh
The sstable reader currently knows the definition of all the different
consumers and contexts. But it doesn't really need to, as it is a
template. Exploit this and prepare for a organization scheme where the
consumers and contexts live hidden in a cc file which includes and
instantiates the sstable reader template. As a first step expose
`sstable_mutation_reader` in a header.
2021-03-11 12:17:13 +02:00
Avi Kivity
a49c4ab754 Update tools/java submodule
* tools/java c5d9e8513e...2c6110500c (1):
  > cassandra.in.sh: Add path to rack/dc properties file to classpath

Fixes #7930.
2021-03-11 12:03:01 +02:00
Asias He
d5e6ba1ff1 repair: Shortcut when no followers to repair with
- 3 nodes in the cluster with rf = 3
- run repair on node1 with ignore_nodes to ignore node2 and node3
- node1 has no followers to repair with

However, currently node1 will walk through the repair procedure to read
data from disk and calculate hashes which are unnecessary.

This patch fixes this issue, so that in case there are no followers, we
skip the range and avoid the unnecessary work.

Before:
   $ curl -X POST http://127.0.0.1:10000/storage_service/repair_async/myks3?ignore_nodes="127.0.0.2,127.0.0.3"

   repair - repair id [id=1, uuid=ff39151b-2ce9-4885-b7e9-89158b14b5c2] on shard 0 stats:
   repair_reason=repair, keyspace=myks3, tables={standard1},
   ranges_nr=769, sub_ranges_nr=769, round_nr=1456,
   round_nr_fast_path_already_synced=1456,
   round_nr_fast_path_same_combined_hashes=0,
   round_nr_slow_path=0, rpc_call_nr=0, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.19 seconds,
   tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
   row_from_disk_bytes={{127.0.0.1, 2822972}},
   row_from_disk_nr={{127.0.0.1, 6218}},
   row_from_disk_bytes_per_sec={{127.0.0.1, 14.1695}} MiB/s,
   row_from_disk_rows_per_sec={{127.0.0.1, 32726.3}} Rows/s,
   tx_row_nr_peer={}, rx_row_nr_peer={}

Data was read from disk.

After:
   $ curl -X POST http://127.0.0.1:10000/storage_service/repair_async/myks3?ignore_nodes="127.0.0.2,127.0.0.3"

   repair - repair id [id=1, uuid=c6df8b23-bd3b-4ebc-8d4c-a11d1ebcca39] on shard 0 stats:
   repair_reason=repair, keyspace=myks3, tables={standard1}, ranges_nr=769,
   sub_ranges_nr=0, round_nr=0, round_nr_fast_path_already_synced=0,
   round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
   rpc_call_nr=0, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.0 seconds,
   tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
   row_from_disk_bytes={},
   row_from_disk_nr={},
   row_from_disk_bytes_per_sec={} MiB/s,
   row_from_disk_rows_per_sec={} Rows/s,
   tx_row_nr_peer={}, rx_row_nr_peer={}

No data was read from disk.

Fixes #8256

Closes #8257
2021-03-11 11:53:22 +02:00
Avi Kivity
c8f692e526 Merge 'cql3: Rewrite get_clustering_bounds() using expressions' from Dejan Mircevski
Instead of using the `restrictions` class hierarchy, calculate the clustering slice using the `expr::expression` representation of the WHERE clause.  This will allow us to eventually drop the `restrictions` hierarchy altogether.

Tests: unit (dev, debug)

Closes #8227

* github.com:scylladb/scylla:
  cql3: Make get_clustering_bounds() use expressions
  cql3/expr: Add is_multi_column()
  cql3/expr: Add more operators to needs_filtering
  cql3: Replace CK-bound mode with comparison_order
  cql3/expr: Make to_range globally visible
  cql3: Gather slice-defining WHERE expressions
  cql3: Add statement_restrictions::_where
  test: Add unit tests for get_clustering_bounds
2021-03-11 11:46:52 +02:00
Gleb Natapov
a849246cfc raft: reset the leader on entering candidate state
Not resetting a leader causes vote requests to be ignored instead of
rejected which will make voting round to take more time to fail and may
slow down new leader election.
2021-03-11 10:36:43 +02:00
Gleb Natapov
20d6bb36cd raft: use modern unordered_set::contains instead of find in become_candidate 2021-03-11 10:36:43 +02:00
Dejan Mircevski
990de02d28 cql3: Make get_clustering_bounds() use expressions
Use expressions instead of _clustering_columns_restrictions.  This is
a step towards replacing the entire restrictions class hierarchy with
expressions.

Update some expected results in unit tests to reflect the new code.
These new results are equivalent to the old ones in how
storage_proxy::query() will process them (details:
bound_view::from_range() returns the same result for an empty-prefix
singular as for (-inf,+inf)).

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:25:43 -05:00
Dejan Mircevski
8dac132581 cql3/expr: Add is_multi_column()
It will come in handy when we start using expressions to calculate the
clustering slice.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:25:43 -05:00
Dejan Mircevski
1f591bd16e cql3/expr: Add more operators to needs_filtering
Omitting these operators didn't cause bugs, because needs_filtering()
is never invoked on them.  But that will likely change in the future,
so add them now to prevent problems down the road.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:25:43 -05:00
Dejan Mircevski
c0c93982d0 cql3: Replace CK-bound mode with comparison_order
Instead of defining this enum in multi_column_restriction::slice, put
it in the expr namespace and add it to binary_operator.  We will need
it when we switch bounds calculation from multi_column_restriction to
expr classes.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:25:43 -05:00
Dejan Mircevski
7dfe471b5a cql3/expr: Make to_range globally visible
It will be used in statement_restrictions for calculating clustering
bounds.  And it will come in handy elsewhere in the future, I'm sure.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:25:43 -05:00
Dejan Mircevski
28b5a372f8 cql3: Gather slice-defining WHERE expressions
Add statement_restrictions::_clustering_prefix_restrictions and fill
it with relevant expressions.  Explain how to find all such
expressions in the WHERE clause.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:25:43 -05:00
Dejan Mircevski
da096bfdce cql3: Add statement_restrictions::_where
... and collect all restrictions' expressions into it.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:25:43 -05:00
Dejan Mircevski
2525759027 test: Add unit tests for get_clustering_bounds
... as guardrails for the upcoming rewrite.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:17:26 -05:00
Calle Wilund
f44420f2c9 snapshot: Add filter to check for existing snapshot
Fixes #8212

Some snapshotting operations call in on a single table at a time.
When checking for existing snapshots in this case, we should not
bother with snapshots in other tables. Add an optional "filter"
to check routine, which if non-empty includes tables to check.

Use case is "scrub" which calls with a limited set of tables
to snapshot.

Closes #8240
2021-03-10 20:21:38 +02:00
Benny Halevy
ff5b42a0fa bytes_ostream: max_chunk_size: account for chunk header
Currently, if the data_size is greater than
max_chunk_size - sizeof(chunk), we end up
allocating up to max_chunk_size + sizeof(chunk) bytes,
exceeding buf.max_chunk_size().

This may lead to allocation failures, as seen in
https://github.com/scylladb/scylla/issues/7950,
where we couldn't allocate 131088 (= 128K + 16) bytes.

This change adjusted the expose max_chunk_size()
to be max_alloc_size (128KB) - sizeof(chunk)
so that the allocated chunks would normally be allocated
in 128KB chunks in the write() path.

Added a unit test - test_large_placeholder that
stresses the chunk allocation path from the
write_place_holder(size) entry point to make
sure it handles large chunk allocations correctly.

Refs #7950
Refs #8081

Test: unit(release), bytes_ostream_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210303143413.902968-1-bhalevy@scylladb.com>
2021-03-10 19:54:12 +02:00
Asias He
268fa9d9fe main: Lower shares for main scheduling group     The main scheduling group has the shares of 1000, which is as high as   the statement group.     From time to time, we see unexpected scheduling group leaking to the main group, which causes the drop of the query performance.   This patch reduce the main scheduling shares to 200, which is the same as the maintenance scheduling group. It is a safer default in case code leaks to the main scheduling group.         Refs: #7720
Closes #8243
2021-03-10 19:34:45 +02:00
Takuya ASADA
af8eae317b scylla_coredump_setup: avoid coredump failure when hard limit of coredump is set to zero
On the environment hard limit of coredump is set to zero, coredump test
script will fail since the system does not generate coredump.
To avoid such issue, set ulimit -c 0 before generating SEGV on the script.

Note that scylla-server.service can generate coredump even ulimit -c 0
because we set LimitCORE=infinity on its systemd unit file.

Fixes #8238

Closes #8245
2021-03-10 19:28:10 +02:00
Avi Kivity
5342d79461 Merge "Preparatory work in sstable_set for the upcoming compound_sstable_set_impl" from Raphael
* 'preparatory_work_for_compound_set' of github.com:raphaelsc/scylla:
  sstable_set: move all() implementation into sstable_set_impl
  sstable_set: preparatory work to change sstable_set::all() api
  sstables: remove bag_sstable_set
2021-03-10 19:19:26 +02:00
Botond Dénes
cf28552357 mutation_test: test_mutation_diff_with_random_generator: compact input mutations
This test checks that `mutation_partition::difference()` works correctly.
One of the checks it does is: m1 + m2 == m1 + (m2 - m1).
If the two mutations are identical but have compactable data, e.g. a
shadowable tombstone shadowed by a row marker, the apply will collapse
these, causing the above equality check to fail (as m2 - m1 is null).
To prevent this, compact the two input mutations.

Fixes: #8221
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210310141118.212538-1-bdenes@scylladb.com>
2021-03-10 16:28:14 +01:00
Raphael S. Carvalho
c3b8757fa1 sstable_set: move all() implementation into sstable_set_impl
The main motivation behind this is that by moving all() impl into
sstable_set_impl, sstable_set no longer needs to maintain a list
with all sstables, which in turn may disagree with the respective
sstable_set_impl.

This will be very important for compound_sstable_set_impl which
will be built from existing sets, and will implement all() by
combining the all() of its managed sets.
Without this patch, we'd have to insert the same sstable at
both compound set and also the set managed by it, to guarantee
all() of compound set would return the correct data, which would
be expensive and error prone.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-10 12:02:13 -03:00
Raphael S. Carvalho
05b07c7161 sstable_set: preparatory work to change sstable_set::all() api
users of sstable_set::all() rely on the set itself keeping a reference
to the returned list, so user can iterate through the list assuming
that it is alive all the way through.

this will change in the future though, because there will be a
compound set impl which will have to merge the all() of multiple
managed sets, and the result is a temporary value.

so even range-based loops on all() have to keep a ref to the returned
list, to avoid the list from being prematurely destroyed.

so the following code
	for (auto& sst : *sstable_set.all()) { ...}
becomes
	for (auto sstables = sstable_set.all(); auto& sst : *sstables) { ... }

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-10 12:02:12 -03:00
Avi Kivity
746798fd56 Merge "sstables: get rid of data_consume_context" from Botond
"
This class is basically a wrapper around a unique pointer and a few
short convenience methods, but is otherwise a distraction in trying to
untangle the maze that is the sstable reader class hierachy.
So this patchset folds it into its only real user: the sstable reader.
"

* 'data_consume_context_bye' of https://github.com/denesb/scylla:
  sstable: move data_consume_* factory methods to row.hh
  sstables: fold data_consume_context: into its users
  sstables: partition.cc: remove data_consume_* forward declarations
2021-03-10 16:45:32 +02:00
Nadav Har'El
a1725217e1 Merge 'alternator: coroutinize handle_api_request' from Piotr Sarna
The indentation level is significantly reduced, and so is the number
of allocations. The function signature is changed from taking an rvalue
ref to taking the unique_ptr by value, because otherwise the coroutine
captures the request as a reference, which results in use-after-free.

Tests: unit(dev)

Closes #8249

* github.com:scylladb/scylla:
  alternator: drop read_content_and_verify_signature
  alternator: coroutinize handle_api_request
2021-03-10 16:08:08 +02:00
Piotr Sarna
ba264e7199 alternator: drop read_content_and_verify_signature
The only use of this helper function was inlined in a bigger
coroutine, so it's no longer needed.
2021-03-10 14:42:53 +01:00
Piotr Sarna
35da51879f alternator: coroutinize handle_api_request
The indentation level is significantly reduced, and so is the number
of allocations.
The function signature is changed from taking an rvalue ref to taking
the unique_ptr by value, because otherwise the coroutine captures
the request as a reference, which results in use-after-free.
2021-03-10 14:42:52 +01:00
Botond Dénes
1aa2424dcf sstable: move data_consume_* factory methods to row.hh 2021-03-10 15:40:50 +02:00
Botond Dénes
a06465a8f3 sstables: fold data_consume_context: into its users
`data_consume_context` is a thin wrapper over the real context object
and it does little more than forward method calls to it. The few
methods doing more then mere forwarding can be folded into its single
real user: `sstable_reader`.
2021-03-10 15:38:58 +02:00
Botond Dénes
37eb547224 sstables: partition.cc: remove data_consume_* forward declarations
They don't seem to serve any purpose, everything builds fine without
them.
2021-03-10 15:23:54 +02:00
Raphael S. Carvalho
f7cc431477 compaction_manager: Fix use-after-free in rewrite_sstables()
Use-after-free introduced by 2cf0c4bbf1.
That's because compacting is moved into then_wrapped() lambda, so it's
potentially freed on the next iteration of repeat().

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210309232940.433490-1-raphaelsc@scylladb.com>
2021-03-10 13:18:38 +02:00
Nadav Har'El
f41dac2a3a alternator: avoid large contiguous allocation for request body
Alternator request sizes can be up to 16 MB, but the current implementation
had the Seastar HTTP server read the entire request as a contiguous string,
and then processed it. We can't avoid reading the entire request up-front -
we want to verify its integrity before doing any additional processing on it.
But there is no reason why the entire request needs to be stored in one big
*contiguous* allocation. This always a bad idea. We should use a non-
contiguous buffer, and that's the goal of this patch.

We use a new Seastar HTTPD feature where we can ask for an input stream,
instead of a string, for the request's body. We then begin the request
handling by reading lthe content of this stream into a
vector<temporary_buffer<char>> (which we alias "chunked_content"). We then
use this non-contiguous buffer to verify the request's signature and
if successful - parse the request JSON and finally execute it.

Beyond avoiding contiguous allocations, another benefit of this patch is
that while parsing a long request composed of chunks, we free each chunk
as soon as its parsing completed. This reduces the peak amount of memory
used by the query - we no longer need to store both unparsed and parsed
versions of the request at the same time.

Although we already had tests with requests of different lengths, most
of them were short enough to only have one chunk, and only a few had
2 or 3 chunks. So we also add a test which makes a much longer request
(a BatchWriteItem with large items), which in my experiment had 17 chunks.
The goal of this test is to verify that the new signature and JSON parsing
code which needs to cross chunk boundaries work as expected.

Fixes #7213.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210309222525.1628234-1-nyh@scylladb.com>
2021-03-10 09:22:34 +01:00
Juliusz Stasiewicz
382545a614 docs: explain SSL/non-SSL and shard-aware CQL ports
I added short description of shard-aware ports + updated the rules
for disabling ports and enabling SSL introduced by #7992.

Fixes #8146

Closes #8152
2021-03-09 22:48:30 +02:00
Tomasz Grabiec
c9c2beabc0 Merge "raft: replication tests as individual boost tests" from Alejo
* alejo/raft-tests-replication-boost-5:
  raft: replication test: use Seastar random generator
  raft: replication test: rename drop_replication
  raft: replication test: change to Boost test
  raft: replication test: id helper functions
  raft: replication test: improve handling connectivity
  raft: replication test: parametrize snapshots
  raft: replication test: parametrize drop_replication
  raft: replication test: remove unused configuration
  raft: replication test: add license
2021-03-09 17:58:59 +01:00
Pavel Emelyanov
096e452db9 test: Fix exit condition of row_cache_test::test_eviction_from_invalidated
The test populates the cache, then invalidates it, then tries to push
huge (10x times the segment size) chunks into seastar memory hoping that
the invalid entries will be evicted. The exit condition on the last
stage is -- total memory of the region (sum of both -- used and free)
becomes less than the size of one chunk.

However, the condition is wrong, because cache usually contains a dummy
entry that's not necessarily on lru and on some test iteration it may
happen that

  evictable size < chunk size < evictable size + dummy size

In this case test fails with bad_alloc being unable to evict the memory
from under the dummy.

fixes: #7959
tests: unit(row_cache_test), unit(the failing case with the triggering
       seed from the issue + 200 times more with random seeds)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210309134138.28099-1-xemul@scylladb.com>
2021-03-09 17:57:52 +01:00
Alejo Sanchez
f67b85e2b3 raft: replication test: use Seastar random generator
Use the random generator provided by Seastar test suite.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 12:52:07 -04:00
Alejo Sanchez
1bf10a87c6 raft: replication test: rename drop_replication
Rename drop_replication to packet_drops for readability.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 12:52:07 -04:00
Alejo Sanchez
6e193ee3bf raft: replication test: change to Boost test
Change test/raft directory to Boost test type.

Run replication_test cases with their own test.

RAFT_TEST_CASE macro creates 2 test cases, one with random 20% packet
loss named name_drops.

The directory test/raft is changed to host Boost tests instead of unit.

While there improve the documentation.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 12:52:07 -04:00
Alejo Sanchez
8d9c797954 raft: replication test: id helper functions
In raft the UUID 0 is a special case so server ids start at 1.
Add two helper functions. Convert local 0-based id to raft 1-based
UUID. And from UUID to raft_id.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 12:50:12 -04:00
Alejo Sanchez
0ffa450222 raft: replication test: improve handling connectivity
Change global map of disconnected servers to a more intuitive class
connected. The class is callable for the most common case
connected(id).

Methods connect(), disconnect(), and all() are provided for readability
instead of directly calling map methods (insert, erase, clear). They
also support both numerical (0 based) and server_id (UUID, 1 based) ids.

The actual shared map is kept in a lw_shared_ptr.

The class is passed around to be copy-constructed which is practically
just creating a new lw_shared_ptr.

Internally it tracks disconnected servers but externally it's more
intuitive to use connect instead of disconnect. So it reads
"connected id" and "not disconnected id", without double negatives.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 12:39:29 -04:00
Alejo Sanchez
7a644f37d3 raft: replication test: parametrize snapshots
Snapshots and persisted snapshots created per test instead of globals.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 11:58:20 -04:00
Alejo Sanchez
f72e89fcfe raft: replication test: parametrize drop_replication
Pass drop_replication down instead of keeping it global.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 11:58:20 -04:00
Alejo Sanchez
5a03670f91 raft: replication test: remove unused configuration
Remove test case configuration as it's not implemented yet.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 11:58:20 -04:00
Alejo Sanchez
efc6681cd6 raft: replication test: add license
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-09 11:58:20 -04:00
Piotr Sarna
d473bc9b06 Merge 'Fix inconsistencies in MV and SI (reworked)' from Eliran Sinvani
This is a reworked submission of #7686 which has been reverted.  This series
fixes some race conditions in MV/SI schema creation and load, we spotted some
places where a schema without a base table reference can sneak into the
registry. This can cause to an unrecoverable error since write commands with
those schemas can't be issued from other nodes. Most of those cases can occur on
2 main and uncommon cases, in a mixed cluster (during an upgrade) and in a small
window after a view or base table altering.

Fixes #7709

Closes #8091

* github.com:scylladb/scylla:
  database: Fix view schemas in place when loading
  global_schema_ptr: add support for view's base table
  materialized views: create view schemas with proper base table reference.
  materialized views: Extract fix legacy schema into its own logic
2021-03-09 16:27:34 +01:00
Asias He
61ac8d03b9 repair: Add ignore_nodes option
In some cases, user may want to repair the cluster, ignoring the node
that is down. For example, run repair before run removenode operation to
remove a dead node.

Currently, repair will ignore the dead node and keep running repair
without the dead node but report the repair is partial and report the
repair is failed. It is hard to tell if the repair is failed only due to
the dead node is not present or some other errors.

In order to exclude the dead node, one can use the hosts option. But it
is hard to understand and use, because one needs to list all the "good"
hosts including the node itself. It will be much simpler, if one can
just specify the node to exclude explicitly.

In addition, we support ignore nodes option in other node operations
like removenode. This change makes the interface to ignore a node
explicitly more consistent.

Refs: #7806

Closes #8233
2021-03-09 16:03:13 +01:00
Gleb Natapov
2a41ad0b57 raft: add testing for non-voting members
Add tests to check if quorum (for leader election and commit index
purposes) is calculated correctly in the presence of non-voting members.
Message-Id: <20210304101158.1237480-3-gleb@scylladb.com>
2021-03-09 13:51:09 +01:00
Gleb Natapov
dd6ba3d507 raft: add non-voting member support
This patch adds a support for non-voting members. Non voting member is a
member which vote is not counted for leader election purposes and commit
index calculation purposes and it cannot become a leader. But otherwise
it is a normal raft node. The state is needed to let new nodes to catch
up their log without disturbing a cluster.

All kind of transitions are allowed. A node may be added as a voting member
directly or it may be added as non-voting and then changed to be voting
one through additional configuration change. A node can be demoted from
voting to non-voting member through a configuration change as well.
Message-Id: <20210304101158.1237480-2-gleb@scylladb.com>
2021-03-09 13:47:48 +01:00
Raphael S. Carvalho
863b95aa34 sstables: remove bag_sstable_set
bag_sstable_set can be replaced with partitioned_sstable_set, which
will provide the same functionality, given that L0 sstables go to
a "bag" rather than interval map. STCS, for example, will only
have L0 sstables, so it will get exact the same behavior with
partitioned_sstable_set.

it also gives us the benefit of keeping the leveled sstables in
the interval map if user has switched from LCS to STCS, until
they're all compacted into size-tiered ssts.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-09 08:39:48 -03:00
Avi Kivity
9038a81317 treewide: drop SEASTAR_CONCEPT
Since Scylla requires C++20, there is no need to protect
concept definitions or usages with SEASTAR_CONCEPT; it just
clutters the code. This patch therefore removes all uses.

Closes #8236
2021-03-08 16:04:20 +01:00
Asias He
dc40184faa gossip: Handle timeout error in gossiper::do_shadow_round
Currently, the rpc timeout error for the GOSSIP_GET_ENDPOINT_STATES verb
is not handled in gossiper::do_shadow_round. If the
GOSSIP_GET_ENDPOINT_STATES rpc call to any of the remote nodes goes
timeout, gossiper::do_shadow_round will throw an exception and fail the
whole boot up process.

It is fine that some of the remote nodes timeout in shadow round. It is
not a must to talk to all nodes.

This patch fixes an issue we saw recently in our sct tests:

```
INFO    | scylla[1579]: [shard 0] init - Shutting down gossiping
INFO    | scylla[1579]: [shard 0] gossip - gossip is already stopped
INFO    | scylla[1579]: [shard 0] init - Shutting down gossiping was successful
...

ERR     | scylla[1579]: [shard 0] init - Startup failed: seastar::rpc::timeout_error (rpc call timed out)
```

Fixes #8187

Closes #8213
2021-03-08 13:03:41 +01:00
Nadav Har'El
28804a50f7 alternator-test: test that index can't be a name reference (#xyz)
We already have a test which shows verify DynamoDB and Alternator
do not allow an index in an attribute path - like a[0].b - to be
a value reference - a[:xyz].b. We forgot to verify that the index
also can't be a name reference - a[#xyz].b is a syntax error. So here
we add a test which confirms that this is indeed the case - DynamoDB
doesn't allow it, and neither does Alternator.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210219123310.1240271-1-nyh@scylladb.com>
2021-03-08 10:17:19 +01:00
Avi Kivity
938761f49f types.cc: drop unused #include "compaction_garbage_collector.hh"
Garbage-collect unused #includes.

Closes #8232
2021-03-08 06:44:03 +01:00
Takuya ASADA
2d9feaacea scylla_raid_setup: don't abort using raiddev when array_state is 'clear'
On Ubuntu 20.04 AMI, scylla_raid_setup --raiddev /dev/md0 causes
'/dev/md0 is already using' (issue #7627).
So we merged the patch to find free mdX (587b909).

However, look into /proc/mdstat of the AMI, it actually says no active md device available:

ubuntu@ip-10-0-0-43:~$ cat /proc/mdstat
Personalities :
unused devices: <none>

We currently decide mdX is used when os.path.exists('/sys/block/mdX/md/array_state') == True,
but according to kernel doc, the file may available even array is STOPPED:

    clear

        No devices, no size, no level
        Writing is equivalent to STOP_ARRAY ioctl
https://www.kernel.org/doc/html/v4.15/admin-guide/md.html

So we should also check array_state != 'clear', not just array_state
existance.

Fixes #8219

Closes #8220
2021-03-07 18:30:11 +02:00
Avi Kivity
1287a5e1d0 test: index_reader_assertions: fix misuse of trichotomic comparator in has_monotonic_positions
has_monotonic_positions() wants to check for a greater-than-or-equal-to
relation, but actually tests for not-equal, since it treats a
trichotomic comparator as a less-than comparator. This is clearly seen
in the BOOST_FAIL message just below.

Fix by aligning the test with the intended invariant. Luckily, the tests
still pass.

Ref #1449.

Closes #8222
2021-03-07 13:44:37 +02:00
Eliran Sinvani
0220786710 database: Fix view schemas in place when loading
On restart the view schemas are loaded and might contain old
views with an unmarked computed column. We already have code to
update the schema, but before we do it we load the view as is. This
is not desired since once registered, this view version can be used
for writes which is forbidden since we will spot a none computed
column which is in the view's primary key but not in the base table
at all. To solve this, in addition to altering the persistent schema,
we fix the view's loaded schema in place. This is safe since computed
column is just involved in generating a value for this column when
creating a view update so the effect of this manipulation stays
internal.
The second stage of the in place fixing is to persist the
changes made in the in place fixing so the view is ready for
the next node restart in particular the `computed_columns` table.
2021-03-07 12:57:16 +02:00
Eliran Sinvani
04de770566 global_schema_ptr: add support for view's base table
Up until now, the global_schema_ptr object was a crack
through which a view schema with an uninitialized base
reference could sneak. Even if the schema itself contained a
base reference, the base schema didn't carry over to shards
different than the shard on which the global_schema_ptr was
created.
Since once the schema is in the registry it might be used for
everything (reads and writes), we also need to make sure that
global schemas for an incomplete view schemas will not be created.
2021-03-07 12:50:42 +02:00
Eliran Sinvani
9162748b18 materialized views: create view schemas with proper base table
reference.

Newly created view schemas don't always have their base info,
this is bad since such schemas don't support read nor write.
This leaves us vulnerable to a race condition where there is
an attempt to use this schema for read or write. Here we initialize
the base reference and also reconfigure the view to conform to the
new computed column type, which makes it usable for write and not only
reads. We do it for views created in the migration manager following
announcements and also for copied schemas.
2021-03-07 12:50:42 +02:00
Eliran Sinvani
39cd9dae4e materialized views: Extract fix legacy schema into its own logic
We extract the logic for fixing the view schema into it's own
logic as we will need to use it in more places in the code.
This makes 'maybe_update_legacy_secondary_index_mv_schema' redundant since
it becomes a two liner wrapper for this logic. We also
remove it here and replace the call to it with the equivalent code.
2021-03-07 12:50:42 +02:00
Takuya ASADA
53c7600da8 dist: increase fs.aio-max-nr value for other apps
Current fs.aio-max-nr value cpu_count() * 11026 is exact size of scylla
uses, if other apps on the environment also try to use aio, aio slot
will be run out.
So increase value +65536 for other apps.

Related #8133

Closes #8228
2021-03-07 12:11:36 +02:00
Piotr Sarna
7106ca27e6 service: reduce continuation length for paxos pruning
A pair of (finally, handle_exception) is reduced to a single
use of then_wrapped(), which saves an allocation.

Message-Id: <01949e286db93397209435a85fcc46a8beef6d24.1614937462.git.sarna@scylladb.com>
2021-03-07 11:59:10 +02:00
Nadav Har'El
ad563c6279 Update tools/java submodule
Fixes an sstableloader bug where we quoted twice column names that
had to be quoted, and therefore failed on such tables - and in particular
Alternator tables which always have a column called ":attrs".

Fixes #8229

* tools/java 142f517a23...c5d9e8513e (1):
  > sstableloader: Only escape column names once
2021-03-07 10:33:49 +02:00
Botond Dénes
debaae41f9 mutation_partition: operator<<(mutation_partition::printer)
Include row tombstones in the row printout.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210305094106.210249-1-bdenes@scylladb.com>
2021-03-05 14:39:39 +02:00
Botond Dénes
45471419d0 multishard_mutation_query: re-enable reverse queries
034cb81323 and 0f0c3be disallowed reverse partition-range scans based on
the observation that the CQL frontend disallows them, assuming that
other client APIs also disallow them. As it turns out this is not true
and there it at least one client API (Thrift) which does allows reverse
range scans. So re-enable them.

Fixes: #8211

Tests: unit(release), dtest(thrift_tests.py)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210304142249.164247-1-bdenes@scylladb.com>
2021-03-04 17:06:16 +02:00
Nadav Har'El
acfa180766 cql-pytest: recognize when Scylla crashes
Before this patch, if Scylla crashes during some test in cql-pytest, all
tests after it will fail because they can't connect to Scylla - and we can
get a report on hundreds of failures without a clear sign of where the real
problem was.

This patch introduces an autouse fixture (i.e., a fixture automatically
used by every test) which tries to run a do-nothing CQL command after each
test.  If this CQL command fails, we conclude that Scylla crashed and
report the test in which this happened - and exist pytest instead of failing
a hundred more tests.

Fixes #8080

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210304132804.1527977-1-nyh@scylladb.com>
2021-03-04 16:00:00 +02:00
Raphael S. Carvalho
1226fc755f compaction_manager: Increase cleanup compaction resilience when low on disk space
In a scenario where node is running out of disk space, which is a common
cause of cluster expansion, it's very important to clean up the smallest
files first to increase the chances of success when the biggest files are
reached down the road. That's possible given that cleanup operates on a
single file at a time, and that the smaller the file the smaller the
space requirement.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210303165520.55563-1-raphaelsc@scylladb.com>
2021-03-04 15:11:06 +02:00
Botond Dénes
25367deb01 mutation_partition: make row::consume_with() exception safe
This function currently eagerly decrements `_size`, before `func()` is
invoked. If `func()` throws the consumption fails but the size remains
decremented. If this happens right at the last element in the row, the
`row::empty()` will incorrectly return `true`, even though there is
still one cell left in it. Move the decrement after the `func()`
invocation to avoid this by only decrementing if the consumption
was successful.

Fixes: #8154

Tests: unit(mutation_test:release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210304125318.143323-1-bdenes@scylladb.com>
2021-03-04 15:07:15 +02:00
Piotr Sarna
added53b7d Merge 'hints: use a soft disk space limit in hints commitlog' from Piotr Dulikowski
A recent change to the commitlog (4082f57) caused its configurable size limit to
be strictly enforced - after reaching the limit, new segments wouldn't be
allocated until some of the previous segments are freed. This flow can work for
the regular commitlog, however the hints commitlog does not delete the segments
itself - instead, hints manager recreates its commitlog every 10 seconds, picks
up segments left by the previous instance and deletes each segment manually only
after all hints are sent out from a segment.

Because of the non-standard flow, it is possible that the hints commitlog fills
up and stops accepting more hints. Hints manager uses a relatively low limit for
each commitlog instance (128MB divided by shard count), so it's not hard to fill
it up. What's worse, hints manager tries to acquire file_update_mutex in
exclusive mode before re-creating the commitlog, while hints waiting to be
written acquire this lock in shared mode - which causes hints flushing to
completely deadlock and no more hints be admitted to the commitlog. The queue of
hints waiting to be admitted grows very quickly and soon all writes which could
result in a hint being generated are rejected with OverloadedException.

To solve this problem, it is now possible to bring back the soft disk space
limit by setting a flag in commitlog's configuration.

Tests:
- unit(dev)
- wrote hints for 15 minutes in order to see if it gets stuck again

Fixes #8137

Closes #8206

* github.com:scylladb/scylla:
  hints_manager: don't use commitlog hard space limit
  commitlog: add an option to allow going over size limit
2021-03-04 12:24:05 +01:00
Tomasz Grabiec
d6a94a7db1 Merge 'Make dht::token tri_compare safer' from Avi Kivity
tri_compare() returns an int, which is dangerous as a tri_compare can
be misused where a less_compare is expected. To prevent such misuse,
convert the interval<> template to accept comparators that return
std::strong_ordering, and then convert dht::token's comparator to do
the same.

Ref #1449.

Closes #8181

* github.com:scylladb/scylla:
  dht: convert token tri_compare to std::strong_ordering
  interval: support C++20 three-way comparisons
2021-03-04 11:55:08 +01:00
Nadav Har'El
3e66a5cd43 Merge 'More Redis cleanups' from Pekka Enberg
This pull request removes seastar namespace imports from the header
files. There are some additional cleanups to make that easier and to
remove some commented out code.

Closes #8202

* github.com:scylladb/scylla:
  redis: Remove seastar namespace import from query_processor.hh
  redis: Switch to seastar::sharded<> in query_procesor.hh
  redis: Remove seastar namespace import from query_utils.hh
  redis: Remove seastar namespace import from reply.hh
  redis: Remove commented out code from options.hh
  redis: Remove seastar namespace import from options.hh
  redis: Remove seastar namespace import from service.hh
  redis: Switch to seastar::sharded<> in service.{hh,cc}
  redis: Remove unneeded include from keyspace_utils.hh
  redis: Remove seastar namespace import from keyspace_utils.hh
  redis: Remove seastar namespace import from command_factory.hh
  redis: Fix include path in command_factory.hh
  redis: Remove unneeded includes from command_factory.hh
2021-03-04 11:08:24 +02:00
Pekka Enberg
6066db7c90 Update tools/jmx submodule
* tools/jmx bac7d0b...15c1d4f (2):
  > StorageService: Add a method to return the uptime
  > Bump Jackson version in scylla-apiclient
2021-03-04 10:56:37 +02:00
Nadav Har'El
e12e57c915 Merge 'Fix alternator streams management regression' from Calle Wilund
Refs: #8012
Fixes: #8210

With the update to CDC generation management, the way we retrieve and process these changed.
One very bad bug slipped through though; the code for getting versioned streams did not take into
account the late-in-pr change to make clustering of CDC gen timestamps reversed. So our alternator
shard info became quite rump-stumped, leading to more or less no data depending on when generations
changed w.r. data.

Also, the way we track the above timestamps changed, so we should utilize this for our end-of-iterator check.

Closes #8209

* github.com:scylladb/scylla:
  alternator::streams: Use better method for generation timestamp
  system_distributed_keyspace: Add better routine to get latest cdc gen. timestamp
  system_distributed_keyspace: Fix cdc_get_versioned_streams timestamp range
2021-03-04 09:43:56 +02:00
Pekka Enberg
1d8a94f941 Update tools/jmx submodule
* tools/jmx c2fc96b...bac7d0b (1):
  > Merge 'Fix locking in APIBuilder.remove()' from Pekka Enberg
2021-03-03 18:30:48 +02:00
Calle Wilund
8bbc976ff1 alternator::streams: Use better method for generation timestamp
Get timestamp via system_distributed, instead of local gen.
2021-03-03 15:46:38 +00:00
Calle Wilund
5da0129775 system_distributed_keyspace: Add better routine to get latest cdc gen. timestamp
Since we have a table of cdc version timestamps, conviniently sorted
reversed, we can just query this and get the latest known gen ts.
2021-03-03 15:44:54 +00:00
Calle Wilund
5a69250d7e system_distributed_keyspace: Fix cdc_get_versioned_streams timestamp range
With the new scheme for cdc generation management, one of the last
changes was to make the time ordering of the stream timestamps reversed.

However, cdc_get_versioned_streams forgot to take this into account
when sifting out timestamp ranges for stream retrieval (based on
low mark).

Fixed by doing reverse iteration.
2021-03-03 15:41:42 +00:00
Tomasz Grabiec
3cb01f218f Merge "raft: add unit tests for log, tracker, votes and fix found bugs" from Kostja
Test log consistency after apply_snapshot() is called.
Ensure log::last_term() log::last_conf_index() and log::size()
work as expected.

Misc cleanups.

* scylla-dev.git/raft-confchange-test-v4:
  raft: fix spelling
  raft: add a unit test for voting
  raft: do not account for the same vote twice
  raft: remove fsm::set_configuration()
  raft: consistently use configuration from the log
  raft: add ostream serialization for enum vote_result
  raft: advance commit index right after leaving joint configuration
  raft: add tracker test
  raft: tidy up follower_progress API
  raft: update raft::log::apply_snapshot() assert
  raft: add a unit test for raft::log
  raft: rename log::non_snapshoted_length() to log::in_memory_size()
  raft: inline raft::log::truncate_tail()
  raft: ignore AppendEntries RPC with a very old term
  raft: remove log::start_idx()
  raft: return a correct last term on an empty log
  raft: do not use raft::log::start_idx() outside raft::log()
  raft: rename progress.hh to tracker.hh
  raft: extend single_node_is_quiet test
2021-03-03 16:29:40 +01:00
Tomasz Grabiec
0dc57db248 Revert "Merge "raft: add unit tests for log, tracker, votes and fix found bugs" from Kostja"
This reverts commit f94f70cda8, reversing
changes made to 5206a97915.

Not the latest version of the series was merged. Rvert prior to
merging the latest one.
2021-03-03 16:29:02 +01:00
Avi Kivity
facc7c370e Update tools/jmx submodule
* tools/jmx 8073af6...c2fc96b (1):
  > APIBuilder: Remove RW-lock in JMX server repository wrapper

Fixes #7991.
2021-03-03 15:41:09 +02:00
Avi Kivity
aae43e1a20 Merge 'Untyped_result_set: make non-copying and fragment retaining' from Calle Wilund
Refs #7961
Fixes #8014

The "untyped_result_set" object was created for small, internal access to cql-stored metadata.
It is nowadays used for rather more than that (cdc).
This has the potential of mixing badly with the fact that the type does deep copying of data
and linearizes all (not to mention handles multiple rows rather inefficiently).

Instead of doing a deep copy of input, we keep assume ownership and build
rows of the views therein, potentially retaining fragmented data as-is
avoiding premature linearization.

Note that this is not all sugar and flowers though. Any data access will
by nature be more expensive, and the view collections we create are
potentially just as expensive as copying for small cells.

Otoh, it allows writing code using this that avoids data copying,
depending on destination.

v2:
* Fixed wrong collection reserved in visitor
* Changed row index from shared ptr to ref
* Moved typedef
* Removed non-existing constructors
* Added const ref to index build
* Fixed raft usage after rebase

v3:
* Changed shared_ptr to unique

Closes #8015

* github.com:scylladb/scylla:
  untyped_result_set: Do not copy data from input store (retain fragmented views)
  result_generator: make visitor callback args explicit optionals
  listlike_partial_deserializing_iterator: expose templated collection routines
2021-03-03 13:13:18 +02:00
Nadav Har'El
4e3db5297a cql-pytest: rework tests for filtering leaving out most rows
Previously, we had two tests demonstrating issue #7966. But since then,
our understanding of this issue has improved which resulted in issue #8203,
so this patch improves those tests and makes them reproduce the new issue.

Importantly, we now know that this problem is not specific to a full-table
scan, and also happens in a single-partition scan, so we fix the test to
demonstrate this (instead of the old test, which missed the problem so
the test passed).

Both tests pass on Cassandra, and fail on Scylla.

Refs #8203.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210302224020.1498868-1-nyh@scylladb.com>
2021-03-03 11:22:08 +01:00
Calle Wilund
e4d6c8904f untyped_result_set: Do not copy data from input store (retain fragmented views)
Refs #7961
Fixes #8014

Instead of doing a deep copy of input, we keep assume ownership and build
rows of the views therein, potentially retaining fragmented data as-is
avoiding premature linearization.

Note that this is not all sugar and flowers though. Any data access will
by nature be more expensive, and the view collections we create are
potentially just as expensive as copying for small cells.

Otoh, it allows writing code using this that avoids data copying,
depending on destination.

v2:
* Fixed wrong collection reserved in visitor
* Changed row index from shared ptr to ref
* Moved typedef
* Removed non-existing constructors
* Added const ref to index build
* Fixed raft usage after rebase

v3:
* Changed shared_ptr to unique
2021-03-03 10:19:46 +00:00
Calle Wilund
353730d4bb result_generator: make visitor callback args explicit optionals
This allows a visitor to separate temporaries (non-optional views)
from store backed views (optionals) when traversing.
2021-03-03 10:19:46 +00:00
Calle Wilund
bba43ce31a listlike_partial_deserializing_iterator: expose templated collection routines
To allow using fragmented types as input.
2021-03-03 10:19:46 +00:00
Nadav Har'El
0fea089b37 Merge 'Fix reading whole requests during shedding' from Piotr Sarna
When shedding requests (e.g. due to their size or number exceeding the
limits), errors were returned right after parsing their headers, which
resulted in their bodies lingering in the socket. The server always
expects a correct request header when reading from the socket after the
processing of a single request is finished, so shedding the requests
should also take care of draining their bodies from the socket.

Fixes #8193

Closes #8194

* github.com:scylladb/scylla:
  cql-pytest: add a shedding test
  transport: return error on correct stream during size shedding
  transport: return error on correct stream during shedding
  transport: skip the whole request if it is too large
  transport: skip the whole request during shedding
2021-03-03 08:52:48 +02:00
Piotr Sarna
4499f89916 cql-pytest: add a shedding test
This scylla-only test case tries to push a too-large request
to Scylla, and then retries with a smaller request, expecting
a success this time.

Refs #8193
2021-03-03 07:08:55 +01:00
Pekka Enberg
310b5c9592 redis: Fix license text in server.hh
The search and replace pattern went bit overboard. Let's fix up the
license text.
Message-Id: <20210302171150.3346-1-penberg@scylladb.com>
2021-03-03 07:06:45 +01:00
Dejan Mircevski
05497fe14d cql3/maps: Drop redundant if condition
Accidentally introduced in 9eed26ca3d, it can never be true due to
code above it.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8201
2021-03-03 07:06:45 +01:00
Nadav Har'El
d6335b7fda test/alternator: better tests of oversized requests
Like DynamoDB, Alternator rejects requests larger than some fixed maximum
size (16MB). We had a test for this feature - test_too_large_request,
but it was too blunt, and missed two issues:

Refs #8195
Refs #8196

So this patch adds two better tests that reproduce these two issues:

First, test_too_large_request_chunked verifies that an oversized request
is detected even if the body is sent with chunked encoding.

Second, both tests - test_too_large_request_chunked and
test_too_large_request_content_length - verify that the rather limited
(and arguably buggy) Python HTTP client is able to read the 413 status
code - and doesn't report some generic I/O error.

Both tests pass on DynamoDB, but fail on Alternator because of these two
open issues.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210302154555.1488812-1-nyh@scylladb.com>
2021-03-03 07:06:45 +01:00
Nadav Har'El
c6ca1ec643 cql-pytest: add reproducers for two filtering-related issues
The main goal of this patch is to add a reproducer for issue #7966, where
partition-range scan with filtering that begins with a long string of
non-matches aborts the query prematurely - but the same thing is fine with
a single-partition scan. The test, test_filtering_with_few_matches, is
marked as "xfail" because it still fails on Scylla. It passes on Cassandra.

I put a lot of effort into making this reproducer *fast* - the dev-build
test takes 0.4 seconds on my laptop. Earlier reproducers for the same
problem took as much as 30 seconds, but 0.4 seconds turns this test into
a viable regression test.

We also add a test, test_filter_on_unset, reproduces issue #6295 (or
the duplicate #8122), which was already solved so this test passes.

Refs #6295
Refs #7966
Refs #8122

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210301170451.1470824-1-nyh@scylladb.com>
2021-03-03 07:06:45 +01:00
Calle Wilund
58489dc003 cql3::restrictions: Add SCYLLA_CLUSTERING_BOUND keyword for sstableloader
Refs #8093
Refs /scylladb/scylla-tools-java#218

Adds keyword that can preface value tuples in (a, b, c) > (1, 2, 3)
expressions, forcing the restriction to bypass column sort order
treatment, and instead just create the raw ck bounds accordningly.

This is a very limited, and simple version, but since we only need
to cover this above exact syntax, this should be sufficient.

v2:
* Add small cql test
v3:
* Added comment in multi_column_restriction::slice, on what "mode" means and is for
* Added small document of our internal CQL extension keywords, including this.
v4:
* Added a few more cases to tests to verify multi-column restrictions
* Reworded docs a bit
v5:
* Fixed copy-paste error in comment
v6:
* Added negative (error) test cases
v7:
* Added check + reject of trying to combine SCYLLA_CLUST... slice and
  normal one

Closes #8094
2021-03-03 07:06:45 +01:00
Pekka Enberg
9d54a3e743 redis: Remove seastar namespace import from query_processor.hh 2021-03-02 18:39:30 +02:00
Pekka Enberg
27c5041c86 redis: Switch to seastar::sharded<> in query_procesor.hh 2021-03-02 18:38:41 +02:00
Pekka Enberg
ee8fe53b3c redis: Remove seastar namespace import from query_utils.hh 2021-03-02 18:37:31 +02:00
Pekka Enberg
c90c1ccd44 redis: Remove seastar namespace import from reply.hh 2021-03-02 18:36:30 +02:00
Pekka Enberg
1075d72780 redis: Remove commented out code from options.hh 2021-03-02 18:34:46 +02:00
Pekka Enberg
1c222fda65 redis: Remove seastar namespace import from options.hh 2021-03-02 18:34:30 +02:00
Pekka Enberg
d452c8f42e redis: Remove seastar namespace import from service.hh 2021-03-02 18:33:31 +02:00
Pekka Enberg
d0594a86aa redis: Switch to seastar::sharded<> in service.{hh,cc} 2021-03-02 18:30:39 +02:00
Avi Kivity
ee9db75210 Merge 'Clean up Redis transport layer' from Pekka Enberg
The Redis transport layer seems to have originated as a copy-paste of
the CQL transport layer. This pull request removes bunch of unused
and commented out bits of code, and also does some minor cleanups
like organizing includes, to make the code more readable.

Closes #8198

* github.com:scylladb/scylla:
  redis: Remove unused to_bytes_view() function from server.cc
  redis: Remove unused tracing_request_type enum
  redis: Remove unneeded connection friend declaration
  redis: Remove unused process_request_executor friend declaration
  redis: Remove unused _request_cpu class member
  redis: Remove commented out code from server.hh
  redis: Remove duplicate request.hh include
  redis: Remove unused db::config forward declaration
  redis: Remove unused fmt_visitor forward declaration
  redis: Organize includes in server.{cc,hh}
  redis: Switch to seastar::sharded<>
  redis: Remove redundant access modifiers from server.hh
2021-03-02 18:27:38 +02:00
Pekka Enberg
097aaa6dc2 redis: Remove unneeded include from keyspace_utils.hh 2021-03-02 18:16:29 +02:00
Pekka Enberg
7f4de3f915 redis: Remove seastar namespace import from keyspace_utils.hh 2021-03-02 18:15:37 +02:00
Pekka Enberg
bf47b58b8a redis: Remove seastar namespace import from command_factory.hh 2021-03-02 18:13:49 +02:00
Pekka Enberg
92e257d5bd redis: Fix include path in command_factory.hh 2021-03-02 18:13:08 +02:00
Pekka Enberg
ac4b8e4534 redis: Remove unneeded includes from command_factory.hh 2021-03-02 18:12:30 +02:00
Piotr Dulikowski
376da49cf4 hints_manager: don't use commitlog hard space limit
This commit disables the hard space limit applied by commitlogs created
to store hints. The hard limit causes problems for hints because they
use small-sized commitlogs to store hints (128MB, currently). Instead of
letting the commitlog delete the segments itself, it recreates the
commitlog every 10 seconds and manually deletes old segments after all
hints are sent out from them.

If the 128MB limit is reached, the hints manager will get stuck. A
future which puts hint into commitlog holds a shared lock, and commitlog
recreation needs to get an exclusive lock, which results in a deadlock.
No more hints will be admitted, and eventually we will start rejecting
writes with OverloadedException due to too many hints waiting to be
admitted to the commitlog.

By disabling the hard limit for hints commitlog, the old behavior is
brought back - commitlog becomes more conservative with the space used
after going over its size limit, but does not block until some of its
segments are deleted.
2021-03-02 16:53:50 +01:00
Piotr Sarna
8635094144 transport: return error on correct stream during size shedding
When a request is shed due to being too large, its response
was sent with stream id 0 instead of the stream id that matches
the communication lane. That in turn confused the client,
which is no longer the case.
2021-03-02 15:10:46 +01:00
Piotr Sarna
d6ea6937ee transport: return error on correct stream during shedding
When a request is shed due to exceeding the max number of concurrent
requests, its response was sent with stream id 0 instead of
the stream id that matches the communication lane.
That in turn confused the client, which is no longer the case.
2021-03-02 15:10:46 +01:00
Pekka Enberg
01a785f561 redis: Remove unused to_bytes_view() function from server.cc 2021-03-02 14:29:52 +02:00
Pekka Enberg
fb6eecfae2 redis: Remove unused tracing_request_type enum 2021-03-02 14:29:52 +02:00
Pekka Enberg
8d79deb973 redis: Remove unneeded connection friend declaration 2021-03-02 14:29:51 +02:00
Pekka Enberg
ff81f7bc23 redis: Remove unused process_request_executor friend declaration 2021-03-02 14:29:51 +02:00
Pekka Enberg
87c5968602 redis: Remove unused _request_cpu class member 2021-03-02 14:29:51 +02:00
Pekka Enberg
11fa32e8c9 redis: Remove commented out code from server.hh 2021-03-02 14:29:51 +02:00
Pekka Enberg
ddab15c47f redis: Remove duplicate request.hh include 2021-03-02 14:29:51 +02:00
Pekka Enberg
07bd125a59 redis: Remove unused db::config forward declaration 2021-03-02 14:29:51 +02:00
Pekka Enberg
5a7e6b6c09 redis: Remove unused fmt_visitor forward declaration 2021-03-02 14:29:51 +02:00
Pekka Enberg
298bf19981 redis: Organize includes in server.{cc,hh} 2021-03-02 14:29:51 +02:00
Pekka Enberg
23c2f47054 redis: Switch to seastar::sharded<> 2021-03-02 14:29:51 +02:00
Pekka Enberg
7bd4ff9d75 redis: Remove redundant access modifiers from server.hh 2021-03-02 14:13:45 +02:00
Avi Kivity
5f4bf18387 Revert "Merge 'sstables: add versioning to the sstable_set ' from Wojciech Mitros"
This reverts commit 31909515b3, reversing
changes made to ef97adc72a. It shows many
serious regressions in dtest.

Fixes #8197.
2021-03-02 13:21:22 +02:00
Takuya ASADA
870c3a28c1 scylla_setup: strip spaces of comma separated list
On RAID prompt, we can type disk list something like this:
 /dev/sda1,/dev/sdb1,/dev/sdc1,/dev/sdd1

However, if the list has spaces in the list, it doesn't work:
 /dev/sda1, /dev/sdb1, /dev/sdc1, /dev/sdd1

Because the script mistakenly recognize the space part of a device path.
So we need strip() the input for each item.

Fixes #8174

Closes #8190
2021-03-02 12:48:18 +02:00
Piotr Sarna
4a24d7dca0 transport: skip the whole request if it is too large
When a request is shed due to being too large, only the header
was actually read, and the body was still stuck in the socket
- and would be read in the next iteration, which would expect
to actually read a new request header.
Instead, the whole message is now skipped, so that a new request
can be correctly read and parsed.

Fixes #8193
2021-03-02 10:10:19 +01:00
Piotr Sarna
3eb7e768cb transport: skip the whole request during shedding
When a request is shed due to exceeding the number of max concurrent
requests, only its header was actually read, and the body was still
stuck in the socket - and would be read in the next iteration,
which would expect to actually read a new request header.
Instead, the whole message is now skipped, so that a new request
can be correctly read and parsed.

Refs #8193
2021-03-02 10:10:19 +01:00
Avi Kivity
10364fca6e Merge "Build query::result directly in range scan queries" from Botond
"
Currently range scans build their results on the replica in the
`reconcilable_result` format, that -- as its name suggests -- is
normally used for reconciliation (read repair). As such this result
format is quite inefficient for normal queries: it contains all columns
and all tombstones in the requested range. These are all unnecessary for
normal queries which only want live data and only those columns that are
requested by the user.
Furthermore, as the coordinator works in terms of `query::result` for
normal queries anyway, this intermediate result has to be converted to
the final `query::result` format adding an unnecessary intermediate
conversion step.
This series gets rid of this problem by introducing
`query_data_on_all_shards()`, a variant of
`query_mutations_on_all_shards()` that builds `query::result` directly.
Reverse queries still use the old intermediate method behind the scenes.

Fixes #8061
Refs #7434

Tests: unit(release, debug)
"

* 'range-scan-data-variant/v5-rebased' of https://github.com/denesb/scylla:
  cql_query_test: add unit test for the more efficient range scan result format
  test/cql_test_env: do_with_cql_test_env(): add thread_attributes parameter
  cql_query_test: test_query_limit: clean up scheduling groups
  storage_proxy: use query_data_on_all_shards() for data range scan queries
  query: partition_slice: add range_scan_data_variant option
  gms: add RANGE_SCAN_DATA_VARIANT cluster feature
  multishard_mutation_query: query_mutations_on_all_shards(): refuse reverse queries
  multishard_mutation_query: add query_data_on_all_shards()
  mutation_partition.cc: fix indentation
  query_result_builder: make it a public type
  multishard_mutation_query: generalize query code w.r.t. the result builder used
  multishard_mutation_query: query_mutations_on_all_shards(): extract logic into new method
  multishard_mutation_query: query_mutations_on_all_shards(): convert to coroutine
  multishar_mutation_query: do_query_mutations(): convert to coroutine
  multishard_mutation_query: read_page(): convert to coroutine
  multishard_mutation_query: extract page reading logic into separate method
2021-03-02 08:54:41 +02:00
Botond Dénes
257c295cff cql_query_test: add unit test for the more efficient range scan result format
The most user-visible aspect of this change is range scans which select
a small subset of the columns. These queries work as the user expects
them to work: unselected columns are not included in determining the
size of the result (or that of the page). This is the aspect this test
is checking for. While at it, also test single partition queries too.
2021-03-02 08:01:53 +02:00
Botond Dénes
af0a23e75c test/cql_test_env: do_with_cql_test_env(): add thread_attributes parameter
To allow conveniently setting the scheduling group `func` is to be run
in.
2021-03-02 07:53:53 +02:00
Botond Dénes
fe280271a6 cql_query_test: test_query_limit: clean up scheduling groups
Destroy scheduling groups created for this test, so other tests can
create scheduling groups with the same name, without conflicts.
2021-03-02 07:53:53 +02:00
Botond Dénes
f8ce168c8e storage_proxy: use query_data_on_all_shards() for data range scan queries
Currently range scans build their result using the `reconcilable_result`
format and then convert it to `query::result`. This is inefficient for
multiple reasons:
1) it introduces an additional intermediate result format and a
   subsequent conversion to the final one;
2) the reconcilable result format was designed for reconciliation so it
   contains all data, including columns unselected by the query, dead
   rows and tombstones, which takes much more memory to build;

There is no reason to go through all this trouble, if there ever was one
in the past it doesn't stand anymore. So switch to the newly introduced
`query_data_on_all_shards()` when doing normal data range scans, but
only if all the nodes in the cluster supports it, to avoid artificial
differences in page sizes due to how reconcilable result and
query::result calculates result size and the consequent false-positive
read repair.
The transition to this new more efficient method is coordinated by a
cluster feature and whether to use it is decided by the coordinator
(instead of each replica individually). This is to avoid needless
reconciliation due to the different page sizes the two formats will
produce.
2021-03-02 07:53:53 +02:00
Botond Dénes
f15551d23a query: partition_slice: add range_scan_data_variant option
Switching to the data variant of range scans have to be coordinated by
the coordinator to avoid replicas noticing the availability of the
respective feature in different time, resulting in some using the
mutation variant, some using the data variant.
So the plan is that it will be the coordinator's job to check the
cluster feature and set the option in the partition slice which will
tell the replicas to use the data variant for the query.
2021-03-02 07:53:53 +02:00
Botond Dénes
5c84aa52db gms: add RANGE_SCAN_DATA_VARIANT cluster feature
To control the transition to the data variant of range scans. As there
is a difference in how the data and mutation variants calculate pages
sizes, the transition to the former has to happen in a controlled
manner, when all nodes in the cluster support it, to avoid artificial
differences in page content and subsequently triggering false-positive
read repair.
2021-03-02 07:53:53 +02:00
Botond Dénes
0f0c3be63e multishard_mutation_query: query_mutations_on_all_shards(): refuse reverse queries
Refuse reverse queries just like in the new
`query_data_on_all_shards()`. The reason is the same, reverse range
scans are not supported on the client API level and hence they are
underspecified and more importantly: not tested.
2021-03-02 07:53:53 +02:00
Botond Dénes
034cb81323 multishard_mutation_query: add query_data_on_all_shards()
A data query variant of the existing `query_mutations_on_all_shards()`.
This variant builds a `query::result`, instead of `reconcilable_result`.
This is actually the result format coordinators want when executing
range scans, the reason for using the reconcilable result for these
queries is historic, and it just introduces an unnecessary intermediate
format.
This new method allows the storage proxy to skip this intermediate
format and the associated conversion to `query::result`, just like we do
for single partition queries.

Reverse queries are refused because they are not supported on the client
API (CQL) level anyway and hence it is unspecified how they should work
and more importantly: they are not tested.
2021-03-02 07:53:53 +02:00
Botond Dénes
df0f501ba2 mutation_partition.cc: fix indentation
Left broken from the previous patch.
2021-03-02 07:53:53 +02:00
Botond Dénes
950150c6df query_result_builder: make it a public type
We will want to use it in multishard_mutation_query.cc.
2021-03-02 07:53:53 +02:00
Botond Dénes
f19ab5cff1 multishard_mutation_query: generalize query code w.r.t. the result builder used
We want to add support to building `query::result` directly and reuse
the code path we use to build reconcilable result currently for it.
So templatize said code path on the result builder used. Since the
different result builders don't have a source level compatible interface
an adaptor class is used.
2021-03-02 07:53:53 +02:00
Botond Dénes
bddb0d35d6 multishard_mutation_query: query_mutations_on_all_shards(): extract logic into new method
In the next patches we are going to generalize the query logic w.r.t.
the result builder used, so query_mutations_on_all_shards() will be just
a facade parametrizing the actual query code with the right result
builder.
2021-03-02 07:53:53 +02:00
Botond Dénes
b0b620b501 multishard_mutation_query: query_mutations_on_all_shards(): convert to coroutine
In preparation to generalizing it w.r.t. the result builder used.
This change will be much simpler with the coroutine code.
2021-03-02 07:53:53 +02:00
Botond Dénes
5d85615698 multishar_mutation_query: do_query_mutations(): convert to coroutine
In preparation to generalizing it w.r.t. the result builder used.
This change will be much simpler with the coroutine code.
2021-03-02 07:53:53 +02:00
Botond Dénes
8138bdb434 multishard_mutation_query: read_page(): convert to coroutine
In preparation to generalizing it w.r.t. the result builder used. This
change will be much simpler with the coroutine code.
2021-03-02 07:53:53 +02:00
Botond Dénes
29195f67f1 multishard_mutation_query: extract page reading logic into separate method
The block of code moved also coincides with the scope in which the
reader has to be alive, making the code more clear.
2021-03-02 07:53:53 +02:00
Benny Halevy
baf5d05631 storage_service: use atomic_vector for lifecycle_subscribers
So it can be modified while walked to dispatch
subscribed event notifications.

In #8143, there is a race between scylla shutdown and
notify_down(), causing use-after-free of cql_server.

Using an atomic vector itstead and futurizing
unregister_subscriber allows deleting from _lifecycle_subscribers
while walked using atomic_vector::for_each.

Fixes #8143

Test: unit(release)
DTest:
  update_cluster_layout_tests:TestUpdateClusterLayout.add_node_with_large_partition4_test(release)
  materialized_views_test.py:TestMaterializedViews.double_node_failure_during_mv_insert_4_nodes_test(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210224164647.561493-2-bhalevy@scylladb.com>
2021-03-01 20:34:42 +02:00
Benny Halevy
1ed04affab cql_server: event_notifier: unregister_subscriber in stop
Move unregister_subscriber from the destructor to stop
as preparation for moving storage_service lifescyle_subscribers
to atomic_vector and futurizing unregister_subscriber.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210224164647.561493-1-bhalevy@scylladb.com>
2021-03-01 20:34:42 +02:00
Avi Kivity
fe8f9039a2 Update seastar submodule
* seastar 803e790598...ea5e529f30 (3):
  > Merge "Teach io_tester to generate YAML output" from Pavel E
  > bitset: set_range: mark constructor constexpr
  > Update dpdk submodule
2021-03-01 20:34:35 +02:00
Avi Kivity
8747c684e0 Merge 'Move timeouts to client state' from Piotr Sarna
This series is extracted from #7913 as it may prove useful to other series as well, and #7913 might take a while until its merged, given that it also depends on other unmerged pull requests.

The idea of this series is to move timeouts to the client state, which will allow changing them independently for each session - e.g. by setting per-service-level timeouts and initializing the values from attached service levels (see #7867).

Closes #8140

* github.com:scylladb/scylla:
  treewide: remove timeout config from query options
  cql3: use timeout config from client state instead of query options
  cql3: use timeout config from client state instead of query options
  cql3: use timeout config from client state instead of query options
  service: add timeout config to client state
2021-03-01 20:34:35 +02:00
Raphael S. Carvalho
2cf0c4bbf1 compaction: Prevent cleanup and regular from compacting the same sstable
Due to regression introduced by 463d0ab, regular can compact in parallel a sstable
being compacted by cleanup, scrub or upgrade.

This redundancy causes resources to be wasted, write amplification is increased
and so does the operation time, etc.

That's a potential source of data resurrection because the now-owned data from
a sstable being compacted by both cleanup and regular will still exist in the
node afterwards, so resurrection can happen if node regains ownership.

Fixes #8155.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210225172641.787022-1-raphaelsc@scylladb.com>
2021-03-01 20:34:35 +02:00
Tomasz Grabiec
cb0b8d1903 row_cache: Zap dummy entries when populating or reading a range
This will prevent accumulation of unnecessary dummy entries.

A single-partition populating scan with clustering key restrictions
will insert dummy entries positioned at the boundaries of the
clustering query range to mark the newly populated range as
continuous.

Those dummy entries may accumulate with time, increasing the cost of
the scan, which needs to walk over them.

In some workloads we could prevent this. If a populating query
overlaps with dummy entries, we could erase the old dummy entry since
it will not be needed, it will fall inside a broader continuous
range. This will be the case for time series worklodas which scan with
a decreasing (newest) lower bound.

Refs #8153.

_last_row is now updated atomically with _next_row. Before, _last_row
was moved first. If exception was thrown and the section was retried,
this could cause the wrong entry to be removed (new next instead of
old last) by the new algorithm. I don't think this was causing
problems before this patch.

The problem is not solved for all the cases. After this patch, we
remove dummies only when there is a single MVCC version. We could
patch apply_monotonically() to also do it, so that dummies which are
inside continuous ranges are eventually removed, but this is left for
later.

perf_row_cache_reads output after that patch shows that the second
scan touches no dummies:

$ build/release/test/perf/perf_row_cache_reads_g -c1 -m200M
Rows in cache: 0
Populating with dummy rows
Rows in cache: 265320
Scanning
read: 142.621613 [ms], preemption: {count: 639, 99%: 0.545791 [ms], max: 0.526929 [ms]}, cache: 0/0 [MB]
read: 0.023197 [ms], preemption: {count: 1, 99%: 0.035425 [ms], max: 0.032736 [ms]}, cache: 0/0 [MB]

Message-Id: <20210226172801.800264-1-tgrabiec@scylladb.com>
2021-03-01 20:34:35 +02:00
Tomasz Grabiec
761f89e55e api: Introduce system/drop_sstable_caches RESTful API
Evicts objects from caches which reflect sstable content, like the row
cache. In the future, it will also drop the page cache
and sstable index caches.

Unlike lsa/compact, doesn't cause reactor stalls.

The old lsa/compact call invokes memory reclamation, which is
non-preemptible. It also compacts LSA segments, so does more
work. Some use cases don't need to compact LSA segments, just want the
row cache to be wiped.

Message-Id: <20210301120211.36195-1-tgrabiec@scylladb.com>
2021-03-01 16:13:04 +02:00
Piotr Dulikowski
aa2df75321 commitlog: add an option to allow going over size limit
This commit adds an option which, when turned on, allows the commitlog
to go over configured size limit. After reaching the limit, commitlog
will be more conservative with its usage of the disk space - for
example, it won't increase the segment reserve size or reuse recycled
segments. Most importantly, it won't block writes until the space used
by the commitlog goes down.

This change is necessary for hinted handoff to keep its current
behavior. Hinted handoff does not let the commitlog free segments
itself - instead, it re-creates it every 10 seconds and manually deletes
segments after all hints are sent from a segment.
2021-03-01 14:16:05 +01:00
Takuya ASADA
d0297c599a dist: tune fs.aio-max-nr based on the number of cpus
Current aio-max-nr is set up statically to 1048576 in
/etc/sysctl.d/99-scylla-aio.conf.
This is sufficient for most use cases, but falls short on larger machines
such as i3en.24xlarge on AWS that has 96 vCPUs.

We need to tune the parameter based on the number of cpus, instead of
static setting.

Fixes #8133

Signed-off-by: Takuya ASADA <syuu@scylladb.com>

Closes #8188
2021-03-01 14:18:24 +02:00
Avi Kivity
31909515b3 Merge 'sstables: add versioning to the sstable_set ' from Wojciech Mitros
Currently, the sstable_set in a table is copied before every change
to allow accessing the unchanged version by existing sstable readers.
This patch changes the sstable_set to a structure that keeps all its
versions that are referenced somewhere and provides a way of getting
a reference to an immutable version of the set.
Each sstable in the set is associated with the versions it is alive in,
and is removed when all such versions don't have references anymore.
To avoid copying, the object holding all sstables in the set version is
changed to a new structure, sstable_list, which was previously an alias
for std::unordered_set<shared_sstable>, and which implements most of the
methods of an unordered_set, but its iterator uses the actual set with
all sstables from all referenced versions and iterates over those
sstables that belong to the captured version.
The methods that modify the sets contents give strong exception guarantee
by trying to insert new sstables to its containers, and erasing them in
the case of an caught exception.
To release shared_sstables as soon as possible (i.e. when all references
to versions that contain them die), each time a version is removed, all
sstables that were referenced exclusively by this version are erased. We
are able to find these sstables efficiently by storing, for each version,
all sstables that were added and erased in it, and, when a version is
removed, merging it with the next one. When a version that adds an sstable
gets merged with a version that removes it, this sstable is erased.

Fixes #2622

Signed-off-by: Wojciech Mitros wojciech.mitros@scylladb.com

Closes #8111

* github.com:scylladb/scylla:
  sstables: add test for checking the latency of updating the sstable_set in a table
  sstables: move column_family_test class from test/boost to test/lib
  sstables: use fast copying of the sstable_set instead of rebuilding it
  sstables: replace the sstable_set with a versioned structure
  sstables: remove potential ub
  sstables: make sstable_set constructor less error-prone
2021-03-01 14:16:36 +02:00
Avi Kivity
ef97adc72a Merge "Validate token monotonicity on the sstable write path" from Botond
"
We have recently seen out-of-order partitions getting into sstables
causing major disruption later on. Given the damage caused, it was again
raised that we should enable partition key monotonicity validation
unconditionally in the sstable write path. This was also raised in the
past but dismissed as key validation was suspected (but not measured) to
add considerable per-fragment overhead. One of the problems was that the
key monotonicity validation was all or nothing. It either validated all
(clustering and partition) key monotonicity or none of it.
This series takes a second look at this and solves the all-or-nothing
problem by making the configuration of the key monotonicity check more
fine grained, allowing for enabling just token monotonicity validation
separately, then enables it unconditionally.

Refs: #7623

Tests: unit(release)
"

* 'sstable-writer-validate-partition-keys-unconditionally/v3' of https://github.com/denesb/scylla:
  sstables: enable token monotonicity validation by default
  mutation_fragment_stream_validator: add token validation level
  mutation_fragment_stream_validating_filter: make validation levels more fine-grained
2021-03-01 11:23:51 +02:00
Amnon Heiman
0595596172 api/compaction_manager: add the compaction id in get_compaction
This patch adds the compaction id to the get_compaction structure.
While it was supported, it was not used and up until now wasn't needed.

After this patch a call to curl -X GET 'http://localhost:10000/compaction_manager/compactions'
will include the compaction id.

Relates to #7927

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes #8186
2021-03-01 10:51:31 +02:00
Piotr Sarna
7936652322 db,view: improve verbosity of errors coming from view updates
The error now contains information about the view table that failed,
as well as base and view tokens.
Example:
view - Error applying view update to 127.0.0.1 (view: ks.testme_v_idx_index,
        base token: -4069959284402364209, view token: -3248873570005575792): std::runtime_error (manually injected error)

Fixes #8177

Closes #8178
2021-03-01 10:46:14 +02:00
Avi Kivity
86d8977c96 Update tools/python3 submodule
* tools/python3 199ac90...6f3bcbe (2):
  > Add support pip modules
  > create-relocatable-package.py: add support python libraries in /usr/local
2021-03-01 10:10:13 +02:00
Avi Kivity
8ac0d6d15d Update tools/jmx submodule
* tools/jmx bf8bb16...8073af6 (1):
  > CompactionManager: add the compaction id when available

Fixes #7927.
2021-03-01 10:09:16 +02:00
Takuya ASADA
4cf9b6988e scylla_coredump_setup: don't run apt-get when systemd-coredump is already installed
Check systemd-coredump existance before running apt-get install
systemd-coredump.

Closes #8185
2021-03-01 09:38:51 +02:00
Botond Dénes
f0b284dab8 sstables: enable token monotonicity validation by default
Partition key order validation in data written to sstables can be very
disruptive. All our components in the storage layers assume that
partitions are in order, which means that reading out-of-order
partitions triggers undefined behaviour. Computer scientists often joke
that undefined behaviour can erase your hard drive and in this case the
damage done by undefined behaviour caused by out-of-order partitions is
very close to that. The corruption is known to mutate causing crashes,
corrupting more data and even loose data. For this reason it is
imperative that out-of-order partitions cannot get into sstables. This
patch enables token monotonicity validation unconditionally in
the sstable writer. As partition key monotonicity checks involve a key
copy per partition, which might have an impact on the performance, we do
the next best thing instead and enable only token monotonicity
validation.
2021-03-01 07:49:23 +02:00
Botond Dénes
727bc0f5d4 mutation_fragment_stream_validator: add token validation level
In some cases the full-blown partition key validation and especially the
associated key copy per partition might be deemed too costly. As a next
best thing this patch adds a token only validation, which should cover
99% (number pulled out of my sleeve) of the cases. Let's hope no one
gets unlucky.
2021-03-01 07:49:23 +02:00
Botond Dénes
694f8a4ec6 mutation_fragment_stream_validating_filter: make validation levels more fine-grained
Currently key order validation for the mutation fragment stream
validating filter is all or nothing. Either no keys (partition or
clustering) are validated or all of them. As we suspect that clustering
key order validation would add a significant overhead, this discourages
turning key validation on, which means we miss out on partition key
monotonicity validation which has a much more moderate cost.
This patch makes this configurable in a more fine-grained fashion,
providing separate levels for partition and clustering key monotonicity
validation.

As the choice for the default validation level is not as clear-cut as
before, the default value for the validation level is removed in the
validating filter's constructor.
2021-03-01 07:49:23 +02:00
Avi Kivity
3cd2f00438 dht: convert token tri_compare to std::strong_ordering
Change token's tri_compare functions to return std::strong_ordering,
which is not convertible to bool and therefore not suspect to
being misused where a less-compare is expected.

Two of the users (ring_position and decorated_key) have to undo
the conversion, since they still return int. A follow up will
convert them too.

Ref #1449.
2021-02-28 21:03:59 +02:00
Avi Kivity
d3d7698502 interval: support C++20 three-way comparisons
Allow the tri-comparator input to range functions to return
std::strong_ordering, e.g. the result of operator<=>. An int
input is still allowed, and coerced to std::strong_ordering by
tri-comparing it against zero. Once all users are converted, this
will be disallowed.

The clever code that performs boundary comparisons unfortunately
has to be dumbed down to conditionals. A helper
require_ordering_and_on_equal_return() is introduced that accepts
a comparison result between bound values, an expected comparison
result, and what to return if the bound value matches (this depends
on whether individual bounds are exclusive or inclusive, on
whether the bounds are start bounds or end bounds, and on the
sense of the comparison).

Unfortunately, the code is somewhat pessimized, and there is no
way to pessimize it as the enum underlying std::strong_ordering
is hidden.
2021-02-28 21:03:25 +02:00
Avi Kivity
d980f550d1 Merge 'row_cache: Make fill_buffer() preemptable when cursor leads with dummy rows' from Tomasz Grabiec
fill_buffer() will keep scanning until _lower_bound_changed is true,
even if preemption is signaled, so that the reader makes forward
progress.

Before the patch, we did not update _lower_bound on touching a dummy
entry. The read will not respect preemption until we hit a non-dummy
row. If there is a lot of dummy rows, that can cause reactor stalls.

Fix that by updating _lower_bound on dummy entries as well.

Refs #8153.

Tested with perf_row_cache_reads:

```
$ build/release/test/perf/perf_row_cache_reads -c1 -m200M
Rows in cache: 0
Populating with dummy rows
Rows in cache: 373929
Scanning
read: 183.658966 [ms], preemption: {count: 848, 99%: 0.545791 [ms], max: 0.519343 [ms]}, cache: 99/100 [MB]
read: 120.951515 [ms], preemption: {count: 257, 99%: 0.545791 [ms], max: 0.518795 [ms]}, cache: 99/100 [MB]
```

Notice that max preemption latency is low in the second "read:" line.

Closes #8167

* github.com:scylladb/scylla:
  row_cache: Make fill_buffer() preemptable when cursor leads with dummy rows
  tests: perf: Introduce perf_row_cache_reads
  row_cache: Add metric for dummy row hits
2021-02-28 21:00:20 +02:00
Botond Dénes
1d9b5911fe time_series_sstable_set::create_single_key_sstable_reader(): fix use-after-free
The optimal path of said method mistakenly captures `pos` (a local
variable) in its reader factory method and passes a temporary range
implicitly constructed from said `pos` as the range parameter to the
sstable reader. This will lead to the sstable reader using a dangling
range and will result in returning no result for queries. This patch
fixes this bug and adds a unit test to cover this code path.

Fixes #8138.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210226143111.104591-2-bdenes@scylladb.com>
2021-02-26 23:57:25 +02:00
Botond Dénes
dd5a601aaa result_memory_accounter: abort unpaged queries hitting the global limit
The `result_memory_accounter` terminates a query if it reaches either
the global or shard-local limit. This used to be so only for paged
queries, unpaged ones could grow indefinitely (until the node OOM'd).
This was changed in fea5067 which enforces the local limit on unpaged
queries as well, by aborting them. However a loophole remained in the
code: `result_memory_accounter::check_and_update()` has another stop
condition, besides `check_local_limit()`, it also checks the global
limit. This stop condition was not updated to enforce itself on unpaged
queries by aborting them, instead it silently terminated them, causing
them to return less data then requested. This was masked by most queries
reaching the local limit first.
This patch fixes this by aborting unpaged mutation queries when they hit
the global limit.

Fixes: #8162

Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210226102202.51275-1-bdenes@scylladb.com>
2021-02-26 23:43:16 +02:00
Botond Dénes
bc1fcd3db2 multishard_combining_reader: only read from needed shards
The multishard combining reader currently assumes that all shards have
data for the read range. This however is not always true and in extreme
cases (like reading a single token) it can lead to huge read
amplification. Avoid this by not pushing shards to
`_shard_selection_min_heap` if the first token they are expected to
produce falls outside of the read range. Also change the read ahead
algorithm to select the shards from `_shard_selection_min_heap`, instead
of walking them in shard order. This was wrong in two ways:
* Shards may be ordered differently with respect to the first partition
  they will produce; reading ahead on the next shard in shard order
  might not bring in data on the next shard the read will continue on.
  Shard order is only correct when starting a new range and shards are
  iterated over in the order they own tokens according to the sharding
  algorithm.
* Shards that may not have data relevant to the read range are also
  considered for read ahead.

After this patch, the multishard reader will only read from shards that
have data relevant to the read range, both in the case of normal reads
and also for read-ahead.

Fixes: #8161

Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210226132536.85438-1-bdenes@scylladb.com>
2021-02-26 23:29:20 +02:00
Piotr Sarna
0e0282cdf1 Merge ' cdc: move (most of) CDC generation management to a new service' from Kamil Braun
Currently all management of CDC generations happens in storage_service,
which is a big ball of mud that does many unrelated things.

This PR introduces a new service crafted to handle CDC generation
management: listening and reacting to generation changes in the cluster.

We plug the service in, initializing it in main and test code,
passing a reference to storage_service and having storage_service call
the service (using the `after_join` method): the service only starts
doing its job after the node joins the token ring (either on bootstrap
or restart).

Some parts of generation management still remain in storage_service:
the bootstrap procedure, which happens inside storage_service,
must also do some initialization regarding CDC generations,
for example: on restart it must retrieve the latest known generation
timestamp from disk; on bootstrap it must create a new generation
and announce it to other nodes. The order of these operations w.r.t
the rest of the startup procedure is important, hence the startup
procedure is the only right place for them. We may try decoupling
these services even more in follow-up PRs, but that requires a bit
of careful reasoning. What this PR does is a low-hanging fruit.

Still, what remains in storage_service is a small part of the entire
CDC generation management logic; most of it has been moved to the
new service. This includes listening for generation changes and
updating the data structures for performing CDC log writes (cdc::metadata).
Furthermore these handling functions now return futures (and are internally
coroutines), where previously they required a seastar::async context.

This PR is a prerequisite to fixing #7985. The fact that all the CDC generation
management code was in storage_service is technical debt. It will be easier
to modify the management algorithms when they sit in their own module.

Tests: unit (dev) and cdc_tests.py dtest (dev), and local replication test using scylla-cdc-java

Closes #8172

* github.com:scylladb/scylla:
  cdc: move (most of) CDC generation management code to the new service
  cdc: coroutinize make_new_cdc_generation
  cdc: coroutinize update_streams_description
  cdc: introduce cdc::generation_service
  main: move cdc_service initialization just prior to storage_service initialization
2021-02-26 12:42:27 +01:00
Kamil Braun
e2f03e4aba cdc: move (most of) CDC generation management code to the new service
Currently all management of CDC generations happens in storage_service,
which is a big ball of mud that does many unrelated things.

Previous commits have introduced a new service for managing CDC
generations. This code moves most of the relevant code to this new
service.

However, some part still remains in storage_service: the bootstrap
procedure, which happens inside storage_service, must also do some
initialization regarding CDC generations, for example: on restart it
must retrieve the latest known generation timestamp from disk; on
bootstrap it must create a new generation and announce it to other
nodes. The order of these operations w.r.t the rest of the startup
procedure is important, hence the startup procedure is the only right
place for them.

Still, what remains in storage_service is a small part of the entire
CDC generation management logic; most of it has been moved to the
new service. This includes listening for generation changes and
updating the data structures for performing CDC log writes (cdc::metadata).
Furthermore these functions now return futures (and are internally
coroutines), where previously they required a seastar::async context.
2021-02-26 12:06:12 +01:00
Tomasz Grabiec
b9c3b6c10f row_cache: Make fill_buffer() preemptable when cursor leads with dummy rows
fill_buffer() will keep scanning until _lower_bound_chnaged is true,
even if preemption is signalled, so that the reader makes forward
progress.

Before the patch, we did not update _lower_bound on touching a dummy
entry. The read will not respect preemption until we hit a non-dummy
row. If there is a lot of dummy rows, that can cause reactor stalls.

Fix that by updating _lower_bound on dummy entries as well.

Refs #8153.

Tested with perf_row_cache_reads:

$ build/release/test/perf/perf_row_cache_reads -c1 -m200M
Rows in cache: 0
Populating with dummy rows
Rows in cache: 373929
Scanning
read: 183.658966 [ms], preemption: {count: 848, 99%: 0.545791 [ms], max: 0.519343 [ms]}, cache: 99/100 [MB]
read: 120.951515 [ms], preemption: {count: 257, 99%: 0.545791 [ms], max: 0.518795 [ms]}, cache: 99/100 [MB]

Notice that max preemption latency is low in the second "read:" line.
2021-02-26 01:20:38 +01:00
Tomasz Grabiec
52e411df36 tests: perf: Introduce perf_row_cache_reads
Tests performance of various read patterns from the row cache.

Example:

$ build/release/test/perf/perf_row_cache_reads_g  -c1 -m200M
Filling memtable
Rows in cache: 0
Populating with dummy rows
Rows in cache: 373929
Scanning
read: 156.288986 [ms], preemption: {count: 702, 99%: 0.545791 [ms], max: 0.537537 [ms]}, cache: 99/100 [MB]
read: 106.480766 [ms], preemption: {count: 6, 99%: 0.006866 [ms], max: 106.496168 [ms]}, cache: 99/100 [MB]
2021-02-26 01:20:38 +01:00
Tomasz Grabiec
f0a3272a5f row_cache: Add metric for dummy row hits
This will help to diagnose performance problems related to the read
having to walk through a lot of dummy rows to fill the buffer.

Refs #8153
2021-02-25 18:26:01 +01:00
Piotr Sarna
c5214eb096 treewide: remove timeout config from query options
Timeout config is now stored in each connection, so there's no point
in tracking it inside each query as well. This patch removes
timeout_config from query_options and follows by removing now
unnecessary parameters of many functions and constructors.
2021-02-25 17:20:27 +01:00
Piotr Sarna
f973e09454 cql3: use timeout config from client state instead of query options
... in batch statement, in order to be able to remove the timeout
from query options later.
2021-02-25 17:20:27 +01:00
Piotr Sarna
639d90d2d6 cql3: use timeout config from client state instead of query options
... in modification statement, in order to be able to remove the timeout
from query options later.
2021-02-25 17:20:27 +01:00
Piotr Sarna
b71665efe8 cql3: use timeout config from client state instead of query options
... in select statement, in order to be able to remove the timeout
from query options later.
2021-02-25 17:20:27 +01:00
Piotr Sarna
7ceafda70a service: add timeout config to client state
Future patches will use this per-connection timeout config
to allow setting different timeouts for each session,
based on roles.
2021-02-25 17:20:26 +01:00
Takuya ASADA
aabc67e386 dist/debian: don't run dh_installinit for scylla-node-exporter when service name == package name
dh_installinit --name <service> is for forcing install debian/*.service
and debian/*.default that does not matches with package name.
And if we have subpackages, packager has responsibility to rename
debian/*.service to debian/<subpackage>.*service.

However, we currently mistakenly running
dh_installinit --name scylla-node-exporter for
debian/scylla-node-exporeter.service,
the packaging system tries to find destination package for the .service,
and does not find subpackage name on it, so it will pick first
subpackage ordered by name, scylla-conf.

To solve the issue, we just need to run dh_installinit without --name
when $product == 'scylla'.

Fixes #8163

Closes #8164
2021-02-25 17:05:20 +02:00
Avi Kivity
032fdfe855 Update seastar submodule
* seastar e53a1059f9...803e790598 (9):
  > io_queue: Count total time spent in the queue
  > io_queue: Fix "delay" metrics
Fixes #8166.
  > file: expose disk offset alignment for overwrites
Ref #7663.
  > RPC: (client) retain local address and use on stream creation
  > rpc: sink_impl: align _{last,next}_seq_num to cache-line size
  > reactor: Fix outdated comment
  > fair_queue: Remove now dead ticket strictly_less method
  > io_queue: Double max request size
  > bitsets: set_iterator: correctly implement pre- and post-increment operators
2021-02-25 16:58:06 +02:00
Takuya ASADA
f3a82f4685 scylla_setup: allow running scylla_setup with strict umask setting
We currently deny running scylla_setup when umask != 0022.
To remove this limitation, run os.chmod(0o644) on every file creation
to allow reading from scylla user.

Note that perftune.yaml is not really needed to set 0644 since perftune.py is
running in root user, but setting it to align permission with other files.

Fixes #8049

Closes #8119
2021-02-25 16:42:45 +02:00
Nadav Har'El
750d7903be cql-pytest: fix some comments in util.py
Fix some incorrect comments, pasted from other files or mentioning
wrong names. No other changes except comments

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210225133237.1403891-1-nyh@scylladb.com>
2021-02-25 16:00:20 +02:00
Raphael S. Carvalho
7bf0744d36 reshape/TWCS: Fix off-by-one in threshold check
A given time bucket should also be reshaped if its # of sstables
has reached the threshold.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210223182634.570648-1-raphaelsc@scylladb.com>
2021-02-24 15:12:40 +02:00
Raphael S. Carvalho
21608bd677 sstables: Fix TWCS reshape for windows with at least min_threshold sstables
TWCS reshape was silently ignoring windows which contain at least
min_threshold sstables (can happen with data segregation).
When resizing candidates, size of multi_window was incorrectly used and
it was always empty in this path, which means candidates was always
cleared.

Fixes #8147.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210224125322.637128-1-raphaelsc@scylladb.com>
2021-02-24 15:11:19 +02:00
Tomasz Grabiec
ecb6c56a2a Merge 'lsa: background reclaim' from Avi Kivity
This series adds background reclaim to lsa, with the goal
that most large allocations can be satisfied from available
free memory, and and reclaim work can be done from a preemptible
context.

If the workload has free cpu, then background reclaim will
utilize that free cpu, reducing latency for the main workload.
Otherwise, background reclaim will compete with the main
workload, but since that work needs to happen anyway,
throughput will not be reduced.

A unit test is added to verify it works.

Fixes #1634.

Closes #8044

* github.com:scylladb/scylla:
  test: logalloc_test: test background reclaim
  logalloc: reduce gap between std min_free and logalloc min_free
  logalloc: background reclaim
  logalloc: preemptible reclaim
2021-02-24 13:23:30 +01:00
Piotr Sarna
25f47561cb transport: fix an outdated comment
The comment mentions calling a lambda in-place, but the lambda
is no longer there since 2019!

Message-Id: <3903c84d5c151415409f28935e328b552dd548f8.1614155567.git.sarna@scylladb.com>
2021-02-24 11:14:01 +02:00
Avi Kivity
15d3797e97 test: logalloc_test: test background reclaim
Test that the background reclaimer is able to compete with a
fake load and reclaim 10 MB/s. The test is quite stressful as the "LRU"
is fully randomized.

If the background reclaimer is disabled, the test fails as soon as the
20MB "gap" is exhausted. With the reclaimer enabled, it is able to
free memory ahead of the allocations.
2021-02-23 19:42:42 +02:00
Nadav Har'El
d905e71a90 Alternator: add support for CORS protocol
This patch adds to Alternator support for the CORS (Cross-Origin Resource
Sharing) protocol - a simple extension over the HTTP protocol which
browsers use when Javascript code contacts HTTP-based servers.

Although we usually think of Alternator as being used in a three-tier
application, in some setups there is no middle layer and the user's
browser, running Javascript code, wants to communicate directly with the
database. However, for security reasons, by default Javascript loaded
from domain X is not allowed to communicate with different domains Y.
The CORS protocol is meant to allow this, and Alternator needs to
participate in this protocol if it is to be used directly from Javascript
in browsers.

To implement CORS, Alternator needs to respond to the OPTIONS method
which it didn't allow before - with certain headers based on the
input headers. It also needs to do some of these things for the
regular methods (mostly, POST). The patch includes a comprehensive
test that runs against both Alternator and DynamoDB and shows that
Alternator handles these headers and methods the same as DynamoDB.

Additionally, I tested manually a Javascript DynamoDB client - which
didn't work prior to this patch (the browser reported CORS errors),
and works after this patch.

Fixes #8025.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210217222027.1219319-1-nyh@scylladb.com>
2021-02-23 13:15:03 +01:00
Asias He
7018377bd7 messaging_service: Move gossip ack message verb to gossip group
Fix a scheduling group leak:

INFO [shard 0] gossip - gossiper::run sg=gossip
INFO [shard 0] gossip - gossiper::handle_ack_msg sg=statement
INFO [shard 0] gossip - gossiper::handle_syn_msg sg=gossip
INFO [shard 0] gossip - gossiper::handle_ack2_msg sg=gossip

After the fix:

INFO [shard 0] gossip - gossiper::run sg=gossip
INFO [shard 0] gossip - gossiper::handle_ack_msg sg=gossip
INFO [shard 0] gossip - gossiper::handle_syn_msg sg=gossip
INFO [shard 0] gossip - gossiper::handle_ack2_msg sg=gossip

Fixes #7986

Closes #8129
2021-02-23 10:10:00 +02:00
Tomasz Grabiec
fb1d3fe2cf table: Fix schema mismatch between memtable reader and sstable writer
The schema used to create the sstable writer has to be the same as the
schema used by the reader, as the former is used to intrpret mutation
fragments produced by the reader.

Commit 9124a70 intorduced a deferring point between reader creation
and writer creation which can result in schema mismatch if there was a
concurrent alter.

This could lead to the sstable write to crash, or generate a corrupted
sstable.

Fixes #7994

Message-Id: <20210222153149.289308-1-tgrabiec@scylladb.com>
2021-02-22 17:51:00 +02:00
Raphael S. Carvalho
81d773e5d8 compaction_manager: Redefine weight for better control of parallel compactions
Compaction manager allows compaction of different weights to proceed in
parallel. For example, a small-sized compaction job can happen in parallel to a
large-sized one, but similar-sized jobs are serialized.

The problem is the current definition of weight, which is the log (base 4) of
total size (size of all sstables) of a job.

This is what we get with the current weight definition:
    weight=5	for sizes=[1K, 3K]
    weight=6	for sizes=[4K, 15K]
    weight=7	for sizes=[16K, 63K]
    weight=8	for sizes=[64K, 255K]
    weight=9	for sizes=[258K, 1019K]
    weight=10	for sizes=[1M, 3M]
    weight=11	for sizes=[4M, 15M]
    weight=12	for sizes=[16M, 63M]
    weight=13	for sizes=[64M, 254M]
    weight=14	for sizes=[256M, 1022M]
    weight=15	for sizes=[1033M, 4078M]
    weight=16	for sizes=[4119M, 10188M]
    total weights: 12

Note that for jobs smaller than 1MB, we have 5 different weights, meaning 5
jobs smaller than 1MB could proceed in parallel. High number of parallel
compactions can be observed after repair, which potentially produces tons of
small sstables of varying sizes. That causes compaction to use a significant
amount of resources.

To fix this problem, let's add a fixed tax to the size before taking the log,
so that jobs smaller than 1M will all have the same weight.

Look at what we get with the new weight definition:
    weight=10	for sizes=[1K, 2M]
    weight=11	for sizes=[3M, 14M]
    weight=12	for sizes=[15M, 62M]
    weight=13	for sizes=[63M, 254M]
    weight=14	for sizes=[256M, 1022M]
    weight=15	for sizes=[1033M, 4078M]
    weight=16	for sizes=[4119M, 10188M]
    total weights: 7

Fixes #8124.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210217123022.241724-1-raphaelsc@scylladb.com>
2021-02-22 15:50:29 +02:00
Asias He
554ab035dd main: Run init_server and join_cluster inside maintenance scheduling group
Currently, init_server and join_cluster which initiate the bootstrap and
replace operations on the new node run inside the main scheduling group.
We should run them inside the maintenance scheduling group to reduce the
impact on the user workload.

This patch fixes a scheduling group leak for bootstrap and replace operation.

Before:
[shard 0] storage_service - storage_service::bootstrap sg=main
[shard 0] repair - bootstrap_with_repair sg=main

After:
[shard 0] storage_service - storage_service::bootstrap sg=streaming
[shard 0] repair - bootstrap_with_repair sg=streaming

Fixes #8130

Closes #8131
2021-02-22 14:55:02 +02:00
Michał Chojnowski
a24f83852e atomic_cell: fix operator<< for atomic_cell_or_collection
operator<< used the wrong criterium for deciding whether the data is
stored as atomic_cell or collection_mutation, resulting in
catastrophical failure if it was used with frozen collections or UDTs.
Since frozen collections and UDTs are stored as atomic_cell, not
collection_mutation, the correct criterium is not is_collection(),
but is_multi_cell().

Closes #8134
2021-02-22 14:45:34 +02:00
Kamil Braun
022d7773f4 cdc: coroutinize make_new_cdc_generation 2021-02-22 12:47:44 +01:00
Kamil Braun
26ca9d6c33 cdc: coroutinize update_streams_description 2021-02-22 12:46:53 +01:00
Kamil Braun
d4937daaea cdc: introduce cdc::generation_service
This commit introduces a new service crafted to handle CDC generation
management: listening and reacting to generation changes in the cluster.

The implementation is a stub for now, the service reacts to generation
changes by simply logging the event.

The commit plugs the service in, initializing it in main and test code,
passing a reference to storage_service and having storage_service start
the service (using the `after_join` method): the service only starts
doing its job after the node joins the token ring (either on bootstrap
or restart).
2021-02-22 12:45:43 +01:00
Kamil Braun
8e72c33d7c main: move cdc_service initialization just prior to storage_service
initialization

As a preparation for introducing CDC generation management service.

cdc_service will depend on the generation service.
But the generation service needs some other services to work
properly. In particular, it uses the local database, so it should be
initialized after the local database.

The only service that will need the cdc generation service is
storage_service, so we can place the generation service initialization
code right before storage_service initialization code. So the order will
be cdc_generation_service -> cdc_service -> storage_service.
2021-02-22 12:43:10 +01:00
Liu Lan
d2378129a3 docs: fix invalid path in README.mds
Signed-off-by: Liu Lan <liulan_yewu@cmss.chinamobile.com>

Closes #8126
2021-02-21 13:49:12 +02:00
Konstantin Osipov
95ee8e1b90 raft: fix spelling
Fix spelling of a few comments.
2021-02-19 22:56:26 +03:00
Pekka Enberg
d483922671 Update tools/java submodule
* tools/java 0187829d5e...142f517a23 (2):
  > nodetool: Enable resetlocalschema
  > sstableloader: Make progress printout less eager.
2021-02-19 12:37:04 +02:00
Avi Kivity
78d1afeabd Merge "Use radix tree to store cells on a row" from Pavel E
"
Current storage of cells in a row is a union of vector and set. The
vector holds 5 cell_and_hash's inline, up to 32 ones in the external
storage and then it's switched to std::set. Once switched, the whole
union becomes the waste of space, as it's size is

   sizeof(vector head) + 5 * sizeof(cell and hash) = 90+ bytes

and only 3 pointers from it are used (std::set header). Also the
overhead to keep cell_and_hash as a set entry is more then the size
of the structure itself.

Column ids are 32-bit integers that most likely come sequentialy.
For this kind of a search key a radix tree (with some care for
non-sequential cases) can be beneficial.

This set introduces a compact radix tree, that uses 7-bit sub values
from the search key to index on each node and compacts the nodes
themselves for better memory usage. Then the row::_storage is replaced
with the new tree.

The most notable result is the memory footprint decrease, for wide
rows down to 2x times. The performance of micro-benchmarks is a bit
lower for small rows and (!) higer for longer (8+ cells). The numbers
are in patch #12 (spoiler: they are better than for v2)

v3:
- trimmed size of radix down to 7 bits
- simplified the nodes layouts, now there are 2 of them (was 4)
- enhanced perf_mutation to test N-cells schema
- added AVX intra-nodes search for medium-sized nodes
- added .clone_from() method that helped to improve perf_mutation
- minor
  - changed functions not to return values via refs-arguments
  - fixed nested classes to properly use language constructors
  - renamed index_to to key_t to distinguish from node_index_t
  - improved recurring variadic templates not to use sentinel argument
  - use standard concepts

v2:
- fixed potential mis-compilation due to strict-aliasing violation
- added oracle test (radix tree is compared with std::map)
- added radix to perf_collection
- cosmetic changes (concepts, comments, names)

A note on item 1 from v2 changelog. The nodes are no longer packed
perfectly, each has grown 3 bytes. But it turned out that when used
as cells container most of this growth drowned in lsa alignments.

next todo:
- aarch64 version of 16-keys node search

tests: unit(dev), unit(debug for radix*), pref(dev)
"

* 'br-radix-tree-for-cells-3' of https://github.com/xemul/scylla:
  test/memory_footpring: Print radix tree node sizes
  row: Remove old storages
  row: Prepare row::equal for switch
  row: Prepare row::difference for switch
  row: Introduce radix tree storage type
  row-equal: Re-declare the cells_equal lambda
  test: Add tests for radix tree
  utils: Compact radix tree
  array-search: Add helpers to search for a byte in array
  test/perf_collection: Add callback to check the speed of clone
  test/perf_mutation: Add option to run with more than 1 columns
  test/perf_mutation: Prepare to have several regular columns
  test/perf_mutation: Use builder to build schema
2021-02-18 21:19:14 +02:00
Nadav Har'El
02dde2aca1 cql-pytest: port Cassandra's unit test validation/entities/json_test
In this patch, we port validation/entities/json_test.java, containing
21 tests for various JSON-related operations - SELECT JSON, INSERT JSON,
and the fromJson() and toJson() functions.

In porting these tests, I uncovered 19 (!!) previously unknown bugs in
Scylla:

Refs #7911: Failed fromJson() should result in FunctionFailure error, not
            an internal error.
Refs #7912: fromJson() should allow null parameter.
Refs #7914: fromJson() integer overflow should cause an error, not silent
            wrap-around.
Refs #7915: fromJson() should accept "true" and "false" also as strings.
Refs #7944: fromJson() should not accept the empty string "" as a number.
Refs #7949: fromJson() fails to set a map<ascii, int>.
Refs #7954: fromJson() fails to set null tuple elements.
Refs #7972: toJson() truncates some doubles to integers.
Refs #7988: toJson() produces invalid JSON for columns with "time" type.
Refs #7997: toJson() is missing a timezone on timestamp.
Refs #8001: Documented unit "µs" not supported for assigning a "duration"
            type.
Refs #8002: toJson() of decimal type doesn't use exponents so can produce
            huge output.
Refs #8077: SELECT JSON output for function invocations should be
            compatible with Cassandra.
Refs #8078: SELECT JSON ignores the "AS" specification.
Refs #8085: INSERT JSON with bad arguments should yield InvalidRequest
            error, not internal error.
Refs #8086: INSERT JSON cannot handle user-defined types with case-
            sensitive component names.
Refs #8087: SELECT JSON incorrectly quotes strings inside map keys.
Refs #8092: SELECT JSON missing null component after adding field to
            UDT definition.
Refs #8100: SELECT JSON with IN and ORDER BY does not obey the ORDER BY.

Due to these bugs, 8 out of the 21 tests here currently xfail and one
has to be skipped (issue #8100 causes the sanitizer to detect a use
after free, and crash Scylla).

As usual in these sort of tests, all 21 tests pass when running against
Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210217130732.1202811-1-nyh@scylladb.com>
2021-02-18 20:44:04 +02:00
Takuya ASADA
32d4ec6b8a scylla_util.py: resolve /dev/root to get actual device on aws
When psutil.disk_paritions() reports / is /dev/root, aws_instance mistakenly
reports root partition is part of ephemeral disks, and RAID construction will
fail.
This prevents the error and reports correct free disks.

Fixes #8055

Closes #8040
2021-02-18 20:25:45 +02:00
Avi Kivity
90a7f76fb6 Merge 'cdc: log: fix a use-after-free in process_bytes_visitor' from Michał Chojnowski
Due to small value optimization used in `bytes`, views to `bytes` stored
in `vector` can be invalidated when the vector resizes, resulting in
use-after-free and data corruption. Fix that.

Closes #8105

* github.com:scylladb/scylla:
  cdc: log: avoid an unnecessary copy
  cdc: log: fix use-after-free in process_bytes_visitor
2021-02-18 20:23:41 +02:00
Michał Chojnowski
96c22cf3f8 cdc: log: avoid an unnecessary copy
There is no need to copy `bytes_view` into `bytes` here.
2021-02-18 14:08:18 +01:00
Michał Chojnowski
8cc4f39472 cdc: log: fix use-after-free in process_bytes_visitor
Due to small value optimization used in `bytes`, views to `bytes` stored
in `vector` can be invalidated when the vector resizes, resulting in
use-after-free and data corruption. Fix that.

Fixes #8117
2021-02-18 14:08:17 +01:00
Konstantin Osipov
32952a744a raft: add a unit test for voting
Test duplicate votes, votes from non-members and voting
in joint configuration.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
e49d5f89a5 raft: do not account for the same vote twice
While a duplicate vote from the same server is not possible by a
conforming Raft implementation, Raft assumptions on network permit
duplicates.
So, in theory, it is possible that a vote message is delivered
multiple times.

The current voting implementation does reject votes from non-members,
but doesn't check for duplicate votes.

Keep track of who has voted yet, and reject duplicate votes.

A unit test follows.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
7ea064ac04 raft: remove fsm::set_configuration()
Set either tracker or votes configuration explicitly.
This saves a few lines and simplifies unit tests.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
4083026b65 raft: consistently use configuration from the log 2021-02-18 16:04:44 +03:00
Konstantin Osipov
c4552ffb9a raft: add ostream serialization for enum vote_result 2021-02-18 16:04:44 +03:00
Konstantin Osipov
2ae04d8a47 raft: advance commit index right after leaving joint configuration
Imagine the cluster is in joint configuration {{A, B}, {A, B, C, D, E}}.
The leader's view of stable indexes is:

Server  Match Index
A       5
B       5
C       6
D       7
E       8

The commit index would be 5 if we use joint configuration, and 6
if we assume we left it. Let it happen without an extra FSM
step.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
132db931da raft: add tracker test 2021-02-18 16:04:44 +03:00
Konstantin Osipov
6e3932bbc7 raft: tidy up follower_progress API
Make the API More explicit so it's available for testing.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
ed65a8635e raft: update raft::log::apply_snapshot() assert
apply_snapshot() doesn't support applying the same snapshot
twice. The caller must check the current snapshot before
applying.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
e58a3e42ca raft: add a unit test for raft::log 2021-02-18 16:04:44 +03:00
Konstantin Osipov
51c968bcb4 raft: rename log::non_snapshoted_length() to log::in_memory_size()
The old name was incorrect, in case apply_snapshot() was called with
non-zero trailing entries, the total log length is greater than the
length of the part that is not stored in a snapshot.

Fix spelling in related comments.

Rename fsm::wait() to fsm::wait_max_log_size(), it's a more
specific name. Rename max_log_length to max_log_size to use
'size' rather than 'length' consistently for log size.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
cfe407b402 raft: inline raft::log::truncate_tail()
It's the core of apply_snapshot() work and is only used in it.

Now that truncate_tail is inline, rename truncate_head()
to truncate_uncommitted().
2021-02-18 16:04:44 +03:00
Konstantin Osipov
e0011c6e4d raft: ignore AppendEntries RPC with a very old term
Do not assert on an outdated message.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
805d52eb16 raft: remove log::start_idx()
Replace it with a private _first_idx, which is maintained
along with the rest of class log state.
_first_idx is a name consistent with counterpart last_idx().

Do not use a function since going forward we may want
to remove Raft index from struct log_entry, so should rely
less on it.

This fixes a bug when _last_conf_idx was not reset
after apply_snapshot() because start_idx() was pointing
to a non-existent entry.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
af8770da63 raft: return a correct last term on an empty log
If the log is empty, we must use snapshot's term,
since the log could be right after taking a snapshot
when no trailing entries were kept.

This fixes a rare possible bug when a log matching
rule could be violated during elections by a follower
with a log which was just truncated after a snapshot.

A separate unit test for the issue will follow.
2021-02-18 16:04:43 +03:00
Konstantin Osipov
cb035a7c8d raft: do not use raft::log::start_idx() outside raft::log()
raft::log::start_idx() is currently not meaningful
in case the log is empty.

Avoid using it in fsm::replicate_to() and avoid manual search for
previous log term, instead encapsulate the search in log::term_for().

As a side effect we currently return a correct term (0)
when log matching rule is exercised for an empty log
and the very first snapshot with term 0. Update raft_etcd_test.cc
accordingly.

This change happens to reduce the overall line count.

While at it, improve the comments in raft::replicate_to().
2021-02-18 16:04:43 +03:00
Konstantin Osipov
04b4d97d6a raft: rename progress.hh to tracker.hh
class tracker is the main class of this module.
2021-02-18 16:04:43 +03:00
Konstantin Osipov
97a16c0f77 raft: extend single_node_is_quiet test 2021-02-18 16:04:43 +03:00
Avi Kivity
f0950e023d Merge 'Split CDC streams table partitions into clustered rows ' from Kamil Braun
Until now, the lists of streams in the `cdc_streams_descriptions` table
for a given generation were stored in a single collection. This solution
has multiple problems when dealing with large clusters (which produce
large lists of streams):
1. large allocations
2. reactor stalls
3. mutations too large to even fit in commitlog segments

This commit changes the schema of the table as described in issue #7993.
The streams are grouped according to token ranges, each token range
being represented by a separate clustering row. Rows are inserted in
reasonably large batches for efficiency.

The table is renamed to enable easy upgrade. On upgrade, the latest CDC
generation's list of streams will be (re-)inserted into the new table.

Yet another table is added: one that contains only the generation
timestamps clustered in a single partition. This makes it easy for CDC
clients to learn about new generations. It also enables an elegant
two-phase insertion procedure of the generation description: first we
insert the streams; only after ensuring that a quorum of replicas
contains them, we insert the timestamp. Thus, if any client observes a
timestamp in the timestamps table (even using a ONE query),
it means that a quorum of replicas must contain the list of streams.

---

Nodes automatically ensure that the latest CDC generation's list of
streams is present in the streams description table. When a new
generation appears, we only need to update the table for this
generation; old generations are already inserted.

However, we've changed the description table (from
`cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The
existing mechanism only ensures that the latest generation appears in
the new description table. We add an additional procedure that
rewrites the older generations as well, if we find that it is necessary
to do so (i.e. when some CDC log tables may contain data in these
generations).

Closes #8116

* github.com:scylladb/scylla:
  tests: add a simple CDC cql pytest
  cdc: add config option to disable streams rewriting
  cdc: rewrite streams to the new description table
  cql3: query_processor: improve internal paged query API
  cdc: introduce no_generation_data_exception exception type
  docs: cdc: mention system.cdc_local table
  cdc: coroutinize do_update_streams_description
  sys_dist_ks: split CDC streams table partitions into clustered rows
  cdc: use chunked_vector for streams in streams_version
  cdc: remove `streams_version::expired` field
  system_distributed_keyspace: use mutation API to insert CDC streams
  storage_service: don't use `sys_dist_ks` before it is started
2021-02-18 12:49:43 +02:00
Kamil Braun
4bf28aad7a tests: add a simple CDC cql pytest 2021-02-18 11:44:59 +01:00
Kamil Braun
841f07e9b7 cdc: add config option to disable streams rewriting
Rewriting stream descriptions is a long, expensive, and prone-to-failure
operation. Due to #8061 it may consume a lot of memory. In general, it
may keep failing (and being retried) endlessly, straining the cluster.
As a backdoor we add this flag for potential future needs of admins or
field engineers.

I don't expect it will ever be used, but it won't hurt and may save us
some work in the worst case scenario.
2021-02-18 11:44:59 +01:00
Kamil Braun
9bdd000e97 cdc: rewrite streams to the new description table
Nodes automatically ensure that the latest CDC generation's list of
streams is present in the streams description table. When a new
generation appears, we only need to update the table for this
generation; old generations are already inserted.

However, we've changed the description table (from
`cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The
existing mechanism only ensures that the latest generation appears in
the new description table. This commit adds an additional procedure that
rewrites the older generations as well, if we find that it is necessary
to do so (i.e. when some CDC log tables may contain data in these
generations).
2021-02-18 11:44:59 +01:00
Kamil Braun
4ef736a0a3 cql3: query_processor: improve internal paged query API
The `query_processor::query` method allowed internal paged queries.
However, it was quite limited, hardcoding a number of parameters:
consistency level, timeout config, page size.

This commit does the following improvements:
1. Rename `query` to `query_internal` to make it obvious that this API
   is supposed to be used for internal queries only
2. Extend the method to take consistency level, timeout config, and page
   size as parameters
3. Remove unused overloads of `query_internal`
4. Fix a bunch of typos / grammar issues in the docstring
2021-02-18 11:44:59 +01:00
Kamil Braun
7c91894ddf cdc: introduce no_generation_data_exception exception type 2021-02-18 11:44:59 +01:00
Kamil Braun
99cc9b8051 docs: cdc: mention system.cdc_local table 2021-02-18 11:44:59 +01:00
Kamil Braun
44aab61aea cdc: coroutinize do_update_streams_description 2021-02-18 11:44:59 +01:00
Kamil Braun
67d4e5576d sys_dist_ks: split CDC streams table partitions into clustered rows
Until now, the lists of streams in the `cdc_streams_descriptions` table
for a given generation were stored in a single collection. This solution
has multiple problems when dealing with large clusters (which produce
large lists of streams):
1. large allocations
2. reactor stalls
3. mutations too large to even fit in commitlog segments

This commit changes the schema of the table as described in issue #7993.
The streams are grouped according to token ranges, each token range
being represented by a separate clustering row. Rows are inserted in
reasonably large batches for efficiency.

The table is renamed to enable easy upgrade. On upgrade, the latest CDC
generation's list of streams will be (re-)inserted into the new table.

Yet another table is added: one that contains only the generation
timestamps clustered in a single partition. This makes it easy for CDC
clients to learn about new generations. It also enables an elegant
two-phase insertion procedure of the generation description: first we
insert the streams; only after ensuring that a quorum of replicas
contains them, we insert the timestamp. Thus, if any client observes a
timestamp in the timestamps table (even using a ONE query),
it means that a quorum of replicas must contain the list of streams.
2021-02-18 11:44:59 +01:00
Kamil Braun
ba920361b3 cdc: use chunked_vector for streams in streams_version
The vector may get quite long (say... 1,6M stream IDs). We prevent a
large allocation by using utils::chunked_vector.
2021-02-18 11:44:59 +01:00
Kamil Braun
9ae4467970 cdc: remove streams_version::expired field
This field was not used anywhere.
2021-02-18 11:44:59 +01:00
Kamil Braun
3d7b990300 system_distributed_keyspace: use mutation API to insert CDC streams
The `storage_proxy::mutate` low-level API is much more powerful than
the CQL API. This power is not needed for this commit but for the next.
2021-02-18 11:44:59 +01:00
Kamil Braun
0df15ca8cc storage_service: don't use sys_dist_ks before it is started
It could happen that system_distributed_keyspace was used by
storage_service before it was fully started (inside
`handle_cdc_generation`), i.e. before sys_dist_ks' `start()` returned
(on shard 0). It only checked whether `local_is_initialized()` returns
true, so it only ensured that the service is constructed.

Currently, sys_dist_ks' `start` only announces migrations, so this was
mostly harmless. More concretely: it could result in the node trying to
send CQL requests using a table that it didn't yet recognize by calling
sys_dist_ks' methods before the `announce_migration` call inside `start`
has returned. This would result in an exception; however, the exception
would be catched by the caller and the procedure would be retried,
succeeding eventually. See `handle_cdc_generation` for details.

Still, the initial intention of the code was to wait for the sys_dist_ks
service to be fully started before it was used. This commit fixes that.
2021-02-18 11:44:59 +01:00
Tomasz Grabiec
f94f70cda8 Merge "raft: add unit tests for log, tracker, votes and fix found bugs" from Kostja
Test log consistency after apply_snapshot() is called.
Ensure log::last_term() log::last_conf_index() and log::size()
work as expected.

Misc cleanups.

* scylla-dev/raft-confchange-test:
  raft: add a unit test for voting
  raft: do not account for the same vote twice
  raft: remove fsm::set_configuration()
  raft: consistently use configuration from the log
  raft: add ostream serialization for enum vote_result
  raft: advance commit index right after leaving joint configuration
  raft: add tracker test
  raft: tidy up follower_progress API
  raft: update raft::log::apply_snapshot() assert
  raft: add a unit test for raft::log
  raft: rename log::non_snapshoted_length() to log::length()
  raft: inline raft::log::truncate_tail()
  raft: ignore AppendEntries RPC with a very old term
  raft: remove log::start_idx()
  raft: return a correct last term on an empty log
  raft: do not use raft::log::start_idx() outside raft::log()
  raft: rename progress.hh to tracker.hh
  raft: extend single_node_is_quiet test
2021-02-18 10:55:59 +01:00
Raphael S. Carvalho
5206a97915 compaction: Fix leak of expired sstable in the backlog tracker
expired sstables are skipped in the compaction setup phase, because they don't
need to be actually compacted, but rather only deleted at the end.
that is causing such sstables to not be removed from the backlog tracker,
meaning that backlog caused by expired sstables will not be removed even after
their deletion, which means shares will be higher than needed, making compaction
potentially more aggressive than it have to.

to fix this bug, let's manually register these sstables into the monitor,
such that they'll be removed from the tracker once compaction completes.

Fixes #6054.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210216203700.189362-1-raphaelsc@scylladb.com>
2021-02-18 11:12:00 +02:00
Takuya ASADA
d7f202f900 dist/debian: fix renaming debian/scylla-* files rule
Current renaming rule of debian/scylla-* files is buggy, it fails to
install some .service files when custom product name specified.

Introduce regex based rewriting instead of adhoc renaming, and fixed
wrong renaming rule.

Fixes #8113

Closes #8114
2021-02-18 10:35:19 +02:00
Pekka Enberg
843bf57c3c Update tools/jmx submodule
* tools/jmx 949cefc...bf8bb16 (1):
  > Merge 'dist/debian: fix renaming debian/scylla-* files rule' from Takuya ASADA
2021-02-18 10:35:00 +02:00
Botond Dénes
c3b4c3f451 evictable_reader: reset _range_override after fast-forwarding
`_range_override` is used to store the modified range the reader reads
after it has to be recreated (when recreating a reader it's read range
is reduced to account for partitions it already read). When engaged,
this field overrides the `_pr` field as the definitive range the reader
is supposed to be currently reading. Fast forwarding conceptually
overrides the range the reader is currently reading, however currently
it doesn't reset the `_range_override` field. This resulted in
`_range_override` (containing the modified pre-fast-forward range)
incorrectly overriding the fast-forwarded-to range in `_pr` when
validating the first partition produced by the just recreated reader,
resulting in a false-positive validation failure.

Fixes: #8059

Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210217164744.420100-1-bdenes@scylladb.com>
2021-02-17 19:11:00 +02:00
Benny Halevy
4b46793c19 row_cache: scanning_and_populating_reader: add _read_next_partition flag
Instead of resetting _reader in scanning_and_populating_reader::fill_buffer
in the `reader_finished` case, use a gentler, _read_next_partition flag
on which `read_next_partition` will be called in the next iteration.

Then, read_next_partition can close _reader only before overwriting it
with a new reader.  Otherwise, if _reader is always closed in the
``reader_finished` case, we end up hitting premature end_of_stream.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-30-bhalevy@scylladb.com>
2021-02-17 19:06:21 +02:00
Benny Halevy
57540dae42 mutation_query: mark reconcilable_result_builder constructor noexcept
With result_memory_accounter begin nothrow move constructible
reconcilable_result_builder does not throw.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-67-bhalevy@scylladb.com>
2021-02-17 18:56:12 +02:00
Benny Halevy
92e0e84ee5 database: futurize remove
In preparation for futurizing the querier_cache api.

Coroutinize drop_column_family while at it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-61-bhalevy@scylladb.com>
2021-02-17 18:52:53 +02:00
Benny Halevy
5263ab0e9d row_cache: read_context: use query-request is_single_partition helper
Rather than hand-coding the same logic.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-32-bhalevy@scylladb.com>
2021-02-17 18:29:39 +02:00
Benny Halevy
35256d1b92 treewide: explicitly use flat_mutation_reader_opt
Unlike flat_mutation_reader_opt that is defined using
optimized_optional<flat_mutation_reader>, std::optional<T> does not evaluate
to `false` after being moved, only after it is explicitly reset.

Use flat_mutation_reader_opt rather than std::optional<flat_mutation_reader>
to make it easier to check if it was closed before it's destroyed
or being assigned-over.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-6-bhalevy@scylladb.com>
2021-02-17 17:57:34 +02:00
Avi Kivity
c63e26e26f Merge 'cdc: Limit size of topology description' from Piotr Jastrzębski
Currently, whole topology description for CDC is stored in a single row.
This means that for a large cluster of strong machines (say 100 nodes 64
cpus each), the size of the topology description can reach 32MB.

This causes multiple problems. First of all, there's a hard limit on
mutation size that can be written to Scylla. It's related to commit log
block size which is 16MB by default. Mutations bigger than that can't be
saved. Moreover, such big partitions/rows cause reactor stalls and
negatively influence latency of other requests.

This patch limits the size of topology description to about 4MB. This is
done by reducing the number of CDC streams per vnode and can lead to CDC
data not being fully colocated with Base Table data on shards. It can
impact performance and consistency of data.

This is just a quick fix to make it easily backportable. A full solution
to the problem is under development.

For more details see #7961, #7993 and #7985.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

Closes #8048

* github.com:scylladb/scylla:
  cdc: Limit size of topology description
  cdc: Extract create_stream_ids from topology_description_generator
2021-02-17 15:43:53 +02:00
Piotr Jastrzebski
649f254863 cdc: Limit size of topology description
Currently, whole topology description for CDC is stored in a single row.
This means that for a large cluster of strong machines (say 100 nodes 64
cpus each), the size of the topology description can reach 32MB.

This causes multiple problems. First of all, there's a hard limit on
mutation size that can be written to Scylla. It's related to commit log
block size which is 16MB by default. Mutations bigger than that can't be
saved. Moreover, such big partitions/rows cause reactor stalls and
negatively influence latency of other requests.

This patch limits the size of topology description to about 4MB. This is
done by reducing the number of CDC streams per vnode and can lead to CDC
data not being fully colocated with Base Table data on shards. It can
impact performance and consistency of data.

This is just a quick fix to make it easily backportable. A full solution
to the problem is under development.

For more details see #7961, #7993 and #7985.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-02-17 13:24:40 +01:00
Avi Kivity
001652815c Merge 'imr: switch back to open-coded description of structures' from Michał Chojnowski
Commit aab6b0ee27 introduced the
controversial new IMR format, which relied on a very template-heavy
infrastructure to generate serialization and deserialization code via
template meta-programming. The promise was that this new format, beyond
solving the problems the previous open-coded representation had (working
on linearized buffers), will speed up migrating other components to this
IMR format, as the IMR infrastructure reduces code bloat, makes the code
more readable via declarative type descriptions as well as safer.
However, the results were almost the opposite. The template
meta-programming used by the IMR infrastructure proved very hard to
understand. Developers don't want to read or modify it. Maintainers
don't want to see it being used anywhere else. In short, nobody wants to
touch it.

This commit does a conceptual revert of
aab6b0ee27. A verbatim revert is not
possible because related code evolved a lot since the merge. Also, going
back to the previous code would mean we regress as we'd revert the move
to fragmented buffers. So this revert is only conceptual, it changes the
underlying infrastructure back to the previous open-coded one, but keeps
the fragmented buffers, as well as the interface of the related
components (to the extent possible).

Fixes: #5578

Closes #8106

* github.com:scylladb/scylla:
  imr: switch back to open-coded description of structures
  utils: managed_bytes: add a few trivial helper methods
  utils: fragment_range: move FragmentedView helpers to fragment_range.hh
  utils: fragment_range: add single_fragmented_mutable_view
  utils: fragment_range: implement FragmentRange for fragment_range
  utils: mutable_view: add front()
  types: remove an unused helper function
  test: mutation_test: fix memory calculations in make_fragments_with_non_monotonic_positions
  test: mutation_test: remove an obsolete assertion
  test: mutation_test: initialize an uninitialized variable
  test: sstable_datafile_test: fix tracking of closed sstables in sstable_run_based_compaction_test
2021-02-17 13:40:16 +02:00
Botond Dénes
ba7a9d2ac3 imr: switch back to open-coded description of structures
Commit aab6b0ee27 introduced the
controversial new IMR format, which relied on a very template-heavy
infrastructure to generate serialization and deserialization code via
template meta-programming. The promise was that this new format, beyond
solving the problems the previous open-coded representation had (working
on linearized buffers), will speed up migrating other components to this
IMR format, as the IMR infrastructure reduces code bloat, makes the code
more readable via declarative type descriptions as well as safer.
However, the results were almost the opposite. The template
meta-programming used by the IMR infrastructure proved very hard to
understand. Developers don't want to read or modify it. Maintainers
don't want to see it being used anywhere else. In short, nobody wants to
touch it.

This commit does a conceptual revert of
aab6b0ee27. A verbatim revert is not
possible because related code evolved a lot since the merge. Also, going
back to the previous code would mean we regress as we'd revert the move
to fragmented buffers. So this revert is only conceptual, it changes the
underlying infrastructure back to the previous open-coded one, but keeps
the fragmented buffers, as well as the interface of the related
components (to the extent possible).

Fixes: #5578
2021-02-16 23:43:07 +01:00
Michał Chojnowski
25a9569cc4 utils: managed_bytes: add a few trivial helper methods
We will use them in the upcoming IMR removal patch.
2021-02-16 23:43:07 +01:00
Michał Chojnowski
3f248ca7cc utils: fragment_range: move FragmentedView helpers to fragment_range.hh
In the upcoming IMR removal patch we will need read_simple() and similar helpers
for FragmentedView outside of types.hh. For now, let's move them to
fragment_range.hh, where FragmentedView is defined. Since it's a widely included
header, we should consider moving them to a more specialized header later.
2021-02-16 21:35:15 +01:00
Michał Chojnowski
8a06a576aa utils: fragment_range: add single_fragmented_mutable_view
We will use it later in the upcoming IMR removal patch.
2021-02-16 21:35:15 +01:00
Michał Chojnowski
7b662b9315 utils: fragment_range: implement FragmentRange for fragment_range
This will allow us to pass FragmentedView instances to places where
FragmentRange is expected.
2021-02-16 21:35:15 +01:00
Michał Chojnowski
f972f90193 utils: mutable_view: add front()
We will use it in the upcoming patches.
2021-02-16 21:35:14 +01:00
Michał Chojnowski
9e591c6634 types: remove an unused helper function 2021-02-16 21:35:14 +01:00
Michał Chojnowski
6b8a69e01f test: mutation_test: fix memory calculations in make_fragments_with_non_monotonic_positions
The off-by-one error would cause
test_multishard_combining_reader_non_strictly_monotonic_positions to fail if
the added range_tombstones filled the buffer exactly to the end.
In such situation, with the old loop condition,
make_fragments_with_non_monotonic_positions would add one range_tombstone too
many to the deque, violating the test assumptions.
2021-02-16 21:35:14 +01:00
Michał Chojnowski
5b79d6ca4c test: mutation_test: remove an obsolete assertion
Due to small value optimizations, the removed assertions are not true in
general. Until now, atomic_cell did not use small value optimizations, but
it will after upcoming changes.
2021-02-16 21:35:14 +01:00
Michał Chojnowski
aa60f28a09 test: mutation_test: initialize an uninitialized variable
It was assumed to be zero-initialized, but C++ does not guarantee that.
It has to be initialized explicitly.
2021-02-16 21:35:14 +01:00
Michał Chojnowski
52bd190bb3 test: sstable_datafile_test: fix tracking of closed sstables in sstable_run_based_compaction_test
sstable_run_based_compaction_test assumed that sstables are freed immediately
after they are fully processed.
Hovewer, since commit b524f96a74,
mutation_reader_merger releases sstables in batches of 4, which breaks the
assumption. This fix adjusts the test accordingly.

Until now, the test only kept working by chance: by coincidence, the number of
test sstables processed by merging_reader in a single fill_buffer() call was
divisible by 4. Since the test checks happen between those calls,
the test never witnessed a situation when an sstable was fully processed,
but not released yet.

The error was noticed during the work on an upcoming patch which changes the
size of mutation_fragment, and reduces the number of test sstables processed
in a single fill_buffer() call, which breaks the test.
2021-02-16 21:35:14 +01:00
Konstantin Osipov
d293966366 raft: add a unit test for voting
Test duplicate votes, votes from non-members and voting
in joint configuration.
2021-02-16 23:15:16 +03:00
Konstantin Osipov
3478389d60 raft: do not account for the same vote twice
While duplicate votes are not allowed by Raft rules, it is possible
that a vote message is delivered multiple times.

The current voting implementation does reject votes from non-members,
but doesn't check for duplicate votes.

Keep track of who has voted yet, and reject duplicate votes.

A unit test follows.
2021-02-16 23:15:16 +03:00
Konstantin Osipov
ffd38de5fe raft: remove fsm::set_configuration()
Set either tracker or votes configuration explicitly.
This saves a few lines and simplifies unit tests.
2021-02-16 23:15:16 +03:00
Konstantin Osipov
b941ca9bae raft: consistently use configuration from the log 2021-02-16 23:15:16 +03:00
Konstantin Osipov
75eddaf493 raft: add ostream serialization for enum vote_result 2021-02-16 23:15:16 +03:00
Konstantin Osipov
e099003c7c raft: advance commit index right after leaving joint configuration
Imagine the cluster is in joint configuration {{A, B}, {A, B, C, D, E}}.
Server stable indexes are:

Server  Stable Index
A       5
B       5
C       6
D       7
E       8

The commit index would be 5 if we use joint configuration, and 6
if we assume we left it. Left it happen without an extra FSM
step.
2021-02-16 23:15:16 +03:00
Konstantin Osipov
1bdb3fc8a9 raft: add tracker test 2021-02-16 23:15:16 +03:00
Konstantin Osipov
63965f46f4 raft: tidy up follower_progress API
Make the API More explicit so it's available for testing.
2021-02-16 23:15:16 +03:00
Konstantin Osipov
74879fab09 raft: update raft::log::apply_snapshot() assert
apply_snapshot() doesn't support applying the same snapshot
twice. The caller must check the current snapshot before
applying.
2021-02-16 23:15:12 +03:00
Konstantin Osipov
6ee3aedcc2 raft: add a unit test for raft::log 2021-02-16 23:12:01 +03:00
Konstantin Osipov
c35f029be1 raft: rename log::non_snapshoted_length() to log::length()
The old name was incorrect, in case apply_snapshot() was called with
non-zero trailing entries, the total log length is greater than the
length of the part that is not stored in a snapshot.

Fix spelling in related comments.

Rename fsm::wait() to fsm::wait_max_log_length(), it's a more
specific name.
2021-02-16 23:12:01 +03:00
Konstantin Osipov
9e1a652805 raft: inline raft::log::truncate_tail()
It's the core of apply_snapshot() work and is only used in it.

Now that truncate_tail is inline, truncate_head() can be
called simply truncate().
2021-02-16 23:10:58 +03:00
Konstantin Osipov
f7fb788edf raft: ignore AppendEntries RPC with a very old term
Do not assert on an outdated message.
2021-02-16 23:07:58 +03:00
Konstantin Osipov
7236f081c1 raft: remove log::start_idx()
Replace it with a private _first_idx, which is maintained
along with the rest of class log state.
_first_idx is a name consistent with counterpart last_idx().

Do not use a function since going forward we may want
to remove Raft index from struct log_entry, so should rely
less on it.

This fixes a bug when _last_conf_idx was not reset
after apply_snapshot() because start_idx() was pointing
to a non-existent entry.
2021-02-16 23:06:23 +03:00
Konstantin Osipov
59ea383c7d raft: return a correct last term on an empty log
If the log is empty, we must use snapshot's term,
since the log could be right after taking a snapshot
when no trailing entries were kept.

This fixes a rare possible bug when a log matching
rule could be violated during elections by a follower
with a log which was just truncated after a snapshot.

A separate unit test for the issue will follow.
2021-02-16 21:07:05 +03:00
Konstantin Osipov
6c14775b20 raft: do not use raft::log::start_idx() outside raft::log()
raft::log::start_idx() is currently not meaningful
in case the log is empty.

Avoid using it in fsm::replicate_to() and avoid manual search for
previous log term, instead encapsulate the search in log::term_for().

As a side effect we currently return a correct term (0)
when log matching rule is exercised for an empty log
and the very first snapshot with term 0. Update raft_etcd_test.cc
accordingly.

This change happens to reduce the overall line count.

While at it, improve the comments in raft::replicate_to().
2021-02-16 21:05:44 +03:00
Nadav Har'El
946e63ee6e cql-pytest: remove "xfail" tag from two passing tests
Issue #7595 was already fixed last week, in commit
b6fb5ee912, so the two tests which failed
because of this issue no longer fail and their "xfail" tag can be removed.

Refs #7595.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210216160606.1172855-1-nyh@scylladb.com>
2021-02-16 19:17:22 +02:00
Nadav Har'El
737c1c6cc7 cql-pytest: Additional JSON tests
This patch adds several additional tests o test/cql-pytest/test_json.py
to reproduce additional bugs or clarify some non-bugs.

First, it adds a reproducer for issue #8087, where SELECT JSON may create
invalid JSON - because it doesn't quote a string which is part of a map's
key. As usual for these reproducers, the test passes on Cassandra, and fails
on Scylla (so marked xfail).

We have a bigger test translated from Cassandra's unit tests,
cassandra_tests/validation/entities/json_test.py::testInsertJsonSyntaxWithNonNativeMapKeys
which demonstrates the same problem, but the test added in this patch is much
shorter and focuses on demonstrating exactly where the problem is.

Second, this patch adds a test test verifies that SELECT JSON works correctly
for UDTs or tuples where one of their components was never set - in such a
case the SELECT JSON should also output this component, with a "null" value.
And this test works (i.e., produces the same result in Cassandra and Scylla).
This test is interesting because it shows that issue #8092 is specific to the
case of an altered UDT, and doesn't happen for every case of null
component in a UDT.

Refs #8087
Refs #8092

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210216150329.1167335-1-nyh@scylladb.com>
2021-02-16 16:05:31 +01:00
Avi Kivity
2f3b265dac Update seastar submodule
* seastar 76cff58964...e53a1059f9 (18):
  > rpc: streaming sink: order outgoing messages
Fixes #7552.
  > http: fix compilation issues when using clang++
  > http/file_handler: normalize file-type for mime detection
  > http/mime_types: add support for svg+xml
  > reactor: simplify get_sched_stats()
  > Merge "output_stream: make api noexcept" from Benny
  > Merge " input_stream: make api noexcept" from Benny
  > rpc: mark 'protocol' class as final
  > tls: reloadable_certificate inotify flag is wrong
Fixes #8082.
  > cli: Ignore the --num-io-queues option
  > io_queue: Do not carry start time in lambda capture
  > fstream: Cancel all IO-s on file_data_source_impl close
  > http: add "Transfer-Encoding: chunked" handling
  > http: add ragel parsers for chunks used in messages with Transfer-Encoding: chunked
  > http: add request content streaming
  > http: add reading/skipping all bytes in an input_stream
  > Merge "Reduce per-io-queue container for prio classes" from Pavel Emelyanov
  > seastar-addr2line: split multiple addresses on the same line
2021-02-16 16:19:26 +02:00
Avi Kivity
789233228b messaging: don't inherit from seastar::rpc::protocol
messaging_service's rpc_protocol_server_wrapper inherits from
seastar::rpc::protocol::server as a way to avoid a
is unfortunate, as protocol.hh wasn't designed for inheritance, and
is not marked final.

Avoid this inheritance by hiding the class as a member. This causes
a lot of boilerplate code, which is unfortunate, but this random
inheritance is bad practice and should be avoided.

Closes #8084
2021-02-16 16:04:44 +02:00
Gleb Natapov
c9392095ce cql3: store cf_prop_defs as optional instead of shared_ptr
It been a shard_ptr is a remnant of translation from Java.
Message-Id: <20210216123931.80280-3-gleb@scylladb.com>
2021-02-16 15:58:38 +02:00
Gleb Natapov
805da054e7 cql3: store cf_name as optional in cf_statement instead of shared_ptr
It been a shard_ptr is a remnant of translation from Java.
Message-Id: <20210216123931.80280-2-gleb@scylladb.com>
2021-02-16 15:58:37 +02:00
Gleb Natapov
6335af625e cql3: assert that unengaged optional is not accessed in keyspace_element_name::get_keyspace()
Message-Id: <20210216085545.54753-2-gleb@scylladb.com>
2021-02-16 15:36:00 +02:00
Gleb Natapov
200ca974c3 Do not access potentially unengaged optional in keyspace_element_name
Currently there are places that call
keyspace_element_name::get_keyspace() without checking that _ks_name is
engaged. Fix those places.
Message-Id: <20210216085545.54753-1-gleb@scylladb.com>
2021-02-16 15:35:59 +02:00
Botond Dénes
4d309fc34a repair: row_level: invoke on_internal_error() on out-of-order partitions
repair_writer::do_write(): already has a partition compare for each
mutation fragment written, do determine whether the fragment belongs to
another partition or not. This equal compare can be converted to a
tri_compare at no extra cost allowing for detecting out-of-order
partitions, in which case `on_internal_error()` is called.

Refs: #7623
Refs: #7552

Test: dtest(RepairAdditionalTest.repair_disjoint_row_3nodes_diff_shard_count_test:debug)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210216074523.318217-1-bdenes@scylladb.com>
2021-02-16 15:31:40 +02:00
Benny Halevy
50ca693a02 main: disable stall detector during startup
We see long reactor stalls from `logalloc::prime_segment_pool`
in debug mode yet the stall detector's purpose is to detect
reactor stalls during normal operation where they can increase
the latency of other queries running in parallel.

Since this change doesn't actually fix the stalls but rather
hides them, the following annotations will just refrence
the respective github issues rather than auto-close them.

Refs #7150
Refs #5192
Refs #5960

Restore blocked_reactor_notify_ms right before
starting storage_proxy.  Once storage_proxy is up, this node
affects cluster latency, and so stalls should be reported so
they can be fixed.

Test: secondary_index_test --blocked-reactor-notify-ms 1 (release)
DTest: CASSANDRA_DIR=../scylla/build/release SCYLLA_EXT_OPTS="--blocked-reactor-notify-ms 2" ./scripts/run_test.sh materialized_views_test:TestMaterializedViews.interrupt_build_process_with_resharding_half_to_max_test

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210216112052.27672-1-bhalevy@scylladb.com>
2021-02-16 13:28:31 +02:00
Tomasz Grabiec
446ea07ac6 Merge "raft: server instance init and raft RPC handlers" from Pavel Solodovnikov
This series provides a `raft_services` class to create and store
a raft schema changes server instances, and also wires up the RPC handlers
for Raft RPC verbs.

* manmanson/raft-api-server-handlers-v10:
  raft: share `raft_gossip_failure_detector` instance across multiple raft rpc instances
  raft: move server address handling from `raft_rpc` to `raft_services` class
  raft: wire up schema Raft RPC handlers
  raft: raft_rpc: provide `update_address_mapping` and dispatcher functions
  raft: pass `group_id` as an argument to raft rpc messages
  raft: use a named constant for pre-defined schema raft group
2021-02-16 11:14:50 +01:00
Pavel Solodovnikov
1ada0abf81 raft: share raft_gossip_failure_detector instance across multiple raft rpc instances
Store an instance inside `raft_services` and reuse it for
all raft groups created and managed by `raft_services` instance.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-02-16 13:09:12 +03:00
Pavel Solodovnikov
8c2a904dc8 raft: move server address handling from raft_rpc to raft_services class
This allows to decouple `raft_gossip_failure_detector` from being
dependent on a particular rpc instance and thus makes it possible
to share the same failure detector instance among all raft servers
since they are managed in a centralized way by a `raft_services`
instance.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-02-16 13:09:06 +03:00
Pavel Solodovnikov
63cdf4694d raft: wire up schema Raft RPC handlers
This patch adds registration and de-registration of the
corresponding Raft RPC verbs handlers.

There is a new `raft_services` class that
is responsible for initializing the raft RPC verbs and
managing raft server instances.

The service inherits `seastar::peering_sharded_service<T>`,
because we need to route the request to the appropriate shard
which is handled by the `shard_for_group` function (currently
only handling schema raft group to land on shard 0, otherwise
throws an exception).

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-02-16 13:08:59 +03:00
Nadav Har'El
1e1cbaf589 docs/alternator: clean up description of DynamoDB compatibility
We had Alternator's current compatibility with DynamoDB described in
two places - alternator.md and compatibility.md. This duplication was
not only unnecessary, in some places it led to inconsistent claims.

In general, the better description was in compatibility.md, so in
this patch we remove the compatibility section from alternator.md
and instead link to compatibility.md. There was a bit of information
that was missing in compatibility.md, so this patch adds it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210215203057.1132162-1-nyh@scylladb.com>
2021-02-16 08:48:28 +01:00
Pavel Emelyanov
9baf1226dc test/memory_footpring: Print radix tree node sizes
After switching cells storage onto compact radix tree it
becomes useful to know the tree nodes' sizes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:41:09 +03:00
Pavel Emelyanov
1bdfa355ea row: Remove old storages
Now when the 3rd storage type (radix tree) is all in, old
storage can be safely removed.  The result is:

1. memory footprint

sizeof(class row):  112 => 16 bytes
sizeof(rows_entry): 126 => 120 bytes

the "in cache" value depends on the number of cells:

num of cells     master       patch
         1       752         656
         2       808         712
         3       864         768
         4       920         824
         5       968         936
         6      1136         992
         ...
         16     1840        1672
         17     1904        1992  (+88)
         18     1976        2048  (+72)
         19     2048        2104  (+56)
         20     2120        2160  (+40)
         21     2184        2208  (+24)
         22     2256        2264  ( +8)
         23     2328        2320
         ...
         32     2960        2808

After 32 cells the storage switches into rbtree with
24-bytes per-cell overhead and the radix tree improvement
rocketlaunches

           64     7872        6056
           128   15040        9512
           256   29376       18568

2. perf_mutation test is enhanced by this series and the
   results differ depending on the number of columns used

                    tps value
--column-count    master   patch
          1       59.9k    57.6k  (-3.8%)
          2       59.9k    57.5k
          4       59.8k    57.6k
          8       57.6k    57.7k  <- eq
         16       56.3k    57.6k
         32       53.2k    57.4k  (+7.9%)

A note on this. Last time 1-column test was ~5% worse which
was explained by inline storage of 5 cells that's present on
current implementation and was absent in radix tree.

An attempt to make inline storage for small radix trees
resulted in complete loss of memory footprint gain, but gave
fraction of percent to perf_mutation performance. So this
version doesn't have inline nodes.

The 1.2% improvement from v2 surprisingly came from the
tree::clone_from() which in v2 was work-around-ed by slow
walk+emplace sequence while this version has the optimized
API call for cloning.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:35:06 +03:00
Pavel Emelyanov
2053b1c202 row: Prepare row::equal for switch
Same as the previous patch, re-implement the row::equal to use
the radix_tree iterator for comparison of two index:cell sequences.

The std::equal() doesn't work here, since the predicate-fn needs
to look at both iterators to call it.key() on (radix tree API
feature), while std::equal provides only the T&s in it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:31:52 +03:00
Pavel Emelyanov
b5527b3635 row: Prepare row::difference for switch
The method effectively walks two pairs of <colun_id, cell> and
applies the difference to separare row instance. The code added
is the copy of the same code below this hunk with the mechanical
substitution:

 c.first -> c.key()
 c.second -> c->cell
 it->first -> it.key()
 it->second -> it.cell

because first-s are column_id-s reported by radix tree iterator
.key() method and second-s are cells, that were referenced by
current code in get_..._vector() from boost::irange and are now
directly pointed to by raidx tree iterator.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:27:00 +03:00
Pavel Emelyanov
f006acc853 row: Introduce radix tree storage type
Currently class row uses a union of a vector and a set to keep
the cells and switches between them. Add the 3rd type with the
radix tree, but never switch to it, just to show how the operations
would look like. Later on vector and set will be removed and the
whole row will be immediately switched to the radix tree storage.

NB: All the added places have indentation deliberately broken, so
that next patch will just remove the surrounding (old) code away
and (most of) the new one will happen in its place instantly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:27:00 +03:00
Pavel Emelyanov
5f276b279e row-equal: Re-declare the cells_equal lambda
For further patching it's handy to have this helper to accept
column_id and atomic_cell_or_collection arguments, instead of
an std::pair of these two.

This is to facilitate next patching.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:27:00 +03:00
Pavel Emelyanov
aa85bc790b test: Add tests for radix tree
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:27:00 +03:00
Pavel Emelyanov
a5bd68ae5d utils: Compact radix tree
The tree uses integral type as a search key. On each level the local index
is next 7 bits from the key, respectively for 32-bit key we have 5 levels.
The tree uses 2 memory packing techniques -- prefix compaction and growing
node layouts.

The prefix compaction is used when a node has only one child. In this case
such a node is replaced in its parent with this only child and the child in
question keeps "prefix:length" pair on board, that's used to check if the
short-cut lookup took the correct path.

The growing node layouts makes the nodes occupy as much memory as needed
to keep the _present_ keys and there are 2 kinds of layouts.

Direct layout is array, intra-node search is plain indexing. The layout
storage grows in vector-like manner, but there's a special case for the
maximum-sized layout that helps avoiding some boundary checks.

Indirect layout keeps two arrays on board -- with values and with indices.
The intra-node search is thus a lookup in the latter array first. This
layout is used to save memory for sparse keys. Lookup is optimized with
SIMD instructions.

Inner nodes use direct layouts, as they occupy ~1% of memory and thus
need not to be memory efficient. At the same time lookup of a key in the
tree potentially walks several inner nodes, so speeding up search for
them is beneficial.

Leaf nodes are indirect, since they are 99% of memory and thus need to
be packed well. The single indirect lookup when searching in the tree
doesn't slow things down notably even on insertion stress test.

Said that
 * inner nodes are: header + 4 / 8 / 16 / 32 / 64 / 128 pointers
 * leaf nodes are : header + 4 / 8 / 16 / 32 bytes + <same nr> objects
                 or header + 16 bytes bitmap + 128 objects

The header is
 - backreference (8 bytes)
 - prefix (4 bytes)
 - size, layout, capacity (1 byte each)

The iterator is one-direction (for simplicity) but it enough for its main
target -- the sparse array of cells on a row. Also the iterator has an
.index() method that reports back the index of the entry at which it points.
This greatly simplifies the tree scans by the class row further.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 19:25:10 +03:00
Piotr Sarna
495b7b5596 alternator: use unique_ptr for storing attribute paths
Previous commit eliminated the only copying of the attribute paths,
so it's now safe to make the object noncopyable.
Message-Id: <5468e8c17d3d42a03c1dd33706bbaac0c58959ce.1613398751.git.sarna@scylladb.com>
2021-02-15 18:22:59 +02:00
Piotr Sarna
7e1641224c alternator: batch: pass attrs_to_get by a shared pointer
The attrs_to_get object was previously copied, but it's quite
a heavyweight operations, since this object may contain an
instance of std::map or std::unordered_map.
To avoid copying whole maps, the object is wrapped in a shared
const pointer.
Message-Id: <75ad810de16c630b65ae8d319cb4b37e1de8085f.1613398751.git.sarna@scylladb.com>
2021-02-15 18:22:56 +02:00
Tomasz Grabiec
f86108aef1 Merge "raft: move ticking to external code" from Alejo
As Gleb suggested in a previous review, remove ticker from raft and
leave calling tick() to external code.

While there, tick faster to speed up tests.

* https://github.com/alecco/scylla/tree/tests-17-remove-ticker:
  raft: replication test: reduce ticker from 100ms to 1ms
  raft: drop ticker from raft
2021-02-15 18:14:03 +02:00
Botond Dénes
c24f350846 scylla-gdb.py: nonwrapping_interval_printer: fix compatibility with 4.2+
Use the `_interval` member instead of the old `_range` field, but stay
compatible with pre 4.2 releases, falling back to `_range` when
`_interval` doesn't exist.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210215104008.166746-1-bdenes@scylladb.com>
2021-02-15 18:14:03 +02:00
Pavel Emelyanov
d43ad8738c array-search: Add helpers to search for a byte in array
The radix tree code will need the code to find 8-bit value
in an array of some fixed size, so here are the helpers.

Those that allow for SIMD implementations are such for x86_64

TODO: Add aarch64

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 17:47:59 +03:00
Pavel Emelyanov
0ad361b380 test/perf_collection: Add callback to check the speed of clone
In some places scylla clones collections of objects, so it's
sometimes needed to measure the speed of this operation.

This patch adds a placeholder for it, but no implementations
for any supported collections. It will be added soon for radix
tree.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 17:46:37 +03:00
Pavel Emelyanov
767253fe24 test/perf_mutation: Add option to run with more than 1 columns
The --column-count makes the test generate schema with
the given numbers of columns and make mutation maker
fill random column with the value on each iteration.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 17:45:42 +03:00
Pavel Emelyanov
fc84ab3418 test/perf_mutation: Prepare to have several regular columns
Teach the schema builder and test itself to work on more
than one regular column, but for now only use 1, as before.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 17:44:34 +03:00
Pavel Emelyanov
21adff2a41 test/perf_mutation: Use builder to build schema
The test will be taught to use more than one regular
column, so switch to builder in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 17:44:06 +03:00
Piotr Sarna
cbbb7f08a0 Merge 'Alternator: support nested attribute paths...
in all expressions' from Nadav Har'El.

This series fixes #5024 - which is about adding support for nested attribute
paths (e.g., a.b.c[2]) to Alternator.  The series adds complete support for this
feature in ProjectionExpression, ConditionExpression, FilterExpression and
UpdateExpression - and also its combination with ReturnValues. Many relevant
tests - and also some new tests added in this series - now pass.

The first patch in the series fixes #8043 a bug in some error cases in
conditions, which was discovered while working in this series, and is
conceptually separate from the rest of the series.

Closes #8066

* github.com:scylladb/scylla:
  alternator: correct implemention of UpdateItem with nested attributes and ReturnValues
  alternator: fix bug in ReturnValues=UPDATED_NEW
  alternator: implemented nested attribute paths in UpdateExpression
  alternator: limit the depth of nested paths
  alternator: prepare for UpdateItem nested attribute paths
  alternator: overhaul ProjectionExpression hierarchy implementation
  alternator: make parsed::path object printable
  alternator-test: a few more ProjectionExpression conflict test cases
  alternator-test: improve tests for nested attributes in UpdateExpression
  alternator: support attribute paths in ConditionExpression, FilterExpression
  alternator-test: improve tests for nested attributes in ConditionExpression
  alternator: support attribute paths in ProjectionExpression
  alternator: overhaul attrs_to_get handling
  alternator-test: additional tests for attribute paths in ProjectionExpression
  alternator-test: harden attribute-path tests for ProjectionExpression
  alternator: fix ValidationException in FilterExpression - and more
2021-02-15 15:45:49 +02:00
Tomasz Grabiec
508f928220 tests: sstables: Test sstable write fails on missing partition_end mid-stream
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210115163055.74398-1-tgrabiec@scylladb.com>
2021-02-15 15:45:49 +02:00
Benny Halevy
e532585126 test: sstables::test_env: do_with: futurize_invoke func
Otherwise, if `func` throws, test_env isn't stopped, as it should.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210214190157.211858-1-bhalevy@scylladb.com>
2021-02-15 15:45:49 +02:00
Wojciech Mitros
1819be5ebc canonical_mutation: make the data type non-contiguous
The canonical_mutation type can contain a large mutation, particularly
when the mutation is a result of converting a big schema. Its data
was stored in a field of type 'bytes', which is non-contiguous and
may cause a large allocation.
This is fixed by simply changing the type to 'bytes_ostream', which is
fragmented. The change is compatible because the idl type 'bytes' is compatible
with 'bytes_ostream' as a result of dcf794b, and all canonical_mutations's
methods use the field as an input stream (ser::as_input_stream), which can
be used on 'bytes_ostream' too.

Fixes #8074

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #8075
2021-02-15 10:24:47 +01:00
Nadav Har'El
f884104eed cql-pytest: add more JSON tests
This patch adds several more tests reproducing bugs in toJson() and
SELECT JSON.

First add two xfailing tests reproducing two toJson() issues - #7988
and #8002. The first is that toJson() incorrectly formats values of the
"time" type - it should be a string but Scylla forgets the quotes.
The second is that toJson() format "decimal" values as JSON numbers
without using an exponent, resulting in memory allocation failure
for numbers with high exponents, like 1e1000000000.

The actual test for 1e1000000000 has to be skipped because in
debug build mode we get a crash trying this huge allocation.
So instead, we check 1e1000 - this generates a string of 1000
characters, which is much too much (should just be "1e1000")
but doesn't crash.

Then we add a reproducing test for issue #8077: When using SELECT JSON
on a function, such as count(*), ttl(v) or intAsBlob(v), Cassandra has
a specific way how it formats the result in JSON, and Scylla should do
it the same way unless we have a good reason not to.

As usual, the new tests passes on Cassandra, fails on Scylla, so is marked
xfail.

Refs #7988
Refs #8002
Refs #8077.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210214210727.1098388-1-nyh@scylladb.com>
2021-02-15 10:55:43 +02:00
Nadav Har'El
9e029f09e5 docs: improve CONTRIBUTING.md
Start improving CONTRIBUTING.md, as suggested in issue #8037:

1. Incorporate the few lines we had in coding-style.md into CONTRIBUTING.md.
   This was mostly a pointer to Seastar's coding style anyway, so it's not
   helpful to have a separate file which hopeful developers will not find
   anyway.

2. Mention the Scylla developers mailing list, not just the Scylla users
   mailing list. The Scylla developers mailing list is where all the action
   happens, and it's very odd not to mention it.

3. The decisions that github pull requests are forbidden was retracted
   a long time ago, so change the explanation on pull requests.

4. Some smaller phrasing changes.

Refs #8037.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210214152752.1071313-1-nyh@scylladb.com>
2021-02-14 22:09:24 +02:00
Nadav Har'El
8c9935359c build: Stir -O levels
Merged patch series from Pavel Emelyanov:

The default -O<> levels are considered to produce slow and tedious to
test code, so it's tempting to increase the level. On the other hand,
there was some complains about re-compile-mostrly work that would suffer
from slower builds.

This set tries to find a reasonable compromise -- raise the default opt
levels and provide the ability to configure one if needed.

* 'br-cxx-o-levels-2' of github.com:xemul/scylla:
	configure: Switch debug build from -O0 to -Og
	configure: Switch dev build from -O1 to -O2
	configure: Make -O flag configurable
2021-02-14 22:09:24 +02:00
Avi Kivity
cb4e1bb0b9 logalloc: reduce gap between std min_free and logalloc min_free
With the larger gap, logalloc reserved more memory for std than
the background reclaim threshold for running, so it was triggered
rarely.

With the gap reduced, background reclaim is constantly running in
an allocating workload (e.g. cache misses).
2021-02-14 19:09:29 +02:00
Avi Kivity
ca0c006b37 logalloc: background reclaim
Set up a coroutine in a new scheduling group to ensure there is
a "cushion" of free memory. It reclaims in preemptible mode in
order to reduce reactor stalls (constrast with synchronous reclaim
that cannot preempt until it achieved its goal).

The free memory target is arbitrarily set at 60MB. The reclaimer's
shares are proportional to the distance from the free memory target;
so a workload that allocates memory rapidly will have the background
reclaimer working harder.

I rolled my own condition variable here, mostly as an experiment.
seastar::condition_variable requires several allocations, while
the one here requires none. We should formalize it after we gain
more experience with it.
2021-02-14 19:09:29 +02:00
Avi Kivity
35076dd2d3 logalloc: preemptible reclaim
Add an option (currently unused by all callers) to preempt
reclaim. If reclaim is preempted, it just stops what it is
doing, even if it reclaimed nothing. This is useful for background
reclaim.

Currently, preemption checks are on segment granularity. This is
probably too coarse, and should be refined later, but is already
better than the current granularity which does not allow preemption
until the entire requested memory size was reclaimed.
2021-02-14 19:09:29 +02:00
Alejo Sanchez
5e49650146 raft: replication test: reduce ticker from 100ms to 1ms
To speed up replication test reduce the tick time from 100ms to 1ms

Speed up: debug 3.7 to 2.5, dev 2.9 to 2.1 seconds

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-02-14 09:59:06 -04:00
Alejo Sanchez
b41a6822e8 raft: drop ticker from raft
Remove ticker callbacks from raft::server.
External code should periodically call raft::server::tick().

Update replication_test accordingly.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-02-14 09:41:42 -04:00
Nadav Har'El
ea338db581 cql-pytest: reproduce bug in setting time column with integer
This test reproduces issue #7987, where Scylla cannot set a time column
with an integer - wheres the documentation says this should be possible
and it also works in Cassandra.

The test file includes tests for both ways of setting a time column
(using an integer and a string), with both prepared and unprepared
statements, and demonstrates that only one combination fails in Scylla -
an unprepared statement with an integer. This test xfails on Scylla
and passes on Cassandra, and the rest pass on both.

Refs #7987.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210128215103.370723-1-nyh@scylladb.com>
2021-02-14 15:09:38 +02:00
Nadav Har'El
49cd9b3fd5 alternator: correct implemention of UpdateItem with nested attributes and ReturnValues
This patch fixes the last missing part of nested attribute support in
UpdateItem - returning the correct attributes when ReturnValues is requested.
When the expression says "a.b = :val" and ReturnValues is set to UPDATED_OLD
or UPDATED_NEW, only the actual updated attribute a.b should be returned, not
the entire top-level attribute a as we did before this patch.

This patch was made very simple because our existing hierarchy_filter()
function already does exactly the right thing, and can trivially be made to
accept any attribute_path_map<T> (in our case attribute_path_map<action>),
not just attrs_to_get as it did until now.

This patch also adds several more checks to the test in test_returnvalues.py
to improve the test's coverage even more. Interestingly, I discovered two
esoteric cases where DynamoDB does something which makes little sense, but
apparently simplified their implementation - but the beautiful thing is that
it also simplifies our implementation! See long comments about these two
cases in the test code.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
964500e47a alternator: fix bug in ReturnValues=UPDATED_NEW
Commit 0c460927bf broke UpdateItem's
ReturnValues=UPDATED_NEW by moving previous_item while it is still
needed. None of the existing tests broke because none of them needed
previous_item after it was moved - but it started to break when we
add support for nested attribute paths, which need this previous_item.

So this patch returns the move to a copy, as it was before the
aforementioned patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
33685a683e alternator: implemented nested attribute paths in UpdateExpression
This patch adds full support for nested attribute paths (e.g., a.b[3].c)
in UpdateExpression. After in previous patches we already added such
support for ProjectionExpression, ConditionExpression and FilterExpression
this means the nested attribute paths feature is now complete, so we
remove the warning from the documents. However, there is one last loose
end to tie and we will do it in the next patch: After this patch, the
combination of UpdateExpression with nested attributes and ReturnValues
is still wrong, and the test for it in test_returnvalues.py still xfails.

Note that previous patches already implemented support for attribute paths
in expression evaluations - i.e., the right-hand side of UpdateExpression
actions, and in this patch we just needed to implement the left hand side:
When an update action is on an attribute a.b we need to read the entire
content of the top-level a (an RWM operation), modify just the b part of
its json with the result of the action, and finally write back the entire
content of a. Of course everything gets complicated by the fact that we
can have multiple actions on multiple pieces of the same JSON, and we also
need to detect overlapping and conflicting actions (we already have this
detection in the attribute_path_map<> class we introduced in a previous
patch).

I decided to leave one small esoteric difference, reproduced by the xfailing
test_update_expression.py::test_nested_attribute_remove_from_missing_item:
As expected, "SET x.y = :val" fails for an item if its attribute x doesn't
exist or the item itself does not exist. For the update expression
"REMOVE x.y", DynamoDB fails if the attribute x doesn't exist, but oddly
silently passes if the entire item doesn't exist. Alternator does not
currently reproduce this oddity - it will fail this write as well.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
7789606545 alternator: limit the depth of nested paths
DynamoDB limits the depth of a nested path in expressions (e.g. "a.b.c.d")
to 32 levels. This patch adds the same limit also to Alternator.

The exact value of this limit is less important (although it did make
sense to choose the same limit as DynamoDB does), but it's important
to have *some* limit: It's often convenient to handle paths with a
recursive algorithm, and if we allow unlimited path depth, it can
result in unlimited recursion depth, and a crash. Let's avoid this
possibility.

We detect the over-long path while building the parsed::path object
in the parser, and generate a parse error.

This patch also includes a test that verifies that both Alternator
and DynamoDB have the same 32-level nesting limit on paths.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
4c7e27c688 alternator: prepare for UpdateItem nested attribute paths
This patch prepares UpdateItem for updating of nested attribute paths
(e.g., "SET a.b = :val"), but does not yet support them.

Instead of _update_expression holding an unsorted list of "actions",
we change it to hold a attribute_path_map of actions. This will allow
us to process all the actions on a top-level attribute together, and
moreover gets us "for free" the correct checking for overlapping and
conflicting updates - exactly the same checking we already had in
attribute_path_map for ProjectionExpression. Other than this change,
most of this patch is just code movement, not functional changes.

After this patch, the tests for update path overlap and conflict pass:
test_update_expression_multi_overlap_nested and
test_update_expression_multi_conflict_nested.

We can also mark test_update_expression_nested_attribute_rhs as passing -
this test involves an attribute path in the right-hand-side of an update,
but the left-hand-side is still a top-level attribute, so it works (it
actually worked before this patch - it started working when we implemented
attribute paths in expressions, for ConditionExpression and
FilterExpression).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
7c5db2da83 alternator: overhaul ProjectionExpression hierarchy implementation
For ProjectionExpression we implemented a hierarchical filter object which
can be used to hold a tree of attribute paths groups by a the top-level
attributes, and also detect overlapping and conflicting entries.

For UpdateExpression, we need almost exactly the same object: We need to
group update actions (e.g., SET a.b=3) by the top-level attribute, and
also detect and fail overlapping or conflicting paths.

So in this patch we rewrite the data structure we had for ProjectionExpression
in a more genric manner, using the template attribute_path_map<T> - which
holds data of type T for each attribute path. We also implement a template
function attribute_path_map_add() to add a path/value pair to this map,
and includes all the overlap and conflict detecting logic.

There shouldn't be functional changes in this patch. The ProjectionExpression
code uses the new generic code instead of the specific code, but should work
the same. In the next patch we can use the new generic code to implement
UpdateExpression as well.

The only somewhat functional change is better error messages for
conflicting or overlapping paths - which now include one of the
conflicting paths.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
f78d33dd73 alternator: make parsed::path object printable
Make the parsed::path object printable - which is useful for error messages.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
c2f18e56ea alternator-test: a few more ProjectionExpression conflict test cases
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:34 +02:00
Nadav Har'El
de62a8c2d3 alternator-test: improve tests for nested attributes in UpdateExpression
We already had many tests for nested attributes in UpdateExpression, but
this patch adds even more:

 * Test nested attribute in right-hand-side in assignment: z = a.c.x.
 * Test for making multiple changes to the same and different top-level
   attributes in the same update.
 * Additional cases of overlap between multiple changes.
 * Tests for conflict between multiple changes.
 * Tests for writing to a nested path on a non-existent attribute or item.
 * A stronger test for array append sorts the added items.

As this feature was not yet implemented, these tests fail on Alternator,
and pass on DynamoDB.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-14 12:21:24 +02:00
Pavel Solodovnikov
2ed445bfdd raft: raft_rpc: provide update_address_mapping and dispatcher functions
Provide several utility functions which will be used in rpc message
handlers:

1. `update_address_mapping` -- add a new (server_id -> inet_address)
   mapping for a `raft_rpc` instance.
   This is used to update rpc module with a caller address
   upon receiving an rpc message from a yet unknown server.
2. A set of dispatcher functions for every rpc call that forward calls
   to an appropriate `raft::rpc_server` instance (for which `raft::rpc`
   has a back-pointer).

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-02-12 17:55:48 +03:00
Pavel Emelyanov
ffc9cc9aec range-streamer: Remove global storage service reference
The reference is used by range streamer and (!) storage
service itself to find out if the consistent_rangemovement
option is ON/OFF.

Both places already have the database with config at hands
and can be simplified.

v2: spellchecking

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210212095403.22662-1-xemul@scylladb.com>
2021-02-12 15:50:30 +01:00
Tomasz Grabiec
26aa000493 Merge "raft: replication test fixes" from Alejo
Fix rare debug mode hang and a minor fix.

* alejo/tests-16-fix-debug-hang-disruptive-ticks-master-v3:
  raft: replication test: fix debug mode hangs
  raft: replication test: remove unnecessary param
2021-02-11 20:35:35 +01:00
Nadav Har'El
a03a8a89a9 cql-pytest: fix flaky timeuuid_test.py
The test timeuuid_test.py::testTimeuuid sporadically failed, and it turns out
the reason was a bug in the test - which this patch fixes.

The buggy test created a timeuuid and then compared the time stored in it
to the result of the dateOf() CQL function. The problem is that dateOf()
returns a CQL "timestamp", which has millisecond resolution, while the
timeuuid *may* have finer than millisecond resolution. The reason why this
test rarely failed is that in our implementation, the timeuuid almost
always gets a millisecond-resolution timestamp. Only if now() gets called
more than once in one millisecond, does it pick a higher time incremented
by less than a millisecond.

What this patch does is to truncate the time read from the timeuuid to
millisecond resolution, and only then compare it to the result of dateOf().
We cannot hope for more.

Fixes #8060

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210211165046.878371-1-nyh@scylladb.com>
2021-02-11 19:03:58 +02:00
Alejo Sanchez
97338ab53f raft: replication test: fix debug mode hangs
For certain situations where barely enough nodes to elect a new leader
are connected a disruptive candidate can occassionally block the
election.

For example having servers A B C D E and only A B C are active in a
partition. If the test wants to elect A, it has to first make all 3
servers reach election timeout threshold (to make B and C receptive).
Then A is ticked till it becomes a candidate and has to send vote
requests to the other servers.

But all servers have a timer (_ticker) calling their periodic tick()
functions. If one of the other servers, say B, gets its timer tick
before A sends vote requests, B becomes a (disruptive) candidate and
will refuse to vote for A. In our case of only having 3 out of 5 servers
connected a single missing vote can hang the election.

This patch disables timer ticks for all servers when running custom
elections and partitioning.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-02-11 11:42:31 -04:00
Pavel Solodovnikov
d8dfdfba1e raft: pass group_id as an argument to raft rpc messages
This will be used later to filter the requests which belong
to the schema raft group and route them to shard 0.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-02-11 16:25:33 +03:00
Pavel Solodovnikov
3b50cdf1ed raft: use a named constant for pre-defined schema raft group
Introduce a static `schema_raft_state_machine::group_id` constant,
which denotes the raft group id for the schema changes server.

Also fix the comment on the state machine class declaration
to emphasize that the instance will be managed by shard 0.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-02-11 16:24:39 +03:00
Tomasz Grabiec
234f9dbe85 Merge 'Fix mixed cluster schema sync' from Eliran Sinvani
When a table is altered in a mixed cluster by a node with a more
recent version, the request can fail if there is a difference in
schema_features between the two versions.  This miniset handles the
two problems that prevents the sync.

Closes #8011

* github.com:scylladb/scylla:
  schema: recalculate digest when computed_columns feature is enabled
  schema tables: Remove mutations to unknown tables when adapting schema mutations
  schema tables: Register 'scylla_tables' versions that were sent to other nodes
2021-02-11 13:03:38 +01:00
Eliran Sinvani
63b794d104 schema: recalculate digest when computed_columns feature is enabled
The schema digest is affected by the computed_columns feature, this
means that we have to recalculate our schema digest when this feature is
enabled.
2021-02-11 13:48:58 +02:00
Eliran Sinvani
178ced9014 schema tables: Remove mutations to unknown tables when adapting schema
mutations

Whenever an alter table occurs, the mutations for the just altered table
are sent over to all of the replicas from the coordinator.
In a mixed cluster the mutations should be adapted to a specific version
of the schema. However, the adaptation that happens today doesn't omit
mutations to newly added schema tables, to be more specific, mutations
to the `computed_columns` table which doesn't exist for example in
version 2019.1
This makes altering a table during a rolling upgrade from 2019.1 to
2020.1 dangerous.
2021-02-11 13:48:55 +02:00
Eliran Sinvani
ff1ba9bc2b schema tables: Register 'scylla_tables' versions that were sent to other
nodes

In a mixed cluster there can be a situation where `scylla_tables` needs
to be  sent over to another node because a schema sync or because the
node pulls it because it is referenced by a frozen_mutation. The former
is not a problem since the sending node chooses the version to send.
However, the former is problematic since `scylla_tables` versions are not
registered anywhere.
This registers every `scylla_tables` schema version which is used to adapted
mutations since after this happens a schema pull for this version might
follow.
2021-02-11 13:47:16 +02:00
Takuya ASADA
856fe12e13 dist/debian: install scylla-node-exporter.service correctly
node-exporter systemd unit name is "scylla-node-exporter.service", not
"node-exporter.service".

Fixes #8054

Closes #8053
2021-02-11 12:19:38 +02:00
Benny Halevy
d01e7e7b58 stream_session: prepare: fix missing string format argument
As seen in
mv_populating_from_existing_data_during_node_decommission_test dtest:
```
ERROR 2021-02-11 06:01:32,804 [shard 0] stream_session - failed to log message: fmt::v7::format_error (argument not found)
```

Fixes #8067

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210211100158.543952-1-bhalevy@scylladb.com>
2021-02-11 12:05:32 +02:00
Wojciech Mitros
17634d141b sstables: add test for checking the latency of updating the sstable_set in a table
Added a test which measures the time it takes to replace sstables in a table's
sstable_set, using the leveled compaction strategy.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-02-11 11:02:55 +01:00
Wojciech Mitros
693b4e0fcd sstables: move column_family_test class from test/boost to test/lib
Column_family_test allows performing private methods on column_family's
sstable_set. It may be useful not only in the boost tests, so it's moved
from test/boost/sstable_test.hh to test/lib/sstable_test_env.hh.
sstable_test.hh includes sstable_test_env.hh, so no includes need to be
changed.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-02-11 11:02:55 +01:00
Wojciech Mitros
0feff8712e sstables: use fast copying of the sstable_set instead of rebuilding it
The sstable_set enables copying without iterating over all its elements,
so it's faster to copy a set and modify it than copy all its elements
while filtering the ones that were erased.

The modifications are done on a temporary version of the set, so that
if an operation fails the base version remains unchanged

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-02-11 11:02:55 +01:00
Wojciech Mitros
aa0cd940d6 sstables: replace the sstable_set with a versioned structure
Currently, the sstable_set in a table is copied before every change
to allow accessing the unchanged version by existing sstable readers.
This patch changes the sstable_set to a structure that allows copying
without actually copying all the sstables in the set, while providing
the same methods(and some extra) without majorly decreasing their speed.
This is achieved by associating all copies with sstable_set versions
which hold the changes that were performed in them, and references to
the versions that were copied, a.k.a. their parents. The set represented
by a version is the result of combining all changes of its ancestors.

This causes most methods of the version to have a time complexity
dependent on the number of its ancestors. To limit this number, versions
that represent copies that have already been deleted are merged with its
descendants.

The strategy used for deciding when and with which of its children
should a version be merged heavily depends on the use case of sstable_sets:
there is a main copy of the set in a table class which undergoes many
insertions and deletions, and there are copies of it in compaction or
mutation readers which are further copied or edited few or zero times.
It's worth to mention, that when a copy is made, the copied set should not
be modified anymore, because it would also modify the results given by the
copy. In order to still allow modifying the copied set, if a change is
to be performed on it, the version assiociated with this set is replaced
with a new version depending on the previous one.
As we can see, in our use case there is a main chain of versions(with
changes from the table), and smaller branches of versions that start
from a version from this chain, but are deleted soon after.
In such case we can merge a version when it has exactly one descendant,
as this limits the number of concurrent ancestors of a version to the
number of copies of its ancestors are concurrently used. During each
such merge, the parent version is removed and the child version is
modified so that all operations on it give the same results.

In order to preserve the same interface, the sstable_set still contains a
lw_shared_ptr<sstable_list>, but sstable_list (previously an alias for
unordered_set<shared_sstable>) is now a new structure. Each sstable_set
contains a sstable_list but not every sstable_list has to be contained
by a sstable_set, and we also want to allow fast copying of sstable_lists,
so the reference to the sstable_set_version is kept by the sstable_lists
and the sstable_set can access the sstable_set_version it's associated
with through its sstable_list.

Accessing sstables that are elements of a certain sstable_set copy(so
the select, select_sstable_runs and sstable_list's iterator) get results
from containers that hold all sstables from all versions(which are stored
in a single, shared "versioned_sstable_set_data" structure), and then
filter out these sstables that aren't present in the version in question.
This version of the sstable_set allows adding and erasing the same sstable
repeatedly. Inserting and erasing from the set modifies the containers in
a version only when it has an actual effect: if an sstable has been added
in the parent version, and hasn't been erased in the child version, adding
it again will have no effect. This ensures that when merging versions, the
versions have disjoint sets of added, and erased sstables (an sstable can
still be added in one and erased in the second). It's worth noting hat if
an sstable has been added in one of the merged sets and erased in the
second, the version that remains after merging doesn't need to have any
info about the sstable's inclusion in the set - it can be inferred from
the changes in previous versions (and it doesn't matter if the sstable has
been erased before or after being added).

To release pointers to sstables as soon as possible (i.e. when all references
to versions that contain them die), if an sstable is added/erased in all
child versions that are based on a version which has no external references,
this change gets removed from these versions and added to the parent version.
If an sstable's insertion gets overwritten as a result, we might be able
to remove the sstable completely from the set. We know how many times this
needs to happen by counting, for each sstable, in how many different verisions
has it been added. When a change that adds an sstable gets merged with a change
that removes it, or when a such a change simply gets deleted alongside its
associated version, this count is reduced, and when an sstable gets added to a
version that doesn't already contain it, this count is increased.

The methods that modify the sets contents give strong exception guarantee
by trying to insert new sstables to its containers, and erasing them in
the case of an caught exception.

Fixes #2622

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-02-11 11:02:55 +01:00
Wojciech Mitros
48153a1e2c sstables: remove potential ub
If the range expression in a range based for loop returns a temporary,
its lifetime is extended until the end of the loop. The same can't be said
about temporaries created within the range expression. In our case,
*t->get_sstables_including_compacted_undeleted() returns a reference to a
const sstable_list, but the t->get_sstables_including_compacted_undeleted()
is a temporary lw_shared_ptr, so its lifetime may not be prolonged until the
end of the loop, and it may be the sole owner of the referenced sstable_list,
so the referenced sstable_list may be already deleted inside the loop too.
Fix by creating a local copy of the lw_shared_ptr, and get reference from it
in the loop.

Fixes #7605

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-02-11 11:02:55 +01:00
Wojciech Mitros
e1b494633b sstables: make sstable_set constructor less error-prone
Adding an non-empty set of sstables as the set of all sstables in
an sstable_set could cause inconsistencies with the values returned
by select_sstable_runs because the _all_runs map would still be
initialized empty. For similar reasons, the provided sstable_set_impl
should also be empty.

Dispel doubts by removing the unordered_set from the constructor, and
adding a check of emptiness of the sstable_set_impl.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-02-11 11:02:55 +01:00
Shlomi Livne
718976e794 scylla_io_setup did not configure pre tuned gce instances correctly
scylla_io_setup condition for nr_disks was using the bitwise operator
(&) instead of logical and operator (and) causing the io_properties
files to have incorrect values

Fixes #7341

Reviewed-by: Lubos Kosco <lubos@scylladb.com>
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>

Closes #8019
2021-02-11 11:06:00 +02:00
Avi Kivity
9cbbf40710 Merge "register_inactive_read: error handling" from Benny
"
Currently, register_inactive_read accepts an eviction_notify_handler
to be called when the inactive_read is evicted.

However, in case there was an error in register_inactive_read
the notification function isn't called leaving behind
state that needs to be cleaned up.

This series separates the register_inactive_reader interface
into 2 parts:

1. register_inactive_reader(flat_mutation_reader) - which just registers
the reader and return an inactive_read_handle, *if permitted*.
Otherwise, the notification handler is not called (it is not known yet)
and the caller is not expected to do anything fance at this point
that will require cleanup.

This optimizes the server when overloaded since we do less work
that we'd need to undo in case the reader_concurrecy_semaphore
runs out of resources.

2. After register_inactive_reader succeeded to return a valid
inactive_read_handle, the caller sets up its local state
and may call `set_notify_handler` to set the optional
notify_handler and ttl on the o_r_h.

After this state, the notify_handler will be called when
the inactive_reader is evicted, for any reason.

querier_cache::insert_querier was modified to use the
above procedure and to handle (and log/ignore) any error
in the process.

inactive_read_handle and inactive_read keeping track of each other
was simplified by keeping an iterator in the handle and a backpointer
in the inactive_read object.  The former is used to evict the reader
and to set the notify_handler and/or ttl without having to lookup the i_r.
The latter is used to invalidate the i_r_h when the i_r is destroyed.

Test: unit(release), querier_cache_test(debug)
"

* tag 'register_inactive_read-error-handling-v6' of github.com:bhalevy/scylla:
  querier_cache: insert_querier: ignore errors to register inactive reader
  querier_cache: insert_querier: handle errors
  querier_utils: mark functions noexcept
  reader_concurrency_semaphore: register_inactive_read: make noexcept
  reader_concurrency_semaphore: separate set_notify_handler from register_inactive_reader
  reader_concurrency_semaphore: inactive_read: make ttl_timer non-optional
  reader_concurrency_semaphore: inactive_read: use intrusive list
  reader_concurrency_semaphore: do_wait_admission: use try_evict_one_inactive_read
  reader_concurrency_semaphore: try_evict_one_inactive_read: pass evict_reason
  reader_concurrency_semaphore: unregister_inactive_read: calling on wrong semaphore is an internal error
  reader_concurrency_semaphore: unregister_inactive_read: do nothing if disengaged
  reader_concurrency_semaphore: inactive_read_handle: swap definition order
  reader_lifecycle_policy: retire low level try_resume method
  reader_concurrency_semaphore: inactive_read: keep a flat_mutation_reader
2021-02-10 19:09:21 +02:00
Alejo Sanchez
941eceb9c8 raft: replication test: remove unnecessary param
Remove unnecessary param from wait_log()

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-02-10 09:11:17 -04:00
Piotr Sarna
2aa4631148 test: fix a flaky timeout test depending on TTL
One of the USING TIMEOUT tests relied on a specific TTL value,
but that's fragile if the test runs on the boundary of 2 seconds.
Instead, the test case simply checks if the TTL value is present
and is greater than 0, which makes the test robust unless its execution
lasts for more than 1 million seconds, which is highly unlikely.

Fixes #8062

Closes #8063
2021-02-10 14:20:02 +02:00
Piotr Sarna
8f98c0585f failure_detector: add a missing const qualifier
The mean() method is effectively const, so it should be marked as such.
Message-Id: <14dd39e8419136909fcf10508c34de3752faa7fe.1612953601.git.sarna@scylladb.com>
2021-02-10 13:04:37 +02:00
Piotr Sarna
aa39130a20 bounded_stats_queue: add missing const qualifiers
Most of the methods of this utility are effectively const.
Message-Id: <ed376ab74b6323cf770cc0a1314edbae0b16111e.1612953601.git.sarna@scylladb.com>
2021-02-10 13:04:35 +02:00
Piotr Jastrzebski
390cef6a96 cdc: Extract create_stream_ids from topology_description_generator
This new function will be used in the following patches in additional
places.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-02-10 10:24:06 +01:00
Gleb Natapov
d06d21bfae database: remove add_keyspace() function
It is not longer used.
Message-Id: <20210209175931.1796263-2-gleb@scylladb.com>
2021-02-10 00:36:02 +01:00
Nadav Har'El
54785604b4 Merge 'Add max concurrent requests to alternator' from Piotr Sarna
Previous version, merged and dequeued due to a dependency bug: https://github.com/scylladb/scylla/pull/7297

Note: this pull request is temporarily created against /next, because it depends on https://github.com/scylladb/scylla/pull/7279.

This series adds support for `max_concurrent_requests_per_shard` config variable to alternator. Excessive requests are shed and RequestLimitExceeded is sent back to the client.

Tested manually by reloading Scylla multiple times and editing the config, while bombarding alternator with many concurrent requests. Observed excepted failures are:
`botocore.errorfactory.RequestLimitExceeded: An error occurred (RequestLimitExceeded) when calling the CreateTable operation: too many in-flight requests: 17
`

Fixes #7294

Closes #8039

* github.com:scylladb/scylla:
  alternator: server: return api_error instead of throwing
  alternator: add requests_shed metrics
  alternator: add handling max_concurrent_requests_per_shard
  alternator: add RequestLimitExceeded error
2021-02-09 19:55:31 +02:00
Gleb Natapov
d8345c67d9 Consolidate system and non system keyspace creation
The code that creates system keyspace open code a lot of things from
database::create_keyspace(). The patch makes create_keyspace() suitable
for both system and non system keyspaces and uses it to create system
keyspaces as well.
Message-Id: <20210209160506.1711177-1-gleb@scylladb.com>
2021-02-09 17:18:04 +01:00
Gleb Natapov
51037e94ec lwt: handle an error during prune operation
The error is benign but if it is not handled "unhandled exception" error
will be printed in the logs.

Message-Id: <20210209150313.GA1708015@scylladb.com>
2021-02-09 16:26:00 +01:00
Tomasz Grabiec
3dd9c5596a Merge 'Minor tweaks to the failure detector interface' from Piotr Sarna
The interface of the failure detector service is cleaned up a little:
 - an unimplemented method is removed (is_alive)
 - a return type of another method is fixed (arrival_samples)
 - a getter for the most recent successful update is added (last_update)

This code was tested manually during various overload protection
experiments, which check if the failure detector can be used to reject
requests which have a very small chance of succeeding within their
timeout.

Closes #8052

* github.com:scylladb/scylla:
  failure_detector: add getting last update time point
  failure_detector: return arrival samples by const reference
  failure_detector: remove unimplemented is_alive method
2021-02-09 15:23:09 +01:00
Konstantin Osipov
86dec79c1b raft: rename progress.hh to tracker.hh
class tracker is the main class of this module.
2021-02-09 17:07:25 +03:00
Konstantin Osipov
41387225c3 raft: extend single_node_is_quiet test 2021-02-09 17:04:13 +03:00
Piotr Sarna
4acc6fecf0 Merge 'locator: Check DC names in NetworkTopologyStrategy' from Juliusz Stasiewicz
The same trick is used as in C*:
79e693e16e/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java (L241)

The edited CQL test relied on quietly accepting non-existing DCs, so it had to
be removed. Also, one boost-test referred to nonexistent `datacenter2` and had
to be removed.

Fixes #7595

Closes #8056

* github.com:scylladb/scylla:
  tests: Adjusted tests for DC checking in NTS
  locator: Check DC names in NTS
2021-02-09 14:45:20 +02:00
Botond Dénes
3d001b5587 query: use local limit for non-limited queries in mixed cluster
Since fea5067df we enforce a limit on the memory consumption of
otherwise non-limited queries like reverse and non-paged queries. This
limit is sent down to the replicas by the coordinator, ensuring that
each replica is working with the same limit. This however doesn't work
in a mixed cluster, when upgrading from a version which doesn't have
this series. This has been worked around by falling back to the old
max_result_size constant of 1MB in mixed clusters. This however resulted
in a regression when upgrading from a pre fea5067df to a post fea5067df
one. Pre fea5067df already had a limit for reverse queries, which was
generalized to also cover non-paged ones too by fea5067df.
The regression manifested in previously working reverse queries being
aborted. This happened because even though the user has set a generous
limit for them before the upgrade, in the mix cluster replicas fall back
to the much stricter 1MB limit temporarily ignoring the configured limit
if the coordinator is an old node. This patch solves this problem by
using the locally configured limit instead of the max_result_size
constant. This means that the user has to take extra care to configure
the same limit on all replicas, but at least they will have working
reverse queries during the upgrade.

Fixes: #8022

Tests: unit(release), manual test by user who reported the issue
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210209075947.1004164-1-bdenes@scylladb.com>
2021-02-09 14:45:20 +02:00
Avi Kivity
2f50bf2029 Update seastar submodule
* seastar 4c7c5c7c4...76cff5896 (6):
  > rpc: Make is possible for rpc server instance to refuse connection
  > reactor: expose cumulative tasks processed statistic
  > fair_queue: add missing #include <optional>
  > reactor: optimize need_preempt() thread-local-storage access
  > Merge " Use reference for backend->reactor link" from Pavel E
  > test: coroutines: failed coroutine does not throw
2021-02-09 14:45:20 +02:00
Avi Kivity
37b41d7764 test: add missing #include <fstream>
std::ofstream is used, but there is no direct include for it. This
fails the build with libstdc++ 11.

Closes #8050
2021-02-09 14:45:20 +02:00
Juliusz Stasiewicz
97bb15b2f2 tests: Adjusted tests for DC checking in NTS
CQL test relied on quietly acceptiong non-existing DCs, so it had
to be removed. Also, one boost-test referred to nonexisting
`datacenter2` and had to be removed.
2021-02-09 08:29:35 +01:00
Juliusz Stasiewicz
b6fb5ee912 locator: Check DC names in NTS
The same trick is used as in C*:
79e693e16e/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java (L241)

Fixes #7595
2021-02-09 07:04:17 +01:00
Benny Halevy
d2b8b3041d querier_cache: insert_querier: ignore errors to register inactive reader
Since the reader may normally dropped upon
registration, hitting an error is equivalent to having
it evicted at any time, so just log the exception
and ignore it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 22:31:01 +02:00
Benny Halevy
9bdb8190ce querier_cache: insert_querier: handle errors
Make insert_querier exception safe.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 22:31:01 +02:00
Benny Halevy
b8f935457a querier_utils: mark functions noexcept
They all are trivially noexcept.
Mark them so to simplify error handing assumptions in the
next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 22:31:01 +02:00
Benny Halevy
6e92f07630 reader_concurrency_semaphore: register_inactive_read: make noexcept
Catch error to allocate an inactive_read and just log them.
Return an empty inactive_read_handle in
this case, as if the inactive reader was evicted due to
lack of resources.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 22:31:01 +02:00
Benny Halevy
46c2229b78 reader_concurrency_semaphore: separate set_notify_handler from register_inactive_reader
Register the inactive reader first with no
evict_notify_handler and ttl.

Those can be set later, only if registration succeeded.
Otherwise, as in the querier example, there is no need
to to place the querier in the index and erase it
on eviction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 22:31:01 +02:00
Benny Halevy
d752ea7e91 reader_concurrency_semaphore: inactive_read: make ttl_timer non-optional
By default it will be unarmed and with no callback
so there's no need to wrap it in a std::optional.

This saves an allocation and another potential
error case.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 22:31:01 +02:00
Benny Halevy
a12c9638b6 reader_concurrency_semaphore: inactive_read: use intrusive list
To simplify insertion and eviction into the inactive_reads container,
use an intrusive list thta requires a single allocation for the
inactive_read object itself.

This allows passing a reference to the inactive_read
to evict it.

Note that the reader will be unlinked automatically from
the inactive_readers list if the inactive_read_handle is destroyed.
This is okay since there is no need to track the inactive_read
if the caller loses the i_r_h (e.g. if an error is thrown).

It is also safe to evict the inactive_reader while the
i_r_h is alive.  In this case the i_r will be unlinked
after the flat_mutation_reader it holds is moved out of it.

bi::auto_unlink will detect that it's alredy unlinked
when destroyed and do nothing.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 22:31:01 +02:00
Benny Halevy
f751e42bf9 reader_concurrency_semaphore: do_wait_admission: use try_evict_one_inactive_read
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 20:52:16 +02:00
Benny Halevy
81cd3d0c51 reader_concurrency_semaphore: try_evict_one_inactive_read: pass evict_reason
So try_evict_one_inactive_read could be used also in do_wait_admission
in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 20:32:40 +02:00
Benny Halevy
e072199b8d reader_concurrency_semaphore: unregister_inactive_read: calling on wrong semaphore is an internal error
Calling unregister_inactive_read on the wrong semaphore is a blatant
bug so better call on_internal_error so it'd be easier to catch and fix.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 20:32:40 +02:00
Benny Halevy
9c9b4c85ae reader_concurrency_semaphore: unregister_inactive_read: do nothing if disengaged
There is no need to lookup the inactive_read if the i_r_h
is disengaged, it should not be registered so just return
quickly.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 20:32:40 +02:00
Benny Halevy
769dff6c54 reader_concurrency_semaphore: inactive_read_handle: swap definition order
For using boost::intrusive::list for _inactive_reads.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 20:32:40 +02:00
Benny Halevy
d565e3fb57 reader_lifecycle_policy: retire low level try_resume method
The caller can now just call sem.unregister_inactive_read(irh) directly.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 20:32:40 +02:00
Benny Halevy
4e8f29ef14 reader_concurrency_semaphore: inactive_read: keep a flat_mutation_reader
There's no need to hold a unique_ptr<flat_mutation_reader> as
flat_mutation_reader itself holds a unique_ptr<flat_mutation_reader::impl>
and functions as a unique ptr via flat_mutation_reader_opt.

With that, unregister_inactive_read was modified to return a
flat_mutation_reader_opt rather than a std::unique_ptr<flat_mutation_reader>,
keeping exactly the same semantics.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-08 20:32:40 +02:00
Nadav Har'El
e52785be08 alternator: support attribute paths in ConditionExpression, FilterExpression
This patch fully implements support for attribute paths (e.g. a.b.c, a.d[3])
for the ConditionExpression in conditional updates, and FilterExpression in
queries and scans. After this patch, all previously-xfailing tests in
test_projection_expression.py and test_filter_expression.py now pass.

The fix is simple: Both ConditionExpression and FilterExpression use the
function calculate_value() to calculate the value of the expression. When
this function calculates the value of a path, it mustn't just take the
top-level attribute - it needs to walk into the specific sub-object as
specified by the attribute path.

This is not the end of attribute path support, UpdateExpression and
ReturnValues are not yet fully supported. This will come in following
patches.

Refs #5024

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-08 19:19:09 +02:00
Nadav Har'El
579c7b8dae alternator-test: improve tests for nested attributes in ConditionExpression
Strengthen the tests in test_condition_expression.py for nested attribute
paths (e.g., b.y[1]):

1. The test test_update_condition_nested_attributes only tested successful
   conditions involving nested attributes. Let's also add an *unsuccessful*
   condition, to verify we don't accidentally pass every condition involving
   a nested attribute.

2. Test a case where a non-existant nested attribute is involved in the
   condition.

3. In the test for an attribute path with references - "#name1.#name2",
   make sure the test doesn't pass if #name2 is silently ignored.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-08 19:19:09 +02:00
Piotr Sarna
faca59efa6 failure_detector: add getting last update time point
It can be useful to use the information how long ago an endpoint
responded to heartbeat.
2021-02-08 16:45:58 +01:00
Tomasz Grabiec
c16e4a0423 migration_manager: Propagate schema changes with reads like we do on writes
This fixes the problem where the cordinator already knows about the
new schema and issues a read which uses new objects, but the replica
doesn't know those objects yet. The read will fail in this case. We
can avoid this if we propagate schema changes with reads, like we
already do for writes.

Message-Id: <20210205163422.414275-1-tgrabiec@scylladb.com>
2021-02-08 16:49:55 +02:00
Avi Kivity
4082f57edc Merge 'Make commitlog disk limit a hard limit.' from Calle Wilund
Refs #6148

Commitlog disk limit was previously a "soft" limit, in that we allowed allocating new segments, even if we were over
disk usage max. This would also cause us sometimes to create new segments and delete old ones, if badly timed in
needing and releasing segments, in turn causing useless disk IO for pre-allocation/zeroing.

This patch set does:
* Make limit a hard limit. If we have disk usage > max, we wait for delete or recycle.
* Make flush threshold configurable. Default is ask for flush when over 50% usage. (We do not wait for results)
* Make flush "partial". We flush X% of the used space (used - thres/2), and make the rp limit accordingly. This means we will try to clear the N oldest segments, not all. I.e. "lighter" flush. Of course, if the CL is wholly dominated by a single CF, this will not really help much. But when > 1 cf is used, it means we can skip those not having unflushed data < req rp.
* Force more eager flush/recycle if we're out of segments

Note: flush threshold is not exposed in scylla config (yet). Because I am unsure of wording, and even if it should.
Note: testing is sparse, esp. in regard to latency/timeouts added in high usage scenarios. While I can fairly easily provoke "stalls" (i.e. forced waiting for segments to free up) with simple C-S, it is hard to say exactly where in a more sane config (I set my limits looow) latencies will start accumulating.

Closes #7879

* github.com:scylladb/scylla:
  commitlog: Force earlier cycle/flush iff segment reserve is empty
  commitlog: Make segment allocation wait iff disk usage > max
  commitlog: Do partial (memtable) flushing based on threshold
  commitlog: Make flush threshold configurable
  table: Add a flush RP mark to table, and shortcut if not above
2021-02-08 16:44:05 +02:00
Avi Kivity
af2d1fa0de Update abseil submodule
Compiles with newer compilers.

Added new library wyhash.a to configure.py.

* abseil 1e3d25b...9c6a50f (51):
  > Export of internal Abseil changes
  >  Do not set mvsc linker flags for clang-cl (fixes #874) (#891)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Add support for Elbrus 2000 (e2k) (#889)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Add missing word 'library' in the 'status' description (#868)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Include the status library into the main README. (#863)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > fix build dll (#797)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Fix stacktrace on aarch64 architecture. Fixes #805 (#827)
  > moved deleted functions to public for better compiler errors. (#828)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
2021-02-08 15:41:46 +02:00
Gleb Natapov
b9a5aff7a6 distributed_loader: drop execute_futures function
execute_futures() is just a local reimplementation of
when_all_succeed(). Use the former directly.

Message-Id: <20210208114816.GA1658725@scylladb.com>
2021-02-08 13:24:19 +01:00
Nadav Har'El
104ef5242b alternator: support attribute paths in ProjectionExpression
This patch fully implements support for attribute paths (e.g. a.b.c, a.d[3])
for the ProjectionExpression in the various operations where this parameter
is supported - GetItem, BatchGetItem, Query and Scan. After this patch, all
xfailing tests in test_projection_expression.py now pass.

In the previous patch we remembered in the "attrs_to_get" object not only
the top-level attributes to read from the table, but also how to filter
from it only the desired pieces of the nested document. In this patch we
add a filter() function to do this filtering, and call it in the right
places to post-process the JSON objects we read from the table.

We also had to fix reference resolution in paths to resolve all the
components of the path (e.g., #name1.#name2) and not just the top-level
attribute.

This is not the end of attribute path support, there are still other
expressions (ConditionExpression, UpdateExpression, FilterExpression,
ReturnValues) where they are not yet supported. This will come in following
patches.

Refs #5024

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-08 14:16:40 +02:00
Nadav Har'El
6340619e69 alternator: overhaul attrs_to_get handling
In the existing code, the variable "attrs_to_get" is a list of top-level
attributes to fetch for an item. It is used to implement features like
ProjectionExpression or AttributesToGet in GetItem and other places.

However, to support attribute paths (e.g., a.b.c[2]) in ProjectionExpression,
i.e., issue #5024, we need more than that. We still need to know the top-
level attribute "a", because this is the granularity we have in the Scylla
table (all the content inside "a" is serialized as a single JSON); But we
also need to remember exactly which parts *inside* "a" we will need to
extract and return.

So in this patch we add a new type, "attrs_to_get", which is more than
just a list of top-level attributes. Instead, it is a *map*, whose keys
are the top-level attributes, and the value for each of them is a
"hierarchy_filter", an object which describes which part of the attribute
is needed.

This patch includes the code which converts the AttributesToGet and
ProjectionExpression into the new attrs_to_get structure. During this
conversion, we recognize two kinds of errors which DynamoDB complains
about: We recognize "overlapping" attributes (e.g., requesting both
a.b and a.b.c) and "conflicting" attributes (e.g, requesting both
a.b and a[1]). After this, two xfailing tests we had for detecting
these overlap and conflicts finally pass and their "xfail" label is
removed.

After this patch, we have the attrs_to_get object which can allow us
to filter only the requested pieces of the top-level attributes, but
we don't use it yet - so this patch is not enough for complete support
of attribute paths in ProjectionExpression. We will complete this
support in the next patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-08 14:16:40 +02:00
Nadav Har'El
b2dbd56a3a alternator-test: additional tests for attribute paths in ProjectionExpression
This patch adds more tests for attribute paths in ProjectionExpression,
that deal with document paths which do not fit the content of the item -
e.g., trying to ask for "a.b[3]" when a.b is not a list but rather an
integer or a dictionary.

Moreover, we note that if you try to ask for "a.b, a[2]", DynamoDB
fails this request as a "conflict". The reasoning is that no single
item can ever have both a.b and a[2] (the first is only valid for
dictionaries, the second for lists). It's not clear to me why we
still can't return whichever of the two actually is relevant, but
the fact is that DynamoDB does not allow it.

The new tests fail on Alternator (marked xfailed) and pass on DynamoDB.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-08 14:16:40 +02:00
Nadav Har'El
2a2c5563ba alternator-test: harden attribute-path tests for ProjectionExpression
We have 7 xfailing tests for usage of nested attribute paths (e.g.,
"a.b.c[7]") in a ProjectionExpression. But some of these tests were too
"easy" to pass - a trivial and *wrong* implementation that just ignores
the path and uses the top level attribute (in the above example, "a"),
would cause some of them to start passing.

So this patch strengthens these tests. They still pass on AWS DynamoDB,
and now continue to fail with the aforementioned broken implementation.

Refs #5024.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-08 14:16:40 +02:00
Nadav Har'El
653610f4bc alternator: fix ValidationException in FilterExpression - and more
The first condition expressions we implemented in Alternator were the old
"Expected" syntax of conditional updates. That implementation had some
specific assumptions on how it handles errors: For example, in the "LT"
operator in "Expected", the second operand is always part of the query, so
an error in it (e.g., an unsupported type) resulted it a ValidationException
error.

When we implemented ConditionExpression and FilterExpression, we wrongly
used the same functions check_compare(), check_BETWEEN(), etc., to implement
them. This results in some inaccurate error handling. The worst example is
what happens when you use a FilterExpression with an expression such as
"x < y" - this filter is supposed to silently skip items whose "x" and "y"
attributes have unsupported or different types, but in our implementation
a bad type (e.g., a list) for y resulted in a ValidationException which
aborted the entire scan! Interestingly, in once case (that of BEGINS_WITH)
we actually noticed the slightly different behavior needed and implemented
the same operator twice - with ugly code duplication. But in other operators
we missed this problem completely.

This patch first adds extensive tests of how the different expressions
(Expected, QueryFilter, FilterExpression, ConditionExpression) and the
different operators handle various input errors - unsupported types,
missing items, incompatible types, etc. Importantly, the tests demonstrate
that there is often different behavior depending on whether the bad
input comes from the query, or from the item. Some of the new tests
fail before this patch, but others pass and were useful to verify that
the patch doesn't break anything that already worked correctly previously.
As usual, all the tests pass on Cassandra.

Finally, this patch *fixes* all these problems. The comparison functions
like check_compare() and check_BETWEEN() now not only take the operands,
they also take booleans saying if each of the operands came from the
query or from an item. The old-syntax caller (Expected or QueryFilter)
always say that the first operand is from the item and the second is
from the query - but in the new-syntax caller (ConditionExpression or
FilterExpression) any or all of the operands can come from the query
and need verification.

The old duplicated code for check_BEGINS_WITH() - which a TODO to remove
it - is finally removed. Instead we use the same idea of passing booleans
saying if each of its operands came from an item or from the query.

Fixes #8043

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-02-08 14:16:30 +02:00
Pavel Emelyanov
a05adb8538 database: Remove global storage proxy reference
The db::update_keyspace() needs sharded<storage_proxy>
reference, but the only caller of it already has it and
can pass one as argument.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210205175611.13464-3-xemul@scylladb.com>
2021-02-08 12:59:46 +01:00
Pavel Emelyanov
8490c9ff6a transport: Remove global storage service reference
On start the transport controller keeps the storage service
on server config's lambda just to let the server grab a
database config option.

The same can be achieved by passing the sharded database
reference to sharded<server>::start, so that each server
instance get local database with config.

As an nice side effect transport::server's config looks
more like a config with simple values and without methods
and/or lambdas on board.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210205175611.13464-1-xemul@scylladb.com>
2021-02-08 12:58:49 +01:00
Piotr Sarna
d23584c8f7 failure_detector: return arrival samples by const reference
There's no point in always returning the whole map by value - callers
can decide to copy the map of their own if need be.
2021-02-08 11:50:32 +01:00
Piotr Sarna
445e6e44f4 failure_detector: remove unimplemented is_alive method
The method was never implemented, so it makes no sense to keep
it in the header.
2021-02-08 11:49:50 +01:00
Amnon Heiman
4498bb0a48 API: Fix aggregation in column_familiy
Few method in column_familiy API were doing the aggregation wrong,
specifically, bloom filter disk size.

The issue is not always visible, it happens when there are multiple
filter files per shard.

Fixes #4513

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes #8007
2021-02-08 12:11:30 +02:00
Raphael S. Carvalho
e1261d10f1 table: Avoid useless allocations when updating cache on memtable flush completion
we're unconditionally using make_combined_mutation_source(), which causes extra
allocations, even if memtable was flushed into a single sstable, which is the
most common case. memtable will only be flushed into more than one sstable if
TWCS is used and memtable had old data written into it due to out-of-order
writes.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210205182028.439948-1-raphaelsc@scylladb.com>
2021-02-06 20:03:33 +02:00
Pavel Emelyanov
7e68ed6a5d configure: Switch debug build from -O0 to -Og
Previous patch changed the -O flag for dev builds. This
had no effect on unit tests compile+run time, and was
aimed at improving the individual tests, dtest, stress-
and other tests runtimes.

This change is mainly focused on imprving the debug-mode
full unit tests running, while keeping the debuggability:
the compile+run time gets ~10 minutes shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-05 19:46:28 +03:00
Pavel Emelyanov
4fd5ef92ae configure: Switch dev build from -O1 to -O2
Based on the original patch from Nadav.

The -O1-generated code is too slow. Raising the opt level
slows compilation down ~9%, but greatly improves the
testing time. E.g. running the alternator test alone is
2.5 times faster with -O2 (118 vs 48 seconds).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-05 19:46:28 +03:00
Pavel Emelyanov
7ced07d22c configure: Make -O flag configurable
It was noticed, that current optimization levels do not
generate fast enough code for dev builds. On the other
hand just increasing the default optimization level will
make re-compile-mostly work much more frustrating.

The new configure.py option allows to select the desired
-O option value by hands. Current hard-coded values are
used as defaults.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-05 19:46:28 +03:00
Botond Dénes
7910e745bc scylla-gdb.py: std_list: restore python2 compatibility
std_list has an iterator object which provides the python3 `__next__()`
method only. Python2 wants a method called `next()`. As it is trivial to
provide both, do that to allow debugging on centos7.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210205073549.734362-1-bdenes@scylladb.com>
2021-02-05 12:47:53 +01:00
Gleb Natapov
8dbe222331 raft: compile raft by default 2021-02-05 12:40:20 +01:00
Konstantin Osipov
adc87aa278 raft: re-lookup progress object after a configuration change
Fix raft_fsm_test failure in debug mode. ASAN complained
that follower_progress is used in append_entries_reply()
after it was destroyed. This could happen if in maybe_commit()
we switched to a new configuration and destroyed old progress
objects.

The fix is to lookup the object one more time after maybe_commit().
2021-02-05 12:40:19 +01:00
Piotr Sarna
d7848750d8 alternator: server: return api_error instead of throwing
Throwing a C++ exception creates unnecessary overhead, so when
an unsupported operation is encountered, the api error is directly
returned instead of being thrown.
2021-02-04 17:23:41 +01:00
Piotr Sarna
868e04e8e2 alternator: add requests_shed metrics
The counter shows the total number of requests shed due to overload.
2021-02-04 17:23:41 +01:00
Piotr Sarna
1b8c946ad7 alternator: add handling max_concurrent_requests_per_shard
The config value is already used to set an upper limit of concurrent
CQL requests, and now it's also abided by alternator.
Excessive requests result in returning RequestLimitExceeded error
to the client.

Tests: manual
Running multiple concurrent requests via the test suite results in:
botocore.errorfactory.RequestLimitExceeded: An error occurred (RequestLimitExceeded) when calling the CreateTable operation: too many in-flight requests: 17
2021-02-04 17:23:41 +01:00
Piotr Sarna
32dc692b8b alternator: add RequestLimitExceeded error
The error code is used when requests are shed due to crossing
the user-defined threshold of the rate of incoming requests.
2021-02-04 17:14:21 +01:00
Avi Kivity
7f3083739f Merge "sstables: Share partition index pages between readers" from Tomasz
"
Before this patch, each index reader had its own cache of partition
index pages. Now there is a shared cache, owned by the sstable object.
This allows concurrent reads to share partition index pages and thus
reduce the amount of I/O.

It used to be like that a few years ago, but we moved to per-reader
cache to implement incremental promoted index parsing, to avoid OOMs
with large partitions. At that time, the solution involved caching
input streams inside partition index entries, which couldn't be reused
between readers. This could have been solved differently. Instead of
caching input streams, we can cache information needed to created them
(temporary_buffer<>). This solution takes this approach.

This series is also needed before we can implement promoted index
caching. That's because before the promoted index can be shared by
readers, the partition index entries, which hold the promoted index,
must also be shareable.

The pages live as long as there is at least one index reader
referencing them. So it only helps when there is concurrent access. In
the future we will keep them for longer and evict on memory pressure.

Promoted index cursor is no longer created when the partition index
entry is parsed, by it's created on-demand when the top-level cursor
enters the partition. The promoted index cursor is owned by the
top-level cursor, not by the partition index entry.

Below are the results of an experiment performed on my laptop which
demonstrates the improvement in performance.

Load driver command line:

  ./scylla-bench                   \
       -workload uniform           \
       -mode read                  \
       --partition-count=10        \
       -clustering-row-count=1     \
       -concurrency 100

Scylla command line:

  scylla --developer-mode=1 -c1 -m1G --enable-cache=0

The workload is IO-bound.
Before, we needed 2 I/O per read, now we need 1 (amortized).
The throughput is ~70% higher.

Before:

 time   ops/s  rows/s errors max    99.9th 99th   95th   90th   median mean
   1s    4706    4706      0 35ms   30ms   27ms   25ms   24ms   21ms   21ms
   2s    4646    4646      0 42ms   31ms   31ms   27ms   25ms   21ms   22ms
 3.1s    4670    4670      0 40ms   27ms   26ms   25ms   25ms   21ms   21ms
 4.1s    4581    4581      0 39ms   33ms   33ms   27ms   26ms   21ms   22ms
 5.1s    4345    4345      0 40ms   37ms   35ms   32ms   31ms   21ms   23ms
 6.1s    4328    4328      0 49ms   40ms   34ms   32ms   31ms   22ms   23ms
 7.1s    4198    4198      0 45ms   36ms   35ms   31ms   30ms   22ms   24ms
 8.2s    3913    3913      0 51ms   50ms   50ms   39ms   35ms   24ms   26ms
 9.2s    4524    4524      0 34ms   31ms   30ms   28ms   27ms   21ms   22ms

After:

 time   ops/s  rows/s errors max    99.9th 99th   95th   90th   median mean
   1s    7913    7913      0 25ms   25ms   20ms   15ms   14ms   12ms   13ms
   2s    7913    7913      0 18ms   18ms   18ms   16ms   14ms   12ms   13ms
   3s    8125    8125      0 20ms   20ms   17ms   15ms   14ms   12ms   12ms
   4s    5609    5609      0 41ms   35ms   29ms   28ms   27ms   13ms   18ms
 5.1s    8020    8020      0 18ms   17ms   17ms   15ms   14ms   12ms   13ms
 6.1s    7102    7102      0 27ms   27ms   24ms   19ms   18ms   13ms   14ms
 7.1s    5780    5780      0 26ms   26ms   26ms   23ms   22ms   17ms   18ms
 8.1s    6530    6530      0 37ms   34ms   26ms   22ms   20ms   15ms   15ms
 9.1s    7937    7937      0 19ms   19ms   17ms   17ms   16ms   12ms   13ms

Tests:

  - unit [release]
  - scylla-bench
"

* tag 'share-partition-index-v1' of github.com:tgrabiec/scylla:
  sstables: Share partition index pages between readers
  sstables: index_reader: Drop now unnecessary index_entry::close_pi_stream()
  sstables: index_reader: Do not store cluster index cursor inside partition indexes
2021-02-04 17:27:49 +02:00
Nadav Har'El
1953b1b006 alternator-test: increase timeout in tracing test
Our test for tracing Alternator requests can't be sure when tracing a request
finished, because tracing is asynchronous and has no official ending signal.
So before we can conclude that tracing failed, we need to wait until a
timeout, which in the current code was roughly 6.4 seconds (the timeout
logic is unnecessarily convoluted, but to make a long story short it has
exponential sleeps starting with 0.1 second and ending with 3.2 seconds,
totaling 6.4 seconds).

It turns out that sporadically, in test runs on overcommitted test machines
with the very slow debug build, we fail this test with this timeout.
So this patch increases the timeout to 51.2 seconds. It should be more
than enough for everyone. Famous last words :-)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210204151554.582260-1-nyh@scylladb.com>
2021-02-04 17:17:07 +02:00
Tomasz Grabiec
63188abb87 sstables: Share partition index pages between readers
Before this patch, each index reader had its own cache of partition
index pages. Now there is a shared cache, owned by the sstable object.
This allows concurrent reads to share partition index pages and thus
reduce the amount of I/O.

This change is also needed before we can implement promoted index caching.
That's because before the promoted index can be shared by readers, the
partition index entries, which hold the promoted index, must also be
shareable.

The pages live as long as there is at least one index reader
referencing them. So it only helps when there is concurrent access. In
the future we will keep them for longer and evict on memory pressure.

Promoted index cursor is no longer created when the partition index entry
is parsed, by it's created on-demand when the top-level cursor enters
the partition. The promoted index cursor is owned by the top-level cursor,
not by the partition index entry.
2021-02-04 15:24:07 +01:00
Tomasz Grabiec
c232d71fc8 sstables: index_reader: Drop now unnecessary index_entry::close_pi_stream() 2021-02-04 15:24:07 +01:00
Tomasz Grabiec
5ed559c8c6 sstables: index_reader: Do not store cluster index cursor inside partition indexes
Currently, the partition index page parser will create and store
promoted index cursors for each entry. The assumption is that
partition index pages are not shared by readers so each promoted index
cursor will be used by a single index_reader (the top-level cursor).

In order to be able to share partition index entries we must make the
entries immutable and thus move the cursor outside. The promoted index
cursor is now created and owned by each index_reader. There is at most
one such active cursor per index_reader bound (lower/upper).
2021-02-04 15:23:55 +01:00
Avi Kivity
713a159600 tools: toolchain: add simplified procedure for creating dbuild images
The current procedure for building images is complicated, as it
requires access to x86_64, aarch64, and s390x machines. Add an alternative
procedure that is fully automated, as it relies on emulation on a single
machine.

It is slow, but requires less attention.

Closes #8024
2021-02-04 15:37:36 +02:00
Avi Kivity
bd7fbcc0cf tools: toolchain: dbuild: keep original user's groups
The supplementary groups are removed by default, so add them back.
Supplementary groups are useful for group-shared directories like
ccache.

I added them to the podman-only branch since I don't know if this
works for docker. If a docker user verifies it works there too,
we can move it to the generic code.

Closes #8020
2021-02-04 15:36:55 +02:00
Gleb Natapov
e9043565b3 raft: add counters to raft server
The patch adds set of counters for various events inside raft
implementation to facilitate monitoring and debugging.

Message-Id: <20210204125313.GA1513786@scylladb.com>
2021-02-04 14:19:54 +01:00
Benny Halevy
f5fe8283cc test: reader_permit: do not include reader_concurrency_semaphore.hh in header file
We can do with a forward declaration instead to reduce
the dependency, and include reader_concurrency_semaphore.hh
in test/lib/reader_permit.cc instead.

We need to include "../../reader_permit.hh" to get the
definition of class reader_permit. We need the include
path to prevent recursive include (or rename test/lib/reader_permit.hh
but this creates a lot of code churn).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210204122002.1041808-1-bhalevy@scylladb.com>
2021-02-04 15:02:16 +02:00
Benny Halevy
338c190842 reader_concurrency_semaphore: inactive_read_handle: mark methods noexcept
All are trivially noexcept.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210204113327.1027792-1-bhalevy@scylladb.com>
2021-02-04 13:57:42 +02:00
Benny Halevy
ba4b8dd6e5 sstables: row.hh: no need to include reader_concurrency_semaphore.hh
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210204113413.1027893-1-bhalevy@scylladb.com>
2021-02-04 13:42:06 +02:00
Tomasz Grabiec
2e3f6a9622 tests: perf_fast_forward: Print outpout directory
Message-Id: <20210203180053.230627-1-tgrabiec@scylladb.com>
2021-02-04 10:39:41 +02:00
Tomasz Grabiec
e0ceb454c0 tests: perf_fast_forward: Print error hints to stdout
They point to lines printed to stdout, so should be aligned with them.
Message-Id: <20210203180016.230547-1-tgrabiec@scylladb.com>
2021-02-04 10:39:41 +02:00
Avi Kivity
fcd48adcc4 Update seastar submodule
* seastar b5b2ee53d...4c7c5c7c4 (1):
  > Merge "add support for printing backtraces on one line" from Benny

Fixes #5464.
2021-02-03 14:01:45 +02:00
Benny Halevy
ca6f5cb0bc test: commitlog_test: test_allocation_failure: fill memory using smaller allocations
commitlog was changed to use fragmented_temporary_buffer::ostream (db::commitlog::output).
So if there are discontiguous small memory blocks, they can be used to satisfy
an allocation even if no contiguous memory blocks are available.

To prevent that, as Avi suggested, this change allocates in 128K blocks
and frees the last one to succeed (so that we won't fail on allocating continuations).

Fixes #8028

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210203100333.862036-1-bhalevy@scylladb.com>
2021-02-03 12:21:20 +02:00
Pavel Solodovnikov
856b0b3a58 raft: introduce raft_gossip_failure_detector class
This is an implementation of `raft::failure_detector` for Scylla
that uses gms::gossiper to query `is_alive` state for a given
raft server id.

Server ids are translated to `gms::inet_address` to be consumed
by `gms::gossiper` with the help of `raft_rpc` class,
which manages the mapping.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210129223109.2142072-1-pa.solodovnikov@scylladb.com>
2021-02-03 10:45:18 +01:00
Tomasz Grabiec
f8ae46f294 Merge "raft: RPC module implementation" from Pavel Solodovnikov
This series provides additional RPC verbs and corresponding methods in
`messaging_service` class, as well as a scylla-specific Raft RPC module
implementation that uses `netw::messaging_service` under the hood to
dispatch RPC messages.

* https://github.com/ManManson/scylla/commits/raft-api-rpc-impl-v6:
  raft: introduce `raft_rpc` class
  raft: add Raft RPC verbs to `messaging_service` and wire up the RPC calls
  configure.py: compile serializer.cc
2021-02-03 10:43:58 +01:00
Benny Halevy
55e3df8a72 dist: scylla_util: prevent IndexError when no ephemeral_disks were found
Currently we call firstNvmeSize before checking that we have enough
(at least 1) ephemeral disks.  When none are found, we hit the following
error (see #7971):
```
File "/opt/scylladb/scripts/libexec/scylla_io_setup", line 239, in
if idata.is_recommended_instance():
File "/opt/scylladb/scripts/scylla_util.py", line 311, in is_recommended_instance
diskSize = self.firstNvmeSize
File "/opt/scylladb/scripts/scylla_util.py", line 291, in firstNvmeSize
firstDisk = ephemeral_disks[0]
IndexError: list index out of range
```

This change reverses the order and first checks that we found
enough disks before getting the fist disk size.

Fixes #7971

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #8027
2021-02-03 11:30:18 +02:00
Avi Kivity
10606aadb5 Update tools/java submodule
* tools/java 78c8ef4f54...0187829d5e (1):
  > nodetool: alternate way to specify table name which includes a dot

Fixes #6521.
2021-02-03 11:27:33 +02:00
Botond Dénes
46b795b5fd mutation: consume(): add reverse mode
`mutation::consume()` is used by range scans to convert the immediate
`reconcilable_result` to the final `query::result` format. When the
range scan is in reverse, `mutation::consume()` has to feed the
clustering fragments to the consumer in reverse order, but currently
`mutation::consume()` always uses the natural order, breaking reverse
range scans.
This patch fixes this by adding a `consume_in_reverse` parameter to
`mutation::consume()`, and consequently support for consuming clustering
fragments in reverse order.

Fixes: #8000

Tests: unit(release, debug),
dtest(thrift_tests.py:TestMutations.test_get_range_slice)

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210203081659.622424-1-bdenes@scylladb.com>
2021-02-03 11:00:47 +02:00
Piotr Sarna
c03363b520 README: fix a dead link for building instructions
The link was outdated, since its destination was moved to
a subdirectory.

Message-Id: <b0e0eedaea4f26acf050a91ab9eed1ca37a838bb.1612338584.git.sarna@scylladb.com>
2021-02-03 10:59:50 +02:00
Avi Kivity
913d970c64 Merge "Unify inactive readers" from Botond
"
Currently inactive readers are stored in two different places:
* reader concurrency semaphore
* querier cache
With the latter registering its inactive readers with the former. This
is an unnecessarily complex (and possibly surprising) setup that we want
to move away from. This series solves this by moving the responsibility
if storing of inactive reads solely to the reader concurrency semaphore,
including all supported eviction policies. The querier cache is now only
responsible for indexing queriers and maintaining relevant stats.
This makes the ownership of the inactive readers much more clear,
hopefully making Benny's work on introducing close() and abort() a
little bit easier.

Tests: unit(release, debug:v1)
"

* 'unify-inactive-readers/v2' of https://github.com/denesb/scylla:
  reader_concurrency_semaphore: store inactive readers directly
  querier_cache: store readers in the reader concurrency semaphore directly
  querier_cache: retire memory based cache eviction
  querier_cache: delegate expiry to the reader_concurrency_semaphore
  reader_concurrency_semaphore: introduce ttl for inactive reads
  querier_cache: use new eviction notify mechanism to maintain stats
  reader_concurrency_semaphore: add eviction notification facility
  reader_concurrency_semaphore: extract evict code into method evict()
2021-02-03 10:59:04 +02:00
Piotr Sarna
d395305ddd api: fix retrieving replied RPC messages
The API call referred to a nonexistent callback,
which is now renamed to better match the API path
and actually implemented.

Message-Id: <3d0dbb42f67e1584999a58da9aa9cc722487fda1.1612279443.git.sarna@scylladb.com>
2021-02-03 09:42:17 +02:00
Pekka Enberg
5670276163 Update seastar submodule
* seastar cb3aaf07...b5b2ee53 (1):
  > perftune.py: fix assignment after extend and add asserts
Fixes #8008
2021-02-02 15:27:13 +02:00
Tomasz Grabiec
873e732042 Merge "Switch partition rows onto B-tree" from Pavel Emelyanov
This is the continuaiton of the row-cache performance
improvements, this time -- the rework of clustering keys part.

The goal is to solve the same set of problems:
- logN eviction complexity
- deep and sparse tree

Unlike partitions, this cache has one big feature that makes it
impossible to just use existing B+ tree:

  There's no copyable key at hands. The clustering key is the
  managed_bytes() that is not nothrow-copy-constructibe, neither
  it's hash-able for lookup due to prefix lookup.

Thus the choice is the B-tree, which is also N-ary one, but
doesn't copy keys around.

B-trees are like B+, but can have key:data pairs in inner nodes,
thus those nodes may be significantly bigger then B+ ones, that
have data-s only in leaf trees. Not to make the memory footprint
worse, the tree assumes that keys and data live on the same object
(the rows_entry one), and the tree itself manages only the key
pointers.

Not to invalidate iterators on insert/remove the tree nodes keep
pointers on keys, not the keys themselves.

The tree uses tri-compare instead of less-compare. This makes the
.find and .lower_bound methods do ~10% less comparisons on random
insert/lookup test.

Numbers:

- memory_footprint: B-tree       master
  rows_entry size:  216          232

  1 row
   in-cache:        968          960     (because of dummy entry)
   in-memtable:     1006         1022

  100 rows
   in-cache:        50774        50856
   in-memtable:     50620        50918

- mutation_test:    B-tree       master
   tps.average:     891177       833896

- simple_query:     B-tree       master
   tps.median:      71807        71656
   tps.maximum:     71847        71708

* xemul/clustering-cache-over-btree-4:
  mutation_partition: Save one keys comparison
  partition_snapshot_row_cursor: Remove rows pointer
  mutation_partition: Use B-tree insertion sugar
  perf-test : Print B-tree sizes
  mutation_partition: Switch cache of rows onto B-tree
  partition_snapshot_reader: Rename cmp to less for explicity
  mutation_partition: Make insertion bullet-proof
  mutation_partition: Use tri-compare in non-set places
  flat_mutation_reader: Use clear() in destroy_current_mutation()
  rows_entry: Generalize compare
  utils: Intrusive B-tree (with tests)
  tests: Generalize bptree compaction test
  tests: Generalize bptree stress test
2021-02-02 12:26:02 +01:00
Tomasz Grabiec
75eb97b12c Merge 'Commitlog multi-entry write' from Calle Wilund
Fixes #7615

Makes the CL writer interface N-valued (though still 1 for the "old" paths). Adds a new write path to input N mutations -> N rp_handles.
Guarantees that all entries are written or none are, and that they will be flushed to disk together.

Small test included.

Closes #7616

* github.com:scylladb/scylla:
  commitlog_test: Add multi-entry write test
  commitlog: Add "add_entries" call to allow inputting N mutations
  commitlog: Make commitlog entries optionally multi-entry
  commitlog: Move entry_writer definition to cc file
2021-02-02 12:23:19 +01:00
Tomasz Grabiec
7b17969a6e Merge 'sstable: reader: preempt after every fragment' from Avi Kivity
Whenever we push a fragment, we check whether the buffer is
full and return proceed::no if so, so that the state machine pauses
and lets the consumer continue. This patch adds an additional
condition - if preemption is needed, we also return proceed::no.
This drops us back to the outer loop
(in sstable_mutation_reader::fill_buffer), which will yield to
the reactor as part of seastar::do_until().

Two cases (partition_start and partition_end) did not have the
check for is_buffer_full(); it is added now. This can trigger
is the partition has no rows.

Unlike the previous attempt, push_ready_fragments() is not touched.

The extra preemption opportunities triggered a preexisting bug in
clustering_ranges_walker; it is fixed in the first patch of the series.

I tested this by reading from a large partition with a simple
schema (pk int, ck int, primary key(pk, ck)) with BYPASS CACHE.
However, even without the patch I only got sporadic stalls
with the detector set to 1ms, so it's possible I'm not testing
correctly.

Test: unit (dev, debug, release)

Fixes #7883.

Closes #7928

* github.com:scylladb/scylla:
  sstable: reader: preempt after every fragment
  clustering_range_walker: fix false discontiguity detected after a static row
2021-02-02 12:21:58 +01:00
Benny Halevy
0fecc78d88 user_function: throw on_internal_error if executed outside a seastar thread
Rather than asserting, as seen in #7977.
This shouldn't crash the server in production.

Add unit test that reproduces this scenario
and verifies the internal error exception.

Fixes #7977

Test: unit(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210201163051.1775536-1-bhalevy@scylladb.com>
2021-02-02 13:03:39 +02:00
Calle Wilund
720a47fe8a commitlog_test: Add multi-entry write test 2021-02-02 10:41:08 +00:00
Calle Wilund
c5f6125039 commitlog: Add "add_entries" call to allow inputting N mutations
Fixes #7615

Allows N mutations to be written "atomically" (i.e. in the same
call). Either all are added to segement, or none.

Returns rp_handle vector corresponding to the call vector.
2021-02-02 10:41:08 +00:00
Calle Wilund
5fcc2066ed commitlog: Make commitlog entries optionally multi-entry
Allows writing more than one blob of data using a single
"add" call into segment. The old call sites will still
just provide a single entry.

To ensure we can determine the health of all the entries
as a unit, we need to wrap them in a "parent" entry.
For this, we bump the commitlog segment format and
introduce a magic marker, which if present, means
we have entries in entry, totalling "size" bytes.
We checksum the entra header, and also checksum
the individual checksums of each sub-entry (faster).
This is added as a post-word.

When parsing/replaying, if v2+ and marker, we have to
read all entries + checksums into memory, verify, and
_then_ we can actually send the info to caller.
2021-02-02 10:41:08 +00:00
Calle Wilund
6bef3f9cc3 commitlog: Move entry_writer definition to cc file
Should not be public/visible
2021-02-02 10:32:44 +00:00
Juliusz Stasiewicz
29e4737a9b transport: Fix abort on certain configurations of native_transport_port(_ssl)
The reason was accessing the `configs` table out of index. Also,
native_transport_port-s can no longer be disabled by setting to 0,
as per the table below.

Rules for port/encryption (the same apply to shard_aware counterpart):

np  := native_transport_port.is_set()
nps := native_transport_port_ssl.is_set()
ceo := ceo.at("enabled") == "true"
eq  := native_transport_port_ssl() == native_transport_port()

+-----+-----+-----+-----+
|  np | nps | ceo |  eq |
+-----+-----+-----+-----+
|  0  |  0  |  0  |  *  |   =>   listen on native_transport_port, unencrypted
|  0  |  0  |  1  |  *  |   =>   listen on native_transport_port, encrypted
|  0  |  1  |  0  |  *  |   =>   nonsense, don't listen
|  0  |  1  |  1  |  *  |   =>   listen on native_transport_port_ssl, encrypted
|  1  |  0  |  0  |  *  |   =>   listen on native_transport_port, unencrypted
|  1  |  0  |  1  |  *  |   =>   listen on native_transport_port, encrypted
|  1  |  1  |  0  |  *  |   =>   listen on native_transport_port, unencrypted
|  1  |  1  |  1  |  0  |   =>   listen on native_transport_port, unencrypted + native_transport_port_ssl, encrypted
|  1  |  1  |  1  |  1  |   =>   native_transport_port(_ssl), encrypted
+-----+-----+-----+-----+

Fixes #7783
Fixes #7866

Closes #7992
2021-02-02 11:32:31 +02:00
Avi Kivity
285303b131 Update tools/jmx submodule
* tools/jmx 2c95650...949cefc (2):
  > dist/redhat: stop using systemd macros, call systemctl directly
  > Remove obsolete FIXME

See scylladb/scylla-jmx#94.
2021-02-02 11:29:36 +02:00
Takuya ASADA
7b310c591e dist/redhat: stop using systemd macros, call systemctl directly
Fedora version of systemd macros does not work correctly on CentOS7,
since CentOS7 does not support "file trigger" feature.
To fix the issue we need to stop using systemd macros, call systemctl
directly.

See scylladb/scylla-jmx#94

Closes #8005
2021-02-02 11:28:07 +02:00
Avi Kivity
da4fa0629a Merge "sstables: add sstable_origin to scylla_metadata" from Benny
"
This series extends the scylla_metadata sstable component
to hold an optional testual description of the sstable origin.
It describes where the sstables originated from
(e.g. memtable, repair, streaming, compaction, etc.)

The origin string is provided by the sstable writer via
sstable_writer_config, written to the scylla_metadata component,
and loaded on sstable::load().

A get_origin() method was added to class sstable to retrieve
its origin.  It returns an empty string by default if the origin
is missing.

Compaction now logs the sstable origin for each sstable it
compacts, and it generates the sstable origin for all sstables
in generates.  Regular compaction origin is simply set to "compaction"
while other compaction types are mentioned by name, as
"cleanup", "resharding", "reshaping", etc.

A unit test was added to test the sstable_origin by writing either
an empty origin and a random string, and then comparing
the origin retrieved by sstable::load to the one written.

Test: unit(release)

Fixes #7880
"

* tag 'sstable-origin-v2' of github.com:bhalevy/scylla:
  compaction: log sstable origin
  sstables: scylla_metadata: add support for sstable_origin
  sstables: sstable_writer_config: add origin member
2021-02-02 10:35:11 +02:00
Pavel Emelyanov
54ddb5a70a mutation_partition: Save one keys comparison
The apply_monotonically checks if the cursor is behind the source
position to decide whether or not to push it forward (with the
lower_bound call). The 2nd comparison is done to check if either
the cursor was ahead or if lower_bound result actually hit the key.

This 2nd comparison can be avoided:

- the 1st case needs B-tree lower_bound API extention that reports
  if the bound is match or not.
- the 2nd one is covered with reusing tri-compare result from the
  1st comparison

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
4ccce97396 partition_snapshot_row_cursor: Remove rows pointer
The pointer is needed to erase an element by its iterator from the
rows container. The B-tree has this method on iterator and it does
NOT need to walk up the tree to find its root, so the complexity
is still amortized constant.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
8e7c1e049b mutation_partition: Use B-tree insertion sugar
The B-tree .insert methods accept unique pointers and release them

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
a92eb2f7a9 perf-test : Print B-tree sizes
After the switch from BST to B-tree the memory foorprint includes inner/leaf nodes
from the B-tree, so it's useful to know their sizes too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
5c0f9a8180 mutation_partition: Switch cache of rows onto B-tree
The switch is pretty straightforward, and consists of

- change less-compare into tri-compare

- rename insert/insert_check into insert_before_hint

- use tree::key_grabber in mutation_partition::apply_monotonically to
  exception-safely transfer a row from one tree to another

- explicitly erase the row from tree in rows_entry::on_evicted, there's
  a O(1) tree::iterator method for this

- rewrite rows_entry -> cache_entry transofrmation in the on_evicted to
  fit the B-tree API

- include the B-tree's external memory usage into stats

That's it. The number of keys per node was is set to 12 with linear search
and linear extention of 20 because

- experimenting with tree shows that numbers 8 through 10 keys with linear
  search show the best performance on stress tests for insert/find-s of
  keys that are memcmp-able arrays of bytes (which is an approximation of
  current clustring key compare). More keys work slower, but still better
  than any bigger value with any type of search up to 64 keys per node

- having 12 keys per nodes is the threshold at which the memory footprint
  for B-tree becomes smaller than for boost::intrusive::set for partitions
  with 32+ keys

- 20 keys for linear root eats the first-split peak and still performs
  well in linear search

As a result the footpring for B tree is bigger than the one for BST only for
trees filled with 21...32 keys by 0.1...0.7 bytes per key.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
165255e2bd partition_snapshot_reader: Rename cmp to less for explicity
This is less comparator, cmp is used as a sign of tri-compare in this set.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
ee9e104541 mutation_partition: Make insertion bullet-proof
The bi::intrusive::set::insert-s are non-throwing, so it's safe to add
new entry like this

  auto* ne = new entry;
  set.insert(ne);

and not worry about memory leak. B-tree's insert will be throwing, so we
need some way to free the new entries in case of exception. There's alreay
a way for this:

   std::unique_ptr<entry> ne = std::make_unique<entry>();
   set.insert(*ne);
   ne.release();

so make every insertion into the set work this way in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
926f748a3d mutation_partition: Use tri-compare in non-set places
The mutation_partition::_rows  will be switched on B-tree with tri
comparator, so to clearly identify not affected by it places, switch
them onto tri-compare in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
bfcd6a4bb7 flat_mutation_reader: Use clear() in destroy_current_mutation()
Currently the code uses a look of unlink_leftmost_without_rebalance
calls. B-tree does have it, but plain clearing of the tree is a bit
faster with clear().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
306c40939b rows_entry: Generalize compare
Turn the rows_entry less-comparator's calls into a template as
they are nothing but wrappers on top of rows_entyry tri-comparator.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
2f7c03d84c utils: Intrusive B-tree (with tests)
The design of the tree goes from the row-cache needs, which are

1. Insert/Remove do not invalidate iterators
2. Elements are LSA-manageable
3. Low key overhead
4. External tri-comparator
5. As little actions on insert/remove as possible

With the above the design is

Two types of nodes -- inner and leaf. Both types keep pointer on parent nodes
and N pointers on keys (not keys themselves). Two differences: inner nodes have
array of pointers on kids, leaf nodes keep pointer on the tree (to update left-
and rightmost tree pointers on node move).

Nodes do not keep pointers/references on trees, thus we have O(1) move of any
object, but O(logN) to get the tree size. Fortunately, with big keys-per-node
value this won't result in too many steps.

In turn, the tree has 3 pointers -- root, left- and rightmost leaves. The latter
is for constant-time begin() and end().

Keys are managed by user with the help of embeddable member_hook instance,
which is 1 pointer in size.

The code was copied from the B+ tree one, then heavily reworked, the internal
algorythms turned out to differ quite significantly.

For the sake of mutation_partition::apply_monotonically(), which needs to move
an element from one tree into another, there's a key_grabber helping wrapper
that allows doing this move respecting the exception-safety requirement.

As measured by the perf_collections test the B-tree with 8 keys is faster, than
the std::set, but slower than the B+tree:

            vs set        vs b+tree
   fill:     +13%           -6%
   find:     +23%          -35%

Another neat thing is that 1-key insertion-removal is ~40% faster than
for BST (the same number of allocations, but the key object is smaller,
less pointers to set-up and less instructions to execute when linking
node with root).

v4:
- equip insertion methods with on_alloc_point() calls to catch
  potential exception guarantees violations eariler

- add unlink_leftmost_without_rebalance. The method is borrowed from
  boost intrusive set, and is added to kill two birds -- provide it,
  as it turns out to be popular, and use a bit faster step-by-step
  tree destruction than plain begin+erase loop

v3:
- introduce "inline" root node that is embedded into tree object and in
  which the 1st key is inserted. This greatly improves the 1-key-tree
  performance, which is pretty common case for rows cache

v2:
- introduce "linear" root leaf that grows on demand

  This improves the memory consumption for small trees. This linear node may
  and should over-grow the NodeSize parameter. This comes from the fact that
  there are two big per-key memory spikes on small trees -- 1-key root leaf
  and the first split, when the tree becomes 1-key root with two half-filled
  leaves. If the linear extention goes above NodeSize it can flatten even the
  2nd peak

- mitigate the keys indirection a bit

  Prefetching the keys while doing the intra-node linear scan and the nodes
  while descending the tree gives ~+5% of fill and find

- generalize stress tests for B and B+ trees

- cosmetic changes

TODO:

- fix few inefficincies in the core code (walks the sub-tree twice sometimes)
- try to optimize the leaf nodes, that are not lef-/righmost not to carry
  unused tree pointer on board

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:29 +03:00
Pavel Emelyanov
6d63bdbefe tests: Generalize bptree compaction test
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:28:59 +03:00
Pavel Emelyanov
8bdad0bb28 tests: Generalize bptree stress test
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:28:57 +03:00
Avi Kivity
db4b9215dd sstable: reader: preempt after every fragment
Whenever we push a fragment, we check whether the buffer is
full and return proceed::no if so, so that the state machine pauses
and lets the consumer continue. This patch adds an additional
condition - if preemption is needed, we also return proceed::no.
This drops us back to the outer loop
(in sstable_mutation_reader::fill_buffer), which will yield to
the reactor as part of seastar::do_until().

Two cases (partition_start and partition_end) did not have the
check for is_buffer_full(); it is added now. This can trigger
is the partition has no rows.

Unlike the previous attempt, push_ready_fragments() is not touched.

I tested this by reading from a large partition with a simple
schema (pk int, ck int, primary key(pk, ck)) with BYPASS CACHE.
However, even without the patch I only got sporadic stalls
with the detector set to 1ms, so it's possible I'm not testing
correctly.

Test: unit (dev)

Fixes #7883.
2021-02-01 19:32:07 +02:00
Avi Kivity
7634a90dd2 clustering_range_walker: fix false discontiguity detected after a static row
clustering_range_walker detects when we jump from one row range to another. When
a static row is included in the query, the constructor sets up the first before/after
bounds to be exactly that static row. That creates an artificial range crossing if
the first clustering range is contiguous with the static row.

This can cause the index to be consulted needlessly if we happen to fall back
to sstable_mutation_reader after reading the static row.

A unit test is added.

Ref #7883.
2021-02-01 19:32:07 +02:00
Pavel Solodovnikov
9d17a654a6 raft: use null_sharder for raft tables
Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210201105300.110210-1-pa.solodovnikov@scylladb.com>
2021-02-01 18:52:04 +02:00
Gleb Natapov
382ee066bf database: drop duplicated function
The database lass have to duplicated functions keyspaces() and
get_keyspaces(). Drop the former since it is used in one place only.

Message-Id: <20210201135333.GA1403508@scylladb.com>
2021-02-01 18:52:04 +02:00
Tomasz Grabiec
eac9c1d80a Merge "raft: configuration changes with joint consensus" from Kostja
Support configuration changes based on joint consensus.
When a user adds a configuration entry, commit an interim "joint
consensus" configuration to the log first, and transition to the
final configuration once both C_old and C_new configurations
accept the joint entry.

Misc cleanups.

* scylla-dev/raft-config-changes-v2:
  raft: update README.md
  raft: add a simple test for configuration changes
  raft: joint consensus, wire up configuration changes in the API
  raft: joint consensus, count votes using joint config
  raft: joint consensus, wire up configuration changes in FSM
  raft: joint consensus, update progress tracker with joint configuration
  raft: joint consensus, don't store configuration in FSM
  raft: joint consensus, keep track of the last confchange index in the log
  raft: joint consensus, implement helpers in class configuration
  raft: joint consensus, use unordered_set for server_address list
  raft: joint consensus, switch configuration to joint
  raft: rename check_committed() to maybe_commit()
  raft: fix spelling and add comments
2021-02-01 18:52:04 +02:00
Benny Halevy
4b309e0829 compaction: log sstable origin
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-01 16:45:52 +02:00
Benny Halevy
77328a936a sstables: scylla_metadata: add support for sstable_origin
Add new scylla_metadata_type::SSTableOrigin.
Store and retrive a sstring to the scylla metadata component.
Pass sstable_writer_config::origin from the mx sstable writer
and ignore it in the k_l writer.

Add unit test to verify the sstable_origin extension
using both empty and a random string.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-01 16:45:52 +02:00
Benny Halevy
22f6023ac3 sstables: sstable_writer_config: add origin member
Add a string describing where the sstables originated
from (e.g. memtable, repair, streaming, compaction, etc.)

If configure_writer is called with a nullptr, the origin
will be equal to an empty string.

Introduce test_env_sstables_manager that provides an overload
of configure_writer with no parmeters that calls the base-class'
configure_writer with "test" origin.  This was to reduce the
code churn in this patch and to keep the tests simple.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-02-01 16:45:52 +02:00
Nadav Har'El
75a4281bff cql-pytest: test the units supposed to be usable for "duration" type
This patch adds a test for the different units which are supposed to
be usable for assigning a "duration" type in CQL. It turns out that
all documented units are supported correctly except µs (with a unicode
mu), so the test reproduces issue #8001.

The test xfails on Scylla (because µs is not supported) and passes
on Cassandra.

Refs: #8001.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210131192220.407481-1-nyh@scylladb.com>
2021-02-01 11:05:10 +01:00
Avi Kivity
bb202db1ff Merge 'dist/offline_installer/redhat: fix umask error' from Takuya ASADA
Since makeself script changes current umask, scylla_setup causes
"scylla does not work with current umask setting (0077)" error.
To fix that we need use latest version of makeself, and specfiy --keep-umask
option.

Fixes #6243

Closes #6244

* github.com:scylladb/scylla:
  dist/offline_redhat: fix umask error
  dist/offline_installer/redhat: support cross build
2021-01-31 18:47:27 +02:00
Takuya ASADA
49e4f318a0 dist/offline_redhat: fix umask error
Since makeself script changes current umask, scylla_setup causes
"scylla does not work with current umask setting (0077)" error.
To fix that we need use latest version of makeself, and specfiy --keep-umask
option.

Fixes #6243
2021-01-31 21:37:49 +09:00
Takuya ASADA
74d7e31576 dist/offline_installer/redhat: support cross build
Supported cross build by running CentOS7 on docker, now it's able to build
on Fedora.
It also supported switch container image, tested on Oracle Linux 7 and
CentOS 7/8.
2021-01-31 21:37:49 +09:00
Avi Kivity
9271e4bf6e Update seastar submodule
* seastar 52d41277a...cb3aaf07e (2):
  > tls: reloadable_credentials_base: add_dir_watch: fix root dir detection
  > scripts/perftune.py: convert nic option in old perftune.yaml to list for compatibility
2021-01-31 13:28:45 +02:00
Raphael S. Carvalho
298d54ceb0 utils/fragment_temporary_buffer: don't push empty fragment if data size is fragment-aligned
last fragment is unconditionally pushed to set of fragments, so if data
size is fragment-aligned, an empty fragment will be needlessly pushed to
the back of the fragment set.

note: i haven't tested if empty fragment at back of set will cause issues,
i think it won't, but this should be avoided anyway.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210129231532.871405-3-raphaelsc@scylladb.com>
2021-01-30 20:54:20 +02:00
Raphael S. Carvalho
e745f1e697 utils/fragmented_temporary_buffer: avoid reallocations by reserving upfront
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210129231532.871405-2-raphaelsc@scylladb.com>
2021-01-30 20:54:20 +02:00
Raphael S. Carvalho
08e838d4b5 utils/fragmented_temporary_buffer: simplify allocate_to_fit()
1) reuse default_fragment_size for knowledge of max fragment size
2) fragments_count is not a good name as it doesn't include last non-full
fragment (if present), so rename it.
3) simplify calculation of last fragment size

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210129231532.871405-1-raphaelsc@scylladb.com>
2021-01-30 20:54:20 +02:00
Pavel Solodovnikov
b9a280161d raft: introduce raft_rpc class
The patch contains a skeleton implementation for the Scylla-specific
Raft RPC module.

It uses `netw::messaging_service` as underlying mechanism to send
RPC messages.

The instance is supposed to be bound to a single raft group.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-30 01:12:35 +03:00
Pavel Solodovnikov
1a979dbba2 raft: add Raft RPC verbs to messaging_service and wire up the RPC calls
All RPC module APIs except for `send_snapshot` should resolve as
soon as the message is sent, so these messages are passed via
`send_message_oneway_timeout`.

`send_snapshot` message is sent via `send_message_timeout` and
returns a `future<>`, which resolves when snapshot transfer
finishes or fails with an exception.

All necessary functions to wire the new Raft RPC verbs are also
provided (such as `register` and `unregister` handlers).

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-30 01:11:17 +03:00
Pavel Solodovnikov
e30a55ba2f configure.py: compile serializer.cc
This file was not added to the configure.py,
which `raft_sys_table_storage` series was supposed to do.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-30 01:09:32 +03:00
Konstantin Osipov
a8f2fa7fa0 raft: update README.md 2021-01-29 22:07:08 +03:00
Konstantin Osipov
b7692af8bc raft: add a simple test for configuration changes
Test adding, removing replacing a node.

With fix-ups by Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-29 22:07:08 +03:00
Konstantin Osipov
c7b5a60320 raft: joint consensus, wire up configuration changes in the API
Now that we've implemented joint consensus based configuration changes,
replace add_server()/remove_server() with a more general set_configuration().
2021-01-29 22:07:08 +03:00
Konstantin Osipov
afadc7c0a1 raft: joint consensus, count votes using joint config
Send RequestVote to a joint config.

We need to exclude self from the list of peers
if we're not part of the current configuration.
Avoid disrupting the cluster in this case.

Maintain separate status for previous and current config when counting
votes.
2021-01-29 22:07:08 +03:00
Konstantin Osipov
8b86d91754 raft: joint consensus, wire up configuration changes in FSM
When add_entry() with new configuraiton is submitted,
create a joint configuration and switch to it immediately.
Refuse to enter joint configuration if a configuration
change is already in progress.
When the leader it committed an entry with joint configuration,
append a new entry with final configuration and switch to it.

Resign leadership if the current leader is not part of a new
configuration.

When we change from A, B, C to B, C, D and the leader is A,
then, when C_new starts to be used, the leader is not part of
the current configuration, so it doesn't have to be in the tracker.
Do not try to find & advance leader progress unconditionally then.
2021-01-29 22:07:08 +03:00
Konstantin Osipov
18a684ba11 raft: joint consensus, update progress tracker with joint configuration
The leader doesn't have to be part of the current
configuration, so add a way to access follower_progress for the leader
only if it is present.

Upon configuration changes, preserve progress information
for intact nodes, remove for removed, and create a new progress
object for added nodes.

When tracking commit progress in joint configuration mode,
calculate two commit indexes for two configurations, and
choose the smallest one.
2021-01-29 22:07:08 +03:00
Konstantin Osipov
20df1955b2 raft: joint consensus, don't store configuration in FSM
In follower state, FSM doesn't know the current cluster
configuration.  Instead of trying to watch the follower log for
configuration changes to keep FSM copy up to date, remove it from
FSM altogether since the follower doesn't need it anyway.

When entering candidate or leader state, fetch the most recent
configuration from the log and initialize the state specific
state with it.
2021-01-29 22:07:07 +03:00
Konstantin Osipov
b29181875c raft: joint consensus, keep track of the last confchange index in the log
When initializing the log, find the most recent configuration
change index, if present.
Maintain the most recent configuration change index when
the log is truncated or entries are appended to it.
The last configuration change index will be used by FSM when it enters
candidate or leader state to fetch the current configuration.

We never truncate beyond a single in-progress configuration
change, so storing the previous value of last_conf_idx
helps avoid log backward scan on truncation in 100% of cases.

Remove all unused log constructors.
2021-01-29 22:07:07 +03:00
Konstantin Osipov
6e128aa357 raft: joint consensus, implement helpers in class configuration 2021-01-29 22:07:07 +03:00
Konstantin Osipov
1ca738d9a2 raft: joint consensus, use unordered_set for server_address list 2021-01-29 22:07:07 +03:00
Konstantin Osipov
df944f953c raft: joint consensus, switch configuration to joint
In order to work correctly in transitional configuration,
participants must enter it after crashes, restarts and
state changes.

This means it must be stored in Raft log and snapshot
on the leader and followers.

This is most easily done if transitional configuration
is just a flavour of standard configuration.

In FSM, rename _current_config to _configuration,
it now contains both current and future configuration
at all times.
2021-01-29 22:07:07 +03:00
Konstantin Osipov
076e46af9e raft: rename check_committed() to maybe_commit()
This is what the function does, and it's the name
used in other implementations.
2021-01-29 22:07:07 +03:00
Gleb Natapov
aad0209b1c raft: fix spelling and add comments
Fix spelling errors in a few comments,
improve comments.

With fix-ups by Gleb Natapov <gleb@scylladb.com>
2021-01-29 22:07:07 +03:00
Pavel Emelyanov
575c992a35 test: Bring test_apply_monotonically_is_monotonic back to work
The idea of the monotonicity checking test is: try to apply
one one random partition to another random one sequentually
failing allocations. Each time allocation fails (with the
bad_alloc exception) -- check the exception guarantee is
respected, then apply (!) the very same two partitions to
each other. At the end of the test we make sure, that an
exception may pop up at any point of application and it
will be safe.

This idea is flawed currently. When verifying the guarantee
the test moves the 2nd partition and leaves it empty for the
next loop iteration. So right on the 2nd attempt to apply
partitions it becomes a no-op, doesn't fail and no more
exceptions arise.

Fix by restoring both partitions at the end of each check.
Broken since 74db08165d.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210129153641.5449-1-xemul@scylladb.com>
2021-01-29 18:47:15 +01:00
Tomasz Grabiec
16eb4c6ce2 Merge "raft: system table backed persistency module" from Pavel Solodovnikov
This series contains an initial implementation of raft persistency module
that uses `raft` system table as the underlying storage model.

"system.raft" table will be used as a backend storage for implementing
raft persistence module in Scylla. It combines both raft log,
persisted vote and term, and snapshot info.

The table is partitioned by group id, thus allowing multi-raft
operation. The rest of the table structure mirrors the fields of
corresponding core raft structures defined in `raft.hh`, such as
`raft::log_entry`.

The raft table stores the only the latest snapshot id while
the actual snapshot will be available in a separate table
called `system.raft_snapshots`. The schema of `raft_snapshots`
mirrors the fields of `raft::snapshot` structure.

IDL definitions are also added for every raft struct so that we
automatically provide serialization and deserialization facilities
needed both for persistency module and for future RPC implmementation.

The first patch is a side-change needed to provide complete
serialization/deserialization for `bytes_ostream`, which we
need when persisting the raft log in the table (since `data`
is a variant containing `raft::command` (aka `bytes_ostream`)
among others).
`bytes_ostream` was lacking `deserialize` function, which is
added in the patch.

The second patch provides serializer for `lw_shared_ptr<T>`
which will be used for `raft::append_entries`, which has
a field with `std::vector<const lw_shared_ptr<raft::log_entry>>`
type.

There is also a patch to extend `fragmented_temporary_buffer`
with a static function `allocate_to_fit` that allocates an
instance of the fragmented buffer that has a specified size.
Individual fragment size is limited to 128kb.

The patch-set also contains the test suite covering basic
functionality of the persistency module.

* manmanson/raft-api-impl-v11:
  raft/sys_table_storage: add basic tests for raft_sys_table_storage
  raft: introduce `raft_sys_table_storage` class
  utils: add `fragmented_temporary_buffer::allocate_to_fit`
  raft: add IDL definitions for raft types
  raft: create `system.raft` and `system.raft_snapshots` tables
  serializer: add `serializer<lw_shared_ptr<T>>` specialization
  serializer: add `deserialize` function overload for `bytes_ostream`
2021-01-29 11:40:39 +02:00
Pavel Solodovnikov
e309502c42 raft/sys_table_storage: add basic tests for raft_sys_table_storage
The test suite covers the most basic use cases for the system table
backed raft persistency module:
 * store/load vote and term
 * store/load snapshot
 * store snapshot with log tail truncation
 * store/load log entries
 * log truncation

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-29 02:00:27 +03:00
Pavel Solodovnikov
aebb1987b5 raft: introduce raft_sys_table_storage class
This is the implementation of raft persistency module that
uses `raft` system table as the underlying storage model.

The instance is supposed to be bound to a single raft group.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-29 02:00:12 +03:00
Pavel Solodovnikov
d14dc030ac utils: add fragmented_temporary_buffer::allocate_to_fit
Introduce `fragmented_temporary_buffer::allocate_to_fit` static
function returning an instance of the buffer of a specified size.

The allocated buffer fragments have a size of at most 128kb.
`bytes_ostream` has the same hard-coded limit, so just use the
same here.

This patch will be later needed for `raft::log_entry` raw data
serialization when writing to the underlying persistent storage.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-29 01:59:16 +03:00
Pavel Solodovnikov
e1504bbf0e raft: add IDL definitions for raft types
Changes to the `configuration` and `tagged_uint64` classes are needed
to overcome limitations of the IDL compiler tool, i.e. we need to
supply a constructor to the struct initializing all the
members (raft::configuration) and also need to make an accessor
function for private members (in case of raft::tagged_uint64).

All other structs mirror raft definitions in exactly the same way
they are declared in `raft.hh`.

`tagged_id` and `tagged_uint64` are used directly instead of their
typedef-ed companions defined in `raft.hh` since we don't want
to introduce indirect dependencies. In such case it can be guaranteed
that no accidental changes made outside of the idl file will affect idl
definitions.

This patch also fixes a minor typo in `snapshot_id_tag` struct used
in `snapshot_id` typedef.
2021-01-29 01:59:10 +03:00
Pavel Solodovnikov
cf5b8c4b79 raft: create system.raft and system.raft_snapshots tables
System raft table will be used as a backend storage for implementing
raft persistence module in Scylla. It combines both raft log,
persisted vote and term, and snapshot info.

The table is partitioned by group id, thus allowing multi-raft
operation. The rest of the table structure mirrors the fields of
corresponding core raft structures defined in `raft.hh`, such as
`raft::log_entry`.

The raft table stores the only the latest snapshot id while
the actual snapshot will be available in a separate table
called `system.raft_snapshots`. The schema of `raft_snapshots`
mirrors the fields of `raft::snapshot` structure.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-29 01:59:04 +03:00
Pavel Solodovnikov
83c26e542d serializer: add serializer<lw_shared_ptr<T>> specialization
This one works similar to `serializer<optional<T>>` and will be
later needed for serializing `raft::append_request`, which has
a field containing `lw_shared_ptr`.

Users to be warned, though: this code assumes that the pointer
is never null. This is done to mirror the serialize implementation
for `lw_shared_ptr:s` in the messaging_service.cc, which is
subject to being deleted in favor of the impl in the
`serializer_impl.hh`.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-29 01:58:46 +03:00
Avi Kivity
b32ece6975 Update tools/java submodule
* tools/java 4a55b81941...78c8ef4f54 (1):
  > nodetool: do no treat table name with dot as a secondary index

Fixes #6521.
2021-01-28 16:16:47 +02:00
Kamil Braun
bf115e7d69 schema_tables: put schema tables on shard 0
We use a custom sharder for all schema tables: every table under
the `system_schema` keyspace, plus `system.scylla_table_schema_history`.
This sharder puts all data on shard 0.

To achieve this, we hardcode the sharder in initial schema object
definitions. Furthermore - since the sharder is not stored inside schema
mutations yet - whenever we deserialize schema objects from mutations,
we modify the sharder based on the schema's keyspace and table names.

A regression test is added to ensure no one forgets to set the special
sharder for newly added schema tables. This test assumes that all newly
added schema tables will end up in the `system_schema` keyspace (other
tables may go unnoticed, unfortunately).

Closes #7947
2021-01-28 13:28:22 +02:00
Avi Kivity
32cdcc0c8b Merge "sstables: consolidate reader factory methods" from Botond
"
Currently there are three different methods for creating an sstable
reader:
* one for single key reads
* one for ranged reads
* and one nobody uses

This patch-set consolidates all these into a single `make_reader()`
method, which behind the scenes uses the same logic to dispatch to the
right sstable reader constructor that `sstables::as_mutation_source()`
uses.

This patch-set is part of an effort to clean up the jungle that is the
various reader creation methods. The next step is to clean up the
sstable_set, which has even more methods.

One very sad discovery I made while working on this patch-set is that
we
still default `mutation_reader::forwarding` to `yes` in the sstable
range reader creator method and in the
`mutation_source::make_reader()`.
I couldn't assume that all callers are passing what they mean as the
value for that parameter. I found many sites in tests that create
forwardable single partition readers. This is also something we should
address soon.

Tests: unit(release, debug:v3)
"

* 'sstables-consolidate-reader-factory-methods-v4' of https://github.com/denesb/scylla:
  cql_query_test: add unit test covering the non-optimal TWCS sstable read path
  sstable_mutation_reader: consolidate constructors
  tests: don't pass temporary ranges to readers
  sstables: sstable_mutation_reader: remove now unused whole sstable constructor
  sstables: stats: remove now unused sstable_partition_reads counter
  sstable: remove read_.*row.*_flat() methods
  tree-wide: use sstables::make_reader() instead of the read_.*row.*_flat() methods
  sstables: pass partition_range to create_single_key_sstable_reader()
  sstables: sstable: add make_reader()
2021-01-28 12:05:06 +02:00
Botond Dénes
1e9ce62ee6 cql_query_test: add unit test covering the non-optimal TWCS sstable read path
The sstable read path for TWCS tables takes a different path when the
optimized read path cannot be used. This path was found to be not
covered at all by unit tests which allowed a trivial use-after-free to
slip in. Add a unit test to cover this path as well, so ASAN can catch
such bugs in the future.
2021-01-28 11:34:03 +02:00
Avi Kivity
55609f2033 Update seastar submodule
* seastar a287bb1a3...52d41277a (8):
  > fair_queue: Preempted requests got re-queued too far
  > scripts/perftune.py: remove repeated items after merging options from file
  > file.hh: Remove fair_queue.hh
  > Merge "Reloadable TLS certificate tolerance" from Calle
  > Merge "Cancellable IO" from Pavel E
  > abort-source: Improve the subscriptions management
  > fair_queue: Improve requests preemption while in pending state
  > http: add support for Default handler (/*)
2021-01-28 08:45:33 +01:00
Konstantin Osipov
b4f875f08e uuid: reduce code dependency on UUID_gen.hh
Do not include UUID_gen.hh in trace_state.hh and lists.hh
to reduce header level dependency on it.

Message-Id: <20210127173114.725761-2-kostja@scylladb.com>
2021-01-27 20:08:29 +02:00
Botond Dénes
6024ef5dad sstable_mutation_reader: consolidate constructors
The two remaining sstable constructor are very similar apart from the
content of the initialize lambda. Speaking of which, the two remaining
initializer lambdas can be easily merged into one too. So this patch
does just that, consolidates the two constructors one and moves
consolidates as well as extracts the initializer method into a member
method. This means we have to store the previously captured variables as
members, but this is actually a good thing: when debugging we can see
the range and slice the reader is reading, and we are not actually
paying for it either -- they were already stored, just out of sight.
2021-01-27 17:38:17 +02:00
Botond Dénes
dd26a96e63 tests: don't pass temporary ranges to readers
The sstable_mutation_reader, like all other mutation readers expects
that the partition-range passed to it is kept alive by its creator
for the duration of its lifetime. However, the single-key constructor
of the sstable reader was more tolerant, as it only extracted the key
from the range, essentially requiring only the key to be kept alive (but
not the containing range). Naturally in time some code come to rely on
it and ended up passing temporary ranges to the reader. This behaviour
will no longer be acceptable as we are about to consolidate the various
sstable reader constructors, uniformly requiring that the range is kept
alive. So this patch fixes up the tests so they work with this stricter
requirement. Only two occurences were found.
2021-01-27 17:38:17 +02:00
Botond Dénes
43ad64db78 sstables: sstable_mutation_reader: remove now unused whole sstable constructor 2021-01-27 17:38:17 +02:00
Botond Dénes
ec6c540c30 sstables: stats: remove now unused sstable_partition_reads counter 2021-01-27 17:38:17 +02:00
Botond Dénes
5f18e9eb37 sstable: remove read_.*row.*_flat() methods 2021-01-27 17:38:17 +02:00
Botond Dénes
c3b4e990a2 tree-wide: use sstables::make_reader() instead of the read_.*row.*_flat() methods 2021-01-27 17:38:17 +02:00
Botond Dénes
080bc2ffec sstables: pass partition_range to create_single_key_sstable_reader()
We want to unify the various sstable reader creation methods and this
method taking a ring position instead of a partition range like
everybody else stands in the way of that.

This is effect reverts 68663d0de.
2021-01-27 17:38:14 +02:00
Wojciech Mitros
a1f93e4297 api: use a list instead of a vector to remove a large allocation in api handler
Follow-up to #7917

The size of an cf::column_family_info is 224 bytes, so an std::vector that
contains one for each column family may be very large, causing allocations
of over 1MB.
Considering the vector is used only for iteration, it can be changed to
a non-contiguous list instead.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #7973
2021-01-27 16:02:07 +02:00
Avi Kivity
aec231ba2e Merge "Unify query paths" from Botond
"
Currently we have two parallel query paths:
* database::query() -> table::query() -> data_query()
* mutation::query()

The former is used by single partition queries, the latter by range
scans, as mutation::query() is used to convert reconcilable_result to
query::result (which means it is also used in single partition queries
if it triggers read repair). This is a rather unfortunate situation as
we have two parallel implementation of the query code, which means they
are prone to diverge, and in fact they already have -- more on that
later.

This patchset aims to remedy this situation by retiring
`mutation::query()` and migrating users to an implementation based on
the "standard" query path, in other words one using the same building
blocks as the `database::query()` path. This means using
`compact_mutation` for compacting and `query_result_builder` for result
building. These components however were created to work with
`flat_mutation_reader`, however introducing a reader into this pipeline
would mean that we'd have to make all the related APIs asynchronous,
which would cause an insane amount of churn. To avoid this, this
patchset adds an API compatible `consume()` method to `mutation`, which
can accept a `compact_mutation` instance as-is. This allows an elegant
and succinct reimplementation. So far so good.

Like mentioned above, the two implementations have diverged in time, or
have been different from the start. The difference manifest when
calculating digests, more precisely in which tombstones are included in
the digest. The retired `mutation::query()` path incorporates only
non-purgeable tombstones in the digest. The standard query path however
incorporates all tombstones, even those that can be purged. After some
scrutiny however this difference proved to be completely theoretical,
as
the code path where this would matter -- converting reconcilable result
to query result -- passes min timestamp as the query time to the
compaction, so nothing is compacted and hence the difference has no
chance to manifest.

This patch-set was motivated by the desire to provide a single solution
to #7434, instead of two, one for each path.

Tests: unit(release:v2, debug:v2, dev:v3)
"

* 'unified-query-path/v3' of https://github.com/denesb/scylla:
  mutation: remove now unused query() and query_compacted()
  treewide: use query_mutations() instead of mutation::query()
  mutation_test: test_query_digest: ensure digest is produced consistently
  mutation_query: introduce query_mutation()
  mutation_query: to_data_query_result(): migrate to standard query code
  mutation_query: move to_data_query_result() to mutation_partition.cc
  mutation: add consume()
  flat_mutation_reader: move mutation consumer concepts to separate header
  mutation compactor: query compaction: ignore purgeable tombstones
2021-01-27 15:58:47 +02:00
Botond Dénes
a5a8037f6e sstables: sstable: add make_reader()
This will be the only method to create sstable readers with. For now we
leave the other variants, they as well as their users will be removed in
a following patch.
2021-01-27 15:20:06 +02:00
Nadav Har'El
2113849a2b cql-pytest: reproducer for toJson() bug with doubles
This patch adds a cql-pytest, test_json.py::test_tojson_double(),
which reproduces issue #7972 - where toJson() prints some doubles
incorrectly - truncated to integers, but some it prints fine (I
still don't know why, this will need to be debugged).

The test is marked xfail: It fails on Scylla, and passes on Cassandra.

Refs #7972.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210127124338.297544-1-nyh@scylladb.com>
2021-01-27 14:00:25 +01:00
Pavel Solodovnikov
10b117aada raft: create dummy impl for schema changes state machine
This patch introduces `schema_raft_state_machine` class
which is currently just a dummy implementation throwing a
"not implemented" exceptions for every call.

Will be needed later to construct an instance of `raft::server`.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210126193413.1520948-1-pa.solodovnikov@scylladb.com>
2021-01-27 12:33:27 +01:00
Pavel Solodovnikov
223c823963 serializer: add deserialize function overload for bytes_ostream
For some reason we had a distinct specialization of `serialize`
function to handle `bytes_ostream` but not `deserialize`.

This will be used in the following patches.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-26 23:21:15 +03:00
Asias He
c82250e0cf gossip: Allow deferring advertise of local node to be up
Currently the replacing node sets the status as STATUS_UNKNOWN when it
starts gossip service for the first time before it sets the status to
HIBERNATE to start the replacing operation. This introduces the
following race:

1) Replacing node using the same IP address of the node to be replaced
starts gossip service without setting the gossip STATUS (will be seen as
STATUS_UNKNOWN by other nodes)

2) Replacing node waits for gossip to settle and learns status and
tokens of existing nodes

3) Replacing node announces the HIBERNATE STATUS.

After Step 1 and before Step 3, existing nodes will mark the replacing
node as UP, but haven't marked the replacing node as doing replacing
yet. As a result, the replacing node will not be excluded from the read
replicas and will be considered a target node to serve CQL reads.

To fix, we make the replacing node avoid responding echo message when it is not
ready.

Fixes #7312

Closes #7714
2021-01-26 19:02:11 +01:00
Pekka Enberg
9fc83ac627 Update tools/java submodule
* tools/java 8080009794...4a55b81941 (1):
  > cassandra.in.sh: remove debug message
2021-01-26 15:56:58 +02:00
Avi Kivity
90a6c3bd7a build: reduce release mode inline tuning on aarch64
I see a miscompile on aarch64 where a call to format("{}", uuid)
translates a function pointer to -1. When called, this crashes.

Reduce the inline threshold from 2500 to 600. This doesn't guarantee
no miscompiles but all the tests pass with this parameter.

Closes #7953
2021-01-26 11:14:42 +02:00
Tomasz Grabiec
90f6bb754e Merge "raft: replication tests: fixes for debug mode" from Alejo
The following patches fix issues seen occasionally in debug mode.

Notes:
    - In debug mode there's still the UB nullptr arithmetic warning.

* https://github.com/alecco/scylla/tree/raft-ale-tests-07h-wait-propagation:
  raft: replication test: wait for log propagation
  raft: replication test: move wait for log to a function
  raft: replication test: remove unused member
  raft: replication test: use later()
  raft: testing: remove election wait time and just yield
2021-01-26 11:14:42 +02:00
Avi Kivity
f58151d191 test: mutation_test: fix initialization order bug with thread local storage
test_cell_external_memory_usage uses with_allocator() to observe how some
types allocate memory. However, compiler reordering (observed with clang 11
on aarch64) can move the various thread-local CQL type object initialization
into the with_allocator() scope; so any managed object allocated as part of
this initialization also gets measured, and the test fails. The code movement
is legal, as far as I can tell.

Fix this by initializing the type object early; use an atomic_thread_fence
as an optimization barrier so the compiler doesn't eliminate the or move
the early initialization.

Closes #7951
2021-01-26 11:14:42 +02:00
Nadav Har'El
356250f720 cql-pytest: tests for fromJson() failing to set tuple elements to null
This patch adds a test for trying to set a tuple element to null with
fromJson(), which works on Cassandra but fails on Scylla. So the test
xfails on Scylla. Reproduces issue #7954.

Refs #7954.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210124082311.126300-1-nyh@scylladb.com>
2021-01-26 11:14:42 +02:00
Avi Kivity
05c435dddc Merge "mutation readers: remove next_partition() workarounds" from Botond
"
`next_partition()` used to return void, so readers that had to call
future returning code had to work around this. Now that
`next_partition()` returns a future, we can get rid of these
workarounds.

Tests: unit(release, debug)
"

* 'next-partition-cross-shard-readers/v1' of https://github.com/denesb/scylla:
  mutation_reader: reader_lifecycle_policy::stopped_reader: drop pending_next_partition flag
  mutation_reader: evictable_reader: remove next_partition() workaround
  mutation_reader: shard_reader: remove next_partition() workaround
  mutation_reader: foreign_reader: remove next_partition() workaround
2021-01-26 11:14:42 +02:00
Nadav Har'El
067330c08f Merge 'redis: support large redis message' from Takuya ASADA
If the message is larger than current buffer size, we need to consume
more data until we reach to tail of the message.
To do so, we need to return nullptr when it's not on the tail.

Fixes #7273

Closes #7903

* github.com:scylladb/scylla:
  redis: rename _args_size/_size_left There are two types of numerical parameter in redis protocol:  - *[0-9]+ defined array size  - $[0-9]+ defined string size
  redis: fix large message handling
2021-01-25 10:11:17 +02:00
Takuya ASADA
229940aaff redis: rename _args_size/_size_left
There are two types of numerical parameter in redis protocol:
 - *[0-9]+ defined array size
 - $[0-9]+ defined string size

Currently, array size is stored to args_count, and string size is
stored to _arg_size / _size_left.
It's bit hard to understand since both uses same word "arg(s)", let's
rename string size variables to _bytes_count / _bytes_left.
2021-01-25 10:26:37 +09:00
Takuya ASADA
7a6ee9858f redis: fix large message handling
If the message is larger than current buffer size, we need to consume
more data until we reach to tail of the message.
To do so, we need to return nullptr when it's not on the tail.

Fixes #7273
2021-01-25 10:26:37 +09:00
Alejo Sanchez
0d694990cf raft: replication test: wait for log propagation
Wait until entries propagate after adding and before changing leader
using the same code as done for partitioning.

This fixes occasional hangs in debug mode when a test switches to a
different leader without leaving enough time for full propagation.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-24 20:33:54 -04:00
Alejo Sanchez
4d1ec88f90 raft: replication test: move wait for log to a function
Move wait for log propagation to its own function for reuse.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-24 20:25:48 -04:00
Alejo Sanchez
72f9b108e3 raft: replication test: remove unused member
Initial state doesn't need to specify total entries anymore.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-24 20:25:48 -04:00
Alejo Sanchez
db95d6e7f1 raft: replication test: use later()
Instead of sleep 1us use later()

Also use later to yield after sending append entries in rpc test impl.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-24 20:25:48 -04:00
Alejo Sanchez
f875ff72c9 raft: testing: remove election wait time and just yield
Replace sleep time for elect_me_leader with yield to speed things up.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-24 20:25:48 -04:00
Pekka Enberg
8258556832 Update tools/python3 submodule
* tools/python3 c579207...199ac90 (1):
  > dist: debian: adjust .orig tarball name for .rc releases
2021-01-24 21:30:59 +02:00
Gleb Natapov
020da49c89 storage_proxy: remove no longer needed range_slice_read_executor
After support for mixed cluster compatibility feature
DIGEST_MULTIPARTITION_READ was dropped in 854a44ff9b
range_slice_read_executor and never_speculating_read_executor become
identical, so remove the former for good.

Message-Id: <20210124122731.GA1122499@scylladb.com>
2021-01-24 14:45:22 +02:00
Benny Halevy
088f92e574 paxos_state: learn: fix injected error description
It was copy-pasted from another injection point.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201220091439.3604201-1-bhalevy@scylladb.com>
2021-01-24 11:51:23 +02:00
Takuya ASADA
5d527bd17e scylla_ntp_setup: use chrony on all distributions
To simplify scylla_ntp_setup, use chrony on all distributions.

Closes #7922
2021-01-24 11:45:58 +02:00
Takuya ASADA
984dc44ebf dist: drop /etc/security/limits.d/scylla.conf
Drop limits.d conf file, since we don't use it.
We set these parameters via systemd unit file instead.

Fixes #7925

Closes #7941
2021-01-24 11:43:39 +02:00
Benny Halevy
1847d49971 test: test_env: pick the highest sstable version by default
If possible, test the highest sstable format version,
as it's the mostly used.

If there pre-written sstables we need to load from the
test directory from an older version, either specify their
version explicitly, or use the new test_env::reusable_sst
method that looks up the latest sstable version in the
given directory and generation.

Test: unit(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201210161822.2833510-1-bhalevy@scylladb.com>
2021-01-24 10:38:55 +02:00
Botond Dénes
226088d12e mutation_reader: reader_lifecycle_policy::stopped_reader: drop pending_next_partition flag
Its not used anymore.
2021-01-22 16:18:59 +02:00
Botond Dénes
4eb65b12a0 mutation_reader: evictable_reader: remove next_partition() workaround
`next_partition()` now returns a future<>, so we can forward it to the
remote shard in the scope of the next partition call, remove the
now obsolete workaround for the synchronous next partition.
2021-01-22 16:18:30 +02:00
Botond Dénes
febd2feb4c mutation_reader: shard_reader: remove next_partition() workaround
`next_partition()` now returns a future<>, so we can forward it to the
remote shard in the scope of the next partition call, remove the
now obsolete workaround for the synchronous next partition.
2021-01-22 15:53:05 +02:00
Botond Dénes
9c96d74b72 mutation: remove now unused query() and query_compacted() 2021-01-22 15:36:37 +02:00
Botond Dénes
1a3ee71b39 treewide: use query_mutations() instead of mutation::query()
We want to retire the latter.
2021-01-22 15:36:37 +02:00
Botond Dénes
81da6b756f mutation_reader: foreign_reader: remove next_partition() workaround
`next_partition()` now returns a future<>, so we can forward it to the
remote shard in the scope of the next partition call, remove the
now obsolete workaround for the synchronous next partition.
2021-01-22 15:30:36 +02:00
Nadav Har'El
cb9e2ee00a cql-pytest: tests for fromJson() setting a map<ascii, int>
The fromJson() function can take a map JSON and use it to set a map column.
However, the specific example of a map<ascii, int> doesn't work in Scylla
(it does work in Cassandra). The xfailing tests in this patch demonstrate
this. Although the tests use perfectly legal ASCII, scylla fails the
fromJson() function, with a misleading error.

Refs #7949.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210121233855.100640-1-nyh@scylladb.com>
2021-01-22 14:29:25 +01:00
Botond Dénes
a9d726c7ba mutation_test: test_query_digest: ensure digest is produced consistently
Before we retire the mutation::query() code, expand the digest test to
check that the new code replacing it produces identical digest on all
possible equivalent mutations.
2021-01-22 15:27:48 +02:00
Botond Dénes
821ed96e0e mutation_query: introduce query_mutation()
This is a replacement of `mutation::query()`, but with an implementation
based on the standard query result building code.
This will allow us to migrate the remaining `mutation::query()` users
off of said method, which in turn will allow us to retire it finally.
2021-01-22 15:27:48 +02:00
Botond Dénes
c4f12221b8 mutation_query: to_data_query_result(): migrate to standard query code
Reimplement in terms of the standard query result building code. We want
to retire the alternative query result code in `mutation::query()` and
`to_data_query_result()` is one of the main users.
2021-01-22 15:27:48 +02:00
Botond Dénes
164582f33b mutation_query: move to_data_query_result() to mutation_partition.cc
We want to rewrite the above mentioned method's implementation in terms
of the standard query result building code (that of the `data_query()`
path), in order to retire the alternative query code in the mutation
class.
The `data_query()` code uses classes private to `mutation_partition.cc`
and instead of making these public, just move `to_data_query_result()`
to `mutation_partition.cc`.
2021-01-22 15:27:48 +02:00
Botond Dénes
d0c5f550a9 mutation: add consume()
This consume method accepts a `FlattenedConsumer`, the same one that the
name-sake `flat_mutation_reader::consume()` does. Indeed the main
purpose of this method is to allow using the standard query result
building stack with a mutation, the same way said stack is used with
mutation readers currently. This will allow us to replace the parallel
query result building code that currently exists in the
`mutation::query()` and friends, with the standard one.
2021-01-22 15:27:48 +02:00
Botond Dénes
9153f63135 flat_mutation_reader: move mutation consumer concepts to separate header
In the next patch we will want to use these concepts in `mutation.hh`. To
avoid pulling in the entire `flat_mutation_reader.hh` just for these,
and create a circular dependency in doing so, move them to a dedicated
header instead.
2021-01-22 15:27:48 +02:00
Botond Dénes
73808c12eb mutation compactor: query compaction: ignore purgeable tombstones
This behaviour is makes query result building sensitive to whether the
data was recently compacted or not, in particular different digests will
be produced depending on whether purgeable tombstones happened to be
compacted (and thus purged) or not. This means that two replicas can
produce different digests for the same data if has compacted some
purgeable tombstones and the other not.

To avoid this, drop purgeable tombstones during query compaction as
well.
2021-01-22 15:27:48 +02:00
Pavel Emelyanov
90d445464b compaction: Remove compaction_manager::enabled()
This method was marked with 'FIXME -- should not be public'
when it was introduced. Since then it has stopped being used
and can even be removed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210122083146.5886-1-xemul@scylladb.com>
2021-01-22 14:07:38 +02:00
Kamil Braun
570d15c7bc multishard_combining_reader: do not use smp::count
`multishard_combining_reader` currently only works under the assumption
that every table uses the same sharder configured using the node's number
of shards. But we could potentially specify a different sharder for a chosen table,
e.g. one that puts everything on shard 0.
Then this assumption will be broken and the reader causes a segfault.

Fixes #7945.
2021-01-21 18:28:18 +02:00
Nadav Har'El
328be1ca7c cql-pytest: tests for fromJson() not accepting empty string as integer
When writing to an integer column, Cassandra's fromJson() function allows
not just JSON number constants, it also allows a string containing a
number. Strings which do not hold a number fail with a FunctionFailure.
In particular, the empty string "" is an invalid number, and should fail.

The tests in this patch check this for two integer types: int and
varint.

Curiously, Cassandra and Scylla have opposite bugs here: Scylla fails
to recognize the error for varint, while Cassandra fails to recognize
the error for int. The tests in this patch reproduce these bugs.

The tests demonstrating Scylla's bug are marked xfail, and the tests
demonstrating Cassandra's bug is marked "cassandra_bug" (which means
it is marked xfail only when running against Cassandra, but expected
to succeed on Scylla.

Refs #7944.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210121133833.66075-1-nyh@scylladb.com>
2021-01-21 15:24:48 +01:00
Nadav Har'El
702b1b97bf cql: fix error return from execution of fromJson() and other functions
As reproduced in cql-pytest/test_json.py and reported in issue #7911,
failing fromJson() calls should return a FUNCTION_FAILURE error, but
currently produce a generic SERVER_ERROR, which can lead the client
to think the server experienced some unknown internal error and the
query can be retried on another server.

This patch adds a new cassandra_exception subclass that we were missing -
function_execution_exception - properly formats this error message (as
described in the CQL protocol documentation), and uses this exception
in two cases:

1. Parse errors in fromJson()'s parameters are converted into a
   function_execution_exception.

2. Any exceptions during the execute() of a native_scalar_function_for
   function is converted into a function_execution_exception.
   In particular, fromJson() uses a native_scalar_function_for.

   Note, however, that functions which already took care to produce
   a specific Cassandra error, this error is passed through and not
   converted to a function_execution_exception. An example is
   the blobAsText() which can return an invalid_request error, so
   it is left as such and not converted. This also happens in Cassandra.

All relevant tests in cql-pytest/test_json.py now pass, and are
no longer marked xfail. This patch also includes a few more improvements
to test_json.py.

Fixes #7911

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210118140114.4149997-1-nyh@scylladb.com>
2021-01-21 15:21:13 +01:00
Nadav Har'El
49440d67ad Merge: Fix multiple issues with timeuuid type
Merged patch series by Konstantin Osipov:

"These series improve uniqueness of generated timeuuids and change
list append/prepend logic to use client/LWT timestamp in timeuuids
generated for list keys. Timeuuid compare functions are
optimized.

The test coverage is extended for all of the above."

  uuid: add a comment warning against UUID::operator<
  uuid: replace slow versions of timeuiid compare with optimized/tested versions.
  test: add tests for legacy uuid compare & msb monotonicity
  test: add a test case for append/prepend limit
  test: add a test case for monotonicity of timeuuid least significant bits
  uuid: implement optimized timeuuid compare
  test: add a test case for list prepend/append with custom timestamp
  lists: rewrite list prepend to use append machinery
  lists: use query timestamp for list cell values during append
  uuid: fill in UUID node identifier part of UUID
  test: add a CQL test for list append/prepend operations
2021-01-21 13:20:07 +02:00
Konstantin Osipov
e18e2cb9f2 uuid: add a comment warning against UUID::operator< 2021-01-21 13:03:59 +03:00
Konstantin Osipov
845f6c667b uuid: replace slow versions of timeuiid compare with optimized/tested versions. 2021-01-21 13:03:59 +03:00
Konstantin Osipov
56d8d166cb test: add tests for legacy uuid compare & msb monotonicity 2021-01-21 13:03:59 +03:00
Konstantin Osipov
257c5b0879 test: add a test case for append/prepend limit 2021-01-21 13:03:59 +03:00
Konstantin Osipov
d6e65a3735 test: add a test case for monotonicity of timeuuid least significant bits
Ensure that timeuuid least significant bits are compared correctly.
2021-01-21 13:03:59 +03:00
Konstantin Osipov
0af3758aff uuid: implement optimized timeuuid compare
Introduce uint64_t based comparator for serialized timeuuids.

Respect Cassandra legacy for timeuuid compare order.

Scylla uses two versions of timeuuid compare:
- one for timeuuid values stored in uuid columns
- a different one for timeuuid values stored in timeuuid columns.

This commit re-implements the implementations of these comparators in
types.cc and deprecates the respective implementations types.cc. They
will be removed in a following patch.

A micro-benchmark at https://github.com/alecco/timeuuid-bench/
shows 2-4x speed up of the new comparators.
2021-01-21 13:03:59 +03:00
Konstantin Osipov
b4500a55c7 test: add a test case for list prepend/append with custom timestamp
Scylla now takes a custom timestamp into account when
executing list append/prepend operations. Test the new
semantics.
2021-01-21 13:03:59 +03:00
Konstantin Osipov
232ce6f611 lists: rewrite list prepend to use append machinery
Rewrite list prepend to use the same machinery
as append, and thus produce correct results when used in LWT.

After this patch, list prepend begins to honor user supplied timestamps.

If a user supplied timestamp for prepend is less than 2010-01-01 00:00:00
an exception is thrown.

Fixes #7611
2021-01-21 13:03:59 +03:00
Konstantin Osipov
2b8ce83eea lists: use query timestamp for list cell values during append
Scylla list cells are represented internally as a map of
timeuuid => value. To append a new value to a list
the coordinator generates a timeuuid reflecting the current time as key
and adds a value to the map using this key.

Before this patch, Scylla always generated a timeuuid for a new
value, even if the query had a user supplied or LWT timestamp.
This could break LWT linearizability. User supplied timestamps were
ignored.

This is reported as https://github.com/scylladb/scylla/issues/7611

A statement which appended multiple values to a list or a BATCH
generated an own microsecond-resolution timeuuid for each value:

BEGIN BATCH
  UPDATE ... SET a = a + [3]
  UPDATE ... SET a = a + [4]
APPLY BATCH

UPDATE ... SET a = a + [3, 4]

To fix the bug, it's necessary to preserve monotonicity of
timeuuids within a batch or multi-value append, but make sure
they all use the microsecond time, as is set by LWT or user.

To explain the fix, it's first necessary to recall the structure
of time-based UUIDs:

60 bits: time since start of GMT epoch, year 1582, represented
         in 100-nanosecond units
4 bits:  version
14 bits: clock sequence, a random number to avoid duplicates
         in case system clock is adjusted
2 bits:  type
48 bits: MAC address (or other hardware address)

The purpose of clockseq bits is as defined in
https://tools.ietf.org/html/rfc4122#section-4.1.5
is to reduce the probability of UUID collision in case clock
goes back in time or node id changes. The implementation should reset it
whenever one of these events may occur.

Since LWT microsecond time is guaranteed to be
unique by Paxos, the RFC provisioning for clockseq and MAC
slots becomes excessive.

The fix thus changes timeuuid slot content in the following way:
- time component now contains the same microsecond time for all
  values of a statement or a batch. The time is unique and monotonic in
  case of LWT. Otherwise it's most always monotonic, but may not be
  unique if two timestamps are created on different coordinators.
- clockseq component is used to store a sequence number which is
  unique and monotonic for all values within the statement/batch.
- to protect against time back-adjustments and duplicates
  if time is auto-generated, MAC component contains a random (spoof)
  MAC address, re-created on each restart. The address is different
  at each shard.

The change is made for all sources of time: user, generated, LWT.
Conditioning the list key generation algorithm on the source of
time would unnecessarily complicate the code while not increase
quality (uniqueness) of created list keys.

Since 14 bits of clockseq provide us with only 16383 distinct slots
per statement or batch, 3 extra bits in nanosecond part of the time
are used to extend the range to 131071 values per statement/batch.
If the rang is exceeded beyond the limit, an exception is produced.

A twist on the use of clockseq to extend timeuuid uniqueness is
that Scylla, like Cassandra, uses int8 compare to compare lower
bits of timeuuid for ordering. The patch takes this into account
and sign-complements the clockseq value to make it monotonic
according to the legacy compare function.

Fixes #7611

test: unit (dev)
2021-01-21 13:03:59 +03:00
Konstantin Osipov
6d1781be36 uuid: fill in UUID node identifier part of UUID
Before this patch, UUID generation code was not creating
sufficiently unique IDs: the 6 byte node identifier was mostly
empty, i.e. only containing shard id. This could lead to
collisions between queries executed concurrently at different
coordinators, and, since timeuuid is used as key in list append
and prepend operations, lead to lost updates.

To generate a unique node id, the patch uses a combination of
hardware MAC address (or a random number if no hardware address is
available) and the current shard id.

The shard id is mixed into higher bits of MAC, to reduce the
chances on NIC collision within the same network.

With sufficiently unique timeuuids as list cell keys, such updates
are no longer lost, but multi-value update can still be "merged"
with another multi-value update.

E.g. if node A executes SET l = l + [4, 5] and node B executes SET
l  = l + [6, 7], the list value could be any of [4, 5, 6, 7], [4,
6, 5, 7], [6, 4, 5, 7] and so on.

At least we are now less likely to get any value lost.

Fixes #6208.

@todo: initialize UUID subsystem explicitly in main()
and switch to using seastar::engine().net().network_interfaces()

test: unit (dev)
2021-01-21 13:03:53 +03:00
Avi Kivity
4cfaab208e allocation_strategy: set preferred max contiguous allocation to 128k for standard allocations
Now that managed_bytes and its users do not assume that a managed_bytes
instance allocated using standard_allocation_strategy is non-fragmented,
we can set the preferred max contiguous allocation to 128k. This causes
managed_bytes to fragment instances that are larger than this size.

Note that managed_bytes is the only user.

Closes #7943
2021-01-21 11:15:13 +02:00
Tomasz Grabiec
f08a3e3fd8 Merge "raft: test fixes, etcd tests, simplification" from Alejo
This patch set adds etcd unit tests for raft.

It also includes a fix for replication test in debug mode and a
simplification for append_request.

Tests: unit ({dev}), unit ({debug}), unit ({release})

*  https://github.com/alecco/scylla/tree/raft-ale-tests-09b:
  raft: etcd unit tests: test log replication
  raft: boost test etcd: test fsm can vote from any state
  raft: boost test etcd: port TestLeaderElectionOverwriteNewerLogs
  raft: replication test: add etcd test for cycling leaders
  raft: testing: provide primitives to wait for log propagation
  raft: etcd unit tests: initial boost tests
  raft: combine append_request _receive and _send
2021-01-21 10:41:33 +02:00
Pekka Enberg
7d98e05923 Update tools/python3 submodule
* tools/python3 1763a1a...c579207 (1):
  > dist/debian: handle rc version correctly
2021-01-21 10:41:33 +02:00
Avi Kivity
daa0e964fc dbuild: avoid --pids-limit with podman and cgroupsv1
Podman doesn't correctly support --pids-limit with cgroupsv1. Some
versions ignore it, and some versions reject the option.

To avoid the error, don't supply --pids-limit if cgroupsv2 is not
available (detected by its presence in /proc/filesystems). The user
is required to configure the pids limit in
/etc/containers/containers.conf.

Fixes #7938.

Closes #7939
2021-01-21 10:41:33 +02:00
Botond Dénes
4d581f1bb3 docs/README.md: guides: also mention running and debugging
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210120083304.36447-1-bdenes@scylladb.com>
2021-01-20 16:07:29 +02:00
Avi Kivity
f11a0700a8 Merge "mutation_writer: explicitly close writers" from Benny
"
_consumer_fut is expected to return an exception
on the abort path.  Wait for it and drop any exception
so it won't be abandoned as seen in #7904.

A future<> close() method was added to return
_consumer_fut.  It is called both after abort()
in the error path, and after consume_end_of_stream,
on the success path.

With that, consume_end_of_stream was made void
as it doesn't return a future<> anymore.

Fixes #7904
Test: unit(release)
"

* tag 'close-bucket-writer-v5' of github.com:bhalevy/scylla:
  mutation_writer: bucket_writer: add close
  mutation_writer/feed_writers: refactor bucket/shard writers
  mutation_writer: update bucket/shard writers consume_end_of_stream
2021-01-20 16:07:29 +02:00
Pekka Enberg
6cc981d089 scylla: Add "--build-mode" command line option
This adds a "--build-mode" command line option to "scylla" executable:

$ ./build/dev/scylla --build-mode
dev

This allows you to discover the build mode of a "scylla" executable
without resorting to "readelf", for example, to verify that you are
looking at the correct executable while debugging packaging issues.

Closes #7865
2021-01-20 16:07:29 +02:00
Botond Dénes
7eb8c71342 tools/scylla-types: add link to cql3-type-mapping.md
Just like scylla-sstable-index, scylla-types accepts types in (short)
cassandra class name notation. The mapping from the clq3 type names to
the class names is not straight-forward in all cases, so provide a link
to a table which lists the cassandra class name of all supported types
(and more).

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210120083816.37774-2-bdenes@scylladb.com>
2021-01-20 10:50:33 +02:00
Botond Dénes
882ade7c6a types/scylla-sstable-index: update URL to cql3-type-mapping.md
Said document was recently moved but the URL was not updated.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210120083816.37774-1-bdenes@scylladb.com>
2021-01-20 10:50:33 +02:00
Avi Kivity
114da51d73 Revert "commitlog: fix size of a write used to zero a segment"
This reverts commit df2f67626b. The fix
is correct, but has an unfortunate side effect with O_DSYNC: each
128k write also needs to flush the XFS log. This translates to
32MB/128k = 256 flushes, compared to one flush with the original code.

A better fix would be to prezero without O_DSYNC, then reopen the file
with O_DSYNC, but we can do that later.

Reopens #5857.
2021-01-20 10:23:43 +02:00
Avi Kivity
586f16bf79 Merge "Cut snitch -> storage service dependency" from Pavel E
"
Currently storage service and snitch implicitly depend on each
other. Storage service gossips snitch data on start, snitch
kicks the storage service when its configuration changes.

This interdependency is relaxed:

- snitch gossips all its state itself without using the
  storage service as a mediator
- storage service listens for snitch updates with the help
  of self-breaking subscription

Both changes make snitch independent from storage service,
remove yet another call for global storage service from the
codebase and make the storage service -> snitch reference
robust against dagling pointers/references

tests: unit(dev), dtest.rebuild.TestRebuild.simple_rebuild(dev)
"

* 'br-snitch-gossip-2' of https://github.com/xemul/scylla:
  storage-service: Subscribe to snitch to update topology
  snitch: Introduce reconfiguration signal
  snitch: Always gossip snitch info itself
  snitch: Do gossip DC and RACK itself
  snitch: Add generic gossiping helper
2021-01-20 10:23:43 +02:00
Pavel Solodovnikov
041072b59f raft: rename storage to persistence
The new naming scheme more clearly communicates to the client of
the raft library that the `persistence` interface implements
persistency layer of the fsm that is powering the raft
protocol itself rather than the client-side workflow and
user-provided `state_machine`.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201126135114.7933-1-pa.solodovnikov@scylladb.com>
2021-01-20 10:23:43 +02:00
Gleb Natapov
248449816b raft: fix snapshot transfer with existing log prefix
Current code that checks when snapshot has to be transferred does not
take in account the case where there can be log entries preceding the
snapshot. Fix the code to correctly test for snapshot transfer
condition.

Message-Id: <20210117095801.GB733394@scylladb.com>
2021-01-20 10:23:43 +02:00
Gleb Natapov
1ab262e86b raft: test: change replication_test to submit one entry at a time
replication_test's state machine is not commutative, so if commands are
applied in different order the states will be different as well. Since
the preemption check was added into co_await in seastar even waiting for
a ready future can preempt which will cause reordering of simultaneously
submitted entries in debug mode. For a long time we tried to keep entries
submission parallel in the test, but with the above seastar change it
is no longer possible to maintain it without changing the state machine
to be commutative. The patch changes the test to submit entries one by
one.

Message-Id: <20210117095147.GA733394@scylladb.com>
2021-01-20 10:23:43 +02:00
Benny Halevy
f29732573a mutation_writer: bucket_writer: add close
bucket_writer::close waits for the _consumer_fut.
It is called both after consume_end_of_stream()
and after abort().

_consumer_fut is expected to return an exception
on the abort path.  Wait for it and drop any exception
so it won't be abandoned as seen in #7904.

With that moved to close() time, consume_end_of_stream
doesn't need to return a future and is made void
all the way in the stack.  This is ok since
queue_reader_handle::push_end_of_stream is synchronous too.

Added a unit test that aborts the reader consumer
during `segregate_by_timestamp`, reproducing the
Exceptional future ignored issue without the fix.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-19 19:03:58 +02:00
Benny Halevy
fc3f9a57ff mutation_writer/feed_writers: refactor bucket/shard writers
Consolidate shard_based_splitting_writer::shard_writer
and timestamp_based_splitting_writer::bucket_writer
common code into mutation_writer::bucket_writer.

This provides a common place to handle consume_end_of_stream()
and abort(), and in particular the handling of the underlying
_conmsumer_fut.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-19 18:48:01 +02:00
Benny Halevy
a9d91a2d09 mutation_writer: update bucket/shard writers consume_end_of_stream
After 61520a33d6
feed_writers doesn't call consume_end_of_stream
after abort() so no need to test
            if (!_handle.is_terminated()) {

and consume_end_of_stream is now called in then_wrapped
rather than `finally` so it's ok if it throws.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-19 18:44:40 +02:00
Kamil Braun
1a8630e6a7 transport: silence "broken pipe" and "connection reset by peer" errors
The code would already silence broken pipe exceptions since it's
expected when the other side closes the connection or when we shutdown the
socket during Scylla shutdown, but the code wouldn't handle the following:
1. "Connection reset by peer" errors: these can also happen in the
   aforementioned two scenarios; the conditions that determine which of
   the two types of errors occur are unclear.
2. The scenarios would sometimes result in a `seastar::nested_exception`,
   mainly during shutdown. The errors could happen once when trying to send
   a response to a request (`_write_buf.write(...)/flush(...)`) and then
   again when trying to close the connection in a `finally` block. These
   nested exceptions were not silenced.

The commit handles each of these cases.
Closes #7907.

Closes #7931
2021-01-19 10:30:17 +02:00
Tomasz Grabiec
94749b01eb Merge "futurize flat_mutation_reader::next_partition" from Benny
The main motivation for this patchset is to prepare
for adding a async close() method to flat_mutation_reader.

In order to close the reader before destroying it
in all paths we need to make next_partition asynchronous
so it can asynchronously close a current reader before
destoring it, e.g. by reassignment of flat_mutation_reader_opt,
as done in scanning_reader::next_partition.

Test: unit(release, debug)

* git@github.com:bhalevy/scylla.git futurize-next-partition-v1:
  flat_mutation_reader: return future from next_partition
  multishard_mutation_query: read_context: save_reader: destroy reader_meta from the calling shard
  mutation_reader: filtering_reader: fill_buffer: futurize inner loop
  flat_mutation_reader::impl: consumer_adapter: futurize handle_result
  flat_mutation_reader: consume_pausable/in_thread: futurize_invoke consumer
  flat_mutation_reader: FlatMutationReaderConsumer: support also async consumer
  flat_mutation_reader:impl: get rid of _consume_done member
2021-01-19 10:19:03 +02:00
Alejo Sanchez
8a61e7defc raft: etcd unit tests: test log replication
etcd TestLogReplication

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:37 -04:00
Alejo Sanchez
417b18aaad raft: boost test etcd: test fsm can vote from any state
etcd TestVoteFromAnyState

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:37 -04:00
Alejo Sanchez
5a75c0e06a raft: boost test etcd: port TestLeaderElectionOverwriteNewerLogs
Log truncation of follower when node re-gains leadership.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:37 -04:00
Alejo Sanchez
f14c44c686 raft: replication test: add etcd test for cycling leaders
This test cycles 3 nodes as leaders without adding entries.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:37 -04:00
Alejo Sanchez
f627972186 raft: testing: provide primitives to wait for log propagation
For tests to be able to transition in a consistent state, in some cases
it's needed to allow the followers to catch up with the leader.

This prevents occasional hangs in debug mode for incoming tests.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:37 -04:00
Alejo Sanchez
948ae813e4 raft: etcd unit tests: initial boost tests
First batch of ported etcd raft unit tests.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:12 -04:00
Gleb Natapov
6d47a535b9 raft: combine append_request _receive and _send
Combine structs for append request send and receive into a single
struct.

Author:    Gleb Natapov <gleb@scylladb.com>
Date:      Mon Nov 23 14:33:14 2020 +0200
2021-01-18 12:24:13 -04:00
Konstantin Osipov
bf1a031bd6 test: add a CQL test for list append/prepend operations
Test single- and multi- value list append, prepend,
append and prepend in a batch, conditional statements.

This covers the parts of Cassandra which are working as documented
and which we intend to preserve compatibility with.
2021-01-18 17:32:00 +03:00
Jenkins
faf71c6f75 release: prepare for 4.5.dev 2021-01-18 16:05:25 +02:00
Avi Kivity
df3ef800c2 Merge 'Introduce load and stream feature' from Asias He
storage_service: Introduce load_and_stream

=== Introduction ===

This feature extends the nodetool refresh to allow loading arbitrary sstables
that do not belong to a node into the cluster. It loads the sstables from disk
and calculates the owning nodes of the data and streams to the owners
automatically.

From example, say the old cluster has 6 nodes and the new cluster has 3 nodes.
We can copy the sstables from the old cluster to any of the new nodes and
trigger the load and stream process.

This can make restores and migrations much easier.

=== Performance ===

I managed to get 40MB/s per shard on my build machine.
CPU: AMD Ryzen 7 1800X Eight-Core Processor
DISK: Samsung SSD 970 PRO 512GB

Assume 1TB sstables per node, each shard can do 40MB/s, each node has 32
shards, we can finish the load and stream 1TB of data in 13 mins on each
node.

1TB / 40 MB per shard * 32 shard / 60 s = 13 mins

=== Tests ===

backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_test
which creates a cluster with 4 nodes and inserts data, then use
load_and_stream to restore to a 2 nodes cluster.

=== Usage ===

curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true

=== Notes ===

Btw, with the old nodetool refresh, the node will not pick up the data
that does not belong to this node but it will not delete it either. One
has to run nodetool cleanup to remove those data manually which is a
surprise to me and probably to users as well. With load and stream, the
process will delete the sstables once it finishes stream, so no nodetool
cleanup is needed.

The name of this feature load and stream follows load and store in CPU world.

Fixes #7831

Closes #7846

* github.com:scylladb/scylla:
  storage_service: Introduce load_and_stream
  distributed_loader: Add get_sstables_from_upload_dir
  table: Add make_streaming_reader for given sstables set
2021-01-18 15:08:19 +02:00
Avi Kivity
60f5ec3644 Merge 'managed_bytes: switch to explicit linearization' from Michał Chojnowski
This is a revival of #7490.

Quoting #7490:

The managed_bytes class now uses implicit linearization: outside LSA, data is never fragmented, and within LSA, data is linearized on-demand, as long as the code is running within with_linearized_managed_bytes() scope.

We would like to stop linearizing managed_bytes and keep it fragmented at all times, since linearization can require large contiguous chunks. Large contiguous allocations are hard to satisfy and cause latency spikes.

As a first step towards that, we remove all implicitly linearizing accessors and replace them with an explicit linearization accessor, with_linearized().

Some of the linearization happens long before use, by creating a bytes_view of the managed_bytes object and passing it onwards, perhaps storing it for later use. This does not work with with_linearized(), which creates a temporary linearized view, and does not work towards the longer term goal of never linearizing. As a substitute a managed_bytes_view class is introduced that acts as a view for managed_bytes (for interoperability it can also be a view for bytes and is compatible with bytes_view).

By the end of the series, all linearizations are temporary, within the scope of a with_linearized() call and can be converted to fragmented consumption of the data at leisure.

This has limited practical value directly, as current uses of managed_bytes are limited to keys (which are limited to 64k). However, it enables converting the atomic_cell layer back to managed_bytes (so we can remove IMR) and the CQL layer to managed_bytes/managed_bytes_view, removing contiguous allocations from the coordinator.

Closes #7820

* github.com:scylladb/scylla:
  test: add hashers_test
  memtable: fix accounting of managed_bytes in partition_snapshot_accounter
  test: add managed_bytes_test
  utils: fragment_range: add a fragment iterator for FragmentedView
  keys: update comments after changes and remove an unused method
  mutation_test: use the correct preferred_max_contiguous_allocation in measuring_allocator
  row_cache: more indentation fixes
  utils: remove unused linearization facilities in `managed_bytes` class
  misc: fix indentation
  treewide: remove remaining `with_linearized_managed_bytes` uses
  memtable, row_cache: remove `with_linearized_managed_bytes` uses
  utils: managed_bytes: remove linearizing accessors
  keys, compound: switch from bytes_view to managed_bytes_view
  sstables: writer: add write_* helpers for managed_bytes_view
  compound_compat: transition legacy_compound_view from bytes_view to managed_bytes_view
  types: change equal() to accept managed_bytes_view
  types: add parallel interfaces for managed_bytes_view
  types: add to_managed_bytes(const sstring&)
  serializer_impl: handle managed_bytes without linearizing
  utils: managed_bytes: add managed_bytes_view::operator[]
  utils: managed_bytes: introduce managed_bytes_view
  utils: fragment_range: add serialization helpers for FragmentedMutableView
  bytes: implement std::hash using appending_hash
  utils: mutable_view: add substr()
  utils: fragment_range: add compare_unsigned
  utils: managed_bytes: make the constructors from bytes and bytes_view explicit
  utils: managed_bytes: introduce with_linearized()
  utils: managed_bytes: constrain with_linearized_managed_bytes()
  utils: managed_bytes: avoid internal uses of managed_bytes::data()
  utils: managed_bytes: extract do_linearize_pure()
  thrift: do not depend on implicit conversion of keys to bytes_view
  clustering_bounds_comparator: do not depend on implicit conversion of keys to bytes_view
  cql3: expression: linearize get_value_from_mutation() eariler
  bytes: add to_bytes(bytes)
  cql3: expression: mark do_get_value() as static
2021-01-18 11:01:28 +02:00
Asias He
4d32d03172 storage_service: Introduce load_and_stream
=== Introduction ===

This feature extends the nodetool refresh to allow loading arbitrary sstables
that do not belong to a node into the cluster. It loads the sstables from disk
and calculates the owning nodes of the data and streams to the owners
automatically.

From example, say the old cluster has 6 nodes and the new cluster has 3 nodes.
We can copy the sstables from the old cluster to any of the new nodes and
trigger the load and stream process.

This can make restores and migrations much easier.

=== Performance ===

I managed to get 40MB/s per shard on my build machine.
CPU: AMD Ryzen 7 1800X Eight-Core Processor
DISK: Samsung SSD 970 PRO 512GB

Assume 1TB sstables per node, each shard can do 40MB/s, each node has 32
shards, we can finish the load and stream 1TB of data in 13 mins on each
node.

1TB / 40 MB per shard * 32 shard / 60 s = 13 mins

=== Tests ===

backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_test
which creates a cluster with 4 nodes and inserts data, then use
load_and_stream to restore to a 2 nodes cluster.

=== Usage ===

curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true

=== Notes ===

Btw, with the old nodetool refresh, the node will not pick up the data
that does not belong to this node but it will not delete it either. One
has to run nodetool cleanup to remove those data manually which is a
surprise to me and probably to users as well. With load and stream, the
process will delete the sstables once it finishes stream, so no nodetool
cleanup is needed.

The name of this feature load and stream follows load and store in CPU world.

Fixes #7831
2021-01-18 16:32:33 +08:00
Avi Kivity
ab44464911 Revert "docker: remove sshd from the image"
This reverts commit 32fd38f349. Some
tests (in scylla-cluster-tests) depend on it.
2021-01-17 14:34:40 +02:00
Raphael S. Carvalho
00c29e1e24 table: Move notify_bootstrap_or_replace_*() out of line
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210117045747.69891-9-raphaelsc@scylladb.com>
2021-01-17 10:36:13 +02:00
Asias He
28007f13f8 distributed_loader: Add get_sstables_from_upload_dir
This function scans sstables under the upload directory and return a list of
sstables for each shard.

Refs #7831
2021-01-16 20:03:17 +08:00
Michał Chojnowski
5b72fb65ae test: add hashers_test
This test is a sanity check. It verifies that our wrappers over well known
hashes (xxhash, md5, sha256) actually calculate exactly those hashes.
It also checks that the `update()` methods of used hashers are linear with
respect to concatenation: that is, `update(a + b)` must be equivalent to
`update(a); update(b)`. This wasn't relied on before, but now we need to
confirm that hashing fragmented keys without linearizing them won't break
backward compatibility.
2021-01-15 18:28:24 +01:00
Michał Chojnowski
85048b349b memtable: fix accounting of managed_bytes in partition_snapshot_accounter
managed_bytes has a small overhead per each fragment. Due to that, managed_bytes
containing the same data can have different total memory usage in different
allocators. The smaller the preferred max allocation size setting is, the more
fragments are needed and the greater total per-fragment overhead is.
In particular, managed_bytes allocated in the LSA could grow in
memory usage when copied to the standard allocator, if the standard allocator
had a preferred max allocation setting smaller than the LSA.

partition_snapshot_accounter calculates the amount of memory used by
mutation fragments in the memtable (where they are allocated with LSA) based
on the memory usage after they are copied to the standard allocator.
This could result in an overestimation, as explained above.
But partition_snapshot_accounter must not overestimate the amount of freed
memory, as doing otherwise might result in OOM situations.

This patch prevents the overaccounting by adding minimal_external_memory_usage():
a new version of external_memory_usage(), which ignores allocator-dependent
overhead. In particular, it includes the per-fragment overhead in managed_bytes
only once, no matter how many fragments there are.
2021-01-15 18:21:13 +01:00
Michał Chojnowski
d31771c0b2 test: add managed_bytes_test 2021-01-15 18:21:13 +01:00
Michał Chojnowski
72ecbd6936 utils: fragment_range: add a fragment iterator for FragmentedView
A stylistic change. Iterators are the idiomatic way to iterate in C++.
2021-01-15 14:05:44 +01:00
Michał Chojnowski
2e38647a95 keys: update comments after changes and remove an unused method
The comments were outdated after the latest changes (bytes_view vs
managed_bytes_view).
compound_view_wrapper::get_component() is unused, so we remove it.
2021-01-15 14:05:44 +01:00
Piotr Sarna
6ae94d31c1 treewide: remove shared pointer usage from the pager
The pager interface doesn't really need to be virtual,
so the next step could be to remove the need for pointers
entirely, but migrating from shared_ptr to unique_ptr
is a low-hanging fruit.

Message-Id: <a5bdecb17ae58e914da020fb58a41f4574565c66.1610709560.git.sarna@scylladb.com>
2021-01-15 15:03:14 +02:00
Avi Kivity
f20736d93d Merge 'Support unofficial distributions' from Takuya ASADA
Since we introduced relocatable package and offline installer, scylla binary itself can run almost any distributions.
However, setup scripts are not designed to run in unsupported distributions, it causes error on such environment.
This PR adds minimal support to run offline installation on unsupported distributions, tested on SLES, Arch Linux and Gentoo.

Closes #7858

* github.com:scylladb/scylla:
  dist: use sysconfig_parser to parse gentoo config file
  dist: add package name translation
  dist: support SLES/OpenSUSE
  install.sh: add systemd existance check
  install.sh: ignore error missing sysctl entries
  dist: show warning on unsupported distributions
  dist: drop Ubuntu 14.04 code
  dist: move back is_amzn2() to scylla_util.py
  dist: rename is_gentoo_variant() to is_gentoo()
  dist: support Arch Linux
  dist: make sysconfig directory detectable
2021-01-14 16:59:49 +02:00
Raphael S. Carvalho
97e076365e Fix stalls on Memtable flush by preempting across fragment generation if needed
Flush is facing stalls because partition_snapshot_flat_reader::fill_buffer()
generates mutation fragment until buffer is full[1] without yielding.

this is the code path:
    flush_reader::fill_buffer()      <---------|
        flat_mutation_reader::consume_pausable()      <--------|
                partition_snapshot_flat_reader::fill_buffer() -|

[1]: https://github.com/scylladb/scylla/blob/6cfc949e/partition_snapshot_reader.hh#L261

This is fixed by breaking the loop in do_fill_buffer() if preemption is needed,
allowing do_until() to yield in sequence, and when it resumes, continue from
where it left off, until buffer is full.

Fixes #7885.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210114141417.285175-1-raphaelsc@scylladb.com>
2021-01-14 16:30:55 +02:00
Ivan Prisyazhnyy
32fd38f349 docker: remove sshd from the image
implicit revert of 6322293263

sshd previosly was used by the scylla manager 1.0.
new version does not need it. there is no point of
having it currently. it also confuses everyone.

Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>

Closes #7921
2021-01-14 12:52:24 +02:00
Pavel Emelyanov
2b31be0daa client-state,cdc: Remove call for storage_service from permissions check
The client_state::check_access() calls for global storage service
to get the features from it and check if the CDC feature is on.
The latter is needed to perform CDC-specific checks.

However it was noticed, that the check for the feature is excessive
as all the guarded if-s will resolve to false in case CDC is off
and the check_access will effectively work as it would with the
feature check.

With that observation, it's possible to ditch one more global storage
service reference.

tests: unit(dev), dtest(dev, auth)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210105063651.7081-1-xemul@scylladb.com>
2021-01-14 12:52:24 +02:00
Benny Halevy
29002e3b48 flat_mutation_reader: return future from next_partition
To allow it to asynchronously close underlying readers
on next_partition().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-13 17:35:07 +02:00
Benny Halevy
ff931c2ecc multishard_mutation_query: read_context: save_reader: destroy reader_meta from the calling shard
The reader_meta in _readers[shard] is created on shard 0 and must
be destroyed on it as well.

A following patch changes next_partition() to return a future<>
thus it introduces a continuation that requires access to `rm`.

We cannot move it down to the conuation safely, since it will be
wrongly destroyed in the invoked shard, so use do_with to hold it
in the scope of the calling shard until the invoked function
completes.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-13 17:35:07 +02:00
Benny Halevy
75c0c05f71 mutation_reader: filtering_reader: fill_buffer: futurize inner loop
Prepare for futurizing next_partition().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-13 17:35:07 +02:00
Benny Halevy
cd4d082e51 flat_mutation_reader::impl: consumer_adapter: futurize handle_result
Prepare for futurizing next_partition.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-13 17:35:07 +02:00
Benny Halevy
d8ae6d7591 flat_mutation_reader: consume_pausable/in_thread: futurize_invoke consumer
To support both sync and async consumers.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-13 17:35:07 +02:00
Benny Halevy
fdb3c59e35 flat_mutation_reader: FlatMutationReaderConsumer: support also async consumer
So that consumer_adapter and other consumers in the future
may return a future from consumer().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-13 17:35:07 +02:00
Benny Halevy
515bed90bb flat_mutation_reader:impl: get rid of _consume_done member
It is only used in consume_pausable, that can easily do
without it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-13 17:35:07 +02:00
Pavel Emelyanov
d3ee8774ad storage-service: Subscribe to snitch to update topology
Currently snitch explicitly calls storage service (if
it's initialized) to update topology on snitch data
change.

Instead of it -- make storage service subscribe on the
snitch reconfigure signal upon creation.

This finally makes snitch fully independent from storage
service.

In tests the snitch instance is not created, so check
for it before subscribing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-01-13 16:41:34 +03:00
Pavel Emelyanov
d1a2d0f894 snitch: Introduce reconfiguration signal
Add a notifier to snitch_base that gets triggered when the
snitch configuration changes to which others may subscribe.

For now only the gossiping-file-snitch triggers it when it
re-reads its config file. Other existing snitches are kinda
static in this sense.

The subscribe-trigger engine is based on scoped connection
from boost::signals2.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-01-13 16:41:34 +03:00
Pavel Emelyanov
ca336409d7 snitch: Always gossip snitch info itself
The gossiping_property_file_snitch updates the gossip RACK and DC
values upon config change. Right now this is done with the help
of storage service, but the needed code to gossip rack and dc is
already available in the snitch itself.

Said that -- gossip snitch info by snitch helper and remove the
storage_service's one. This makes the 2nd step decoupling snitch
and storage service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-01-13 16:41:34 +03:00
Pavel Emelyanov
99e71bd1f6 snitch: Do gossip DC and RACK itself
This is the 2nd step in generalizing the snitch data gossiping
and at the same the 1st step in decoupling storage service and
snitch.

During start storage service starts gossiper, which notifies the
snicth with .gossiper_starting() call, then the storage service
calls gossip_snitch_info.

This patch makes snitch itself do the last step.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-01-13 16:41:34 +03:00
Pavel Emelyanov
bc1a3a358d snitch: Add generic gossiping helper
Nowadays some snitch implementations gossip the INTERNAL_IP
value and storage_service gossip RACK and DC for all of them.

This functionality is going to be generalized and the first
step is in making a common method for a snitch to gossip its
data.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-01-13 16:41:34 +03:00
Takuya ASADA
7a74f8cd2e dist: use sysconfig_parser to parse gentoo config file
Use sysconfig_parser instead of regex, to improve code readability.
2021-01-13 21:34:23 +09:00
Takuya ASADA
2a4d293841 dist: add package name translation
Translate package name from CentOS package to different distribution
package name, to use single package name for pkg_install().
2021-01-13 21:27:14 +09:00
Takuya ASADA
0a9843842d dist: support SLES/OpenSUSE
Add support SLES/OpenSUSE on setup script.
2021-01-13 19:32:46 +09:00
Takuya ASADA
a34edf8169 install.sh: add systemd existance check
offline installer can run in non-systemd distributions, but it won't
work since we only have systemd units.
So check systemd existance and print error message.
2021-01-13 19:32:45 +09:00
Takuya ASADA
b8c35772b3 install.sh: ignore error missing sysctl entries
On some kernel may not have specified sysctl parameter, so we should
ignore the error.
2021-01-13 19:32:45 +09:00
Takuya ASADA
e8f74e800c dist: show warning on unsupported distributions
Add warning message on unsupported distributions, for
scylla_cpuscaling_setup and scylla_ntp_setup.
2021-01-13 19:32:45 +09:00
Takuya ASADA
2f344cf50d dist: drop Ubuntu 14.04 code
We don't support Ubuntu 14.04 anymore, drop them
2021-01-13 19:32:45 +09:00
Takuya ASADA
8e59f70080 dist: move back is_amzn2() to scylla_util.py
Distribution detection functions should be placed same place,
so move back it to scylla_util.py
2021-01-13 19:32:45 +09:00
Takuya ASADA
921b1676c0 dist: rename is_gentoo_variant() to is_gentoo()
is_redhat_variant() is the function to detect RHEL/CentOS/Fedora/OEL,
and is_debian_variant() is the function to detect Debian/Ubuntu.
Unlike these functions, is_gentoo_variant() does not detect "Gentoo variants",
we should rename it to is_gentoo().
2021-01-13 19:32:45 +09:00
Takuya ASADA
fffa8f5ded dist: support Arch Linux
Add support Arch Linux on setup script.
2021-01-13 19:32:45 +09:00
Takuya ASADA
0d11f9463d dist: make sysconfig directory detectable
Currently, install.sh provide a way to customize sysconfig directory,
but sysconfig directory is hardcoded on script.
Also, /etc/sysconfig seems correct to use default value, but current
code specify /etc/default as non-redhat distributions.

Instead of hardcoding, generate generate python script in install.sh
to save specified sysconfig directory path in python code.
2021-01-13 19:32:45 +09:00
Wojciech Mitros
93613e20a3 api: remove potential large allocation in /column_family/ GET request handler
The reply to a /column_family/ GET request contains info about all
column families. Currently, all this info is stored in a single
string when replying, and this string may require a big allocation
when there are many column families.
To avoid that allocation, instead of a single string, use a
body_writer function, which writes chunks of the message content
to the output stream.

Fixes #7916

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>

Closes #7917
2021-01-13 12:04:18 +02:00
Avi Kivity
ed53b3347e Merge 'idl: remove the large allocation in mutation_partition_view::rows()' from Wojciech Mitros
After these changes the generated code deserializes the stream into a chunked vector, instead of an contiguous one, so even if there are many fields in it, there won't be any big allocations.

I haven't run the scylla cluster test with it yet but it passes the unit tests.

Closes #7919

* github.com:scylladb/scylla:
  idl: change the type of mutation_partition_view::rows() to a chunked_vector
  idl-compiler: allow fields of type utils::chunked_vector
2021-01-13 11:07:29 +02:00
Nadav Har'El
711b311d47 cql-pytest: tests for fromJson() integer overflow
Numbers in JSON are not limited in range, so when the fromJson() function
converts a number to a limited-range integer column in Scylla, this
conversion can overflow. The following tests check that this conversion
should result in an error (FunctionFailure), not silent trunction.

Scylla today does silently wrap around the number, so these tests
xfail. They pass on Cassandra.

Refs #7914.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210112151041.3940361-1-nyh@scylladb.com>
2021-01-13 11:07:29 +02:00
Nadav Har'El
617e1be1b6 cql-pytest: expand tests for fromJson() failures
This patch adds more (failing) tests for issue #7911, where fromJson()
failures should be reported as a clean FunctionFailure error, not an
internal server error.

The previous tests we had were about JSON parse failures, but a
different type of error we should support is valid JSON which returned
the wrong type - e.g., the JSON returning a string when an integer
was expected, or the JSON returning a string with non-ASCII characters
when ASCII was expected. So this patch adds more such tests. All of
them xfail on Scylla, and pass on Cassandra.

Refs #7911.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210112122211.3932201-1-nyh@scylladb.com>
2021-01-13 11:07:29 +02:00
Nadav Har'El
2ebe8055ee cql-pytest: add test for fromJson() null parameter.
This patch adds a reproducer test for issue #7912, which is about passing
a null parameter to the fromJson() function supposed to be legal (and
return a null value), and is legal in Cassandra, but isn't allowed in
Scylla.

There are two tests - for a prepared and unprepared statement - which
fail in different ways. The issue is still open so the tests xfail on
Scylla - and pass on Cassandra.

Refs #7912.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210112114254.3927671-1-nyh@scylladb.com>
2021-01-13 11:07:29 +02:00
dgarcia360
78e9f45214 docs: update url
Related issue scylladb/sphinx-scylladb-theme#88

Once this commit is merged, the docs will be published under the new domain name https://scylla.docs.scylladb.com

Frequently asked questions:

    Should we change the links in the README/docs folder?

GitHub automatically handles the redirections. For example, https://scylladb.github.io/sphinx-scylladb-theme/stable/examples/index.html redirects to https://sphinx-theme.scylladb.com/stable/examples/index.html
Nevertheless, it would be great to change URLs progressively to avoid the 301 redirections.

    Do I need to add this new domain in the custom dns domain section on GitHub settings?

It is not necessary. We have already edited the DNS for this domain and the theme creates programmatically the required CNAME file. If everything goes well, GitHub should detect the new URL after this PR is merged.

    The DNS doesn't seem to have the right SSL certificates

GitHub handles the certificate provisioning but is not aware of the subdomain for this repo yet. make multi-version will create a new file "CNAME". This is published in gh-pages branch, therefore GitHub should create the missing cert.

Closes #7877
2021-01-13 11:07:29 +02:00
Avi Kivity
d508a63d4b row_cache: linearize key in cache_entry::do_read()
do_read() does not linearize cache_entry::_key; this can cause a crash
with keys larger than 13k.

Fixes #7897.

Closes #7898
2021-01-13 11:07:29 +02:00
dgarcia360
36f8d35812 docs: added multiversion_regex_builder
Fixed makefile

Added path

Closes #7876
2021-01-13 11:07:29 +02:00
Benny Halevy
5e41228fe8 test: everywhere: use seastar::testing::local_random_engine
Use the thread_local seastar::testing::local_random_engine
in all seastar tests so they can be reproduced using
the --random-seed option.

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210112103713.578301-2-bhalevy@scylladb.com>
2021-01-13 11:07:29 +02:00
Benny Halevy
43ab094c88 configure: add utf8_test to pure_boost_tests
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210112103713.578301-1-bhalevy@scylladb.com>
2021-01-13 11:07:29 +02:00
Dejan Mircevski
d79c2cab63 cql3: Use correct comparator in timeuuid min/max
The min/max aggregators use aggregate_type_for comparators, and the
aggregate_type_for<timeuuid> is regular uuid.  But that yields wrong
results; timeuuids should be compared as timestamps.

Fix it by changing aggregate_type_for<timeuuid> from uuid to timeuuid,
so aggregators can distinguish betwen the two.  Then specialize the
aggregation utilities for timeuuid.

Add a cql-pytest and change some unit tests, which relied on naive
uuid comparators.

Fixes #7729.

Tests: unit (dev, debug)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #7910
2021-01-13 11:07:29 +02:00
Avi Kivity
96d64b7a1f Merge "Wire interposer consumer for memtable flush" from Raphael
"
Without interposer consumer on flush, it could happen that a new sstable,
produced by memtable flush, will not conform to the strategy invariant.
For example, with TWCS, this new sstable could span multiple time windows,
making it hard for the strategy to purge expired data. If interposer is
enabled, the data will be correctly segregated into different sstables,
each one spanning a single window.

Fixes #4617.

tests:
    - mode(dev).
    - manually tested it by forcing a flush of memtable spanning many windows
"

* 'segregation_on_flush_v2' of github.com:raphaelsc/scylla:
  test: Add test for TWCS interposer on memtable flush
  table: Wire interposer consumer for memtable flush
  table: Add write_memtable_to_sstable variant which accepts flat_mutation_reader
  table: Allow sstable write permit to be shared across monitors
  memtable: Track min timestamp
  table: Extend cache update to operate a memtable split into multiple sstables
2021-01-13 11:07:29 +02:00
Nadav Har'El
8164c52871 cql-pytest: add test for fromJson() parse error
This patch adds a reproducer test for issue #7911, which is about a parse
error in JSON string passed to the fromJson() function causing an
internal error instead of the expected FunctionFailure error.

The issue is still open so the test xfails on Scylla (and passes on
Cassandra).

Refs #7911.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210112094629.3920472-1-nyh@scylladb.com>
2021-01-13 11:07:29 +02:00
Pavel Solodovnikov
10e3da692f lwt: validate paxos_grace_seconds table option
The option can only take integer values >= 0, since negative
TTL is meaningless and is expected to fail the query when used
with `USING TTL` clause.

It's better to fail early on `CREATE TABLE` and `ALTER TABLE`
statement with a descriptive message rather than catch the
error during the first lwt `INSERT` or `UPDATE` while trying
to insert to system.paxos table with the desired TTL.

Tests: unit(dev)
Fixes: #7906

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210111202942.69778-1-pa.solodovnikov@scylladb.com>
2021-01-13 11:07:29 +02:00
Gleb Natapov
51bf5f5846 raft: test: do not check snapshot during backpressure test
Unfortunately snapshot checking still does not work in the presence of
log entries reordering. It is impossible to know when exactly the
snapshot will be taken and if it is taken before all smaller than
snapshot idx entries are applied the check will fail since it assumes
that.

This patch disabled snapshot checking for SUM state machine that is used
in backpressure test.

Message-Id: <20201126122349.GE1655743@scylladb.com>
2021-01-13 11:07:29 +02:00
Wojciech Mitros
59769efd3b idl: change the type of mutation_partition_view::rows() to a chunked_vector
The value of mutation_partition_view::rows() may be very large, but is
used almost exclusively for iteration, so in order to avoid a big allocation
for an std::vector, we change its type to an utils::chunked_vector.

Fixes #7918

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-01-13 04:25:53 +01:00
Wojciech Mitros
88e750f379 idl-compiler: allow fields of type utils::chunked_vector
The utils::chunked_vector has practically the same methods
as a std::vector, so the same code can be generated for it.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-01-13 04:09:18 +01:00
Avi Kivity
ccd09f1398 Update seastar submodule
* seastar 6b36e84c3...a287bb1a3 (1):
  > merge: file: correct dma alignment for odd filesystems

Ref #7794.
2021-01-11 20:38:59 +02:00
Tomasz Grabiec
6cfc949e62 Merge "sstables: validate the writer's input with the mutation fragment stream validator" from Botond
We have recently seen a suspected corrupt mutation fragment stream to get
into an sstable undetected, causing permanent corruption. One of the
suspected ways this could happen is the compaction sstable write path not
being covered with a validator. To prevent events like this in the future
make sure all sstable write paths are validated by embedding the validator
right into the sstable writer itself.

Refs: #7623
Refs: #7640

Tests: unit(release)

* https://github.com/denesb/scylla.git sstable-writer-fragment-stream-validation/v2:
  sstable_writer: add validation
  test/boost/sstable_datafile_test: sstable_scrub_test: disable key validation
  mutation_fragment_stream_validator: make it easier to validate concrete fragment types
  flat_mutation_reader: extract fragment stream validator into its own header
2021-01-11 14:57:48 +01:00
Calle Wilund
4be718ebfa commitlog: Force earlier cycle/flush iff segment reserve is empty
Attempt to hurry flushing/segment delete/recycle if we are trying
to get a segment for allocation, and reserve is empty when above
disk threshold. This is minimize time waited in allocation semaphore.
2021-01-11 12:45:36 +00:00
Calle Wilund
be8c359a62 commitlog: Make segment allocation wait iff disk usage > max
Instead of allowing new segments to be added, explicitly wait
for either disk delete or recycle to happen iff current disk
usage is larger than limit.
2021-01-11 12:45:36 +00:00
Calle Wilund
ab55a1b4e6 commitlog: Do partial (memtable) flushing based on threshold
Instead of asking to flush data for all segments, just request
up to an RP where we get comfortably below disk usage threshold.
2021-01-11 12:45:10 +00:00
Pekka Enberg
42806c6f40 Update seastar submodule
* seastar ed345cdb...6b36e84c (3):
  > perftune.py: Don't print nic driver name to avoid
Fixes #7905
  > io_tester: Make file sizes configurable
  > io_queue: Limit tickets for oversized requests
2021-01-11 14:12:06 +02:00
Pavel Solodovnikov
0981b786a8 db/query_options: specify serial consistency for DEFAULT specific_options
Cassandra constructs `QueryOptions.SpecificOptions` in the same
way that we do (by not providing `serial_constency`), but they
do have a user-defined constructor which does the following thing:

	this.serialConsistency = serialConsistency == null ? ConsistencyLevel.SERIAL : serialConsistency;

This effectively means that DEFAULT `SpecificOptions` always
have `SerialConsistency` set to `SERIAL`, while we leave this
`std::nullopt`, since we don't have a constructor for
`specific_options` which does this.

Supply `db::consistency_level::SERIAL` explicitly to the
`specific_options::DEFAULT` value.

Tests: unit(dev)
Fixes: #7850

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201231104018.362270-1-pa.solodovnikov@scylladb.com>
2021-01-11 12:12:29 +02:00
Nadav Har'El
a3f9bd9c3f cql-pytest: add xfailing reproducer for issue #7888
This adds a simple reproducer for a bug involving a CONTAINS relation on
frozen collection clustering columns when the query is restricted to a
single partition - resulting in a strange "marshalling error".

This bug still exists, so the test is marked xfail.

Refs #7888.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210107191417.3775319-1-nyh@scylladb.com>
2021-01-11 08:49:16 +01:00
Nadav Har'El
678da50a10 cql-pytest: add reproducers for reversed frozen collection bugs
We add a reproducer for issues #7868 and #7875 which are about bugs when
a table has a frozen collection as its clustering key, and it is sorted
in *reverse order*: If we tried to insert an item to such a table using an
unprepared statement, it failed with a wrong error ("invalid set literal"),
but if we try to set up a prepared statement, the result is even worse -
an assertion failure and a crash.

Interestingly, neither of these problems happen without reversed sort order
(WITH CLUSTERING ORDER BY (b DESC)), and we also add a test which
demonstrates that with default (increasing) order, everything works fine.

All tests pass successfully when run against Cassandra.

The fix for both issues was already committed, so I verified these tests
reproduced the bug before that commit, and pass now.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210110232312.3844408-1-nyh@scylladb.com>
2021-01-11 08:48:30 +01:00
Nadav Har'El
f32c34d8ad cql-pytest: port Cassandra's unit test validation/entities/frozen_collections_test
In this patch, we port validation/entities/frozen_collections_test.java,
containing 33 tests for frozen collections of all types, including
nesting collections.

In porting these tests, I uncovered four previously unknown bugs in Scylla:

Refs #7852: Inserting a row with a null key column should be forbidden.
Refs #7868: Assertion failure (crash) when clustering key is a frozen
            collection and reverse order.
Refs #7888: Certain combination of filtering, index, and frozen collection,
            causes "marshalling error" failure.
Refs #7902: Failed SELECT with tuple of reversed-ordered frozen collections.

These tests also provide two more reproducers for an already known bug:

Refs #7745: Length of map keys and set items are incorrectly limited to
            64K in unprepared CQL.

Due to these bugs, 7 out of the 33 tests here currently xfail. We actually
had more failing tests, but we fixed issue #7868 before this patch went in,
so its tests are passing at the time of this submission.

As usual in these sort of tests, all 33 pass when running against Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210110231350.3843686-1-nyh@scylladb.com>
2021-01-11 08:48:08 +01:00
Nadav Har'El
0516cd1609 alternator test: de-duplicate some duplicate code
In test_streams.py we had some code to get a list of shards and iterators
duplicated three times. Put it in a function, shards_and_latest_iterators(),
to reduce this duplication.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201006112421.426096-1-nyh@scylladb.com>
2021-01-11 08:47:25 +01:00
Botond Dénes
cb4d92aae4 sstable_writer: add validation
Add a mutation_fragment_stream_validating_filter to
sstables::writer_impl and use it in sstable_writer to validate the
fragment stream passed down to the writer implementation. This ensures
that all fragment streams written to disk are validated, and we don't
have to worry about validating each source separately.

The current validator from sstable::write_components() is removed. This
covers only part of the write paths. Ad-hoc validations in the reader
implementations are removed as well as they are now redundant.
2021-01-11 09:12:56 +02:00
Botond Dénes
4b254a26ab test/boost/sstable_datafile_test: sstable_scrub_test: disable key validation
The test violates clustering key order on purpose to produce a corrupt
sstable (to test scrub). Disable key validation so when we move the
validator into the writer itself in the next patch it doesn't abort the
test.
2021-01-11 09:12:56 +02:00
Botond Dénes
8dae6152bf mutation_fragment_stream_validator: make it easier to validate concrete fragment types
The current API is tailored to the `mutation_fragment` type. In
the next patch we will want to use the validator from a context where
the mutation fragments are already decomposed into their respective
concrete types, e.g. static_row, clustering_row, etc. To avoid having to
reconstruct a mutation fragment type just to use the validator, add an
API which allows validating these concrete types conveniently too.
2021-01-11 08:07:42 +02:00
Botond Dénes
495f9d54ba flat_mutation_reader: extract fragment stream validator into its own header
To allow using it without pulling in the huge `flat_mutation_reader.hh`.
2021-01-11 08:07:42 +02:00
Dejan Mircevski
3aa80f47fe abstract_type: Rework unreversal methods
Replace two methods for unreversal (`as` and `self_or_reversed`) with
a new one (`without_reversed`).  More flexible and better named.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #7889
2021-01-10 19:30:12 +02:00
Tomasz Grabiec
15b5b286d9 Merge "frozen_mutation: better diagnostics for out-of-order and duplicate rows" from Botond
Currently, frozen mutations, that contain partitions with out-of-order
or duplicate rows will trigger (if they even do) an assert in
`row::append_cell()`. However, this results in poor diagnostics (if at
all) as the context doesn't contain enough information on what exactly
went wrong. This results in a cryptic error message and an investigation
that can only start after looking at a coredump.

This series remedies this problem by explicitly checking for
out-of-order and duplicate rows, as early as possible, when the
supposedly empty row is created. If the row already existed (is a
duplicate) or it is not the last row in the partition (out-of-order row)
an exception is thrown and the deserialization is aborted. To further
improve diagnostics, the partition context is also added to the
exception.

Tests: unit(release)

* botond/frozen-mutation-bad-row-diagnostics/v3:
  frozen_mutation: add partition context to errors coming from deserializing
  partition_builder: accept_row(): use append_clustering_row()
  mutation_partition: add append_clustered_row()
2021-01-10 19:30:12 +02:00
Pekka Enberg
e5fe0acd15 Update seastar submodule
* seastar 56cfe179...ed345cdb (1):
  > perftune.py: Fix the dump options after adding multiple nics option
Refs #6266
2021-01-08 18:13:26 +01:00
Benny Halevy
60bde99e8e flat_mutation_reader: consume_in_thread: always filter.on_end_of_stream on return
Since we're calling _consumer.consume_end_of_stream()
unconditionally when consume_pausable_in_thread returns.

Refs #7623
Refs #7640

Test: unit(dev)
Dtest: materialized_views_test.py:TestMaterializedViews.interrupt_build_process_with_resharding_low_to_half_test

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210106103024.3494569-1-bhalevy@scylladb.com>
2021-01-08 18:13:26 +01:00
Michał Chojnowski
f317b3c39f mutation_test: use the correct preferred_max_contiguous_allocation in measuring_allocator
measuring_allocator is a wrapper around standard_allocator, but it exposed
the default preferred_max_contiguous_allocation, not the one from
standard_allocator. Thus managed_bytes allocated in those two allocators
had fragments of different size, and their total memory usage differed,
causing test_external_memory_usage to fail if
standard_allocator::preferred_max_contiguous_allocation was changed from the
default. Fix that.
2021-01-08 14:16:08 +01:00
Pavel Solodovnikov
907b73a652 row_cache: more indentation fixes
Fixup indentation issues introduced in recent patches.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-08 14:16:08 +01:00
Pavel Solodovnikov
eb523d4ac8 utils: remove unused linearization facilities in managed_bytes class
Remove the following bits of `managed_bytes` since they are unused:
* `with_linearized_managed_bytes` function template
* `linearization_context_guard` RAII wrapper class for managing
  `linearization_context` instances.
* `do_linearize` function
* `linearization_context` class

Since there is no more public or private methods in `managed_class`
to linearize the value except for explicit `with_linearized()`,
which doesn't use any of aforementioned parts, we can safely remove
these.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-08 14:16:08 +01:00
Pavel Solodovnikov
8709844566 misc: fix indentation
The patch fixes indentation issues introduced in previous patches
related to removing `with_linearized_managed_bytes` uses from the
code tree.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-08 14:16:08 +01:00
Pavel Solodovnikov
e04eb68a9c treewide: remove remaining with_linearized_managed_bytes uses
There is no point in calling the wrapper since linearization code
is private in `managed_bytes` class and there is no one to call
`managed_bytes::data` because it was deleted recently.

This patch is a prerequisite for removing
`with_linearized_managed_bytes` function completely, alongside with
the corresponding parts of implementation in `managed_bytes`.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-08 14:16:08 +01:00
Pavel Solodovnikov
bf8b138b42 memtable, row_cache: remove with_linearized_managed_bytes uses
Since `managed_bytes::data()` is deleted as well as other public
APIs of `managed_bytes` which would linearize stored values except
for explicit `with_linearized`, there is no point
invoking `with_linearized_managed_bytes` hack which would trigger
automatic linearization under the hood of managed_bytes.

Remove useless `with_linearized_managed_bytes` wrapper from
memtable and row_cache code.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-08 14:16:08 +01:00
Avi Kivity
3bf6b78668 utils: managed_bytes: remove linearizing accessors
Accessor that require linearization, such as data(), begin(),
and casting to bytes_view, are no longer used and are now removed.
2021-01-08 14:16:08 +01:00
Michał Chojnowski
dbcf987231 keys, compound: switch from bytes_view to managed_bytes_view
The keys classes (partition_key et al) already use managed_bytes,
but they assume the data is not fragmented and make liberal use
of that by casting to bytes_view. The view classes use bytes_view.

Change that to managed_bytes_view, and adjust return values
to managed_bytes/managed_bytes_view.

The callers are adjusted. In some places linearization (to_bytes())
is needed, but this isn't too bad as keys are always <= 64k and thus
will not be fragmented when out of LSA. We can remove this
linearization later.

The serialize_value() template is called from a long chain, and
can be reached with either bytes_view or managed_bytes_view.
Rather than trace and adjust all the callers, we patch it now
with constexpr if.

operator bytes_view (in keys) is converted to operator
managed_bytes_view, allowing callers to defer or avoid
linearization.
2021-01-08 14:16:08 +01:00
Michał Chojnowski
a1a0839164 sstables: writer: add write_* helpers for managed_bytes_view
We will use them in the upcoming patch where we transition keys from bytes_view
to mutable_bytes_view.
2021-01-08 14:16:08 +01:00
Michał Chojnowski
45c1b90eb5 compound_compat: transition legacy_compound_view from bytes_view to managed_bytes_view
The underlying view will change from bytes_view to managed_bytes_view in the
next commits, so we prepare for that.
2021-01-08 14:16:08 +01:00
Avi Kivity
d9fcc4f4ef types: change equal() to accept managed_bytes_view
bytes_view can convert to managed_bytes_view, so the change
is compatible with the existing representation and the next
patches, which change compound types to use managed_bytes_view.
2021-01-08 14:16:08 +01:00
Michał Chojnowski
1de0b9a425 types: add parallel interfaces for managed_bytes_view
We will need those to transition keys and compound from bytes_view to
managed_bytes_view.
2021-01-08 14:16:08 +01:00
Avi Kivity
d1f354f5fb types: add to_managed_bytes(const sstring&)
This is a helper for tests (similar to to_bytes(const sstring&)).
2021-01-08 14:16:08 +01:00
Michał Chojnowski
c6eb485675 serializer_impl: handle managed_bytes without linearizing
With managed_bytes_view implemented, it's easy to de/serialize managed_bytes
without linearization.
2021-01-08 14:16:08 +01:00
Michał Chojnowski
bf0ec63e34 utils: managed_bytes: add managed_bytes_view::operator[]
This operator has a single purpose: an easier port of legacy_compound_view
from bytes_view to managed_bytes_view.
It is inefficient and should be removed as soon as legacy_compound_view stops
using operator[].
2021-01-08 14:16:08 +01:00
Michał Chojnowski
778269151a utils: managed_bytes: introduce managed_bytes_view
managed_bytes_view is a non-owning view into managed_bytes.
It can also be implicitly constructed from bytes_view.

It conforms to the FragmentedView concept and is mainly used through that
interface.

It will be used as a replacement for bytes_view occurrences currently
obtained by linearizing managed_bytes.
2021-01-08 14:16:08 +01:00
Michał Chojnowski
cf7d25b98d utils: fragment_range: add serialization helpers for FragmentedMutableView
We will use them to write to managed_bytes_view in an upcoming patch,
to avoid linearization in compound_type::serialize_value.
2021-01-08 14:16:07 +01:00
Michał Chojnowski
75898ee44e bytes: implement std::hash using appending_hash
This is a preparation for the upcoming introduction of managed_bytes_view,
intended as a fragmented replacement for bytes_view.
To ease the transition, we want both types to give equal hashes for equal
contents.
2021-01-08 13:17:46 +01:00
Michał Chojnowski
4822730752 utils: mutable_view: add substr()
Analogous to bytes_view::substr.
This bit of functionality will be used to implement managed_bytes_mutable_view.
2021-01-08 13:17:46 +01:00
Dejan Mircevski
9eed26ca3d cql3: Fix maps::setter_by_key for unset values
Unset values for key and value were not handled.  Handle them in a
manner matching Cassandra.

This fixes all cases in testMapWithUnsetValues, so re-enable it (and
fix a comment typo in it).

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-01-07 13:22:20 +02:00
Dejan Mircevski
4515a49d4d cql3: Fix IN ? for unset values
When the right-hand side of IN is an unset value, we must report an
error, like Cassandra does.

This fixes testListWithUnsetValues, so re-enable it.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-01-07 13:22:20 +02:00
Dejan Mircevski
5bee97fa51 cql3: Fix handling of scalar unset value
Make the bind() operation of the scalar marker handle the unset-value
case (which it previously didn't).

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-01-07 13:22:20 +02:00
Dejan Mircevski
8b2f459622 cql3: Fix crash when removing unset_value from set
Avoid crash described in #7740 by ignoring the update when the
element-to-remove is UNSET_VALUE.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-01-07 13:22:20 +02:00
Pekka Enberg
e81f4caf67 Update seastar submodule
* seastar a2fc9d72...56cfe179 (1):
  > perftune.py: Fix nic_is_bond_iface() and other function signatures
Refs #6266
2021-01-07 13:22:20 +02:00
Takuya ASADA
10184ba64f redis: implement parse error, reply error message correctly
Since we haven't implemented parse error on redis protocol parser,
reply message is broken at parse error.
Implemented parse error, reply error message correctly.

Fixes #7861
Fixes #7114

Closes #7862
2021-01-07 13:22:20 +02:00
Dejan Mircevski
176ff0238a cql3: Fix handling of reverse-order maps
When the clustering order is reversed on a map column, the column type
is reversed_type_impl, not map_type_impl.  Therefore, we have to check
for both reversed type and map type in some places.

This patch handles reverse types in enough places to make
test_clustering_key_reverse_frozen_map pass.  However, it leaves
other places (invocations of is_map() and *_cast<map_type_impl>())
as they currently are; some are protected by callers from being
invoked on reverse types, but some are quite possibly bugs untriggered
by existing tests.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-01-07 13:22:20 +02:00
Dejan Mircevski
6bb10fcf36 cql3: Fix handling of reverse-order lists
When the clustering order is reversed on a list column, the column type
is reversed_type_impl, not list_type_impl.  Therefore, we have to check
for both reversed type and list type in some places.

This patch handles reverse types in enough places to make
test_clustering_key_reverse_frozen_list pass.  However, it leaves
other places (invocations of is_list() and *_cast<list_type_impl>())
as they currently are; some are protected by callers from being
invoked on reverse types, but some are quite possibly bugs untriggered
by existing tests.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-01-07 13:22:20 +02:00
Dejan Mircevski
14fa39cfa6 cql3: Fix handling of reverse-order sets
When the clustering order is reversed on a set column, the column type
is reversed_type_impl, not set_type_impl.  Therefore, we have to check
for both reversed type and set type in some places.

To make such checks easier, add convenience methods self_or_reversed()
and as() to abstract_type.  Invoke those methods (instead of is_set()
and casts) enough to make test_clustering_key_reverse_frozen_set pass.
Leave other invocations of is_set() and *_cast<set_type_impl>() as
they are; some are protected by callers from being invoked on reverse
types, but some are quite possibly bugs untriggered by existing tests.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-01-07 13:22:20 +02:00
Calle Wilund
7c84b16cd8 commitlog: Make flush threshold configurable 2021-01-05 18:16:09 +00:00
Calle Wilund
c3d95811da table: Add a flush RP mark to table, and shortcut if not above
Adds a second RP to table, marking where we flushed last.
If a new flush request comes in that is below this mark, we
can skip a second flush.

This is to (in future) support incremental CL flush.
2021-01-05 18:16:09 +00:00
Raphael Carvalho
28a2aca627 Fix doc for building pkgs for a specific build mode
Closes #7878
2021-01-05 18:56:21 +02:00
Tomasz Grabiec
1d717f37e2 vint-serialization: Reference the correct spec
We are not using the protobol buffers format for vint.

Message-Id: <1609865471-22292-1-git-send-email-tgrabiec@scylladb.com>
2021-01-05 18:54:09 +02:00
Vojtech Havel
d858c57357 cql3: allow SELECTs restricted by "IN" to retrieve collections
This patch enables select cql statements where collection columns are
selected  columns in queries where clustering column is restricted by
"IN" cql operator. Such queries are accepted by cassandra since v4.0.

The internals actually provide correct support for this feature already,
this patch simply removes relevant cql query check.

Tests: cql-pytest (testInRestrictionWithCollection)

Fixes #7743
Fixes #4251

Signed-off-by: Vojtech Havel <vojtahavel@gmail.com>
Message-Id: <20210104223422.81519-1-vojtahavel@gmail.com>
2021-01-05 14:39:18 +02:00
Pekka Enberg
e54cc078a1 Update seastar submodule
* seastar d1b5d41b...a2fc9d72 (6):
  > perftune.py: support passing multiple --nic options to tune multiple interfaces at once
  > perftune.py recognize and sort IRQs for Mellanox NICs
  > perftune.py: refactor getting of driver name into __get_driver_name()
Fixes #6266
  > install-dependencies: support Manjaro
  > append_challenged_posix_file_impl: optimize_queue: use max of sloppy_size_hint and speculative_size
  > future: do_until: handle exception in stop condition
2021-01-05 13:32:21 +02:00
Avi Kivity
43a2636229 Merge "Remove proxy from size-estimates reader" from Pavel E
"
The size_estimates_mutation_reader call for global proxy
to get database from. The database is used to find keyspaces
to work with. However, it's safe to keep the local database
refernece on the reader itself.

tests: unit(debug)
"

* 'br-no-proxy-in-size-estimate-reader' of https://github.com/xemul/scylla:
  size_estimate_reader: Use local db reference not global
  size_estimate_reader: Keep database reference on mutation reader
  size_estimate_reader: Keep database reference on virtual_reader
2021-01-05 11:28:09 +02:00
Pavel Emelyanov
9632af5d6b schema_tables: Drop unused merge_schema overload
After the d3aa1759 one of them became unused.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210105051724.5249-1-xemul@scylladb.com>
2021-01-05 11:25:22 +02:00
Michał Chojnowski
6c97027f85 utils: fragment_range: add compare_unsigned
We will use it to compare fragmented buffers (mainly managed_bytes_view in
types, compound, and tests) without linearization.
2021-01-04 22:50:45 +01:00
Michał Chojnowski
2d28471a59 utils: managed_bytes: make the constructors from bytes and bytes_view explicit
Conversions from views to owners have no business being implicit.
Besides, they would also cause various ambiguity problems when adding
managed_bytes_view.
2021-01-04 22:22:12 +01:00
Raphael S. Carvalho
d265bb9bdb test: Add test for TWCS interposer on memtable flush
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-01-04 16:55:06 -03:00
Raphael S. Carvalho
9124a708f1 table: Wire interposer consumer for memtable flush
From now on, memtable flush will use the strategy's interposer consumer
iff split_during_flush is enabled (disabled by default).
It has effect only for TWCS users as TWCS it's the only strategy that
goes on to implement this interposer consumer, which consists of
segregating data according to the window configuration.

Fixes #4617.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-01-04 16:26:07 -03:00
Raphael S. Carvalho
c926a948e5 table: Add write_memtable_to_sstable variant which accepts flat_mutation_reader
This new variant will be needed for interposer consumer.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-01-04 16:23:00 -03:00
Raphael S. Carvalho
32acb44fec table: Allow sstable write permit to be shared across monitors
As a preparation for interposer on flush, let's allow database write monitor
to store a shared sstable write permit, which will be released as soon as
any of the sstable writers reach the sealing stage.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-01-04 14:46:43 -03:00
Nadav Har'El
ed31dd1742 cql-pytest: port Cassandra's unit test validation/entities/counters_test
In this patch, we port validation/entities/collection_test.java, containing
7 tests for CQL counters. Happily, these tests did not uncover any bugs in
Scylla and all pass on both Cassandra and Scylla.

There is one small difference that I decided to ignore instead of reporting
a bug. If you try a CREATE TABLE with both counter and non-counter columns,
Scylla gives a ConfigurationException error, while Cassandra gives a more
reasonable InvalidRequest. The ported test currently allows both.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201223181325.3148928-1-nyh@scylladb.com>
2021-01-04 18:25:48 +01:00
Nadav Har'El
05d6eff850 cql-pytest: add tests for non-support of unicode equivalence`
In issue #7843 there were questions raised on how much does Scylla
support the notion of Unicode Equivalence, a.k.a. Unicode normalization.

Consider the Spanish letter ñ - it can be represented by a single Unicode
character 00F1, but can also be represented as a 006E (lowercase "n")
followed by a 0303 ("combining tilde"). Unicode specifies that these
two representations should be considered "equivalent" for purposes of
sorting or searching. But the following tests demonstrates that this
is not, in fact, supported in Scylla or Cassandra:

1. If you use one representation as the key, then looking up the other one
   will not find the row. Scylla (and Cassandra) do *not* consider
   the two strings equivalent.

2. The LIKE operator (a Scylla-only extension) doesn't know that
   the single-character ñ begins with an n, or that the two-character
   ñ is just a single character.
   This is despite the thinking on #7843 which by using ICU in the
   implementation of LIKE, we somehow got support for this. We didn't.

Refs #7843

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201229125330.3401954-1-nyh@scylladb.com>
2021-01-04 18:25:28 +01:00
Nadav Har'El
feb028c97e cql-pytest: add reproducer for issue 7856
This patch adds a reproducer for issue #7856, which is about frozen sets
and how we can in Scylla (but not in Cassandra), insert one in the "wrong"
order, but only in very specific circumstances which this reproducer
demonstrates: The bug can only be reproduced in a nested frozen collection,
and using prepared statements.

Refs #7856

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201231085500.3514263-1-nyh@scylladb.com>
2021-01-04 18:25:12 +01:00
Raphael S. Carvalho
738049cba2 memtable: Track min timestamp
Tracking both min and max timestamp will be required for memtable flush
to short-circuit interposer consumer if needed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-01-04 13:24:43 -03:00
Raphael S. Carvalho
5519fdba72 table: Extend cache update to operate a memtable split into multiple sstables
This extension is needed for future work where a memtable will be segregated
during flush into one sstable or more. So now multiple sstables can be added
to the set after a memtable flush, and compaction is only triggered at the
end.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-01-04 13:24:10 -03:00
Piotr Sarna
d5da455d95 schema_tables: describe calculate_schema_digest better
- the mystical `accept_predicate` is renamed to `accept_keyspace`
   to be more self-descriptive
 - a short comment is added to the original calculate_schema_digest
   function header, mentioning that it computes schema digest
   for non-system keyspaces

Refs #7854

Message-Id: <04f1435952940c64afd223bd10a315c3681b1bef.1609763443.git.sarna@scylladb.com>
2021-01-04 14:46:17 +02:00
Amos Kong
8b231a3bd9 install.sh: switch to use realpath for EnvironmentFile
In scylla-jmx, we fixed a hardcode sysconfdir in EnvironmentFile path,
realpath was used to convert the path. This patch changed to use
realpath in scylla repo to make it consistent with scylla-jmx.

Suggested-by: Pekka Enberg <penberg@scylladb.com>
Signed-off-by: Amos Kong <amos@scylladb.com>

Closes #7860
2021-01-04 12:45:17 +02:00
Avi Kivity
33ee07a9d8 Merge 'Skip internal distributed tables in schema_change_test' from Piotr Sarna
The original idea for `schema_change_test` was to ensure that **if** schema hasn't changed, the digest also remained unchanged. However, a cumbersome side effect of adding an internal distributed table (or altering one) is that all digests in `schema_change_test` are immediately invalid, because the schema changed.
Until now, each time a distributed system table was added/amended, a new test case for `schema_change_test` was generated, but this effort is not worth the effect - when a distributed system table is added, it will always propagate on its own, so generating a new test case does not bring any tangible new test coverage - it's just a pain.

To avoid this pain, `schema_change_test` now explicitly skips all internal keyspaces - which includes internal distributed tables - when calculating schema digest. That way, patches which change the way of computing the digest itself will still require adding a new test case, which is good, but, at the same time, changes to distributed tables will not force the developers to introduce needless schema features just for the sake of this test.

Tests:
 * unit(dev)
 * manual(rebasing on top of a change which adds two distributed system tables - all tests still passed)

Refs #7617

Closes #7854

* github.com:scylladb/scylla:
  schema_change_test: skip distributed system tables in digest
  schema_tables: allow custom predicates in schema digest calc
  alternator: drop unneeded sstring creation
  system_keyspace: migrate helper functions to string_view
  database: migrate find_keyspace to string views
2021-01-04 12:44:03 +02:00
Piotr Sarna
e26aa836a9 schema_change_test: skip distributed system tables in digest
With previous design of the schema change test, a regeneration
was necessary each time a new distributed system table was added.
It was not the original purpose of the test to keep track of new
distributed tables which simply propagate on their own,
so the test case is now modified: internal distributed tables
are not part of the schema digest anymore, which means that
changes inside them will not cause mismatches.

This change involves a one-shot regeneration of all digests,
which due to historical reasons included internal distributed
tables in the digest, but no further regenerations should ever
be necessary when a new internal distributed table is added.
2021-01-04 10:24:40 +01:00
Piotr Sarna
13a60b02ea schema_tables: allow custom predicates in schema digest calc
For testing purposes it would be useful to be able to skip computing
schema for certain tables (namely, internal distributed tables).
In order to allow that, a function which accepts a custom predicate
is added.
2021-01-04 10:11:41 +01:00
Piotr Sarna
12b5184933 alternator: drop unneeded sstring creation
It's now possible to use string views to check if a particular
table is a system table, so it's no longer needed to explicitly
create an sstring instance.
2021-01-04 09:47:01 +01:00
Piotr Sarna
f293c59a46 system_keyspace: migrate helper functions to string_view
Functions for checking if the keyspace is system/internal were based
on sstring references, which is impractical compared to string views
and may lead to unnecessary creation of sstring instances.
2021-01-04 09:47:01 +01:00
Piotr Sarna
aba9772eff database: migrate find_keyspace to string views
... in order to avoid creating unnecessary sstring instances
just to compare strings.
2021-01-04 09:47:01 +01:00
Gleb Natapov
d3aa17591c migration_manager: drop announce_locally flag
It looks like the history of the flag begins in Cassandra's
https://issues.apache.org/jira/browse/CASSANDRA-7327 where it is
introduced to speedup tests by not needing to start the gossiper.
The thing is we always start gossiper in our cql tests, so the flag only
introduce noise. And, of course, since we want to move schema to use raft
it goes against the nature of the raft to be able to apply modification only
locally, so we better get rid of the capability ASAP.

Tests: units(dev, debug)
Message-Id: <20201230111101.4037543-2-gleb@scylladb.com>
2021-01-03 13:58:09 +02:00
Gleb Natapov
491f10bb70 schema-tables: make schema update global when fixing legacy SI tables
When a node notice that it uses legacy SI tables it converts them to use
new format, but it update only local schema. It will only cause schema
discrepancy between nodes, there schema change should propagate
globally.

Fixes #7857.

Message-Id: <20201230111101.4037543-1-gleb@scylladb.com>
2021-01-03 13:57:46 +02:00
Raphael S. Carvalho
d55d65d77c compaction: Enable filtering reader only on behalf of cleanup compaction
After 13fa2bec4c, every compaction will be performed through a filtering
reader because consumers cannot do the filtering if interposer consumer is
enabled.

It turns out that filtering_reader is adding significant overhead when regular
compactions are running. As no other compaction type need to actually do
any filtering, let's limit filtering_reader to cleanup compaction.
Alternatively, we could disable interposer consumer on behalf of cleanup,
or add support for the consumers to do the filtering themselves but that
would add lots of complexity.

Fixes #7748.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201230194516.848347-2-raphaelsc@scylladb.com>
2021-01-03 12:02:43 +02:00
Raphael S. Carvalho
e42d277805 compaction: Drop needless partition filter for regular compaction
This filter is used to discard data that doesn't belong to current
shard, but scylla will only make a sstable available to regular
compaction after it was resharded on either boot or refresh.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201230194516.848347-1-raphaelsc@scylladb.com>
2021-01-03 12:02:42 +02:00
Pekka Enberg
5872b754e0 Revert "dist/docker: Remove 'epel-release' from Docker image"
This reverts commit ceb67e7728. The
"epel-release" package is needed to install the "supervisord"
package, which I somehow missed in testing...

Fixes #7851
2021-01-02 12:49:12 +02:00
Nadav Har'El
93a2c52338 cql-pytest: add tests for inserting rows with missing key columns
This patch adds two simple tests for what happens when a user tries to
insert a row with one of the key column missing. The first tests confirms
that if the column is completely missing, we correctly print an error
(this was issue #3665, that was already marked fixed).

However, the second test demonstrates that we still have a bug when
the key column appears on the command, but with a null value.
In this case, instead of failing the insert (as Cassandra does),
we silently ignore it. This is the proper behavior for UNSET_VALUE,
but not for null. So the second test is marked xfail, and I opened
issue #7852 about it.

Refs #3665
Refs #7852

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201230132350.3463906-1-nyh@scylladb.com>
2020-12-30 18:20:01 +01:00
Nadav Har'El
10fbef5bff cql-pytest: clean up test_using_timeout.py
In a previous version of test_using_timeout.py, we had tables pre-filled
with some content labled "everything". The current version of the tests
don't use it, so drop it completely.

One test, test_per_query_timeout_large_enough, still had code that did
	res = list(cql.execute(f"SELECT * FROM {table} USING TIMEOUT 24h"))
	assert res == everything
this was a bug - it only works as expected if this test is run before
anything other test is run, and will fail if we ever reorder or parallelize
these tests. So drop these two lines.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201229145435.3421185-1-nyh@scylladb.com>
2020-12-30 09:16:25 +01:00
Asias He
84f482bde4 table: Add make_streaming_reader for given sstables set
Add a streaming reader that streams from a given sstables set.

Refs #7831
2020-12-30 08:32:42 +08:00
Nadav Har'El
5f24ff9187 Merge 'Coroutinize alternator tagging requests' from Piotr Sarna
This miniseries rewrites two alternator request handlers from seastar threads to coroutines - since these handlers are not on a hot path and using seastar threads is way too heavy for such a simple routine.

NOTE: this pull request obviously has to wait until coroutines are fully supported in Seastar/Scylla.

Closes #7453

* github.com:scylladb/scylla:
  alternator: coroutinize untagging a resource
  alternator: coroutinize tagging a resource
2020-12-29 23:36:25 +02:00
Avi Kivity
700ddd1914 Merge 'scylla_setup: enable node_exporter for offline installation' from Amos Kong
node_exporter had been added to scylla-server package by commit
95197a09c9.

So we can enable it by default for offline installation.

Closes #7832

* github.com:scylladb/scylla:
  scylla_setup: cleanup if judgments
  scylla_setup: enable node_exporter for offline installation
2020-12-28 22:07:36 +02:00
Avi Kivity
1716359455 Update tools/jmx submodule
* tools/jmx 20469bf...2c95650 (1):
  > install.sh: set a valid WorkingDirectory for nonroot offline install
2020-12-28 21:19:04 +02:00
Avi Kivity
f7b731bc46 Merge 'Fix potential reactor stall on LCS compaction completion' from Raphael Carvalho
On every compaction completion, sstable set is rebuilt from scratch.
With LCS and ~160G of data per shard, it means we'll have to create
a new sstable set with ~1000 entries whenever compaction completes,
which will likely result in reactor stalling for a significant
amount of time.

Fixes #7758.

Closes #7842

* github.com:scylladb/scylla:
  table: Fix potential reactor stall on LCS compaction completion
  table: decouple preparation from execution when updating sstable set
  table: change rebuild_sstable_list to return new sstable set
  row_cache: allow external updater to decouple preparation from execution
2020-12-28 21:16:17 +02:00
Pavel Emelyanov
7ac435f67c test: Enhance test for range_tombstone_list de-overlapping
The range_tombstone_list always (unless misused?) contains de-overlapped
entries. There's a test_add_random that checks this, but it suffers from
several problems:

- generated "random" ranges are sequential and may only overlap on
  their borders
- test uses the keys of the same prefix length

Enhance the generator part to produce a purely random sequence of ranges
with bound keys of arbitrary length. Just pay attention to generate the
"valid" individual ranges, whose start is not ahead of the end.

Also -- rename the test to reflect what it's doing and increase the
number of iterations.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201228115525.20327-1-xemul@scylladb.com>
2020-12-28 18:26:48 +02:00
Raphael S. Carvalho
8dd7280107 table: Fix potential reactor stall on LCS compaction completion
On every compaction completion, sstable set is rebuilt from scratch.
With LCS and ~160G of data per shard, it means we'll have to create
a new sstable set with ~1000 entries whenever compaction completes,
which will likely result in reactor stalling for a significant
amount of time.

This is fixed by futurizing build_new_sstable_list(), so it will
yield whenever needed.

Fixes #7758.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-12-28 13:17:50 -03:00
Raphael S. Carvalho
6082da4703 table: decouple preparation from execution when updating sstable set
row cache now allows updater to first prepare the work, and then execute
the update atomically as the last step. let's do that when rebuilding
the set, so now new set is created in the preparation phase, and the
new set replaces the old one in the execution phase, satisfying the
atomicity requirement of row cache.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-12-28 13:17:48 -03:00
Raphael S. Carvalho
43f0200b8f table: change rebuild_sstable_list to return new sstable set
procedure is changed to return the new set, so caller will be responsible
for replacing the old set with the new one. this will allow our future
work where building new set and enabling it will be decoupled.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-12-28 13:17:47 -03:00
Raphael S. Carvalho
198b87503f row_cache: allow external updater to decouple preparation from execution
External updater may do some preparatory work like constructing a new sstable list,
and at the end atomically replace the old list by the new one.

Decoupling the preparation from execution will give us the following benefits:
- the preparation step can now yield if needed to avoid reactor stalls, as it's
been futurized.
- the execution step will now be able to provide strong exception guarantees, as
it's now decoupled from the preparation step which can be non-exception-safe.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-12-28 13:17:45 -03:00
Avi Kivity
3325960486 Update seastar submodule
* seastar 1f5e3d3419...d1b5d41b6d (1):
  > append_challenged_posix_file_impl: adjust sloppy_size only in optimize_queue

Fixes #7439 (the coredump part).
2020-12-28 13:00:04 +02:00
Nadav Har'El
7eda6b1e90 cql-pytest: increase default request timeout
The CQL tests in test/cql-pytest use the Python CQL driver's default
timeout for execute(), which is 10 seconds. This usually more than
enough. However, in extreme cases like noted in issue #7838, 10
seconds may not be enough. In that issue, we run a very slow debug
build on a very slow test machine, and encounter a very slow request
(a DROP KEYSPACE that needs to drop multiple tables).

So this patch increases the default timeout to an even larger
120 seconds. We don't care that this timeout is ridiculously
large - under normal operations it will never be reached, there
is no code which loops for this amount of time for example.

Tested that this patch fixes #7838 by choosing a much lower timeout
(1 second) and reproducing test failures caused by timeouts.

Fixes #7838.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201228090847.3234862-1-nyh@scylladb.com>
2020-12-28 11:19:37 +02:00
Amos Kong
8723b0ce86 leveled_compaction_strategy: fix boundary of maximum sstable level
The MAX_LEVELS is the levels count, but sstable level (index) starts
from 0. So the maximum and valid level is MAX_LEVELS - 1.

Signed-off-by: Amos Kong <amos@scylladb.com>

Closes #7833
2020-12-27 18:59:54 +02:00
Benny Halevy
8a745a0ee0 compaction: compaction_writer: destroy shared_sstable after the sstable_writer
sstable_writer may depend on the sstable throughout its whole lifecycle.
If the sstable is freed before the sstable_writer we might hit use-after-free
as in the follwing case:
```
std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*>::operator+=(long) at /usr/include/c++/10/bits/stl_deque.h:240
 (inlined by) std::operator+(std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*> const&, long) at /usr/include/c++/10/bits/stl_deque.h:378
 (inlined by) std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*>::operator[](long) const at /usr/include/c++/10/bits/stl_deque.h:252
 (inlined by) std::deque<sstables::compression::segmented_offsets::bucket, std::allocator<sstables::compression::segmented_offsets::bucket> >::operator[](unsigned long) at /usr/include/c++/10/bits/stl_deque.h:1327
 (inlined by) sstables::compression::segmented_offsets::push_back(unsigned long, sstables::compression::segmented_offsets::state&) at ./sstables/compress.cc:214
sstables::compression::segmented_offsets::writer::push_back(unsigned long) at ./sstables/compress.hh:123
 (inlined by) compressed_file_data_sink_impl<crc32_utils, (compressed_checksum_mode)1>::put(seastar::temporary_buffer<char>) at ./sstables/compress.cc:519
seastar::output_stream<char>::put(seastar::temporary_buffer<char>) at table.cc:?
 (inlined by) seastar::output_stream<char>::put(seastar::temporary_buffer<char>) at ././seastar/include/seastar/core/iostream-impl.hh:432
seastar::output_stream<char>::flush() at table.cc:?
seastar::output_stream<char>::close() at table.cc:?
sstables::file_writer::close() at sstables.cc:?
sstables::mc::writer::~writer() at writer.cc:?
 (inlined by) sstables::mc::writer::~writer() at ./sstables/mx/writer.cc:790
sstables::mc::writer::~writer() at writer.cc:?
flat_mutation_reader::impl::consumer_adapter<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~consumer_adapter() at compaction.cc:?
 (inlined by) std::_Optional_payload_base<sstables::compaction_writer>::_M_destroy() at /usr/include/c++/10/optional:260
 (inlined by) std::_Optional_payload_base<sstables::compaction_writer>::_M_reset() at /usr/include/c++/10/optional:280
 (inlined by) std::_Optional_payload<sstables::compaction_writer, false, false, false>::~_Optional_payload() at /usr/include/c++/10/optional:401
 (inlined by) std::_Optional_base<sstables::compaction_writer, false, false>::~_Optional_base() at /usr/include/c++/10/optional:474
 (inlined by) std::optional<sstables::compaction_writer>::~optional() at /usr/include/c++/10/optional:659
 (inlined by) sstables::compacting_sstable_writer::~compacting_sstable_writer() at ./sstables/compaction.cc:229
 (inlined by) compact_mutation<(emit_only_live_rows)0, (compact_for_sstables)1, sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>::~compact_mutation() at ././mutation_compactor.hh:468
 (inlined by) compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>::~compact_for_compaction() at ././mutation_compactor.hh:538
 (inlined by) std::default_delete<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >::operator()(compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>*) const at /usr/include/c++/10/bits/unique_ptr.h:85
 (inlined by) std::unique_ptr<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>, std::default_delete<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~unique_ptr() at /usr/include/c++/10/bits/unique_ptr.h:361
 (inlined by) stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >::~stable_flattened_mutations_consumer() at ././mutation_reader.hh:342
 (inlined by) flat_mutation_reader::impl::consumer_adapter<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~consumer_adapter() at ././flat_mutation_reader.hh:201
auto flat_mutation_reader::impl::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter>(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:272
 (inlined by) auto flat_mutation_reader::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter>(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:383
 (inlined by) auto flat_mutation_reader::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:389
 (inlined by) seastar::future<void> sstables::compaction::setup<noop_compacted_fragments_consumer>(noop_compacted_fragments_consumer)::{lambda(flat_mutation_reader)#1}::operator()(flat_mutation_reader)::{lambda()#1}::operator()() at ./sstables/compaction.cc:612
```

What happens here is that:

    compressed_file_data_sink_impl(output_stream<char> out, sstables::compression* cm, sstables::local_compression lc)
            : _out(std::move(out))
            , _compression_metadata(cm)
            , _offsets(_compression_metadata->offsets.get_writer())
            , _compression(lc)
            , _full_checksum(ChecksumType::init_checksum())

_compression_metadata points to a buffer held by the sstable object.
and _compression_metadata->offsets.get_writer returns a writer that keeps
a reference to the segmented_offsets in the sstables::compression
that is used in the ~writer -> close path.

Fixes #7821

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201227145726.33319-1-bhalevy@scylladb.com>
2020-12-27 17:02:13 +02:00
Pavel Emelyanov
387889315e mutation-partition: Relax putting a dummy entry into a continuous range
When applying a mutation partition to another if a dummy entry
from the source falls into a destination continuous range, it
can be just dropped. However, current implementation still
inserts it and then instantly removes.

Relax this code-flow by dropping the unwanted entry without
tossing it.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201224130438.11389-1-xemul@scylladb.com>
2020-12-27 14:47:32 +02:00
Amos Kong
9adc6f68ee scylla_setup: cleanup if judgments
This patch merged two nested if judgments.

Signed-off-by: Amos Kong <amos@scylladb.com>
2020-12-26 04:45:25 +08:00
Amos Kong
632b01ce4e scylla_setup: enable node_exporter for offline installation
node_exporter had been added to scylla-server package by commit
95197a09c9.

So we can enable it by default for offline installation.

Signed-off-by: Amos Kong <amos@scylladb.com>
2020-12-25 10:54:31 +08:00
Pavel Emelyanov
72c2482f73 mutation-partition: Construct rows_entry directly from clustering_row
When a rows_entry is added to row_cache it's constructed from
clustering_row  by unpacking all its internals and putting
them into the rows_entry's deletable_row. There's a shorter
way -- the clustering_row already has the deletale_row onboard
from which rows_entry can copy-construct its.

This lets keeping the rows_entry and deletable_row set of
constructors a bit shorter.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201224161112.20394-1-xemul@scylladb.com>
2020-12-24 18:13:44 +02:00
Avi Kivity
8f06a687b4 Merge "idl: minor improvements to idl compiler" from Pavel S
"
This series does a lot of cleanups, dead code removal, and most
importantly fixes the following things in IDL compiler tool:
 * The grammar now rejects invalid identifiers, which,
   in some cases, allowed to write things like `std:vector`.
 * Error reporting is improved significantly and failures are
   now pointing to the place of failure much more accurately.
   This is done by restricting rule backtracing on those rules
   which don't need it.
"

* 'idl-compiler-minor-fixes-v4' of https://github.com/ManManson/scylla:
  idl: move enum and class serializer code writers to the corresponding AST classes
  idl: extract writer functions for `write`, `read` and `skip` impls for classes and enums
  idl: minor fixes and code simplification
  idl: change argument name from `hout` to `cout` in all dependencies of `add_visitors` fn
  idl: fix parsing of basic types and discard unneeded terminals
  idl: remove unused functions
  idl: improve error tracing in the grammar and tighten-up some grammar rules
  idl: remove redundant `set_namespace` function
  idl: remove unused `declare_class` function
  idl: slightly change `str` and `repr` for AST types
  idl: place directly executed init code into if __name__=="__main__"
2020-12-24 15:14:09 +02:00
Takuya ASADA
95197a09c9 dist: add node_exporter to scylla-server package
To connection-less environment, we need to add node_exporter binary
to scylla-server package, not downloading it from internet.

Related #7765
Fixes #2190

Closes #7796
2020-12-24 11:44:13 +02:00
Pavel Solodovnikov
219ac2bab5 large_data_handler: fix segmentation fault when constructing data_value from a nullptr
It turns out that `cql_table_large_data_handler::record_large_rows`
and `cql_table_large_data_handler::record_large_cells` were broken
for reporting static cells and static rows from the very beginning:

In case a large static cell or a large static row is encountered,
it tries to execute `db::try_record` with `nullptr` additional values,
denoting that there is no clustering key to be recorded.

These values are next passed to `qctx.execute_cql()`, which
creates `data_value` instances for each statement parameter,
hence invoking `data_value(nullptr)`.

This uses `const char*` overload which delegates to
`std::string_view` ctor overload. It is UB to pass `nullptr`
pointer to `std::string_view` ctor. Hence leading to
segmentation faults in the aforementioned large data reporting
code.

What we want here is to make a null `data_value` instead, so
just add an overload specifically for `std::nullptr_t`, which
will create a null `data_value` with `text` type.

A regression test is provided for the issue (written in
`cql-pytest` framework).

Tests: test/cql-pytest/test_large_cells_rows.py

Fixes: #6780

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201223204552.61081-1-pa.solodovnikov@scylladb.com>
2020-12-24 11:37:43 +02:00
Nadav Har'El
79faaa34c7 alternator test: confirm that list index can't be a reference
In Alternator's expression parser in alternator/expressions.g, a list can be
indexed by a '[' INTEGER ']'. I had doubts whether maybe a value-reference
for the index, e.g., "something[:xyz]", should also work. So this patch adds
a test that checks whether "something[:xyz]" works, and confirms that both
DynamoDB and Alternator don't accept it and consider it a syntax error.

So Alternator's parser is correct to insist that the index be a literal
integer.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201214100302.2807647-1-nyh@scylladb.com>
2020-12-24 11:37:29 +02:00
Piotr Sarna
b62457d5b0 test: add verification to using timeout prepared statements
Previously the test cases only verified that the queries
did not time out with sufficiently large timeout, but now
they also check that appropriate data is inserted and can be read.
Message-Id: <8bc979434fce977c30d8516dc82789d4fe317696.1608734455.git.sarna@scylladb.com>
2020-12-24 11:37:29 +02:00
Piotr Sarna
1577e6f632 test: add cases for using timeout with batches
The test suite for USING TIMEOUT already included SELECT,
INSERT and UPDATE statements, but missed batches. The suite
is now updated to include batch tests.

Tests: unit(dev)
Message-Id: <a6738d2ed3d62681615523d01109362766c90325.1608734455.git.sarna@scylladb.com>
2020-12-24 11:37:29 +02:00
Piotr Sarna
4eb41b7d56 test: use random keys in tests for USING TIMEOUT
Since the tables are written to and it's possible to run
mutliple test cases concurrently, the cases now use pseudorandom
keys instead of hardcoded values.
Message-Id: <d864dbb096360c17cdc2ebd8e79bfd983c19910e.1608734455.git.sarna@scylladb.com>
2020-12-24 11:37:29 +02:00
Avi Kivity
0bbd78037f Update seastar submodule
* seastar 2bd8c8d088...1f5e3d3419 (5):
  > Merge "Avoid fair-queue rovers overflow if not configured" from Pavel E
  > doc: add a coroutines section to the tutorial
  > Merge "tests/perf: add random-seed config option" from Benny
  > iotune: Print parameters affecting the measurement results
  > cook: Add patch cmd for ragel build (signed char confusion on aarch64)
2020-12-24 11:37:29 +02:00
Piotr Sarna
3b26fc01c2 alternator: coroutinize untagging a resource
Historically, a seastar thread was used for this request
because it's not on a critical path, but a coroutine makes
the code simpler.
2020-12-23 15:53:57 +01:00
Piotr Sarna
1ca39cc8c1 alternator: coroutinize tagging a resource
Historically, a seastar thread was used for this request
because it's not on a critical path, but a coroutine makes
the code simpler.
2020-12-23 15:53:57 +01:00
Pavel Solodovnikov
3a91f1127d idl: move enum and class serializer code writers to the corresponding AST classes
Expand the role of AST classes to also supply methods for actually
generating the code. More changes will follow eventually until
all generation code is handled by these classes.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-22 23:23:12 +03:00
dgarcia360
fd5f0c3034 docs: add organization
Closes #7818
2020-12-22 15:33:31 +02:00
Pekka Enberg
ceb67e7728 dist/docker: Remove 'epel-release' from Docker image
We no longer need the 'epel-release' package for anything as our
scylla-server package bundles all the necessary dependencies.

Closes #7823
2020-12-22 14:55:17 +02:00
Avi Kivity
e2dfa24540 Merge "token_metadata: add clear_gently" from Benny
"
We've encountered a number of reactor stalls
related to token_metadata that were fixed
in 052a8d036d.

This is a follow-up series that adds a clear_gently
method to token_metadata that uses continuations
to prevent reactor stalls when destroying token_metadata
objects.

Test: unit(dev), {network_topology_strategy,storage_proxy}_test(debug)
"

* tag 'token_metadata_clear_gently-v3' of github.com:bhalevy/scylla:
  token_metadata: add clear_gently
  token_metadata: shared_token_metadata: add mutate_token_metadata
  token_metdata: futurize update_normal_tokens
  abstract_replication_strategy: get_pending_address_ranges: invoke clone_only_token_map if can_yield
  repair: replace_with_repair: convert to coroutine
2020-12-22 13:23:31 +02:00
Nadav Har'El
f2978e1873 cql-pytest: port Cassandra's collection_test.py
A previous patch added test/cql-pytest/cassandra_tests - a framework for
porting Cassandra's unit tests to Python - but only ported two tiny test
files with just 3 tests.  In this patch, we finally port a much larger
test file validation/entities/collection_test.java. This file includes
50 separate tests, which cover a lot of aspects of collection support,
as well as how other stuff interact with collections.

As of now, 23 (!) of these 50 tests fail, and exposed six new issues
in Scylla which I carefully documented:

Refs #7735: CQL parser missing support for Cassandra 3.10's new "+=" syntax
Refs #7740: CQL prepared statements incomplete support for "unset" values
Refs #7743: Restrictions missing support for "IN" on tables with
            collections, added in Cassandra 4.0
Refs #7745: Length of map keys and set items are incorrectly limited to 64K
            in unprepared CQL
Refs #7747: Handling of multiple list updates in a single request differs
            from recent Cassandra
Refs #7751: Allow selecting map values and set elements, like in
            Cassandra 4.0

These issues vary in severity - some are simply new Cassandra 4.0 features
that Scylla never implemented, but one (#7740) is an old Cassandra 2.2
feature which it seems we did not implement correctly in some cases that
involve collections.

Note that there are some things that the ported tests do not include.
In a handful of places there are things which the Python driver checks,
before sending a request - not giving us an opportunity to check how
the server handles such errors. Another notable change in this port is
that the original tests repeated a lot of tests with and without a
"nodetool flush". In this port I chose to stub the flush() function -
it does NOT flush. I think the point of these tests is to check the
correctness of the CQL features - *not* to verify that memtable flush
works correctly. Doing a real memtable flush is not only slow, it also
doesn't really check much (Scylla may still serve data from cache,
not sstables). So I decided it is pointless.

An important goal of this patch is that all 50 tests (except three
skipped tests because Python has client-side checking), pass when
run on Cassandra (with test/cql-pytest/run-cassandra). This is very
important: It was very easy to make mistakes while porting the tests,
and I did make many such mistakes; But running the against Cassandra
allowed me to fix those mistakes - because the correct tests should
pass on Cassandra. And now they do.

Unfortunately, the new tests are significantly slower than what we've
been accustomed in Alternator/CQL tests. The 50 tests create more than a
hundred tables, udfs, udts, and similar slow operations - they do not
reuse anything via fixtures. The total time for these 50 tests (in dev
build mode) is around 18 seconds. Just one test - testMapWithLargePartition
is responsibe for almost half (!) of that time - we should consider in
the future whether it's worth it or can be made smaller.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201215155802.2867386-1-nyh@scylladb.com>
2020-12-22 13:22:09 +02:00
Avi Kivity
5a33ce58a7 Update seastar submodule
* seastar 3b8903d406...2bd8c8d088 (8):
  > core: remove unused chrono.h reference
  > cmake: force cxx standard if dialect is specified
  > queue: add front()
  > coroutine: deprecate coroutine forwarding
  > memory: Use 2^n sizes when searching for preferred span size
  > shared_ptr: define debug_shared_ptr_counter_type constructor as noexcept
  > install-dependencies: add pkg-config to Debian/Ubuntu packages
  > log: do_log: prevent garbling due to context switch
2020-12-22 13:22:09 +02:00
Benny Halevy
322aa2f8b5 token_metadata: add clear_gently
clear_gently gently clears the token_metadata members.
It uses continuations to allow yielding if needed
to prevent reactor stalls.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-22 11:22:21 +02:00
Benny Halevy
56aa49ca81 token_metadata: shared_token_metadata: add mutate_token_metadata
mutate_token_metadata acquires the shared_token_metadata lock,
clones the token_metadata (using clone_async)
and calls an asynchronous functor on
the cloned copy of the token_metadata to mutate it.

If the functor is successful, the mutated clone
is set back to to the shared_token_metadata,
otherwise, the clone is destroyed.

With that, get rid of shared_token_metadata::clone

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-22 11:22:19 +02:00
Benny Halevy
e089c22ec1 token_metdata: futurize update_normal_tokens
The function complexity if O(#tokens) in the worst case
as for each endpoint token to traverses _token_to_endpoint_map
lineraly to erase the endpoint mapping if it exists.

This change renames the current implementation of
update_normal_tokens to update_normal_tokens_sync
and clones the code as a coroutine that returns a future
and may yield if needed.

Eventually we should futurize the whole token_metadata
and abstract_replication_strategy interface and get rid
of the synchronous functions.  Until then the sync
version is still required from call sites that
are neither returning a future nor run in a seastar thread.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-22 10:35:15 +02:00
Benny Halevy
e7f4cd89a9 abstract_replication_strategy: get_pending_address_ranges: invoke clone_only_token_map if can_yield
Optimize the can_yield case by invoking the futurized version
of clone_only_token_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-22 09:49:08 +02:00
Benny Halevy
55316df6bf repair: replace_with_repair: convert to coroutine
Prepare to futurizing update_normal_tokens.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-22 09:49:08 +02:00
Piotr Sarna
da7e87dc56 test: add cases for using timeout with bind markers
The test suite for USING TIMEOUT already included binding
the timeout value, but only for wildcard (?). The test case
is now extended with named bind markers.

Tests: unit(dev)
Message-Id: <b5344f40d26d90b36e90a04c2474127728535eaa.1608573624.git.sarna@scylladb.com>
2020-12-22 09:03:56 +02:00
Pekka Enberg
961b9e8390 install.sh: Add seastar-cpu-map.sh to $PATH
Add the seastar-cpu-map.sh to the SBINFILES variable, which is used to
create symbolic links to scripts so that they appear in $PATH.

Please note that there are additional Python scripts (like perftune.py),
which are not in $PATH. That's because Python scripts are handled
separately in "install.sh" and no Python script has a "sbin" symlink. We
might want to change this in the future, though.

Fixes #6731

Closes #7809
2020-12-21 14:12:27 +02:00
Avi Kivity
0f7b6dd180 utils: managed_bytes: introduce with_linearized()
This is a temporary scaffold for weaning ourselves off
linearization. It differs from with_linearized_managed_bytes in
that it does not rely on the environment (linearization_context)
and so is easier to remove.
2020-12-20 15:14:44 +01:00
Avi Kivity
c37e495958 utils: managed_bytes: constrain with_linearized_managed_bytes()
The passed function must be called with a no parameters; document
and enforce it.
2020-12-20 15:14:44 +01:00
Avi Kivity
a1df1b3c34 utils: managed_bytes: avoid internal uses of managed_bytes::data()
We use managed_bytes::data() in a few places when we know the
data is non-fragmented (such as when the small buffer optimization
is in use). We'd like to remove managed_bytes::data() as linearization
is bad, so in preparation for that, replace internal uses of data()
with the equivalent direct access.
2020-12-20 15:14:44 +01:00
Avi Kivity
72a2554a86 utils: managed_bytes: extract do_linearize_pure()
do_linearize() is an impure function as it changes state
in linearization_context. Extract the pure parts into a new
do_linearize_pure(). This will be used to linearize managed_bytes
without a linearization_context, during the transition period where
fragmented and non-fragmented values coexist.
2020-12-20 15:14:44 +01:00
Avi Kivity
4b3f0fd7c0 thrift: do not depend on implicit conversion of keys to bytes_view
This implicit conversion will soon be gone, as it is dangerous.
Ask for the representation explicitly.
2020-12-20 15:14:44 +01:00
Avi Kivity
8521248955 clustering_bounds_comparator: do not depend on implicit conversion of keys to bytes_view
This implicit conversion will soon be gone, as it is dangerous.
Ask for the representation explicitly.
2020-12-20 15:14:44 +01:00
Avi Kivity
1dd6d7029a cql3: expression: linearize get_value_from_mutation() eariler
do_get_value() is careful to return a fragmented view, but
its only caller get_value_from_mutation() linearizes it immediately
afterwards. Linearize it sooner; this prevents mixing in
fragmented values from cells (now via IMR) and fragmented values
from partition/clustering keys. It only works now because
keys are not fragmented outside LSA, and value_view has a special
case for single-fragment values.

This helps when keys become fragmented.
2020-12-20 15:14:44 +01:00
Avi Kivity
b59a21967c bytes: add to_bytes(bytes)
Converting from bytes to bytes is nonsensical, but it helps
when transitioning to other types (managed_bytes/managed_bytes_view),
and these types will have to_bytes() conversions.
2020-12-20 15:14:44 +01:00
Avi Kivity
28126257c2 cql3: expression: mark do_get_value() as static
It is used only later in this file.
2020-12-20 15:14:44 +01:00
Avi Kivity
b3e39d81aa Merge 'Avoid scanning sstables in parallel for TWCS single-partition queries' from Kamil Braun
We introduce a new single-key sstable reader for sstables created by `TimeWindowCompactionStrategy`.

The reader uses the fact that sstables created by TWCS are mostly disjoint with respect to the contained `position_in_partition`s in order to avoid having multiple sstable readers opened at the same time unnecessarily. In case there are overlapping ranges (for example, in the current time-window), it performs the necessary merging (it uses `clustering_order_reader_merger`, introduced recently).

The reader uses min/max clustering key metadata present in `md` sstables in order to decide when to open or close a sstable reader.

The following experiment was performed:
1. create a TWCS table with 1 minute windows
2. fill the table with 8 equal windows of data
   (each window flushed to a separate sstable)
3. perform `select * from ks.t where pk = 0 limit 1` query
   with and without the change

The expectation is that with the commit, only one sstable will be opened
to fetch that one row; without the commit all 8 sstables would be opened at once.
The difference in the value of `scylla_reactor_aio_bytes_read` was measured
(value after the query minus value before the query), both with and without the commit.

With the commit, the difference was 67584.
Without the commit, the difference was 528384.
528384 / 67584 ~= 7.8.

Fixes #6418.

Closes #7437

* github.com:scylladb/scylla:
  sstables: gather clustering key filtering statistics in TWCS single key reader
  sstables: use time_series_sstable_set in time_window_compaction_strategy
  sstable_set: new reader for TWCS single partition queries
  mutation_reader_test: test clustering_order_reader_merger with time_series_sstable_set
  sstable_set: introduce min_position_reader_queue
  sstable_set: introduce time_series_sstable_set
  sstables: add min_position and max_position accessors
  sstable_set: make create_single_key_sstable_reader a virtual method
  clustering_order_reader_merger: fix the 0 readers case
2020-12-19 23:53:18 +02:00
Kamil Braun
53414558a1 sstables: gather clustering key filtering statistics in TWCS single key reader 2020-12-18 16:33:27 +01:00
Kamil Braun
4f2d45001c sstables: use time_series_sstable_set in time_window_compaction_strategy
The following experiment was performed:
1. create a TWCS table with 1 minute windows
2. fill the table with 8 windows of data
   (each window flushed to a separate sstable)
3. perform `select * from ks.t where pk = 0 limit 1` query
   with and without the change

The expectation is that with the commit, only one sstable will be opened
to fetch that one row; without the commit all 8 sstables would be opened at once.
The difference in the value of `scylla_reactor_aio_bytes_read` was measured
(value after the query minus value before the query), both with and without the commit.

With the commit, the difference was 67584.
Without the commit, the difference was 528384.
528384 / 67584 ~= 7.8.

Fixes https://github.com/scylladb/scylla/issues/6418.
2020-12-18 16:33:27 +01:00
Kamil Braun
f0842ba34e sstable_set: new reader for TWCS single partition queries
This commit introduces a new implementation of `create_single_key_sstable_reader`
in `time_series_sstable_set` dedicated for TWCS-created sstables.

It uses the fact that such sstables are mostly disjoint with respect to
contained `position_in_partition`s in order to decrease the number of
sstable readers that are opened at the same time.

The implementation uses `clustering_order_reader_merger` under the hood.

The reader assumes that the schema does not have static columns and none
of the queried sstable contain partition tombstones; also, it assumes
that the sstables have the min/max clustering key metadata in order for
the implementation to be efficient. Thus, if we detect that some of
these assumptions aren't true, we fall back to the old implementation.
2020-12-18 16:33:27 +01:00
Kamil Braun
b41139a07f mutation_reader_test: test clustering_order_reader_merger with time_series_sstable_set 2020-12-18 16:33:27 +01:00
Kamil Braun
d0548aa77f sstable_set: introduce min_position_reader_queue
This is a queue of readers of sstables in a time_series_sstable_set,
returning the readers in order of the smallest position_in_partition
that the sstables have. It uses the min/max clustering key sstable
metadata.

The readers are opened lazily, at the moment of being returned.
2020-12-18 16:33:27 +01:00
Kamil Braun
52697022b0 sstable_set: introduce time_series_sstable_set
At this moment it is a slightly less efficient version of
bag_sstable_set, but in following commits we will use the new data
structures to gain advantage in single partition queries
for sstables created by TimeWindowCompactionStrategy.
2020-12-18 16:33:27 +01:00
Kamil Braun
2a160dd909 sstables: add min_position and max_position accessors
The methods return a lower-bound and an upper-bound for the
position-in-partitions appearing in a given sstable.
2020-12-18 16:33:27 +01:00
Kamil Braun
fe26da82ba sstable_set: make create_single_key_sstable_reader a virtual method
... of sstable_set_impl.

Soon we shall provide a specialized implementation in one of the
`sstable_set_impl` derived classes.

The existing implementation is used as the default one.
2020-12-18 12:31:16 +01:00
Kamil Braun
5e846b33b8 clustering_order_reader_merger: fix the 0 readers case
With 0 readers the merger would produce a `partition_end` fragment
when it should immediately return `end_of_stream` instead.
2020-12-18 12:30:40 +01:00
Gleb Natapov
85cffd1aeb lwt: rewrite storage_proxy::cas using coroutings
Makes code much simpler to understand.

Message-Id: <20201201160213.GW1655743@scylladb.com>
2020-12-17 18:15:35 +01:00
Avi Kivity
a60c81b615 Merge 'cql3: Fix handling of impossible restrictions on a primary-key column' from Dejan Mircevski
There were two problems with handling conflicting equalities on the same PK column (eg, c=1 AND c=0):
1. When the column is indexed, Scylla crashed (#7772)
2. Computing ranges and slices was throwing an exception

This series fixes them both; it also happens to resolve some old TODOs from restriction_test.

Tests: unit (dev, debug)

Closes #7804

* github.com:scylladb/scylla:
  cql3: Fix value_for when restriction is impossible
  cql3: Fix range computation for p=1 AND p=1
2020-12-17 12:01:36 +02:00
Dejan Mircevski
46b4b59945 cql3: Fix value_for when restriction is impossible
Previously, single_column_restrictions::value_for() assumed that a
column's restriction specifies exactly one value for the column.  But
since 37ebe521e3, multiple equalities on the same column are allowed,
so the restriction could be a conjunction of conflicting
equalities (eg, c=1 AND c=0).  That violates an assert and crashes
Scylla.

This patch fixes value_for() by gracefully handling the
impossible-restriction case.

Fixes #7772

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-12-16 15:00:29 -05:00
Dejan Mircevski
4bb1107652 cql3: Fix range computation for p=1 AND p=1
Previously compute_bounds was assuming that primary-key columns are
restricted by exactly one equality, resulting in the following error:

query 'select p from t where p=1 and p=1' failed:
 std::bad_variant_access (std::get: wrong index for variant)

This patch removes that assumption and deals correctly with the
multiple-equalities case.  As a byproduct, it also stops raising
"invalid null value" exceptions for null RHS values.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-12-16 14:46:48 -05:00
Pavel Solodovnikov
edf9ccee48 idl: extract writer functions for write, read and skip impls for classes and enums
Split `write`, `read` and `skip` serializer function writers to
separate functions in `handle_class` and `handle_enum` functions,
which slightly improves readability.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-16 20:33:55 +03:00
Pavel Solodovnikov
8049cb0f91 idl: minor fixes and code simplification
* Introduce `ns_qualified_name` and `template_params_str` functions
  to simplify code a little bit in `handle_enum` and `handle_class`
  functions.
* Previously each serializer had a separate namespace open-close
  statements, unify them into a single namespace scope.
* Fix a few more `hout` -> `cout` argument names.
* Rename `template` pattern to `template_decl` to improve clarity.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-16 19:32:08 +03:00
Pavel Solodovnikov
0de96426db idl: change argument name from hout to cout in all dependencies of add_visitors fn
Prior to the patch all functions that are called from `add_visitors`
and this function itself declared the argument denoting the output
file as `hout`. Though, this was quite misleading since `hout`
is meant to be header file with declarations, while `cout` is an
implementation file.

These functions write to implmentation file hence `hout` should
be changed to `cout` to avoid confusion.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-16 19:32:03 +03:00
Pavel Solodovnikov
0defb52855 idl: fix parsing of basic types and discard unneeded terminals
Prior to the patch `btype` production was using `with_colon`
rule, which accidentally supported parsing both numbers and
identifiers (along with other invalid inputs, such as "123asd").

It was changed to use `ns_qualified_ident` and those places
which can accept numeric constants, are explicitly listing
it as an alternative, e.g. template parameter list.

Unfortunately, I had to make TemplateType to explicitly construct
`BasicType` instances from numeric constants in template arguments
list. This is exactly the way it was handled before, though.

But nonetheless, this should be addressed sometime later.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-16 19:31:57 +03:00
Pavel Solodovnikov
0cc87ead3d idl: remove unused functions
Remove the following functions since they are not used:
* `open_namespaces`
* `close_namespaces`
* `flat_template`

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-16 19:31:51 +03:00
Pavel Solodovnikov
bea965a0a7 idl: improve error tracing in the grammar and tighten-up some grammar rules
This patch replaces use of some handwritten rules to use their
alternatives already defined in `pyparsing.pyparsing_common`
class, i.e.: `number`, `identifier` productions.

Changed ignore patterns for comments to use pre-defined
`pp.cppStyleComment` instead of hand-written combination of
'//'-style and C-style comment rules.

Operator '-' is now used whenever possible to improve debugging
experience: it disables default backtracking for productions
so that compiler fails earlier and can now point more precisely
to a place in the input string where it failed instead of
backtracking to the top-level rule and reporting error there.

Template names and class names now use `ns_qualified_ident`
rule instead of `with_colon` which prevents grammar from
matching invalid identifiers, such as `std:vector`.

Many places are using the updated `identifier` production, which
is working correctly unlike its predecessor: now inputs
such as `1ident` are considered invalid.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-16 19:31:46 +03:00
Pavel Solodovnikov
3a037bc5b6 idl: remove redundant set_namespace function
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-16 19:31:40 +03:00
Pavel Solodovnikov
e76e8aec0e idl: remove unused declare_class function
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-16 19:31:35 +03:00
Pavel Solodovnikov
745f4ac23b idl: slightly change str and repr for AST types
Surround string representation with angle brackets. This improves
readability when printing debug output.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-16 19:31:20 +03:00
Pavel Solodovnikov
4a61270701 idl: place directly executed init code into if __name__=="__main__"
Since idl compiler is not intended to be used as a module to other
python build scripts, move initialization code under an if checking
that current module name is "__main__".

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-16 19:30:33 +03:00
Gleb Natapov
37368726c9 migration_manager: remove unused announce() variant
Message-Id: <20201216153150.GG3244976@scylladb.com>
2020-12-16 18:14:07 +02:00
Konstantin Osipov
2c46938c2a commitlog: avoid a syscall in a most common case of segment recycle
When recycling a segment in O_DSYNC mode if the size of the segment
is neither shrunk nor grown, avoid calling file::truncate() or
file::allocate().

Message-Id: <20201215182332.1017339-2-kostja@scylladb.com>
2020-12-16 14:57:36 +02:00
Avi Kivity
fdb47c954d Merge "idl: allow IDL compiler to parse const specifiers for template arguments" from Pavel S
"
This patch series consists of the following patches:

1. The first one turned out to be a massive rewrite of almost
everything in `idl-compiler.py`. It aims to decouple parser
structures from the internal representation which is used
in the code-generation itself.

Prior to the patch everything was working with raw token lists and
the code was extremely fragile and hard to understand and modify.

Moreover, every change in the parser code caused a cascade effect
of breaking things at many different places, since they were relying
on the exact format of output produced by parsing rules.

Now there is a bunch of supplementary AST structures which provide
hierarchical and strongly typed structure as the output of parsing
routine.
It is much easier to verify (by the means of `isinstance`, for example)
and extend since the internal structures used in code-generation are
decoupled from the structure of parsing rules, which are now controlled
by custom parse actions providing high-level abstractions.

It is tested manually by checking that the old code produces exactly
the same autogenerated sources for all Scylla IDLs as the new one.

2 and 3. Cosmetics changes only: fixed a few typos and moved from
old-fashioned `string.Template` to python f-strings.

This improves readability of the idl-compiler code by a lot.

Only one non-functional whitespace change introduced.

4. This patch adds a very basic support for the parser to
understand `const` specifier in case it's used with a template
parameter for a data member in a class, e.g.

    struct my_struct {
        std::vector<const raft::log_entry> entries;
    };

It actually does two things:
* Adjusts `static_asserts` in corresponding serializer methods
  to match const-ness of fields.
* Defines a second serializer specialization for const type in
  `.dist.hh` right next to non-const one.

This seems to be sufficient for raft-related uses for now.
Please note there is no support for the following cases, though:

    const std::vector<raft::log_entry> entries;
    const raft::term_t term;

None of the existing IDLs are affected by the change, so that
we can gradually improve on the feature and write the idl
unit-tests to increase test coverage with time.

5. A basic unit-test that writes a test struct with an
`std::vector<S<const T>>` field and reads it back to verify
that serialization works correctly.

6. Basic documentation for AST classes.
TODO: should also update the docs in `docs/IDL.md`. But it is already
quite outdated, and some changes would even be out of scope for this
patch set.
"

* 'idl-compiler-refactor-v5' of https://github.com/ManManson/scylla:
  idl: add docstrings for AST classes
  idl: add unit-test for `const` specifiers feature
  idl: allow to parse `const` specifiers for template arguments
  idl: fix a few typos in idl-compiler
  idl: switch from `string.Template` to python f-strings and format string in idl-compiler
  idl: Decouple idl-compiler data structures from grammar structure
2020-12-16 14:05:33 +02:00
Gleb Natapov
61520a33d6 mutation_writer: pass exceptions through feed_writer
feed_writer() eats exception and transforms it into an end of stream
instead. Downstream validators hate when this happens.

Fixes #7482
Message-Id: <20201216090038.GB3244976@scylladb.com>
2020-12-16 13:18:19 +02:00
Pavel Solodovnikov
8b8dce15c3 idl: add docstrings for AST classes
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-16 09:03:39 +03:00
Botond Dénes
978ec7a4bb tools: introduce scylla-sstable-index
A tool which lists all partitions contained in an sstable index. As all
partitions in an sstable are indexed, this tool can be used to find out
what partitions are contained in a given sstable.

The printout has the following format:
$pos: $human_readable_value (pk{$raw_hex_value})

Where:
* $pos: the position of the partition in the (decompressed) data file
* $human_readable_value: the human readable partition key
* $raw_hex_value: the raw hexadecimal value of the binary representation
  of the partition key

For now the tool requires the types making up the partition key to be
specified on the command line, using the `--type|-t` command line
argument, using the Cassandra type class name notation for types.
As these are not assumed to be widely known, this patch includes a
document mapping all cql3 types to their Cassandra type class name
equivalent (but not just).

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201208092323.101349-1-bdenes@scylladb.com>
2020-12-15 18:46:47 +02:00
Calle Wilund
71c5dc82df database: Verify iff we actually are writing memtables to disk in truncate
Fixes #7732

When truncating with auto_snapshot on, we try to verify the low rp mark
from the CF against the sstables discarded by the truncation timestamp.
However, in a scenario like:

Fill memtables
Flush
Truncate with snapshot A
Fill memtables some more
Truncate
Move snapshot A to upload + refresh (load old tables)
Truncate

The last op will assert, because while we have sstables loaded, which
will be discarded now, we did not in fact generate any _new_ ones
(since memtables are empty), and the RP we get back from discard is
one from an earlier generation set.

(Any permutation of events that create the situation "empty memtable" +
"non-empty sstables with only old tables" will generate the same error).

Added a check that before flushing checks if we actually have any
data, and if not, does not uphold the RP relation assert.

Closes #7799
2020-12-15 16:24:36 +02:00
Avi Kivity
7636799b18 Merge 'Add waiting for flushes on table drops' from Piotr Sarna
This series makes sure that before the table is dropped, all pending memtable flushes related to its memtables would finish.
Normally, flushes are not problematic in Scylla, because all tables are by default `auto_snapshot=true`, which also implies that a table is flushed before being dropped. However, with `auto_snapshot=false` the flush is not attempted at all. It leads to the following race:
1. Run a node with `auto_snapshot=false`
2. Schedule a memtable flush  (e.g. via nodetool)
3. Get preempted in the middle of the flush
4. Drop the table
5. The flush that already started wakes up and starts operating on freed memory, which causes a segfault

Tests: manual(artificially preempting for a long time in bullet point 2. to ensure that the race occurs; segfaults were 100% reproducible before the series and do not happen anymore after the series is applied)

Fixes #7792

Closes #7798

* github.com:scylladb/scylla:
  database: add flushes to waiting for pending operations
  table: unify waiting for pending operations
  database: add a phaser for flush operations
  database: add waiting for pending streams on table drop
2020-12-15 16:02:47 +02:00
Pavel Solodovnikov
1e6df841a5 idl: add unit-test for const specifiers feature
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-15 16:03:18 +03:00
Pavel Solodovnikov
facf27dbe4 idl: allow to parse const specifiers for template arguments
This patch introduces very limited support for declaring `const`
template parameters in data members.

It's not covering all the cases, e.g.
`const type member_variable` and `const template_def<T1, T2, ...>`
syntax is not supported at the moment.

Though the changes are enough for raft-related use: this makes it
possible to declare `std::vector<raft::log_entries_ptr>` (aka
`std::vector<lw_shared_ptr<const raft::log_entry>>`) in the IDL.

Existing IDL files are not affected in any way.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-15 16:03:11 +03:00
Pavel Solodovnikov
f02703fcd7 idl: fix a few typos in idl-compiler
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-15 16:02:55 +03:00
Pavel Solodovnikov
28b602833f idl: switch from string.Template to python f-strings and format string in idl-compiler
Move to a modern and lightweight syntax of f-strings
introduced in python 3.6. It improves readability and provides
greater flexibility.

A few places are now using format strings instead, though.

In case when multiline substitution variable is used, the template
string should be first re-indented and only after that the
formatting should be applied, or we can end up with screwed
indentation the in generated sources.

This change introduces one invisible whitespace change
in `query.dist.impl.hh`, otherwise all generated code is exactly
the same.

Tests: build(dev) and diff genetated IDL sources by hand

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-15 16:01:17 +03:00
Pavel Solodovnikov
4ab1f7f55d idl: Decouple idl-compiler data structures from grammar structure
Instead of operating on the raw lists of tokens, transform them into
typed structures representation, which makes the code by many orders of
magnitude simpler to read, understand and extend.

This includes sweeping changes throughout the whole source code of the
tool, because almost every function was tightly coupled to the way
data was passed down from the parser right to the code generation
routines.

Tested manually by checking that old generated sources are precisely
the same as the new generated sources.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-12-15 15:59:17 +03:00
Piotr Sarna
b1208d0fcc database: add flushes to waiting for pending operations
In order to prevent races with table drops, the helper function
which waits for all pending operations to finish now also
waits for pending flushes.
2020-12-15 13:11:33 +01:00
Piotr Sarna
cd1e351dc1 table: unify waiting for pending operations
In order to reduce code duplication which already caused a bug,
waiting for pending operations is now unified with a single helper
function.
2020-12-15 13:11:25 +01:00
Piotr Sarna
df3204426d database: add a phaser for flush operations
Pending flushes can participate in races when a table
with auto_snapshot==false is dropped. The race is as follows:
1. A flush of table T is initiated
2. The flush operation is preempted
3. Table T is dropped without flushing, because it has auto_snapshot off
4. The flush operation from (2.) wakes up and continues
   working on table T, which is already dropped
5. Segfault/memory corruption

To prevent such races, a phaser for pending flushes is introduced
2020-12-15 12:59:36 +01:00
Piotr Sarna
57d63ca036 database: add waiting for pending streams on table drop
We already wait for pending reads and writes, so for completeness
we should also wait for all pending stream operations to finish
before dropping the table to avoid inconsistencies.
2020-12-15 12:55:45 +01:00
Takuya ASADA
ebc4076fa5 tools: toolchain: add node_exporter
Download node_exporter in frozen image to prepare adding node_exporter
to relocatable pacakge.

Related #2190

Closes #7765

[avi: updated toolchain, x86_64/aarch64/s390x]
2020-12-14 20:34:17 +02:00
Piotr Sarna
13317f7698 alternator: ensure correct isolation level in tracing tests
Taking advantage of the fact that isolation level can be defined
for a table with a tag, the tracing test that relies on CAS
can now be sure to have a correct isolation level.
Message-Id: <43f005ab9d566c7d3d55ce93c553127b1df9e87f.1607954739.git.sarna@scylladb.com>
2020-12-14 17:37:55 +02:00
Piotr Sarna
7081e361cc test: add isolation level requirement message to tracing tests
Alternator tracing tests require the cluster to have the 'always'
isolation level configured to work properly. If that's not the case,
the tests will fail due to not having CAS-related traces present
in the logs. In order to help the users fix their configuration,
a helper message is printed before the test case is performed.
Automatic tests do not need this, because they are all ran with
matching isolation level, but this message could greatly improve
the user experience for manual tests.
Message-Id: <62bcbf60e674f57a55c9573852b6a28f99cbf408.1607949754.git.sarna@scylladb.com>
2020-12-14 14:53:58 +02:00
Piotr Sarna
4b0303d8ae tests: make alternator tracing tests idempotent
The outcome of alternator tracing tests was that tracing probability
was always set to 0 after the test was finished. That makes sense
for most test runs, but manual tests can work on existing clusters
with tracing probability set to some other value. Due to preserve
previous trace probability, the value is now extracted and stored,
so that it can be restored after the test is done.
Message-Id: <94f829b63f92847b4abb3b16f228bf9870f90c2e.1607949754.git.sarna@scylladb.com>
2020-12-14 14:53:23 +02:00
Avi Kivity
19ff528ef3 Update seastar submodule
* seastar 2de43eb6bf...3b8903d406 (3):
  > coroutines: check preemption flag in co_await
  > memory: consider span freelist objects in small pool diagnostics
  > util: noncopyable_function: avoid gcc uninitialized error in move constructor
2020-12-14 12:50:32 +02:00
Pekka Enberg
8d00c16feb transport/server: Code cleanups
Fix up some coding style issues spotted while reading the code:

- Fix indentation to be 4 spaces

- Remove superfluous semicolons

Closes #7793
2020-12-14 12:48:05 +02:00
Konstantin Osipov
b6c6cc275f commitlog: align input of dma_write() during segment recycle
Normally a file size should be aligned around block size, since
we never write to it any unaligned size. However, we're not
protected against partial writes.

Just to be safe, align up the amount of bytes to zerofill
when recycling a segment.

Message-Id: <20201211142628.608269-4-kostja@scylladb.com>
2020-12-14 12:16:18 +02:00
Konstantin Osipov
ad6817bcde commitlog: fix typo in a comment
Message-Id: <20201211142628.608269-2-kostja@scylladb.com>
2020-12-14 12:16:14 +02:00
Benny Halevy
0e79e0f215 test: mutation_diff: extend section markers
When the different mutations are printed via
BOOST_REQUIRE_EQUAL, we don't get the "expect {} but got {}"
section markers.  Instead, the parts we're interested in
are bracketed like "critical check X == Y has failed [{} != {}]"

Test: with both formats:
- https://github.com/scylladb/scylla/files/3890627/test_concurrent_reads_and_eviction.log
- https://github.com/scylladb/scylla/files/4303117/flat_mutation_reader_test.118.log
- https://github.com/scylladb/scylla/files/5687372/flat_mutation_reader_test.172.log.gz

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201214100521.3814909-1-bhalevy@scylladb.com>
2020-12-14 12:11:34 +02:00
Nadav Har'El
72cb3e9255 alternator test: add missing wait for update_table to finish
Three tests in test_streams.py run update_table() on a table without
waiting for it to complete, and then call update_table() on the same
table or delete it. This always works in Scylla, and usually works in
AWS, but if we reach the second call, it may fail because the previous
update_table() did not take effect yet. We sometimes see these failures
when running the Alternator test suite against AWS.

So in this patch, after an each update_table() we wait for the table
to return from UPDATING to ACTIVE status.

The entire Alternator test suite now passes (or skipped) on AWS,
so: Fixes #7778.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201213164931.2767236-1-nyh@scylladb.com>
2020-12-14 09:18:38 +01:00
Nadav Har'El
43ce0aef3d alternator test: fix test wrongly failing on AWS
The test test_query_filter.py::test_query_filter_paging fails on AWS
and shouldn't fail, so this patch fixes the test. Note that this is
only a test problem - no fix is needed for Alternator itself.

The test reads 20 results with 1-result pages, and assumed that
21 pages are returned. The 21st page may happen because when the
server returns the 20th, it might not yet know there will be no
additional results, so another page is needed - and will be empty.
Still a different implementation might notice that the last page
completed the iteration, and not return an extra empty page. This is
perfectly fine, and this is what AWS DynamoDB does today - and should
not be considered an error.

Refs #7778

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201213143612.2761943-1-nyh@scylladb.com>
2020-12-14 09:18:31 +01:00
Nadav Har'El
4ab98a4c68 alternator: use a more specific error when Authorization header is missing
When request signature checking is enabled in Alternator, each request
should come with the appropriate Authorization header. Most errors in
this preparing this header will result in an InvalidSignatureException
response; But DynamoDB returns a more specific error when this header is
completely missing: MissingAuthenticationTokenException. We should do the
same, but before this patch we return InvalidSignatureException also for
a missing header.

The test test_authorization.py::test_no_authorization_header used to
enshrine our wrong error message, and failed when run against AWS.
After this patch, we fix the error message and the test - which now
passes against both Alternator and AWS.

Refs #7778.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201213133825.2759357-1-nyh@scylladb.com>
2020-12-14 09:18:24 +01:00
Avi Kivity
39afe14ad4 Merge 'Add per query timeout' from Piotr Sarna
This series allows setting per-query timeout via CQL. It's possible via the existing `USING` clause, which is extended to be available for `SELECT` statement as well. This parameter accepts a duration and can also be provided as a marker.
The parameter acts as a regular part of the `USING` clause, which means that it can be used along with `USING TIMESTAMP` and `USING TTL` without issues.
The series comes with a pytest test suite.

Examples:
```cql
        SELECT * FROM t USING TIMEOUT 200ms;
```
```cql
        INSERT INTO t(a,b,c) VALUES (1,2,3) USING TIMESTAMP 42 AND TIMEOUT 50ms;
```

Working with prepared statements works as usual - the timeout parameter can be
explicitly defined or provided as a marker:

```cql
        SELECT * FROM t USING TIMEOUT ?;
```
```cql
        INSERT INTO t(a,b,c) VALUES (?,?,?) USING TIMESTAMP 42 AND TIMEOUT 50ms;
```

Tests: unit(dev)
Fixes #7777

Closes #7781

* github.com:scylladb/scylla:
  test: add prepared statement tests to USING TIMEOUT suite
  docs: add an entry about USING TIMEOUT
  test: add a test suite for USING TIMEOUT
  storage_proxy: start propagating local timeouts as timeouts
  cql3: allow USING clause for SELECT statement
  cql3: add TIMEOUT attribute to the parser
  cql3: add per-query timeout to select statement
  cql3: add per-query timeout to batch statement
  cql3: add per-query timeout to modification statement
  cql3: add timeout to cql attributes
2020-12-14 09:46:46 +02:00
Piotr Sarna
d6e7e36280 test: add prepared statement tests to USING TIMEOUT suite 2020-12-14 07:50:40 +01:00
Piotr Sarna
da77ab832b docs: add an entry about USING TIMEOUT
The paragraph describes how USING TIMEOUT clause can be used
along with some simple examples.
2020-12-14 07:50:40 +01:00
Piotr Sarna
0148b41a02 test: add a test suite for USING TIMEOUT
The test suite is based on cql-pytest and checks if USING TIMEOUT
works as expected.
2020-12-14 07:50:40 +01:00
Piotr Sarna
27fba35832 storage_proxy: start propagating local timeouts as timeouts
A local timeout was previously propagated to the client as WriteFailure,
while there exists a more concrete error type for that: WriteTimeout.
2020-12-14 07:50:40 +01:00
Piotr Sarna
ddd9cb1b2a cql3: allow USING clause for SELECT statement
In order to be able to specify a timeout for SELECT statements,
it's now possible to use the USING clause with it.
2020-12-14 07:50:40 +01:00
Piotr Sarna
d3896a209b cql3: add TIMEOUT attribute to the parser
It's now possible to specify TIMEOUT as part of the USING clause.
2020-12-14 07:50:40 +01:00
Piotr Sarna
157be33b89 cql3: add per-query timeout to select statement
First of all, select statement is extended with an 'attrs' field,
which keeps the per-query attributes. Currently, only TIMEOUT
parameter is legal to use, since TIMESTAMP and TTL bear no meaning
for reads.

Secondly, if TIMEOUT attribute is set, it will be used as the effective
timeout for a particular query.
2020-12-14 07:50:40 +01:00
Piotr Sarna
20dedd0df7 cql3: add per-query timeout to batch statement
If TIMEOUT attribute is set, it will be used as the effective
timeout for a particular query.
2020-12-14 07:50:40 +01:00
Piotr Sarna
3c49b6bd88 cql3: add per-query timeout to modification statement
If TIMEOUT attribute is set, it will be used as the effective
timeout for a particular query.
2020-12-14 07:50:40 +01:00
Piotr Sarna
5bbd0b049b cql3: add timeout to cql attributes
This attribute will be used later to specify per-query timeout.
2020-12-14 07:50:40 +01:00
Benny Halevy
c60da2e90d cdc: remove _token_metadata from db_context
1. It's unused since cbe510d1b8
2. It's unsafe to keep a reference to token_metadata&
potentially across yield points.

The higher-level motivation is to make
storage_service::get_token_metadata() private so we
can control better how it's used.

For cdc, if the token_metadata is going to be needed
to the future, it'd be better get it from
db_context::_proxy.get_token_metadata_ptr().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201213162351.52224-2-bhalevy@scylladb.com>
2020-12-13 18:32:17 +02:00
Avi Kivity
0f967f911d Merge "storage_service: get_token_metadata_ptr to hold on to token_metadata" from Benny
"
This series fixes use-after-free via token_metadata&

We may currently get a token_metadata& via get_token_metadata() and
use it across yield points in a couple of sites:
- do_decommission_removenode_with_repair
- get_new_source_ranges

To fix that, get_token_metadata_ptr and hold on to it
across yielding.

Fixes #7790

Dtest: update_cluster_layout_tests:TestUpdateClusterLayout.simple_removenode_2_test(debug)
Test: unit(dev)
"

* tag 'storage_service-token_metadata_ptr-v2' of github.com:bhalevy/scylla:
  storage_service: get_new_source_ranges: don't hold token_metadata& across yield point
  storage_service: get_changed_ranges_for_leaving: no need to maybe_yield for each token_range
  storage_service: get_changed_ranges_for_leaving: release token_metadata_ptr sooner
  storage_service: get_changed_ranges_for_leaving: don't hold token_metadata& across yield
2020-12-13 17:37:24 +02:00
Aleksandr Bykov
e74dc311e7 dist: scylla_util: fix aws_instance.ebs_disks method
aws_instance.ebs_disks() method should return ebs disk
instead of ephemeral

Signed-off-by: Aleksandr Bykov <alex.bykov@scylladb.com>

Closes #7780
2020-12-13 17:33:37 +02:00
Benny Halevy
1fbc831dae storage_service: get_new_source_ranges: don't hold token_metadata& across yield point
Provide the token_metadata& to get_new_source_ranges by the caller,
who keeps it valid throughout the call.

Note that there is no need to clone_only_token_map
since the token_metadata_ptr is immutable and can be
used just as well for calling strat.get_range_addresses.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-13 16:42:00 +02:00
Benny Halevy
f13913d251 storage_service: get_changed_ranges_for_leaving: no need to maybe_yield for each token_range
Now that we pass can_yield::yes to calculate_natural_endpoints
for each token_range.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-13 16:42:00 +02:00
Benny Halevy
89ed0705e8 storage_service: get_changed_ranges_for_leaving: release token_metadata_ptr sooner
No need to hold on to the shared token_metadata_ptr
after we got clone_after_all_left().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-13 16:42:00 +02:00
Benny Halevy
684c4143df storage_service: get_changed_ranges_for_leaving: don't hold token_metadata& across yield
When yielding in clone_only_token_map or clone_after_all_left
the token_metadata got with get_token_metadata() may go away.

Use get_token_metadata_ptr() instead to hold on to it.

And with that, we don't need to clone_only_token_map.
`metadata` is not modified by calculate_natural_endpoints, so we
can just refer to the immutable copy retrieved with
get_token_metadata_ptr.

Fixes #7790

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-13 16:41:58 +02:00
Avi Kivity
65a0244614 Update tools/jmx submodule
* tools/jmx 6174a47...20469bf (1):
  > column_family: Return proper cardinality for toppartitions requests
2020-12-13 13:51:38 +02:00
Avi Kivity
9265b87610 Merge "Remove get_local_storage_proxy from validation" from Pavel E
"
The validate_column_family() helper uses the global proxy
reference to get database from. Fortunatelly, all the callers
of it can provide one via argument.

tests: unit(dev)
"

* 'br-no-proxy-in-validate' of https://github.com/xemul/scylla:
  validation: Remove get_local_storage_proxy call
  client_state: Call validate_column_family() with database arg
  client_state: Add database& arg to has_column_family_access
  storage_proxy: Add .local_db() getters
  validate: Mark database argument const
2020-12-13 13:12:57 +02:00
Avi Kivity
19aaf8eb83 Merge "Remove global storage service from index manager" from Pavel E
"
The initial intent was to remove call for global storage service from
secondary index manager's create_view_for_index(), but while fixing it
one of intermediate schema table's helper managed to benefit from it
by re-using the database reference flying by.

The cleanup is done by simply pushing the database reference along the
stack from the code that already has it down the create_view_for_index().

tests: unit(dev)
"

* 'br-no-storages-in-index-and-schema' of https://github.com/xemul/scylla:
  schema-tables: Use db from make_update_table_mutations in make_update_indices_mutations
  schema-tables: Add database argument to make_update_table_mutations
  schema-tables: Factor out calls getting database instance
  index-manager: Move feature evaluation one level up
2020-12-13 12:41:51 +02:00
Benny Halevy
aae3991246 repair: do_decommission_removenode_with_repair: don't deref ops when null
`ops` might be passed as a disengaged shared_ptr when called
from `decommission_with_repair`.

In this case we need to propagate to sync_data_using_repair a
disengaged std::optional<utils::UUID>.

Fixes #7788

DTest: update_cluster_layout_tests:TestUpdateClusterLayout.verify_latest_copy_decommission_node_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201213073743.331253-1-bhalevy@scylladb.com>
2020-12-13 12:37:18 +02:00
Avi Kivity
18be57a4e5 Update seastar submodule
* seastar 8b400c7b45...2de43eb6bf (3):
  > core: show span free sizes correctly in diagnostics
  > Merge "IO queues to share capacities" from Pavel E
  > file: make_file_impl: determine blockdev using st_mode
2020-12-12 21:57:01 +02:00
Pekka Enberg
c990f2bd34 Merge 'Reinstate [[nodiscard]] support' from Avi Kivity
The switch to clang disabled the clang-specific -Wunused-value
since it generated some harmless warnings. Unfortunately, that also
prevent [[nodiscard]] violations from warning.

Fix by clearing all instances of the warning (including [[nodiscard]]
violations that crept in while it was disabled) and reinstating the warning.

Closes #7767

* github.com:scylladb/scylla:
  build: reinstate -Wunused-value warning for [[nodiscard]]
  test: lib: don't ignore future in compare_readers()
  test: mutation_test: check both ranges when comparing summaries
  serialializer: silence unused value warning in variant deserializer
2020-12-12 09:54:05 +02:00
Avi Kivity
615b8e8184 dist: rpm: uninstall tuned when installing scylla-kernel-conf
tuned 2.11.0-9 and later writes to kerned.sched_wakeup_granularity_ns
and other sysctl tunables that we so laboriously tuned, dropping
performance by a factor of 5 (due to increased latency). Fix by
obsoleting tuned during install (in effect, we are a better tuned,
at least for us).

Not needed for .deb, since debian/ubunto do not install tuned by
default.

Fixes #7696

Closes #7776
2020-12-12 09:54:05 +02:00
Pavel Emelyanov
3a025cfa52 schema-tables: Use db from make_update_table_mutations in make_update_indices_mutations
Two halves of the tunnel finally connect -- the
latter helper needs the local database instance and
is only called by the former one which already has it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 21:23:53 +03:00
Pavel Emelyanov
89fd524c5a schema-tables: Add database argument to make_update_table_mutations
There are 3 callers of this helper (cdc, migration manager and tests)
and all of them already have the database object at hands.

The argument will be used by next patch to remove call for global
storage proxy instance from make_update_indices_mutations.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 21:21:22 +03:00
Pavel Emelyanov
1bcef04c7a schema-tables: Factor out calls getting database instance
The make_update_indices_mutations gets database instance
for two things -- to find the cf to work with and to get
the value of a feature for index view creation.

To suit both and to remove calls for global storage proxy
and service instances get the database once in the
function entrance. Next patch will clean this further.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 21:17:11 +03:00
Pavel Emelyanov
6dd10e771d index-manager: Move feature evaluation one level up
The create_view_for_index needs to know the state of the
correct-idx-token-in-secondary-index feature. To get one
it takes quite a long route through global storage service
instance.

Since there's only one caller of the method in question,
and the method is called in a loop, it's a bit faster to
get the feature value in caller and pass it in argument.

This will also help to get rid of the call for global
storage service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 21:14:12 +03:00
Pavel Emelyanov
3a3ee45488 size_estimate_reader: Use local db reference not global
The get_next_partition uses global proxy instance to get
the local database reference. Now it's available in the
reader object itself, so it's possible to remove this
call for global storage proxy.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 20:38:21 +03:00
Pavel Emelyanov
107dcbfbd6 size_estimate_reader: Keep database reference on mutation reader
This reader uses local databse instance in its get_next_partition
method to find keyspaces to work with

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 20:34:54 +03:00
Pavel Emelyanov
48e494fb62 size_estimate_reader: Keep database reference on virtual_reader
The database will be then used to create the mutation reader

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 20:31:35 +03:00
Pavel Emelyanov
83073f4e8b validation: Remove get_local_storage_proxy call
It is used in validate_column_family. The last caller of it was removed by
previous patch, so we may kill the helper itself

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 18:52:42 +03:00
Pavel Emelyanov
12cc539835 client_state: Call validate_column_family() with database arg
The previous patch brought the databse reference arg. And since
the currently called validate_column_family() overload _just_
gets the database from global proxy, it's better to shortcut.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 18:50:49 +03:00
Pavel Emelyanov
b0c4a9087d client_state: Add database& arg to has_column_family_access
It is called from cql3/statements' check_access methods and from thrift
handlers. The former have proxy argument from which they can get the
database. The latter already have the database itself on board.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 18:49:16 +03:00
Pavel Emelyanov
4c7bc8a3d1 storage_proxy: Add .local_db() getters
To facilitate the next patching

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 18:48:02 +03:00
Avi Kivity
a11ecfe231 Merge 'types: don't linearize in validate()' from Michał Chojnowski
A sequel to #7692.

This series gets rid of linearization when validating collections and tuple types. (Other types were already validated without linearizing).
The necessary helpers for reading from fragmented buffers were introduced in #7692. All this series does is put them to use in `validate()`.

Refs: #6138

Closes #7770

* github.com:scylladb/scylla:
  types: add single-fragment optimization in validate()
  utils: fragment_range: add with_simplified()
  cql3: statements: select_statement: remove unnecessary use of with_linearized
  cql3: maps: remove unnecessary use of with_linearized
  cql3: lists: remove unnecessary use of with_linearized
  cql3: tuples: remove unnecessary use of with_linearized
  cql3: sets: remove unnecessary use of with_linearized
  cql3: tuples: remove unnecessary use of with_linearized
  cql3: attributes: remove unnecessary uses of with_linearized
  types: validate lists without linearizing
  types: validate tuples without linearizing
  types: validate sets without linearizing
  types: validate maps without linearizing
  types: template abstract_type::validate on FragmentedView
  types: validate_visitor: transition from FragmentRange to FragmentedView
  utils: fragmented_temporary_buffer: add empty() to FragmentedView
  utils: fragmented_temporary_buffer: don't add to null pointer
2020-12-11 17:33:59 +02:00
Pavel Emelyanov
563b466227 validate: Mark database argument const
They are indeed used like that

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-11 18:27:45 +03:00
Michał Chojnowski
150473f074 types: add single-fragment optimization in validate()
Manipulating fragmented views is costlier that manipulating contiguous views,
so let's detect the common situation when the fragmented view is actually
contiguous underneath, and make use of that.

Note: this optimization is only useful for big types. For trivial types,
validation usually only checks the size of the view.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
e2d17879fc utils: fragment_range: add with_simplified()
Reading from contiguous memory (bytes_view) is significantly simpler
runtime-wise than reading from a fragmented view, due to less state and less
branching, so we often want to convert a fragmented view to a simple view before
processing it, if the fragmented view contains at most one fragment, which is
common. with_simplified() does just that.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
51ca5fa4c5 cql3: statements: select_statement: remove unnecessary use of with_linearized
We can validate directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
72186bee69 cql3: maps: remove unnecessary use of with_linearized
We can validate directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
3f3a10c588 cql3: lists: remove unnecessary use of with_linearized
We can validate directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
efa036329d cql3: tuples: remove unnecessary use of with_linearized
We can validate directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
4f359a7a99 cql3: sets: remove unnecessary use of with_linearized
We can validate directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
281417917b cql3: tuples: remove unnecessary use of with_linearized
We can validate directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
d1d1a00311 cql3: attributes: remove unnecessary uses of with_linearized
We can validate and deserialize directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
0581b3ff31 types: validate lists without linearizing
We can validate collections directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
4fe41b69fd types: validate tuples without linearizing
We can validate tuples directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
a7dd736d03 types: validate sets without linearizing
We can validate collections directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
1459608375 types: validate maps without linearizing
We can validate collections directly from fragmented buffers now.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
82befbe8c0 types: template abstract_type::validate on FragmentedView
This is primarily a stylistic change. It makes the interface more consistent
with deserialize(). It will also allow us to call `validate()` for collection
elements in `validate_aux()`.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
15dbe00e8a types: validate_visitor: transition from FragmentRange to FragmentedView
This will allow us to easily get rid of linearizations when validating
collections and tuples, because the helpers used in validate_aux() already
have FragmentedView overloads.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
3647c0ba47 utils: fragmented_temporary_buffer: add empty() to FragmentedView
It's redundant with size_bytes(), but sometimes empty() is more readable and
reduces churn when replacing other types with FragmentedView.
2020-12-11 09:53:07 +01:00
Michał Chojnowski
b4dd5d3bdb utils: fragmented_temporary_buffer: don't add to null pointer
When fragmented_temporary_buffer::view is created from a bytes_view,
_current is null. In that case, in remove_current(), null pointer offset
happens, and ubsan complains. Fix that.
2020-12-11 09:53:07 +01:00
Raphael S. Carvalho
e4b55f40f3 sstables: Fix sstable reshaping for STCS
The heuristic of STCS reshape is correct, and it built the compaction
descriptor correctly, but forgot to return it to the caller, so no
reshape was ever done on behalf of STCS even when the strategy
needed it.

Fixes #7774.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201209175044.1609102-1-raphaelsc@scylladb.com>
2020-12-10 12:45:25 +02:00
Asias He
829b4c1438 repair: Make removenode safe by default
Currently removenode works like below:

- The coordinator node advertises the node to be removed in
  REMOVING_TOKEN status in gossip

- Existing nodes learn the node in REMOVING_TOKEN status

- Existing nodes sync data for the range it owns

- Existing nodes send notification to the coordinator

- The coordinator node waits for notification and announce the node in
  REMOVED_TOKEN

Current problems:

- Existing nodes do not tell the coordinator if the data sync is ok or failed.

- The coordinator can not abort the removenode operation in case of error

- Failed removenode operation will make the node to be removed in
  REMOVING_TOKEN forever.

- The removenode runs in best effort mode which may cause data
  consistency issues.

  It means if a node that owns the range after the removenode
  operation is down during the operation, the removenode node operation
  will continue to succeed without requiring that node to perform data
  syncing. This can cause data consistency issues.

  For example, Five nodes in the cluster, RF = 3, for a range, n1, n2,
  n3 is the old replicas, n2 is being removed, after the removenode
  operation, the new replicas are n1, n5, n3. If n3 is down during the
  removenode operation, only n1 will be used to sync data with the new
  owner n5. This will break QUORUM read consistency if n1 happens to
  miss some writes.

Improvements in this patch:

- This patch makes the removenode safe by default.

We require all nodes in the cluster to participate in the removenode operation and
sync data if needed. We fail the removenode operation if any of them is down or
fails.

If the user want the removenode operation to succeed even if some of the nodes
are not available, the user has to explicitly pass a list of nodes that can be
skipped for the operation.

$ nodetool removenode --ignore-dead-nodes <list_of_dead_nodes_to_ignore> <host_id>

Example restful api:

$ curl -X POST "http://127.0.0.1:10000/storage_service/remove_node/?host_id=7bd303e9-4c7b-4915-84f6-343d0dbd9a49&ignore_nodes=127.0.0.3,127.0.0.5"

- The coordinator can abort data sync on existing nodes

For example, if one of the nodes fails to sync data. It makes no sense for
other nodes to continue to sync data because the whole operation will
fail anyway.

- The coordinator can decide which nodes to ignore and pass the decision
  to other nodes

Previously, there is no way for the coordinator to tell existing nodes
to run in strict mode or best effort mode. Users will have to modify
config file or run a restful api cmd on all the nodes to select strict
or best effort mode. With this patch, the cluster wide configuration is
eliminated.

Fixes #7359

Closes #7626
2020-12-10 10:14:39 +02:00
Piotr Sarna
20bdeb315a Merge ' types: add constraint on lexicographical_tri_compare()' from Avi Kivity
Verify that the input types are iterators and their value types are compatible
with the compare function.

Because some of the inputs were not actually valid iterators, they are adjusted
too.

Closes #7631

* github.com:scylladb/scylla:
  types: add constraint on lexicographical_tri_compare()
  composite: make composite::iterator a real input_iterator
  compound: make compount_type::iterator a real input_iterator
2020-12-09 18:48:01 +01:00
Nadav Har'El
a8fdbf31cd alternator: fix UpdateItem ADD for non-existent attribute
UpdateItem's "ADD" operation usually adds elements to an existing set
or adds a number to an existing counter. But it can *also* be used
to create a new set or counter (as if adding to an empty set or zero).

We unfortunately did not have a test for this case (creating a new set
or counter), and when I wrote such a test now, I discovered the
implementation was missing. So this patch adds both the test and the
implementation. The new test used to fail before this patch, and passes
with it - and passes on DynamoDB.

Note that we only had this bug for the newer UpdateItem syntax.
For the old AttributeUpdates syntax, we already support ADD actions
on missing attributes, and already tested it in test_update_item_add().
I just forgot to test the same thing for the newer syntax, so I missed
this bug :-(

Fixes #7763.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207085135.2551845-1-nyh@scylladb.com>
2020-12-09 18:44:30 +01:00
Juliusz Stasiewicz
b150906d39 gossip: Added SNITCH_NAME to application_state
Snitch name needs to be exchanged within cluster once, on shadow
round, so joining nodes cannot use wrong snitch. The snitch names
are compared on bootstrap and on normal node start.

If the cluster already used mixed snitches, the upgrade to this
version will fail. In this case customer needs to add a node with
correct snitch for every node with the wrong snitch, then put
down the nodes with the wrong snitch and only then do the upgrade.

Fixes #6832

Closes #7739
2020-12-09 15:45:25 +02:00
Nadav Har'El
781f9d9aca alternator: make default timeout configurable
Whereas in CQL the client can pass a timeout parameter to the server, in
the DynamoDB API there is no such feature; The server needs to choose
reasonable timeouts for its own internal operations - e.g., writes to disk,
querying other replicas, etc.

Until now, Alternator had a fixed timeout of 10 seconds for its
requests. This choice was reasonable - it is much higher than we expect
during normal operations, and still lower than the client-side timeouts
that some DynamoDB libraries have (boto3 has a one-minute timeout).
However, there's nothing holy about this number of 10 seconds, some
installations might want to change this default.

So this patch adds a configuration option, "--alternator-timeout-in-ms",
to choose this timeout. As before, it defaults to 10 seconds (10,000ms).

In particular, some test runs are unusually slow - consider for example
testing a debug build (which is already very slow) in an extremely
over-comitted test host. In some cases (see issue #7706) we noticed
the 10 second timeout was not enough. So in this patch we increase the
default timeout chosen in the "test/alternator/run" script to 30 seconds.

Please note that as the code is structured today, this timeout only
applies to some operations, such as GetItem, UpdateItem or Scan, but
does not apply to CreateTable, for example. This is a pre-existing
issue that this patch does not change.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207122758.2570332-1-nyh@scylladb.com>
2020-12-09 14:30:43 +01:00
Avi Kivity
f802356572 Revert "Revert "Merge "raft: fix replication if existing log on leader" from Gleb""
This reverts commit dc77d128e9. It was reverted
due to a strange and unexplained diff, which is now explained. The
HEAD on the working directory being pulled from was set back, so git
thought it was merging the intended commits, plus all the work that was
committed from HEAD to master. So it is safe to restore it.
2020-12-08 19:19:55 +02:00
Avi Kivity
1badd315ef Merge "Speed up devel tests 10 times" from Pavel E
"
The multishard_mutation_query test is toooo slow when built
with clang in dev mode. By reducing the number of scans it's
possible to shrink the full suite run time from half an hour
down to ~3 minutes.

tests: unit(dev)
"

* 'br-devel-mode-tests' of https://github.com/xemul/scylla:
  test: Make multishard_mutation_query test do less scans
  configure: Add -DDEVEL to dev build flags
2020-12-08 15:42:12 +02:00
Pavel Emelyanov
b837cf25b1 test: Make multishard_mutation_query test do less scans
When built by clang this dev-mode test takes ~30 minutes to
complete. Let's reduce this time by reducing the scale of
the test if DEVEL is set.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-08 15:55:04 +03:00
Pavel Emelyanov
703451311f configure: Add -DDEVEL to dev build flags
To let source code tell debug, dev and release builds
from each other.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-08 15:54:30 +03:00
Avi Kivity
461c9826de Merge 'scylla_setup: fix wrong command suggestion' from Takuya ASADA
scylla_setup command suggestion does not shows an argument of --io-setup,
because we mistakely stores bool value on it (recognized as 'store_true').
We always need to print '--io-setup X' on the suggestion instead.

Also, --nic is currently ignored by command suggestion, need to print it just like other options.

Related #7395

Closes #7724

* github.com:scylladb/scylla:
  scylla_setup: print --swap-directory and --swap-size on command suggestion
  scylla_setup: print --nic on command suggestion
  scylla_setup: fix wrong command suggestion on --io-setup scylla_setup command suggestion does not shows an argument of --io-setup, because we mistakely stores bool value on it (recognized as 'store_true'). We always need to print '--io-setup X' on the suggestion instead.
2020-12-08 13:58:55 +02:00
Avi Kivity
98271a5c57 Merge 'types: don't linearize in serialize_for_cql()' from Michał Chojnowski
A sequel to #7692.

This series gets rid of linearization in `serialize_for_cql`, which serializes collections and user types from `collection_mutation_view` to CQL. We switch from `bytes` to `bytes_ostream` as the intermediate buffer type.

The only user of of `serialize_for_cql` immediately copies the result to another `bytes_ostream`. We could avoid some copies and allocations by writing to the final `bytes_ostream` directly, but it's currently hidden behind a template.

Before this series, `serialize_for_cql_aux()` delegated the actual writing to `collection_type_impl::pack` and `tuple_type_impl::build_value`, by passing them an intermediate `vector`. After this patch, the writing is done directly in `serialize_for_cql_aux()`. Pros: we avoid the overhead of creating an intermediate vector, without bloating the source code (because creating that intermediate vector requires just as much code as serializing the values right away). Cons: we duplicate the CQL collection format knowledge contained in `collection_type_impl::pack` and `tuple_type_impl::build_value`.

Refs: #6138

Closes #7771

* github.com:scylladb/scylla:
  types: switch serialize_for_cql from bytes to bytes_ostream
  types: switch serialize_for_cql_aux from bytes to bytes_ostream
  types: serialize user types to bytes_ostream
  types: serialize lists to bytes_ostream
  types: serialize sets to bytes_ostream
  types: serialize maps to bytes_ostream
  utils: fragment_range: use range-based for loop instead of boost::for_each
  types: add write_collection_value() overload for bytes_ostream and value_view
2020-12-08 12:38:36 +02:00
Lubos Kosco
a0b1474bba scylla_util.py: Increase disk to ram ratio for GCP
Increase accepted disk-to-RAM ratio to 105 to accomodate even 7.5GB of
RAM for one NVMe log various reasons for not recommending the instance
type.

Fixes #7587

Closes #7600
2020-12-08 11:20:30 +02:00
Piotr Wojtczak
c09ab3b869 api: Add cardinality to toppartitions results
This change enhances the toppartitions api to also return
the cardinality of the read and write sample sets. It now uses
the size() method of space_saving_top_k class, counting the unique
operations in the sampled set for up to the given capacity.

Fixes #4089
Closes #7766
2020-12-08 09:38:59 +01:00
Nadav Har'El
86779664f4 alternator: fix broken Scan/Query paging with bytes keys
When an Alternator table has partition keys or sort keys of type "bytes"
(blobs), a Scan or Query which required paging used to fail - we used
an incorrect function to output LastEvaluatedKey (which tells the user
where to continue at the next page), and this incorrect function was
correct for strings and numbers - but NOT for bytes (for bytes, we
need to encode them as base-64).

This patch also includes two tests - for bytes partition key and
for bytes sort key - that failed before this patch and now pass.
The test test_fetch_from_system_tables also used to fail after a
Limit was added to it, because one of the tables it scans had a bytes
key. That test is also fixed by this patch.

Fixes #7768

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207175957.2585456-1-nyh@scylladb.com>
2020-12-08 09:38:23 +01:00
Eliran Sinvani
70770ff7fa debian pkg: Make deb packages explicitly depend on versioned components
Up until now, Scylla's debian packages dependencies versions were
unspecified. This was due to a technical difficulty to determine
the version of the dependent upon packages (such as scylla-python3
or scylla-jmx). Now, when those packages are also built as part of
this repo and are built with a version identical to the server package
itself we can depend all of our packages with explicit versions.
The motivation for this change is that if a user tries to install
a specific Scylla version by installing a specific meta package,
it will silently drag in the latest components instead of the ones
of the requested versions.
The expected change in behavior is that after this change an attempt
to install a metapackage with version which is not the latest will fail
with an explicit error hinting the user what other packages of the same
version should be explicitly included in the command line.

Fixes #5514

Closes #7727
2020-12-07 18:58:15 +02:00
Michał Chojnowski
d43fd456cd types: switch serialize_for_cql from bytes to bytes_ostream
Now we can serialize collections from collection_mutation_view_description
without linearizations.
2020-12-07 17:55:36 +01:00
Michał Chojnowski
81a55b032d types: switch serialize_for_cql_aux from bytes to bytes_ostream
We will switch serialize_for_cql itself to bytes_ostream soon.
2020-12-07 17:55:35 +01:00
Michał Chojnowski
71183cf0bd types: serialize user types to bytes_ostream
Avoids linearization by serializing to a fragmented type.
It's still linearized at the very end, this will be changed in the near future.
2020-12-07 17:52:06 +01:00
Michał Chojnowski
41b889d0c8 types: serialize lists to bytes_ostream
Avoids linearization by serializing to a fragmented type.
It's still linearized at the very end, this will be changed in the near future.
2020-12-07 17:49:21 +01:00
Michał Chojnowski
2b3d2c193d types: serialize sets to bytes_ostream
Avoids linearization by serializing to a fragmented type.
It's still linearized at the very end, this will be changed in the near future.
2020-12-07 17:47:49 +01:00
Michał Chojnowski
35823d12db types: serialize maps to bytes_ostream
Avoids linearization by serializing to a fragmented type.
It's still linearized at the very end, this will be changed in the near future.
2020-12-07 17:47:12 +01:00
Botond Dénes
ba7cf2f5fd tools/scylla-types: update name in description to use - instead of _
The executable was rename from using _ to using - to at one point but
apparently the description wasn't updated.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201207161626.79013-1-bdenes@scylladb.com>
2020-12-07 18:34:52 +02:00
Avi Kivity
7580a93ec8 build: reinstate -Wunused-value warning for [[nodiscard]]
The switch to clang disabled the clang-specific -Wunused-value
since it generated some harmless warnings. Unfortunately, that also
prevent [[nodiscard]] violations from warning.

Fix by reinstating the warning, now that all instances of the warning
have been fixed.
2020-12-07 16:51:19 +02:00
Avi Kivity
8fc0bbd487 test: lib: don't ignore future in compare_readers()
A fast_forward_to() call is not waited on in compare_readers(). Since
this is called in a thread, add a future::get() call to wait for it.
2020-12-07 16:50:20 +02:00
Avi Kivity
732d83dc0e test: mutation_test: check both ranges when comparing summaries
A copy/paste error means we ignore the termination of one of the
ranges. Change the comma expression to a disjunction to avoid
the unused value warning from clang.

The code is not perfect, since if the two ranges are not the same
size we'll invoke undefined behavior, but it is no worse than before
(where we ignored the comparison completely).
2020-12-07 16:47:52 +02:00
Avi Kivity
fc0a45af5f serialializer: silence unused value warning in variant deserializer
The variant deserializer uses a fold expression to implement
an if-tree with a short-circuit, producing an intermediate boolean
value to terminate evaluation. This intermediate value is unneeded,
but evokes a warning from clang when -Wunused-value is enabled.

Since we want to enable the warning, add a cast to void to ignore
the intermediate value.
2020-12-07 16:45:20 +02:00
Michał Chojnowski
60a3cecfea utils: fragment_range: use range-based for loop instead of boost::for_each
We want to pass bytes_ostream to this loop in later commits.
bytes_ostream does not conform to some boost concepts required by
boost::for_each, so let's just use C++'s native loop.
2020-12-07 12:50:36 +01:00
Piotr Sarna
1cc4ed50c1 db: fix getting local ranges for size estimates table
When getting local ranges, an assumption is made that
if a range does not contain an end or when its end is a maximum token,
then it must contain a start. This assumption proven not true
during manual tests, so it's now fortified with an additional check.

Here's a gdb output for a set of local ranges which causes an assertion
failure when calling `get_local_ranges` on it:

(gdb) p ranges
$1 = std::vector of length 2, capacity 2 = {{_interval = {_start = std::optional<interval_bound<dht::token>> = {[contained value] = {_value = {_kind = dht::token_kind::before_all_keys,
            _data = 0}, _inclusive = false}}, _end = std::optional<interval_bound<dht::token>> [no contained value], _singular = false}}, {_interval = {
      _start = std::optional<interval_bound<dht::token>> [no contained value], _end = std::optional<interval_bound<dht::token>> = {[contained value] = {_value = {
            _kind = dht::token_kind::before_all_keys, _data = 0}, _inclusive = true}}, _singular = false}}}

Closes #7764
2020-12-07 12:08:31 +02:00
Takuya ASADA
c3abba1913 scylla_setup: print --swap-directory and --swap-size on command suggestion
We need to print --swap-directory and --swap-size on command suggestion just like other options.

Related #7395
2020-12-07 18:40:59 +09:00
Takuya ASADA
582a3ffb2f scylla_setup: print --nic on command suggestion
We need to print --nic on command suggestion just like other options.

Related #7395
2020-12-07 18:40:59 +09:00
Nadav Har'El
220d6dde17 alternator, test: make test_fetch_from_system_tables faster
The test test_fetch_from_system_tables tests Alternator's system-table
feature by reading from all system tables. The intention was to confirm
we don't crash reading any of them - as they have different schemas and
can run into different problems (we had such problems in the initial
implementation). The intention was not to read *a lot* from each table -
we only make a single "Scan" call on each, to read one page of data.
However, the Scan call did not set a Limit, so the single page can get
pretty big.

This is not normally a problem, but in extremely slow runs - such as when
running the debug build on an extremely overcommitted test machine (e.g.,
issue #7706) reading this large page may take longer than our default
timeout. I'll send a separate patch for the timeout issue, but for now,
there is really no reason why we need to read a big page. It is good
enough to just read 50 rows (with Limit=50). This will still read all
the different types and make the test faster.

As an example, in the debug run on my laptop, this test spent 2.4
seconds to read the "compaction_history" table before this patch,
and only 0.1 seconds after this patch. 2.4 seconds is close to our
default timeout (10 seconds), 0.1 is very far.

Fixes #7706

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207075112.2548178-1-nyh@scylladb.com>
2020-12-07 08:52:31 +01:00
Michał Chojnowski
1fe7490970 types: add write_collection_value() overload for bytes_ostream and value_view
We will use it to serialize collections to bytes_ostream in serialize_for_cql().
2020-12-07 08:48:31 +01:00
Nadav Har'El
0cd05dd0fd cql-pytest: add tests for ALLOW FILTERING
The original goal of this patch was to replace the two single-node dtests
allow_filtering_test and allow_filtering_secondary_indexes_test, which
recently caused us problems when we wanted to change the ALLOW FILTERING
behavior but the tests were outside the tree. I'm hoping that after this
patch, those two tests could be removed from dtest.

But this patch actually tests more cases then those original dtest, and
moreover tests not just whether ALLOW FILTERING is required or not, but
also that the results of the filtering is correct.

Currently, four of the included tests are expected to fail ("xfail") on
Scylla, reproducing two issues:

1. Refs #5545:
   "WHERE x IN ..." on indexed column x wrongly requires ALLOW FILTERING
2. Refs #7608:
   "WHERE c=1" on clustering key c should require ALLOW FILTERING, but
   doesn't.

All tests, except the one for issue #5545, pass on Cassandra. That one
fails on Cassandra because doesn't support IN on an indexed column at all
(regardless of whether ALLOW FILTERING is used or not).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201115124631.1224888-1-nyh@scylladb.com>
2020-12-06 19:51:25 +02:00
Pavel Solodovnikov
56c0fcfcb2 cql_query_test: handle bounce_to_shard msg in test_null_value_tuple_floating_types_and_uuids
Use `prepared_on_shard` helper function to handle `bounce_to_shard`
messages that can happen when using LWT statements.

Fixes: #7757
Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201204172944.601730-1-pa.solodovnikov@scylladb.com>
2020-12-06 19:34:13 +02:00
Amos Kong
6b1659ee80 schema.cc/describe: fix invalid compaction options in schema
There is a typo in schema.cql of snapshot, lack of comma after
compaction strategy. It will fail to restore schema by the file.

    AND compaction = {'class': 'SizeTieredCompactionStrategy''max_compaction_threshold': '32'}

map_as_cql_param() function has a `first` parameter to smartly add
comma, the compaction_strategy_options is always not the first.

Fixes #7741

Signed-off-by: Amos Kong <amos@scylladb.com>

Closes #7734
2020-12-06 17:40:05 +02:00
Avi Kivity
ca950e6f08 Merge "Remove get_local_storage_service() from counters" from Pavel E
"
The storage service is called there to get the cached value
of db::system_keyspace::get_local_host_id(). Keeping the value
on database decouples it from storage service and kills one
more global storage service reference.

tests: unit(dev)
"

* 'br-remove-storage-service-from-counters-2' of https://github.com/xemul/scylla:
  counters: Drop call to get_local_storage_service and related
  counters: Use local id arg in transform_counter_update_to_shards
  database: Have local id arg in transform_counter_updates_to_shards()
  storage_service: Keep local host id to database
2020-12-06 16:15:21 +02:00
Avi Kivity
6e460e121a Merge 'docs: Add Sphinx and ScyllaDB theme' from David Garcia
This PR adds the Sphinx documentation generator and the custom theme ``sphinx-scylladb-theme``. Once merged, the GitHub Actions workflow should automatically publish the developer notes stored under ``docs`` directory on http://scylladb.github.io/scylla

1. Run the command ``make preview`` from the ``docs`` directory.
3. Check the terminal where you have executed the previous command. It should not raise warnings.
3. Open in a new browser tab http://127.0.0.1:5500/ to see the generated documentation pages.

The table of contents displays the files sorted as they appear on GitHub. In a subsequent iteration, @lauranovich and I will submit an additional PR proposing a new folder organization structure.

Closes #7752

* github.com:scylladb/scylla:
  docs: fixed warnings
  docs: added theme
2020-12-06 15:26:57 +02:00
Benny Halevy
64a4ffc579 large_data_handler: do not delete records in the absence of large_data_stats
The previous way of deleting records based on the whole
sstatble data_size causes overzealous deletions (#7668)
and inefficiency in the rows cache due to the large number
of range tombstones created.

Therefore we'd be better of by juts letting the
records expire using he 30 days TTL.

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201206083725.1386249-1-bhalevy@scylladb.com>
2020-12-06 11:34:37 +02:00
Avi Kivity
dc77d128e9 Revert "Merge "raft: fix replication if existing log on leader" from Gleb"
This reverts commit 0aa1f7c70a, reversing
changes made to 72c59e8000. The diff is
strange, including unrelated commits. There is no understanding of the
cause, so to be safe, revert and try again.
2020-12-06 11:34:19 +02:00
Pavel Emelyanov
df0e26035f counters: Drop call to get_local_storage_service and related
The local host id is now passed by argument, so we don't
need the counter_id::local() and some other methods that
call or are called by it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-04 16:31:12 +03:00
Pavel Emelyanov
914613b3c3 counters: Use local id arg in transform_counter_update_to_shards
Only few places in it need the uuid. And since it's only 16 bytes
it's possibvle to safely capture it by value in the called lambdas.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-04 16:30:31 +03:00
Pavel Emelyanov
62214e2258 database: Have local id arg in transform_counter_updates_to_shards()
There are two places that call it -- database code itself and
tests. The former already has the local host id, so just pass
one.

The latter are a bit trickier. Currently they use the value from
storage_service created by storage_service_for_tests, but since
this version of service doesn't pass through prepare_to_join()
the local_host_id value there is default-initialized, so just
default-initialize the needed argument in place.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-04 15:09:30 +03:00
Pavel Emelyanov
5a286ee8d4 storage_service: Keep local host id to database
The value in question is cached from db::system_keyspace
for places that want to have it without waiting for
futures. So far the only place is database counters code,
so keep the value on database itself. Next patches will
make use of it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-12-04 15:09:29 +03:00
Piotr Sarna
2015988373 Merge 'types: get rid of linearization in deserialize()' from Michał Chojnowski
Citing #6138: > In the past few years we have converted most of our codebase to
work in terms of fragmented buffers, instead of linearised ones, to help avoid
large allocations that put large pressure on the memory allocator.  > One
prominent component that still works exclusively in terms of linearised buffers
is the types hierarchy, more specifically the de/serialization code to/from CQL
format. Note that for most types, this is the same as our internal format,
notable exceptions are non-frozen collections and user types.  > > Most types
are expected to contain reasonably small values, but texts, blobs and especially
collections can get very large. Since the entire hierarchy shares a common
interface we can either transition all or none to work with fragmented buffers.

This series gets rid of intermediate linearizations in deserialization. The next
steps are removing linearizations from serialization, validation and comparison
code.

Series summary:
- Fix a bug in `fragmented_temporary_buffer::view::remove_prefix`. (Discovered
  while testing. Since it wasn't discovered earlier, I guess it doesn't occur in
  any code path in master.)
- Add a `FragmentedView` concept to allow uniform handling of various types of
  fragmented buffers (`bytes_view`, `temporary_fragmented_buffer::view`,
  `ser::buffer_view` and likely `managed_bytes_view` in the future).
- Implement `FragmentedView` for relevant fragmented buffer types.
- Add helper functions for reading from `FragmentedView`.
- Switch `deserialize()` and all its helpers from `bytes_view` to
  `FragmentedView`.
- Remove `with_linearized()` calls which just became unnecessary.
- Add an optimization for single-fragment cases.

The addition of `FragmentedView` might be controversial, because another concept
meant for the same purpose - `FragmentRange` - is already used. Unfortunately,
it lacks the functionality we need. The main (only?) thing we want to do with a
fragmented buffer is to extract a prefix from it and `FragmentRange` gives us no
way to do that, because it's immutable by design. We can work around that by
wrapping it into a mutable view which will track the offset into the immutable
`FragmentRange`, and that's exactly what `linearizing_input_stream` is. But it's
wasteful. `linearizing_input_stream` is a heavy type, unsuitable for passing
around as a view - it stores a pair of fragment iterators, a fragment view and a
size (11 words) to conform to the iterator-based design of `FragmentRange`, when
one fragment iterator (4 words) already contains all needed state, just hidden.
I suggest we replace `FragmentRange` with `FragmentedView` (or something
similar) altogether.

Refs: #6138

Closes #7692

* github.com:scylladb/scylla:
  types: collection: add an optimization for single-fragment buffers in deserialize
  types: add an optimization for single-fragment buffers in deserialize
  cql3: tuples: don't linearize in in_value::from_serialized
  cql3: expr: expression: replace with_linearize with linearized
  cql3: constants: remove unneeded uses of with_linearized
  cql3: update_parameters: don't linearize in prefetch_data_builder::add_cell
  cql3: lists: remove unneeded use of with_linearized
  query-result-set: don't linearize in result_set_builder::deserialize
  types: remove unneeded collection deserialization overloads
  types: switch collection_type_impl::deserialize from bytes_view to FragmentedView
  cql3: sets: don't linearize in value::from_serialized
  cql3: lists: don't linearize in value::from_serialized
  cql3: maps: don't linearize in value::from_serialized
  types: remove unused deserialize_aux
  types: deserialize: don't linearize tuple elements
  types: deserialize: don't linearize collection elements
  types: switch deserialize from bytes_view to FragmentedView
  types: deserialize tuple types from FragmentedView
  types: deserialize set type from FragmentedView
  types: deserialize map type from FragmentedView
  types: deserialize list type from FragmentedView
  types: add FragmentedView versions of read_collection_size and read_collection_value
  types: deserialize varint type from FragmentedView
  types: deserialize floating point types from FragmentedView
  types: deserialize decimal type from FragmentedView
  types: deserialize duration type from FragmentedView
  types: deserialize IP address types from FragmentedView
  types: deserialize uuid types from FragmentedView
  types: deserialize timestamp type from FragmentedView
  types: deserialize simple date type from FragmentedView
  types: deserialize time type from FragmentedView
  types: deserialize boolean type from FragmentedView
  types: deserialize integer types from FragmentedView
  types: deserialize string types from FragmentedView
  types: remove unused read_simple_opt
  types: implement read_simple* versions for FragmentedView
  utils: fragmented_temporary_buffer: implement FragmentedView for view
  utils: fragment_range: add single_fragmented_view
  serializer: implement FragmentedView for buffer_view
  utils: fragment_range: add linearized and with_linearized for FragmentedView
  utils: fragment_range: add FragmentedView
  utils: fragmented_temporary_buffer: fix view::remove_prefix
2020-12-04 09:46:20 +01:00
Michał Chojnowski
a1f7fabb3d types: collection: add an optimization for single-fragment buffers in deserialize
Helpers parametrized with single_fragmented_view should compile to better code,
so let's use them when possible.
2020-12-04 09:21:05 +01:00
Michał Chojnowski
08c394726e types: add an optimization for single-fragment buffers in deserialize
Values usually come in a single fragment, but we pay the cost of fragmented
deserialization nevertheless: bigger view objects (4 words instead of 2 words)
more state to keep updated (i.e. total view size in addition to current fragment
size) and more branches.

This patch adds a special case for single-fragment buffers to
abstract_type::deserialize. They are converted to a single_fragmented_view
before doing anything else. Templates instantiated with single_fragmented_view
should compile to better code than their multi-fragmented counterparts. If
abstract_type::deserialize is inlined, this patch should completely prevent any
performance penalties for switching from with_linearized to fragmented
deserialization.
2020-12-04 09:19:39 +01:00
Michał Chojnowski
f75db1fcf5 cql3: tuples: don't linearize in in_value::from_serialized
We can deserialize directly from fragmented buffers now.
2020-12-04 09:19:39 +01:00
Michał Chojnowski
68177a6721 cql3: expr: expression: replace with_linearize with linearized
with_linearized creates an additional internal `bytes` when the input is
fragmented. linearized copies the data directly to the output `bytes`, so it's
more efficient.
2020-12-04 09:19:39 +01:00
Michał Chojnowski
5ffe40d5a2 cql3: constants: remove unneeded uses of with_linearized
We can deserialize directly from fragmented buffers now.
2020-12-04 09:19:39 +01:00
Michał Chojnowski
3c98806df9 cql3: update_parameters: don't linearize in prefetch_data_builder::add_cell
We can deserialize directly from fragmented buffers now.
2020-12-04 09:19:39 +01:00
Michał Chojnowski
c43ef3951b cql3: lists: remove unneeded use of with_linearized
We can deserialize directly from fragmented buffers now.
2020-12-04 09:19:39 +01:00
Michał Chojnowski
0d5c5b8645 query-result-set: don't linearize in result_set_builder::deserialize
We can deserialize directly from fragmented buffers now.
2020-12-04 09:19:39 +01:00
Michał Chojnowski
04786dee30 types: remove unneeded collection deserialization overloads
Inherit the method from base class rather than reimplementing it in every child.
2020-12-04 09:19:39 +01:00
Michał Chojnowski
c08419e28d types: switch collection_type_impl::deserialize from bytes_view to FragmentedView
Devirtualizes collection_type_impl::deserialize (so it can be templated) and
adds a FragmentedView overload. This will allow us to deserialize collections
with explicit cql_serialization_format directly from fragmented buffers.
2020-12-04 09:19:37 +01:00
dgarcia360
1304f6a0bb docs: fixed warnings
docs: fixed warnings
2020-12-03 17:40:34 +01:00
dgarcia360
a340b46a79 docs: added theme 2020-12-03 17:37:18 +01:00
Michał Chojnowski
d731b34d95 cql3: sets: don't linearize in value::from_serialized
We can deserialize directly from fragmented buffers now.
2020-12-03 10:57:07 +01:00
Michał Chojnowski
64e64fd2b3 cql3: lists: don't linearize in value::from_serialized
We can deserialize directly from fragmented buffers now.
2020-12-03 10:57:07 +01:00
Michał Chojnowski
536a2f8c8d cql3: maps: don't linearize in value::from_serialized
We can deserialize directly from fragmented buffers now.
2020-12-03 10:57:07 +01:00
Michał Chojnowski
58d9f52363 types: remove unused deserialize_aux
Dead code.
2020-12-03 10:57:07 +01:00
Michał Chojnowski
8440279130 types: deserialize: don't linearize tuple elements
We can deserialize directly from fragmented buffers now.
2020-12-03 10:57:07 +01:00
Michał Chojnowski
a216b0545f types: deserialize: don't linearize collection elements
We can deserialize directly from fragmented buffers now.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
1ccdfc7a90 types: switch deserialize from bytes_view to FragmentedView
The final part of the transition of deserialize from bytes_view to
FragmentedView.
Adds a FragmentedView overload to abstract_type::deserialize and
switches deserialize_visitor from bytes_view to FragmentedView, allowing
deserialization of all types with no intermediate linearization.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
898cea4cde types: deserialize tuple types from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
507883f808 types: deserialize set type from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
9b211a7285 types: deserialize map type from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
5f1939554c types: deserialize list type from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
ad7ab73cd0 types: add FragmentedView versions of read_collection_size and read_collection_value
We will need those to deserialize collections from FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
495bf5c431 types: deserialize varint type from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
0f8ad89740 types: deserialize floating point types from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
0bb0291e50 types: deserialize decimal type from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
760bc5fd60 types: deserialize duration type from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
75a56f439b types: deserialize IP address types from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
9f668929db types: deserialize uuid types from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
3e1a24ca0d types: deserialize timestamp type from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
a4bc43ab19 types: deserialize simple date type from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
24bd986aea types: deserialize time type from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
c03ad52513 types: deserialize boolean type from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
2f351928e2 types: deserialize integer types from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
28b727082f types: deserialize string types from FragmentedView
A part of the transition of deserialize from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
426308f526 types: remove unused read_simple_opt
Dead code.
2020-12-03 10:57:06 +01:00
Michał Chojnowski
e1145fe410 types: implement read_simple* versions for FragmentedView
We will need those to switch deserialize() from bytes_view to FragmentedView.
2020-12-03 10:57:06 +01:00
Botond Dénes
71722d8b41 frozen_mutation: add partition context to errors coming from deserializing 2020-12-02 15:08:49 +02:00
Botond Dénes
8d944ff755 partition_builder: accept_row(): use append_clustering_row()
The partition builder doesn't expect the looked-up row to exist. In fact
it already existing is a sign of a bug. Currently bugs resulting in
duplicate rows will manifest by tripping an assert in
`row::append_cell()`. This however results in poor diagnostics, so we
want to catch these errors sooner to be able to provide higher level
diagnostics. To this end, switch to the freshly introduced
`append_clustering_row()` so that duplicate rows are found early and in
a context where their identity is known.
2020-12-02 15:08:49 +02:00
Botond Dénes
63ea36e277 mutation_partition: add append_clustered_row()
A variant of `clutered_row()` which throws if the row already exists, or
if any greater row already exists.
2020-12-02 15:08:32 +02:00
Benny Halevy
c7311d1080 docs: sstable-scylla-format: document large_data_type in more details
This adds details about large_data_type on top of
ca5184052d
and introduces structured indentation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201202110539.634880-1-bhalevy@scylladb.com>
2020-12-02 13:25:49 +02:00
Avi Kivity
a95c2a946c Merge 'mutation_reader: introduce clustering_order_reader_merger' from Kamil Braun
This abstraction is used to merge the output of multiple readers, each
opened for a single partition query, into a non-decreasing stream
of mutation_fragments.

It is similar to `mutation_reader_merger`,
but an important difference is that the new merger may select new readers
in the middle of a partition after it already returned some fragments
from that partition.  It uses the new `position_reader_queue` abstraction
to select new readers. It doesn't support multi-partition (ring range) queries.

The new merger will be later used when reading from sstable sets created
by TimeWindowCompactionStrategy. This strategy creates many sstables
that are mostly disjoint w.r.t the contained clustering keys, so we can
delay opening sstable readers when querying a partition until after we have
processed all mutation fragments with positions before the keys
contained by these sstables.

A microbenchmark was added that compares the existing combining reader
(which uses `mutation_reader_merger` underneath) with a new combining reader
built using the new `clustering_order_reader_merger` and a simple queue of readers
that returns readers from some supplied set. The used set of readers is built from the following
ranges of keys (each range corresponds to a single reader):
`[0, 31]`, `[30, 61]`, `[60, 91]`, `[90, 121]`, `[120, 151]`.
The microbenchmark runs the reader and divides the result by the number of mutation fragments.
The results on my laptop were:
```
$ build/release/test/perf/perf_mutation_readers -t clustering_combined.* -r 10
single run iterations:    0
single run duration:      1.000s
number of runs:           10

test                                      iterations      median         mad         min         max
clustering_combined.ranges_generic           2911678   117.598ns     0.685ns   116.175ns   119.482ns
clustering_combined.ranges_specialized       3005618   111.015ns     0.349ns   110.063ns   111.840ns
```
`ranges_generic` denotes the existing combining reader, `ranges_specialized` denotes the new reader.

Split from https://github.com/scylladb/scylla/pull/7437.

Closes #7688

* github.com:scylladb/scylla:
  tests: mutation_source_test for clustering_order_reader_merger
  perf: microbenchmark for clustering_order_reader_merger
  mutation_reader_test: test clustering_order_reader_merger in memory
  test: generalize `random_subset` and move to header
  mutation_reader: introduce clustering_order_reader_merger
2020-12-02 12:15:35 +02:00
Kamil Braun
502ed2e9f7 tests: mutation_source_test for clustering_order_reader_merger 2020-12-02 11:13:58 +01:00
Nadav Har'El
fae2ba60e9 cql-pytest: start to port Cassandra's CQL unit tests
In issue #7722, it was suggested that we should port Cassandra's CQL unit
tests into our own repository, by translating the Java tests into Python
using the new cql-pytest framework. Cassandra's CQL unit test framework is
orders of magnitude faster than dtest, and in-tree, so Cassandra have been
moving many CQL correctness tests there, and we can also benefit from their
test cases.

In this patch, we take the first step in a long journey:

1. I created a subdirectory, test/cql-pytest/cassandra_tests, where all the
   translated Cassandra tests will reside. The structure of this directory
   will mirror that of the test/unit/org/apache/cassandra/cql3 directory in
   the Cassandra repository.
   pytest conveniently looks for test files recursively, so when all the
   cql-pytest are run, the cassandra_tests files will be run as well.
   As usual, one can also run only a subset of all the tests, e.g.,
   "test/cql-pytest/run -vs cassandra_tests" runs only the tests in the
   cassandra_tests subdirectory (and its subdirectories).

2. I translated into Python two of the smallest test files -
   validation/entities/{TimeuuidTest,DataTypeTest}.java - containing just
   three test functions.
   The plan is to translate entire Java test files one by one, and to mirror
   their original location in our own repository, so it will be easier
   to remember what we already translated and what remains to be done.

3. I created a small library, porting.py, of functions which resemble the
   common functions of the Java tests (CQLTester.java). These functions aim
   to make porting the tests easier. Despite the resemblence, the ported code
   is not 100% identical (of course) and some effort is still required in
   this porting. As we continue this porting effort, we'll probably need
   more of these functions, can can also continue to improve them to reduce
   the porting effort.

Refs #7722.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201201192142.2285582-1-nyh@scylladb.com>
2020-12-02 09:29:22 +01:00
Avi Kivity
77466177ab Merge 'Use large_data_counters in scylla_metadata to decide when to delete large_data records' from Benny Halevy
This series introduces a `large_data_counters` element to `scylla_metadata` component to explicitly count the number of `large_{partitions,rows,cells}` and `too_many_rows` in the sstable.  These are accounted for in the sstable writer whenever the respective large data entry is encountered.

It is taken into account in `large_data_handler::maybe_delete_large_data_entries`, when engaged.
Otherwise, if deleting a legacy sstable that has no such entry in `scylla_metadata`, just revert to using the current method of comparing the sstable's `data_size` to the various thresholds.

Fixes #7668

Test: unit(dev)
Dtest: wide_rows_test.py (in progress)

Closes #7669

* github.com:scylladb/scylla:
  docs: sstable-scylla-format: add large_data_stats subcomponent
  large_data_handler: maybe_delete_large_data_entries: use sstable large data stats
  large_data_handler: maybe_delete_large_data_entries: accept shared_sstable
  large_data_handler: maybe_delete_large_data_entries: move out of line
  sstables: load large_data_stats from scylla_metadata
  sstables: store large_data_stats in scylla_metadata
  sstables: writer: keep track of large data stats
  large_data_handler: expose methods to get threshold
  sstables: kl/writer: never record too many rows
  large_data_handler: indicate recording of large data entries
  large_data_handler: move constructor out of line
2020-12-02 10:08:18 +02:00
Nadav Har'El
5c08489569 cql-pytest: don't run tests if Scylla boot timed out
In test/cql-pytest/run.py we have a 200 second timeout to boot Scylla.
I never expected to reach this timeout - it normally takes (in dev
build mode) around 2 seconds, but in one run on Jenkins we did reach it.
It turns out that the code does not recognize this timeout correctly,
thought that Scylla booted correctly - and then failed all the
subtests when they fail to connect to Scylla.

This patch fixes the timeout logic. After the timeout, if Scylla's
CQL port is still not responsive, the test run is failed - without
trying to run many individual tests.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201201150927.2272077-1-nyh@scylladb.com>
2020-12-02 08:48:44 +02:00
Kamil Braun
2da723b9c8 cdc: produce postimage when inserting with no regular columns
When a row was inserted into a table with no regular columns, and no
such row existed in the first place, postimage would not be produced.
Fix this.

Fixes #7716.

Closes #7723
2020-12-01 18:01:23 +02:00
Benny Halevy
ca5184052d docs: sstable-scylla-format: add large_data_stats subcomponent
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-01 15:19:42 +02:00
Benny Halevy
4406a2514e large_data_handler: maybe_delete_large_data_entries: use sstable large data stats
If the sstable has scylla_metadata::large_data_stats use them
to determine whether to delete the corresponding large data records.

Otherwise, defer to the current method of comparing the sstable
data_size to the respective thresholds.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-01 15:19:42 +02:00
Benny Halevy
8cebe7776f large_data_handler: maybe_delete_large_data_entries: accept shared_sstable
Since the actual deletion if the large data entries
is done in the background, and we don't captures the shared_sstable,
we can safely pass it to maybe_delete_large_data_entries when
deleting the sstable in sstable::unlink and it will be release
as soon as maybe_delete_large_data_entries returns.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-01 15:19:42 +02:00
Benny Halevy
f7d0ae3d10 large_data_handler: maybe_delete_large_data_entries: move out of line
It is called on the cold path, when the sstable is deleted.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-01 15:19:42 +02:00
Benny Halevy
be4a58c34c sstables: load large_data_stats from scylla_metadata
Load the large data stats from the scylla_metadata component
if they are present. Otherwise, if we're opening a legacy sstable
that has scylla_metadata_type::LargeDataStats, leave
sstable::_large_data_stats disengaged.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-01 15:19:42 +02:00
Benny Halevy
92443ed71c sstables: store large_data_stats in scylla_metadata
Store the large data statistics in the scylla_metadata component.

These will be retrieved when loading the sstable and be
used for determining whether to delete the corresponding
large data entries upon sstable deletion.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-01 15:19:42 +02:00
Benny Halevy
79c19a166c sstables: writer: keep track of large data stats
In the next patch, this is will be written to the
sstable's scylla_metadata component.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-01 15:19:41 +02:00
Benny Halevy
8ab053bd44 large_data_handler: expose methods to get threshold
To be used for keeping large_data statistics in sstable.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-01 15:18:14 +02:00
Benny Halevy
f1257dfdc0 sstables: kl/writer: never record too many rows
rows_count is not tracked prior to the mc format.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-01 15:18:14 +02:00
Benny Halevy
dd7422a713 large_data_handler: indicate recording of large data entries
Return true from the maybe_{record,log}_* methods if
a large data record or log entry were emitted.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-01 15:18:14 +02:00
Benny Halevy
873107821b large_data_handler: move constructor out of line
No need for it to be inlined.

Also, add debug logging to the large data handler options.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-01 15:18:14 +02:00
Dejan Mircevski
e45af3b9b8 index: Ensure restriction is supported in find_idx
Previously, statement_restrictions::find_idx() would happily return an
index for a non-EQ restriction (because it checked only the column
name, not the operator).  This is incorrect: when the selected index
is for a non-EQ restriction, it is impossible to query that index
table.

Fixes #7659.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #7665
2020-12-01 15:16:48 +02:00
Avi Kivity
df572b41ae Update seastar submodule
* seastar 010fb0df1e...8b400c7b45 (6):
  > append_challenged_posix_file_impl::read_dma: allow iovec to cross _logical_size
  > Merge "Extend per task-queue timing statistics" from Pavel E
  > tls_test: Create test certs at build time
  > cook: upgrade hwloc version
  > memory: rate-limit diagnostics messages
  > util/log: add rate-limited version of writer version of log()
2020-12-01 15:12:25 +02:00
Tomasz Grabiec
0c5d23d274 thrift: Validate cell names when constructing clustering keys
Currently, if the user provides a cell name with too many components,
we will accept it and construct an invalid clusterin key. This may
result in undefined behavior down the stream.

It was caught by ASAN in a debug build when executing dtest
cql_tests.py:MiscellaneousCQLTester.cql3_insert_thrift_test with
nodetool flush manually added after the write. Triggered during
sstable writing to an MC-format sstable:

   seastar::shared_ptr<abstract_type const>::operator*() const at ././seastar/include/seastar/core/shared_ptr.hh:577
   sstables::mc::clustering_blocks_input_range::next() const at ./sstables/mx/writer.cc:180

To prevent corrupting the state in this way, we should fail
early. This patch addds validation which will fail thrift requests
which attempt to create invalid clustering keys.

Fixes #7568.

Example error:

  Internal server error: Cell name of ks.test has too many components, expected 1 got 2 in 0x0004000000040000017600

Message-Id: <1605550477-24810-1-git-send-email-tgrabiec@scylladb.com>
2020-12-01 15:12:08 +02:00
Avi Kivity
2fd895a367 Merge 'dist/common/scripts/scylla_setup: Optionally config rsyslog destination' from Amnon Heiman
This patch adds an option to scylla_setup to configure an rsyslog destination.

The monitoring stack has an option to get information from rsyslog it
requires that rsyslog on the scylla machines will send the trace line to
it.

The configuration will be in a Scylla configuration file, so it is safe to run it multiple times.

Fixes #7589

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes #7634

* github.com:scylladb/scylla:
  dist/common/scripts/scylla_setup: Optionally config rsyslog destination
  Adding dist/common/scripts/scylla_rsyslog_setup utility
2020-12-01 13:12:32 +02:00
Amnon Heiman
4036cecdea dist/common/scripts/scylla_setup: Optionally config rsyslog destination
This patch adds an option to scylla_setup to configure an rsyslog
destination.

The monitoring stack has an option to get information from rsyslog, it
requires that rsyslog on the scylla machines will send the trace line to
it.

If the /etc/rsyslog.d/ directory exists (that means the current system
runs rsyslog) it will ask if to add rsyslog configuration and if yes, it
would run scylla_rsyslog_setup.

Fixes #7589

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-12-01 12:33:37 +02:00
Takuya ASADA
572d6b2a4e scylla_setup: fix wrong command suggestion on --io-setup
scylla_setup command suggestion does not shows an argument of --io-setup,
because we mistakely stores bool value on it (recognized as 'store_true').
We always need to print '--io-setup X' on the suggestion instead.

Related #7395
2020-12-01 07:23:55 +09:00
Tomasz Grabiec
f8f81ec322 Merge "raft: various snapshot fixes" from Gleb
* scylla-dev/snapshot_fixes_v1:
  raft: ignore append_reply from a peer in SNAPSHOT state
  raft: Ignore outdated snapshots
  raft: set next_idx to correct value after snapshot transfer
2020-11-30 21:34:31 +01:00
Alejo Sanchez
72a64b05ea raft: replication test: fix total entries for initial snapshot
Since now total expected entries are updated by load snapshot, do not
trim the total entries expected values with the initial snapshot on
test state machine initialization.

reported by @gleb

Branch URL: https://github.com/alecco/scylla/tree/raft-ale-tests-06-snapshot-total-entries

Tests: unit ({dev}), unit ({debug}), unit ({release})

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20201125171232.321992-1-alejo.sanchez@scylladb.com>
2020-11-30 21:34:31 +01:00
Kamil Braun
af49a95627 perf: microbenchmark for clustering_order_reader_merger 2020-11-30 11:55:44 +01:00
Kamil Braun
4f7e2bf920 mutation_reader_test: test clustering_order_reader_merger in memory 2020-11-30 11:55:44 +01:00
Kamil Braun
b22aa6dbde test: generalize random_subset and move to header 2020-11-30 11:55:44 +01:00
Kamil Braun
0b36c5e116 mutation_reader: introduce clustering_order_reader_merger
This abstraction is used to merge the output of multiple readers, each
opened for a single partition query, into a non-decreasing stream
of mutation_fragments.

It is similar to `mutation_reader_merger`,
an important difference is that the new merger may select new readers
in the middle of a partition after it already returned some fragments
from that partition.  It uses the new `position_reader_queue` abstraction
to select new readers. It doesn't support multi-partition (ring range) queries.

The new merger will be later used when reading from sstable sets created
by TimeWindowCompactionStrategy. This strategy creates many sstables
that are mostly disjoint w.r.t the contained clustering keys, so we can
delay opening sstable readers when querying a partition until after we have
processed all mutation fragments with positions before the keys
contained by these sstables.
2020-11-30 11:55:44 +01:00
Avi Kivity
ea9c058be3 Merge 'Don't use secondary indices for multi-column restrictions' from Dejan Mircevski
Fix #7680 by never using secondary index for multi-column restrictions.

Modify expr::is_supported_by() to handle multi-column correctly.

Tests: unit (dev)

Closes #7699

* github.com:scylladb/scylla:
  cql3/expr: Clarify multi-column doesn't use indexing
  cql3: Don't use index for multi-column restrictions
  test: Add eventually_require_rows
2020-11-30 12:38:26 +02:00
Avi Kivity
12c20c4101 Merge 'test/cql-pytest: tests for string validation (UTF-8 and ASCII)' from Nadav Har'El
The first two patches in this series are small improvements to cql-pytest to prepare for the third and main patch. This third patch adds cql-pytest tests which check that we fail CQL queries that try to inject non-ASCII and non-UTF-8 strings for ascii and text columns, respectively.

The tests do not discover any unknown bug in Scylla, however, they do show that Scylla is more strict in its definition of "valid UTF-8" compared to Cassandra.

Closes #7719

* github.com:scylladb/scylla:
  test/cql-pytest: add tests for validation of inserted strings
  test/cql-pytest: add "scylla_only" fixture
  test/cpy-pytest: enable experimental features
2020-11-30 12:26:25 +02:00
Piotr Wojtczak
3560acd311 cql_metrics: Add metrics for CQL errors
This change adds tracking of all the CQL errors that can be
raised in response to a CQL message from a client, as described
in the CQL v4 protocol and with Scylla's CDC_WRITE_FAILUREs
included.

Fixes #5859

Closes #7604
2020-11-30 12:18:37 +02:00
Takuya ASADA
6238d105d9 dist/redhat: drop Conflicts with older kernel
We have "Conflicts: kernel < 3.10.0-514" on rpm package to make sure
the environment is running newer kernel.
However, user may use non-standard kernel which has different package name,
like kernel-ml or kernel-uek.
On such environment Conflicts tag does not works correctly.
Even the system running with newer kernel, rpm only checks "kernel" package
version number.

To avoid such issue, we need to drop Conflicts tag.

Fixes #7675
2020-11-30 11:38:42 +02:00
Nadav Har'El
48c78ade33 test/cql-pytest: add tests for validation of inserted strings
This patch adds comprehensive cql-pytest tests for checking the validation
of strings - ASCII or UTF-8 - in CQL. Strings can be represented in CQL
using several methods - a strings can be a string literal as
part of the statement, can be encoded as a blob (0x...), or
can be a binding parameter for a prepared statement, or returned
by user-defined functions - and these tests check all of them.

We already have low-level unit tests for UTF-8 parsing in
test/boost/utf8_test.cc, but the new tests here confirms that we really
call these low-level functions in the correct way. Moreover, since these
are CQL tests, they can also be run against Cassandra, and doing that
demonstrated that Scylla's UTF-8 parsing is *stricter* than Cassandra's -
Scylla's UTF-8 parser rejects the following sequences which Cassandra's
accepts:

 1. \xC0\x80 as another non-minimal representation of null. Note that other
    non-minimal encodings are rejected by Cassandra, as expected.
 2. Characters beyond the official Unicode range (or what Scylla considers
    the end of the range).
 3. UTF-16 surrogates - these are not considered valid UTF-8, but Cassandra
    accepts them, and Scylla does not.

In the future, we should consider whether Scylla is more correct than
Cassandra here (so we're fine), or whether compatibility is more important
than correctness (so this exposed a bug).

The ASCII tests reproduces issue #5421 - that trying to insert a
non-ASCII string into an "ascii" column should produce an error on
insert - not later when fetching the string. This test now passes,
because issue 5421 was already fixed.

These tests did not exposed any bug in Scylla (other than the differences
with Cassandra mentioned a bug), so all of them pass on Scylla. Two
of the tests fail on Cassandra, because Cassandra does not recognize
some invalid UTF-8 (according to Scylla's definition) as invalid.

Refs #5421.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-11-29 17:43:20 +02:00
Dejan Mircevski
5bc7e31284 restrictions: Forbid mixing ck=0 and (ck)=(0)
Reject the previously accepted case where the multi-column restriction
applied to just a single column, as it causes a crash downstream.  The
user can drop the parentheses to avoid the rejection.

Fixes #7710

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #7712
2020-11-29 17:06:41 +02:00
Avi Kivity
0584db1eb3 Merge "Unstall cleanup_compaction::get_ranges_for_invalidation" from Benny
"
This series adds maybe_yield called from
cleanup_compaction::get_ranges_for_invalidation
to avoid reactor stalls.

To achieve that, we first extract bool_class can_yield
to utils/maybe_yield.hh, and add a convience helper:
utils::maybe_yield(can_yield) that conditionally calls
seastar::thread::maybe_yield if it can (when called in a
seastar thread).

With that, we add a can_yield parameter to dht::to_partition_ranges
and dht::partition_range::deoverlap (defaults to false), and
use it from cleanup_compaction::get_ranges_for_invalidation,
as the latter is always called from `consume_in_thread`.

Fixes #7674

Test: unit(dev)
"

* tag 'unstall-get_ranges_for_invalidation-v2' of github.com:bhalevy/scylla:
  compaction: cleanup_compaction: get_ranges_for_invalidation: add yield points
  dht/i_partitioner: to_partition_ranges: support yielding
  locator: extract can_yield to utils/maybe_yield.hh
2020-11-29 14:10:39 +02:00
Asias He
0a3a2a82e1 api: Add force_remove_endpoint for gossip
It is used to force remove a node from gossip membership if something
goes wrong.

Note: run the force_remove_endpoint api at the same time on _all_ the
nodes in the cluster in order to prevent the removed nodes come back.
Becasue nodes without running the force_remove_endpoint api cmd can
gossip around the removed node information to other nodes in 2 *
ring_delay (2 * 30 seconds by default) time.

For instance, in a 3 nodes cluster, node 3 is decommissioned, to remove
node 3 from gossip membership prior the auto removal (3 days by
default), run the api cmd on both node 1 and node 2 at the same time.

$ curl -X POST --header "Accept: application/json"
"http://127.0.0.1:10000/gossiper/force_remove_endpoint/127.0.0.3"
$ curl -X POST --header "Accept: application/json"
"http://127.0.0.2:10000/gossiper/force_remove_endpoint/127.0.0.3"

Then run 'nodetool gossipinfo' on all the nodes to check the removed nodes
are not present.

Fixes #2134

Closes #5436
2020-11-29 13:58:46 +02:00
Nadav Har'El
0864933d4d test/cql-pytest: add "scylla_only" fixture
This patch adds a fixture "scylla_only" which can be used to mark tests
for Scylla-specific features. These tests are skipped when running against
other CQL servers - like Apache Cassandra.

We recognize Scylla by looking at whether any system table exists with
the name "scylla" in its name - Scylla has several of those, and Cassandra
has none.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-11-29 10:18:58 +02:00
Nadav Har'El
91ccb2afb5 test/cpy-pytest: enable experimental features
Enable experimental features, and in particular UDF, so we can test those
features in our tests.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-11-29 10:18:58 +02:00
Michał Chojnowski
fcb258cb01 utils: fragmented_temporary_buffer: implement FragmentedView for view
fragmented_temporary_buffer::view is one of the types we want to directly
deserialize from.
2020-11-27 15:26:13 +01:00
Michał Chojnowski
f6cc2b6a48 utils: fragment_range: add single_fragmented_view
bytes_view is one of the types we want to deserialize from (at least for now),
so we want to be able to pass it to deserialize() after it's transitioned to
FragmentView.

single_fragmented_view is a wrapper implementing FragmentedView for bytes_view.
It's constructed from bytes_view explicitly, because it's typically used in
context where we want to phase linearization (and by extension, bytes_view) out.
2020-11-27 15:26:13 +01:00
Michał Chojnowski
0b20c7ef65 serializer: implement FragmentedView for buffer_view
buffer_view is one of the types we want to directly deserialize from.
2020-11-27 15:26:13 +01:00
Michał Chojnowski
2008c0f62f utils: fragment_range: add linearized and with_linearized for FragmentedView
We would like those helpers to disappear one day but for now we still need them
until everything can handle fragmented buffers.
2020-11-27 15:26:13 +01:00
Michał Chojnowski
fc90bd5190 utils: fragment_range: add FragmentedView
This patch introduces FragmentedView - a concept intented as a general-purpose
interface for fragmented buffers.
Another concept made for this purpose, FragmentedRange, already exists in the
codebase. However, it's unwieldy. The iterator-based design of FragmentRange is
harder to implement and requires more code, but more importantly it makes
FragmentRange immutable.
Usually we want to read the beginning of the buffer and pass the rest of it
elsewhere. This is impossible with FragmentRange.
FragmentedView can do everything FragmentRange can do and more, except for
playing nicely with iterator-based collection methods, but those are useless for
fragmented buffers anyway.
2020-11-27 15:26:13 +01:00
Lubos Kosco
4d0587ed11 scylla_util.py: fix metadata gcp call for disks to get details
disk parsing expects output from recursive listing of GCP
metadata REST call, the method used to do it by default,
but now it requires a boolean flag to run in recursive mode

Fixes #7684

Closes #7685
2020-11-27 15:20:56 +02:00
Pekka Enberg
c84754a634 Update tools/java submodule
* tools/java ad48b44a26...8080009794 (1):
  > sstableloader: Fix command line parsing of "ignore-missing-columns"
2020-11-27 15:19:48 +02:00
Avi Kivity
390e07d591 dist: sysctl: configure more inotify instances
Since f3bcd4d205 ("Merge 'Support SSL Certificate Hot
Reloading' from Calle"), we reload certificates as they are
modified on disk. This uses inotify, which is limited by a
sysctl fs.inotify.max_user_instances, with a default of 128.

This is enough for 64 shards only, if both rpc and cql are
encrypted; above that startup fails.

Increase to 1200, which is enough for 6 instances * 200 shards.

Fixes #7700.

Closes #7701
2020-11-26 23:44:48 +02:00
Takuya ASADA
5f81f97773 install.sh: apply sysctl.d files on non-packaging installation
We don't apply sysctl.d files on non-packaging installation, apply them
just like rpm/deb taking care of that.

Fixes #7702

Closes #7705
2020-11-26 09:52:14 +02:00
Takuya ASADA
ba4d54efa3 dist/redhat: packaging dependencies.conf as normal file, not ghost
When we introduced dependencies.conf, we mistakenly added it on rpm as %ghost,
but it should be normal file, should be installed normally on package installation.

Fixes #7703

Closes #7704
2020-11-26 09:50:05 +02:00
Dejan Mircevski
7f8ed811c1 cql3/expr: Clarify multi-column doesn't use indexing
Although not currently used, the old code was wrong and confusing to
readers.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-11-25 10:59:13 -05:00
Avi Kivity
956f031a68 Merge 'Add missing shaded<>::stop in exceptional startup code for CQL/redis' from Calle Wilund
Fixes #7211

If we start a sharded<> object, then proceed to do potentially
exceptional stuff, we should destroy it on said exception.
Otherwise, the exception propagation will abort on RAII
destruction of the sharded<>. And we get no exception logging.

Closes #7697

* github.com:scylladb/scylla:
  redis::service: Shut down sharded<> subobject on startup exception
  transport::controller: Shut down distributed object on startup exception
2020-11-25 17:57:53 +02:00
Calle Wilund
55acf09662 redis::service: Shut down sharded<> subobject on startup exception
Refs #7211

If we start a sharded<> object, then proceed to do potentially
exceptional stuff, we should destroy it on said exception.
Otherwise, the exception propagation will abort on RAII
destruction of the sharded<>. And we get no exception logging.
2020-11-25 15:52:47 +00:00
Calle Wilund
ae4d5a60ca transport::controller: Shut down distributed object on startup exception
Fixes #7211

If we start a sharded<> object, then proceed to do potentially
exceptional stuff, we should destroy it on said exception.
Otherwise, the exception propagation will abort on RAII
destruction of the sharded<>. And we get no exception logging.
2020-11-25 15:52:47 +00:00
Dejan Mircevski
db63b40347 cql3: Don't use index for multi-column restrictions
The downstream code expects a single-column restriction when using an
index.  We could fix it, but we'd still have to filter the rows
fetched from the index table, unlike the code that queries the base
table directly.  For instance, WHERE (c1,c2,c3) = (1,2,3) with an
index on c3 can fetch just the right rows from the base table but all
the c3=3 rows from the index table.

Fixes #7680

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-11-25 10:39:04 -05:00
Dejan Mircevski
ab7aa57b24 test: Add eventually_require_rows
Makes it easier to combine eventually{assert_that} with useful error
messages.

Refs #7573.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-11-25 10:34:44 -05:00
Benny Halevy
e1fe1f18c7 compaction: cleanup_compaction: get_ranges_for_invalidation: add yield points
Avoid reactor stalls by allowing yielding in long-running loops
as seen in #7674.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-25 13:46:32 +02:00
Gleb Natapov
be6119b350 raft: ignore append_reply from a peer in SNAPSHOT state
If append_reply is received from a node that currently gets snapshot
transferred to it ignore it, it is a stray reply.
2020-11-25 12:36:41 +02:00
Gleb Natapov
851e3000c4 raft: Ignore outdated snapshots
Do not try to install snapshots that are older than current one.
2020-11-25 12:36:41 +02:00
Gleb Natapov
2ce9473037 raft: set next_idx to correct value after snapshot transfer
After snapshot is transferred progress::next_idx is set to its index,
but the code uses current snapshot to set it instead of the snapshot
that was transferred. Those can be different snapshots.
2020-11-25 11:34:49 +02:00
Tomasz Grabiec
0aa1f7c70a Merge "raft: fix replication if existing log on leader" from Gleb
* scylla-dev/add_dummy_v2:
  raft: test: replication works on leader change without adding an entry
  raft: commit a dummy entry after leader change
  raft: test: fix snapshot correctness check
  sstables: add `may_have_partition_tombstones` method
2020-11-24 11:35:18 +01:00
Gleb Natapov
51d1d20687 raft: test: replication works on leader change without adding an entry
Check that a newly elected leader commits all the entries in its log
without waiting for more entries to be submitted.
2020-11-24 11:35:18 +01:00
Gleb Natapov
6130fb8b39 raft: commit a dummy entry after leader change
After a node becomes leader it needs to do two things: send an append
message to establish its leadership and commit one entry to make sure
all previous entries with smaller terms are committed as well.
2020-11-24 11:35:18 +01:00
Gleb Natapov
e3a886738b raft: test: fix snapshot correctness check
Snapshot index cannot be used to check snapshot correctness since some
entries may not be command and thus do not affect snapshot value. Lest
use applied entries count instead.
2020-11-24 11:35:18 +01:00
Benny Halevy
37e971ad87 dht/i_partitioner: to_partition_ranges: support yielding
Allow yielding to prevent reactor stalls
when called with a long vector of ranges.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-24 12:23:56 +02:00
Benny Halevy
157a964a63 locator: extract can_yield to utils/maybe_yield.hh
Move the definition of bool_class can_yield to a standalone
header file and define there a maybe_yield(can_yield) helper.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-24 12:23:56 +02:00
Asias He
1b2155eb1d repair: Use same description for the same metric
In commit 9b28162f88 (repair: Use label
for node ops metrics), we switched to use label for different node
operations. We should use the same description for the same metric name.

Fixes #7681

Closes #7682
2020-11-24 09:35:39 +02:00
Avi Kivity
e8ff77c05f Merge 'sstables: a bunch of refactors' from Kamil Braun
1. sstables: move `sstable_set` implementations to a separate module

    All the implementations were kept in sstables/compaction_strategy.cc
    which is quite large even without them. `sstable_set` already had its
    own header file, now it gets its own implementation file.

    The declarations of implementation classes and interfaces (`sstable_set_impl`,
    `bag_sstable_set`, and so on) were also exposed in a header file,
    sstable_set_impl.hh, for the purposes of potential unit testing.

2. mutation_reader: move `mutation_reader::forwarding` to flat_mutation_reader.hh

    Files which need this definition won't have to include
    mutation_reader.hh, only flat_mutation_reader.hh (so the inclusions are
    in total smaller; mutation_reader.hh includes flat_mutation_reader.hh).

3. sstables: move sstable reader creation functions to `sstable_set`

    Lower level functions such as `create_single_key_sstable_reader`
    were made methods of `sstable_set`.

    The motivation is that each concrete sstable_set
    may decide to use a better sstable reading algorithm specific to the
    data structures used by this sstable_set. For this it needs to access
    the set's internals.

    A nice side effect is that we moved some code out of table.cc
    and database.hh which are huge files.

4. sstables: pass `ring_position` to `create_single_key_sstable_reader`

    instead of `partition_range`.

    It would be best to pass `partition_key` or `decorated_key` here.
    However, the implementation of this function needs a `partition_range`
    to pass into `sstable_set::select`, and `partition_range` must be
    constructed from `ring_position`s. We could create the `ring_position`
    internally from the key but that would involve a copy which we want to
    avoid.

5. sstable_set: refactor `filter_sstable_for_reader_by_pk`

    Introduce a `make_pk_filter` function, which given a ring position,
    returns a boolean function (a filter) that given a sstable, tells
    whether the sstable may contain rows with the given position.

    The logic has been extracted from `filter_sstable_for_reader_by_pk`.

Split from #7437.

Closes #7655

* github.com:scylladb/scylla:
  sstable_set: refactor filter_sstable_for_reader_by_pk
  sstables: pass ring_position to create_single_key_sstable_reader
  sstables: move sstable reader creation functions to `sstable_set`
  mutation_reader: move mutation_reader::forwarding to flat_mutation_reader.hh
  sstables: move sstable_set implementations to a separate module
2020-11-24 09:23:57 +02:00
Michał Chojnowski
9bceaac44c utils: fragmented_temporary_buffer: fix view::remove_prefix
This piece of logic was wrong for two unrelated reasons:
1. When fragmented_temporary_buffer::view is constructed from bytes_view,
_current is null. When remove_prefix was used on such view, null pointer
dereference happened.
2. It only worked for the first remove_prefix call. A second call would put a
wrong value in _current_position.
2020-11-24 03:05:13 +01:00
Kamil Braun
d158921966 sstables: add may_have_partition_tombstones method
For sstable versions greater or equal than md, the `min_max_column_names`
sstable metadata gives a range of position-in-partitions such that all
clustering rows stored in this sstable have positions in this range.

Partition tombstones in this context are understood as covering the
entire range of clustering keys; thus, if the sstable contains at least
one partition tombstone, the sstable position range is set to be the
range of all clustered rows.

Therefore, by checking that the position range is *not* the range of all
clustered rows we know that the sstable cannot have any partition tombstones.

Closes #7678
2020-11-23 23:30:19 +02:00
Kamil Braun
72c59e8000 flat_mutation_reader: document assumption about fast_forward_to
It is not legal to fast forward a reader before it enters a partition.
One must ensure that there even is a partition in the first place. For
this one must fetch a `partition_start` fragment.

Closes #7679
2020-11-23 17:39:46 +01:00
Pavel Emelyanov
fea4a5492f system-keyspace: Remove dead code
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201123151453.27341-1-xemul@scylladb.com>
2020-11-23 17:16:15 +02:00
Tomasz Grabiec
36f9da6420 Merge "raft: testing: snapshots and partitioning elections" from Alejo
Fixes, features needed for testing, snapshot testing.
Free election after partitioning (replication test) .

* https://github.com/alecco/scylla/tree/raft-ale-tests-05e:
  raft: replication test: partitioning with leader
  raft: replication test: run free election after partitioning
  raft: expose fsm tick() to server for testing
  raft: expose is_leader() for testing
  raft: replication test: test take and load snapshot
  raft: fix a bug in leader election
  raft: fix default randomized timeout
  raft: replication test: fix custom next leader
  raft: replication test: custom next leader noop for same
  raft: replication test: fix failure detector for disconnected
2020-11-23 14:36:39 +01:00
Kamil Braun
6c8b0af505 sstable_set: refactor filter_sstable_for_reader_by_pk
Introduce a `make_pk_filter` function, which given a ring position,
returns a boolean function (a filter) that given a sstable, tells
whether the sstable may contain rows with the given position.

The logic has been extracted from `filter_sstable_for_reader_by_pk`.
2020-11-23 12:35:10 +01:00
Kamil Braun
68663d0de0 sstables: pass ring_position to create_single_key_sstable_reader
instead of partition_range.

It would be best to pass `partition_key` or `decorated_key` here.
However, the implementation of this function needs a `partition_range`
to pass into `sstable_set::select`, and `partition_range` must be
constructed from `ring_position`s. We could create the `ring_position`
internally from the key but that would involve a copy which we want to
avoid.
2020-11-23 12:33:24 +01:00
Takuya ASADA
b90ddc12c9 scylla_prepare: add --tune system when SET_CLOCKSOURCE=yes
perftune.py only run clocksource setup when --tune system specified,
so we need to add it on the parameter when SET_CLOCKSOURCE=yes.

Fixes #7672
2020-11-23 10:51:16 +02:00
Avi Kivity
f8e0517bc7 cql: do not advance timeouts on internal pages
Currently, each internal page fetched during aggregating
gets a timeout based on the time the page fetch was started,
rather than the query start time. This means the query can
continue processing long after the client has abandoned it
due to its own timeout, which is based on the query start time.

Fix by establishing the timeout once when the query starts, and
not advancing it.

Test: manual (SELECT count(*) FROM a large table).

Fixes #1175.

Closes #7662
2020-11-23 08:14:18 +01:00
Alejo Sanchez
1f8ca4e06d raft: replication test: partitioning with leader
For test simplicity support

    partition{leader{A},B,C,D}

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-11-22 22:39:00 -04:00
Avi Kivity
3eac976e24 build: remove non-C/C++ jobs from submodule_pools
The C and C++ sub-builds were placed in submodule_pool to
reduce concurrency, as they are memory intensive (well, at least
the C++ jobs are), and we choose build concurrency based on memory.
But the other submodules are not memory intensives, and certainly
the packaging jobs are not (and they are single-threaded too).

To allow these simple jobs to utilize multicores more efficiently,
remove them from submodule_pool so they can run in parallel.

Closes #7671
2020-11-23 00:32:41 +02:00
Avi Kivity
bcced9f56b build: compress unified package faster
The unified package is quite large (1GB compressed), and it
is the last step in the build so its build time cannot be
parallized with other tasks. Compress it with pigz to take
advantage of multiple cores and speed up the build a little.

Closes #7670
2020-11-23 00:31:04 +02:00
Takuya ASADA
3fefa520bd dist/common/scripts: drop run() and out(), swtich to subprocess.run()
We initially implemented run() and out() functions because we couldn't use
subprocess.run() since we were on Python 3.4.
But since we moved to relocatable python3, we don't need to implement it ourselves.
Why we keep using these functions are, because we needed to set environemnt variable to set PATH.
Since we recently moved away these codes to python thunk, we finally able to
drop run() and out(), switch to subprocess.run().
2020-11-22 17:59:27 +02:00
Alejo Sanchez
f12fed0809 raft: replication test: run free election after partitioning
When partitioning without keeping the existing leader, run an election
without forcing a particular leader.

To force a leader after partitioning, a test can just set it with new_leader{X}.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-11-22 10:32:34 -04:00
Alejo Sanchez
d610d5a7b8 raft: expose fsm tick() to server for testing
For tests to advance servers they need to invoke tick().

This is needed to advance free elections.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-11-22 10:32:34 -04:00
Alejo Sanchez
9e7e14fc50 raft: expose is_leader() for testing
Expose fsm leader check to allow tests to find out the leader after an
election.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-11-22 10:32:34 -04:00
Alejo Sanchez
f4d0131f02 raft: replication test: test take and load snapshot
Through configuration trigger automatic snapshotting.

For now, handle expected log index within the test's state machine and
pass it with snapshot_value (within the test file).

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-11-22 10:32:34 -04:00
Konstantin Osipov
bce8cb11a7 raft: fix a bug in leader election
If a server responds favourably to RequestVote RPC, it should
reset its election timer, otherwise it has very high chances of becoming
a candidate with an even newer term, despite successful elections.
A candidate with a term larger than the leader rejects AppendEntries
RPCs and can not become a leader itself (because of protection
against of disruptive leaders), so is stuck in this state.
2020-11-22 10:32:34 -04:00
Alejo Sanchez
08f8c418df raft: fix default randomized timeout
Range after election timeout should start at +1.
This matches existing update_current_term() code adding dist(1, 2*n).

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-11-22 10:32:34 -04:00
Alejo Sanchez
ab3a8b7bcd raft: replication test: fix custom next leader
Adjustments after changes due to free election in partitioning and changes in
the code.

Elapse previous leader after isolating it.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-11-22 10:32:22 -04:00
Alejo Sanchez
3bff7d1d21 raft: replication test: custom next leader noop for same
If custom specified leader is same do nothing.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-11-22 10:15:20 -04:00
Amnon Heiman
9e116d136e Adding dist/common/scripts/scylla_rsyslog_setup utility
scylla_rsyslog_setup adds a configuration file to rsyslog to forward the
trances to a remote server.

It will override any existing file, so it is safe to run it multiple
times.

It takes an ip, or ip and port from the users for that configuration, if
no port is provided, the default port of Scylla-Monitoring promtail is
used.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-11-22 15:48:48 +02:00
Avi Kivity
1e170ebfc1 Merge 'Changing hints configuration followup' from Piotr Dulikowski
Follow-up to https://github.com/scylladb/scylla/pull/6916.

- Fixes wrong usage of `resource_manager::prepare_per_device_limits`,
- Improves locking in `resource_manager` so that it is more safe to call its methods concurrently,
- Adds comments around `resource_manager::register_manager` so that it's more clear what this method does and why.

Closes #7660

* github.com:scylladb/scylla:
  hints/resource_manager: add comments to register_manager
  hints/resource_manager: fix indentation
  hints/resource_manager: improve mutual exclusion
  hints/resource_manager: correct prepare_per_device_limits usage
2020-11-22 15:06:35 +02:00
Alejo Sanchez
1436e4a323 raft: replication test: fix failure detector for disconnected
For a disconnected server all other servers is_alive() is false.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-11-22 09:04:58 -04:00
Pekka Enberg
2c8dcbe5c5 reloc: Remove "build_reloc.sh" script as obsolete
The "ninja dist-server-tar" command is a full replacement for
"build_reloc.sh" script. We release engineering infrastructure has been
switched to ninja, so let's remove "build_reloc.sh" as obsolete.
2020-11-20 22:41:26 +02:00
Piotr Sarna
5a9dc6a3cc Merge 'Cleanup CDC tests after CDC became GA' from Piotr Jastrzębski
Now that CDC is GA, it should be enabled in all the tests by default.
To achieve that the PR adds a special db::config::add_cdc_extension()
helper which is used in cql_test_envm to make sure CDC is usable in
all the tests that use cql_test_env.m As a result, cdc_tests can be
simplified.
Finally, some trailing whitespaces are removed from cdc_tests.

Tests: unit(dev)

Closes #7657

* github.com:scylladb/scylla:
  cdc: Remove trailing whitespaces from cdc_tests
  cdc: Remove mk_cdc_test_config from tests
  config: Add add_cdc_extension function for testing
  cdc: Add missing includes to cdc_extension.hh
2020-11-20 13:56:29 +01:00
Konstantin Osipov
269c049a16 test.py: enable back CQL based tests
The patch which introduces build-dependent testing
has a regression: it quietly filters out all tests
which are not part of ninja output. Since ninja
doesn't build any CQL tests (including CQL-pytest),
all such tests were quietly disabled.

Fix the regression by only doing the filtering
in unit and boost test suites.

test: dev (unit), dev + --build-raft
Message-Id: <20201119224008.185250-1-kostja@scylladb.com>
2020-11-20 11:45:15 +02:00
Pekka Enberg
6a04ae69a2 Update seastar submodule
* seastar c861dbfb...010fb0df (3):
  > build: clean up after failed -fconcepts detection
  > logger: issue std::endl to output stream
  > util/log: improve discoverability of log rate-limiting
2020-11-20 11:43:11 +02:00
Avi Kivity
82b508250e tools: toolchain: dbuild: don't confine with seccomp
Some systems (at least, Centos 7, aarch64) block the membarrier()
syscall via seccomp. This causes Scylla or unit tests to burn cpu
instead of sleeping when there is nothing to do.

Fix by instructing podman/docker not to block any syscalls. I
tested this with podman, and it appears [1] to be supported on
docker.

[1] https://docs.docker.com/engine/security/seccomp/#run-without-the-default-seccomp-profile

Closes #7661
2020-11-20 09:11:52 +02:00
Kamil Braun
40d8bfa394 sstables: move sstable reader creation functions to sstable_set
Lower level functions such as `create_single_key_sstable_reader`
were made methods of `sstable_set`.

The motivation is that each concrete sstable_set
may decide to use a better sstable reading algorithm specific to the
data structures used by this sstable_set. For this it needs to access
the set's internals.

A nice side effect is that we moved some code out of table.cc
and database.hh which are huge files.
2020-11-19 17:52:39 +01:00
Kamil Braun
708093884c mutation_reader: move mutation_reader::forwarding to flat_mutation_reader.hh
Files which need this definition won't have to include
mutation_reader.hh, only flat_mutation_reader.hh (so the inclusions are
in total smaller; mutation_reader.hh includes flat_mutation_reader.hh).
2020-11-19 17:52:39 +01:00
Kamil Braun
b02b441c2e sstables: move sstable_set implementations to a separate module
All the implementations were kept in sstables/compaction_strategy.cc
which is quite large even without them. `sstable_set` already had its
own header file, now it gets its own implementation file.

The declarations of implementation classes and interfaces (`sstable_set_impl`,
`bag_sstable_set`, and so on) were also exposed in a header file,
sstable_set_impl.hh, for the purposes of potential unit testing.
2020-11-19 17:52:37 +01:00
Avi Kivity
70689088fd Merge "Remove reference on database from global qctx" from Pavel E
"
The qctx is global object that references query processor and
database to let the rest of the code query system keyspace.

As the first step of de-globalizing it -- remove the database
reference from it. After the set the qctx remains a simple
wrapper over the query processor (which is already de-globalized)
and the query processor in turn is mostly needed only to parse
the query string into prepared statement only. This, in turn,
makes it possible to remove the qctx later by parsing the
query strings on boot and carrying _them_ around, not the qctx
itself.

tests: unit(dev), dtest(simple_cluster_driver_test:dev), manual start/stop
"

* 'br-remove-database-from-qctx' of https://github.com/xemul/scylla:
  query-context: Remove database from qctx
  schema-tables: Use query processor referece in save_system(_keyspace)?_schema
  system-keyspace: Rewrite force_blocking_flush
  system-keyspace: Use cluster_name string in check_health
  system-keyspace: Use db::config in setup_version
  query-context: Kill global helpers
  test: Use cql_test_env::evecute_cql instead of qctx version
  code: Use qctx::evecute_cql methods, not global ones
  system-keyspace: Do not call minimal_setup for the 2nd time
  system-keyspace: Fix indentation after previous patch
  system-keyspace: Do not do invoke_on_all by hands
  system-keyspace: Remove dead code
2020-11-19 18:31:51 +02:00
Pavel Emelyanov
689fd029a1 query-context: Remove database from qctx
No users of qctx::db are left.  One global database reference less.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-11-19 18:39:05 +03:00
Pavel Emelyanov
464c8990d4 schema-tables: Use query processor referece in save_system(_keyspace)?_schema
The save_system_schema and save_system_keyspace_schema are both
called on start and can the needed get query processor reference
from arguments.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-11-19 18:39:05 +03:00
Pavel Emelyanov
66dcc47571 system-keyspace: Rewrite force_blocking_flush
The method is called after query_processor::execute_internal
to flush the cf. Encapsulating this flush inside database and
getting the database from query_processor lets removing
database reference from global qctx object.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-11-19 18:39:05 +03:00
Pavel Emelyanov
6cad18ad33 system-keyspace: Use cluster_name string in check_health
The check_help needs global qctx to get db.config.cluster_name,
which is already available at the caller side.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-11-19 18:39:05 +03:00
Pavel Emelyanov
36a3ee6ad4 system-keyspace: Use db::config in setup_version
This is the beginning of de-globalizing global qctx thing.

The setup_version() needs global qctx to get config from.
It's possible to get the config from the caller instead.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-11-19 18:39:05 +03:00
Pavel Emelyanov
43039a0812 query-context: Kill global helpers
Now the db::execute_cql* callers are patched, the global
helpers can be removed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-11-19 18:39:05 +03:00
Pavel Emelyanov
64eef0a4f7 test: Use cql_test_env::evecute_cql instead of qctx version
Similar to previous patch, but for tests. Since cql_test_env
does't have qctx on board, the patch makes one step forward
and calls what is called by qctx::execute_cql.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-11-19 18:39:05 +03:00
Pavel Emelyanov
303ebe4a36 code: Use qctx::evecute_cql methods, not global ones
There are global db::execute_cql() helpers that just forward
the args into qctx::execute_cql(). The former are going away,
so patch all callers to use qctx themselves.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-11-19 18:39:05 +03:00
Pavel Emelyanov
8bf6b1298c system-keyspace: Do not call minimal_setup for the 2nd time
THe system_keyspace::minimal_setup is called by main.cc by hands
already, some steps before the regular ::setup().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-11-19 18:39:05 +03:00
Pavel Emelyanov
7b82ec2f9e system-keyspace: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-11-19 18:39:05 +03:00
Pavel Emelyanov
1773dadc72 system-keyspace: Do not do invoke_on_all by hands
The cache_truncation_record needs to run cf.cache_truncation_record
on each shard's DB, so the invoke_on_all can be used.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-11-19 18:39:05 +03:00
Pavel Emelyanov
fb20d9cd1e system-keyspace: Remove dead code
Not called anywhare.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-11-19 18:39:05 +03:00
Piotr Dulikowski
60ac68b7a2 hints/resource_manager: add comments to register_manager
Adds more comments to resource_manager::register_manager in order to
better explain what this function is doing.
2020-11-19 16:34:37 +01:00
Piotr Dulikowski
c0c10b918c hints/resource_manager: fix indentation
Fixes indentation in prepare_per_device_limits.
2020-11-19 16:34:37 +01:00
Piotr Dulikowski
ead6a3f036 hints/resource_manager: improve mutual exclusion
This commit causes start, stop and register_manager methods of the
resource_manager to be serialized with respect to each other using the
_operation_lock.

Those function modify internal state, so it's best if they are
protected with a semaphore. Additionally, those function are not going
to be used frequently, therefore it's perfectly fine to protect them in
such a coarse manner.

Now, space_watchdog has a dedicated lock for serializing its on_timer
logic with resource_manager::register_manager. The reason for separate
lock is that resource_manager::stop cannot use the same lock as the
space_watchdog - otherwise a situation could occur in which
space_watchdog waits for semaphore units held by
resource_manager::stop(), and resource_manager::stop() waits until the
space_watchdog stops its asynchronous event loop.
2020-11-19 16:34:37 +01:00
Piotr Dulikowski
362aebee7b hints/resource_manager: correct prepare_per_device_limits usage
The resource_manager::prepare_per_device_limits function calculates disk
quota for registered hints managers, and creates an association map:
from a storage device id to those hints manager which store hints on
that device (_per_device_limits_map)

This function was used with an assumption that it is idempotent - which
is a wrong assumption. In resource_manager::register_manager, if the
resource_manager is already started, prepare_per_device_limits would be
called, and those hints managers which were previously added to the
_per_device_limits_map would be added again. This would cause the space
used by those managers to be calculated twice, which would artificially
lower the limit which we impose on the space hints are allowed to occupy
on disk.

This patch fixes this problem by changing the prepare_per_device_limits
function to operate on a hints manager passed by argument. Now, we make
sure that this function is called on each hints manager only once.
2020-11-19 16:34:37 +01:00
Piotr Jastrzebski
debd10cc55 cdc: Remove trailing whitespaces from cdc_tests
The change was performed automatically using vim and
:%s/\s\+$//e

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-19 16:25:22 +01:00
Piotr Jastrzebski
6bdbfbafb7 cdc: Remove mk_cdc_test_config from tests
Now that CDC is GA and enabled by default, there's no longer a need
for a specific config in CDC tests.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-19 16:21:32 +01:00
Avi Kivity
2deb8e6430 Merge 'mutation_reader: generalize combined_mutation_reader' from Kamil Braun
It is now called `merging_reader`, and is used to change a `FragmentProducer`
that produces a non-decreasing stream of mutation fragments batches into
a `flat_mutation_reader` producing a non-decreasing stream of fragments.

The resulting stream of fragments is increasing except for places where
we encounter range tombstones (multiple range tombstones may be produced
with the same position_in_partition)

`merging_reader` is a simple adapter over `mutation_fragment_merger`.

The old `combined_mutation_reader` is simply a specialization of `merging_reader`
where the used `FragmentProducer` is `mutation_reader_merger`, an abstraction that
merges the output of multiple readers into one non-decreasing stream of fragment
batches.

There is no separate class for `combined_mutation_reader` now. Instead,
`make_combined_reader` works directly with `merging_reader`.

The PR also improves some comments.

Split from https://github.com/scylladb/scylla/pull/7437.

Closes #7656

* github.com:scylladb/scylla:
  mutation_reader: `generalize combined_mutation_reader`
  mutation_reader: fix description of mutation_fragment_merger
2020-11-19 17:19:01 +02:00
Piotr Jastrzebski
9ede193f0a config: Add add_cdc_extension function for testing
and use it in cql_test_env to enable cdc extension
for all tests that use it.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-19 16:16:07 +01:00
Piotr Jastrzebski
89f4298670 cdc: Add missing includes to cdc_extension.hh
Without those additional includes, a .cc file
that includes cdc_extension.hh won't compile.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-19 16:11:33 +01:00
Nadav Har'El
5f37c1ef33 Merge 'Don't add delay to the timestamp of the first CDC generation' from Piotr Jastrzębski
After the concept of the seed nodes was removed we can distinguish
whether the node is the first node in the cluster or not.

Thanks to this we can avoid adding delay to the timestamp of the first
CDC generation.

The delay is added to the timestamp to make sure that all the nodes
in the cluster manage to learn about it before the timestamp becomes in the past.
It is safe to not add the delay for the first node because we know it's the only node
in the cluster and no one else has to learn about the timestamp.

Fixes #7645

Tests: unit(dev)

Closes #7654

* github.com:scylladb/scylla:
  cdc: Don't add delay to the timestamp of the first generation
  cdc: Change for_testing to add_delay in make_new_cdc_generation
2020-11-19 16:47:16 +02:00
Kamil Braun
857911d353 mutation_reader: generalize combined_mutation_reader
It is now called `merging_reader`, and is used to change a `FragmentProducer`
that produces a non-decreasing stream of mutation fragments batches into
a `flat_mutation_reader` producing a non-decreasing stream of fragments.

The resulting stream of fragments is increasing except for places where
we encounter range tombstones (multiple range tombstones may be produced
with the same position_in_partition)

`merging_reader` is a simple adapter over `mutation_fragment_merger`.

The old `combined_mutation_reader` is simply a specialization of `merging_reader`
where the used `FragmentProducer` is `mutation_reader_merger`, an abstraction that
merges the output of multiple readers into one non-decreasing stream of fragment
batches.

There is no separate class for `combined_mutation_reader` now. Instead,
`make_combined_reader` works directly with `merging_reader`.
2020-11-19 14:35:11 +01:00
Kamil Braun
60adee6900 mutation_reader: fix description of mutation_fragment_merger
The resulting sequence is not necessarily strictly increasing
(e.g. if there are range tombstones).
2020-11-19 14:29:04 +01:00
Avi Kivity
a1be71b388 Merge "Harden network_topology_strategy_test.calculate_natural_endpoints" from Benny
"
We've recently seen failures in this unit test as follows:
```
test/boost/network_topology_strategy_test.cc(0): Entering test case "testCalculateEndpoints"
unknown location(0): fatal error: in "testCalculateEndpoints": std::out_of_range: _Map_base::at
./seastar/src/testing/seastar_test.cc(43): last checkpoint
test/boost/network_topology_strategy_test.cc(0): Leaving test case "testCalculateEndpoints"; testing time: 15192us
test/boost/network_topology_strategy_test.cc(0): Entering test case "test_invalid_dcs"
network_topology_strategy_test: ./seastar/include/seastar/core/future.hh:634: void seastar::future_state<seastar::internal::monostate>::set(A &&...) [T = seastar::internal::monostate, A = <>]: Assertion `_u.st == state::future' failed.
Aborting on shard 0.
```

This series fixes 2 issues in this test:
1. The core issue where std::out_of_range exception
   is not handled in calculate_natural_endpoints().
2. A secondary issue where the static `snitch_inst` isn't
   stopped when the first exception is hit, failing
   the next time the snitch is started, as it wasn't
   stopped properly.

Test: network_topology_strategy_test(release)
"

* tag 'nts_test-harden-calculate_natural_endpoints-v1' of github.com:bhalevy/scylla:
  test: network_topology_strategy_test: has_sufficient_replicas: handle empty dc endpoints case
  test: network_topology_strategy_test: fixup indentation
  test: network_topology_strategy_test: always stop_snitch after create_snitch
2020-11-19 14:11:42 +02:00
Piotr Jastrzebski
93a7f7943c cdc: Don't add delay to the timestamp of the first generation
After the concept of the seed nodes was removed we can distinguish
whether the node is the first node in the cluster or not.

Thanks to this we can avoid adding delay to the timestamp of the first
CDC generation.

Fixes #7645

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-19 13:03:18 +01:00
Tomasz Grabiec
d3a5814f4f api: Connect nodetool resetlocalschema to schema version recalculation
It doesn't really do what the nodetool command is docuemented to do,
which is to truncate local schema tables, but it is still an
improvement.

Message-Id: <1605740190-30332-1-git-send-email-tgrabiec@scylladb.com>
2020-11-19 13:55:09 +02:00
Piotr Jastrzebski
3024795507 cdc: Change for_testing to add_delay in make_new_cdc_generation
The meaning of the parameter changes from defining whether the function
is called in testing environment to deciding whether a delay should be
added to a timestamp of a newly created CDC generation.

This is a preparation for improvement in the following patch that does
not always add delay to every node but only to non-first node.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-19 12:19:42 +01:00
Pekka Enberg
ba39bfa1be dist-check: Fix script name to work on Windows filesystem
Asias He reports that git on Windows filesystem is unhappy about the
colon character (":") present in dist-check files:

$ git reset --hard origin/master
error: invalid path 'tools/testing/dist-check/docker.io/centos:7.sh'
fatal: Could not reset index file to revision 'origin/master'.

Rename the script to use a dash instead.

Closes #7648
2020-11-19 13:16:30 +02:00
Gleb Natapov
43dc5e7dc2 test: add support for different state machines
Current tests uses hash state machine that checks for specific order of
entries application. The order is not always guaranty though.
Backpressure may delay some entires to be submitted and when they are
released together they may be reordered in the debug mode due to
SEASTAR_SHUFFLE_TASK_QUEUE. Introduce an ability for test to choose
state machine type and implement commutative state machine that does
not care about ordering.
2020-11-18 19:14:37 +01:00
Gleb Natapov
8d9b6f588e raft: stop accepting requests on a leader after the log reaches the limit
To prevent the log to take too much memory introduce a mechanism that
limits the log to a certain size. If the size is reached no new log
entries can be submitted until previous entries are committed and
snapshotted.
2020-11-18 19:14:37 +01:00
Evgeniy Naydanov
587b909c5c scylla_raid_setup: try /dev/md[0-9] if no --raiddev provided
If scylla_raid_setup script called without --raiddev argument
then try to use any of /dev/md[0-9] devices instead of only
one /dev/md0.  Do it in this way because on Ubuntu 20.04
/dev/md0 used by OS already.

Closes #7628
2020-11-18 18:42:31 +02:00
Pavel Emelyanov
dbb2722e46 auth: Fix class name vs field name compilation by gcc
gcc fails to compile current master like this

In file included from ./service/client_state.hh:44,
                 from ./cql3/cql_statement.hh:44,
                 from ./cql3/statements/prepared_statement.hh:47,
                 from ./cql3/statements/raw/select_statement.hh:45,
                 from build/dev/gen/cql3/CqlParser.hpp:64,
                 from build/dev/gen/cql3/CqlParser.cpp:44:
./auth/service.hh:188:21: error: declaration of ‘const auth::resource& auth::command_desc::resource’ changes meaning of ‘resource’ [-fpermissive]
  188 |     const resource& resource; ///< Resource impacted by this command.
      |                     ^~~~~~~~
In file included from ./auth/authenticator.hh:57,
                 from ./auth/service.hh:33,
                 from ./service/client_state.hh:44,
                 from ./cql3/cql_statement.hh:44,
                 from ./cql3/statements/prepared_statement.hh:47,
                 from ./cql3/statements/raw/select_statement.hh:45,
                 from build/dev/gen/cql3/CqlParser.hpp:64,
                 from build/dev/gen/cql3/CqlParser.cpp:44:
./auth/resource.hh:98:7: note: ‘resource’ declared here as ‘class auth::resource’
   98 | class resource final {
      |       ^~~~~~~~

clang doesn't fail

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201118155905.14447-1-xemul@scylladb.com>
2020-11-18 18:40:55 +02:00
Asias He
f7c954dc1e repair: Use decorated_key::tri_compare to compare keys
It is faster than the legacy_equal because it compares the token first.

Fixes #7643

Closes #7644
2020-11-18 14:12:59 +02:00
Piotr Sarna
c0d72b4491 db,view: remove duplicate entries from the list of target endpoints
If a list of target endpoints for sending view updates contains
duplicates, it results in benign (but annoying) broken promise
errors happening due to duplicated write response handlers being
instantiated for a single endpoint.
In order to avoid such errors, target remote endpoints are deduplicated
from the list of pending endpoints.
A similar issue (#5459) solved the case for duplicated local endpoints,
but that didn't solve the general case.

Fixes #7572

Closes #7641
2020-11-18 13:43:49 +02:00
Avi Kivity
d612ca78f3 Merge 'Allow changing hinted handoff configuration in runtime' from Piotr Dulikowski
This PR allows changing the hinted_handoff_enabled option in runtime, either by modifying and reloading YAML configuration, or through HTTP API.

This PR also introduces an important change in semantics of hinted_handoff_enabled:
- Previously, hinted_handoff_enabled controlled whether _both writing and sending_ hints is allowed at all, or to particular DCs,
- Now, hinted_handoff_enabled only controls whether _writing hints_ is enabled. Sending hints from disk is now always enabled.

Fixes: #5634
Tests:
- unit(dev) for each commit of the PR
- unit(debug) for the last commit of the PR

Closes #6916

* github.com:scylladb/scylla:
  api: allow changing hinted handoff configuration
  storage_proxy: fix wrong return type in swagger
  hints_manager: implement change_host_filter
  storage_proxy: always create hints manager
  config: plug in hints::host_filter object into configuration
  db/hints: introduce host_filter
  hints/resource_manager: allow registering managers after start
  hints: introduce db::hints::directory_initializer
  directories.cc: prepare for use outside main.cc
2020-11-18 13:41:02 +02:00
Calle Wilund
9f48dc7dac locator::ec2_multi_region_snitch: Handle ipv6 broadcast/public ip
Fixes #7064

Iff broadcast address is set to ipv6 from main (meaning prefer
ipv6), determine the "public" ipv6 address (which should be
the same, but might not be), via aws metadata query.

Closes #7633
2020-11-18 12:48:25 +02:00
Asias He
9b28162f88 repair: Use label for node ops metrics
Make it easier to be consumed by the scylla-monitor.

Fixes #7270

Closes #7638
2020-11-18 10:12:39 +02:00
Avi Kivity
f55b522c1b database: detect misconfigured unit tests that don't set available_memory
available_memory is used to seed many caches and controllers. Usually
it's detected from the environment, but unit tests configure it
on their own with fake values. If they forget, then the undefined
behavior sanitizer will kick in in random places (see 8aa842614a
("test: gossip_test: configure database memory allocation correctly")
for an example.

Prevent this early by asserting that available_memory is nonzero.

Closes #7612
2020-11-18 08:49:32 +02:00
Avi Kivity
13c6c90d8c Merge 'Remove std::iterator usage' from Piotr Jastrzębski
std::iterator is deprecated since C++17 so define all the required iterator_traits directly and stop using std::iterator at all.

More context: https://www.fluentcpp.com/2018/05/08/std-iterator-deprecated

Tests: unit(dev)

Closes #7635

* github.com:scylladb/scylla:
  log_heap: Remove std::iterator from hist_iterator
  types: Remove std::iterator from tuple_deserializing_iterator
  types: Remove std::iterator from listlike_partial_deserializing_iterator
  sstables: remove std::iterator from const_iterator
  token_metadata: Remove std::iterator from tokens_iterator
  size_estimates_virtual_reader: Remove std::iterator
  token_metadata: Remove std::iterator from tokens_iterator_impl
  counters: Remove std::iterator from iterators
  compound_compat: Remove std::iterator from iterators
  compound: Remove std::iterator from iterator
  clustering_interval_set: Remove std::iterator from position_range_iterator
  cdc: Remove std::iterator from collection_iterator
  cartesian_product: Remove std::iterator from iterator
  bytes_ostream: Remove std::iterator from fragment_iterator
2020-11-17 19:22:17 +02:00
Benny Halevy
5171590d83 test: network_topology_strategy_test: has_sufficient_replicas: handle empty dc endpoints case
We saw this intermittent failure in testCalculateEndpoints:
```
unknown location(0): fatal error: in "testCalculateEndpoints": std::out_of_range: _Map_base::at
```

It turns out that there are no endpoints associated with the dc passed
to has_sufficient_replicas in the `all_endpoints` map.

Handle this case by returning true.

The dc is still required to appear in `dc_replicas`,
so if it's not found there, fail the test gracefully.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-17 18:57:19 +02:00
Piotr Jastrzebski
2fe9d879df log_heap: Remove std::iterator from hist_iterator
std::iterator is deprecated since C++17 so define all the required
iterator_traits directly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-17 16:53:20 +01:00
Piotr Jastrzebski
957d4c3532 types: Remove std::iterator from tuple_deserializing_iterator
std::iterator is deprecated since C++17 so define all the required
iterator_traits directly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-17 16:53:20 +01:00
Piotr Jastrzebski
5f64e57b10 types: Remove std::iterator from listlike_partial_deserializing_iterator
std::iterator is deprecated since C++17 so define all the required
iterator_traits directly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-17 16:53:20 +01:00
Piotr Jastrzebski
bacda100ec sstables: remove std::iterator from const_iterator
std::iterator is deprecated since C++17 so define all the required
iterator_traits directly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-17 16:53:20 +01:00
Piotr Jastrzebski
661b52c7df token_metadata: Remove std::iterator from tokens_iterator
std::iterator is deprecated since C++17 so define all the required
iterator_traits directly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-17 16:53:20 +01:00
Piotr Jastrzebski
c0bc6b5795 size_estimates_virtual_reader: Remove std::iterator
std::iterator is deprecated since C++17 so define all the required
iterator_traits directly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-17 16:53:20 +01:00
Piotr Jastrzebski
87bf577450 token_metadata: Remove std::iterator from tokens_iterator_impl
std::iterator is deprecated since C++17 so define all the required
iterator_traits directly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-17 16:53:20 +01:00
Piotr Jastrzebski
651849e0c1 counters: Remove std::iterator from iterators
std::iterator is deprecated since C++17 so define all the required
iterator_traits directly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-17 16:53:20 +01:00
Piotr Jastrzebski
742b5b7fc5 compound_compat: Remove std::iterator from iterators
std::iterator is deprecated since C++17 so define all the required
iterator_traits directly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-17 16:53:20 +01:00
Piotr Jastrzebski
493c2bfc96 compound: Remove std::iterator from iterator
std::iterator is deprecated since C++17 so define all the required
iterator_traits directly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-17 16:53:20 +01:00
Piotr Jastrzebski
c5d6ee0e45 clustering_interval_set: Remove std::iterator from position_range_iterator
std::iterator is deprecated since C++17 so define all the required
iterator_traits directly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-17 16:53:20 +01:00
Piotr Jastrzebski
6b1167ea0d cdc: Remove std::iterator from collection_iterator
std::iterator is deprecated since C++17 so define all the required
iterator_traits directly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-17 16:53:20 +01:00
Piotr Jastrzebski
a2fa10a0bc cartesian_product: Remove std::iterator from iterator
std::iterator is deprecated since C++17 so define all the required
iterator_traits directly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-17 16:53:20 +01:00
Piotr Jastrzebski
0605d9e8ed bytes_ostream: Remove std::iterator from fragment_iterator
std::iterator is deprecated since C++17 so define all the required
iterator_traits directly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-17 16:53:20 +01:00
Benny Halevy
a38709b6bb test: network_topology_strategy_test: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-17 16:10:35 +02:00
Benny Halevy
5c73d4f65b test: network_topology_strategy_test: always stop_snitch after create_snitch
Currently stop_snitch is not called if the test fails on exception.
This causes a failure in create_snitch where snitch_inst fails to start
since it wasn't stopped earlier.

For example:
```
test/boost/network_topology_strategy_test.cc(0): Entering test case "testCalculateEndpoints"
unknown location(0): fatal error: in "testCalculateEndpoints": std::out_of_range: _Map_base::at
./seastar/src/testing/seastar_test.cc(43): last checkpoint
test/boost/network_topology_strategy_test.cc(0): Leaving test case "testCalculateEndpoints"; testing time: 15192us
test/boost/network_topology_strategy_test.cc(0): Entering test case "test_invalid_dcs"
network_topology_strategy_test: ./seastar/include/seastar/core/future.hh:634: void seastar::future_state<seastar::internal::monostate>::set(A &&...) [T = seastar::internal::monostate, A = <>]: Assertion `_u.st == state::future' failed.
Aborting on shard 0.
Backtrace:
  0x0000000002825e94
  0x000000000282ffa9
  0x00007fd065f971df
  /lib64/libc.so.6+0x000000000003dbc4
  /lib64/libc.so.6+0x00000000000268a3
  /lib64/libc.so.6+0x0000000000026788
  /lib64/libc.so.6+0x0000000000035fc5
  0x0000000000b484cf
  0x0000000002a7c69f
  0x0000000002a7c62f
  0x0000000000b47b9e
  0x0000000002595da2
  0x0000000002595913
  0x0000000002a83a31

```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-17 16:09:43 +02:00
Piotr Jastrzebski
f2b98b0aad Replace disable_failure_guard with scoped_critical_alloc_section
scoped_critical_alloc_section was recently introduced to replace
disable_failure_guard and made the old class deprecated.

This patch replaces all occurences of disable_failure_guard with
scoped_critical_alloc_section.

Without this patch the build prints many warnings like:
warning: 'disable_failure_guard' is deprecated: Use scoped_critical_section instead [-Wdeprecated-declarations]

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <ca2a91aaf48b0f6ed762a6aa687e6ac5e936355d.1605621284.git.piotr@scylladb.com>
2020-11-17 16:01:25 +02:00
Avi Kivity
006e0e4fe0 Merge "Add scylla specific information to the OOM diagnostics report" from Botond
"
Use the recently introduced seastar mechanism which allows the
application running on top of seastar to add its own part to the
diagnostics report to add scylla specific information to said report.
The report now closely resembles that produced by `scylla memory` from
`scylla-gdb.py`, with the exception of coordinator-specific information.
This should greatly speed up the debugging of OOM, as the diagnostics
report will be available from the logs, without having to obtain a
coredump and set up a debugging environment in which it can be opened.

Example report:

INFO  2020-11-10 12:02:44,182 [shard 0] testlog - Dumping seastar memory diagnostics
Used memory:  2029M
Free memory:  19M
Total memory: 2G

LSA
  allocated: 1770M
  used:      1766M
  free:      3M

Cache:
  total: 1770M
  used:  1716M
  free:  54M

Memtables:
 total: 0B
 Regular:
  real dirty: 0B
  virt dirty: 0B
 System:
  real dirty: 0B
  virt dirty: 0B

Replica:
  Read Concurrency Semaphores:
    user: 100/100, 33M/41M, queued: 477
    streaming: 0/10, 0B/41M, queued: 0
    system: 0/100, 0B/41M, queued: 0
    compaction: 0/∞, 0B/∞
  Execution Stages:
    data query stage:
      statement	987
         Total: 987
    mutation query stage:
         Total: 0
    apply stage:
         Total: 0
  Tables - Ongoing Operations:
    Pending writes (top 10):
      0 Total (all)
    Pending reads (top 10):
      1564 ks.test
      1564 Total (all)
    Pending streams (top 10):
      0 Total (all)

Small pools:
objsz	spansz	usedobj	memory	unused	wst%
8	4K	11k	88K	6K	6
10	4K	10	8K	8K	98
12	4K	2	8K	8K	99
14	4K	4	8K	8K	99
16	4K	15k	244K	5K	2
32	4K	2k	52K	3K	5
32	4K	20k	628K	2K	0
32	4K	528	20K	4K	17
32	4K	5k	144K	480B	0
48	4K	17k	780K	3K	0
48	4K	3k	140K	3K	2
64	4K	50k	3M	6K	0
64	4K	66k	4M	7K	0
80	4K	131k	10M	1K	0
96	4K	37k	3M	192B	0
112	4K	65k	7M	10K	0
128	4K	21k	3M	2K	0
160	4K	38k	6M	3K	0
192	4K	15k	3M	12K	0
224	4K	3k	720K	10K	1
256	4K	148	56K	19K	33
320	8K	13k	4M	14K	0
384	8K	3k	1M	20K	1
448	4K	11k	5M	5K	0
512	4K	2k	1M	39K	3
640	12K	163	144K	42K	29
768	12K	1k	832K	59K	7
896	8K	131	144K	29K	20
1024	4K	643	732K	89K	12
1280	20K	11k	13M	26K	0
1536	12K	12	128K	110K	85
1792	16K	12	144K	123K	85
2048	8K	601	1M	14K	1
2560	20K	70	224K	48K	21
3072	12K	13	240K	201K	83
3584	28K	6	288K	266K	92
4096	16K	10k	39M	88K	0
5120	20K	7	416K	380K	91
6144	24K	24	480K	336K	70
7168	28K	27	608K	413K	67
8192	32K	256	3M	736K	26
10240	40K	11k	105M	550K	0
12288	48K	21	960K	708K	73
14336	56K	59	1M	378K	31
16384	64K	8	1M	1M	89
Page spans:
index	size	free	used	spans
0	4K	48M	48M	12k
1	8K	6M	6M	822
2	16K	41M	41M	3k
3	32K	18M	18M	579
4	64K	108M	108M	2k
5	128K	1774M	2G	14k
6	256K	512K	0B	2
7	512K	2M	2M	4
8	1M	0B	0B	0
9	2M	2M	0B	1
10	4M	0B	0B	0
11	8M	0B	0B	0
12	16M	16M	0B	1
13	32M	32M	32M	1
14	64M	0B	0B	0
15	128M	0B	0B	0
16	256M	0B	0B	0
17	512M	0B	0B	0
18	1G	0B	0B	0
19	2G	0B	0B	0
20	4G	0B	0B	0
21	8G	0B	0B	0
22	16G	0B	0B	0
23	32G	0B	0B	0
24	64G	0B	0B	0
25	128G	0B	0B	0
26	256G	0B	0B	0
27	512G	0B	0B	0
28	1T	0B	0B	0
29	2T	0B	0B	0
30	4T	0B	0B	0
31	8T	0B	0B	0

Fixes: #6365
"

* 'dump-memory-diagnostics-oom/v1' of https://github.com/denesb/scylla:
  database: hook-in to the seastar OOM diagnostics report generation
  database: table: add accessors to the operation counts of the phasers
  utils: logalloc: add lsa_global_occupancy_stats()
  utils: phased_barrier: add operations_in_progress()
  mutation_query: mutation_query_stage: add get_stats()
  reader_concurrency_semaphore: add is_unlimited()
2020-11-17 15:50:21 +02:00
Avi Kivity
1cf02cb9d8 types: add constraint on lexicographical_tri_compare()
Verify that the input types are iterators and their value
types are compatible with the compare function.
2020-11-17 15:19:46 +02:00
Avi Kivity
71e93d63c5 composite: make composite::iterator a real input_iterator
Iterators require a default constructor, so add one. This helps
a later patch use std::input_iterator to constrain template
parameters.
2020-11-17 15:19:46 +02:00
Avi Kivity
867b41b124 compound: make compount_type::iterator a real input_iterator
Iterators require a default constructor, so add one. This helps
a later patch use std::input_iterator to constrain template
parameters.
2020-11-17 15:19:38 +02:00
Botond Dénes
34c213f9bb database: hook-in to the seastar OOM diagnostics report generation
Use the mechanism provided by seastar to add scylla specific information
to the memory diagnostics report. The information added is mostly the
same contained in the output of `scylla memory` from `scylla-gdb.py`,
with the exception of the coordinator-specific metrics. The report is
generated in the database layer, where the storage-proxy is not
available and it is not worth pulling it in just for this purpose.

An example report:

INFO  2020-11-10 12:02:44,182 [shard 0] testlog - Dumping seastar memory diagnostics
Used memory:  2029M
Free memory:  19M
Total memory: 2G

LSA
  allocated: 1770M
  used:      1766M
  free:      3M

Cache:
  total: 1770M
  used:  1716M
  free:  54M

Memtables:
 total: 0B
 Regular:
  real dirty: 0B
  virt dirty: 0B
 System:
  real dirty: 0B
  virt dirty: 0B

Replica:
  Read Concurrency Semaphores:
    user: 100/100, 33M/41M, queued: 477
    streaming: 0/10, 0B/41M, queued: 0
    system: 0/100, 0B/41M, queued: 0
    compaction: 0/∞, 0B/∞
  Execution Stages:
    data query stage:
      statement	987
         Total: 987
    mutation query stage:
         Total: 0
    apply stage:
         Total: 0
  Tables - Ongoing Operations:
    Pending writes (top 10):
      0 Total (all)
    Pending reads (top 10):
      1564 ks.test
      1564 Total (all)
    Pending streams (top 10):
      0 Total (all)

Small pools:
objsz	spansz	usedobj	memory	unused	wst%
8	4K	11k	88K	6K	6
10	4K	10	8K	8K	98
12	4K	2	8K	8K	99
14	4K	4	8K	8K	99
16	4K	15k	244K	5K	2
32	4K	2k	52K	3K	5
32	4K	20k	628K	2K	0
32	4K	528	20K	4K	17
32	4K	5k	144K	480B	0
48	4K	17k	780K	3K	0
48	4K	3k	140K	3K	2
64	4K	50k	3M	6K	0
64	4K	66k	4M	7K	0
80	4K	131k	10M	1K	0
96	4K	37k	3M	192B	0
112	4K	65k	7M	10K	0
128	4K	21k	3M	2K	0
160	4K	38k	6M	3K	0
192	4K	15k	3M	12K	0
224	4K	3k	720K	10K	1
256	4K	148	56K	19K	33
320	8K	13k	4M	14K	0
384	8K	3k	1M	20K	1
448	4K	11k	5M	5K	0
512	4K	2k	1M	39K	3
640	12K	163	144K	42K	29
768	12K	1k	832K	59K	7
896	8K	131	144K	29K	20
1024	4K	643	732K	89K	12
1280	20K	11k	13M	26K	0
1536	12K	12	128K	110K	85
1792	16K	12	144K	123K	85
2048	8K	601	1M	14K	1
2560	20K	70	224K	48K	21
3072	12K	13	240K	201K	83
3584	28K	6	288K	266K	92
4096	16K	10k	39M	88K	0
5120	20K	7	416K	380K	91
6144	24K	24	480K	336K	70
7168	28K	27	608K	413K	67
8192	32K	256	3M	736K	26
10240	40K	11k	105M	550K	0
12288	48K	21	960K	708K	73
14336	56K	59	1M	378K	31
16384	64K	8	1M	1M	89
Page spans:
index	size	free	used	spans
0	4K	48M	48M	12k
1	8K	6M	6M	822
2	16K	41M	41M	3k
3	32K	18M	18M	579
4	64K	108M	108M	2k
5	128K	1774M	2G	14k
6	256K	512K	0B	2
7	512K	2M	2M	4
8	1M	0B	0B	0
9	2M	2M	0B	1
10	4M	0B	0B	0
11	8M	0B	0B	0
12	16M	16M	0B	1
13	32M	32M	32M	1
14	64M	0B	0B	0
15	128M	0B	0B	0
16	256M	0B	0B	0
17	512M	0B	0B	0
18	1G	0B	0B	0
19	2G	0B	0B	0
20	4G	0B	0B	0
21	8G	0B	0B	0
22	16G	0B	0B	0
23	32G	0B	0B	0
24	64G	0B	0B	0
25	128G	0B	0B	0
26	256G	0B	0B	0
27	512G	0B	0B	0
28	1T	0B	0B	0
29	2T	0B	0B	0
30	4T	0B	0B	0
31	8T	0B	0B	0
2020-11-17 15:13:21 +02:00
Botond Dénes
4d7f2f45c2 database: table: add accessors to the operation counts of the phasers 2020-11-17 15:13:21 +02:00
Botond Dénes
7b56ed6057 utils: logalloc: add lsa_global_occupancy_stats()
Allows querying the occupancy stats of all the lsa memory.
2020-11-17 15:13:21 +02:00
Botond Dénes
f69942424d utils: phased_barrier: add operations_in_progress()
Allows querying the number of operations in-flight in the current phase.
2020-11-17 15:13:21 +02:00
Botond Dénes
f097bf3005 mutation_query: mutation_query_stage: add get_stats() 2020-11-17 15:13:21 +02:00
Botond Dénes
8c083c17fc reader_concurrency_semaphore: add is_unlimited()
Allows determining whether the semaphore was created without limits.
2020-11-17 15:13:21 +02:00
Avi Kivity
100ad4db38 Merge 'Allow ALTERing the properties of system_auth tables' from Dejan Mircevski
As requested in #7057, allow certain alterations of system_auth tables. Potentially destructive alterations are still rejected.

Tests: unit (dev)

Closes #7606

* github.com:scylladb/scylla:
  auth: Permit ALTER options on system_auth tables
  auth: Add command_desc
  auth: Add tests for resource protections
2020-11-17 12:15:20 +02:00
Botond Dénes
318b0ef259 reader_concurrency_semaphore: rate-limit diagnostics messages
And since now there is no danger of them filling the logs, the log-level
is promoted to info, so users can see the diagnostics messages by
default.

The rate-limit chosen is 1/30s.

Refs: #7398

Tests: manual

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201117091253.238739-1-bdenes@scylladb.com>
2020-11-17 11:57:51 +02:00
Piotr Dulikowski
0fd36e2579 api: allow changing hinted handoff configuration
This commit makes it possible to change hints manager's configuration at
runtime through HTTP API.

To preserve backwards compatibility, we keep the old behavior of not
creating and checking hints directories if they are not enabled at
startup. Instead, hint directories are lazily initialized when hints are
enabled for the first time through HTTP API.
2020-11-17 10:24:43 +01:00
Piotr Dulikowski
6465dd160b storage_proxy: fix wrong return type in swagger
The GET `hinted_handoff_enabled_by_dc` endpoint had an incorrect return
type specified. Although it does not have an implementation, yet, it was
supposed to return a list of strings with DC names for which generating
hints is enabled - not a list of string pairs. Such return type is
expected by the JMX.
2020-11-17 10:24:43 +01:00
Piotr Dulikowski
220a2ca800 hints_manager: implement change_host_filter
Implements a function which is responsible for changing hints manager
configuration while it is running.

It first starts new endpoint managers for endpoints which weren't
allowed by previous filter but are now, and then stops endpoint managers
which are rejected by the new filter.

The function is blocking and waits until all relevant ep managers are
started or stopped.
2020-11-17 10:24:43 +01:00
Piotr Dulikowski
1302f1b5bf storage_proxy: always create hints manager
Now, the hints manager object for regular hints is always created, even
if hints are disabled in configuration. Please note that the behavior of
hints will be unchanged - no hints will be sent when they are disabled.
The intent of this change is to make enabling and disabling hints in
runtime easier to implement.
2020-11-17 10:24:43 +01:00
Piotr Dulikowski
cefe5214ff config: plug in hints::host_filter object into configuration
Uses db::hints::host_filter as the type of hinted_handoff_enabled
configuration option.

Previously, hinted_handoff_enabled used to be a string option, and it
was parsed later in a separate function during startup. The function
returned a std::optional<std::unordered_set<sstring>>, whose meaning in
the context of hints is rather enigmatic for an observer not familiar
with hints.

Now, hinted_handoff_enabled has type of db::hints::host_filter, and it
is plugged into the config parsing framework, so there is no need for
later post-processing.
2020-11-17 10:24:42 +01:00
Piotr Dulikowski
5c3c7c946b db/hints: introduce host_filter
Adds a db::hints::host_filter structure, which determines if generating
hints towards a given target is currently allowed. It supports
serialization and deserialization between the hinted_handoff_enabled
configuration/cli option.

This patch only introduces this structure, but does not make other code
use it. It will be plugged into the configuration architecture in the
following commits.
2020-11-17 10:15:47 +01:00
Piotr Dulikowski
a4f03d72b3 hints/resource_manager: allow registering managers after start
This change modifies db::hints::resource_manager so that it is now
possible to add hints::managers after it was started.

This change will make it possible to register the regular hints manager
later in runtime, if it wasn't enabled at boot time.
2020-11-17 10:15:47 +01:00
Piotr Dulikowski
40710677d0 hints: introduce db::hints::directory_initializer
Introduces a db::hints::directory_initializer object, which encapsulates
the logic of initializing directories for hints (creating/validating
directories, segment rebalancing). It will be useful for lazy
initialization of hints manager.
2020-11-17 10:15:47 +01:00
Piotr Dulikowski
81a568c57a directories.cc: prepare for use outside main.cc
Currently, the `directories` class is used exclusively during
initialization, in the main() function. This commit refactors this class
so that it is possible to use it to initialize directories much later
after startup.

The intent of this change is to make it possible for hints manager to
create directories for hints lazily. Currently, when Scylla is booted
with hinted handoff disabled, the `hints_directory` config parameter is
ignored and directories for hints are neither created nor verified.
Because we would like to preserve this behavior and introduce
possibility to switch hinted handoff on in runtime, the hints
directories will have to be created lazily the first time hinted handoff
is enabled.
2020-11-17 10:15:47 +01:00
Piotr Sarna
5c66291ab9 Update seastar submodule
* seastar 043ecec7...c861dbfb (3):
  > Merge "memory: allow configuring when to dump memory diagnostics on allocation failures" from Botond
  > perftune.py: support kvm-clock on tune-clock
  > execution_stage: inheriting_concrete_execution_stage: add get_stats()
2020-11-17 08:37:39 +01:00
Dejan Mircevski
1beb57ad9d auth: Permit ALTER options on system_auth tables
These alterations cannot break the database irreparably, so allow
them.

Expand command_desc as required.

Add a type (rather than command_desc) parameter to
has_column_family_access() to minimize code changes.

Fixes #7057

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-11-16 22:32:32 -05:00
Dejan Mircevski
9a6c1b4d50 auth: Add command_desc
Instead of passing various bits of the command around, pass one
command_desc object.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-11-16 20:23:52 -05:00
Kamil Braun
d74f303406 cdc: ensure that CDC generation write is flushed to commitlog before ack
When a node bootstraps or upgrades from a pre-CDC version, it creates a
new CDC generation, writes it to a distributed table
(system_distributed.cdc_generation_descriptions), and starts gossiping
its timestamp. When other nodes see the timestamp being gossiped, they
retrieve the generation from the table.

The bootstrapping/upgrading node therefore assumes that the generation
is made durable and other nodes will be able to retrieve it from the
table. This assumption could be invalidated if periodic commitlog mode
was used: replicas would acknowledge the write and then immediately
crash, losing the write if they were unlucky (i.e. commitlog wasn't
synced to disk before the write was acknowledged).

This commit enforces all writes to the generations table to be
synced to commitlog immediately. It does not matter for performance as
these writes are very rare.

Fixes https://github.com/scylladb/scylla/issues/7610.

Closes #7619
2020-11-17 00:01:13 +02:00
Gleb Natapov
df197e36fb raft: store an entry as a shared ptr in an outgoing message
An entry can be snapshotted, before the outgoing message is sent, so the
message has to hold to it to avoid use after free.

Message-Id: <20201116113323.GA1024423@scylladb.com>
2020-11-16 17:54:21 +01:00
Piotr Sarna
fc8ffe08b9 storage_proxy: unify retiring view response handlers
Materialized view updates participate in a retirement program,
which makes sure that they are immediately taken down once their
target node is down, without having to wait for timeout (since
views are a background operation and it's wasteful to wait in the
background for minutes). However, this mechanism has very delicate
lifetime issues, and it already caused problems more than once,
most recently in #5459.
In order to make another bug in this area less likely, the two
implementations of the mechanism, in on_down() and drain_on_shutdown(),
are unified.

Possibly refs #7572

Closes #7624
2020-11-16 18:50:49 +02:00
Avi Kivity
5d45662804 database, streaming: remove remnants of memtable-base streaming
Commit e5be3352cf ("database, streaming, messaging: drop
streaming memtables") removed streaming memtables; this removes
the mechanisms to synchronize them: _streaming_flush_gate and
_streaming_flush_phaser. The memory manager for streaming is removed,
and its 10% reserve is evenly distributed between memtables and
general use (e.g. cache).

Note that _streaming_flush_phaser and _streaming_flush_date are
no longer used to syncrhonize anything - the gate is only used
to protect the phaser, and the phaser isn't used for anything.

Closes #7454
2020-11-16 14:32:19 +01:00
Takuya ASADA
2ce8ca0f75 dist/common/scripts/scylla_util.py: move DEBIAN_FRONTEND environment variable to apt_install()/apt_uninstall()
DEBIAN_FRONTEND environment variable was added just for prevent opening
dialog when running 'apt-get install mdadm', no other program depends on it.
So we can move it inside of apt_install()/apt_uninstall() and drop scylla_env,
since we don't have any other environment variables.
To passing the variable, added env argument on run()/out().
2020-11-16 14:21:36 +02:00
Avi Kivity
fcec68b102 Merge "storage_service: add mutate_token_metadata helper" from Benny
"
This is a follow-up on 052a8d036d
"Avoid stalls in token_metadata and replication strategy"

The added mutate_token_metadata helper combines:
- with_token_metadata_lock
- get_mutable_token_metadata_ptr
- replicate_to_all_cores

Test: unit(dev)
"

* tag 'mutate_token_metadata-v1' of github.com:bhalevy/scylla:
  storage_service: fixup indentation
  storage_service: mutate_token_metadata: do replicate_to_all_cores
  storage_service: add mutate_token_metadata helper
2020-11-15 20:00:19 +02:00
Benny Halevy
51e4d6490b storage_service: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-15 15:18:48 +02:00
Benny Halevy
e861c352f8 storage_service: mutate_token_metadata: do replicate_to_all_cores
Replicate the mutated token_metadata to all cores on success.

This moves replication out of update_pending_ranges(mutable_token_metadata_ptr, sstring),
so add explicit call to replicate_to_all_cores where it is called outside
of mutate_token_metadata.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-15 14:34:20 +02:00
Benny Halevy
25b5db0b72 storage_service: add mutate_token_metadata helper
Replace a repeating pattern of:
    with_token_metadata_lock([] {
        return get_mutable_token_metadata_ptr([] (mutable_token_metadata_ptr tmptr) {
            // mutate token_metadata via tmptr
        });
    });

With a call to mutate_token_metadata that does both
and calls the function with then mutable_token_metadata_ptr.

A following patch will also move the replication to all
cores to mutate_token_metadata.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-15 14:31:39 +02:00
Pekka Enberg
31389d1724 configure.py: Fix unified-package version and release to unbreak "dist" target
The "dist" target fails as follows:

  $ ./tools/toolchain/dbuild ninja dist
  ninja: error: 'build/dev/scylla-unified-package-..tar.gz', needed by 'dist-unified-tar', missing and no known rule to make it

Fix two issues:

- Fix Python variable references to "scylla_version" and
  "scylla_release", broken by commit bec0c15ee9 ("configure.py: Add
  version to unified tarball filename"). The breakage went unnoticed
  because ninja default target does not call into dist...

- Remove dependencies to build/<mode>/scylla-unified-package.tar.gz. The
  file is now in build/<mode>/dist/tar/ directory and contains version
  and release in the filename.

Message-Id: <20201113110706.150533-1-penberg@scylladb.com>
2020-11-15 11:10:26 +02:00
Dejan Mircevski
d554610f32 auth: Add tests for resource protections
Try to mess up system_auth tables and verify that Scylla rejects that.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-11-13 21:18:38 -05:00
Tomasz Grabiec
0a2adf4555 Merge "raft: replication test: simple partitioning" from Alejo
To test handling of connectivity issues and recovery add support for
disconnecting servers.

This is not full partitioning yet as it doesn't allow connectivity
across the disconnected servers (having multiple active partitions.

* https://github.com/alecco/scylla/pull/new/raft-ale-partition-simple-v3:
  raft: replication test: connectivity partitioning support
  raft: replication test: block rpc calls to disconnected servers
  raft: replication test: add is_disconnected helper
  raft: replication test: rename global variable
  raft: replication test: relocate global connection state map
2020-11-13 13:49:33 +01:00
Pekka Enberg
f57b894d42 configure.py: Remove duplicate scylla-package.tar.gz artifact
We currently keep a copy of scylla-package.tar.gz in "build/<mode>" for
compatibility. However, we've long since switched our CI system over to
the new location, so let's remove the duplicate and use the one from
"build/<mode>/dist/tar" instead.
Message-Id: <20201113075146.67265-1-penberg@scylladb.com>
2020-11-13 11:27:39 +01:00
Nadav Har'El
62551b3bd3 docs/alternator: mention that Alternator Streams is experimental
Add to the DynamoDB compatibility document, docs/alternator/compatibility.md,
a mention that Alternator streams are still an experimental features, and
how to turn it on (at this point CDC is no longer an experimental feature,
but Alternator Streams are).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201112184436.940497-1-nyh@scylladb.com>
2020-11-12 21:20:04 +02:00
Nadav Har'El
450de2d89d docs/alternator: Alternator is no longer "experimental"
Drop the adjective "experimental" used to describe Alternator in
docs/alternator/getting-started.md.

In Scylla, the word "experimental" carries a specific meaning - no support
for upgrades, not enough QA, not ready for general use) and Alternator is
no longer experimental in that sense.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201112185249.941484-1-nyh@scylladb.com>
2020-11-12 21:20:03 +02:00
Nadav Har'El
e40fa4b7fd test/cql-pytest: remove xfail mark from passing secondary-index test
Issue #7443 (the wrong sort order of partitions in a secondary index)
was already fixed in commit 7ff72b0ba5.
So the test for it is now passing, and we can remove its "xfail" mark.

Refs #7443

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201112183441.939604-1-nyh@scylladb.com>
2020-11-12 20:43:59 +02:00
Pekka Enberg
274717c97d cql-pytest/test_keyspace.py: Add ALTER KEYSPACE test cases
This adds some test cases for ALTER KEYSPACE:

 - ALTER KEYSPACE happy path

 - ALTER KEYSPACE wit invalid options

 - ALTER KEYSPACE for non-existing keyspace

 - CREATE and ALTER KEYSPACE using NetworkTopologyStrategy with
   non-existing data center in configuration, which triggers a bug in
   Scylla:

   https://github.com/scylladb/scylla/issues/7595
Message-Id: <20201112073110.39475-1-penberg@scylladb.com>
2020-11-12 20:07:12 +02:00
Alejo Sanchez
5d8752602b raft: replication test: connectivity partitioning support
Introduce partition update command consisting of nodes still seeing
each other. Nodes not included are disconnected from everything else.

If the previous leader is not part of the new partition, the first node
specified in the partition will become leader.

For other nodes to accept a new leader it has to have a committed log.
For example, if the desired leader is being re-connected and it missed
entries other nodes saw it will not win the election. Example A B C:

    partition{A,C},entries{2},partition{B,C}

In this case node C won't accept B as a new leader as it's missing 2
entries.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-11-12 10:01:17 -04:00
Alejo Sanchez
2fc5b3a620 raft: replication test: block rpc calls to disconnected servers
Use global connection state with rpc, too.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-11-12 10:01:05 -04:00
Alejo Sanchez
c9e593a6d7 raft: replication test: add is_disconnected helper
Simplify disconnection logic with helper is_disconnected() function

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-11-12 10:00:58 -04:00
Alejo Sanchez
e1b0aad149 raft: replication test: rename global variable
Lowercase for global disconnection map.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-11-12 09:59:06 -04:00
Alejo Sanchez
7a2c6d08a1 raft: replication test: relocate global connection state map
Needed for using by rpc class.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-11-12 09:58:48 -04:00
Piotr Dulikowski
5b12375842 main.cc: wait for hints manager to start
In main.cc, we spawn a future which starts the hints manager, but we
don't wait for it to complete. This can have the following consequences:

- The hints manager does some asynchronous operations during startup,
  so it can take some time to start. If it is started after we start
  handling requests, and we admit some requests which would result in
  hints being generated, those hints will be dropped instead because we
  check if hints manager is started before writing them.
- Initialization of hints manager may fail, and Scylla won't be stopped
  because of it (e.g. we don't have permissions to create hints
  directories). The consequence of this is that hints manager won't be
  started, and hints will be dropped instead of being written. This may
  affect both regular hints manager, and the view hints manager.

This commit causes us to wait until hints manager start and see if there
were any errors during initialization.

Fixes #7598

Closes #7599
2020-11-12 14:17:10 +02:00
Nadav Har'El
78649c2322 Merge 'Mark CDC as GA' from Piotr Jastrzębski
CDC is ready to be a non-experimental feature so remove the experimental flag for it.
Also, guard Alternator Streams with their own experimental flag. Previously, they were using CDC experimental flag as they depend on CDC.

Tests: unit(dev)

Closes #7539

* github.com:scylladb/scylla:
  alternator: guard streams with an experimental flag
  Mark CDC as GA
  cdc: Make it possible for CDC generation creation to fail
2020-11-12 13:49:27 +02:00
Piotr Jastrzebski
d2897d8f8b alternator: guard streams with an experimental flag
Add new alternator-streams experimental flag for
alternator streams control.

CDC becomes GA and won't be guarded by an experimental flag any more.
Alternator Streams stay experimental so now they need to be controlled
by their own experimental flag.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-12 12:36:16 +01:00
Piotr Jastrzebski
e9072542c1 Mark CDC as GA
Enable CDC by default.
Rename CDC experimental feature to UNUSED_CDC to keep accepting cdc
flag.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-12 12:36:13 +01:00
Piotr Jastrzebski
2091408478 cdc: Make it possible for CDC generation creation to fail
Following patch enables CDC by default and this means CDC has to work
will all the clusters now.

There is a problematic case when existing cluster with no CDC support
is stopped, all the binaries are updated to newer version with
CDC enabled by default. In such case, nodes know that they are already
members of the cluster but they can't find any CDC generation so they
will try to create one. This creation may fail due to lack of QUORUM
for the write.

Before this patch such situation would lead to node failing to start.
After the change, the node will start but CDC generation will be
missing. This will mean CDC won't be able to work on such cluster before
nodetool checkAndRepairCdcStreams is run to fix the CDC generation.

We still fail to bootstrap if the creation of CDC generation fails.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-12 12:29:31 +01:00
Lubos Kosco
5c488b6e9a scylla_util.py: properly parse GCP instances without size
fixes #7577

Closes #7592
2020-11-12 13:01:40 +02:00
Piotr Sarna
d43ac783c6 db,view: degrade helper message from error to warn
When a missing base column happens to be named `idx_token`,
an additional helper message is printed in logs.
This additional message does not need to have `error` severity,
since the previous, generic message is already marked as `error`.
This patch simply makes it easier to write tests, because in case
this error is expected, only one message needs to be explicitly
ignored instead of two.

Closes #7597
2020-11-12 12:28:26 +02:00
Avi Kivity
6091dc9b79 Merge 'Add more overload-related metrics' from Piotr Sarna
This miniseries adds metrics which can help the users detect potential overloads:
 * due to having too many in-flight hints
 * due to exceeding the capacity of the read admission queue, on replica side

Closes #7584

* github.com:scylladb/scylla:
  reader_concurrency_semaphore: add metrics for shed reads
  storage_proxy: add metrics for too many in-flight hints failures
2020-11-12 12:27:31 +02:00
Raphael S. Carvalho
13fa2bec4c compaction: Make sure a partition is filtered out only by producer
If interposer consumer is enabled, partition filtering will be done by the
consumer instead, but that's not possible because only the producer is able
to skip to the next partition if the current one is filtered out, so scylla
crashes when that happens with a bad function call in queue_reader.
This is a regression which started here: 55a8b6e3c9

To fix this problem, let's make sure that partition filtering will only
happen on the producer side.

Fixes #7590.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201111221513.312283-1-raphaelsc@scylladb.com>
2020-11-12 12:22:10 +02:00
Avi Kivity
052a8d036d Merge "Avoid stalls in token_metadata and replication strategy" from Benny
"
This series is a rebased version of 3 patchsets that were sent
separately before:

1. [PATCH v4 00/17] Cleanup storage_service::update_pending_ranges et al.
    This patchset cleansup service/storage_service use of
    update_pending_ranges and replicate_to_all_cores.

    It also moves some functionality from gossiping_property_file_snitch::reload_configuration
    into a new method - storage_service::update_topology.

    This prepares storage_service for using a shared ptr to token_metadata,
    updating a copy out of line under a semaphore that serializes writers,
    and eventually replicating to updated copy to all shards and releasing
    the lock.  This is a follow up to #7044.

2. [PATCH v8 00/20] token_metadata versioned shared ptr
    Rather than keeping references on token_metadata use a shared_token_metadata
    containing a lw_shared_ptr<token_metadata> (a.k.a token_metadata_ptr)
    to keep track of the token_metadata.

    Get token_metadata_ptr for a read-only snapshot of the token_metadata
    or clone one for a mutable snapshot that is later used to safely update
    the base versioned_shared_object.

    token_metadata_ptr is used to modify token_metadata out of line, possibly with
    multiple calls, that could be preeempted in-between so that readers can keep a consistent
    snapshot of it while writers prepare an updated version.

    Introduce a token_metadata_lock used to serialize mutators of token_metadata_ptr.
    It's taken by the storage_service before cloning token_metadata_ptr and held
    until the updated copy is replicated on all shards.

    In addition, this series introduces token_metadata::clone_async() method
    to copy the tokne_metadata class using a asynchronous function with
    continuations to avoid reactor stalls as seen in #7220.

    Fixes #7044

3. [PATCH v3 00/17] Avoid stalls in token_metadata and replication strategy

    This series uses the shared_token_metadata infrastructure.

    First patches in the series deal wth cloning token_metadata
    using continuations to allow preemption while cloning (See #7220).

    Then, the rest of the series makes sure to always run
    `update_pending_ranges` and `calculate_pending_ranges_for_*` in a thread,
    it then adds a `can_yield` parameter to the token_metadata and abstract_replication_strategy
    `get_pending_ranges` and friends, and finally it adds `maybe_yield` calls
    in potentially long loops.

    Fixes #7313
    Fixes #7220

Test: unit (dev)
Dtest: gating(dev)
"

* tag 'replication_strategy_can_yield-v4' of github.com:bhalevy/scylla: (54 commits)
  token_metadata_impl: set_pending_ranges: add can_yield_param
  abstract_replication_strategy: get rid of get_ranges_in_thread
  repair: call get_ranges_in_thread where possible
  abstract_replication_strategy: add can_yield param to get_pending_ranges and friends
  abstract_replication_strategy: define can_yield bool_class
  token_metadata_impl: calculate_pending_ranges_for_* reindent
  token_metadata_impl: calculate_pending_ranges_for_* pass new_pending_ranges by ref
  token_metadata_impl: calculate_pending_ranges_for_* call in thread
  token_metadata: update_pending_ranges: create seastar thread
  abstract_replication_strategy: add get_address_ranges method for specific endpoint
  token_metadata_impl: clone_after_all_left: sort tokens only once
  token_metadata: futurize clone_after_all_left
  token_metadata: futurize clone_only_token_map
  token_metadata: use mutable_token_metadata_ptr in calculate_pending_ranges_for_*
  repair: replace_with_repair: use token_metadata::clone_async
  storage_service: reindent token_metadata blocks
  token_metadata: add clone_async
  abstract_replication_strategy: accept a token_metadata_ptr in get_pending_address_ranges methods
  abstract_replication_strategy: accept a token_metadata_ptr in get_ranges methods
  boot_strapper: get_*_tokens: use token_metadata_ptr
  ...
2020-11-12 11:56:05 +02:00
Nadav Har'El
b01bdcf910 alternator streams: add test for StartingSequenceNumber
Add a test that better clarifies what StartingSequenceNumber returned by
DescribeStream really guarantees (this question was raised in a review
of a different patch). The main thing we can guarantee is that reading a
shard from that position returns all the information in that shard -
similar to TRIM_HORIZON. This test verifies this, and it passes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201112081250.862119-1-nyh@scylladb.com>
2020-11-12 10:40:41 +01:00
Piotr Sarna
3ce7848bdf reader_concurrency_semaphore: add metrics for shed reads
When the admission queue capacity reaches its limits, excessive
reads are shed in order to avoid overload. Each such operation
now bumps the metrics, which can help the user judge if a replica
is overloaded.
2020-11-11 19:01:38 +01:00
Piotr Wojtczak
d9810ec8eb cql_metrics: Add counters for CQL request messages
This change adds metrics for counting request message types
listed in the CQL v.4 spec under section 4.1
(https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v4.spec).
To organize things properly, we introduce a new cql_server::transport_stats
object type for aggregating the message and server statistics.

Fixes #4888

Closes #7574
2020-11-11 20:00:17 +02:00
Avi Kivity
d5a6aa4533 Merge 'cql3: Rewrite the need_filtering logic' from Dejan Mircevski
Rewrite in a more readable way that will later allow us to split the WHERE expression in two: a storage-reading part and a post-read filtering part.

Tests: unit (dev,debug)

Closes #7591

* github.com:scylladb/scylla:
  cql3: Rewrite need_filtering() from scratch
  cql3: Store index info in statement_restrictions
2020-11-11 20:00:17 +02:00
Nadav Har'El
940ac80798 cql-pytest: rename test_object_name() function
The name of the utility function test_object_name() is confusing - by
starting with the word "test", pytest can think (if it's imported to the
top-level namespace) that it is a test... So this patch gives it a better
name - unique_name().

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201111140638.809189-1-nyh@scylladb.com>
2020-11-11 20:00:17 +02:00
Nadav Har'El
90eba0ce04 alternator, docs: add a new compatibility.md document
This patch adds a new document, docs/alternator/compatibility.md,
which focuses on what users switching from DynamoDB to Alternator
need to know about where Alternator differs from DynamoDB and which
features are missing.

The compatibility information in the old alternator.md is not deleted
yet. It probably should.

Fixes #7556

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201110180242.716295-1-nyh@scylladb.com>
2020-11-11 20:00:17 +02:00
Avi Kivity
06c949b452 Update seastar submodule
* seastar a62a80ba1d...043ecec732 (8):
  > semaphore: make_expiry_handler: explicitly use this lambda capture
  > configure: add --{enable,disable}-debug-shared-ptr option
  > cmake: add SEASTAR_DEBUG_SHARED_PTR also in dev mode
  > tls_test: Update the certificates to use sha256
  > logger: allow applying a rate-limit to log messages
  > Merge "Handle CPUs not attached to any NUMA nodes" from Pavel E
  > memory: fix malloc_usable_size() during early initialization
  > Merge "make semaphore related functions noexcept" from Benny
2020-11-11 20:00:17 +02:00
Dejan Mircevski
9150a967c6 cql3: Rewrite need_filtering() from scratch
Makes it easier to understand, in preparation for separating the WHERE
expression into filtering and storage-reading parts.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-11-11 08:25:36 -05:00
Dejan Mircevski
e754026010 cql3: Store index info in statement_restrictions
To rewrite need_filtering() in a more readable way, we need to store
info on found indexes in statement_restrictions data members.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-11-11 08:25:36 -05:00
Benny Halevy
275fe30628 token_metadata_impl: set_pending_ranges: add can_yield_param
To prevent a > 10 ms stall when inserting to boost::icl::interval_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
1e2138e8ef abstract_replication_strategy: get rid of get_ranges_in_thread
Use the can_yield param to get_ranges instead.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
e4e0e71b50 repair: call get_ranges_in_thread where possible
To prevent reactor stalls during repair-based operations.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
ba31350239 abstract_replication_strategy: add can_yield param to get_pending_ranges and friends
To prevent reactor stalls as seen in #7313.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
6c2a089a6f abstract_replication_strategy: define can_yield bool_class
To be used by convention by several other methods.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
7fb489d338 token_metadata_impl: calculate_pending_ranges_for_* reindent
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
6ce2436a4c token_metadata_impl: calculate_pending_ranges_for_* pass new_pending_ranges by ref
We can use the seastar thread to keep the vector rather thna creating
a lw_shared_ptr for it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
0ca423dcfc token_metadata_impl: calculate_pending_ranges_for_* call in thread
The functions can be simplified as they are all now being called
from a seastar thread.

Make them sequential, returning void, and yielding if necessary.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
84d086dc77 token_metadata: update_pending_ranges: create seastar thread
So we can yield in this path to prevent reactor stalls.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
1e6c181678 abstract_replication_strategy: add get_address_ranges method for specific endpoint
Some of the callers of get_address_ranges are interested in the ranges
of a specific endpoint.

Rather than building a map for all endpoints and then traversing
it looking for this specific endpoint, build a multimap of token ranges
relating only to the specified endpoint.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
2ce6773dae token_metadata_impl: clone_after_all_left: sort tokens only once
Currently the sorted tokens are copied needlessly by on this path
by `clone_only_token_map` and then recalculated after calling
remove_endpoint for each leaving endpoint.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
0abd8e62cd token_metadata: futurize clone_after_all_left
Call the futurized clone_only_token_map and
remove the _leaving_endpoints from the cloned token_metadata_impl.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
4a622c14e1 token_metadata: futurize clone_only_token_map
Does part of clone_async() using continuations to prevent stalls.

Rename synchronous variant to clone_only_token_map_sync
that is going to be deprecated once all its users will be futurized.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
d1a73ec7b3 token_metadata: use mutable_token_metadata_ptr in calculate_pending_ranges_for_*
Replacing old code using lw_shared_ptr<token_metadata> with the "modern"
mutable_token_metadata_ptr alias.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
6af7b689f3 repair: replace_with_repair: use token_metadata::clone_async
Clone the input token_metadata asynchronously using
clone_async() before modifying it using update_normal_tokens.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
d4d9f3e8a9 storage_service: reindent token_metadata blocks
Many code blocks using with_token_metadata_lock
and get_mutable_token_metadata_ptr now need re-indenting.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
4fc5997949 token_metadata: add clone_async
Clone token_metadata object using async continuation to
prevent reactor stalls.

Refs https://github.com/scylladb/scylla/issues/7220

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
5ab7b0b2ea abstract_replication_strategy: accept a token_metadata_ptr in get_pending_address_ranges methods
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
349aa966ba abstract_replication_strategy: accept a token_metadata_ptr in get_ranges methods
In preparation to returning future<dht::token_range_vector>
from async variants.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
1cbe54a9cf boot_strapper: get_*_tokens: use token_metadata_ptr
To facilitate preempting of long running loops if needed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
63137b35ea range_streamer: convert to token_metadata_ptr
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
6cba82a792 repair: accept a token_metadata_ptr in repair based node ops
Only replace_with_repair needs to clone the token_metadata
and update the local copy, so we can safely pass a read-only
snapshot of the token_metadata rather than copying it in all cases.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
7697c0f129 cdc: generation: use token_metadata_ptr
So it could be safely held across continuations.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
ecda21224e storage_service: replicate_to_all_cores: make exception safe
Perform replication in 2 phases.
First phase just clones the mutable_token_metadata_ptr on all shards.
Second phase applies the cloned copies onto each local_ss._shared_token_metadata.
That phase should never fail.
To add suspenders over the belt, in the impossible case we do get an
exception, it is logged and we abort.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
41c7efd0c0 storage_service: convert to token_metadata_ptr
clone _token_metadata for updating into _updated_token_metadata
and use it to update the local token_metadata on all shard via
do_update_pending_ranges().

Adjust get_token_metadata to get either the update the updated_token_metadata,
if available, or the base token_metadata.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
fa880439c9 storage_service: use token_metadata_lock to serialize updates to token_metadata
Rather than using `serialized_action`, grab a lock before mutating
_token_metadata and hold it until its replicated to all shards.

A following patch will use a mutable token_metadata_ptr
that is updated out of line under the lock.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
476b4daa48 storage_service: convert to shared_token_metadata
In preparation to using token_metadata_ptr and token_metadata_lock.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
88a4c6de13 storage_service: init_server: replicate_to_all_cores after updating token_metadata
Currently the replication to other shards happens later in `prepare_to_join`
that is called in `init_server`.
We should isolate the changes made by init_server and update them first
to all shards so that we can serialize them easily using a lock
and a mutable_token_metadata_ptr, otherwise the lock and the mutable_token_metadata_ptr
will have to be handed over (from this call path) to `prepare_to_join`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
b13156de7d storage_service: use get_token_metadata and get_mutable_token_metadata methods
In preparation to converting to using shared_token_metadata internally.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
572638671c storage_proxy: query_ranges_to_vnodes_generator ranges_to_vnodes: use token_metadata_ptr
Fixes use-after-free seen with putget_with_reloaded_certificates_test:
```
==215==ERROR: AddressSanitizer: heap-use-after-free on address 0x603000a8b180 at pc 0x000012eb5a83 bp 0x7ffd2c16d4c0 sp 0x7ffd2c16d4b0
READ of size 8 at 0x603000a8b180 thread T0
    #0 0x12eb5a82 in std::__uniq_ptr_impl<locator::token_metadata_impl, std::default_delete<locator::token_metadata_impl> >::_M_ptr() const /usr/include/c++/10/bits/unique_ptr.h:173
    #1 0x12ea230d in std::unique_ptr<locator::token_metadata_impl, std::default_delete<locator::token_metadata_impl> >::get() const /usr/include/c++/10/bits/unique_ptr.h:422
    #2 0x12e8d3e8 in std::unique_ptr<locator::token_metadata_impl, std::default_delete<locator::token_metadata_impl> >::operator->() const /usr/include/c++/10/bits/unique_ptr.h:416
    #3 0x12e5d0a2 in locator::token_metadata::ring_range(std::optional<interval_bound<dht::ring_position> > const&, bool) const locator/token_metadata.cc:1712
    #4 0x112d0126 in service::query_ranges_to_vnodes_generator::process_one_range(unsigned long, std::vector<nonwrapping_interval<dht::ring_position>, std::allocator<nonwrapping_interval<dht::ring_position> > >&) service/storage_proxy.cc:4658
    #5 0x112cf3c5 in service::query_ranges_to_vnodes_generator::operator()(unsigned long) service/storage_proxy.cc:4616
    #6 0x112b2261 in service::storage_proxy::query_partition_key_range_concurrent(std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >, std::vector<seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >, std::allocator<seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > > >&&, seastar::lw_shared_ptr<query::read_command>, db::consistency_level, service::query_ranges_to_vnodes_generator&&, int, tracing::trace_state_ptr, unsigned long, unsigned int, std::unordered_map<nonwrapping_interval<dht::token>, std::vector<utils::UUID, std::allocator<utils::UUID> >, std::hash<nonwrapping_interval<dht::token> >, std::equal_to<nonwrapping_interval<dht::token> >, std::allocator<std::pair<nonwrapping_interval<dht::token> const, std::vector<utils::UUID, std::allocator<utils::UUID> > > > >, service_permit) service/storage_proxy.cc:4023
    #7 0x112b094e in operator() service/storage_proxy.cc:4160
    #8 0x1139c8bb in invoke<service::storage_proxy::query_partition_key_range_concurrent(seastar::lowres_clock::time_point, std::vector<seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > >&&, seastar::lw_shared_ptr<query::read_command>, db::consistency_level, service::query_ranges_to_vnodes_generator&&, int, tracing::trace_state_ptr, uint64_t, uint32_t, service::replicas_per_token_range, service_permit)::<lambda(seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&, seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > > /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:2088
    #9 0x1136625b in futurize_invoke<service::storage_proxy::query_partition_key_range_concurrent(seastar::lowres_clock::time_point, std::vector<seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > >&&, seastar::lw_shared_ptr<query::read_command>, db::consistency_level, service::query_ranges_to_vnodes_generator&&, int, tracing::trace_state_ptr, uint64_t, uint32_t, service::replicas_per_token_range, service_permit)::<lambda(seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&, seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > > /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:2119
    #10 0x11366372 in operator()<seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > > /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:1480
    #11 0x1139cc3b in call /local/home/bhalevy/dev/scylla/seastar/include/seastar/util/noncopyable_function.hh:145
    #12 0x116f4944 in seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>::operator()(seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&) const /local/home/bhalevy/dev/scylla/seastar/include/seastar/util/noncopyable_function.hh:201
    #13 0x116b3397 in seastar::future<service::query_partition_key_range_concurrent_result> std::__invoke_impl<seastar::future<service::query_partition_key_range_concurrent_result>, seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&, seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > >(std::__invoke_other, seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&, seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&) /usr/include/c++/10/bits/invoke.h:60
    #14 0x1165c3a6 in std::__invoke_result<seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&, seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > >::type std::__invoke<seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&, seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > >(seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&, seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&) /usr/include/c++/10/bits/invoke.h:96
    #15 0x115e6542 in decltype(auto) std::__apply_impl<seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&, std::tuple<seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > >, 0ul>(seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&, std::tuple<seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > >&&, std::integer_sequence<unsigned long, 0ul>) /usr/include/c++/10/tuple:1724
    #16 0x115e6663 in decltype(auto) std::apply<seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&, std::tuple<seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > > >(seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&, std::tuple<seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > >&&) /usr/include/c++/10/tuple:1736
    #17 0x115e63f9 in seastar::future<seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > >::then_impl_nrvo<seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>, seastar::future<service::query_partition_key_range_concurrent_result> >(seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&&)::{lambda(seastar::internal::promise_base_with_type<service::query_partition_key_range_concurrent_result>&&, seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&, seastar::future_state<std::tuple<seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > > >&&)#1}::operator()(seastar::internal::promise_base_with_type<service::query_partition_key_range_concurrent_result>&&, seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&, seastar::future_state<std::tuple<seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > > >&&) const::{lambda()#1}::operator()() const /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:1530
    #18 0x1165c4b9 in void seastar::futurize<seastar::future<service::query_partition_key_range_concurrent_result> >::satisfy_with_result_of<seastar::future<seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > >::then_impl_nrvo<seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>, seastar::future<service::query_partition_key_range_concurrent_result> >(seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&&)::{lambda(seastar::internal::promise_base_with_type<service::query_partition_key_range_concurrent_result>&&, seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&, seastar::future_state<std::tuple<seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > > >&&)#1}::operator()(seastar::internal::promise_base_with_type<service::query_partition_key_range_concurrent_result>&&, seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&, seastar::future_state<std::tuple<seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > > >&&) const::{lambda()#1}>(seastar::internal::promise_base_with_type<service::query_partition_key_range_concurrent_result>&&, seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&&) /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:2073
    #19 0x115e61f5 in seastar::future<seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > >::then_impl_nrvo<seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>, seastar::future<service::query_partition_key_range_concurrent_result> >(seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&&)::{lambda(seastar::internal::promise_base_with_type<service::query_partition_key_range_concurrent_result>&&, seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&, seastar::future_state<std::tuple<seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > > >&&)#1}::operator()(seastar::internal::promise_base_with_type<service::query_partition_key_range_concurrent_result>&&, seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&, seastar::future_state<std::tuple<seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > > >&&) const /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:1528
    #20 0x1176e9cc in seastar::continuation<seastar::internal::promise_base_with_type<service::query_partition_key_range_concurrent_result>, seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>, seastar::future<seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > >::then_impl_nrvo<seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>, seastar::future<service::query_partition_key_range_concurrent_result> >(seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&&)::{lambda(seastar::internal::promise_base_with_type<service::query_partition_key_range_concurrent_result>&&, seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&, seastar::future_state<std::tuple<seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > > >&&)#1}, seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > >::run_and_dispose() /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:746
    #21 0x16a9a455 in seastar::reactor::run_tasks(seastar::reactor::task_queue&) /local/home/bhalevy/dev/scylla/seastar/src/core/reactor.cc:2196
    #22 0x16a9e691 in seastar::reactor::run_some_tasks() /local/home/bhalevy/dev/scylla/seastar/src/core/reactor.cc:2575
    #23 0x16aa390e in seastar::reactor::run() /local/home/bhalevy/dev/scylla/seastar/src/core/reactor.cc:2730
    #24 0x168ae4f7 in seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) /local/home/bhalevy/dev/scylla/seastar/src/core/app-template.cc:207
    #25 0x168ac541 in seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) /local/home/bhalevy/dev/scylla/seastar/src/core/app-template.cc:115
    #26 0xd6cd3c4 in main /local/home/bhalevy/dev/scylla/main.cc:504
    #27 0x7f8d905d8041 in __libc_start_main (/local/home/bhalevy/dev/scylla/build/debug/dynamic_libs/libc.so.6+0x27041)
    #28 0xd67c9ed in _start (/local/home/bhalevy/.dtest/dtest-o0qoqmkr/test/node3/bin/scylla+0xd67c9ed)

0x603000a8b180 is located 16 bytes inside of 24-byte region [0x603000a8b170,0x603000a8b188)
freed by thread T0 here:
    #0 0x7f8d92a190cf in operator delete(void*, unsigned long) (/local/home/bhalevy/dev/scylla/build/debug/dynamic_libs/libasan.so.6+0xb30cf)
    #1 0xd7ebe54 in seastar::internal::lw_shared_ptr_accessors_no_esft<locator::token_metadata>::dispose(seastar::lw_shared_ptr_counter_base*) /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/shared_ptr.hh:213
    #2 0x112b155d in seastar::lw_shared_ptr<locator::token_metadata const>::~lw_shared_ptr() /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/shared_ptr.hh:300
    #3 0x112b155d in ~<lambda> service/storage_proxy.cc:4137
    #4 0x1132e92d in ~<lambda> /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:1479
    #5 0x1139cc91 in destroy /local/home/bhalevy/dev/scylla/seastar/include/seastar/util/noncopyable_function.hh:148
    #6 0x11565673 in seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>::~noncopyable_function() /local/home/bhalevy/dev/scylla/seastar/include/seastar/util/noncopyable_function.hh:181
    #7 0x1176e783 in seastar::continuation<seastar::internal::promise_base_with_type<service::query_partition_key_range_concurrent_result>, seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>, seastar::future<seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > >::then_impl_nrvo<seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>, seastar::future<service::query_partition_key_range_concurrent_result> >(seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&&)::{lambda(seastar::internal::promise_base_with_type<service::query_partition_key_range_concurrent_result>&&, seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&, seastar::future_state<std::tuple<seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > > >&&)#1}, seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > >::~continuation() /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:729
    #8 0x1176ea06 in seastar::continuation<seastar::internal::promise_base_with_type<service::query_partition_key_range_concurrent_result>, seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>, seastar::future<seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > >::then_impl_nrvo<seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>, seastar::future<service::query_partition_key_range_concurrent_result> >(seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&&)::{lambda(seastar::internal::promise_base_with_type<service::query_partition_key_range_concurrent_result>&&, seastar::noncopyable_function<seastar::future<service::query_partition_key_range_concurrent_result> (seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&&)>&, seastar::future_state<std::tuple<seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > > >&&)#1}, seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> > >::run_and_dispose() /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:750
    #9 0x16a9a455 in seastar::reactor::run_tasks(seastar::reactor::task_queue&) /local/home/bhalevy/dev/scylla/seastar/src/core/reactor.cc:2196
    #10 0x16a9e691 in seastar::reactor::run_some_tasks() /local/home/bhalevy/dev/scylla/seastar/src/core/reactor.cc:2575
    #11 0x16aa390e in seastar::reactor::run() /local/home/bhalevy/dev/scylla/seastar/src/core/reactor.cc:2730
    #12 0x168ae4f7 in seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) /local/home/bhalevy/dev/scylla/seastar/src/core/app-template.cc:207
    #13 0x168ac541 in seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) /local/home/bhalevy/dev/scylla/seastar/src/core/app-template.cc:115
    #14 0xd6cd3c4 in main /local/home/bhalevy/dev/scylla/main.cc:504
    #15 0x7f8d905d8041 in __libc_start_main (/local/home/bhalevy/dev/scylla/build/debug/dynamic_libs/libc.so.6+0x27041)

previously allocated by thread T0 here:
    #0 0x7f8d92a18067 in operator new(unsigned long) (/local/home/bhalevy/dev/scylla/build/debug/dynamic_libs/libasan.so.6+0xb2067)
    #1 0x13cf7132 in seastar::lw_shared_ptr<locator::token_metadata> seastar::lw_shared_ptr<locator::token_metadata>::make<locator::token_metadata>(locator::token_metadata&&) /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/shared_ptr.hh:266
    #2 0x13cc3bfa in seastar::lw_shared_ptr<locator::token_metadata> seastar::make_lw_shared<locator::token_metadata>(locator::token_metadata&&) /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/shared_ptr.hh:422
    #3 0x13ca3007 in seastar::lw_shared_ptr<locator::token_metadata> locator::make_token_metadata_ptr<locator::token_metadata>(locator::token_metadata) locator/token_metadata.hh:338
    #4 0x13c9bdd4 in locator::shared_token_metadata::clone() const locator/token_metadata.hh:358
    #5 0x13c9c18a in service::storage_service::get_mutable_token_metadata_ptr() service/storage_service.hh:184
    #6 0x13a5a445 in service::storage_service::handle_state_normal(gms::inet_address) service/storage_service.cc:1129
    #7 0x13a6371c in service::storage_service::on_change(gms::inet_address, gms::application_state, gms::versioned_value const&) service/storage_service.cc:1421
    #8 0x12a86269 in operator() gms/gossiper.cc:1639
    #9 0x12ad3eea in call /local/home/bhalevy/dev/scylla/seastar/include/seastar/util/noncopyable_function.hh:145
    #10 0x12be2aff in seastar::noncopyable_function<void (seastar::shared_ptr<gms::i_endpoint_state_change_subscriber>)>::operator()(seastar::shared_ptr<gms::i_endpoint_state_change_subscriber>) const /local/home/bhalevy/dev/scylla/seastar/include/seastar/util/noncopyable_function.hh:201
    #11 0x12bb8e98 in atomic_vector<seastar::shared_ptr<gms::i_endpoint_state_change_subscriber> >::for_each(seastar::noncopyable_function<void (seastar::shared_ptr<gms::i_endpoint_state_change_subscriber>)>) utils/atomic_vector.hh:62
    #12 0x12a8662b in gms::gossiper::do_on_change_notifications(gms::inet_address, gms::application_state const&, gms::versioned_value const&) gms/gossiper.cc:1638
    #13 0x12a9387c in operator() gms/gossiper.cc:1978
    #14 0x12b49b20 in __invoke_impl<void, gms::gossiper::add_local_application_state(std::__cxx11::list<std::pair<gms::application_state, gms::versioned_value> >)::<lambda(gms::gossiper&)> mutable::<lambda()> > /usr/include/c++/10/bits/invoke.h:60
    #15 0x12b21fd6 in __invoke<gms::gossiper::add_local_application_state(std::__cxx11::list<std::pair<gms::application_state, gms::versioned_value> >)::<lambda(gms::gossiper&)> mutable::<lambda()> > /usr/include/c++/10/bits/invoke.h:95
    #16 0x12b02865 in __apply_impl<gms::gossiper::add_local_application_state(std::__cxx11::list<std::pair<gms::application_state, gms::versioned_value> >)::<lambda(gms::gossiper&)> mutable::<lambda()>, std::tuple<> > /usr/include/c++/10/tuple:1723
    #17 0x12b028d8 in apply<gms::gossiper::add_local_application_state(std::__cxx11::list<std::pair<gms::application_state, gms::versioned_value> >)::<lambda(gms::gossiper&)> mutable::<lambda()>, std::tuple<> > /usr/include/c++/10/tuple:1734
    #18 0x12b02967 in apply<gms::gossiper::add_local_application_state(std::__cxx11::list<std::pair<gms::application_state, gms::versioned_value> >)::<lambda(gms::gossiper&)> mutable::<lambda()> > /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:2052
    #19 0x12ad866a in operator() /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/thread.hh:258
    #20 0x12b609c2 in call /local/home/bhalevy/dev/scylla/seastar/include/seastar/util/noncopyable_function.hh:116
    #21 0xdfabb5f in seastar::noncopyable_function<void ()>::operator()() const /local/home/bhalevy/dev/scylla/seastar/include/seastar/util/noncopyable_function.hh:201
    #22 0x16e21bb4 in seastar::thread_context::main() /local/home/bhalevy/dev/scylla/seastar/src/core/thread.cc:297
    #23 0x16e2190f in seastar::thread_context::s_main(int, int) /local/home/bhalevy/dev/scylla/seastar/src/core/thread.cc:275
    #24 0x7f8d9060322f  (/local/home/bhalevy/dev/scylla/build/debug/dynamic_libs/libc.so.6+0x5222f)
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
3fab0f8694 storage_proxy: convert to shared_token_metadata
get() the latest token_metadata_ptr from the
shared_token_metadata before each use.

expose get_token_metadata_ptr() rather than get_token_metadata()
so that caller can keep it across continuations.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
a0436ea324 gossiper: convert to shared_token_metadata
get() the latest token_metadata& from the
shared_token_metadata before each use.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
6d06853e6c abstract_replication_strategy: convert to shared_token_metadata
To facilitate that, keep a const shared_token_metadata& in class database
rather than a const token_metadata&

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
f5f28e9b36 test: network_topology_strategy_test: constify calculate_natural_endpoints
In preparation to chaging network_topology_strategy to
accept a const shared_token_metadata& rather than token_metadata&.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
45fb57a2ec abstract_replication_strategy: pass token_metadata& to get_cached_endpoints
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
ade8c77a7c abstract_replication_strategy: pass token_metadata& to do_get_natural_endpoints
Rather than accessing abstract_replication_strategy::_token_metedata directly.
In preparation to changing it to a shared_token_metadata.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
29ed59f8c4 main: start a shared_token_metadata
And use it to get a token_metadata& compatible
with current usage, until the services are converted to
use token_metadata_ptr.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
9d2cffe7ab storage_service: make class a peering_storage_service
No need to call the global service::get_storage_service()
from within the class non-static methods.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
b41a1cf472 storage_service: report all errors from update_pending_ranges and replicate_to_all_cores
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
4188a0b384 storage_service: do_replicate_to_all_cores: call on_internal_error if failed
Now that `replicate_tm_only` doesn't throw, we handle all errors
in `replicate_tm_only().handle_exception`.

We can't just proceed with business as usual if we failed to replicate
token_metadata on all shards and continue working with inconsistent
copies.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
585a447168 storage_service: make replicate_tm_only noexcept
And with that mark also do_replicate_to_all_cores as noexcept.

The motivation to do so is to catch all errors in replicate_tm_only
and calling on_internal_error in the `handle_exception` continuation
in do_replicate_to_all_cores.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
f287346186 storage_service: update_topology: use replicate_to_all_cores
Rather than calling invalidate_cached_rings and update_topology
on all shards do that only on shard 0 and then replicate
to all other shards using replicate_to_all_cores as we do
in all other places that modify token_metadata.

Do this in preparation to using a token_metadata_ptr
with which updating of token_metadata is done on a cloned
copy (serialized under a lock) that becomes visible only when
applied with replicate_to_all_cores.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
9217d5661a storage_service: make get_mutable_token_metadata private
Now that update_topology was moved to class storage_service.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
0e739aa801 storage_service: add update_topology method
Move the functionality from gossiping_property_file_snitch::reload_configuration
to the storage_service class.

With that we can make get_mutable_token_metadata private.

TODO: update token_metadata on shard 0 and then
replicate_to_all_cores rather than updating on all shards
in parallel.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
d629aa22f5 storage_service: keyspace_changed invoke update_pending_ranges on shard 0
keyspace_changed just calls update_pending_ranges (and ignoring any
errors returned from it), so invoke it on shard 0, and with
that update_pending_ranges() is always called on shard 0
and it doesn't need to use `invoke_on` shard 0 by itself.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
ffee694a43 storage_service: make keyspace_changed and update_pending_ranges private
Both are called only internally in the class.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
6eb20c529c storage_service: init_server must be called on shard 0
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
a7df2c215f storage_service: simplify shard 0 sanity checks
We need to assert in only 2 places:
do_update_pending_ranges, that updates token metadata,
and replicate_tm_only, that copies the token metadata
to all other shards.

Currently we throw errors if this is violated
but it should never happen and it's not really recoverable.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
1c16bee81d storage_service: do_replicate_to_all_cores in do_update_pending_ranges
Currently update_pending_ranges involves 2 serialized actions:
do_update_pending_ranges, and then replicate_to_all_cores.

These can be combind by calling do_replicate_to_all_cores
directly from do_update_pending_ranges.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
d6805348ff storage_service: get rid of update_pending_ranges_nowait
It was introduced in 74b4035611
As part of the fix for #3203.

However, the reactor stalls have nothing to do with gossip
waiting for update_pending_ranges - they are related to it being
synchronous and quadratic in the number of tokens
(e.g. get_address_ranges calling calculate_natural_endpoints
for every token then simple_strategy::calculate_natural_endpoints
calling get_endpoint for every token)

There is nothing special in handle_state_leaving that requires
moving update_pending_ranges to the background, we call
update_pending_ranges in many other places and wait for it
so if gossip loop waiting on it was a real problem, then it'd
be evident in many other places.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
b6c1dffe88 storage_service: handle_state_normal: update_pending_ranges earlier
Currently _update_pending_ranges_action is called only on shard 0
and only later update_pending_ranges() updates shard 0 again and replicates
the result to all shards.

There is no need to wait between the two, and call _update_pending_ranges_action
again, so just call update_pending_ranges() in the first place.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
aa8bdc2c0f storage_service: handle_state_bootstrap: update_pending_ranges only after updating host_id
so that the updated host_id (on shard 0) will get replicated to all shards
via update_pending_ranges.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
c2c7baef3b storage_service: on_change: no need to call replicate_to_all_cores
It's already done by each handle_state_* function
either by directly calling replicate_to_all_cores or indirectly, via
update_pending_renages.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
ebfc4c6f4b storage_service: join_token_ring: replicate_to_all_cores early
Currently the updates to token_metadata are immediately visible
on shard 0, but not to other shards until replicate_to_all_cores
syncs them.

To prepare for converting to using shared token_metadata.
In the new world the updated token_metadata is not visible
until committed to the shared_token_metadata, so
commit it here and replicate to all other shards.

It is not clear this isn't needed presently too.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Botond Dénes
f5323b29d9 mutation_reader: queue_reader: don't set EOS flag on abort
If the consumer happens to check the EOS flag before it hits the
exception injected by the abort (by calling fill_buffer()), they can
think the stream ended normally and expect it to be valid. However this
is not guaranteed when the reader is aborted. To avoid consumers falsely
thinking the stream ended normally, don't set the EOS flag on abort at
all.

Additionally make sure the producer is aborted too on abort. In theory
this is not needed as they are the one initiating the abort, but better
to be safe then sorry.

Fixes: #7411
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201102100732.35132-1-bdenes@scylladb.com>
2020-11-11 13:44:25 +02:00
Pekka Enberg
ba6a2b68d1 cql-pytest/test_keyspace.py: Add test case for double WITH issue
Let's add a test case for CASSANDRA-9565, similar to the unit test in
Apache Cassandra:

https://github.com/apache/cassandra/blob/trunk/test/unit/org/apache/cassandra/cql3/validation/operations/CreateTest.java#L546
Message-Id: <20201111104251.19932-1-penberg@scylladb.com>
2020-11-11 13:39:57 +02:00
Avi Kivity
5b312a1238 Merge "sstables: make move_to_new_dir idempotent" from Benny
"
Today, if scylla crashes mid-way in sstable::idempotent-move-sstable
or sstable::create_links we may end up in an inconsistent state
where it refuses to restart due to the presence of the moved-
sstable component files in both the staging directory and
main directory.

This series hardens scylla against this scenario by:
1. Improving sstable::create_links to identify the replay condition
   and support it.
2. Modifying the algorithm for moving sstables between directories
   to never be in a state where we have two valid sstable with the
   same generation, in both the source and destination directories.
   Instead, it uses the temporary TOC file as a marker for rolling
   backwards or forewards, and renames it atomically from the
   destination directory back to the source directory as a commit
   point.  Before which it is preparing the sstable in the destination
   dir, and after which it starts the process of deleting the sstable
   in the source dir.

Fixes #7429
Refs #5714
"

* tag 'idempotent-move-sstable-v3' of github.com:bhalevy/scylla:
  sstable: create_links: support for move
  sstable_directory: support sstables with both TemporaryTOC and TOC
  sstable: create_links: move automatic sstring variables
  sstable: create_links: use captured comps
  sstable: create_links: capture dir by reference
  sstable: create_links: fix indentation
  sstable: create_links: no need to roll-back on failure anymore
  sstable: create_links: support idempotent replay
  sstable: create_links: cleanup style
  sstable: create_links: add debug/trace logging
  sstable: move_to_new_dir: rm TOC last
  sstable: move_to_new_dir: io check remove calls
  test: add sstable_move_test
2020-11-11 12:57:39 +02:00
Avi Kivity
017174670b Update frozen toolchain for python3-urwid-2.1.2
urwid 2.1.0 struggles with some locale settings. 2.1.2
fixes the problem.

Fixes #7487.
2020-11-11 11:54:05 +02:00
Nadav Har'El
44e0cb177e cql-pytest: convert also run-cassandra to Python
Previously, test/cql-pytest/run was a Python script, while
test/cql-pytest/run-cassandra (to run the tests against Cassandra)
was still a shell script - modeled after test/alternator/run.

This patch makes rewrites run-cassandra in Python.

A lot of the same code is needed for both run and run-cassandra
tools. test/cql-pytest/run was already written in a way that this
common code was separate functions. For example, functions to start a
server in a temporary directory, to check when it finishes booting,
and to clean up at the end. This patch moves this common code to
a new file, "run.py" - and the tools "run" and "cassandra-run" are
very short programs which mostly use functions from run.py (run-cassandra
also has some unique code to run Cassandra, that no other test runner
will need).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201110215210.741753-1-nyh@scylladb.com>
2020-11-11 10:57:21 +02:00
Takuya ASADA
5867af4edd install.sh: set PATH for relocatable CLI tools in python thunk
We currently set PATH for relocatable CLI tools in scylla_util.run() and
scylla_util.out(), but it doesn't work for perftune.py, since it's not part of
Scylla, does not use scylla_util module.
We can set PATH in python thunk instead, it can set PATH for all python scripts.

Fixes #7350
2020-11-11 10:27:08 +02:00
Tomasz Grabiec
5fb3650c67 storage_service: Unify token_metadata update paths when replacing a node
After full cluster shutdown, the node which is being replaced will not have its
STATUS set to NORMAL (bug #6088), so listeners will not update _token_metadata.

The bootstrap procedure of replacing node has a workaround for this
and calls update_normal_tokens() on token metadata on behalf of the
replaced node based on just its TOKENS state obtained in the shadow
round.

It does this only for the replacing_a_node_with_same_ip case, but not
for replacing_a_node_with_diff_ip. As a result, replacing the node
with the same ip after full cluster shutdown fails.

We can always call update_normal_tokens(). If the cluster didn't
crash, token_metadata would get the tokens.

Fixes #4325

Message-Id: <1604675972-9398-1-git-send-email-tgrabiec@scylladb.com>
2020-11-11 10:25:56 +02:00
Nadav Har'El
475d8721a5 test: new "cql-pytest" test suite
This patch introduces a new way to do functional testing on Scylla,
similar to Alternator's test/alternator but for the CQL API:

The new tests, in test/cql-pytest, are written in Python (using the pytest
framework), and use the standard Python CQL driver to connect to any CQL
implementation - be it Scylla, Cassandra, Amazon Keyspaces, or whatever.
The use of standard CQL allows the test developer to easily run the same
test against both Scylla and Cassandra, to confirm that the behaviour that
our test expects from Scylla is really the "correct" (meaning Cassandra-
compatible) behavior.

A developer can run Scylla or Cassandra manually, and run "pytest"
to connect to them (see README.md for more instructions). But even more
usefully, this patch also provides two scripts: test/cql-pytest/run and
test/cql-pytest/run-cassandra. These scripts automate the task of running
Scylla or Cassandra (respectively)  in a random IP address and temporary
directory, and running the tests against it.

The script test/cql-pytest/run is inspired by the existing test run
scripts of Alternator and Redis, but rewritten in Python in a way that
will make it easy to rewrite - in a future patch - all these other run
scripts to use the same common code to safely run a test server in a
temporary directory.

"run" is extremely quick, taking around two seconds to boot Scylla.
"run-cassandra" is slower, taking 13 seconds to boot Cassandra (maybe
this can be improved in the future, I still don't know how).
The tests themselves take milliseconds.

Although the 'run' script runs a single Scylla node, the developer
can also bring up any size of Scylla or Cassandra cluster manually
and run the tests (with "pytest") against this cluster.

This new test framework differs from the existing alternatives in the
following ways:

 dtest: dtest focuses on testing correctness of *distributed* behavior,
        involving clusters of multiple nodes and often cluster changes
	during the test. In contrast, cql-pytest focuses on testing the
	*functionality* of a large number of small CQL features - which
	can usually be tested on a single-node cluster.
	Additionally, dtest is out-of-tree, while cql-pytest is in-tree,
	making it much easier to add or change tests together with code
	patches.
	Finally, dtest tests are notoriously slow. Hundreds of tests in
	the new framework can finish faster than a single dtest.
	Slow and out-of-tree tests are difficult to write, and I believe
	this explains why no developer loves writing dtests and maintainers
	do not insist on having them. I hope cql-pytest can change that.

 test/cql: The defining difference between the existing test/cql suite
	and the new test/cql-pytest is the new framework is programmatic,
	Python code, not a text file with desired output. Tests written with
`	code allow things like looping, repeating the same test with different
	parameters. Also, when a test fails, it makes it easier to understand
	why it failed beyond just the fact that the output changed.
	Moreover, in some cases, the output changes benignly and cql-pytest
	may check just the desired features of the output.
	Beyond this, the current version of test/cql cannot run against
	Cassandra. test/cql-pytest can.

The primary motivation for this new framework was
https://github.com/scylladb/scylla/issues/7443 - where we had an
esoteric feature (sort order of *partitions* when an index is addded),
which can be shown in Cqlsh to have what we think is incorrect behavior,
and yet: 1. We didn't catch this bug because we never wrote a test for it,
possibly because it too difficult to contribute tests, and 2. We *thought*
that we knew what Cassandra does in this case, but nobody actually tested
it. Yes, we can test it manually with cqlsh, but wouldn't everything be
better if we could just run the same test that we wrote for Scylla against
Cassandra?

So one of the tests we add in this patch confirms issue #7443 in Scylla,
and that our hunch was correct and Cassandra indeed does not have this
problem. I also add a few trivial tests for keyspace create and drop,
as additional simple examples.

Refs #7443.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201110110301.672148-1-nyh@scylladb.com>
2020-11-10 19:48:23 +02:00
Benny Halevy
bc64ee5410 reloc: add ubsan-suppressions.supp to relocatable package
So we can use it to suppress false-positive ubsan error
when running scylla in debug mode.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201110165214.1467027-1-bhalevy@scylladb.com>
2020-11-10 19:14:27 +02:00
Benny Halevy
f36e5edd50 install.sh: add support for ubsan-suppressions
Install ubsan-suppressions.supp into libexec and use it in
UBSAN_OPTIONS when running scylla to suppress unwanted ubsan errors.

Test: With scylla-ccm fix https://github.com/scylladb/scylla-ccm/pull/278
    $ ccm create scylla-reloc-1 -n 1 --scylla --version unstable/master:latest --scylla-core-package-uri=../scylla/build/{debug,dev}/dist/tar/scylla-package.tar.gz
    $ ccm start

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201110165214.1467027-2-bhalevy@scylladb.com>
2020-11-10 19:14:26 +02:00
Piotr Sarna
e5f2fb2a4d codeowners: add a couple of Botonds
since he's our resident readers specialist.

Closes #7585
2020-11-10 18:22:52 +02:00
Avi Kivity
756b14f309 Merge 'cql3: Drop unneeded filtering when continuous clustering-key is selected' from Dejan Mircevski
I noticed that we require filtering for continuous clustering key, which is not necessary.  I dropped the requirement and made sure the correct data is read from the storage proxy.

The corresponding dtest PR: https://github.com/scylladb/scylla-dtest/pull/1727

Tests: unit (dev,debug), dtest (next-gating, cql*py)

Closes #7460

* github.com:scylladb/scylla:
  cql3: Delete some newlines
  cql3: Drop superfluous ALLOW FILTERING
  cql3: Drop unneeded filtering for continuous CK
2020-11-10 17:41:00 +02:00
Piotr Sarna
2e544a0c89 storage_proxy: add metrics for too many in-flight hints failures
When there are too many in-flight hints, writes start returning
overloaded exceptions. We're missing metrics for that, and these could
be useful when judging if the system is in overloaded state.
2020-11-10 16:26:18 +01:00
Botond Dénes
7f07b95dd3 utils/chunked_vector: reserve_partial(): better explain how to properly use
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201110130953.435123-1-bdenes@scylladb.com>
2020-11-10 15:45:01 +02:00
Eliran Sinvani
8380ac93c5 build: Make artifacts product aware
This commit changes the build file generation and the package
creation scripts to be product aware. This will change the
relocatable package archives to be named after the product,
this commit deals with two main things:
1. Creating the actual Scylla server relocatable with a product
prefixed name - which is independent of any other change
2. Expect all other packages to create product prefixed archive -
which is dependant uppon the actual submodules creating
product prefixed archives.

If the support is not introduced in the submodules first this
will break the package build.

Tests: Scylla full build with the original product and a
different product name.

Closes #7581
2020-11-10 14:38:10 +02:00
Takuya ASADA
f8c7d899b4 dist/debian: fix typo for scylla-server.service filename
Currently debian_files_gen.py mistakenly renames scylla-server.service to
"scylla-server." on non-standard product name environment such as
scylla-enterprise, it should be fix to correct filename.

Fixes #7423
2020-11-10 10:38:41 +02:00
Pavel Solodovnikov
2997f6bd2e cmake: redesign scylla's CMakeLists.txt to finally allow full-fledged building
This patch introduces many changes to the Scylla `CMakeLists.txt`
to enable building Scylla without resorting to pre-building
with a previous configure.py build, i.e. cmake script can now
be used as a standalone solution to build and execute scylla.

Submodules, such as Seastar and Abseil, are also dealt with
by importing their CMake scripts directly via `add_subdirectory`
calls. Other submodules, such as `libdeflate` now have a
custom command to build the library at runtime.

There are still a lot of things that are incomplete, though:
* Missing auxiliary packaging targets
* Unit-tests are not built (First priority to address in the
  following patches)
* Compile and link flags are mostly hardcoded to the values
  appropriate for the most recent Fedora 33 installation.
  System libraries should be found via built-in `Find*` scripts,
  compiler and linker flags should be observed and tested by
  executing feature tests.
* The current build is aimed to be built by GCC, need to support
  Clang since we are moving to it.
* Utility cmake functions should be moved to a separate "cmake"
  directory.

The script is updated to use the most recent CMake version available
in Fedora 33, which is 3.18.

Right now this is more of a PoC rather that a full-fledged solution
but as far as it's not used widely, we are free to evolve it in
a relaxed manner, improving it step by step to achieve feature
parity with `configure.py` solution.

The value in this patch is that now we are able to use any
C++ IDE capable of dealing with CMake solutions and take
advantage of their built-in capabilities, such as:
* Building a code model to efficiently navigate code.
* Find references to symbols.
* Use pretty-printers, beautifiers and other tools conveniently.
* Run scylla and debug it right from the IDE.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201103221619.612294-1-pa.solodovnikov@scylladb.com>
2020-11-10 10:34:27 +02:00
Nadav Har'El
78c598e08e alternator: add missing TableId field to DescribeTable response
DescribeTable should return a UUID "TableId" in its reponse.
We alread had it for CreateTable, and now this patch adds it to
DescribeTable.

The test for this feature is no longer xfail. Moreover, I improved
the test to not only check that the TableId field is present - it
should also match the documented regular expression (the standard
representation of a UUID).

Refs #5026

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201104114234.363046-1-nyh@scylladb.com>
2020-11-09 20:21:47 +01:00
Benny Halevy
0af54f3324 sstable: create_links: support for move
When moving a sstable between directories, we would like to
be able to crash at any point during the algorithm with a
clear way to either roll the operation forwards or backwards.

To achieve that, define sstable::create_links_common that accepts
a `mark_for_removal` flag, implementing the following algorithm:

1. link src.toc to dst.temp_toc.
   until removed, the destination sstable is marked for removal.
2. link all src components to dst.
   crashing here will leave dst with both temp_toc and toc.
3.
   a. if mark_for_removal is unset then just remove dst.temp_toc.
      this is commit the destination sstable and complete create_links.
   b. if mark_for_removal is set then move dst.temp_toc to src.temp_toc.
      this will atomically toggle recovery after crash from roll-back
      to roll-forward.
      here too, crashing at this point will leave src with both
      temp_toc and toc.

Adjust the unit test for the revised algorithm.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-09 19:57:40 +02:00
Benny Halevy
d893cbd918 sstable_directory: support sstables with both TemporaryTOC and TOC
Keep descriptors in a map so it could be searched easily by generation.
and possibly delete the descriptor, if found, in the presence of
a temporary toc component.

A following patch will add support to create_links for moving
sstables between directories.  It is based on keeping a TemporaryTOC
file in the destination directory while linking all source components.
If scylla crashes here, the destination sstable will have both
its TemporaryTOC and TOC components and it needs to be removed
to roll the move backwards.

Then, create_links will atomically move the TemporaryTOC from
the destination back to the source directory, to toggle rolling
back to rolling forward by marking the source sstable for removal.
If scylla crashes here, the source sstable will have both
its TemporaryTOC and TOC components and it needs to be removed
to roll the move forward.

Add unit test for this case.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-09 19:57:40 +02:00
Benny Halevy
7c74222037 sstable: create_links: move automatic sstring variables
Rather than copy them.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-09 19:57:40 +02:00
Benny Halevy
9a906d4d69 sstable: create_links: use captured comps
Now that all_components() is held by `do_with`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-09 19:57:25 +02:00
Benny Halevy
a59911a84c sstable: create_links: capture dir by reference
Now that it's held with `do_with`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-09 19:55:43 +02:00
Benny Halevy
07f80e0521 sstable: create_links: fix indentation
Previous patch was optimized for reviewabilty.
Now cleanup indentation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-09 19:55:32 +02:00
Benny Halevy
6bee63158c sstable: create_links: no need to roll-back on failure anymore
Now that we use `idempotent_link_file` it'll no longer
fail with EEXIST in a replay scenario.

It may fail on ENOENT, and return an exceptional future.
This will be propagated up the stack.  Since it may indicate
parallel invokation of move_to_new_dir, that deletes the source
sstable while this thread links it to the same destination,
rolling back by removing the destination links would
be dangerous.

For an other error, the node is going to be isolated
and stop operation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-09 19:44:55 +02:00
Benny Halevy
65a3b0e51c sstable: create_links: support idempotent replay
Handle the case where create_link is replayed after crashing in the middle.
In particular, if we restart when moving sstables from staging to the base dir,
right after create_links completes, and right before deleting the source links,
we end up with seemingly 2 valid sstables, one still in staging and the other
already in the base table directory, both are hard linked to the same inodes.

Make create_links idempotent so it can replay the operation safely if crashed and
restarted at any point of its operation.

Add unit tests for replay after partial create_links that is expected to succeed,
and a test for replay when an sstable exist in the destination that is not
hard-linked to the source sstable; create_links is expected to fail in this case.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-09 19:44:42 +02:00
Benny Halevy
f0a57deed7 sstable: create_links: cleanup style
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-09 19:44:27 +02:00
Benny Halevy
55f781689a sstable: create_links: add debug/trace logging
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-09 19:44:11 +02:00
Benny Halevy
884fc07e20 sstable: move_to_new_dir: rm TOC last
To facilitate cleanup on crash, first rename the TOC file to TOC.tmp,
and keep until all other files are removed, finally remove TOC.tmp.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-09 19:44:04 +02:00
Benny Halevy
ca76ebb898 sstable: move_to_new_dir: io check remove calls
We need to check these to detect critical errors
while removing the source sstable files.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-09 19:43:38 +02:00
Benny Halevy
818af720d7 test: add sstable_move_test
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-09 19:43:28 +02:00
Benny Halevy
8bcdf39a18 hints/manager: scan_for_hints_dirs: fix use-after-move
This use-after move was apprently exposed after switching to clang
in commit eb861e68e9.

The directory_entry is required for std::stoi(de.name.c_str())
and later in the catch{} clause.

This shows in the node logs as a "Ignore invalid directory" debug
log message with an empty name, and caused the hintedhandoff_rebalance_test
to fail when hints files aren't rebalanced.

Test: unit(dev)
DTest: hintedhandoff_additional_test.py:TestHintedHandoff.hintedhandoff_rebalance_test (dev, debug)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201106172017.823577-1-bhalevy@scylladb.com>
2020-11-09 16:32:54 +01:00
Takuya ASADA
4410934829 install.sh: show warning nonroot mode when systemd does not support user mode
On older distribution such as CentOS7, it does not support systemd user mode.
On such distribution nonroot mode does not work, show warning message and
skip running systemctl --user.

Fixes #7071
2020-11-09 12:16:35 +02:00
Piotr Wojtczak
72c7f25a29 db: add TransitionalAuthorizer and TransitionalAuthenticator...
... to config descriptions

We allow setting the transitional auth as one of the options
in scylla.yaml, but don't mention it at all in the field's
description. Let's change that.

Closes #7565
2020-11-09 10:51:54 +01:00
Gleb Natapov
a01dd636ea suppress ubsan error in boost::deque::clear()
The function is used by raft and fails with ubsan and clang.
The ub is harmless. Lets wait for it to be fixed in boost.

Message-Id: <20201109090353.GZ3722852@scylladb.com>
2020-11-09 11:25:19 +02:00
Bentsi Magidovich
956b97b2a8 scylla_util.py: fix exception handling in curl
Retry mechanism didn't work when URLError happend. For example:

  urllib.error.URLError: <urlopen error [Errno 101] Network is unreachable>

Let's catch URLError instead of HTTP since URLError is a base exception
for all exceptions in the urllib module.

Fixes: #7569

Closes #7567
2020-11-09 10:20:35 +02:00
Benny Halevy
02f5659f21 sstables mx/writer: clustering_blocks_input_range::next: warn on potentially bad key
If _offset falls beyond compound_type->types().size()
ignore the extra components instead of accessing out of the types
vector range.

FIXME: we should validate the thrift key against the schema
and reject it in the thrift handler layer.

Refs #7568

Test: unit(dev)
DTest: cql_tests.py:MiscellaneousCQLTester.cql3_insert_thrift_test (dev, debug)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201108175738.1006817-1-bhalevy@scylladb.com>
2020-11-08 20:53:14 +02:00
Avi Kivity
6b4a7fa515 Revert "Revert "config: Do not enable repair based node operations by default""
This reverts commit 71d0d58f8c. Repair based
node operations are still not ready and will be re-enabled after more
testing and fixes.
2020-11-08 14:09:50 +02:00
Michał Chojnowski
1eb19976b9 database: make changes to durable_writes effective immediately
Users can change `durable_writes` anytime with ALTER KEYSPACE.
Cassandra reads the value of `durable_writes` every time when applying
a mutation, so changes to that setting take effect immediately. That is,
mutations are added to the commitlog only when `durable_writes` is `true`
at the moment of their application.
Scylla reads the value of `durable_writes` only at `keyspace` construction time,
so changes to that setting take effect only after Scylla is restarted.
This patch fixes the inconsistency.

Fixes #3034

Closes #7533
2020-11-06 17:53:22 +01:00
Tomasz Grabiec
894abfa6fc Merge "raft: miscellaneous fixes" from Kostja
This series provides assorted fixes which are a
pre-requisite for the joint consensus implementation
series which follows.

* scylla-dev/raft-misc:
  raft: fix raft_fsm_test flakiness
  raft: drop a waiter of snapshoted entry
  raft: use correct type for node info in add_server()
  raft: overload operator<< for debugging
2020-11-06 15:34:16 +01:00
Konstantin Osipov
c4bbbac975 raft: fix raft_fsm_test flakiness
When election_threshold expires, the current node
can become a candidate, in which case it won't
switch back to follower state upon vote_request.
2020-11-06 17:06:07 +03:00
Gleb Natapov
552745d3d3 raft: drop a waiter of snapshoted entry
An index that is waited can be included in an installed snapshot in
which case there is no way to know if the entry was committed or not.
Abort such waiters with an appropriate error.
2020-11-06 17:06:07 +03:00
Gleb Natapov
8bab38c6fa raft: use correct type for node info in add_server() 2020-11-06 17:06:07 +03:00
Alejo Sanchez
2e4977b24c raft: overload operator<< for debugging
Overload operator<< for ostream and print relevant state for server, fsm, log,
and typed_uint64 types.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-11-06 17:06:07 +03:00
Tomasz Grabiec
3591e7dffd Merge "Remove unused args from range_tombstone methods" from Pavel Emelyanov
* https://github.com/xemul/scylla/tree/br-range-tombstone-unused-args-2:
  range_tombstone: Remove unused trim-front arg from .apply()
  range_tombstone: Undefault argument in .apply
  range_tombstone: Remove unused schema arg from .set_start
2020-11-06 15:04:15 +01:00
Tomasz Grabiec
6d0d55aa72 Merge "Unglobal query processor instance" from Pavel Emelyanov
The query processor is present in the global namespace and is
widely accessed with global get(_local)?_query_processor().
There's a long-term task to get rid of this globality and make
services and componenets reference each-other and, for and
due-to this, start and stop in specific order. This set makes
this for the query processor.

The remaining users of it are -- alternator, controllers for
client services, schema_tables and sys_dist_ks. All of them
except for the schema_tables are fixed just by passing the
reference on query processor with small patches. The schema
tables accessing qp sit deep inside the paxos code, but can
be "fixed" with the qctx thing until the qctx itself is
de-globalized.

* https://github.com/xemul/scylla/tree/br-rip-global-query-processor:
  code: RIP global query processor instance
  cql test env: Keep query processor reference on board
  system distributed keyspace: Start sharded service erarlier
  schema_tables: Use qctx to make internal requests
  transport: Keep sharded query processor reference on controller
  thrift: Keep sharded query processor reference on controller
  alternator: Use local query processor reference to get keys
  alternator: Keep local query processor reference in server
2020-11-06 14:24:41 +01:00
Pavel Emelyanov
bbd7463960 range_tombstone: Remove unused trim-front arg from .apply()
The only caller of this method always passes true to it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-11-06 15:13:05 +03:00
Pavel Emelyanov
787a496caf range_tombstone: Undefault argument in .apply
The only purpose of this change is to compile (git-bisect
safety) and thus prove that the next patch is correct.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-11-06 15:13:05 +03:00
Pavel Emelyanov
3da3d448c8 range_tombstone: Remove unused schema arg from .set_start
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-11-06 15:13:05 +03:00
Piotr Sarna
b61d4bc8d0 db: degrade view building progress loading error to warning
When the view builder cannot read view building progress from an
internal CQL table it produces an error message, but that only confuses
the user and the test suite -- this situation is entirely recoverable,
because the builder simply assumes that there is no progress and the
view building should start from scratch.

Fixes #7527

Closes #7558
2020-11-06 10:19:11 +02:00
Avi Kivity
512daa75a6 Merge 'repair: Use single writer for all followers' from Asias He
repair: Use single writer for all followers

Currently, repair master create one writer for each follower to write
rows from follower to sstables. That are RF - 1 writers in total. Each
writer creates 1 sstable for the range repaired, usually a vnode range.
Those sstables for a given vnode range are disjoint.

To reduce the compaction work, we can create one writer for all the
followers. This reduces the number of sstables generated by repair
significantly to one per vnode range from RF - 1 per vnode range.

Fixes #7525

Closes #7528

* github.com:scylladb/scylla:
  repair: No more vector for _writer_done and friends
  repair: Use single writer for all followers
2020-11-05 18:45:07 +01:00
Gleb Natapov
e1442282d1 raft: test: do not store data in initializer_list
Lifetime rules for initializer_list is weird. Use vector instead.

Message-Id: <20201105111309.GT3722852@scylladb.com>
2020-11-05 18:44:50 +01:00
Michał Chojnowski
f6c33f5775 dbuild: export $HOME seen by dbuild, not by $tool
The default of DBUILD_TOOL=docker requires passwordless access to docker
by the user of dbuild. This is insecure, as any user with unconstrained
access to docker is root equivalent. Therefore, users might prefer to
run docker as root (e.g. by setting DBUILD_TOOL="sudo docker").

However, `$tool -e HOME` exports HOME as seen by $tool.
This breaks dbuild when `$tool` runs docker as a another user.
`$tool -e HOME="$HOME"` exports HOME as seen by dbuild, which is
the intended behaviour.

Closes #7555
2020-11-05 18:44:50 +01:00
Michał Chojnowski
8f74c7e162 dbuild: Replace stray use of docker with $tool
Instead of invoking `$tool`, as is done everywhere else in dbuild,
kill_it() invoked `docker` explicitly. This was slightly breaking the
script for DBUILD_TOOL other than `docker`.

Closes #7554
2020-11-05 18:44:49 +01:00
Tomasz Grabiec
fb9b5cae05 sstables: ka/la: Fix abort when next_partition() is called with certain reader state
Cleanup compaction is using consume_pausable_in_thread() to skip over
disowned partitions, which uses flat_mutation_reader::next_partition().

The implementation of next_partition() for the sstable reader has a
bug which may cause the following assertion failure:

  scylla: sstables/mp_row_consumer.hh:422: row_consumer::proceed sstables::mp_row_consumer_k_l::flush(): Assertion `!_ready' failed.

This happens when the sstable reader's buffer gets full when we reach
the partition end. The last fragment of the partition won't be pushed
into the buffer but will stay in the _ready variable. When
next_partition() is called in this state, _ready will not be cleared
and the fragment will be carried over to the next partition. This will
cause assertion failure when the reader attempts to emit the first
fragment of the next partition.

The fix is to clear _ready when entering a partition, just like we
clear _range_tombstones there.

Fixes #7553.
Message-Id: <1604534702-12777-1-git-send-email-tgrabiec@scylladb.com>
2020-11-05 18:44:49 +01:00
Nadav Har'El
7ff72b0ba5 Merge 'secondary_index: fix returned rows token ordering' from Piotr Grabowski
Fixes returned rows ordering to proper signed token ordering. Before this change, rows were sorted by token, but using unsigned comparison, meaning that negative tokens appeared after positive tokens.

Rename `token_column_computation` to `legacy_token_column_computation` and add some comments describing this computation.

Added (new) `token_column_computation` which returns token as `long_type`, which is sorted using signed comparison - the correct ordering of tokens.

Add new `correct_idx_token_in_secondary_index` feature, which flags that the whole cluster is able to use new `token_column_computation`.

Switch token computation in secondary indexes to (new) `token_column_computation`, which fixes the ordering. This column computation type is only set if cluster supports `correct_idx_token_in_secondary_index` feature to make sure that all nodes
will be able to compute new `token_column_computation`. Also old indexes will need to be rebuilt to take advantage of this fix, as new token column computation type is only set for new indexes.

Fix tests according to new token ordering and add one new test to validate this aspect explicitly.

Fixes #7443

Tested manually a scenario when someone created an index on old version of Scylla and then migrated to new Scylla. Old index continued to work properly (but returning in wrong order). Upon dropping and re-creating the index, it still returned the same data, but now in correct order.

Closes #7534

* github.com:scylladb/scylla:
  tests: add token ordering test of indexed selects
  tests: fix tests according to new token ordering
  secondary_index: use new token_column_computation
  feature: add correct_idx_token_in_secondary_index
  column_computation: add token_column_computation
  token_column_computation: rename as legacy
2020-11-05 18:44:49 +01:00
Benny Halevy
f93fb55726 repair: repair_writer: do not capture lw_shared_ptr cross-shard
The shared_from_this lw_shared_ptr must not be accessed
across shards.  Capturing it in the lambda passed to
mutation_writer::distribute_reader_and_consume_on_shards
causes exactly that since the captured lw_shared_ptr
is copied on other shards, and ends up in memory corruption
as seen in #7535 (probably due to lw_shared_ptr._count
going out-of-sync when incremented/decremented in parallel
on other shards with no synchronization.

This was introduced in 289a08072a.

The writer is not needed in the body of this lambda anyways
so it doesn't need to capture it.  It is already held
by the continuations until the end of the chain.

Fixes #7535

Test: repair_additional_test:RepairAdditionalTest.repair_disjoint_row_3nodes_diff_shard_count_test (dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201104142216.125249-1-bhalevy@scylladb.com>
2020-11-05 18:44:49 +01:00
Tomasz Grabiec
dccd47eec6 Merge "make raft clang compatible" from Gleb
"
    Since we are switching to clang due to raft make it actually compile
    with clang.
    "

tgrabiec: Dropped the patch "raft: compile raft by default" because
the replication_test still fails in debug mode:

   /usr/include/boost/container/deque.hpp:1802:63: runtime error: applying non-zero offset 8 to null pointer

* 'raft-clang-v2' of github.com:scylladb/scylla-dev:
  raft: Use different type to create type dependent statement for static assertion
  raft: drop use of <ranges> for clang
  raft: make test compile with clang
  raft: drop -fcoroutines support from configure.py
2020-11-05 18:42:31 +01:00
Asias He
db28efb28a repair: No more vector for _writer_done and friends
Now that both repair followers and repair master use a single writer. We
can get rid of the vector associated with _writer_done and friends.

Fixes #7525
2020-11-05 13:28:40 +08:00
Asias He
998b153f86 repair: Use single writer for all followers
Currently, repair master create one writer for each follower to write
rows from follower to sstables. That are RF - 1 writers in total. Each
writer creates 1 sstable for the range repaired, usually a vnode range.
Those sstables for a given vnode range are disjoint.

To reduce the compaction work, we can create one writer for all the
followers. This reduces the number of sstables generated by repair
significantly to one per vnode range from RF - 1 per vnode range.

Fixes #7525
2020-11-05 13:28:40 +08:00
Pekka Enberg
edf04cd348 Update tools/python3 submodule
* tools/python3 cfa27b3...1763a1a (1):
  > Relocatable Package: create product prefixed relocatable archive
2020-11-04 14:24:20 +02:00
Pekka Enberg
5519ce2f0e Update tools/jmx submodule
* tools/jmx c51906e...6174a47 (2):
  > Relocatable Package: create product prefixed relocatable archive
  > build(deps-dev): bump junit from 4.8.2 to 4.13.1
2020-11-04 14:24:15 +02:00
Avi Kivity
193d1942f2 build: silence gcc ABI interoperability warning on arm
A gcc bug [1] caused objects built by different versions of gcc
not to interoperate. Gcc helpfully warns when it encounters code that
could be affected.

Since we build everything with one version, and as that versions is far
newer than the last version generating incorrect code, we can silence
that warning without issue.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77728

Closes #7495
2020-11-04 13:29:51 +02:00
Tomasz Grabiec
a7837a9a3b Merge "Enable raft tests" from Kostja
Do not run tests which are not built.
For that, pass the test list from configure.py to test.py
via ninja unit_test_list target.
Minor cleanups.

* scylla-dev.git/test.py-list:
  test: enable raft tests
  test.py: do not run tests which are not built
  configure.py: add a ninja command to print unit test list
  test.py: handle ninja mode_list failure
  configure.py: don't pass modes_list unless it's used
2020-11-04 12:25:04 +01:00
Piotr Grabowski
491987016c tests: add token ordering test of indexed selects
Add new test validating that rows returned from both non-indexed selects
and indexed selects return rows sorted in token order (making sure
that both positive and negative tokens are present to test if signed
comparison order is maintained).
2020-11-04 12:02:42 +01:00
Piotr Grabowski
2bd23fbfa9 tests: fix tests according to new token ordering
Fix tests to adhere to new (correct) token ordering of rows when
querying tables with secondary indexes.
2020-11-04 12:02:42 +01:00
Piotr Grabowski
2342b386f4 secondary_index: use new token_column_computation
Switches token column computation to (new) token_column_computation,
which fixes #7443, because new token column will be compared using
signed comparisons, not the previous unsigned comparison of CQL bytes
type.

This column computation type is only set if cluster supports
correct_idx_token_in_secondary_index feature to make sure that all nodes
will be able to compute (new) token_column_computation. Also old
indexes will need to be rebuilt to take advantage of this fix, as new
token column computation type is only set for new indexes.
2020-11-04 12:02:42 +01:00
Piotr Grabowski
6624d933c9 feature: add correct_idx_token_in_secondary_index
Add new correct_idx_token_in_secondary_index feature, which will be used
to determine if all nodes in the cluster support new 
token_column_computation. This column computation will replace
legacy_token_column_computation in secondary indexes, which was 
incorrect as this column computation produced values that when compared 
with unsigned comparison (CQL type bytes comparison) resulted in 
different ordering  than token signed comparison. See issue:

https://github.com/scylladb/scylla/issues/7443
2020-11-04 12:02:42 +01:00
Piotr Grabowski
9fc2dc59b8 column_computation: add token_column_computation
Introduce new token_column_computation class which is intended to
replace legacy_token_column_computation. The new column computation
returns token as long_type, which means that it will be ordered
according to signed comparison (not unsigned comparison of bytes), which
is the correct ordering of tokens.
2020-11-04 12:02:42 +01:00
Piotr Grabowski
b1350af951 token_column_computation: rename as legacy
Raname token_column_computation to legacy_token_column_computation, as
it will be replaced with new column_computation. The reason is that this
computation returns bytes, but all tokens in Scylla can now be
represented by int64_t. Moreover, returning bytes causes invalid token
ordering as bytes comparison is done in unsigned way (not signed as
int64_t). See issue:

https://github.com/scylladb/scylla/issues/7443
2020-11-04 12:00:18 +01:00
Eliran Sinvani
4c434f3fa4 moving avarage rate: Keep computed rates in zero until they are
meaningful

When computing moving average rates too early after startup, the
rate can be infinite, this is simply because the sample interval
since the system started is too small to generate meaningful results.
Here we check for this situation and keep the rate at 0 if it happens
to signal that there are still no meaningful results.
This incident is unlikely to happen since it can happen only during a
very small time window after restart, so we add a hint to the compiler
to optimize for that in order to have a minimum impact on the normal
usecase.

Fixes #4469
2020-11-04 11:13:59 +02:00
Avi Kivity
8aa842614a test: gossip_test: configure database memory allocation correctly
The memory configuration for the database object was left at zero.
This can cause the following chain of failures:
 - the test is a little slow due to the machine being overloaded,
   and debug mode
 - this causes the memtable flush_controller timer to fire before
   the test completes
 - the backlog computation callback is called
 - this calculates the backlog as dirty_memory / total_memory; this
   is 0.0/0.0, which resolves to NaN
 - eventually this gets converted to an integer
 - UBSAN dooesn't like the convertion from NaN to integer, and complains

Fix by initializing dbcfg.available_memory.

Test: gossip_test(debug), 1000 repetitions with concurrency 6

Closes #7544
2020-11-04 09:26:08 +02:00
Calle Wilund
1db9da2353 alternator::streams: Workaround fix for apparent code gen bug in seq_number
Fixes #7325

When building with clang on fedora32, calling the string_view constructor
of bignum generates broken ID:s (i.e. parsing borks). Creating a temp
std::string fixes it.

Closes #7542
2020-11-04 09:26:08 +02:00
Benny Halevy
1d199c31f8 storage_service: check_for_endpoint_collision: copy gossip state across preemeption point
Since 11a8912093, get_gossip_status
returns a std::string_view rather than a sstring.

As seen in dtest we may print garbage to the log
if we print the string_view after preemption (calling
_gossiper.reset_endpoint_state_map().get())

Test: update_cluster_layout_tests:TestUpdateClusterLayout.simple_add_two_nodes_in_parallel_test (dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201103132720.559168-1-bhalevy@scylladb.com>
2020-11-04 09:26:08 +02:00
Konstantin Osipov
507ca98748 test: enable raft tests
It's safe to do this since now the tests are only run if
they are configured.
2020-11-03 21:30:11 +03:00
Konstantin Osipov
5f90582362 test.py: do not run tests which are not built
Use ninja unit_test_list to find out the list of configured tests.
 If a test is not configured by configure.py, do not try to run it.
2020-11-03 21:30:08 +03:00
Konstantin Osipov
9198e38311 configure.py: add a ninja command to print unit test list
test.py needs this list to avoid running tests which
are not configured, and hence not built.
2020-11-03 21:27:45 +03:00
Konstantin Osipov
ef9c63a6d9 test.py: handle ninja mode_list failure
Print an error message if the subcommand fails.
Use a regular expression to match output.
2020-11-03 21:06:17 +03:00
Konstantin Osipov
7fa08496b0 configure.py: don't pass modes_list unless it's used
Don't redefine  modes_list if it's not used by the ninja
file formatter.
2020-11-03 21:02:55 +03:00
Benny Halevy
9d91d38502 SCYLLA-VERSION-GEN: change master version to 4.4.dev
Now that scylla-ccm and scylla-dtest conform to PEP-440
version comparison (See https://www.python.org/dev/peps/pep-0440/)
we can safely change scylla version on master to be the development
branch for the next release.

The version order logic is:
  4.3.dev is followed by
  4.3.rc[i] followed by
  4.3.[n]

Note that also according to
https://blog.jasonantman.com/2014/07/how-yum-and-rpm-compare-versions/
4.3.dev < 4.3.rc[i] < 4.3.[n]
as "dev" < "rc" by alphabetical order
and both "dev" and "rc*" < any number, based on the general
rule that alphabetical strings compare as less than numbers.

Refs scylladb/scylla-machine-image#79

Test: unit
Dtest: gating
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201015151153.726637-1-bhalevy@scylladb.com>
2020-11-03 13:42:54 +02:00
Avi Kivity
25e6a9e493 Merge "utils/large_bitset: reserve memory for _storage gently" from Botond
"
Introduce a gentle (yielding) implementation of reserve for chunked
vector and use it when reserving the backing storage vector for large
bitset. Large bitset is used by bloom filters, which can be quite large
and have been observed to cause stalls when allocating memory for the
storage.

Fixes: #6974

Tests: unit(dev)
"

* 'gentle-reserve/v1' of https://github.com/denesb/scylla:
  utils/large_bitset: use reserve_partial() to reserve _storage
  utils/chunked_vector: add reserve_partial()
2020-11-03 13:42:54 +02:00
Tomasz Grabiec
5abddc8568 Merge "Testing performance of different collections" from Pavel Emelyanov
There's a perf_bptree test that compares B+ tree collection with
std::set and std::map ones. There will come more, also the "patterns"
to compare are not just "fill with keys" and "drain to empty", so
here's the perf_collection test, that measures timings of

- fill with keys
- drain key by key
- empty with .clear() call
- full scan with iterator
- insert-and-remove of a single element

for currently used collections

- std::set
- std::map
- intrusive_set_external_comparator
- bplus::tree

* https://github.com/xemul/scylla/tree/br-perf-collection-test:
  test: Generalize perf_bptree into perf_collection
  perf_collection: Clear collection between itartions
  perf_collection: Add intrusive_set_external_comparator
  perf_collection: Add test for single element insertion
  perf_collection: Add test for destruction with .clear()
  perf_collection: Add test for full scan time
2020-11-03 13:42:54 +02:00
Gleb Natapov
88a1274583 raft: Use different type to create type dependent statement for static assertion
For some reason the one that woks for gcc does not work for clang.
2020-11-03 08:49:54 +02:00
Gleb Natapov
b6b51bf17e raft: drop use of <ranges> for clang 2020-11-03 08:49:54 +02:00
Gleb Natapov
847400ee96 raft: make test compile with clang
clang does not allow to return a future<> with co_return and it is more
strict with type conversion.
2020-11-03 08:49:54 +02:00
Gleb Natapov
ff18072de8 raft: drop -fcoroutines support from configure.py
We switched to clang and it does not have this flag.
2020-11-03 08:49:54 +02:00
Botond Dénes
a08b640fa7 utils/large_bitset: use reserve_partial() to reserve _storage
To avoid stalls when reserving memory for a large bloom filter. The
filter creation already has a yielding loop for initialization, this
patch extends it to reservation of memory too.
2020-11-02 18:03:19 +02:00
Botond Dénes
bb908b1750 utils/chunked_vector: add reserve_partial()
A variant of reserve() which allows gentle reserving of memory. This
variant will allocate just one chunk at a time. To drive it to
completion, one should call it repeatedly with the return value of the
previous call, until it returns 0.
This variant will be used in the next patch by the large bitset creation
code, to avoid stalls when allocating large bloom filters (which are
backed by large bitset).
2020-11-02 18:02:01 +02:00
Piotr Wojtczak
caa3c471c0 Validate ascii values when creating from CQL
Although the code for it existed already, the validation function
hasn't been invoked properly. This change fixes that, adding
a validating check when converting from text to specific value
type and throwing a marshal exception if some characters
are not ASCII.

Fixes #5421

Closes #7532
2020-11-02 16:47:32 +02:00
Pavel Emelyanov
364ddab148 test: Do not dump test log onto terminal
When unit tests fail the test.py dump their output on the screen. This is impossible
to read this output from the terminal, all the more so the logs are anyway saved in
the testlog/ directory. At the same time the names of the failed tests are all left
_before_ these logs, and if the terminal history is not large enough, it becomes
quite annoying to find the names out.

The proposal is not to spoil the terminal with raw logs -- just names and summaries.
Logs themselves are at testlog/$mode/$name_of_the_test.log

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201031154518.22257-1-xemul@scylladb.com>
2020-11-02 15:42:34 +02:00
Tomasz Grabiec
ba42e7fcc5 multishard_mutation_query: Propagate mutation_reader::forwarding flag
Otherwise all readers will be created with the default forwarding::yes.
This inhibits some optimizations (e.g. results in more sstable read-ahead).

It will also be problematic when we introduce mutation sources which don't support
forwarding::yes in the future.

Message-Id: <1604065206-3034-1-git-send-email-tgrabiec@scylladb.com>
2020-11-02 15:24:36 +02:00
Avi Kivity
eb861e68e9 build: switch to clang as the default compiler
Clang brings us working support for coroutines, which are
needed for Raft and for code simplification.

perf_simple_query as well as full system tests show no
significant performance regression.

Test: unit(dev, release, debug)

Closes #7531
2020-11-02 14:18:13 +02:00
Nadav Har'El
ffbd487c86 Merge 'alternator::streams: Use end-of-record info in get_records' from Calle Wilund
Fixes #7496

Since cdc log now has an end-of-batch/record marker that tells
us explicitly that we've read the last row of a change, we
can use this instead of timestamp checks + limit extra to
ensure we have complete records.

Note that this does not try to fulfill user query limit
exact. To do this we would need to add a loop and potentially
re-query if quried rows are not enough. But that is a
separate exercise, and superbly suited for coroutines!

Closes #7498

* github.com:scylladb/scylla:
  alternator::streams: Reduce the query limit depending on cdc opts
  alternator::streams: Use end-of-record info in get_records
2020-11-02 13:34:00 +02:00
Tomasz Grabiec
2dfc5f1ee5 Merge "Cleanup gossiper endpoint interface" from Benny
This series cleans up the gossiper endpoint_state interface
marking methods const and const noexcept where possible.

To achieve that, endpoint_state::get_status was changed to
return a string_view rather than a sstring so it won't
need to allocate memory.

Also, the get_cluster_name and get_partitioner_name were
changes to return a const sstring& rather than sstring
so they won't need to allocate memory.

The motivation for the series stems from #7339
where an exception in get_host_id within a storage_service
notification handler, called from seastar::defer crashed
the server.

With this series, get_host_id may still throw exceptions on
logical error, but not from calling get_application_state_ptr.

Refs #7339

Test: unit(dev)

* tag 'gossiper-endpoint-noexcept-v2':
  gossiper: mark trivial methods noexcept
  gossiper: get_cluster_name, get_partitioner_name: make noexcept
  gossiper: get_gossip_status: return string_view and make noexcept
  gms/endpoint_state: mark methods using get_status noexcept
  gms/endpoint_state: get_status: return string_view and make noexcept
  gms/endpoint_state: mark get_application_state_ptr and is_cql_ready noexcept
  gms/endpoint_state: mark trivial methods noexcept
  gms/heart_beat_state: mark methods noexcept
  gms/versioned_value: mark trivial methods noexcept
  gms/version_generator: mark get_next_version noexcept
  fb_utilities.hh: mark methods noexcept
  messaging: msg_addr: mark methods noexcept
  gms/inet_address: mark methods noexcept
2020-11-02 12:30:30 +01:00
Avi Kivity
7a3376907e Merge 'improvements for GCE image' from Bentsi
when logging in to the GCE instance that is created from the GCE image it takes 10 seconds to understand that we are not running on AWS. Also, some unnecessary debug logging messages are printed:
```
bentsi@bentsi-G3-3590:~/devel/scylladb$ ssh -i ~/.ssh/scylla-qa-ec2 bentsi@35.196.8.86
Warning: Permanently added '35.196.8.86' (ECDSA) to the list of known hosts.
Last login: Sun Nov  1 22:14:57 2020 from 108.128.125.4

   _____            _ _       _____  ____
  / ____|          | | |     |  __ \|  _ \
 | (___   ___ _   _| | | __ _| |  | | |_) |
  \___ \ / __| | | | | |/ _` | |  | |  _ <
  ____) | (__| |_| | | | (_| | |__| | |_) |
 |_____/ \___|\__, |_|_|\__,_|_____/|____/
               __/ |
              |___/

Version:
       666.development-0.20201101.6be9f4938
Nodetool:
	nodetool help
CQL Shell:
	cqlsh
More documentation available at:
	http://www.scylladb.com/doc/
By default, Scylla sends certain information about this node to a data collection server. For information, see http://www.scylladb.com/privacy/

WARNING:root:Failed to grab http://169.254.169.254/latest/...
WARNING:root:Failed to grab http://169.254.169.254/latest/...
    Initial image configuration failed!

To see status, run
 'systemctl status scylla-image-setup'

[bentsi@artifacts-gce-image-jenkins-db-node-aa57409d-0-1 ~]$

```
this PR fixes this

Closes #7523

* github.com:scylladb/scylla:
  scylla_util.py: remove unnecessary logging
  scylla_util.py: make is_aws_instance faster
  scylla_util.py: added ability to control sleep time between retries in curl()
2020-11-02 12:32:25 +02:00
Piotr Sarna
b66c285f94 schema_tables: fix fixing old secondary index schemas
Old secondary index schemas did not have their idx_token column
marked as computed, and there already exists code which updates
them. Unfortunately, the fix itself contains an error and doesn't
fire if computed columns are not yet supported by the whole cluster,
which is a very common situation during upgrades.

Fixes #7515

Closes #7516
2020-11-02 12:30:20 +02:00
Takuya ASADA
100127bc02 install.sh: allow --packaging with nonroot mode
Since scylla-ccm wants to skip systemctl, we need to support --packaging
in nonroot mode too.

Related: #7187
2020-11-02 12:07:14 +02:00
Calle Wilund
7c8f457bab alternator::streams: Reduce the query limit depending on cdc opts
Avoid querying much more than needed.
Since we have exact row markers now, this is more safe to do.
2020-11-02 08:37:27 +00:00
Calle Wilund
c79108edbb alternator::streams: Use end-of-record info in get_records
Fixes #7496

Since cdc log now has an end-of-batch/record marker that tells
us explicitly that we've read the last row of a change, we
can use this instead of timestamp checks + limit extra to
ensure we have complete records.

Note that this does not try to fulfill user query limit
exact. To do this we would need to add a loop and potentially
re-query if quried rows are not enough. But that is a
separate exercise, and superbly suited for coroutines!
2020-11-02 08:35:36 +00:00
Avi Kivity
b6f8bb6b77 tools/toolchain: update maintainer instructions
The instructions are updated for multiarch images (images that
can be used on x86 and ARM machines).

Additionally,
 - docker is replaced with podman, since that is now used by
   developers. Docker is still supported for developers, but
   the image creation instructions are only tested with podman.
 - added instructions about updating submodules
 - `--format docker` is removed. It is not necessary with
   more recent versions of docker.

Closes #7521
2020-11-02 10:29:54 +02:00
Avi Kivity
3993498fb4 connection_notifier: prevent link errors due to variables defined in header
connection_notifier.hh defines a number of template-specialized
variables in a header. This is illegal since you're allowed to
define something multiple times if it's a template, but not if it's
fully specialized. gcc doesn't care but clang notices and complains.

Fix by defining the variiables as inline variables, which are
allowed to have definitions in multiple translation units.

Closes #7519
2020-11-02 10:28:55 +02:00
Avi Kivity
83b3d3d1d1 test: increase timeout to 12000 seconds to account for slow ARM cores
Some ARM cores are slow, and trip our current timeout of 3000
seconds in debug mode. Quadrupling the timeout is enough to make
debug-mode tests pass on those machines.

Since the timeout's role is to catch rare infinite loops in unsupervised
testing, increasing the timeout has no ill effect (other than to
delay the report of the failure).

Closes #7518
2020-11-02 10:28:14 +02:00
Piotr Sarna
ed047d54bf Merge 'alternator: fix combination of filter and projection' from Nadav
The main goal of this this series is to fix issue #6951 - a Query (or Scan) with
a combination of filtering and projection parameters produced wrong results if
the filter needs some attributes which weren't projected.

This series also adds new tests for various corner cases of this issue. These
new tests also pass after this fix, or still fail because some other missing
feature (namely, nested attributes). These additional tests will be important if
we ever want to refactor or optimize this code, because they exercise some rare
corner code paths at the intersection of filtering and projection.

This series also fixes some additional problems related to this issue, like
combining old and new filtering/projection syntaxes (should be forbidden), and
even one fix to a wrong comment.

Closes #7328

* github.com:scylladb/scylla:
  alternator test: tests for nested attributes in FilterExpression
  alternator test: fix comment
  alternator tests: additional tests for filter+projection combination
  alternator: forbid combining old and new-style parameters
  alternator: fix query with both projection and filtering
2020-11-02 07:28:41 +01:00
Bentsi Magidovich
2866f2d65d scylla_util.py: remove unnecessary logging
when calling curl and exception is raised we can see unnecessary log messages that we can't control.
For example when used in scylla_login we can see following messages:
WARNING:root:Failed to grab http://169.254.169.254/latest/...
WARNING:root:Failed to grab http://169.254.169.254/latest/...
    Initial image configuration failed!

To see status, run
 'systemctl status scylla-image-setup'
2020-11-02 01:13:44 +03:00
Bentsi Magidovich
a62237f1c6 scylla_util.py: make is_aws_instance faster
when used for example in scylla_login we need to understand that we
are not running on AWS faster then 10 seconds
2020-11-02 00:11:21 +03:00
Bentsi Magidovich
83a8550a5f scylla_util.py: added ability to control sleep time between retries in curl() 2020-11-01 22:39:19 +03:00
Avi Kivity
b45c933036 tools: toolchain: update for gcc-10.2.1-6.fc33.x86_64 2020-11-01 19:18:00 +02:00
Avi Kivity
d626563fe3 Update seastar submodule
* seastar 57b758c2f9...a62a80ba1d (1):
  > thread: increase stack size in debug mode
2020-11-01 19:16:59 +02:00
Benny Halevy
e4614d4836 gossiper: mark trivial methods noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-01 16:46:47 +02:00
Benny Halevy
1ba4c84ae2 gossiper: get_cluster_name, get_partitioner_name: make noexcept
These methods can return a const sstring& rather than
allocating a sstring. And with that they can be marked noexcept.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-01 16:46:29 +02:00
Benny Halevy
11a8912093 gossiper: get_gossip_status: return string_view and make noexcept
Change get_gossip_status to return string_view,
and with that it can be noexcept now that it doesn't
allocate memory via sstring.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-01 16:46:18 +02:00
Benny Halevy
126e486fde gms/endpoint_state: mark methods using get_status noexcept
Now that get_status returns string_view, just compare it with a const char*
rather than making a sstring out of it, and consequently, can be marked noexcept.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-01 16:46:18 +02:00
Benny Halevy
6b9191b6c2 gms/endpoint_state: get_status: return string_view and make noexcept
get_status doesn't need to allocate a sstring, it can just
return a std::string_view to the status string, if found.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-01 16:46:18 +02:00
Benny Halevy
232c665bab gms/endpoint_state: mark get_application_state_ptr and is_cql_ready noexcept
Although std::map::find is not guaranteed to be noexcept
it depends on the comperator used and in this case comparing application_state
is noexcept.  Therefore, we can safely mark get_application_state_ptr noexcept.

is_cql_ready depends on get_application_state_ptr and otherwise
handles an exceptions boost::lexical_cast so it can be marked
noexcept as well.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-01 16:46:18 +02:00
Benny Halevy
5d8e2c038b gms/endpoint_state: mark trivial methods noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-01 16:46:18 +02:00
Benny Halevy
d4c364507e gms/heart_beat_state: mark methods noexcept
Now that get_next_version() is noexcept,
update_heart_beat can be noexcept too.

All others are trivially noexcept.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-01 16:46:18 +02:00
Benny Halevy
68a2920201 gms/versioned_value: mark trivial methods noexcept
Also, versioned_value::compare_to() can be marked const.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-01 16:46:18 +02:00
Benny Halevy
c295f521b9 gms/version_generator: mark get_next_version noexcept
It is trivially so.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-01 16:46:18 +02:00
Benny Halevy
87c3fd9cd8 fb_utilities.hh: mark methods noexcept
Now that gms::inet_address assignment is marked as noexcept.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-01 16:46:18 +02:00
Benny Halevy
e28d80ec0c messaging: msg_addr: mark methods noexcept
Based on gms::inet_address.

With that, gossiper::get_msg_addr can be marked noexcept (and const while at it).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-01 16:46:18 +02:00
Benny Halevy
232fc19525 gms/inet_address: mark methods noexcept
Based on the corresponding net::inet_address calls.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-01 16:46:18 +02:00
Avi Kivity
6be9f49380 cql3: expression: switch from range_bound to interval_bound to avoid clang class template argument deduction woes
Clang does not implement P1814R0 (class template argument deduction
for alias templates), so it can't deduce the template arguments
for range_bound, but it can for interval_bound, so switch to that.
Using the modern name rather than the compatibility alias is preferred
anyway.

Closes #7422
2020-11-01 13:19:44 +02:00
Nadav Har'El
deaa141aea docs/isolation.md: fix list of IO priority classes
In commit de38091827  the two IO priority classes streaming_read
and streaming_write into just one. The document docs/isolation.md
leaves a lot to be desired (hint, hint, to anyone reading this and
can write content!) but let's at least not have incorrect information
there.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201101102220.2943159-1-nyh@scylladb.com>
2020-11-01 12:27:06 +02:00
Avi Kivity
46612fe92b Merge 'Add debug context to views out of sync' from Piotr Sarna
This series adds more context to debugging information in case a view gets out of sync with its base table.
A test was conducted manually, by:
1. creating a table with a secondary index
2. manually deleting computed column information from system_schema.computed_columns
3. restarting the target node
4. trying to write to the index

Here's what's logged right after the index metadata is loaded from disk:
```
ERROR 2020-10-30 12:30:42,806 [shard 0] view - Column idx_token in view ks.t_c_idx_index was not found in the base table ks.t
ERROR 2020-10-30 12:30:42,806 [shard 0] view - Missing idx_token column is caused by an incorrect upgrade of a secondary index. Please recreate index ks.t_c_idx_index to avoid future issues.
```

And here's what's logged during the actual failure - when Scylla notices that there exists
a column which is not computed, but it's also not found in the base table:
```
ERROR 2020-10-30 12:31:25,709 [shard 0] storage_proxy - exception during mutation write to 127.0.0.1: seastar::internal::backtraced<std::runtime_error> (base_schema(): operation unsupported when initialized only for view reads. Missing column in the base table: idx_token Backtrace:    0x1d14513
   0x1d1468b
   0x1d1492b
   0x109bbad
   0x109bc97
   0x109bcf4
   0x1bc4370
   0x1381cd3
   0x1389c38
   0xaf89bf
   0xaf9b20
   0xaf1654
   0xaf1afe
   0xb10525
   0xb10ad8
   0xb10c3a
   0xaaefac
   0xabf525
   0xabf262
   0xac107f
   0x1ba8ede
   0x1bdf749
   0x1be338c
   0x1bfe984
   0x1ba73fa
   0x1ba77a4
   0x9ea2c8
   /lib64/libc.so.6+0x27041
   0x9d11cd
   --------
   seastar::lambda_task<seastar::execution_stage::flush()::{lambda()#1}>

```

Hopefully, this information will make it much easier to solve future problems with out-of-sync views.

Tests: unit(dev)
Fixes #7512

Closes #7513

* github.com:scylladb/scylla:
  view: add printing missing base column on errors
  view: simplify creating base-dependent info for reads only
  view: fix typo: s/dependant/dependent
  view: add error logs if a view is out of sync with its base
2020-11-01 11:09:58 +02:00
Piotr Wojtczak
2150c0f7a2 cql: Check for timestamp correctness in USING TIMESTAMP statements
In certain CQL statements it's possible to provide a custom timestamp via the USING TIMESTAMP clause. Those values are accepted in microseconds, however, there's no limit on the timestamp (apart from type size constraint) and providing a timestamp in a different unit like nanoseconds can lead to creating an entry with a timestamp way ahead in the future, thus compromising the table.

To avoid this, this change introduces a sanity check for modification and batch statements that raises an error when a timestamp of more than 3 days into the future is provided.

Fixes #5619

Closes #7475
2020-11-01 11:01:24 +02:00
Pavel Emelyanov
d045df773f code: RIP global query processor instance
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-10-31 18:51:52 +03:00
Pavel Emelyanov
a340caa328 cql test env: Keep query processor reference on board
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-10-31 18:51:52 +03:00
Pavel Emelyanov
8989021dc3 system distributed keyspace: Start sharded service erarlier
The constructors just set up the references, real start happens in .start()
so it is safe to do this early. This helps not carrying migration manager
and query processor down the storage service cluster joining code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-10-31 18:51:52 +03:00
Pavel Emelyanov
021b905773 schema_tables: Use qctx to make internal requests
The query processor global instance is going away. The schema_tables usage
of it requires a huge rework to push the qp reference to the needed places.
However, those places talk to system keyspace and are thus the users of the
"qctx" thing -- the query context for local internal requests.

To make cql tests not crash on null qctx pointer, its initialization should
come earlier (conforming to the main start sequence).

The qctx itself is a global pointer, which waits for its fix too, of course.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-10-31 18:50:01 +03:00
Pavel Emelyanov
699074bd48 transport: Keep sharded query processor reference on controller
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-10-31 15:44:21 +03:00
Pavel Emelyanov
c887d0df4c thrift: Keep sharded query processor reference on controller
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-10-31 15:44:21 +03:00
Pavel Emelyanov
cf172cf656 alternator: Use local query processor reference to get keys
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-10-31 15:44:21 +03:00
Pavel Emelyanov
94a9f22002 alternator: Keep local query processor reference in server
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-10-31 15:44:21 +03:00
Piotr Sarna
35887bf88b view: add printing missing base column on errors
When an out-of-sync view is attempted to be used in a write operation,
the whole operation needs to be aborted with an error. After this patch,
the error contains more context - namely, the missing column.
2020-10-31 12:22:07 +01:00
Piotr Sarna
ef3470fa34 view: simplify creating base-dependent info for reads only
The code which created base-dependent info for materialized views
can be expressed with fewer branches. Also, the constructor
which takes a single parameter is made explicit.
2020-10-31 12:22:07 +01:00
Piotr Sarna
71b28d69b3 view: fix typo: s/dependant/dependent 2020-10-31 12:22:07 +01:00
Piotr Sarna
669e2ada92 view: add error logs if a view is out of sync with its base
When Scylla finds out that a materialized view contains columns
which are not present in the base table (and they are not computed),
it now presents comprehensible errors in the log.
2020-10-31 12:22:07 +01:00
Avi Kivity
1734205315 Update seastar submodule
* seastar 6973080cd1...57b758c2f9 (11):
  > http: handle 'match all' rule correctly
  > http: add missing HTTP methods
  > memory: remove unused lambda capture in on_allocation_failure()
  > Support seastar allocator when seastar::alien is used
  > Merge "make timer related functions noexcept" from Benny
  > script: update dependecy packages for centos7/8
  > tutorial: add linebreak between sections
  > doc: add nav for the second last chap
  > doc: add nav bar at the bottom also
  > doc: rename add_prologue() to add_nav_to_body()
  > Wrong name used in an example in mini tutorial.
2020-10-30 09:49:47 +02:00
Avi Kivity
27125a45b2 test: switch lsa-related tests (imr_test and double_decker_test) to seastar framework
An upcoming change in Seastar only initializes the Seastar allocator in
reactor threads. This causes imr_test and double_decker_test to fail:

 1. Those tests rely on LSA working
 2. LSA requires the Seastar allocator
 3. Seastar is not initialized, so the Seastar allocator is not initialized.

Fix by switching to the Seastar test framework, which initializes Seastar.

Closes #7486
2020-10-30 08:06:04 +02:00
Avi Kivity
8a8589038c test: increase quota for tests to 6GB
test.py estimates the amount of memory needed per test
in order not to overload the machine, but it underestimates
badly and so machines with many cores but not a lot of memory
fail the tests (in debug mode principally) due to running out
of memory.

Increase the estimate from 2GB per test to 6GB.

Closes #7499
2020-10-30 08:04:40 +02:00
Avi Kivity
24097eee11 test: sstable_3_x_test: reduce stack usage in thread- local storage initialization
gcc collects all the initialization code for thread-local storage
and puts it in one giant function. In combination with debug mode,
this creates a very large stack frame that overflows the stack
on aarch64.

Work around the problem by placing each initializer expression in
its own function, thus reusing the stack.

Closes #7509
2020-10-30 08:03:44 +02:00
Piotr Grabowski
e96ef0d629 tests: Cleanup select_statement_utils
Add additional comments to select_statement_utils, fix formatting, add
missing #pragma once and introduce set_internal_paging_size_guard to
set internal_paging in RAII fashion.

Closes #7507
2020-10-29 15:25:02 +01:00
Asias He
d47033837a gossiper: Use dedicated gossip scheduling group
Gossip currently runs inside the default (main) scheduling group. It is
fine to run inside default scheduling group. From time to time, we see
many tasks in main scheduling group and we suspect gossip. It is best
we can move gossip to a dedicated scheduling group, so that we can catch
bugs that leak tasks to main group more easily.

After this patch, we can check:

scylla_scheduler_time_spent_on_task_quota_violations_ms{group="gossip",shard="0"}

Fixes: #7154
Tests: unit(dev)
2020-10-29 12:53:37 +02:00
Avi Kivity
bd73898a5c dist: redhat: don't pull in kernel package
We require a kernel that is at least 3.10.0-514, because older
kernel have an XFS related bug that causes data corruption. However
this Requires: clause pulls in a kernel even in Docker installation,
where it (and especially the associated firmware) occupies a lot of
space.

Change to a Conflicts: instead. This prevents installation when
the really old kernel is present, but doesn't pull it in for the
Docker image.

Closes #7502
2020-10-29 12:44:22 +02:00
Piotr Sarna
8c645f74ce Merge 'select_statement: Fix aggregate results on indexed selects (timeouts fixed) ' from Piotr Grabowski
Overview
Fixes #7355.

Before this changes, there were a few invalid results of aggregates/GROUP BY on tables with secondary indexes (see below).

Unfortunately, it still does NOT fix the problem in issue #7043. Although this PR moves forward fixing of that issue, there is still a bug with `TOKEN(...)` in `WHERE` clauses of indexed selects that is not addressed in this PR. It will be fixed in my next PR.

It does NOT fix the problems in issues #7432, #7431 as those are out-of-scope of this PR and do not affect the correctness of results (only return a too large page).

GROUP BY (first commit)
Before the change, `GROUP BY` `SELECT`s with some `WHERE` restrictions on an indexed column would return invalid results (same grouped column values appearing multiple times):
```
CREATE TABLE ks.t(pk int, ck int, v int, PRIMARY KEY(pk, ck));
CREATE INDEX ks_t on ks.t(v);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 2, 3);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 4, 3);
SELECT pk FROM ks.t WHERE v=3 GROUP BY pk;
 pk
----
  1
  1
```
This is fixed by correctly passing `_group_by_cell_indices` to `result_set_builder`. Fixes the third failing example from issue #7355.

Paging (second commit)
Fixes two issues related to improper paging on indexed `SELECT`s. As those two issues are closely related (fixing one without fixing the other causes invalid results of queries), they are in a single commit (second commit).

The first issue is that when using `slice.set_range`, the existing `_row_ranges` (which specify clustering key prefixes) are not taken into account. This caused the wrong rows to be included in the result, as the clustering key bound was set to a half-open range:
```
CREATE TABLE ks.t(a int, b int, c int, PRIMARY KEY ((a, b), c));
CREATE INDEX kst_index ON ks.t(c);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 3);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 4);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 5);
SELECT COUNT(*) FROM ks.t WHERE c = 3;
 count
-------
     2
```
The second commit fixes this issue by properly trimming `row_ranges`.

The second fixed problem is related to setting the `paging_state` to `internal_options`. It was improperly set to the value just after reading from index, making the base query start from invalid `paging_state`.

The second commit fixes this issue by setting the `paging_state` after both index and base table queries are done. Moreover, the `paging_state` is now set based on `paging_state` of index query and the results of base table query (as base query can return more rows than index query).

The second commit fixes the first two failing examples from issue #7355.

Tests (fourth commit)
Extensively tests queries on tables with secondary indices with  aggregates and `GROUP BY`s.

Tests three cases that are implemented in `indexed_table_select_statement::do_execute` - `partition_slices`,
`whole_partitions` and (non-`partition_slices` and non-`whole_partitions`). As some of the issues found were related to paging, the tests check scenarios where the inserted data is smaller than a page, larger than a page and larger than two pages (and some in-between page boundaries scenarios).

I found all those parameters (case of `do_execute`, number of inserted rows) to have an impact of those fixed bugs, therefore the tests validate a large number of those scenarios.

Configurable internal_paging_size (third commit)
Before this change, internal `page_size` when doing aggregate, `GROUP BY` or nonpaged filtering queries was hard-coded to `DEFAULT_COUNT_PAGE_SIZE` (10,000).  This change adds new internal_paging_size variable, which is configurable by `set_internal_paging_size` and `reset_internal_paging_size` free functions. This functionality is only meant for testing purposes.

Closes #7497

* github.com:scylladb/scylla:
  tests: Add secondary index aggregates tests
  select_statement: Introduce internal_paging_size
  select_statement: Fix paging on indexed selects
  select_statement: Fix GROUP BY on indexed select
2020-10-29 08:30:16 +01:00
Takuya ASADA
fc1c4f2261 scylla_raid_setup: use sysfs to detect existing RAID volume
We may not able to detect existing RAID volume by device file existance,
we should use sysfs instead to make sure it's running.

Fixes #7383

Closes #7399
2020-10-29 09:13:55 +02:00
Avi Kivity
17226f2f6c tools: toolchain: update to Fedora 33 with clang 11
Update the toolchain to Fedora 33 with clang 11 (note the
build still uses gcc).

The image now creates a /root/.m2/repository directory; without
this the tools/jmx build fails on aarch64.

Add java-1.8.0-openjdk-devel since that is where javac lives now.
Add a JAVA8_HOME environment variable; wihtout this ant is not
able to find javac.

The toolchain is enabled for x86_64 and aarch64.
2020-10-28 20:21:44 +02:00
Piotr Grabowski
006d4f40d9 tests: Add secondary index aggregates tests
Extensively tests queries on tables with secondary indices with
aggregates and GROUP BYs. Tests three cases that are implemented
in indexed_table_select_statement::do_execute - partition_slices,
whole_partitions and (non-partition_slices and non-whole_partitions).
As some of the issues found were related to paging, the tests check
scenarios where the inserted data is smaller than a page, larger than
a page and larger than two pages (and some boundary scenarios).
2020-10-28 17:01:25 +01:00
Piotr Grabowski
4975d55cdc select_statement: Introduce internal_paging_size
Before this change, internal page_size when doing aggregate, GROUP BY
or nonpaged filtering queries was hard-coded to DEFAULT_COUNT_PAGE_SIZE.
This made testing hard (timeouts in debug build), because the tests had
to be large to test cases when there are multiple internal pages.

This change adds new internal_paging_size variable, which is 
configurable by set_internal_paging_size and reset_internal_paging_size
free functions. This functionality is only meant for testing purposes.
2020-10-28 17:01:25 +01:00
Piotr Grabowski
b7b5066581 select_statement: Fix paging on indexed selects
Fixes two issues related to improper paging on indexed SELECTs. As those
two issues are closely related (fixing one without fixing the other
causes invalid results of queries), they are in a single commit.

The first issue is that when using slice.set_range, the existing
_row_ranges (which specify clustering key prefixes) are not taken into
account. This caused the wrong rows to be included in the result, as the
clustering key bound was set to a half-open range:

CREATE TABLE ks.t(a int, b int, c int, PRIMARY KEY ((a, b), c));
CREATE INDEX kst_index ON ks.t(c);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 3);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 4);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 5);
SELECT COUNT(*) FROM ks.t WHERE c = 3;
 count
-------
     2

This change fixes this issue by properly trimming row_ranges.

The second fixed problem is related to setting the paging_state
to internal_options. It was improperly set just after reading from
index, making the base query start from invalid paging_state.

This change fixes this issue by setting the paging_state after both
index and base table queries are done. Moreover, the paging_state is
now set based on paging_state of index query and the results of base
table query (as base query can return more rows than index query).

Fixes the first two failing examples from issue #7355.
2020-10-28 17:01:25 +01:00
Piotr Grabowski
fb10386017 select_statement: Fix GROUP BY on indexed select
Before the change, GROUP BY SELECTs with some WHERE restrictions on an 
indexed column would return invalid results (same grouped column values
appearing multiple times):

CREATE TABLE ks.t(pk int, ck int, v int, PRIMARY KEY(pk, ck));
CREATE INDEX ks_t on ks.t(v);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 2, 3);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 4, 3);
SELECT pk FROM ks.t WHERE v=3 GROUP BY pk;
 pk
----
  1
  1

This is fixed by correctly passing _group_by_cell_indices to 
result_set_builder. Fixes the third failing example from issue #7355.
2020-10-28 17:01:25 +01:00
Avi Kivity
5ff5d43c7a Update tools/java submodule
* tools/java e97c106047...ad48b44a26 (1):
  > build: Add generated Thrift sources to multi-Java build
2020-10-28 16:52:25 +02:00
Pavel Emelyanov
b2ce3b197e allocation_strategy: Fix standard_migrator initialization
This is the continuation of 30722b8c8e, so let me re-cite Rafael:

    The constructors of these global variables can allocate memory. Since
    the variables are thread_local, they are initialized at first use.

    There is nothing we can do if these allocations fail, so use
    disable_failure_guard.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201028140553.21709-1-xemul@scylladb.com>
2020-10-28 16:22:23 +02:00
Asias He
289a08072a repair: Make repair_writer a shared pointer
The future of the fiber that writes data into sstables inside
the repair_writer is stored in _writer_done like below:

class repair_writer {
   _writer_done[node_idx] =
      mutation_writer::distribute_reader_and_consume_on_shards().then([this] {
         ...
      }).handle_exception([this] {
         ...
      });
}

The fiber access repair_writer object in the error handling path. We
wait for the _writer_done to finish before we destroy repair_meta
object which contains the repair_writer object to avoid the fiber
accessing already freed repair_writer object.

To be safer, we can make repair_writer a shared pointer and take a
reference in the distribute_reader_and_consume_on_shards code path.

Fixes #7406

Closes #7430
2020-10-28 16:22:23 +02:00
Avi Kivity
4b9206a180 install: abort if LD_PRELOAD is set when executing a relocatable binary
LD_PRELOAD libraries usually have dependencies in the host system,
which they will not have access to in a relocatable environment
since we use a different libc. Detect that LD_PRELOAD is in use and if
so, abort with an error.

Fixes #7493.

Closes #7494
2020-10-28 16:22:23 +02:00
Avi Kivity
2a42fc5cde build: supply linker flags only to the linker, not the compiler
Clang complains if it sees linker-only flags when called for compilation,
so move the compile-time flags from cxx_ld_flags to cxxflags, and remove
cxx_ld_flags from the compiler command line.

The linker flags are also passed to Seastar so that the build-id and
interpreter hacks still apply to iotune.

Closes #7466
2020-10-28 16:22:23 +02:00
Avi Kivity
fc15d0a4be build: relocatable package: exclude tools/python3
python3 has its own relocatable package, no need to include it
in scylla-package.tar.gz.

Python has its own relocatable package, so packaging it in scylla-package.ta

Closes #7467
2020-10-28 16:22:23 +02:00
Avi Kivity
6eb3ba74e4 Update tools/java submodule
* tools/java f2e8666d7e...e97c106047 (1):
  > Relocatable Package: create product prefixed relocatable archive
2020-10-28 08:47:49 +02:00
Juliusz Stasiewicz
e0176bccab create_table_statement: Disallow default TTL on counter tables
In such attempt `invalid_request_exception` is thrown.
Also, simple CQL test is added.

Fixes #6879
2020-10-27 22:44:02 +02:00
Nadav Har'El
92b741b4ff alternator test: more tests for disabled streams and closed shards
We already have a test for the behavior of a closed shard and how
iterators previously created for it are still valid. In this patch
we add to this also checking that the shard id itself, not just the
iterator, is still valid.

Additionally, although the aforementioned test used a disabled stream
to create a closed shard, it was not a complete test for the behavior
of a disabled stream, and this patch adds such a test. We check that
although the stream is disabled, it is still fully usable (for 24 hours) -
its original ARN is still listed on ListStreams, the ARN is still usable,
its shards can be listed, all are marked as closed but still fully readable.

Both tests pass on DynamoDB, and xfail on Alternator because of
issue #7239 - CDC drops the CDC log table as soon as CDC is disabled,
so the stream data is lost immediately instead of being retained for
24 hours.

Refs #7239

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201006183915.434055-1-nyh@scylladb.com>
2020-10-27 22:44:02 +02:00
Nadav Har'El
a57d4c0092 docs: clean up format of docs/alternator/getting-started.md
In https://github.com/scylladb/scylla-docs/pull/3105 it was noted that
the Sphynx document parser doesn't like a horizontal line ("---") in
the beginning of a section. Since there is no real reason why we must
have this horizontal line, let's just remove it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201001151312.261825-1-nyh@scylladb.com>
2020-10-27 22:44:02 +02:00
Avi Kivity
e2a02f15c2 Merge 'transport/system_ks: Add more info to system.clients' from Juliusz Stasiewicz
This patch fills the following columns in `system.clients` table:
* `connection_stage`
* `driver_name`
* `driver_version`
* `protocol_version`

It also improves:
* `client_type` - distinguishes cql from thrift just in case
* `username` - now it displays correct username iff `PasswordAuthenticator` is configured.

What is still missing:
* SSL params (I'll happily get some advice here)
* `hostname` - I didn't find it in tested drivers

Refs #6946

Closes #7349

* github.com:scylladb/scylla:
  transport: Update `connection_stage` in `system.clients`
  transport: Retrieve driver's name and version from STARTUP message
  transport: Notify `system.clients` about "protocol_version"
  transport: On successful authentication add `username` to system.clients
2020-10-27 22:44:02 +02:00
Amnon Heiman
52db99f25f scyllatop/livedata.py: Safe iteration over metrics
This patch change the code that iterates over the metrics to use a copy
of the metrics names to make it safe to remove the metrics from the
metrics object.

Fixes #7488

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-10-27 22:44:02 +02:00
Calle Wilund
1bc96a5785 alternator::streams: Make describe_stream use actual log ttl as window
Allows QA to bypass the normal hardcoded 24h ttl of data and still
get "proper" behaviour w.r.t. available stream set/generations.
I.e. can manually change cdc ttl option for alternator table after
streams enabled. Should not be exposed, but perhaps useful for
testing.

Closes #7483
2020-10-26 12:16:36 +02:00
Calle Wilund
4b65d67a1a partition_version: Change range_tombstones() to return chunked_vector
Refs #7364

The number of tombstones can be large. As a stopgap measure to
just returning a source range (with keepalive), we can at least
alleviate the problem by using a chunked vector.

Closes #7433
2020-10-26 11:54:42 +02:00
Benny Halevy
82aabab054 table: get rid of reshuffle_sstables
It is unused since 7351db7cab

Refs #6950

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201026074914.34721-1-bhalevy@scylladb.com>
2020-10-26 09:50:21 +02:00
Calle Wilund
46ea8c9b8b cdc: Add an "end-of-record" column to
Fixes #7435

Adds an "eor" (end-of-record) column to cdc log. This is non-null only on
last-in-timestamp group rows, i.e. end of a singular source "event".

A client can use this as a shortcut to knowing whether or not he has a
full cdc "record" for a given source mutation (single row change).

Closes #7436
2020-10-26 09:39:27 +02:00
Takuya ASADA
fe2d6765f9 node_exporter_install: upgrade to latest release
We currently uses outdated version of node_exporter, let's upgrade to latest
version.

Fixes #7427
2020-10-25 13:59:14 +02:00
Etienne Adam
c518c1de1c redis: remove useless std::move()
As remarked during the last review, this commit
removes the useless std::move().

Signed-off-by: Etienne Adam <etienne.adam@gmail.com>
Message-Id: <20201024180447.16799-1-etienne.adam@gmail.com>
2020-10-25 13:17:40 +02:00
Avi Kivity
d5150a94d2 Update abseil submodule from upstream
The dynamic_annotations library is now header-only, so
it is no longer built.

* abseil 2069dc7...1e3d25b (73):
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > fix compile fails with asan and -Wredundant-decls (#801)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > btree: fix sign-compare warnings (#800)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Added missing asserts for seq.index() < capacity_ and unified their usage based on has_element(). (#781)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > fix build on P9 (#739)
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Disable pthread for standalone wasm build support (#721)
  > Merge branch 'master' of https://github.com/abseil/abseil-cpp into master
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Export of internal Abseil changes
  > Exclude empty directories (#697)
2020-10-25 12:51:40 +02:00
Dejan Mircevski
40adf38915 cql3/expr: Use Boost concept assert
In bd6855e, we reverted to Boost ranges and commented out the concept
check.  But Boost has its own concept check, which this patch enables.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #7471
2020-10-22 17:24:49 +03:00
Benny Halevy
fcca64b4f6 test: imr_test should run automatically
Unclear why it was places in test/manual in
commit 1c8736f998

Test: boost/imr_test

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201022093826.12009-1-bhalevy@scylladb.com>
2020-10-22 12:40:30 +03:00
Nadav Har'El
6740907f3d Merge 'utf8: don't linearize cells for validation' from Avi Kivity
Currently, we linearize large UTF8 cells in order to validate them.
This can cause large latency spikes if the cell is large.

This series changes UTF8 validation to work on fragmented buffers.
This is somewhat tricky since the validation routines are optimized
for single-instruction-multiple-data (SIMD) architectures.

The unit tests are expanded to cover the new functionality.

Fixes #7448.

Closes #7449

* github.com:scylladb/scylla:
  types: don't linearize utf8 for validation
  test: utf8: add fragmented buffer validation tests
  utils: utf8: add function to validate fragmented buffers
  utils: utf8: expose validate_partial() in a header
  utils: utf8: introduce validate_partial()
  utils: utf8: extract a function to evaluate a single codepoint
2020-10-21 20:51:15 +03:00
Tomasz Grabiec
158ae99c89 Merge 'view info: preserve integrity by allowing base info for reads only and by initializing base info' from Eliran Sinvani
This PR purpose is to handle schema integrity issues that can arise in races involving materialized views.
The possibility of such integrity issues was found in #7420 , where a view schema was used for reading without
it's _base_info member initialized resulting in a segfault.
We handle this doing 3 things:
1. First guard against using an uninitialized base info - this will be considered as an internal error as it will indicate that
there is a path in our code that creates a view schema to be used for reads or writes but is not initializing the base info.
2. We allow the base info to be initialized also from partially matching base (most likely a newer one that this used to create the view).
3. We fix the suspected path that create such a view schema to initialize it. (in migration manager)

It is worth mentioning that this PR is a workaround to a probable design flaw in our materialized views which requires the base
table's information to be retrieved in the first place instead of just being self contained.

Refs #7420

Closes #7469

* github.com:scylladb/scylla:
  materialized views: add a base table reference if missing
  view info: support partial match between base and view for only reading from view.
  view info: guard against null dereference of the base info
2020-10-21 16:21:00 +02:00
Eliran Sinvani
4749c58068 materialized views: add a base table reference if missing
schema pointers can be obtained from two distinct entities,
one is the database, those schema are obtained from the table
objects and the other is from the schema registry.
When a schema or a new schema is attached to a table object that
represents a base table for views, all of the corresponding attached
view schemas are guarantied to have their base info in sync.
However if an older schema is inserted into the registry by the
migratrion manager i.e loaded from other node, it will be
missing this info.
This becomes a problem when this schema is published through the
schema registry as it can be obtained for an obsolete read command
for example and then eventually cause a segmentation fault by null
dereferencing the _base_info ptr.

Refs #7420
2020-10-21 16:52:28 +03:00
Eliran Sinvani
70e04c1123 view info: support partial match between base and view for
only reading from view.

The current implementation of materialized views does
no keep the version to which a specific version of materialized
view schema corresponds to. This complicate things especially on
old views versions that the schema doesn't support anymore. However,
the views, being also an independent table should allow reading from
them as long as they exist even if the base table changed since then.
For the reading purpose, we don't need to know the exact composition
of view primary key columns that are not part of the base primary
key, we only need to know that there are any, and this is a much
looser constrain on the schema.
We can rely on a table invariants such as the fact that pk columns are
not going to disappear on newer version of the table.
This means that if we don't find a view column in the base table, it is
not a part of the base table primary key.
This information is enough for us to perform read on the view.
This commit adds support for being able to rely on such partial
information along with a validation that it is not going to be used for
writes. If it is, we simply abort since this means that our schema
integrity is compromised.
2020-10-21 15:20:43 +03:00
Eliran Sinvani
372051c97d view info: guard against null dereference of the base info
The change's purpose is to guard against segfault that is the
result of dereferencing the _base_info member when it is
uninitialized. We already know this can happen (#7420).
The only purpose of this change is to treat this condition as
an internal error, the reason is that it indicates a schema integrity
problem.
Besides this change, other measures should be taken to ensure that
the _base_table member is initialized before calling methods that
rely on it.
We call the internal_error as a last resort.
2020-10-21 12:12:51 +03:00
Benny Halevy
70219b423f table: add_sstable: provide strong exception guarantees
Do not leave side-effects on nexception.

Fixes #6658

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201020145429.19426-1-bhalevy@scylladb.com>
2020-10-21 11:40:03 +03:00
Avi Kivity
c0ca54395a types: don't linearize utf8 for validation
Use the new non-linearizing validator, avoiding linearization.
Linearization can cause large contiguous memory allocations, which
in turn causes latency spikes.

Fixes #7448.
2020-10-21 11:14:44 +03:00
Avi Kivity
89f111b03f test: utf8: add fragmented buffer validation tests
Since there are a huge number of variations, we use random
testing. Each test case is composed of a random number of valid
code points, with a possible invalid code point somehwere. The
test case is broken up into a random number of fragments. We
test both validation success and error position indicator.
2020-10-21 11:14:44 +03:00
Avi Kivity
91490827c1 utils: utf8: add function to validate fragmented buffers
Add a function to validate fragmented buffers. We validate
each buffer with SIMD-optimized validate_partial(), then
collect the codepoint that spans buffer boundaries (if any)
in a temporary buffer, validate that too, and continue.
2020-10-21 11:14:44 +03:00
Avi Kivity
3d1be9286f utils: utf8: expose validate_partial() in a header
Since fragmented buffers are templates, we'll need access
to validate_partial() in a header. Move it there.
2020-10-21 11:14:44 +03:00
Avi Kivity
22a0c457e2 utils: utf8: introduce validate_partial()
The current validators expect the buffer to contain a full
UTF-8 string. This won't be the case for fragmented buffers,
since a codepoint can straddle two (or more) buffers.

To prepare for that, convert the existing validators to
validate_partial(), which returns either an error, or
success with an indication of the size of the tail that
was not validated and now many bytes it is missing.

This is natural since the SIMD validators already
cannot process a tail in SIMD mode if it's smaller than
the vector size, so only minor rearrangements are needed.
In addition, we now have validate_partial() for non-SIMD
architectures, since we'll need it for fragmented buffer
validation.
2020-10-21 11:14:44 +03:00
Avi Kivity
900699f1b5 utils: utf8: extract a function to evaluate a single codepoint
Our SIMD optimized validators cannot process a codepoint that
spans multiple buffers, and adapting them to be able to will slow
them down. So our strategy is to special-case any codepoint that
spans two buffers.

To do that, extract an evaluate_codepoint() function from the
current validate_naive() function. It returns three values:
 - if a codepoint was successfully decoded from the buffer,
   how many bytes were consumed
 - if not enough bytes were in the buffer, how many more
   are needed
 - otherwise, an error happened, so return an indication

The new function uses a table to calculate a codepoint's
size from its first byte, similar to the SIMD variants.

validate_naive() is now implemented in terms of
evaluate_codepoint().
2020-10-21 11:14:43 +03:00
Raphael S. Carvalho
6f805bd123 sstable_directory: Fix 50% space requirement for resharding
This is a regression caused by aebd965f0.

After the sstable_directory changes, resharding now waits for all sstables
to be exhausted before releasing reference to them, which prevents their
resources like disk space and fd from being released. Let's restore the
old behavior of incrementally releasing resources, reducing the space
requirement significantly.

Fixes #7463.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201020140939.118787-1-raphaelsc@scylladb.com>
2020-10-21 09:51:26 +02:00
Raphael S. Carvalho
74d35a2286 compaction: fix debug log for fully expired ssts
the log is incorrectly printing actually compacted ssts, instead of
fully expired ssts that weren't actually compacted

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201020125053.109615-1-raphaelsc@scylladb.com>
2020-10-20 16:01:28 +03:00
Dmitry Kropachev
62709bab98 dist/docker: Pass extra arguments to Scylla
Currently there is no way to pass scylla arguments from
docker-entrypoint to scylla, but time to time it is needed.

Example: https://github.com/scylladb/scylla-operator/issues/177

Closes #7458
2020-10-20 09:49:10 +03:00
Piotr Sarna
e5edf30869 Merge 'treewide: adjust for missing aggregate template ...
... type deduction and parenthesized aggregate construction' from Avi Kivity

Clang does not implement P0960R3 and P1816R0, so constructions
of aggregates (structs with no constructors) have to used braced initialization
and cannot use class template argument deduction. This series makes the
adjustments.

Closes #7456

* github.com:scylladb/scylla:
  reader_concurrency_semaphore: adjust permit_summary construction for clang
  schema_tables: adjust altered_schema construction for clang
  types: adjust validation_visitor construction for clang
2020-10-20 08:52:29 +03:00
Dejan Mircevski
b037b0c10b cql3: Delete some newlines
Makes files shorter while still keeping the lines under 120 columns.
Separate from other commits to make review easier.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-10-19 15:40:55 -04:00
Dejan Mircevski
62ea6dcd28 cql3: Drop superfluous ALLOW FILTERING
Required no longer, after the last commit.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-10-19 15:38:11 -04:00
Dejan Mircevski
6773563d3d cql3: Drop unneeded filtering for continuous CK
Don't require filtering when a continuous slice of the clustering key
is requested, even if partition is unrestricted.  The read command we
generate will fetch just the selected data; filtering is unnecessary.

Some tests needed to update the expected results now that we're not
fetching the extra data needed for filtering.  (Because tests don't do
the final trim to match selectors and assert instead on all the data
read.)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-10-19 14:46:43 -04:00
Avi Kivity
cfada6e04d reader_concurrency_semaphore: adjust permit_summary construction for clang
Clang does not implement P0960R3, parenthesized initialization of
aggregates, so we have to use brace initialization in
permit_summary.

As the parenthesized constructor call is done by emplace_back(),
we have to do the braced call ourselves.
2020-10-19 14:57:51 +03:00
Avi Kivity
8e386a5f48 schema_tables: adjust altered_schema construction for clang
Clang does not implement P0960R3, parenthesized initialization of
aggregates, so we have to use brace initialization in
altered_schema.

As the parenthesized constructor call is done by emplace_back(),
we have to do the braced call ourselves.
2020-10-19 14:57:21 +03:00
Avi Kivity
ed6775c585 types: adjust validation_visitor construction for clang
Clang does not implement P0960R3, parenthesized initialization of
aggregates, so we have to use brace initialization in
validation_visitor. It also does not implement class template
argument deduction for aggregates (P1816r0), so we have to
specify the template parameters explicity.
2020-10-19 14:53:00 +03:00
Piotr Sarna
ef8815d39e Merge 'treewide: drop some uses of <ranges> for clang' from Avi Kivity
Clang has trouble compiling libstdc++'s `<ranges>`. It is not known whether
the problem is in clang or in libstdc++; I filed bugs for both [1] [2].

Meanwhile, we wish to use clang to gain working coroutine support,
so drop the failing uses of `<ranges>`. Luckily the changes are simple.

[1] https://bugs.llvm.org/show_bug.cgi?id=47509
[2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97120

Closes #7450

* github.com:scylladb/scylla:
  test: view_build_test: drop <ranges>
  test: mutation_reader_test: drop <ranges>
  cql3: expression: drop <ranges>
  sstables: leveled_compaction_strategy: drop use of <ranges>
  utils: to_range(): relax constraint
2020-10-19 09:58:53 +02:00
Tomasz Grabiec
d48f04f25e migration_manger: Drop the schema version check from the schema pull handler
The check was added to support migration from schema tables format v2
to v3. It was needed to handle the rolling upgrade from 2.x to 3.x
scylla version. Old nodes wouldn't recognize new schema mutations, so
the pull handler in v3 was changed to ignore requests from v2 nodes based
on their advertised SCHEMA_TABLES_VERSION gossip state.

This started to cause problems after 3b1ff90 (get rid of the seed
concept). The bootstrapping node sometimes would hang during boot
unable to reach schema agreement.

It's relevant that gossip exchanges about new nodes are unidirectional
(refs #2862).

It's also relevant that pulls are edge-triggered only (refs #7426).

If the bootstrapping node (A) is listed as a seed in one of the
existing node's (B) configuration then node A can be contacted before
it contacts node B. Node A may then send schema pull request to node B
before it learns about node A, and node B will assume it's an old node
and give an empty response. As a result, node A will end up with an
old schema.

The fix is to drop the check so that pull handler always responds with
the schema. We don't support upgrades from nodes using v2 schema
tables format anymore so this should be safe.

Fixes #7396

Tests:

  - manual (ccm)
  - unit (dev)

Message-Id: <1602612578-21258-1-git-send-email-tgrabiec@scylladb.com>
2020-10-19 10:45:23 +03:00
Avi Kivity
3249516f2e test: view_build_test: drop <ranges>
Clang has trouble with some parts of <ranges>. Replace with
boost range adaptors for now.
2020-10-19 10:23:31 +03:00
Avi Kivity
1041521eb8 test: mutation_reader_test: drop <ranges>
Clang has trouble with some parts of <ranges>. Replace with
boost range adaptors for now.
2020-10-19 10:23:31 +03:00
Avi Kivity
bd6855ed62 cql3: expression: drop <ranges>
Clang has trouble with some parts of <ranges>. Replace with
boost range adaptors for now.
2020-10-19 10:23:30 +03:00
Avi Kivity
951b4d1541 sstables: leveled_compaction_strategy: drop use of <ranges>
Clang has trouble with some parts of <ranges>. Replace with iterators
for now.
2020-10-18 18:16:37 +03:00
Avi Kivity
f9129fc1f9 utils: to_range(): relax constraint
The input range to utils::to_range() should be indeed a range,
but clang has trouble compiling <ranges> which causes it to fail.

Relax the constraint until this is fixed.
2020-10-18 18:16:30 +03:00
Avi Kivity
dfe4161e65 Revert "SCYLLA-VERSION-GEN: change master version to 4.3.dev"
This reverts commit 951fb638a3.
QA was not prepared for it and it breaks their scripts.
2020-10-18 14:21:25 +03:00
Nadav Har'El
4159054baf Merge 'treewide: don't capture structured bindings in lambdas' from Avi Kivity
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.

Hopefully, most of these lambda captures will be replaces with
coroutines.

Closes #7445

* github.com:scylladb/scylla:
  test: mutation_reader_test: don't capture structured bindings in lambdas
  api: column_family: don't capture structured bindings in lambdas
  thrift: don't capture structured bindings in lambdas
  test: partition_data_test: don't capture structured bindings in lambdas
  test: querier_cache_test: don't capture structured bindings in lambdas
  test: mutation_test: don't capture structured bindings in lambdas
  storage_proxy: don't capture structured bindings in lambdas
  db: hints/manager: don't capture structured bindings in lambdas
  db: commitlog_replayer: don't capture structured bindings in lambdas
  cql3: select_statement: don't capture structured bindings in lambdas
  cql3: statement_restrictions: don't capture structured bindings in lambdas
  cdc: log: don't capture structured bindings in lambdas
2020-10-18 13:12:11 +03:00
Avi Kivity
6f5ef5a5f5 dht: document incremental partition_range and token_range sharders
Closes #6210
2020-10-18 12:24:49 +03:00
Pavel Solodovnikov
aa4c359cff column_mapping_entry: extract == and != operators
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201016123638.99534-1-pa.solodovnikov@scylladb.com>
2020-10-16 14:59:50 +02:00
Avi Kivity
e6d55e2778 test: mutation_reader_test: don't capture structured bindings in lambdas
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
2020-10-16 15:25:15 +03:00
Avi Kivity
82f79c0077 api: column_family: don't capture structured bindings in lambdas
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
2020-10-16 15:25:05 +03:00
Avi Kivity
99ee5f6aac thrift: don't capture structured bindings in lambdas
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
2020-10-16 15:24:57 +03:00
Avi Kivity
d5e94ab224 test: partition_data_test: don't capture structured bindings in lambdas
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
2020-10-16 15:24:45 +03:00
Avi Kivity
77d54410d0 test: querier_cache_test: don't capture structured bindings in lambdas
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
2020-10-16 15:24:37 +03:00
Avi Kivity
b406af2556 test: mutation_test: don't capture structured bindings in lambdas
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
2020-10-16 15:24:28 +03:00
Avi Kivity
d50f508fa6 storage_proxy: don't capture structured bindings in lambdas
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
2020-10-16 15:24:19 +03:00
Avi Kivity
cb9a9584ac db: hints/manager: don't capture structured bindings in lambdas
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
2020-10-16 15:24:09 +03:00
Avi Kivity
1986a74cc4 db: commitlog_replayer: don't capture structured bindings in lambdas
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
2020-10-16 15:24:01 +03:00
Avi Kivity
05a24408df cql3: select_statement: don't capture structured bindings in lambdas
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
2020-10-16 15:23:53 +03:00
Avi Kivity
c2c3f8343e cql3: statement_restrictions: don't capture structured bindings in lambdas
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
2020-10-16 15:23:33 +03:00
Avi Kivity
d3c0b4c555 cdc: log: don't capture structured bindings in lambdas
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
2020-10-16 15:23:16 +03:00
Avi Kivity
f87d4cca68 Merge "docs/debugging.md: add thematic debugging guides" from Botond
"
Focusing on different aspects of debugging Scylla. Also expand some of
the existing segments and fix some small issues around the document.
"

* 'debugging.md-advanced-guides/v1' of https://github.com/denesb/scylla:
  docs/debugging.md: add thematic debugging guides
  docs/debugging.md: tips and tricks: add section about optimized-out variables
  docs/debugging.md: TLS variables: add missing $ to terminal command
  docs/debugging.md: TUI: describe how to switch between windows
  docs/debugging.md: troubleshooting: expand on crash on backtrace
2020-10-16 14:07:45 +03:00
Pekka Enberg
618e5cb1db Merge 'token_restriction: invalid_request_exception on SELECTs with both normal and token restrictions' from Piotr Grabowski
Before this change, invalid query exception on selects with both normal
and token restrictions was only thrown when token restriction was after
normal restriction.

This change adds proper validation when token restriction is before normal restriction.

**Before the change - does not return error in last query; returns wrong results:**
```
cqlsh> CREATE TABLE ks.t(pk int, PRIMARY KEY(pk));
cqlsh> INSERT INTO ks.t(pk) VALUES (1);
cqlsh> INSERT INTO ks.t(pk) VALUES (2);
cqlsh> INSERT INTO ks.t(pk) VALUES (3);
cqlsh> INSERT INTO ks.t(pk) VALUES (4);
cqlsh> SELECT pk, token(pk) FROM ks.t WHERE pk = 2 AND token(pk) > 0;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Columns "ColumnDefinition{name=pk, type=org.apache.cassandra.db.marshal.Int32Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}" cannot be restricted by both a normal relation and a token relation"
cqlsh> SELECT pk, token(pk) FROM ks.t WHERE token(pk) > 0 AND pk = 2;

 pk | system.token(pk)
----+---------------------
 3 | 9010454139840013625

(1 rows)
```

Closes #7441

* github.com:scylladb/scylla:
  tests: Add token and non-token conjunction tests
  token_restriction: Add non-token merge exception
2020-10-16 13:09:29 +03:00
Tomasz Grabiec
f893516e55 Merge "lwt: store column_mapping's for each table schema version upon a DDL change" from Pavel Solodovnikov
This patch introduces a new system table: `system.scylla_table_schema_history`,
which is used to keep track of column mappings for obsolete table
schema versions (i.e. schema becomes obsolete when it's being changed
by means of `CREATE TABLE` or `ALTER TABLE` DDL operations).

It is populated automatically when a new schema version is being
pulled from a remote in get_schema_definition() at migration_manager.cc
and also when schema change is being propagated to system schema tables
in do_merge_schema() at schema_tables.cc.

The data referring to the most recent table schema version is always
present. Other entries are garbage-collected when the corresponding
table schema version is obsoleted (they will be updated with a TTL equal
to `DEFAULT_GC_GRACE_SECONDS` on `ALTER TABLE`).

In case we failed to persist column mapping after a schema change,
missing entries will be recreated on node boot.

Later, the information from this table is used in `paxos_state::learn`
callback in case we have a mismatch between the most recent schema
version and the one that is stored inside the `frozen_mutation`
for the accepted proposal.

Such situation may arise under following circumstances:
 1. The previous LWT operation crashed on the "accept" stage,
    leaving behind a stale accepted proposal, which waits to be
    repaired.
 2. The table affected by LWT operation is being altered, so that
    schema version is now different. Stored proposal now references
    obsolete schema.
 3. LWT query is retried, so that Scylla tries to repair the
    unfinished Paxos round and apply the mutation in the learn stage.

When such mismatch happens, prior to that patch the stored
`frozen_mutation` is able to be applied only if we are lucky enough
and column_mapping in the mutation is "compatible" with the new
table schema.

It wouldn't work if, for example, the columns are reordered, or
some columns, which are referenced by an LWT query, are dropped.

With this patch we try to look up the column mapping for
the obsolete schema version, then upgrade the stored mutation
using obtained column mapping and apply an upgraded mutation instead.

* git@github.com:ManManson/scylla.git feature/table_schema_history_v7:
  lwt: add column_mapping history persistence tests
  schema: add equality operator for `column_mapping` class
  lwt: store column_mapping's for each table schema version upon a DDL change
  schema_tables: extract `fill_column_info` helper
  frozen_mutation: introduce `unfreeze_upgrading` method
2020-10-15 20:48:29 +02:00
Pavel Solodovnikov
b59ac032c9 lwt: add column_mapping history persistence tests
There are two basic tests, which:
 * Test that column mappings are serialized and deserialized
   properly on both CREATE TABLE and ALTER TABLE
 * Column mappings for obsoleted schema versions are updated
   with a TTL value on schema change

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-10-15 19:25:24 +03:00
Pavel Solodovnikov
81cf11f8a0 schema: add equality operator for column_mapping class
Add a comparator for column mappings that will be used later
in unit-tests to check whether two column mappings match or not.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-10-15 19:24:44 +03:00
Pavel Solodovnikov
055fd3d8ad lwt: store column_mapping's for each table schema version upon a DDL change
This patch introduces a new system table: `system.scylla_table_schema_history`,
which is used to keep track of column mappings for obsolete table
schema versions (i.e. schema becomes obsolete when it's being changed
by means of `CREATE TABLE` or `ALTER TABLE` DDL operations).

It is populated automatically when a new schema version is being
pulled from a remote in get_schema_definition() at migration_manager.cc
and also when schema change is being propagated to system schema tables
in do_merge_schema() at schema_tables.cc.

The data referring to the most recent table schema version is always
present. Other entries are garbage-collected when the corresponding
table schema version is obsoleted (they will be updated with a TTL equal
to `DEFAULT_GC_GRACE_SECONDS` on `ALTER TABLE`).

In case we failed to persist column mapping after a schema change,
missing entries will be recreated on node boot.

Later, the information from this table is used in `paxos_state::learn`
callback in case we have a mismatch between the most recent schema
version and the one that is stored inside the `frozen_mutation`
for the accepted proposal.

Such situation may arise under following circumstances:
 1. The previous LWT operation crashed on the "accept" stage,
    leaving behind a stale accepted proposal, which waits to be
    repaired.
 2. The table affected by LWT operation is being altered, so that
    schema version is now different. Stored proposal now references
    obsolete schema.
 3. LWT query is retried, so that Scylla tries to repair the
    unfinished Paxos round and apply the mutation in the learn stage.

When such mismatch happens, prior to that patch the stored
`frozen_mutation` is able to be applied only if we are lucky enough
and column_mapping in the mutation is "compatible" with the new
table schema.

It wouldn't work if, for example, the columns are reordered, or
some columns, which are referenced by an LWT query, are dropped.

With this patch we try to look up the column mapping for
the obsolete schema version, then upgrade the stored mutation
using obtained column mapping and apply an upgraded mutation instead.

In case we don't find a column_mapping we just return an error
from the learn stage.

Tests: unit(dev, debug), dtests(paxos_tests.py:TestPaxos.schema_mismatch_*_test)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-10-15 19:24:30 +03:00
Benny Halevy
951fb638a3 SCYLLA-VERSION-GEN: change master version to 4.3.dev
Now that scylla-ccm and scylla-dtest conform to PEP-440
version comparison (See https://www.python.org/dev/peps/pep-0440/)
we can safely change scylla version on master to be the development
branch for the next release.

The version order logic is:
  4.3.dev is followed by
  4.3.rc[i] followed by
  4.3.[n]

Note that also according to
https://blog.jasonantman.com/2014/07/how-yum-and-rpm-compare-versions/
4.3.dev < 4.3.rc[i] < 4.3.[n]
as "dev" < "rc" by alphabetical order
and both "dev" and "rc*" < any number, based on the general
rule that alphabetical strings compare as less than numbers.

Test: unit
Dtest: gating
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201015151153.726637-1-bhalevy@scylladb.com>
2020-10-15 18:32:51 +03:00
Gleb Natapov
30ff874e48 raft: make fsm::become_leader() private
Message-Id: <20201015143634.2807731-4-gleb@scylladb.com>
2020-10-15 16:45:55 +02:00
Gleb Natapov
d2e8181852 raft: remove outdated comments in server_impl::add_entry_internal
Message-Id: <20201015143634.2807731-3-gleb@scylladb.com>
2020-10-15 16:45:54 +02:00
Gleb Natapov
2f38c05b93 raft: fix apply fiber logging to be more consistent
Message-Id: <20201015143634.2807731-2-gleb@scylladb.com>
2020-10-15 16:45:54 +02:00
Botond Dénes
8eb9da397f docs/debugging.md: add thematic debugging guides
Add debugging guides focusing on different aspects of debugging
Scylla.
2020-10-15 16:17:39 +03:00
Botond Dénes
0ded715251 docs/debugging.md: tips and tricks: add section about optimized-out variables 2020-10-15 16:17:39 +03:00
Botond Dénes
a2d47738b1 docs/debugging.md: TLS variables: add missing $ to terminal command 2020-10-15 16:17:35 +03:00
Botond Dénes
d99d031c86 docs/debugging.md: TUI: describe how to switch between windows 2020-10-15 16:17:35 +03:00
Botond Dénes
b867142096 docs/debugging.md: troubleshooting: expand on crash on backtrace
Describe why this happens and add a link to the GDB bug tracker as well
as a workaround on avoiding the crash.
2020-10-15 16:17:29 +03:00
Avi Kivity
8068272b46 build: adjust inlining thresholds for clang too
Commit bc65659a46 adjusted the inlining parameters for gcc. Here
we do the same for clang. With this adjustement, clang lags gcc
by 3% in throughput (perf_simple_query --smp 1) compared to 20%
without it.

The value 2500 was derived by binary search. At 5000 compilation
of storage_proxy never completes, at 1250 throughput is down by
10%.

Closes #7418
2020-10-15 14:09:09 +03:00
Tomasz Grabiec
62d2979888 Merge "raft: snapshot support" from Gleb
Support snapshotting for raft. The patch series only concerns itself
with raft logic, not how a specific state machine implements
take_snapshot() callback.

* scylla-dev/raft-snapshots-v2:
  raft: test: add tests for snapshot functionality
  raft: preserve trailing raft log entries during snapshotting
  raft: implement periodic snapshotting of a state machine
  raft: add snapshot transfer logic
2020-10-15 12:45:30 +02:00
Piotr Grabowski
c8fdb02a13 tests: Add token and non-token conjunction tests
Checks for invalid_request_exception in case of trying to run a query
with both normal and token relations. Tests both orderings of those
relations (normal or token relation first).
2020-10-15 12:32:18 +02:00
Piotr Grabowski
9d1cd2c57b token_restriction: Add non-token merge exception
Add exception that is thrown when merging of token and non-token 
restrictions is attempted. Before this change only merging non-token
and token restriction was validated (WHERE pk = 0 AND token(pk) > 0)
and not the other way (WHERE token(pk) > 0 AND pk = 0).
2020-10-15 12:32:18 +02:00
Gleb Natapov
36c67aef8b raft: test: add tests for snapshot functionality
The patch adds two tests; one for snapshot transfer and another for
snapshot generation.
2020-10-15 11:50:27 +03:00
Gleb Natapov
7fdfa32dbd raft: preserve trailing raft log entries during snapshotting
This patch allows to leave snapshot_trailing amount of entries
when a state machine is snapshotted and raft log entries are dropped.
Those entries can be used to catch up nodes that are slow without
requiring snapshot transfer. The value is part of the configuration
and can be changed.
2020-10-15 11:50:27 +03:00
Gleb Natapov
7c1187b7f5 raft: implement periodic snapshotting of a state machine
The patch implements periodic taking of a snapshot and trimming of
the raft log.

In raft the only way the log of already committed entries can be shorten
is by taking a snapshot of the state machine and dropping log entries
included in the snapshot from the raft log. To not let log to grow too
large the patch takes the snapshot periodically after applying N number
of entries where N can be configured by setting snapshot_threshold
value in raft's configuration.
2020-10-15 11:48:44 +03:00
Gleb Natapov
6ca03585f4 raft: add snapshot transfer logic
This patch adds the logic that detects that a follower misses data from
a snapshot and initiate snapshot transfer in that case. Upon receiving
the snapshot the follower stores it locally and applies it to its state
machine. The code assumes that the snapshot is already exists on a
leader.
2020-10-15 11:44:06 +03:00
Avi Kivity
71398f3fb4 Merge "Cleanup sstable writer" from Benny
"
This series cleans up the legacy and common ssatble writer code.

metadata_collector::_ancestors were moved to class sstable
so that the former can be moved out of sstable into file_writer_impl.

Moved setting of replay position and sstable level via
sstable_writer_config so that compaction won't need to access
the metadata_collector via the sstable.

With that, metadata_collector could be moved from class sstable
to sstable_writer::writer_impl along with the column_stats.

That allowed moved "generic" file_writer methods that were actually
k/l format specific into sstable_writer_k_l.

Eventually `file_writer` code is moved into sstables/writer.cc
and sstable_writer_k_l into sstables/kl/writer.{hh,cc}

A bonus cleanup is the ability to get rid of
sstable::_correctly_serialize_non_compound_range_tombstones as
it's now available to the writers via the writer configuration
and not required to be stored in the sstable object.

Fixes #3012

Test: unit(dev)
"

* tag 'cleanup-sstable-writer-v2' of github.com:bhalevy/scylla:
  sstables: move writer code away to writer.cc
  sstables: move sstable_writer_k_l away to kl/writer
  sstables: get rid of sstable::_correctly_serialize_non_compound_range_tombstones
  sstables: move writer methods to sstable_writer_k_l
  sstables: move compaction ancestors to sstable
  sstables: sstable_writer: optionally set sstable level via config
  sstables: sstable_writer: optionally set replay position via config
  sstables: compaction: make_sstable_writer_config
  sstables: open code update_stats_on_end_of_stream in sstable_writer::consume_end_of_stream
  sstables: fold components_writer into sstable_writer_k_l
  sstables: move sstable_writer_k_l definition upwards
  sstables: components_writer: turn _index into unique_ptr
2020-10-15 10:40:28 +03:00
Benny Halevy
279865e56c sstables: move writer code away to writer.cc
Move `file_writer` code into sstables/writer.cc

Fixes #3012

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-10-14 23:41:47 +03:00
Benny Halevy
20adb96f62 sstables: move sstable_writer_k_l away to kl/writer
Move the sstable_writer_k_l code into sstables/kl/writer.{hh,cc}

Refs #3012

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-10-14 23:40:56 +03:00
Benny Halevy
8cd4d53643 sstables: mx/writer: fix copy-paste error in reader_semaphore name
It was copied from sstables.cc in
6ca0464af5.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201014171651.541232-1-bhalevy@scylladb.com>
2020-10-14 22:17:49 +02:00
Benny Halevy
96cd6adc71 sstables: get rid of sstable::_correctly_serialize_non_compound_range_tombstones
Now it's available to the writers via the writer configuration
and not required to be stored in the sstable object.

Refs #3012

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-10-14 19:53:23 +03:00
Benny Halevy
97a446f9fa sstables: move writer methods to sstable_writer_k_l
They are called solely from the sstable_writer_k_l path.
With that, moce the metadata collector and column stats
to writer_impl.  They are now only used by the sstable
writers.

Refs #3012

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-10-14 19:52:17 +03:00
Benny Halevy
e1692bec17 sstables: move compaction ancestors to sstable
Compaction needs access to the sstable's ancestors so we need to
keep the ancestors for the sstable separately from the metadata collector
as the latter is about to be moved to the sstable writer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-10-14 19:51:26 +03:00
Benny Halevy
a49a5f36c1 sstables: sstable_writer: optionally set sstable level via config
And use compaction::make_sstable_writer_config to pass
the compaction's `_sstable_level` to the writer
via sstable_writer_config, instead of via the sstable
metadata_collector, that is going to move from the sstable
to the write_impl.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-10-14 19:49:36 +03:00
Benny Halevy
ac3c33ffca sstables: sstable_writer: optionally set replay position via config
And use compaction::make_sstable_writer_config to pass
the compaction's replay_position (`_rp`) to the writer
via sstable_writer_config, instead of via the sstable
metadata_collector, that is going to move from the sstable
to the write_impl.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-10-14 19:39:46 +03:00
Nadav Har'El
de8ff2f089 docs: some minor cleanups in protocols.md
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201014121626.643743-1-nyh@scylladb.com>
2020-10-14 18:14:00 +03:00
Nadav Har'El
509a41db04 alternator: change name of Alternator's SSL options
When Alternator is enabled over HTTPS - by setting the
"alternator_https_port" option - it needs to know some SSL-related options,
most importantly where to pick up the certificate and key.

Before this patch, we used the "server_encryption_options" option for that.
However, this was a mistake: Although it sounds like these are the "server's
options", in fact prior to Alternator this option was only used when
communicating with other servers - i.e., connections between Scylla nodes.
For CQL connections with the client, we used a different option -
"client_encryption_options".

This patch introduces a third option "alternator_encryption_options", which
controls only Alternator's HTTPS server. Making it separate from the
existing CQL "client_encryption_options" allows both Alternator and CQL to
be active at the same time but with different certificates (if the user
so wishes).

For backward compatibility, we temporarily continue to allow
server_encryption_options to control the Alternator HTTPS server if
alternator_encryption_options is not specified. However, this generates
a warning in the log, urging the user to switch. This temporary workaround
should be removed in a future version.

This patch also:
1. fixes the test run code (which has an "--https" option to test over
   https) to use the new name of the option.
2. Adds documentation of the new option in alternator.md and protocols.md -
   previously the information on how to control the location of the
   certificate was missing from these documents.

Fixes #7204.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200930123027.213587-1-nyh@scylladb.com>
2020-10-14 18:13:57 +03:00
Nadav Har'El
4d7c63c50b docs: in protocols.md, clarify CQL+SSL options and defaults
The wording on how CQL with SSL is configured was ambigous. Clarify the
text to explain that by default, it is *disabled*. We recommend to enable
it on port 9142 - but it's not a "default".

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201014144938.653311-1-nyh@scylladb.com>
2020-10-14 18:06:59 +03:00
Benny Halevy
e314eb3f78 sstables: compaction: make_sstable_writer_config
Consolidate the code to make the sstable_writer_config
for sstable writers into a helper method.

Folowing patches will add the ability to set the
replay position and sstable level via that config
structure.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-10-14 18:01:46 +03:00
Benny Halevy
55d73ec2bc sstables: open code update_stats_on_end_of_stream in sstable_writer::consume_end_of_stream
In preparation to moving sstable methods to sstable_writer_k_l
as part of #3012.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-10-14 17:46:26 +03:00
Benny Halevy
27e3c03ce2 sstables: fold components_writer into sstable_writer_k_l
It serves no purpose being a different class but being called
by sstable_writer_k_l.

Refs #3012.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-10-14 17:40:47 +03:00
Benny Halevy
56a6a4ff17 sstables: move sstable_writer_k_l definition upwards
To facilitate consolidation of components_writer and
some sstable methods into sstable_writer_k_l.

Refs #3012.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-10-14 17:13:44 +03:00
Benny Halevy
8f239f8f4c sstables: components_writer: turn _index into unique_ptr
In preparation to folding components_writer into sstable_writer_k_l
in a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-10-14 17:10:31 +03:00
Botond Dénes
23df38d867 scylla-gdb.py: add pretty-printer for nonwrapping_interval<dht::ring_position>
The patch adds a generic pretty-printer for `nonwrapping_interval<>`
(and `nonwrapping_range<>`) and a specific one for `dht::ring_position`.
Adding support to clustering and partition ranges is just one more step.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201014054013.606141-1-bdenes@scylladb.com>
2020-10-14 16:45:21 +03:00
Calle Wilund
83339f4bac Alternator::streams: Make SequenceNumber monotinically growing
Fixes #7424

AWS sdk (kinesis) assumes SequenceNumbers are monotonically
growing bigints. Since we sort on and use timeuuids are these
a "raw" bit representation of this will _not_ fulfill the
requirement. However, we can "unwrap" the timestamp of uuid
msb and give the value as timestamp<<64|lsb, which will
ensure sort order == bigint order.
2020-10-14 16:45:21 +03:00
Calle Wilund
3f800d68c6 alternator::streams: Ensure shards are reported in string lexical order
Fixes #7409

AWS kinesis Java sdk requires/expects shards to be reported in
lexical order, and even worse, ignores lastevalshard. Thus not
upholding said order will break their stream intropection badly.

Added asserts to unit tests.

v2:
* Added more comments
* use unsigned_cmp
* unconditional check in streams_test
2020-10-14 16:45:21 +03:00
Avi Kivity
f10debc48c Update seastar submodule
* seastar 35c255dcd...6973080cd (2):
  > Merge "memory: improve memory diagnostics dumped on allocation failures" from Botond
  > map_reduce: use get0 rather than get
2020-10-14 16:45:21 +03:00
Benny Halevy
b3f46e9cbf test: serialized_action_test: add test_serialized_action_exception
Tests that the exceptional future returned by the serialized action
is propagated to trigger, reproducing #7352.

The test fails without the previoud patch:
"serialized_action: trigger: include also semaphore status to promise"

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-10-14 16:45:21 +03:00
Benny Halevy
f3fc81751f serialized_action: trigger: propagate action error
Currently, the serialized_action error is set to a shared_promise,
but is not returned to the caller, unless there is an
already outstanding action.

Note that setting the exception to the promise when noone
collected it via the shared_future caused 'Exceptional future ignored'
warning to be issued, as seen in #7352.

Fixes #7352

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-10-14 16:45:21 +03:00
Benny Halevy
81d2f60df9 serialized_action: trigger: include also semaphore status to promise
Currently, if `with_semaphore` returns exceptional future, it is not
propagated to the promise, and other waiters that got a shared
future will not see that.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-10-14 16:45:21 +03:00
Avi Kivity
86bbf1763d Merge "reader concurrency semaphore: dump permit diagnostics on timeout or queue overflow" from Botond
"
The reader concurrency semaphore timing out or its queue being overflown
are fairly common events both in production and in testing. At the same
time it is a hard to diagnose problem that often has a benign cause
(especially during testing), but it is equally possible that it points
to something serious. So when this error starts to appear in logs,
usually we want to investigate and the investigation is lengthy...
either involves looking at metrics or coredumps or both.

This patch intends to jumpstart this process by dumping a diagnostics on
semaphore timeout or queue overflow. The diagnostics is printed to the
log with debug level to avoid excessive spamming. It contains a
histogram of all the permits associated with the problematic semaphore
organized by table, operation and state.

Example:

DEBUG 2020-10-08 17:05:26,115 [shard 0] reader_concurrency_semaphore -
Semaphore _read_concurrency_sem: timed out, dumping permit diagnostics:
Permits with state admitted, sorted by memory
memory  count   name
3499M   27      ks.test:data-query

3499M   27      total

Permits with state waiting, sorted by count
count   memory  name
1       0B      ks.test:drain
7650    0B      ks.test:data-query

7651    0B      total

Permits with state registered, sorted by count
count   memory  name

0       0B      total

Total: permits: 7678, memory: 3499M

This allows determining several things at glance:
* What are the tables involved
* What are the operations involved
* Where is the memory

This can speed up a follow-up investigation greatly, or it can even be
enough on its own to determine that the issue is benign.

Tests: unit(dev, debug)
"

* 'dump-diagnostics-on-semaphore-timeout/v2' of https://github.com/denesb/scylla:
  reader_concurrency_semaphore: dump permit diagnostics on timeout or queue overflow
  utils: add to_hr_size()
  reader_concurrency_semaphore: link permits into an intrusive list
  reader_concurrency_semaphore: move expiry_handler::operator()() out-of-line
  reader_concurrency_semaphore: move constructors out-of-line
  reader_concurrency_semaphore: add state to permits
  reader_concurrency_semaphore: name permits
  querier_cache_test: test_immediate_evict_on_insert: use two permits
  multishard_combining_reader: reader_lifecycle_policy: add permit param to create_reader()
  multishard_combining_reader: add permit parameter
  multishard_combining_reader: shard_reader: use multishard reader's permit
2020-10-13 12:44:23 +03:00
Botond Dénes
18454e4a80 reader_concurrency_semaphore: dump permit diagnostics on timeout or queue overflow
The reader concurrency semaphore timing out or its queue being overflown
are fairly common events both in production and in testing. At the same
time it is a hard to diagnose problem that often has a benign cause
(especially during testing), but it is equally possible that it points
to something serious. So when this error starts to appear in logs,
usually we want to investigate and the investigation is lengthy...
either involves looking at metrics or coredumps or both.

This patch intends to jumpstart this process by dumping a diagnostics on
semaphore timeout or queue overflow. The diagnostics is printed to the
log with debug level to avoid excessive spamming. It contains a
histogram of all the permits associated with the problematic semaphore
organized by table, operation and state.

Example:

DEBUG 2020-10-08 17:05:26,115 [shard 0] reader_concurrency_semaphore -
Semaphore _read_concurrency_sem: timed out, dumping permit diagnostics:
Permits with state admitted, sorted by memory
memory  count   name
3499M   27      ks.test:data-query

3499M   27      total

Permits with state waiting, sorted by count
count   memory  name
1       0B      ks.test:drain
7650    0B      ks.test:data-query

7651    0B      total

Permits with state registered, sorted by count
count   memory  name

0       0B      total

Total: permits: 7678, memory: 3499M

This allows determining several things at glance:
* What are the tables involved
* What are the operations involved
* Where is the memory

This can speed up a follow-up investigation greatly, or it can even be
enough on its own to determine that the issue is benign.
2020-10-13 12:32:14 +03:00
Botond Dénes
0994e8b5e2 utils: add to_hr_size()
This utility function converts a potentially large number to a compact
representation, composed of at most 4 digits and a letter appropriate to
the power of two the number has to multiplied with to arrive to the
original number (with some loss of precision).

The different powers of two are the conventional 2 ** (N * 10) variants:
* N=0: (B)ytes
* N=1: (K)bytes
* N=2: (M)bytes
* N=3: (G)bytes
* N=4: (T)bytes

Examples:
* 87665 will be converted to 87K
* 1024 will be converted to 1K
2020-10-13 12:32:14 +03:00
Botond Dénes
27bbf5566d reader_concurrency_semaphore: link permits into an intrusive list 2020-10-13 12:32:14 +03:00
Botond Dénes
fdb93ae0fd reader_concurrency_semaphore: move expiry_handler::operator()() out-of-line
Soon we will want to add more logic to this now simple handler, move it
out-of-line in preparation.
2020-10-13 12:32:14 +03:00
Botond Dénes
85bfd28f4e reader_concurrency_semaphore: move constructors out-of-line
Soon, the semaphore will have a field that will not have a publicly
available definition. Move the constructor out-of-line in preparation.
2020-10-13 12:32:13 +03:00
Botond Dénes
70fa543c31 reader_concurrency_semaphore: add state to permits
Instead of a simple boolean, designating whether the permit was already
admitted or not, add a proper state field with a value for all the
different states the permit can be in. Currently there are three such
states:
* registered - the permit was created and started accounting resource
  consumption.
* waiting - the permit was queued to wait for admission.
* admitted - the permit was successfully admitted.

The state will be used for debugging purposes, both during coredump
debugging as well as for dumping diagnostics data about permits.
2020-10-13 12:32:13 +03:00
Botond Dénes
ff623e70b3 reader_concurrency_semaphore: name permits
Require a schema and an operation name to be given to each permit when
created. The schema is of the table the read is executed against, and
the operation name, which is some name identifying the operation the
permit is part of. Ideally this should be different for each site the
permit is created at, to be able to discern not only different kind of
reads, but different code paths the read took.

As not all read can be associated with one schema, the schema is allowed
to be null.

The name will be used for debugging purposes, both for coredump
debugging and runtime logging of permit-related diagnostics.
2020-10-13 12:32:13 +03:00
Takuya ASADA
ff129ee030 install.sh: set LC_ALL=en_US.UTF-8 on python3 thunk
scylla-python3 causes segfault when non-default locale specified.
As workaround for this, we need to set LC_ALL=en_US.UTF_8 on python3 thunk.

Fixes #7408

Closes #7414
2020-10-13 09:38:25 +03:00
Vlad Zolotarov
aec70d9953 cql3/statements/batch_statement.cc: improve batch size warning message
Make the warning message clearer:
 * Include the number of partitions affected by the batch.
 * Be clear that the warning is about the batch size in bytes.

Fixes #7367

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>

Closes #7417
2020-10-13 09:02:51 +03:00
Avi Kivity
3451579d81 sstables: move component_type formatter to namespace sstables
Without this, clang complains that we violate argument dependent
lookup rules:

  note: 'operator<<' should be declared prior to the call site or in namespace 'sstables'
  std::ostream& operator<<(std::ostream&, const sstables::component_type&);

we can't enforce the #include order, but we can easily move it it
to namespace sstables (where it belongs anyway), so let's do that.

gcc is happy either way.

Closes #7413
2020-10-12 21:49:25 +02:00
Tomasz Grabiec
29cf7fde03 Merge 'sstables: prepare bound_kind_m formatter for clang' from Avi Kivity
bound_kind_m's formatter violates argument dependent lookup rules
according to clang, so fix that. Along the way improve the formatter a
little.

Closes #7412

* git://github.com/avikivity/scylla.git avikivity-bound_kind_m-formatter:
  sstables: move bound_kind_m formatter to namespace sstables
  sstables: move bound_kind_m formatter to its natural place
  sstables: deinline bound_kind_m formatter
2020-10-12 21:47:53 +02:00
Avi Kivity
5065ae835f sstables: move bound_kind_m formatter to namespace sstables
Without this, clang complains that we violate argument dependent
lookup rules:

  note: 'operator<<' should be declared prior to the call site or in namespace 'sstables'
  std::ostream& operator<<(std::ostream&, const sstables::bound_kind_m&);

we can't enforce the #include order, but we can easily move it it
to namespace sstables (where it belongs anyway), so let's do that.

gcc is happy either way.
2020-10-12 20:38:11 +03:00
Avi Kivity
a00fca1a69 sstables: move bound_kind_m formatter to its natural place
Move bound_kind_m's formatter to the same header file where
is is defined. This prevents cases where the compiler decays
the type (an enum) to the underlying integral type because it
does not see the formatter declaration, resulting in the wrong
output.
2020-10-12 20:36:10 +03:00
Avi Kivity
69c3533d97 sstables: deinline bound_kind_m formatter
The formatter is by no means hot code and should not be inlined.
2020-10-12 20:35:08 +03:00
Juliusz Stasiewicz
0251cb9b31 transport: Update connection_stage in system.clients 2020-10-12 18:44:00 +02:00
Juliusz Stasiewicz
6abe1352ba transport: Retrieve driver's name and version from STARTUP message 2020-10-12 18:37:19 +02:00
Juliusz Stasiewicz
d2d162ece3 transport: Notify system.clients about "protocol_version" 2020-10-12 18:32:00 +02:00
Piotr Dulikowski
77a0f1a153 hints: don't read hint files when it's not allowed to send
When there are hint files to be sent and the target endpoint is DOWN,
end_point_hints_manager works in the following loop:

- It reads the first hint file in the queue,
- For each hint in the file it decides that it won't be sent because the
  target endpoint is DOWN,
- After realizing that there are some unsent hints, it decides to retry
  this operation after sleeping 1 second.

This causes the first segment to be wholly read over and over again,
with 1 second pauses, until the target endpoint becomes UP or leaves the
cluster. This causes unnecessary I/O load in the streaming scheduling
group.

This patch adds a check which prevents end_point_hints_manager from
reading the first hint file at all when it is not allowed to send hints.

First observed in #6964

Tests:
- unit(dev)
- hinted handoff dtests

Closes #7407
2020-10-12 19:09:57 +03:00
Botond Dénes
40c5474022 querier_cache_test: test_immediate_evict_on_insert: use two permits
The test currently uses a single permit shared between two simulated
reads (to wait admission twice). This is not a supported way of using a
permit and will stop working soon as we make the states the permit is in
more pronounced.
2020-10-12 15:56:56 +03:00
Botond Dénes
307cdf1e0d multishard_combining_reader: reader_lifecycle_policy: add permit param to create_reader()
Allow the evictable reader managing the underlying reader to pass its
own permit to it when creating it, making sure they share the same
permit. Note that the two parts can still end up using different
permits, when the underlying reader is kept alive between two pages of a
paged read and thus keeps using the permit received on the previous
page.

Also adjust the `reader_context` in multishard_mutation_query.cc to use
the passed-in permit instead of creating a new one when creating a new
reader.
2020-10-12 15:56:56 +03:00
Botond Dénes
e09ab09fff multishard_combining_reader: add permit parameter
Don't create an own permit, take one as a parameter, like all other
readers do, so the permit can be provided by the higher layer, making
sure all parts of the logical read use the same permit.
2020-10-12 15:56:56 +03:00
Botond Dénes
600f1c7853 multishard_combining_reader: shard_reader: use multishard reader's permit
Don't create a new permit per shard reader, pass down the multishard
reader's one to be used by each shard reader. They all belong to the
same read, they should use the same permit. Note that despite its name
the shard readers are the local representation of a reader living on a
remote shard and as such they live on the same shard the multishard
combining reader lives on.
2020-10-12 15:56:56 +03:00
Avi Kivity
73718414e3 data/cell: fix value_writer use before definition
Clang parses templates more eagerly than gcc, so it fails on
some forward-declared templates. In this case, value_writer
was forward-declared and then used in data::cell. As it also
uses some definitions local to data::cell, it cannot be
defined before it as well as after it.

To solve the problem, we define it as a nested class so
it can use other local definitions, yet be defined before it
is used. No code changes.

Closes #7401
2020-10-12 13:41:09 +03:00
Avi Kivity
da3e51d7b8 build: use c++20 for all C++ files, not just those that use the seastar flags
A few source files (like those generated by antlr) don't build with seastar,
and so don't inherit all of its flags. They then use the compiler default dialect,
not C++20. With gcc that's just fine, since gcc supports concepts in earlier dialects,
but clang requires C++20.

Fix by forcing --std=gnu++20 for all files (same as what Seastar chooses).

Closes #7392
2020-10-12 13:16:27 +03:00
Avi Kivity
affa234151 types: don't linearize ascii during validation
ascii has no inter-byte dependencies and so can
be validated fragment by fragment, reducing large
contiguous allocations.

Fixes #7393.

Closes #7394
2020-10-12 13:15:24 +03:00
Gleb Natapov
9d7c81c1b8 raft: fix boost/raft_fsm_test complication
Message-Id: <20201011063802.GA2628121@scylladb.com>
2020-10-12 12:09:21 +02:00
Takuya ASADA
d5ff82dc61 scylla_setup: skip iotune when developer_mode is enabled
When developer mode automatically enabled on nonroot mode, we should skip
iotune since the parameter won't be used.

Closes #7327
2020-10-12 11:08:10 +03:00
Botond Dénes
d35b0c06da configure.py: add space before appending -ffile-prefix-map to user cflags
Otherwise, it concatenates it to the last user provided cflag, creating
a gibberish flag that gcc will choke on.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201012073523.305271-1-bdenes@scylladb.com>
2020-10-12 10:40:02 +03:00
Nadav Har'El
977da3567f Merge 'Alternator streams: Fix shard lengths, parenting, expiration, filter useless ones and improve paging' from Calle Wilund
The remains of the defunct #7246.

Fixes #7344
Fixes #7345
Fixes #7346
Fixes #7347

Shard ID length is now within limits.
Shard end sequence number should be set when appropriate.
Shard parent is selected a bit more carefully (sorting)
Shards are filtered by time to exclude cdc generations we cannot get data from (too old)
Shard paging improved

Closes #7348

* github.com:scylladb/scylla:
  test_streams: Add some more sanity asserts
  alternator::streams: Set dynamodb data TTL explicitly in cdc options
  alternator::streams: Improve paging and fix parent-child calculation
  alternator::streams: Remove table from shard_id
  alternator::streams: Filter our cdc streams older than data/table
  alternator::error: Add a few dynamo exception types
2020-10-12 09:43:12 +03:00
Avi Kivity
4d6739c2e6 Merge "Use max_concurrent_for_each" from Benny
"
max_concurrent_for_each was added to seastar for replacing
sstable_directory::parallel_for_each_restricted by using
more efficient concurrency control that doesn't create
unlimited number of continuations.

The series replaces the use of sstable_directory::parallel_for_each_restricted
with max_concurrent_for_each and exposes the sstable_directory::do_for_each_sstable
via a static method.

This method is used here by table::snapshot to limit concurrency
do snapshot operations that suffer from the same unbound
concurrency problem sstable_directory solved.

In addition sstable_directory::_load_semaphore that was used
across calls to do_for_each_sstable was replaced by a static per-shard
semaphore that caps concurrency across all calls to `do_for_each_sstable`
on that shard.  This makes sense since the disk is a shared resource.

In the future, we may want to have a load semaphore per device rather than
a single global one.  We should experiment with that.

Test: unit(dev)
"

* tag 'max_concurrent_for_each-v5' of github.com:bhalevy/scylla:
  table: snapshot: use max_concurrent_for_each
  sstable_directory: use a external load_semaphore
  test: sstable_directory_test: extract sstable_directory creation into with_sstable_directory
  distributed_loader: process_upload_dir: use initial_sstable_loading_concurrency
  sstables: sstable_directory: use max_concurrent_for_each
2020-10-12 09:43:12 +03:00
Avi Kivity
54386efe9e build: add libicui18n library for clang
The build with clang fails with

  ld.lld: error: undefined symbol: icu_65::Collator::createInstance(icu_65::Locale const&, UErrorCode&)
  >>> referenced by like_matcher.cc
  >>>               build/dev/utils/like_matcher.o:(boost::re_detail_106900::icu_regex_traits_implementation::icu_regex_traits_implementation(icu_65::Locale const&))
  >>> referenced by like_matcher.cc
  >>>               build/dev/utils/like_matcher.o:(boost::re_detail_106900::icu_regex_traits_implementation::icu_regex_traits_implementation(icu_65::Locale const&))

That symbol lives in libicui18n. It's not clear why clang fails to resolve it and gcc succeeds (after all,
both use lld as the linker) but it is easier to add the library than to attempt to figure out the
discrepancy.

Closes #7391
2020-10-11 22:14:00 +03:00
Avi Kivity
8d3fcdc600 serializer.hh: remove unneeded semicolon after function definition
Closes #7390
2020-10-11 22:12:04 +03:00
Avi Kivity
dfffa4dc71 utils: big_decimal: work around clang difficulty with boost::cpp_int(string_view) constructor
Clang has some difficulty with the boost::cpp_int constructor from string_view.
In fact it is a mess of enable_if<>s so a human would have trouble too.

Work around it by converting to std::string. This is bad for performance, but
this constructor is not going to be fast in any case.

Hopefully a fix will arrive in clang or boost.

Closes #7389
2020-10-11 22:09:19 +03:00
Bentsi Magidovich
7be252e929 dist: fix incorrect AWS user-data url
we used http://169.254.169.254/latest/meta-data/user-data
but correct one http://169.254.169.254/latest/user-data
Fixes: https://github.com/scylladb/scylla-machine-image/issues/63

Closes #7388
2020-10-11 18:20:54 +03:00
Avi Kivity
00864b26c3 query-result-writer: fix idl definition order related failures with clang
Following ad48d8b43c, fix a similar problem which popped up
with higher inlining thresholds in query-result-writer.hh. Since
idl/query depends on idl/keys, it must follow in definition order.

Closes #7384
2020-10-11 17:57:12 +03:00
Avi Kivity
1145462a05 cql3: select_statement: fix undefined pointer arithmetic
We add std::distance(...) + 1 to a vector iterator, but
the vector can be empty, so we're adding a non-zero value
to nullptr, which is undefined behavior.

Rearrange to perform the limit (std::min()) before adding
to the pointer.

Found by clang's ubsan.

Closes #7377
2020-10-11 17:54:08 +03:00
Avi Kivity
610fa83f28 test: database_test: fix threading confusion
database_test contains several instances of calling do_with_cql_test_env()
with a function that expects to be called in a thread. This mostly works
because there is an internal thread in do_with_cql_test_env(), but is not
guaranteed to.

Fix by switching to the more appropriate do_with_cql_test_env_thread().

Closes #7333
2020-10-11 17:44:30 +03:00
Avi Kivity
b172e4c2ce sstables: make index_bound a non-nested struct
Due to a longstanding bug in clang[1], the compiler doesn't think
that such a class is default-constructible. This causes
std::optional<index_bound>::optional() not to compile. Because it
depends on open_tt_marker, extract that too.

[1] https://stackoverflow.com/questions/47974898/clang-5-stdoptional-instantiation-screws-stdis-constructible-trait-of-the-p

Closes #7387
2020-10-11 17:40:01 +03:00
Avi Kivity
58e02c216a test: sstable_datafile_test: sstable_run_based_compaction_test: prevent use of uninitialized variable observer
The variable 'observer' (an std::optional) may be left uninitialized
if 'incremental_enabled' is false. However, it is used afterwards
with a call to disconnect, accessing garbage.

Fix by accessing it via the optional wrapper. A call to optional::reset()
destroys the observable, which in turn calls disconnect().

Closes #7380
2020-10-11 17:36:08 +03:00
Avi Kivity
af8fd8c8d8 utils: build_id: fix ubsan false positive on pointer arithmetic
get_nt_build_id() constructs a pointer by adding a base and an
offset, but if the base happens to be zero, that is undefined
under C++ rules (altough legal ELF).

Fix by performing the addition on integers, and only then
casting to a pointer.

Closes #7379
2020-10-11 17:23:40 +03:00
Avi Kivity
a36eb586ea cql3: selection: don't use gcc extension "typeof"
typeof is not recognized by clang. Use the modern equivalent "decltype"
instead.

Closes #7386
2020-10-11 17:21:15 +03:00
Avi Kivity
15ab6a3feb test: cql_repl: use boost::regex instead of std::regex to avoid stack overflow
libstdc++'s std::regex uses recursion[1], with a depth controlled by the
input. Together with clang's debug mode, this overflows the stack.

Use boost::regex instead, which is immune to the problem.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86164

Closes #7378
2020-10-11 17:12:21 +03:00
Avi Kivity
4fd0ba24ea Update seastar submodule
* seastar ebcb3aeec...35c255dcd (1):
  > append_challenged_posix_file_impl: allow destructing file with no queued work
Fixes #7285.
2020-10-11 16:49:03 +03:00
Avi Kivity
7d025b5cf4 utils: log_heap: relax check for clang's sanitizer
b1e78313fe added a check for ubsan to squelch a false positive,
but that check doesn't work with clang. Relax it to check for debug
mode, so clang doesn't hit the same false positive as gcc did.

Define a SANITIZE macro so we have a reliable way to detect if
we're running with a sanitizer.

Closes #7372
2020-10-11 16:07:16 +03:00
Avi Kivity
882ed2017a test: network_topology_strategy_test: fix overflow in d2t()
d2t() scales a fraction in the range [0, 1] to the range of
a biased token (same as unsigned long). But x86 doesn't support
conversion to unsigned, only signed, so this is a truncating
conversion. Clang's ubsan correctly warns about it.

Fix by reducing the range before converting, and expanding it
afterwards.

Closes #7376
2020-10-11 16:05:02 +03:00
Avi Kivity
8932c4e919 compaction: allow _max_sstable_size = 0
Some test (run_based_compaction_test at least) use _max_sstable_size = 0
in order to force one partition per sstable. That triggers an overflow
when calculating the expected bloom filter size. The overflow doesn't
matter for normal operation, because the result later appears on a
divisor, but does trigger a ubsan error.

Squelch the error by bot dividing by zero here.

I tried using _max_sstable_size = 1, but the test failed for other
reasons.

Closes #7375
2020-10-11 15:43:51 +03:00
Avi Kivity
fc1fcaa11e lua: expect overflow when selecting lua types
When converting a value to its Lua representation, we choose
an integer type if it fits. If it doesn't, we fall back to a
more expensive type. So we explicitly try to trigger an overflow.

However, clang's ubsan doesn't like the overflow, and kills the
test. Tell it that the overflow is expected here.

Closes #7374
2020-10-11 15:38:07 +03:00
Avi Kivity
6bc6db8037 utils/array-search: document restrictions
Our AVX2 implementation cannot load a partial vector,
or mask unused elements (that can be done with AVX-512/SVE2),
so it has some restrictions. Document them.

Closes #7385
2020-10-11 15:19:54 +03:00
Avi Kivity
3e2707c2bf utils: fragmented_temporary_buffer: don't add to potentially null pointers
Offsetting a null pointer is undefined, and clang's ubsan complains.

Rearrange the arithmetic so we never offset a null pointer. A function
is introduced for the remaining contiguous bytes so it can cast the result
to size_t, avoiding a compare-of-different-signedness warning from gcc.

Closes #7373
2020-10-11 15:05:15 +03:00
Benny Halevy
d55985bb7d build: Upgrade to seastar API level 6
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201011105422.818623-2-bhalevy@scylladb.com>
2020-10-11 14:40:32 +03:00
Benny Halevy
064aae8ffa flush_queue: call_helper: support no variadic futures
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201011105422.818623-1-bhalevy@scylladb.com>
2020-10-11 14:40:32 +03:00
Avi Kivity
4c63723ead types: tighten digit count requirement on time nanoseconds components
When the number of nanosecond digits is greater than 9, the std::pow()
expression that corrects the nanosecond value becomes infinite. This
is because sstring::length() is unsigned, and so negative values
underflow and become large.

Following Cassandra, fix by forbidding more than 9 digits of
nanosecond precision.

Found by clang's ubsan.

Closes #7371
2020-10-11 14:13:46 +03:00
Rafael Ávila de Espíndola
a3bd546197 types: Work around a clang thread-local code generation bug (user_type)
Following 5d249a8e27, apply the same
fix for user_type_impl.

This works around https://bugs.llvm.org/show_bug.cgi?id=47747

Depending on this might be unstable, as the bug bug can show up at any
corner, but this is sufficient right now to get
test_user_function_disabled to pass.

Closes #7370
2020-10-11 12:36:38 +03:00
Avi Kivity
6fbfff7b31 Update seastar submodule
* seastar c62c4a3df...ebcb3aeec (1):
  > Merge "map_reduce: futurize_invoke reducer" from Benny
2020-10-11 12:17:06 +03:00
Benny Halevy
a0b5529441 flush_queue: use futurator::invoke
Attend to the following warning with Seastar_API_LEVEL 5+:
```
./utils/flush_queue.hh:68:36: warning: ‘static seastar::futurize<T>::type seastar::futurize<T>::apply(Func&&, FuncArgs&& ...) [with Func = test_queue_ordering_random_ops::run_test_case()::<lambda(int)>::<lambda(int)>; FuncArgs = {int}; T = void; seastar::futurize<T>::type = seastar::future<>]’ is deprecated: Use invoke for varargs [-Wdeprecated-declarations]
   68 |             return futurator::apply(std::forward<Func>(func), f.get());
```

Test: flush_queue(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201007112130.474269-1-bhalevy@scylladb.com>
2020-10-11 12:14:17 +03:00
Nadav Har'El
87cfdb69c6 Merge 'cql3: use larger stack for do_with_cql_parser() in debug mode' from Avi Kivity
Our cql parser uses large amounts of stack, and can overflow it
in debug mode with clang. To prevent this stack overflow,
temporarily use a larger (1MB) stack.

Closes #7369

* github.com:scylladb/scylla:
  cql3: use larger stack for do_with_cql_parser() in debug mode
  cql3: deinline do_with_cql_parser()
2020-10-11 11:29:06 +03:00
Avi Kivity
c41905e986 utils: array-search: deinline, working around clang bug
Clang has a bug processing inline ifuncs with intrinsics[1].
Since ifuncs can't be inlined anyway (they are always dispatched
via a function pointer that is determined based on the CPU
features present), nothing is gained by inlining them. Deinlining
therefore reduces compile time and works around the clang bug.

[1] https://bugs.llvm.org/show_bug.cgi?id=47691

Closes #7358
2020-10-11 10:29:24 +03:00
Avi Kivity
cb6231d1e2 cql3: use larger stack for do_with_cql_parser() in debug mode
Our cql parser uses large amounts of stack, and can overflow it
in debug mode with clang. To prevent this stack overflow,
temporarily use a larger (1MB) stack.

We can't use seastar::thread(), since do_with_cql_parser() does
not yield. We can't use std::thread(), since lw_shared_ptr()'s
debug mode will scream murder at an lw_shared_ptr used across
threads (even though it's perfectly safe in this case). We
can't use boost::context2 since that requires the library to
be compiled with address sanitizer support, which it isn't on
Fedora. So we use a fiber switch using the getcontext() function
familty. This requires extra annotations for debu mode, which are
added.
2020-10-10 00:31:50 +03:00
Avi Kivity
31886bc562 cql3: deinline do_with_cql_parser()
The cql parser causes trouble with the santizers and clang,
since it consumes a large amount of stack space (it does so
with gcc too, but does not overflow our 128k stacks). In
preparation for working around the problem, deinline it
so the hacks need not spread to the entire code base
via #include.

There is no performance impact from the virtual function,
as cql parsing will dominate the call.
2020-10-09 23:49:42 +03:00
Tomasz Grabiec
d2dd2b1ef9 Merge "raft: declarative raft testing" from Alejo
Raft tests with declarative structure instead of procedural.

* https://github.com/alecco/scylla/tree/raft-ale-tests-03d:
  raft: log failed test case name
  raft: test add hasher
  raft: declarative tests
  raft: test make app return proper exit int value
  raft: test add support for disconnected server
  raft: tests use custom server ids for easier debugging
  raft: make election_elapsed public for testing
  raft: test remove unnecessary header
  raft: fix typo snaphot snapshot
2020-10-09 16:01:52 +02:00
Alejo Sanchez
5d408082b6 raft: log failed test case name
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-10-09 15:50:47 +02:00
Alejo Sanchez
664b3eddb1 raft: test add hasher
Values seen by nodes were so far added but this does not provide a
guarantee the order of these values was respected.

Use a digest to check output, implicitly checking order.

On the other hand, sum or a simple positional checksum like Fletcher's
is easier to debug as rolling sum is evident.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-10-09 15:50:42 +02:00
Alejo Sanchez
670824c6fa raft: declarative tests
For convenience making Raft tests, use declarative structures.

Servers are set up and initialized and then updates are processed.
For now, updates are just adding entries to leader and change of leader.

Updates and leader changes can be specified to run after initial test setup.

An example test for 3 nodes, node 0 starting as leader having two entries
0 and 1 for term 1, and with current term 2, then adding 12 entries,
changing leader to node 1, and adding 12 more entries. The test will
automatically add more entries to the last leader until the test limit
of total_values (default 100).

    {.name = "test_name", .nodes = 3, .initial_term = 2,
    .initial_states = {{.le = {{1,0},{1,1}}},
    .updates = {entries{12},new_leader{1},entries{12}},},

Leader is isolated before change via is_leader returning false.
Initial leader (default server 0) will be set with this method, too.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-10-09 15:50:31 +02:00
Alejo Sanchez
7d4b33d834 raft: test make app return proper exit int value
Seastar app returns int result exit value.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-10-09 15:50:24 +02:00
Alejo Sanchez
093bc8fbb3 raft: test add support for disconnected server
Failure detector support of disconnected servers with a global set of
addresses.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-10-09 15:50:02 +02:00
Alejo Sanchez
21d7686766 raft: tests use custom server ids for easier debugging
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-10-09 15:49:57 +02:00
Alejo Sanchez
9f401c517e raft: make election_elapsed public for testing
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-10-09 15:49:52 +02:00
Alejo Sanchez
56683ae689 raft: test remove unnecessary header
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-10-09 15:49:45 +02:00
Alejo Sanchez
1bff357816 raft: fix typo snaphot snapshot
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-10-09 15:49:39 +02:00
Pekka Enberg
266d2b6f71 Update tools/jmx submodule
* tools/jmx c55f3f2...c51906e (1):
  > StorageService.java: Use the endpoint for getRangeToEndpointMap
2020-10-08 12:09:24 +03:00
Amnon Heiman
48c3c94aa6 api/storage_service.cc: Add the get_range_to_endpoint_map
The get_range_to_endpoint_map method, takes a keyspace and returns a map
between the token ranges and the endpoint.

It is used by some external tools for repair.

Token ranges are codes as size-2 array, if start or end are empty, they will be
added as an empty string.

The implementation uses get_range_to_address_map and re-pack it
accordingly.

The use of stream_range_as_array it to reduce the risk of large
allocations and stalls.

Relates to scylladb/scylla-jmx#36

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes #7329
2020-10-08 12:09:09 +03:00
Benny Halevy
1ba9e253c4 table: snapshot: use max_concurrent_for_each
Tables may have thousands of sstables and a number of component files
for each sstables.  Using parallel_for_each on all sstables (and
parallel_for_each in sstables::create_links for each file)
needlessly overloads the system with unbounded number of continuations.

Use max_concurrent_for_each and acquire the db sst_dir_semaphore to
limit parallelism.

Note that although snapshot is called while scylla already
loaded the sstable we use the configured initial_sstable_loading_concurrency().

As a future follow-up we may want to define yet another config
variable for on-going operations on sstable directories
if we see that it warrants a diffrent setting than the initial
loading concurrency.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-10-08 11:57:06 +03:00
Benny Halevy
57cc5f6ae1 sstable_directory: use a external load_semaphore
Although each sstable_directory limits concurrency using
max_concurrent_for_each, there could be a large number
of calls to do_for_each_sstable running in parallel
(e.g per keyspace X per table in the distributed_loader).

To cap parallelism across sstable_directory instances and
concurrent calls to do_for_each_sstable, start a sharded<semaphore>
and pass a shared semaphore& to the sstable_directory:s.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-10-08 11:57:06 +03:00
Benny Halevy
dc46aaa3fd test: sstable_directory_test: extract sstable_directory creation into with_sstable_directory
Use common code to create, start, and stop the sharded<sstable_directory>
for each test.

This will be used in the next patch for creating a sharded semaphore
and passing it to the sstable_directory.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-10-08 11:57:06 +03:00
Takuya ASADA
ec68f67d7e dist/debian/debian_files_gen.py: don't ignore permission error on shutil.rmtree()
shutil.rmtree(ignore_errors=True) was for ignores error when directory not exist,
but it also ignores permission error, so we shouldn't use that.
Run os.path.exists() before shutil.rmtree() instead.

Fixes #7337

Closes #7338
2020-10-08 11:49:10 +03:00
Pekka Enberg
db6bb1ba91 Update tools/java submodule
* tools/java 4313155ab6...f2e8666d7e (1):
  > dist/debian/debian_files_gen.py: don't ignore permission error on shutil.rmtree()
2020-10-08 11:49:01 +03:00
Pekka Enberg
02bf30e9f5 Update tools/jmx submodule
* tools/jmx e3a381d...c55f3f2 (1):
  > dist/debian/debian_files_gen.py: don't ignore permission error on shutil.rmtree()
2020-10-08 11:48:57 +03:00
Pekka Enberg
6c133e36d8 Merge 'build: prepare for clang' from Avi Kivity
This series prepares the build system for clang support. It deals with
the different sets of warnings accepted by clang and gcc, and with
detecting clang 10 as a supported compiler.

It's still not possible to build with clang after this, but we're
another step closer.

Closes #7269

* github.com:scylladb/scylla:
  build: detect and allow clang 10 as a compiler
  build: detect availablity of -Wstack-usage=
  build: disable many clang-specific warnings
2020-10-08 10:16:12 +03:00
Avi Kivity
767e30927c test: suppress ubsan true-positive on rapidjson
rapidjson has a harmless (but true) ubsan violation. It was fixed
in 16872af889.

Since rapidjson has't released since 2016, we're unlikely to see
the fix, so suppress it to prevent the tests failing. In any case
the violation is harmless.

gcc's ubsan doesn't object to the addition.

Closes #7357
2020-10-07 19:27:49 +03:00
Gleb Natapov
0bff15a976 raft: Send multiple entries in one append_entry rpc
Send more that one entry in single append_entry message but
limit one packets size according to append_request_threshold parameter.

Message-Id: <20201007142602.GA2496906@scylladb.com>
2020-10-07 16:43:33 +02:00
Nadav Har'El
bff6fccc9f Update seastar submodule
Updated for the ability to add group names to SMP service groups
(https://github.com/scylladb/seastar/pull/809).

* seastar 8c8fd3ed...c62c4a3d (3):
  > smp service group: add optional group name
  > dpdk: mark link_ready() function override
  > Merge "sharded: make start, stop, and invoke_on methods noexcept" from Benny
2020-10-07 15:59:48 +03:00
Nadav Har'El
f30e86395a Merge 'table: fix race and exception handling in on_compaction_completion()' from Avi Kivity
Fix a race condition in on_compaction_completion() that can prevent shutdown,
as well as an exception handling error. See individual patches for details.

Fixes #7331.

Closes #7334

* github.com:scylladb/scylla:
  table: fix mishandled _sstable_deleted_gate exception in on_compaction_completion
  table: fix on_compaction_completion corrupting _sstables_compacted_but_not_deleted during self-race
2020-10-07 15:27:59 +03:00
Benny Halevy
f4269e3a04 distributed_loader: process_upload_dir: use initial_sstable_loading_concurrency
Although process_upload_dir is not called when initially loading
the tables, but rather from from storage_service::load_new_sstables,
it can use the same sstable_loading_concurrency, rather than constant `4`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-10-07 14:45:20 +03:00
Benny Halevy
c26c784882 sstables: sstable_directory: use max_concurrent_for_each
Use max_concurrent_for_each instead of parallel_for_each in
sstable_directory::parallel_for_each_restricted to avoid
creating potentially thousands of continuations,
one for each sstable.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-10-07 14:45:20 +03:00
Calle Wilund
349c5ee21a test_streams: Add some more sanity asserts
Checking validity of retured shard sets etc.
2020-10-07 08:43:39 +00:00
Calle Wilund
1ed864ce4c alternator::streams: Set dynamodb data TTL explicitly in cdc options
They should be the same by default, but setting it explicitly protects
us from any changing defaults.
2020-10-07 08:43:39 +00:00
Calle Wilund
04deacd7e7 alternator::streams: Improve paging and fix parent-child calculation
Fixes #7345
Fixes #7346

Do a more efficient collection skip when doing paging, instead of
iterating the full sets.

Ensure some semblance of sanity in the parent-child relationship
between shards by ensuring token order sorting and finding the
apparent previous ID coverting the approximate range of new gen.

Fix endsequencenumber generation by looking at whether we are
last gen or not, instead of the (not filled in) 'expired' column.
2020-10-07 08:43:39 +00:00
Calle Wilund
3cdd7fe191 alternator::streams: Remove table from shard_id
Fixes #7344

It is not data really needed, as shard_id:s are not required
to be unique across streams, and also because the length limit
on shard_id text representation.

As a side effect, shard iter instead carries the stream arn.
2020-10-07 08:43:39 +00:00
Pekka Enberg
16ed6fee40 Update tools/jmx submodule
* tools/jmx 25bcd76...e3a381d (1):
  > install.sh: show warning nonroot mode when systemd does not support user mode
2020-10-07 11:39:03 +03:00
Botond Dénes
db56ae695c types: validate(): linearize values lazily
Instead of eagerly linearizing all values as they are passed to
validate(), defer linearization to those validators that actually need
linearized values. Linearizing large values puts pressure on the memory
allocator with large contiguous allocation requests. This is something
we are trying to actively avoid, especially if it is not really neaded.
Turns out the types, whose validators really want linearized values are
a minority, as most validators just look at the size of the value, and
some like bytes don't need validation at all, while usually having large
values.

This is achieved by templating the validator struct on the view and
using the FragmentedRange concept to treat all passed in views
(`bytes_view` and `fragmented_temporary_buffer_view`) uniformly.
This patch makes no attempt at converting existing validators to work
with fragmented buffers, only trivial cases are converted. The major
offenders still left are ascii/utf8 and collections.

Fixes: #7318

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201007054524.909420-1-bdenes@scylladb.com>
2020-10-07 11:00:18 +03:00
Piotr Grabowski
369895b80f transport: Delay NEW_NODE until CQL listen started
After adding a new node to the cluster, Scylla sends a NEW_NODE event
to CQL clients. Some clients immediately try to connect to the new node,
however it fails as the node has not yet started listening to CQL
requests.

In contrast, Apache Cassandra waits for the new node to start its CQL
server before sending NEW_NODE event. In practice this means that
NEW_NODE and UP events will be sent "jointly" after new node is UP.

This change is implemented in the same manner as in Apache Cassandra
code.

Fixes #7301.

Closes #7306
2020-10-07 09:57:27 +03:00
Rafael Ávila de Espíndola
5d249a8e27 types: Work around a clang thread-local code generation bug
This works around https://bugs.llvm.org/show_bug.cgi?id=47747

Depending on this might be unstable, as the bug bug can show up at any
corner, but this is sufficient right now to get
test_user_function_disabled to pass.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20201007000713.1503302-1-espindola@scylladb.com>
2020-10-07 09:49:53 +03:00
Calle Wilund
f1ad66218a alternator::streams: Filter our cdc streams older than data/table
Fixes #7347

If cdc stream id:s are older than either table creation or now - 24h
we can skip them in describe_stream, to minimize the amount of
shards being returned.
2020-10-07 06:13:28 +00:00
Juliusz Stasiewicz
acf0341e9b transport: On successful authentication add username to system.clients
The username becomes known in the course of resolving challenges
from `PasswordAuthenticator`. That's why username is being set on
successful authentication; until then all users are "anonymous".
Meanwhile, `AllowAllAuthenticator` (the default) does not request
username, so users logged with it will remain as "anonymous" in
`system.clients`.

Shuffling of code was necessary to unify existing infrastructure
for INSERTing entries into `system.clients` with later UPDATEs.
2020-10-06 18:52:46 +02:00
Avi Kivity
4bbcc81cfe Merge "Use local reference on query_processor in tracing" from Pavel E
"
There are few places left that call for global query processor instance,
the tracing is one of them.

The query pressor is used mainly in table_helper, so this set mostly
shuffles its methods' arguments to deliver the needed reference. At the
end the main.cc code is patched to provide the query processor, which
is still global and not stopped, and is thus safe to be used anywhere.

tests: unit(dev), dtest(cql_tracing:dev)
"

* 'br-tracing-vs-query-processor' of https://github.com/xemul/scylla:
  tracing: Keep qp anchor on backend
  tracing: Push query processor through init methods
  main: Start tracing in main
  table_helper: Require local query processor in calls
  table_helper: Use local qp as setup_table argument
  table_helper: Use local db variable
2020-10-06 18:04:24 +03:00
Avi Kivity
c6a3fa5a49 Merge "querier_cache: use the querier's permit for memory accounting" from Botond
"
The querier cache has a memory based eviction mechanism, which starts
evicting freshly inserted queriers once their collective memory
consumption goes above the configured limit. For determining the memory
consumption of individual queriers, the querier cache uses
`flat_mutation_reader::buffer_size()`. But we now have a much more
comprehensive accounting of the memory used by queriers: the reader
permit, which also happens to be available in each querier. So use this
to determine the querier's memory consumption instead.

Tests: unit(dev)
"

* 'querier-cache-use-permit-for-memory-accounting/v1' of https://github.com/denesb/scylla:
  flat_mutation_reader: de-virtualize buffer_size()
  querier_cache: use the reader permit for memory accounting
  querier_cache_test: use local semaphore not the test global one
  reader_permit: add consumed_resources() accessor
2020-10-06 16:52:44 +03:00
Calle Wilund
5081d354be alternator::error: Add a few dynamo exception types 2020-10-06 12:52:58 +00:00
Pavel Emelyanov
e7f74449a6 tracing: Keep qp anchor on backend
The query processor is required in table_helper's used by tracing. Now
everything is ready to push the query processor reference from main down
to the table helpers.

Because of the current initialization sequence it's only possible to have
the started query processor at the .start_tracing() time. Earlier, when
the sharded<tracing> is started the query processor is not yet started,
so tracing keeps a pointer on local query processor.

When tracing is stopped, the pointer is null-ed. This is safe (but an
assert is put when dereferencing it), because on stop trace writes' gate
is closed and the query processor is only used in them.

Also there's still a chance that tracing remains started in case of start
abort, but this is on-par with the current code -- sharded query processor
is not stopped, so the memory is not freed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-10-06 15:45:19 +03:00
Pavel Emelyanov
87f1223965 tracing: Push query processor through init methods
The goal is to make tracing keyspace helper reference query processor, so this
patch adds the needed arguments through the initialization stack.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-10-06 15:45:12 +03:00
Pavel Emelyanov
b5f136c651 main: Start tracing in main
Move the tracing::start_tracing() out of the storage_service::join_cluster. It
anyway happens at the end of the join, so the logic is not changed, but it
becomes possible to patch tracing further.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-10-06 15:44:59 +03:00
Pavel Emelyanov
b18522a7ab table_helper: Require local query processor in calls
Keeping the query processor reference on the table_helper in raii manner
seems waistful, the only user of it -- the trace_keyspace_helper -- has
a bunch of helpers on board, each would then keep its own copy for no
gain.

At the same time the trace_keyspace_helper already gets the query processor
for its needs, so it can share one with table_helper-s.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-10-06 15:44:20 +03:00
Pavel Emelyanov
f5d39b9638 table_helper: Use local qp as setup_table argument
The goal is to make table_helper API require the query_processor
reference and use it where needed. The .setup_table() is private
method, and still grabs the query processor reference itself. Since
its futures do noth reshard, it's safe to carry the query processor
reference through.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-10-06 15:44:00 +03:00
Pavel Emelyanov
2f69e90fc9 table_helper: Use local db variable
The .setup_keyspace() method already has the db variable in this
continuation lambda.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-10-06 15:43:54 +03:00
Tomasz Grabiec
46b7ba8809 Merge "Bring memory footprint test back to work" from Pavel Emelyanov
The test was broken by recent sstables manager rework. In the middle
the sstables::test_env is destroyed without being closed which leads
to broken _closing assertion inside ~sstables_manager().

Fix is to use the test_env::do_with helper.

tests: perf.memory_footprint

* https://github.com/xemul/scylla/tree/br-memory-footprint-test-fix:
  test/perf/memory_footprint: Fix indentation after previous patch
  test/perf/memory_footprint: Don't forget to close sstables::test_env after usage
2020-10-06 11:49:03 +02:00
Pavel Emelyanov
8bceb916ea test/perf/memory_footprint: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-10-06 11:08:09 +03:00
Pavel Emelyanov
3e4de0f748 test/perf/memory_footprint: Don't forget to close sstables::test_env after usage
After recent sstables manager rework the sstables::test_env must be .close()d
after usage, otherwise the ~sstables_mananger() hits the _closing assertion.

Do it with the help of .do_with(). The execution context is already seastar::async
in this place, so .get() it explicitly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-10-06 11:06:35 +03:00
Pavel Emelyanov
8558339c63 perf_collection: Add test for full scan time
Scan here means walking the collection forward using iterator.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-10-06 09:57:37 +03:00
Pavel Emelyanov
7284469b24 perf_collection: Add test for destruction with .clear()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-10-06 09:57:37 +03:00
Pavel Emelyanov
72ccc43380 perf_collection: Add test for single element insertion
In some cases a collection is used to keep several elements,
so it's good to know this timing.

For example, a mutation_partition keeps a set of rows, if used
in cache it can grow large, if used in mutation to apply, it's
typically small. Plain replacement of bst into b-tree caused
performance degardation of mutation application because b-tree
is only better at big sizes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-10-06 09:57:37 +03:00
Pavel Emelyanov
207e1aa48f perf_collection: Add intrusive_set_external_comparator
This collection is widely used, any replacement should be
compared against it to better understand pros-n-cons.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-10-06 09:57:37 +03:00
Pavel Emelyanov
2d09864627 perf_collection: Clear collection between itartions
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-10-06 09:57:37 +03:00
Pavel Emelyanov
c891f274dc test: Generalize perf_bptree into perf_collection
Rename into perf_collection and localize the B+ code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-10-06 09:57:37 +03:00
Avi Kivity
0ef85a102f table: fix mishandled _sstable_deleted_gate exception in on_compaction_completion
on_compaction_completion tries to handle a gate_closed_exception, but
with_gate() throws rather than creating an exceptional future, so
the extra handling is lost. This is relatively benign since it will
just fail the compaction, requiring that work to be redone later.

Fix by using the safer try_with_gate().
2020-10-06 08:31:28 +03:00
Avi Kivity
a43d5079f3 table: fix on_compaction_completion corrupting _sstables_compacted_but_not_deleted during self-race
on_compaction_completion() updates _sstables_compacted_but_not_deleted
through a temporary to avoid an exception causing a partial update:

  1. copy _sstables_compacted_but_not_deleted to a temporary
  2. update temporary
  3. do dangerous stuff
  4. move temporary to _sstables_compacted_but_not_deleted

This is racy when we have parallel compactions, since step 3 yields.
We can have two invocations running in parallel, taking snapshots
of the same _sstables_compacted_but_not_deleted in step 1, each
modifying it in different ways, and only one of them winning the
race and assigning in step 4. With the right timing we can end
with extra sstables in _sstables_compacted_but_not_deleted.

Before a5369881b3, this was a benign race (only resulting in
deleted file space not being reclaimed until the service is shut
down), but afterwards, extra sstable references result in the service
refusing to shut down. This was observed in database_test in debug
mode, where the race more or less reliably happens for system.truncated.

Fix by using a different method to protect
_sstables_compacted_but_not_deleted. We unconditionally update it,
and also unconditionally fix it up (on success or failure) using
seastar::defer(). The fixup includes a call to rebuild_statistics()
which must happen every time we touch the sstable list.

Fixes #7331.
2020-10-06 08:29:34 +03:00
Botond Dénes
dd372c8457 flat_mutation_reader: de-virtualize buffer_size()
The main user of this method, the one which required this method to
return the collective buffer size of the entire reader tree, is now
gone. The remaining two users just use it to check the size of the
reader instance they are working with.
So de-virtualize this method and reduce its responsibility to just
returning the buffer size of the current reader instance.
2020-10-06 08:22:56 +03:00
Botond Dénes
cd8d10873f querier_cache: use the reader permit for memory accounting
The querier cache has a memory limit it enforces on cached queriers. For
determining how much memory each querier uses, it currently uses
`flat_mutation_reader::buffer_size()`. However, we now have a much more
complete accounting of the memory each read consumes, in the form of the
reader permit, which also happens to be handy in the queriers. So use it
instead of the not very well maintained `buffer_size()`.
2020-10-06 08:22:56 +03:00
Botond Dénes
f7eea06f61 querier_cache_test: use local semaphore not the test global one
In the mutation source, which creates the reader for this test, the
global test semaphore's permit was passed to the created reader
(`tests::make_permit()`). This caused reader resources to be accounted
on the global test semaphore, instead of the local one the test creates.
Just forward the permit passed to the mutation sources to the reader to
fix this.
2020-10-06 08:22:56 +03:00
Botond Dénes
73a6b97c75 reader_permit: add consumed_resources() accessor
That allows querying he amount of resources accounted though this
permit, and by extension by this logical read.
2020-10-06 08:18:42 +03:00
Nadav Har'El
421f0c729d merge: counters: Avoid signed integer overflow
Merged patch series by Tomasz Grabiec:

UBSAN complains in debug mode when the counter value overflows:

  counters.hh:184:16: runtime error: signed integer overflow: 1 + 9223372036854775807 cannot be represented in type 'long int'
  Aborting on shard 0.

Overflow is supposed to be supported. Let's silence it by using casts.

Fixes #7330.

Tests:

  - build/debug/test/tools/cql_repl --input test/cql/counters_test.cql

Tomasz Grabiec (2):
  counters: Avoid signed integer overflow
  test: cql: counters: Add tests reproducing signed integer overflow in
    debug mode

 counters.hh                   |  2 +-
 test/cql/counters_test.cql    |  9 ++++++++
 test/cql/counters_test.result | 48 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 58 insertions(+), 1 deletion(-)
2020-10-05 21:43:19 +03:00
Tomasz Grabiec
f01ffe063a test: cql: counters: Add tests reproducing signed integer overflow in debug mode
Reproduces #7330
2020-10-05 20:06:34 +02:00
Tomasz Grabiec
d9b9952d7c counters: Avoid signed integer overflow
UBSAN complains in debug mode when the counter value overflows:

counters.hh:184:16: runtime error: signed integer overflow: 1 + 9223372036854775807 cannot be represented in type 'long int'
Aborting on shard 0.

Overflow is supposed to be supported.

Let's silence it by using casts.

Fixes #7330.
2020-10-05 20:04:09 +02:00
Alejo Sanchez
6b38ecc6e0 raft: Forbid server address 0 as it has special meaning
Server address UUID 0 is not a valid server id since there is code that
assumes if server_id is 0 the value is not set (e.g _voted_for).

Prevent users from manually setting this invalid value.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-10-05 15:04:46 +02:00
Konstantin Osipov
532343f09e raft: Fix the bug with not setting the current leader.
When AppendEntries/InstallSnapshot term is the same as the current
server's, and the current servers' leader is not set, we should
assign it to avoid starting election if the current leader becomes idle.

Restructure the code accordingly - change candidate state to Follower
upon InstallSnapshot.
2020-10-05 15:04:45 +02:00
Gleb Natapov
a9674a197b raft: Get back probe_sent logic in progress::PROBE state.
It was erroneously replaced by the logic based on time which caused us
to send one probe per tick which is not an intention at all. There can
be one outstanding probe message but the moment it gets a reply next
one should be sent without waiting for a tick.
2020-10-05 15:04:44 +02:00
Avi Kivity
4f30c479f3 Merge "token_metadata cleanup" from Benny
"
Misc. cleanups and minor optimizations of token_metadata methods
in preparation to futurizing parts of the api around
update_pending_ranges and abstract_replication_strategy::calculate_natural_endpoints,
to prevent reactor stalls on these paths

Test: unit(dev)
"

* 'token_metadata_cleanup' of github.com:bhalevy/scylla:
  token_metadata: get rid of unused calculate_pending_ranges_for_* methods
  token_metadata: get rid of clone_after_all_settled
  token_metadata_impl: remove_endpoint: do not sort tokens
  token_metadata_impl: always sort_tokens in place
2020-10-05 13:31:59 +03:00
Takuya ASADA
0f786f05fe install.sh: logging to scylla-server.log when journalctl --user does not work
On some environment such as CentOS8, journalctl --user -xe does not work
since journald is running in volatile mode.
The issue cannnot fix in non-root mode, as a workaround we should logging to
a file instead of journal.

Also added scylla_logrotate to ExecStartPre which rename previous log file,
since StandardOutput=file:/path/to/file will erase existing file when service
restarted.

Fixes #7131

Closes #7326
2020-10-05 13:17:27 +03:00
Avi Kivity
d72465531e build: use consistent version-release strings across submodules
Instead of relying on SCYLLA-VERSION-GEN to be consistently
updated in each submodule, propagate the top-level product-version-release
to all submodules. This reduces the churn required for each release,
and makes the release strings consistent (previously, the git hash in each
was different).

Closes #7268
2020-10-05 12:32:49 +03:00
Nadav Har'El
8e2e2eab7c alternator test: tests for nested attributes in FilterExpression
Alternator does not yet support direct access to nested attributes in
expressions (this is issue #5024). But it's still good to have tests
covering this feature, to make it easier to check the implementation
of this feature when it comes.

Until now we did not have tests for using nested attributes in
*FilterExpression*. This patch adds a test for the straightforward case,
and also adds tests for the more elaborate combination of FilterExpression
and ProjectionExpression. This combination - see issue #6951 - means that
some attributes need to be retrieved despite not being projected (because
they are needed in a filter). When we support nested attributes there will
be special cases when the projected and filtered attributes are parts of
the same top-level attribute, so the code will need to handle those cases
correctly. As I was working on issue #6951 now, it is a good time to write
a test for these special cases, even if nested attributes aren't yet
supported - so we don't forget to handle these special cases later.

Both new tests pass on DynamoDB, and xfail on Alternator.

Refs #5024 (nested attributes)
Refs #6951 (FilterExpression with ProjectionExpression)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-10-05 02:19:22 +03:00
Nadav Har'El
a403356ade alternator test: fix comment
A comment in test/alternator/test_lsi.py wrongly described the schema
of one of the test tables. Fix that comment.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-10-05 02:19:22 +03:00
Nadav Har'El
85cc535792 alternator tests: additional tests for filter+projection combination
This patch provides two more tests for issue #6951. As this issue was
already fixed, the two new tests pass.

The two new test check two special cases for which were handled correctly
but not yet tested - when the projected attribute is a key attribute of
the table or of one of its LSIs. Having these two additional tests will
ensure that any future refactoring or optimizations in the this area of
the code (filtering, projection, and its combination) will not break these
special cases.

Refs #6951.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-10-05 02:19:22 +03:00
Nadav Har'El
2fc3a30b45 alternator: forbid combining old and new-style parameters
The DynamoDB API has for the Query and Scan requests two filtering
syntaxes - the old (QueryFilter or ScanFilter) and the new (FilterExpression).
Also for projection, it has an old syntax (AttributesToGet) and a new
one (ProjectionExpression). Combining an old-style and new-style parameter
is forbidden by DynamoDB, and should also be forbidden by Alternator.

This patch fixes, and removes the "xfails" tag, of two tests:
  test_query_filter.py::test_query_filter_and_projection_expression
  test_filter_expression.py::test_filter_expression_and_attributes_to_get

Refs #6951

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-10-05 02:19:22 +03:00
Nadav Har'El
282742a469 alternator: fix query with both projection and filtering
We had a bug when a Query/Scan had both projection (ProjectionExpression
or AttributesToGet) and filtering (FilterExpression or Query/ScanFilter).
The problem was that projection left only the requested attributes, and
the filter might have needed - and not got - additional attributes.

The solution in this patch is to add the generated JSON item also
the extra attributes needed by filtering (if any), run the filter on
that, and only at the end remove the extra filtering attributes from
the item to be returned.

The two tests

 test_query_filter.py::test_query_filter_and_attributes_to_get
 test_filter_expression.py::test_filter_expression_and_projection_expression

Which failed before this patch now pass so we drop their "xfail" tag.

Fixes #6951.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-10-05 02:19:22 +03:00
Avi Kivity
715d50bc85 Update seastar submodule
* seastar 292ba734bc...8c8fd3ed28 (15):
  > semaphore_units: add return_units and return_all
  > semaphore_units: release: mark as noexcept
  > circular_buffer: support non-default-constructible allocators correctly
  > core/shared_ptr: Expose use count through {lw_}enable_shared_from_this
  > memory: align allocations to std::max_align_t
  > util/log: logger::do_log(): go easier on allocations
  > doc: add link to multipage version of tutorial
  > doc: fix the output directories of split and tutorial.html
  > build: do not copy htmlsplit.py to build dir
  > doc: add "--input" and "--output-dir" options to htmlsplit.py
  > doc: update split script to use xml.etree.ElementTree
  > Merge "shared_future: make functions noexcept" from Benny
  > tutorial: add linebreak between sections
  > tutorial: format "future<int>" as inline code block
  > docs: specify HTML language code for tutorial.html
2020-10-04 21:30:27 +03:00
Etienne Adam
46f0354cdb redis: pass request as a reference
This patch change the way the request object is passed,
using a reference instead of temporaries.

'exists' test is passing in debug mode, whereas it was
always failing before.

Fixes #7261 by ensuring request object is alive for all commands
during the whole request duration.

Signed-off-by: Etienne Adam <etienne.adam@gmail.com>
Message-Id: <20200924202034.30399-1-etienne.adam@gmail.com>
2020-10-04 14:58:00 +03:00
Avi Kivity
5b5b8b3264 lua: be compatibile with Lua 5.4's lua_resume()
Lua 5.4 added an extra parameter to lua_resume()[1]. The parameter
denotes the number of arguments yielded, but our coroutines
don't yield any arguments, so we can just ignore it.

Define a macro to allow adding extra stuff with Lua 5.4,
and use it to supply the extra parameter.

[1] https://www.lua.org/manual/5.4/manual.html#8.3

Closes #7324
2020-10-04 14:07:51 +03:00
Nadav Har'El
ad48d8b43c Merge 'idl: fix definition order related build failures with clang' from Avi Kivity
Clang eagerly instantiates templates, apparently with the following
algorithm:

 - if both the declaration and definition are seen at the time of
   instantiation, instantiate the template
 - if only the declaration is see at the time of instantiation, just emit
   a reference to the template; even if the definition is later seen,
   it is not instantiated

The "reference" in the second case is a relocation entry in the object file
that is satisfied at link time by the linker, but if no other object file
instantiated the needed template, a link error results.

These problems are hard to diagnose but easy to fix. This series fixes all
known such issues in the code base. It was tested on gcc as well.

Closes #7322

* github.com:scylladb/scylla:
  query-result-reader: order idl implementations correctly
  frozen_schema: order idl implementations correctly
  idl-compiler: generate views after serializers
2020-10-04 11:16:19 +03:00
Takuya ASADA
d611d74905 dist/common/scripts/scylla_setup: force developer mode on nonroot when NOFILE is too low
On Ubuntu 16/18 and Debian 9, LimitNOFILE is set to 4096 and not able to override from
user unit.
To run scylla-server in such environment, we need to turn on developer mode and show
warnings.

Fixes #7133

Closes #7323
2020-10-04 10:16:30 +03:00
Avi Kivity
4b40bc5065 query-result-reader: order idl implementations correctly
Clang eagerly instantiates templates, so if it needs a template
function for which it has a declaration but not a definition, it
will not instantiate the definition when it sees it. This causes
link errors.

Fix by ordering the idl implementation files so that definitions
come before uses.
2020-10-03 19:56:29 +03:00
Avi Kivity
94fcec99d1 frozen_schema: order idl implementations correctly
Clang eagerly instantiates templates, so if it needs a template
function for which it has a declaration but not a definition, it
will not instantiate the definition when it sees it. This causes
link errors.

Fix by ordering the idl implementation files so that definitions
come before uses.
2020-10-03 19:56:28 +03:00
Avi Kivity
a99aba9e48 idl-compiler: generate views after serializers
Clang eagerly instantiates templates, so if it needs a template
function for which it has a declaration but not a definition, it
will not instantiate the definition when it sees it. This causes
link errors.

In this case, the views use the serializer implementations, but are
generated before them.

Fix by generating the view implementations after the serializer
implementations that they use.
2020-10-03 19:56:25 +03:00
Tomasz Grabiec
40b42393d2 Merge "Raft: disable boost tests, add disable to test.py" from Alejo
Add disable option for test configuration.
Tests in this list will be disabled for all modes.

* alejo/next-disable-raft-tests-01:
  Raft: disable boost tests for now
  Tests: add disable to configuration
  Raft: Remove tests for now
2020-10-02 15:51:13 +02:00
Yaron Kaikov
bec0c15ee9 configure.py: Add version to unified tarball filename
Let's add the version and release to unified tarball filename to avoid
having to do that in release engineering pipelines, for example.

Closes #7317
2020-10-02 15:48:11 +03:00
Alejo Sanchez
bb67d15e2f Raft: disable boost tests for now
Disable raft fsm boost tests until raft is part of build.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-10-02 14:03:01 +02:00
Alejo Sanchez
eff7b63c08 Tests: add disable to configuration
For suite.yaml add an extra configuration option disable.

Tests in this list will disabled for all modes.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-10-02 14:01:50 +02:00
Alejo Sanchez
ef170a5088 Raft: Remove tests for now
Remove raft C++ tests until raft is included in build process.

[tgrabiec]: Fixes test.py failure. Tests are not compiled unless --build-raft is
passed to configure.py and we cannot enable it by default yet.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20201002102847.1140775-1-alejo.sanchez@scylladb.com>
2020-10-02 12:42:21 +02:00
Alejo Sanchez
4e26dad3a0 Raft: Remove tests for now
Remove raft C++ tests until raft is included in build process.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-10-02 12:26:05 +02:00
Tomasz Grabiec
864b2c5736 CMakeLists.txt: Add raft directory to source code directories
Needed for IDE integration. Not used for building currently.
Message-Id: <1601570008-19666-1-git-send-email-tgrabiec@scylladb.com>
2020-10-01 19:38:39 +03:00
Gleb Natapov
3e8dbb3c09 lwt: do not return unavailable exception from the 'learn' stage
Unavailable exception means that operation was not started and it can be
retried safely. If lwt fails in the learn stage though it most
certainly means that its effect will be observable already. The patch
returns timeout exception instead which means uncertainty.

Fixes #7258

Message-Id: <20201001130724.GA2283830@scylladb.com>
2020-10-01 17:16:52 +02:00
Tomasz Grabiec
ca7f0c61f0 Merge "raft: initial implementation" from Gleb
This is the beginning of raft protocol implementation. It only supports
log replication and voter state machine. The main difference between
this one and the RFC (besides having voter state machine) is that the
approach taken here is to implement raft as a deterministic state
machine and move all the IO processing away from the main logic.
To do that some changes to RPC interface was required: all verbs are now
one way meaning that sending a request does not wait for a reply  and
the reply arrives as a separate message (or not at all, it is safe to
drop packets).

* scylla-dev/raft-v4:
  raft: add a short readme file
  raft: compile raft tests
  raft: add raft tests
  raft: Implement log replication and leader election
  raft: Introduce raft interface header
2020-10-01 17:09:52 +02:00
Konstantin Osipov
9a5f2b87dc raft: add a short readme file
The file has a brief description of the code status, usage and some
implementation assumptions.
2020-10-01 14:30:59 +03:00
Gleb Natapov
16cb009ea2 raft: compile raft tests
Compilation is not enabled by default as it requires coroutines support
and may require special compiler (until distributed one fixes all the
bugs related to coroutines). To enable raft tests compilation new
configure.py option is added (--build-raft).
2020-10-01 14:30:59 +03:00
Gleb Natapov
4959609589 raft: add raft tests
Add test for currently implemented raft features. replication_test
tests replication functionality with various initial log configurations.
raft_fsm_test test voting state machine functionality.
2020-10-01 14:30:59 +03:00
Gleb Natapov
e1ac1a61c9 raft: Implement log replication and leader election
This patch introduces partial RAFT implementation. It has only log
replication and leader election support. Snapshotting and configuration
change along with other, smaller features are not yet implemented.

The approach taken by this implementation is to have a deterministic
state machine coded in raft::fsm. What makes the FSM deterministic is
that it does not do any IO by itself. It only takes an input (which may
be a networking message, time tick or new append message), changes its
state and produce an output. The output contains the state that has
to be persisted, messages that need to be sent and entries that may
be applied (in that order). The input and output of the FSM is handled
by raft::server class. It uses raft::rpc interface to send and receive
messages and raft::storage interface to implement persistence.
2020-10-01 14:30:59 +03:00
Gleb Natapov
c073997431 raft: Introduce raft interface header
This commit introduce public raft interfaces. raft::server represents
single raft server instance. raft::state_machine represents a user
defined state machine. raft::rpc, raft::rpc_client and raft::storage are
used to allow implementing custom networking and storage layers.

A shared failure detector interface defines keep-alive semantics,
required for efficient implementation of thousands of raft groups.
2020-10-01 14:30:59 +03:00
Piotr Dulikowski
bfbf02a657 transport/config: fix cross-shard use of updateable_value
Recently, the cql_server_config::max_concurrent_requests field was
changed to be an updateable_value, so that it is updated when the
corresponding option in Scylla's configuration is live-reloaded.
Unfortunately, due to how cql_server is constructed, this caused
cql_server instances on all shards to store an updateable_value which
pointed to an updateable_value_source on shard 0. Unsynchronized
cross-shard memory operations ensue.

The fix changes the cql_server_config so that it holds a function which
creates an updateable_value appropriate for the given shard. This
pattern is similar to another, already existing option in the config:
get_service_memory_limiter_semaphore.

This fix can be reverted if updateable_value becomes safe to use across
shards.

Tests: unit(dev)

Fixes: #7310
2020-10-01 14:10:56 +03:00
Etienne Adam
98dc0dc03a redis: only create required keyspaces/tables
The 'redis_database_count' was already existing, but
was not used when initializing the keyspaces. This
patch merely uses it. I think it's better that way, it
seems cleaner not to create 15 x 5 tables when we
use only one redis database.

Also change a test to test with a higher max number
of database.

Signed-off-by: Etienne Adam <etienne.adam@gmail.com>
Message-Id: <20200930210256.4439-1-etienne.adam@gmail.com>
2020-10-01 10:27:03 +03:00
Wojciech Mitros
e79ad38425 tracing: add username to the session table
In order to improve observability, add a username field to the the
system_traces.sessions table. The system table should be change
while upgrading by running the fix_system_distributed_tables.py
script. Until the table is updated, the old behaviour is preserved.

Fixes #6737.
2020-10-01 04:46:40 +02:00
Nadav Har'El
d73cf589e7 docs: fix typos in docs/alternator/alternator.md
Discovered by running a spell-checker.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200930101046.76710-1-nyh@scylladb.com>
2020-10-01 04:46:40 +02:00
Nadav Har'El
8db01aeeb4 docs: fix typo in alternator/getting-started.md
Fix a typo reported by a user. Ran spell-checker to verify there are no
other obvious spelling mistakes in that file.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200930084304.74776-1-nyh@scylladb.com>
2020-10-01 04:46:40 +02:00
Avi Kivity
701d24a832 Merge 'Enhance max concurrent requests code' from Piotr Sarna
This miniseries enhances the code from #7279 by:
 * adding metrics for shed requests, which will allow to pinpoint the problem if the max concurrent requests threshold is too low
 * making the error message more comprehensive by pointing at the variable used to set max concurrent requests threshold

Example of an ehanced error message:
```
ConnectionException('Failed to initialize new connection to 127.0.0.1: Error from server: code=1001 [Coordinator node overloaded] message="too many in-flight requests (configured via max_concurrent_requests_per_shard): 18"',)})
```

Closes #7299

* github.com:scylladb/scylla:
  transport: make _requests_serving param uint32_t
  transport: make overloaded error message more descriptive
  transport: add requests_shed metrics
2020-10-01 04:46:40 +02:00
Benny Halevy
5a250f529f token_metadata: get rid of unused calculate_pending_ranges_for_* methods
They are only called inernally by token_metadata_impl.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-09-30 23:16:23 +03:00
Benny Halevy
41e5a3a245 token_metadata: get rid of clone_after_all_settled
It's unused.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-09-30 23:15:11 +03:00
Benny Halevy
105a2f5244 token_metadata_impl: remove_endpoint: do not sort tokens
Call sort_tokens at the caller as all call sites from
within token_metadata_impl call remove_endpoint for multiple
endpoints so the tokens can be re-sorted only once, when done
removing all tokens.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-09-30 23:12:32 +03:00
Benny Halevy
86303f4fdd token_metadata_impl: always sort_tokens in place
No need to return the sorted tokens vector as it's
always assigned to _sorted_tokens.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-09-30 23:08:56 +03:00
Piotr Sarna
876e9fe51a transport: make _requests_serving param uint32_t
It's not realistic for a shard to have over 4 billion concurrent
requests, so this value can be safely represented in 32 bits.
Also, since the current concurrency limit is represented in uint32_t,
it makes sense for these two to have matching types.
2020-09-30 08:20:52 +02:00
Piotr Sarna
d18f68f1c1 transport: make overloaded error message more descriptive
The message now mentions the config variable used to set the limit
of max allowed concurrent requests.
2020-09-30 08:20:51 +02:00
Piotr Sarna
792ff3757a transport: add requests_shed metrics
The counter shows a total number of requests shed due to overload.
2020-09-30 08:20:50 +02:00
Avi Kivity
fd1dd0eac7 Merge "Track the memory consumption of reader buffers" from Botond
"
The last major untracked area of the reader pipeline is the reader
buffers. These scale with the number of readers as well as with the size
and shape of data, so their memory consumption is unpredictable varies
wildly. For example many small rows will trigger larger buffers
allocated within the `circular_buffer<mutation_fragment>`, while few
larger rows will consume a lot of external memory.

This series covers this area by tracking the memory consumption of both
the buffer and its content. This is achieved by passing a tracking
allocator to `circular_buffer<mutation_fragment>` so that each
allocation it makes is tracked. Additionally, we now track the memory
consumption of each and every mutation fragment through its whole
lifetime. Initially I contemplated just tracking the `_buffer_size` of
`flat_mutation_reader::impl`, but concluded that as our reader trees are
typically quite deep, this would result in a lot of unnecessary
`signal()`/`consume()` calls, that scales with the number of mutation
fragments and hence adds to the already considerable per mutation
fragment overhead. The solution chosen in this series is to instead
track the memory consumption of the individual mutation fragments, with
the observation that these are typically always moved and very rarely
copied, so the number of `signal()`/`consume()` calls will be minimal.

This additional tracking introduces an interesting dilemma however:
readers will now have significant memory on their account even before
being admitted. So it may happen that they can prevent their own
admission via this memory consumption. To prevent this, memory
consumption is only forwarded to the semaphore upon admission. This
might be solved when the semaphore is moved to the front -- before the
cache.
Another consequence of this additional, more complete tracking is that
evictable readers now consume memory even when the underlying reader is
evicted. So it may happen that even though no reader is currently
admitted, all memory is consumed from the semaphore. To prevent any such
deadlocks, the semaphore now admits a reader unconditionally if no
reader is admitted -- that is if all count resources all available.

Refs: #4176

Tests: unit(dev, debug, release)
"

* 'track-reader-buffers/v2' of https://github.com/denesb/scylla: (37 commits)
  test/manual/sstable_scan_footprint_test: run test body in statement sched group
  test/manual/sstable_scan_footprint_test: move test main code into separate function
  test/manual/sstable_scan_footprint_test: sprinkle some thread::maybe_yield():s
  test/manual/sstable_scan_footprint_test: make clustering row size configurable
  test/manual/sstable_scan_footprint_test: document sstable related command line arguments
  mutation_fragment_test: add exception safety test for mutation_fragment::mutate_as_*()
  test: simple_schema: add make_static_row()
  reader_permit: reader_resources: add operator==
  mutation_fragment: memory_usage(): remove unused schema parameter
  mutation_fragment: track memory usage through the reader_permit
  reader_permit: resource_units: add permit() and resources() accessors
  mutation_fragment: add schema and permit
  partition_snapshot_row_cursor: row(): return clustering_row instead of mutation_fragment
  mutation_fragment: remove as_mutable_end_of_partition()
  mutation_fragment: s/as_mutable_partition_start/mutate_as_partition_start/
  mutation_fragment: s/as_mutable_range_tombstone/mutate_as_range_tombstone/
  mutation_fragment: s/as_mutable_clustering_row/mutate_as_clustering_row/
  mutation_fragment: s/as_mutable_static_row/mutation_as_static_row/
  flat_mutation_reader: make _buffer a tracked buffer
  mutation_reader: extract the two fill_buffer_result into a single one
  ...
2020-09-29 16:08:16 +03:00
Pekka Enberg
8f17ca2d1a scripts/refresh-submodules.sh: Add python3 submodule
Message-Id: <20200928075422.377888-1-penberg@scylladb.com>
2020-09-29 16:06:32 +03:00
Yaron Kaikov
d48df44f26 configure.py: build python3, jmx, tools and unified-tar only in relevant dist-{mode}
Today when ever we are building scylla in a singel mode we still
building jmx, tools and python3 for all dev,release and debug.
Let's make sure we build only in relevant build mode

Also adding unified-tar to ninja build

Closes #7260
2020-09-29 15:41:52 +03:00
Juliusz Stasiewicz
0afa738a8f tracing: Fix error on slow batches
`trace_keyspace_helper::make_slow_query_mutation_data` expected a
"query" key in its parameters, which does not appear in case of
e.g. batches of prepared statements. This is example of failing
`record.parameters`:
```
...{"query[0]" : "INSERT INTO ks.tbl (pk, i) values (?, ?);"},
{"query[1]" : "INSERT INTO ks.tbl (pk, i) values (?, ?);"}...
```

In such case Scylla recorded no trace and said:
```
ERROR 2020-09-28 10:09:36,696 [shard 3] trace_keyspace_helper - No
"query" parameter set for a session requesting a slow_query_log record
```

Fix here is to leave query empty if not found. The users can still
retrieve the query contents from existing info.

Fixes #5843

Closes #7293
2020-09-29 13:24:39 +02:00
Asias He
eedcee7f31 gossip: Reduce unncessary VIEW_BACKLOG updates
The blacklog of current and max in VIEW_BACKLOG is not update but the
nodes are updating VIEW_BACKLOG all the time. For example:

```
INFO  2020-03-06 17:13:46,761 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.3, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486026590,718)
INFO  2020-03-06 17:13:46,821 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486026531,742)
INFO  2020-03-06 17:13:47,765 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.3, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486027590,721)
INFO  2020-03-06 17:13:47,825 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486027531,745)
INFO  2020-03-06 17:13:48,772 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.3, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486028590,726)
INFO  2020-03-06 17:13:48,833 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486028531,750)
INFO  2020-03-06 17:13:49,772 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.3, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486029590,729)
INFO  2020-03-06 17:13:49,832 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486029531,753)
```

The downside of such updates:

 - Introduces more gossip exchange traffic
 - Updates system.peers all the time

The extra unnecessary gossip traffic is fine to a cluster in a good
shape but when some of the nodes or shards are loaded, such messages and
the handling of such messages can make the system even busy.

With this patch, VIEW_BACKLOG is updated only when the backlog is really
updated.

Btw, we can even make the update only when the change of the backlog is
great than a threshold, e.g., 5%, which can reduce the traffic even
further.

Fixes #5970
2020-09-29 13:37:37 +03:00
Avi Kivity
6fdc8f28a9 Update tools/jmx submodule
* tools/jmx 45e4f28...25bcd76 (1):
  > install.sh: stop using symlinks for systemd units on nonroot mode

Fixes #7288.
2020-09-29 13:32:45 +03:00
Takuya ASADA
8504332e17 scylla_setup: skip offline warnings on nonroot mode
Since most of the scripts requires root privilege, we don't shows up offline
warning on nonroot mode.

Fixes #7286

Closes #7287
2020-09-29 13:30:13 +03:00
Eliran Sinvani
925cdc9ae1 consistency level: fix wrong quorum calculation whe RF = 0
We used to calculate the number of endpoints for quorum and local_quorum
unconditionally as ((rf / 2) + 1). This formula doesn't take into
account the corner case where RF = 0, in this situation quorum should
also be 0.
This commit adds the missing corner case.

Tests: Unit Tests (dev)
Fixes #6905

Closes #7296
2020-09-29 13:25:41 +03:00
Avi Kivity
6634cbb190 build: detect and allow clang 10 as a compiler
While we don't yet fully support clang as a compiler, this at least
allows working on it.
2020-09-29 12:48:46 +03:00
Avi Kivity
8bead32be3 build: detect availablity of -Wstack-usage=
Detect if the compiler supports -Wstack-usage= and only enable it
if supported. Because each mode uses a different threshold, have each
mode store the threshold value and merge it into the flags only after
we decided we support it.

The switch to make it only a warning is made conditional on compiler
support.
2020-09-29 12:48:45 +03:00
Takuya ASADA
ba29074c42 install.sh: stop using symlinks for systemd units on nonroot mode
On some environment, systemctl enable <service> fails when we use symlink.
So just directly copy systemd units to ~/.config/systemd/user, instead of
creating symlink.

Fixes #7288

Closes #7290
2020-09-29 12:20:41 +03:00
Piotr Sarna
9e5ce5a93c counters: remove unused 1.7.4 counter order code
After cleaning up old cluster features (253a7640e3)
the code for special handling of 1.7.4 counter order was effectively
only used in its own tests, so it can be safely removed.

Closes #7289
2020-09-29 12:16:58 +03:00
Avi Kivity
57f377e1fe Merge 'Add max concurrent requests configuration option to coordinator' from Piotr Sarna
This series approaches issue #7072 and provides a very simple mechanism for limiting the number of concurrent CQL requests being served on a shard. Once the limit is hit, new requests will be instantly refused and OverloadedException will be returned to the client.
This mechanism has many improvement opportunities:
 * shedding requests gradually instead of having one hard limit,
 * having more than one limit per different types of queries (reads, writes, schema changes, ...),
 * not using a preconfigured value at all, and instead figuring out the limit dynamically,
 * etc.

... and none of these are taken into account in this series, which only adds a very basic configuration variable. The variable can be updated live without a restart - it can be done by updating the .yaml file and triggering a configuration re-read via sending the SIGHUP signal to Scylla.

The default value for this parameter is a very large number, which translates to effectively not shedding any requests at all.

Refs #7072

Closes #7279

* github.com:scylladb/scylla:
  transport: make max_concurrent_requests_per_shard reloadable
  transport: return exceptional future instead of throwing
  transport,config: add a param for max request concurrency
  exceptions: make a single-param constructor explicit
  exceptions: add a constructor based on custom message
2020-09-29 12:14:03 +03:00
Pekka Enberg
1adf2cc848 Revert "scylla_ntp_setup: use chrony on all distributions"
This reverts commit 8366d2231d because it
causes the following "scylla_setup" failure on Ubuntu 16.04:

  Command: 'sudo /usr/lib/scylla/scylla_setup --nic ens5 --disks /dev/nvme0n1  --swap-directory / '
  Exit code: 1
  Stdout:
  Setting up libtomcrypt0:amd64 (1.17-7ubuntu0.1) ...
  Setting up chrony (2.1.1-1ubuntu0.1) ...
  Creating '_chrony' system user/group for the chronyd daemon…
  Creating config file /etc/chrony/chrony.conf with new version
  Processing triggers for libc-bin (2.23-0ubuntu11.2) ...
  Processing triggers for ureadahead (0.100.0-19.1) ...
  Processing triggers for systemd (229-4ubuntu21.29) ...
  501 Not authorised
  NTP setup failed.
  Stderr:
  chrony.service is not a native service, redirecting to systemd-sysv-install
  Executing /lib/systemd/systemd-sysv-install enable chrony
  Traceback (most recent call last):
  File "/opt/scylladb/scripts/libexec/scylla_ntp_setup", line 63, in <module>
  run('chronyc makestep')
  File "/opt/scylladb/scripts/scylla_util.py", line 504, in run
  return subprocess.run(cmd, stdout=stdout, stderr=stderr, shell=shell, check=exception, env=scylla_env).returncode
  File "/opt/scylladb/python3/lib64/python3.8/subprocess.py", line 512, in run
  raise CalledProcessError(retcode, process.args,
  subprocess.CalledProcessError: Command '['chronyc', 'makestep']' returned non-zero exit status 1.
2020-09-29 11:23:23 +03:00
Piotr Sarna
4b856cf62d transport: make max_concurrent_requests_per_shard reloadable
This configuration entry is expected to be used as a quick fix
for an overloaded node, so it should be possible to reload this value
without having to restart the server.
2020-09-29 10:11:36 +02:00
Piotr Sarna
4da8957461 transport: return exceptional future instead of throwing
Throwing bears an additional cost, so it's better to simply
construct the error in place and return it.
2020-09-29 10:00:30 +02:00
Piotr Sarna
b4db6d2598 transport,config: add a param for max request concurrency
The newly introduced parameter - max_concurrent_requests_per_shard
- can be used to limit the number of in-flight requests a single
coordinator shard can handle. Each surplus request will be
immediately refused by returning OverloadedException error to the client.
The default value for this parameter is large enough to never
actually shed any requests.
Currently, the limit is only applied to CQL requests - other frontends
like alternator and redis are not throttled yet.
2020-09-29 09:59:30 +02:00
Botond Dénes
2ee026f26f test/manual/sstable_scan_footprint_test: run test body in statement sched group
So that queries are processed in said scheduling group and thus they use
the user read concurrency semaphore.
2020-09-28 11:27:49 +03:00
Botond Dénes
272a54b81c test/manual/sstable_scan_footprint_test: move test main code into separate function 2020-09-28 11:27:49 +03:00
Botond Dénes
29861b068e test/manual/sstable_scan_footprint_test: sprinkle some thread::maybe_yield():s
To avoid stalls.
2020-09-28 11:27:49 +03:00
Botond Dénes
daa9fa72f1 test/manual/sstable_scan_footprint_test: make clustering row size configurable
So that large-row workloads can be simulated too.
2020-09-28 11:27:49 +03:00
Botond Dénes
2ff326a41a test/manual/sstable_scan_footprint_test: document sstable related command line arguments 2020-09-28 11:27:49 +03:00
Botond Dénes
ceb308411c mutation_fragment_test: add exception safety test for mutation_fragment::mutate_as_*() 2020-09-28 11:27:49 +03:00
Botond Dénes
ceb0b02ee8 test: simple_schema: add make_static_row() 2020-09-28 11:27:49 +03:00
Botond Dénes
63578bf0a7 reader_permit: reader_resources: add operator== 2020-09-28 11:27:49 +03:00
Botond Dénes
256140a033 mutation_fragment: memory_usage(): remove unused schema parameter
The memory usage is now maintained and updated on each change to the
mutation fragment, so it needs not be recalculated on a call to
`memory_usage()`, hence the schema parameter is unused and can be
removed.
2020-09-28 11:27:47 +03:00
Botond Dénes
041d71bd6f mutation_fragment: track memory usage through the reader_permit
The memory usage of mutation fragments is now tracked through its
lifetime through a reader permit. This was the last major (to my current
knowledge) untracked piece of the reader pipeline.
2020-09-28 11:27:29 +03:00
Botond Dénes
52662f17ea reader_permit: resource_units: add permit() and resources() accessors 2020-09-28 11:27:29 +03:00
Botond Dénes
6ca0464af5 mutation_fragment: add schema and permit
We want to start tracking the memory consumption of mutation fragments.
For this we need schema and permit during construction, and on each
modification, so the memory consumption can be recalculated and pass to
the permit.

In this patch we just add the new parameters and go through the insane
churn of updating all call sites. They will be used in the next patch.
2020-09-28 11:27:23 +03:00
Botond Dénes
54357221f0 partition_snapshot_row_cursor: row(): return clustering_row instead of mutation_fragment
It is what its callers want anyway.
2020-09-28 10:53:56 +03:00
Botond Dénes
1e6285d776 mutation_fragment: remove as_mutable_end_of_partition()
There is nothing to mutate on a partition_end fragment.
2020-09-28 10:53:56 +03:00
Botond Dénes
5079b9ccf1 mutation_fragment: s/as_mutable_partition_start/mutate_as_partition_start/
We will soon want to update the memory consumption of mutation fragment
after each modification done to it, to do that safely we have to forbid
direct access to the underlying data and instead have callers pass a
lambda doing their modifications.

Uses where this method was just used to move the fragment away are
converted to use `as_mutation_start() &&`.
2020-09-28 10:53:56 +03:00
Botond Dénes
72a88e0257 mutation_fragment: s/as_mutable_range_tombstone/mutate_as_range_tombstone/
We will soon want to update the memory consumption of mutation fragment
after each modification done to it, to do that safely we have to forbid
direct access to the underlying data and instead have callers pass a
lambda doing their modifications.

Uses where this method was just used to move the fragment away are
converted to use `as_range_tombstone() &&`.
2020-09-28 10:53:56 +03:00
Botond Dénes
4f5ccf82cb mutation_fragment: s/as_mutable_clustering_row/mutate_as_clustering_row/
We will soon want to update the memory consumption of mutation fragment
after each modification done to it, to do that safely we have to forbid
direct access to the underlying data and instead have callers pass a
lambda doing their modifications.

Uses where this method was just used to move the fragment away are
converted to use `as_clustering_row() &&`.
2020-09-28 10:53:56 +03:00
Botond Dénes
f2b9cad4c6 mutation_fragment: s/as_mutable_static_row/mutation_as_static_row/
We will soon want to update the memory consumption of mutation fragment
after each modification done to it, to do that safely we have to forbid
direct access to the underlying data and instead have callers pass a
lambda doing their modifications.

Uses where this method was just used to move the fragment away are
converted to use `as_static_row() &&`.
2020-09-28 10:53:56 +03:00
Botond Dénes
0518571e56 flat_mutation_reader: make _buffer a tracked buffer
Via a tracked_allocator. Although the memory allocations made by the
_buffer shouldn't dominate the memory consumption of the read itself,
they can still be a significant portion that scales with the number of
readers in the read.
2020-09-28 10:53:56 +03:00
Botond Dénes
77ea44cb73 mutation_reader: extract the two fill_buffer_result into a single one
Currently we have two, nearly identical definitions of said struct.
Extract it to a common definition and rename it to
`remote_fill_buffer_result`.
2020-09-28 10:53:56 +03:00
Botond Dénes
3fab83b3a1 flat_mutation_reader: impl: add reader_permit parameter
Not used yet, this patch does all the churn of propagating a permit
to each impl.

In the next patch we will use it to track to track the memory
consumption of `_buffer`.
2020-09-28 10:53:48 +03:00
Pekka Enberg
068b1e3470 Update tools/python3 submodule
* tools/python3 b4e52ee...cfa27b3 (1):
  > build: support passing product-version-release as a parameter
2020-09-28 10:53:11 +03:00
Piotr Sarna
58ae0c5208 exceptions: make a single-param constructor explicit
... since it's good practice.
2020-09-28 09:16:31 +02:00
Piotr Sarna
b0737542f2 exceptions: add a constructor based on custom message
OverloadedException was historically only used when the number
of in-flight hints got too high. The other constructor will be useful
for using OverloadedException in other scenarios.
2020-09-28 09:16:31 +02:00
Botond Dénes
c1215592da reader_permit: introduce tracking_allocator
This can be used with standard containers and other containers that use
the std::allocator interface to track the allocations made by them via a
reader_permit.
2020-09-28 08:46:22 +03:00
Botond Dénes
f10abf6e35 reader_permit: reader_resources: add with_memory() factory function
To make creating reader resource with just memory more convenient and
more readable at the same time.
2020-09-28 08:46:22 +03:00
Botond Dénes
4c8ab10563 reader_permit: only forward resource consumption to semaphore after admission
In the next patches we plan to start tracking the memory consumption of
the actual allocations made by the circular_buffer<mutation_fragment>,
as well as the memory consumed by the mutation fragments.
This means that readers will start consuming memory off the permit right
after being constructed. Ironically this can prevent the reader from
being admitted, due to its own pre-admission memory consumption. To
prevent this hold on forwarding the memory consumption to the semaphore,
until the permit is actually admitted.
2020-09-28 08:46:22 +03:00
Botond Dénes
e1eee0dc34 reader_permit: track resource consumed through permit
Track all resources consumed through the permit inside the permit. This
allows querying how much memory each read is consuming (as there should
be one read per permit). Although this might be interesting, especially
when debugging OOM cores, the real reason we are doing this is to be
able forward resource consumption to the semaphore only post-admission.
More on this in the patch introducing this.

Another advantage of tracking resources consumed through the permit is
that now we can detect resource leaks in the permit destructor and
report them. Even if it is just a case of the holder of the resources
wanting to release the resources later, with the permit destroyed it
will cause use-after-free.
2020-09-28 08:46:22 +03:00
Botond Dénes
cd953a36fd reader_permit: move internals to impl
In the next patches the reader permit will gain members that are shared
across all instances of the same permit. To facilitate this move all
internals into an impl class, of which the permit stores a shared
pointer. We use a shared_ptr to avoid defining `impl` in the header.

This is how the reader permit started in the beginning. We've done a
full circle. :)
2020-09-28 08:46:22 +03:00
Botond Dénes
12372731cb reader_permit: add consume()/signal()
And do all consuming and signalling through these methods. These
operations will soon be more involved than the simple forwarding they do
today, so we want to centralize them to a single method pair.
2020-09-28 08:46:22 +03:00
Botond Dénes
375815e650 reader_permit::resource_units: store permit instead of semaphore
In the next patches we want to introduce per-permit resource tracking --
that is, have each permit track the amount of resource consumed through
it. For this, we need all consumption to happen through a permit, and
not directly with the semaphore.
2020-09-28 08:46:22 +03:00
Botond Dénes
04d83f6678 reader_permit: move resource_units declaration outside the reader_permit class
In the next patch we want to store a `reader_permit` instance inside
`resource_units` so a full definition of the former must be available.
2020-09-28 08:46:22 +03:00
Botond Dénes
0fe75571d9 reader_concurrency_semaphore: admit one read if no reader is active
To ensure progress at all times. This is due to evictable readers, who
still hold on to a buffer even when their underlying reader is evicted.
As we are introducing buffer and mutation fragment tracking in the next
patches, these readers will hold on to memory even in this state, so it
may theoretically happen that even though no readers are admitted (all
count resources all available) no reader can be admitted due to lack of
memory. To prevent such deadlocks we now always admit one reader if all
count resource are available.
2020-09-28 08:46:22 +03:00
Botond Dénes
ef0b279c80 reader_concurrency_semaphore: move may_proceed() out-of-line
They are only used in the .cc anyway.
2020-09-28 08:46:22 +03:00
Botond Dénes
d692993bdc mutation_reader_test: test_multishard_combining_reader_non_strictly_monotonic_positions: reset size between buffer fills
Current code uses a single counter to produce multiple buffer worth of
data. This uses carry-on from on buffer to the other, which happens to
work with the current memory accounting but is very fragile. Account
each buffer separately, resetting the counter between them.
2020-09-28 08:46:22 +03:00
Botond Dénes
7e909671f4 view_build_test: test_view_update_generator_deadlock: release semaphore resources
The test consumes all resources off the semaphore, leaving just enough
to admit a single reader. However this amount is calculated based on the
base cost of readers, but as we are going to track reader buffers as
well, the amount of memory consumed will be much less predictable.
So to make sure background readers can finish during shutdown, release
all the consumed resources before leaving scope.
2020-09-28 08:46:22 +03:00
Botond Dénes
122ab1aabd view_build_test: test_view_update_generator_buffering: fail the test early on exceptions
No point in continuing processing the entire buffer once a failure was
found. Especially that an early failure might introduce conditions that
are not handled in the normal flow-path. We could handle these but there
is no point in this added complexity, at this point the test is failed
anyway.
2020-09-28 08:46:22 +03:00
Botond Dénes
99388590da querier_cache_test: test_resources_based_cache_eviction: use semaphore::consume() to drain semaphore
It is much more reliable and simple this way, than playing with
`reader_permit::wait_for_admission()`.
2020-09-28 08:46:22 +03:00
Botond Dénes
3c73cc2a4e tests: prepare for permit forwarding consumption post admission
Some tests rely on `consume*()` calls on the permit to take effect
immediately. Soon this will only be true once the permit has been
admitted, so make sure the permit is admitted in these tests.
2020-09-28 08:46:22 +03:00
Botond Dénes
5e5c94b064 test/lib/reader_lifecycle_policy: don't destroy reader context eagerly
Currently per-shard reader contexts are cleaned up as soon as the reader
itself is destroyed. This causes two problems:
* Continuations attached to the reader destroy future might rely on
  stuff in the context being kept alive -- like the semaphore.
* Shard 0's semaphore is special as it will be used to account buffers
  allocated by the multishard reader itself, so it has to be alive until
  after all readers are destroyed.

This patch changes this so that contexts are destroyed only when the
lifecycle policy itself is destroyed.
2020-09-28 08:46:22 +03:00
Takuya ASADA
8366d2231d scylla_ntp_setup: use chrony on all distributions
To simplify scylla_ntp_setup, use chrony on all distributions.
2020-09-27 12:30:02 +03:00
Rafael Ávila de Espíndola
2093efceab build: Upgrade to seastar API level 5
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200923202424.216444-1-espindola@scylladb.com>
2020-09-26 11:07:49 +03:00
Avi Kivity
36d93f586a Update seastar submodule
* seastar e215023c7...292ba734b (4):
  > future: Fix move of futures of reference type
  > doc: fix hyper link to tutorial.html
  > tutorial: fix formatting of code block
  > README.md: fix the formatting of table
2020-09-25 21:54:44 +03:00
Tomasz Grabiec
97c99ea9f3 Merge "evictable_reader: validate buffer on reader recreation" from Botond
The reader recreation mechanism is a very delicate and error-prone one,
as proven by the countless bugs it had. Most of these bugs were related
to the recreated reader not continuing the read from the expected
position, inserting out-of-order fragments into the stream.
This patch adds a defense mechanism against such bugs by validating the
start position of the recreated reader.
The intent is to prevent corrupt data from getting into the system as
well as to help catch these bugs as close to the source as possible.

Fixes: #7208

Tests: unit(dev), mutation_reader_test:debug (v4)

* botond/evictable-reader-validate-buffer/v5:
  mutation_reader_test: add unit test for evictable reader self-validation
  evictable_reader: validate buffer after recreation the underlying
  evictable_reader: update_next_position(): only use peek'd position on partition boundary
  mutation_reader_test: add unit test for evictable reader range tombstone trimming
  evictable_reader: trim range tombstones to the read clustering range
  position_in_partition_view: add position_in_partition_view before_key() overload
  flat_mutation_reader: add buffer() accessor
2020-09-25 17:02:51 +02:00
Takuya ASADA
eae2aa58fa dist/common/scripts: move back get_set_nic_and_disks_config_value to scylla_util.py
The function mistakenly moved to scylla_sysconfig_setup but it also referenced
from scylla_prepare, move back to scylla_util.py

Fixes #7276

Closes #7280
2020-09-25 13:05:43 +03:00
Botond Dénes
076c27318b mutation_reader_test: add unit test for evictable reader self-validation
Add both positive (where the validation should succeed) and negative
(where the validation should fail) tests, covering all validation cases.
2020-09-25 12:09:01 +03:00
Botond Dénes
0b0ae18a14 evictable_reader: validate buffer after recreation the underlying
The reader recreation mechanism is a very delicate and error-prone one,
as proven by the countless bugs it had. Most of these bugs were related
to the recreated reader not continuing the read from the expected
position, inserting out-of-order fragments into the stream.
This patch adds a defense mechanism against such bugs by validating the
start position of the recreated reader. Several things are checked:
* The partition is the expected one -- the one we were in the middle of
  or the next if we stopped at partition boundaries.
* The partition is in the read range.
* The first fragment in the partition is the expected one -- has a
  an equal or larger position than the next expected fragment.
* The fragment is in the clustering range as defined by the slice.

As these validations are only done on the slow-path of recreating an
evicted reader, no performance impact is expected.
2020-09-25 12:09:00 +03:00
Botond Dénes
91020eef73 evictable_reader: update_next_position(): only use peek'd position on partition boundary
`evictable_reader::update_next_position()` is used to record the position the
reader will continue from, in the next buffer fill. This position is used to
create the partition slice when the underlying reader is evicted and has
to be recreated. There is an optimization in this method -- if the
underlying's buffer is not empty we peek at the first fragment in it and
use it as the next position. This is however problematic for buffer
validation on reader recreation (introduced in the next patch), because
using the next row's position as the next pos will allow for range
tombstones to be emitted with before_key(next_pos.key()), which will
trigger the validation. Instead of working around this, just drop this
optimization for mid-partition positions, it is inconsequential anyway.
We keep it for where it is important, when we detect that we are at a
partition boundary. In this case we can avoid reading the current
partition altogether when recreating the reader.
2020-09-25 12:09:00 +03:00
Botond Dénes
d1b0573e1c mutation_reader_test: add unit test for evictable reader range tombstone trimming 2020-09-25 12:09:00 +03:00
Botond Dénes
4f2e7a18e2 evictable_reader: trim range tombstones to the read clustering range
Currently mutation sources are allowed to emit range tombstones that are
out-of the clustering read range if they are relevant to it. For example
a read of a clustering range [ck100, +inf), might start with:

    range_tombstone{start={ck1, -1}, end={ck200, 1}},
    clustering_row{ck100}

The range tombstone is relevant to the range and the first row of the
range so it is emitted as first, but its position (start) is outside the
read range. This is normally fine, but it poses a problem for evictable
reader. When the underlying reader is evicted and has to be recreated
from a certain clustering position, this results in out-of-order
mutation fragments being inserted into the middle of the stream. This is
not fine anymore as the monotonicity guarantee of the stream is
violated. The real solution would be to require all mutation sources to
trim range tombstones to their read range, but this is a lot of work.
Until that is done, as a workaround we do this trimming in the evictable
reader itself.
2020-09-25 12:09:00 +03:00
Botond Dénes
d7d93aef49 position_in_partition_view: add position_in_partition_view before_key() overload 2020-09-25 12:09:00 +03:00
Avi Kivity
f1fcf4f139 Update seastar submodule
* seastar 9ae33e67e1...e215023c78 (4):
  > future: Make futures non variadic
  > on_internal_error: add noexcept variant
  > Convert another std::result_of to std::invoke_result
  > reactor: remove unused declaration abort_on_error()
2020-09-24 20:04:03 +03:00
Tomasz Grabiec
14fdd2f501 Merge "Gossip echo message improvement" from Asias
This series improves gossip echo message handling in a loaded cluster.

Refs: #7197

* git://github.com/asias/scylla.git gossip_echo_improve_7197:
  gossiper: Handle echo message on any shard
  gossiper: Increase echo message timeout
  gossiper: Remove unused _last_processed_message_at
2020-09-24 15:13:55 +02:00
Pekka Enberg
84a0aca666 configure.py: Rename "mode" to "checkheaders_mode"
The "mode" variable name is used everywhere, usually in a loop.
Therefore, rename the global "mode" to "checkheaders_mode" so that if
your code block happens to be outside of a loop, you don't accidentally
use the globally visible "mode" and spend hours debugging why it's
always "dev".

Spotted by Yaron Kaikov.

Message-Id: <20200924112237.315817-1-penberg@scylladb.com>
2020-09-24 15:00:49 +03:00
Nadav Har'El
e1c42f2bb3 scripts/pull_github_pr.sh: show titles of more than 20 patches
The script pull_github_pr.sh uses git merge's "--log" option to put in
the merge commit the list of titles of the individual patches being
merged in. This list is useful when later searching the log for the merge
which introduced a specific feature.

Unfortunately, "--log" defaults to cutting off the list of commit titles
at 20 lines. For most merges involving fewer than 20 commits, this makes
no difference. But some merges include more than 20 commits, and get
a truncated list, for no good reason. If someone worked hard to create a
patch set with 40 patches, the last thing we should be worried about is
that the merge commit message will be 20 lines longer.

Unfortunately, there appears to be no way to tell "--log" to not limit
the length at all. So I chose an arbitrary limit of 1000. I don't think
we ever had a patch set in Scylla which exceeded that limit. Yet :-)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200924114403.817893-1-nyh@scylladb.com>
2020-09-24 14:51:58 +03:00
Piotr Dulikowski
39771967bb hinted handoff: fix race - decomission vs. endpoint mgr init
This patch fixes a race between two methods in hints manager: drain_for
and store_hint.

The first method is called when a node leaves the cluster, and it
'drains' end point hints manager for that node (sends out all hints for
that node). If this method is called when the local node is being
decomissioned or removed, it instead drains hints managers for all
endpoints.

In the case of decomission/remove, drain_for first calls
parallel_for_each on all current ep managers and tells them to drain
their hints. Then, after all of them complete, _ep_managers.clear() is
called.

End point hints managers are created lazily and inserted into
_ep_managers map the first time a hint is stored for that node. If
this happens between parallel_for_each and _ep_managers.clear()
described above, the clear operation will destroy the new ep manager
without draining it first. This is a bug and will trigger an assert in
ep manager's destructor.

To solve this, a new flag for the hints manager is added which is set
when it drains all ep managers on removenode/decommission, and prevents
further hints from being written.

Fixes #7257

Closes #7278
2020-09-24 14:51:24 +03:00
Nadav Har'El
a5369881b3 Merge 'sstables: make sstable_manager control the lifetime of the sstables it manages' from Avi Kivity
Currently, sstable_manager is used to create sstables, but it loses track
of them immediately afterwards. This series makes an sstable's life fully
contained within its sstable_manager.

The first practical impact (implemented in this series) is that file removal
stops being a background job; instead it is tracked by the sstable_manager,
so when the sstable_manager is stopped, you know that all of its sstable
activity is complete.

Later, we can make use of this to track the data size on disk, but this is not
implemented here.

Closes #7253

* github.com:scylladb/scylla:
  sstables: remove background_jobs(), await_background_jobs()
  sstables: make sstables_manager take charge of closing sstables
  test: test_env: hold sstables_manager with a unique_ptr
  test: drop test_sstable_manager
  test: sstables::test_env: take ownership of manager
  test: broken_sstable_test: prepare for asynchronously closed sstables_manager
  test: sstable_utils: close test_env after use
  test: sstable_test:  dont leak shared_sstable outside its test_env's lifetime
  test: sstables::test_env: close self in do_with helpers
  test: perf/perf_sstable.hh: prepare for asynchronously closed sstables_manager
  test: view_build_test: prepare for asynchronously closed sstables_manager
  test: sstable_resharding_test: prepare for asynchronously closed sstables_manager
  test: sstable_mutation_test: prepare for asynchronously closed sstables_manager
  test: sstable_directory_test: prepare for asynchronously closed sstables_manager
  test: sstable_datafile_test: prepare for asynchronously closed sstables_manager
  test: sstable_conforms_to_mutation_source_test: remove references to test_sstables_manager
  test: sstable_3_x_test: remove test_sstables_manager references
  test: schema_changes_test: drop use of test_sstables_manager
  mutation_test: adjust for column_family_test_config accepting an sstables_manager
  test: lib: sstable_utils: stop using test_sstables_manager
  test: sstables test_env: introduce manager() accessor
  test: sstables test_env: introduce do_with_async_sharded()
  test: sstables test_env: introduce  do_with_async_returning()
  test: lib: sstable test_env: prepare for life as a sharded<> service
  test: schema_changes_test: properly close sstables::test_env
  test: sstable_mutation_test: avoid constructing temporary sstables::test_env
  test: mutation_reader_test: avoid constructing temporary sstables::test_env
  test: sstable_3_x_test: avoid constructing temporary sstables::test_env
  test: lib: test_services: pass sstables_manager to column_family_test_config
  test: lib: sstables test_env: implement tests_env::manager()
  test: sstable_test: detemplate write_and_validate_sst()
  test: sstable_test_env: detemplate do_with_async()
  test: sstable_datafile_test: drop bad 'return'
  table: clear sstable set when stopping
  table: prevent table::stop() race with table::query()
  database: close sstable_manager:s
  sstables_manager: introduce a stub close()
  sstable_directory_test: fix threading confusion in make_sstable_directory_for*() functions
  test: sstable_datafile_test: reorder table stop in compaction_manager_test
  test: view_build_test: test_view_update_generator_register_semaphore_unit_leak: do not discard future in timer
  test: view_build_test: fix threading in test_view_update_generator_register_semaphore_unit_leak
  view: view_update_generator: drop references to sstables when stopping
2020-09-24 13:54:38 +03:00
Botond Dénes
3bb25eefb6 reader_permit: remove unused release() method
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200924090040.240906-1-bdenes@scylladb.com>
2020-09-24 12:28:00 +03:00
Avi Kivity
d45df0c705 Update tools/java submodule
* tools/java 583fc658df...4313155ab6 (1):
  > enable gettraceprobability
2020-09-24 12:19:29 +03:00
Asias He
88b7587755 gossiper: Handle echo message on any shard
Echo message does not need to access gossip internal states, we can run
it on all shards and avoid forwarding to shard zero.

This makes gossip marking node up more robust when shard zero is loaded.

There is an argument that we should make echo message return only when
all shards have responded so that all shards are live and responding.

However, in a heavily loaded cluster, one shard might be overloaded on
multiple nodes in the cluster at the same time. If we require echo
response on all shards, we have a chance local node will mark all peer
nodes as down. As a result, the whole cluster is down. This is much
worse than not excluding a node with a slow shard from a cluster.

Refs: #7197
2020-09-24 10:10:54 +08:00
Asias He
c7cb638e95 gossiper: Increase echo message timeout
Gossip echo message is used to confirm a node is up. In a heavily loaded
slow cluster, a node might take a long time to receive a heart beat
update, then the node uses the echo message to confirm the peer node is
really up.

If the echo message timeout too early, the peer node will not be marked
as up. This is bad because a live node is marked as down and this could
happen on multiple nodes in the cluster which causes cluster wide
unavailability issue. In order to prevent multiple nodes to marked as
down, it is better to be conservative and less restrictive on echo
message timeout.

Note, echo message is not used to detect a node down. Increasing the
echo timeout does not have any impact on marking a node down in a timely
manner.

Refs: #7197
2020-09-24 09:50:09 +08:00
Asias He
173d115a64 gossiper: Remove unused _last_processed_message_at
It is not used any more. We can get rid of it.

Refs: #7197
2020-09-24 09:48:54 +08:00
Avi Kivity
2bd264ec6a sstables: remove background_jobs(), await_background_jobs()
There are no more users for registering background jobs, so remove
the mechanism and the remaining calls.
2020-09-23 20:55:17 +03:00
Avi Kivity
5db96170a5 sstables: make sstables_manager take charge of closing sstables
Currently, closing sstables happens from the sstable destructor.
This is problematic since a destructor cannot wait for I/O, so
we launch the file close process in the background. We therefore
lose track of when the closing actually takes place.

This patch makes sstables_manager take charge of the close process.
Every sstable is linked into one of two intrusive lists in its
manager: _active or _undergoing_close. When the reference count
of the sstable drops to zero, we move it from _active to
_undergoing_close and begin closing the files. sstables_manager
remembers all closes and when sstables_manager::close() is called,
it waits for all of them to complete. Therefore,
sstables_manager::close() allows us to know that all files it
manages are closed (and deleted if necessary).

The sstables_manager also gains a destructor, which disables
move construction.
2020-09-23 20:55:17 +03:00
Avi Kivity
ad8620c289 test: test_env: hold sstables_manager with a unique_ptr
sstables_manager should not be movable (since sstables hold a reference
to it). A following patch will enforce it.

Prepare by using unique_ptr to hold test_env::_manager. Right now, we'll
invoke sstables_manager move construction when creating a test_env with
do_with().

We could have chosen to update sstables when their sstables_manager is
moved, but we get nothing for the complexity.
2020-09-23 20:55:16 +03:00
Avi Kivity
fd61ebb095 test: drop test_sstable_manager
With no users left (apart from some variants of column_family_test_config
which are removed in this patch) there are no more users, so remove it.

test_sstable_manager is obstructs sstables_manager from taking charge
of sstables ownership, since it a thread-local object. We can't close it,
since it will be used in the next test to run.
2020-09-23 20:55:16 +03:00
Avi Kivity
d4c1b62f81 test: sstables::test_env: take ownership of manager
Instead of using test_sstables_manager, which we plan to drop,
carry our own sstables_manager in test_env, and close it when
test_env::stop() is called.
2020-09-23 20:55:16 +03:00
Avi Kivity
4b24e58858 test: broken_sstable_test: prepare for asynchronously closed sstables_manager
Instead of using an asynchronously-constructed and destroyed
test_env, obtain one using test_env::do_with(), which is
prepared for async close.
2020-09-23 20:55:15 +03:00
Avi Kivity
8c3ae648d9 test: sstable_utils: close test_env after use
test_env will soon manage its sstable_manager's lifetime, which
requires closing, so close the test_env.
2020-09-23 20:55:15 +03:00
Avi Kivity
67a887110d test: sstable_test: dont leak shared_sstable outside its test_env's lifetime
do_write_sst() creates a test_env, creates a shared_sstable using that test_env,
and destroys the test_env, and returns the sstable. This works now but will
stop working once sstable_manager becomes responsible for sstable lifetime.

Fortunately, do_write_sst() has one caller that isn't interested in the
return value at all, so fold it into that caller.
2020-09-23 20:55:15 +03:00
Avi Kivity
1dd3079d67 test: sstables::test_env: close self in do_with helpers
The test_env::do_with() are convenient for creating a scope
containing a test_env. Prepare them for asynchronously closed
sstables_manager by closing the test_env after use (which will,
in the future, close the embedded sstables_manager).
2020-09-23 20:55:14 +03:00
Avi Kivity
5e292897df test: perf/perf_sstable.hh: prepare for asynchronously closed sstables_manager
Obtain a test_env using do_with_async_returning(); and pass it to
column_family_test_config so it can stop using test_sstables_manager.
2020-09-23 20:55:14 +03:00
Avi Kivity
16e6abfa27 test: view_build_test: prepare for asynchronously closed sstables_manager
Stop using test_sstables_manager, which is going away. Instead obtain
a managed sstables_manager via cql_test_env.
2020-09-23 20:55:14 +03:00
Avi Kivity
97b36c38e6 test: sstable_resharding_test: prepare for asynchronously closed sstables_manager
Close an explicity-created test_env, and stop using test_sstables_manager
which will disappear.
2020-09-23 20:55:13 +03:00
Avi Kivity
a2c4f65c63 test: sstable_mutation_test: prepare for asynchronously closed sstables_manager
sstables_manager will soon be closed asynhronously, with a future-returning
close() function. To prepare for that, make the following changes
 - replace test_sstables_manager with an sstables_manager obtained from test_env
 - drop unneeded calls to await_background_jobs()

These changes allow lifetime management of the sstables_manager used
in the tests to be centralized in test_env.
2020-09-23 20:55:13 +03:00
Avi Kivity
f671aa60f3 test: sstable_directory_test: prepare for asynchronously closed sstables_manager
sstables_manager will soon be closed asynhronously, with a future-returning
close() function. To prepare for that, make the following changes
 - acquire a test_env with test_env::do_with() (or the sharded variant)
 - change the sstable_from_existing_file function to be a functor that
   works with either cql_test_env or test_env (as this is what individual
   tests want); drop use of test_sstables_manager
 - change new_sstable() to accept a test_env instead of using test_sstables_manager
 - replace test_sstables_manager with an sstables_manager obtained from test_env

These changes allow lifetime management of the sstables_manager used
in the tests to be centralized in test_env.
2020-09-23 20:55:12 +03:00
Avi Kivity
3976066156 test: sstable_datafile_test: prepare for asynchronously closed sstables_manager
sstables_manager will soon be closed asynhronously, with a future-returning
close() function. To prepare for that, make the following changes
 - replace on-stack test_env with test_env::do_with()
 - use the variant of column_family_for_tests that accepts an sstables_manager
 - replace test_sstables_manager with an sstables_manager obtained from test_env

These changes allow lifetime management of the sstables_manager used
in the tests to be centralized in test_env.

Since test_env now calls await_background_jobs on termination, those
calls are dropped.
2020-09-23 20:55:12 +03:00
Avi Kivity
d8c82312e0 test: sstable_conforms_to_mutation_source_test: remove references to test_sstables_manager
Use the sstables_manager from test_env. Use do_with_async() to create the test_env,
to allow for proper closing.

Since do_with_async() also takes care of await_background_jobs(), remove that too.
2020-09-23 20:55:12 +03:00
Avi Kivity
28078928ee test: sstable_3_x_test: remove test_sstables_manager references
test_sstables_manager is going away, so replace it by test_env::manager().

column_family_test_config() has an implicit reference to test_sstables_manager,
so pass test_env::manager() as a parameter.

Calls to await_background_jobs() are removed, since test_env::stop() performs
the same task.

The large rows tests are special, since they use a custom sstables_manager,
so instead of using a test_env, they just close their local sstables_manager.
2020-09-23 20:55:11 +03:00
Avi Kivity
b19b72455b test: schema_changes_test: drop use of test_sstables_manager
It is going away, so get the manager from the test_env object (more
accurate anyway).
2020-09-23 20:55:11 +03:00
Avi Kivity
0134e2f436 mutation_test: adjust for column_family_test_config accepting an sstables_manager
Acquire a test_env and extract an sstables_manager from that, passing it
to column_familty_test_config, in preparation for losing the default
constructor of column_familty_test_config.
2020-09-23 20:55:11 +03:00
Avi Kivity
85087478fc test: lib: sstable_utils: stop using test_sstables_manager
It will be retured soon. Extract the sstable_manager from the sstable itself.
2020-09-23 20:55:10 +03:00
Avi Kivity
f9aa50dcbf test: sstables test_env: introduce manager() accessor
This returns the sstables_manager carried by the test_env. We
will soon retire the global test_sstables_manager, so we need
to provide access to one.
2020-09-23 20:55:10 +03:00
Avi Kivity
9399f06e86 test: sstables test_env: introduce do_with_async_sharded()
Some tests need a test_env across multiple shard. Introduce a variant
of do_with_async() that supplies it.
2020-09-23 20:55:10 +03:00
Avi Kivity
a8e7c04fc9 test: sstables test_env: introduce do_with_async_returning()
Similar to do_with_async(), but returning a non-void return type.
Will be used in test/perf.
2020-09-23 20:55:09 +03:00
Avi Kivity
784d29a75b test: lib: sstable test_env: prepare for life as a sharded<> service
Some tests need a sharded sstables_manager, prepare for that by
adding a stop() method and helpers for creating a sharded service.

Since test_env doesn't yet contain its own sstable_manager, this
can't be used in real life yet.
2020-09-23 20:55:09 +03:00
Avi Kivity
d6bf27be9e test: schema_changes_test: properly close sstables::test_env
sstables::test_env needs to be properly closed (and will soon need it
even more). Use test_env::do_with_async() to do that. Removed
await_background_jobs(), which is now done by test_env::close().
2020-09-23 20:55:08 +03:00
Avi Kivity
e98e5e0a52 test: sstable_mutation_test: avoid constructing temporary sstables::test_env
A test_env contains an sstables_manager, which will soon have a close() method.
As such, it can no longer be a temporary. Switch to using test_env::do_with_async().

As a bonus, test_env::do_with_async() performs await_background_jobs() for us, so
we can drop it from the call sites.
2020-09-23 20:55:08 +03:00
Avi Kivity
6fd4601cf8 test: mutation_reader_test: avoid constructing temporary sstables::test_env
A test_env contains an sstables_manager, which will soon have a close() method.
As such, it can no longer be a temporary. Switch to using test_env::do_with_async().
2020-09-23 20:55:08 +03:00
Avi Kivity
15963e1144 test: sstable_3_x_test: avoid constructing temporary sstables::test_env
A test_env contains an sstables_manager, which will soon have a close() method.
As such, it can no longer be a temporary. Switch to using test_env::do_with_async().

As a bonus, test_env::do_with_async() performs await_background_jobs() for us, so
we can drop it from the call sites.
2020-09-23 20:55:07 +03:00
Avi Kivity
0fbdb009d5 test: lib: test_services: pass sstables_manager to column_family_test_config
Since we're dropping test_sstables_manager, we'll require callers to pass it
to column_family_test_config, so provide overloads that accept it.

The original overloads (that don't accept an sstables_manager) remain for
the transition period.
2020-09-23 20:55:07 +03:00
Avi Kivity
72c13199d8 test: lib: sstables test_env: implement tests_env::manager()
Some tests are now referencing the global test_sstables_manager,
which we plan to remove. Add test_env::manager() as a way to
reference the sstables_manager that the test_env contains.
2020-09-23 20:55:07 +03:00
Avi Kivity
437e131aef test: sstable_test: detemplate write_and_validate_sst()
Reduce code bloat and improve error messages by replacing a template with
noncopyable_function<>.
2020-09-23 20:55:06 +03:00
Avi Kivity
956cd9ee8d test: sstable_test_env: detemplate do_with_async()
Reduce code bloat and improve error messages by using noncopyable_function<>
instead of a template.
2020-09-23 20:55:06 +03:00
Avi Kivity
1c1a737eda test: sstable_datafile_test: drop bad 'return'
The pattern

    return function_returning_a_future().get();

is legal, but confusing. It returns an unexpected std::tuple<>. Here,
it doesn't do any harm, but if we try to coerce the surrounding code
into a signature (void ()), then that will fail.

Remove the unneeded and unexpected return.
2020-09-23 20:55:06 +03:00
Avi Kivity
88ea02bfeb table: clear sstable set when stopping
Drop references to a table's sstables when stopping it, so that
the sstable_manager can start deleting it. This includes staging
sstables.

Although the table is no longer in use at this time, maintain
cache synchronity by calling row_cache::invalidate() (this also
has the benefit of avoiding a stall in row_cache's destructor).
We also refresh the cache's view of the sstable set to drop
the cache's references.
2020-09-23 20:55:05 +03:00
Avi Kivity
9932e6a899 table: prevent table::stop() race with table::query()
Take the gate in table::query() so that stop() waits for queries. The gate
is already waited for in table::stop().

This allows us to know we are no longer using the table's sstables in table::stop().
2020-09-23 20:55:05 +03:00
Avi Kivity
9f886f303c database: close sstable_manager:s
The database class owns two sstable_manager:s - one for user sstables and
one for system sstables. Now that they have a close() method, call it.
2020-09-23 20:55:05 +03:00
Avi Kivity
a90a511d36 sstables_manager: introduce a stub close()
sstables_manager is going to take charge of its sstables lifetimes,
so it will need a close() to wait until sstables are deleted.

This patch adds sstables_manager::close() so that the surrounding
infrastructure can be wired to call it. Once that's done, we can
make it do the waiting.
2020-09-23 20:55:04 +03:00
Avi Kivity
0de2c55f95 sstable_directory_test: fix threading confusion in make_sstable_directory_for*() functions
The make_sstable_directory_for*() functions run in a thread, and
call functions that run in a thread, but return a future. This
more or less works but is a dangerous construct that can fail.

Fix by returning a regular value.
2020-09-23 20:55:04 +03:00
Avi Kivity
c27c2a06bb test: sstable_datafile_test: reorder table stop in compaction_manager_test
Stopping a table will soon close its sstables; so the next check will fail
as the number of sstables for the table will be zero.

Reorder the stop() call to make it safe.

We don't need the stop() for the check, since the previous loop made sure
compactions completed.
2020-09-23 20:55:03 +03:00
Avi Kivity
fd1c201ed4 test: view_build_test: test_view_update_generator_register_semaphore_unit_leak: do not discard future in timer
test_view_update_generator_register_semaphore_unit_leak creates a continuation chain
inside a timer, but does not wait for it. This can result in part of the chain
being executed after its captures have been destroyed.

This is unlikely to happen since the timer fires only if the test fails, and
tests never fail (at least in the way that one expects).

Fix by waiting for that future to complete before exiting the thread.
2020-09-23 20:55:03 +03:00
Avi Kivity
33c9563dc9 test: view_build_test: fix threading in test_view_update_generator_register_semaphore_unit_leak
test_view_update_generator_register_semaphore_unit_leak uses a thread function
in do_with_cql_env(), even though the latter doesn't promise a thread and
accepts a regular function-returning-a-future. It happens to work because the
function happens to be called in a thread, but this isn't guaranteed.

Switch to do_with_cql_env, which guarantees a thread context.
2020-09-23 20:55:03 +03:00
Avi Kivity
844b675520 view: view_update_generator: drop references to sstables when stopping
sstable_manager will soon wait for all sstables under its
control to be deleted (if so marked), but that can't happen
if someone is holding on to references to those sstables.

To allow sstables_manager::stop() to work, drop remaining
queued work when terminating.
2020-09-23 20:55:02 +03:00
Nadav Har'El
a2cc599a2a scripts/pull_github_pr.sh: some nicer messages
The script scripts/pull_github_pr.sh begins by fetching some information
from github, which can cause a noticable wait that the user doesn't
understand - so in this patch we add a couple of messages on what is
happening in the beginning of the script.

Moreover, if an invalid pull-request number is given, the script used
to give mysterious errors when incorrect commands ran using the name
"null" - in this patch we recognize this case and print a clear "Not Found"
error message.

Finally, the PR_REPO variable was never used, so this patch removes it.

Message-Id: <20200923151905.674565-1-nyh@scylladb.com>
2020-09-23 20:53:23 +03:00
Avi Kivity
a63a00b0ea scripts/pull_pr: don't pollute local branch namespace
Currently, scripts/pull_pr pollutes the local branch namespace
by creating a branch and never deleting it. This can be avoided
by using FETCH_HEAD, a temporary name automatically assigned by
git to fetches with no destination.
2020-09-23 15:47:51 +03:00
Avi Kivity
d3588d72c7 Merge "Per semaphore read metrics" from Botond
"
Currently all logical read operations are counted in a single pair of
metrics (successful/failed) located in the `database::db_stats`. This
prevents observing the number of reads executed against the
user/system
read semaphores. This distinction is especially interesting since
0c6bbc84c which selects the semaphore for each read based on the
scheduling group it is running under.

This mini series moves these counters into the semaphore and updates
the
exported metrics accordingly, the `total_reads` and
`total_reads_failed`
now has a user/system lable, just like the other semaphore dependent
metrics.

Tests: manual(checked that new metric works)
"

* 'per-semaphore-read-metrics/v2' of https://github.com/denesb/scylla:
  database: move total_reads* metrics to the concurrency semaphore
  database: setup_metrics(): split the registering database metrics in two
  reader_concurrency_semaphore: add non-const stats accessor
  reader_concurrency_semaphore: s/inactive_read_stats/stats/
2020-09-23 15:25:18 +03:00
Botond Dénes
d7e794e565 database: move total_reads* metrics to the concurrency semaphore 2020-09-23 14:10:24 +03:00
Botond Dénes
32ff524454 database: setup_metrics(): split the registering database metrics in two
Currently all "database" metrics are registered in a single call to
`metric_groups::add_group()`. As all the metrics to-be-registered are
passed in a single initializer list, this blows up the stack size, to
the point that adding a single new metric causes it to exceed the
currently configured max-stack-size of 13696 bytes. To reduce stack
usage, split the single call in two, roughly in the middle. While we
could try to come up with some logical grouping of metrics and do much
arranging and code-movement I think we might as well just split into two
arbitrary groups, containing roughly the same amount of metrics.
2020-09-23 14:06:20 +03:00
Pekka Enberg
9a19c028e4 scylla_kernel_check: Switch to os.mkdirs() function
Commit 8e1f7d4fc7 ("dist/common/scripts: drop makedirs(), use
os.makedirs()") dropped the "mkdirs()" function, but forgot to convert
the caller in scylla_kernel_check to os.mkdirs().

Message-Id: <20200923104510.230244-1-penberg@scylladb.com>
2020-09-23 13:54:43 +03:00
Botond Dénes
593232be0a reader_concurrency_semaphore: add non-const stats accessor
In the next patch we will add externally updated stats, which need a
non-const reference to the stats member.
2020-09-23 13:11:55 +03:00
Botond Dénes
c18756ce9a reader_concurrency_semaphore: s/inactive_read_stats/stats/
In preparations of non-inactive read stats being added to the semaphore,
rename its existing stats struct and member to a more generic name.
Fields, whose name only made sense in the context of the old name are
adjusted accordingly.
2020-09-23 13:11:55 +03:00
Pekka Enberg
bc13e596fe Update tools/java submodule
* tools/java d0cfef38d2...583fc658df (1):
  > build: support passing product-version-release as a parameter
2020-09-23 12:58:34 +03:00
Pekka Enberg
b0447f3245 Update tools/jmx submodule
* tools/jmx 6795a22...45e4f28 (1):
  > build: support passing product-version-release as a parameter
2020-09-23 12:58:30 +03:00
Nadav Har'El
4c2e026e04 alternator streams: fix NextShardIterator for closed shard
As the test test_streams_closed_read confirmed, when a stream shard is
closed, GetRecords should not return a NextShardIterator at all.
Before this patch we wrongly returned an empty string for it.

Before this patch, several Alternator Stream tests (in test_streams.py)
failed when running against a multi-node Scylla cluster. The reason is as
follows: As a multi-node cluster boots and more and more nodes enter the
cluster, the cluster changes its mind about the token ownership, and
therefore the list of stream shards changes. By the time we have the full
cluster, a bunch of shards were created and closed without any data yet.
All the tests will see these closed shards, and need to understand them.
The fetch_more() utility function correctly assumed that a closed shard
does not return a NextShardIterator, and got confused by the empty string
we used to return.

Now that closed shards can return responses without NextShardIterator,
we also needed to fix in this patch a couple of tests which wrongly assumed
this can't happen. These tests did not fail on DynamoDB because unlike in
Scylla, DynamoDB does not have any closed shards in normal tests which
do not specifically cause them (only test_streams_closed_read).

We also need to fix test_streams_closed_read to get rid of an unnecessary
assumption: It currently assumes that when we read the very last item in
a closed shard is read, the end-of-shard is immediately signaled (i.e.,
NextShardIterator is not returned). Although DynamoDB does in fact do this,
it is also perfectly legal for Alternator's implementation to return the
last item with a new NextShardIterator - and only when the client reads
from that iterator, we finally return the signal the end of the shard.

Fixes #7237.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200922082529.511199-1-nyh@scylladb.com>
2020-09-23 09:25:10 +02:00
Avi Kivity
0852f33988 build: disable many clang-specific warnings
While many of these warnings are important, they can be addressed
at at a lower priority. In any case it will be easier to enforce
the warnings if/when we switch to clang.
2020-09-22 23:09:48 +03:00
Tomasz Grabiec
a22645b7dd Merge "Unfriend rows_entry, cache_tracker and mutation_partition" from Pavel Emelyanov
The classes touche private data of each other for no real
reason. Putting the interaction behind API makes it easier
to track the usage.

* xemul/br-unfriends-in-row-cache-2:
  row cache: Unfriend classes from each other
  rows_entry: Move container/hooks types declarations
  rows_entry: Simplify LRU unlink
  mutation_partition: Define .replace_with method for rows_entry
  mutation_partition: Use rows_entry::apply_monotonically
2020-09-22 21:18:14 +02:00
Nadav Har'El
73cb9e3f61 merge: Fix some issues found by clang
Merged pull request https://github.com/scylladb/scylla/pull/7264 by
Avi Kivity (29 commits):

This series fixes issues found by clang. Most are real issues that gcc just
doesn't find, a few are due to clang lagging behind on some C++ updates.
See individual explanations in patches.

The series is not sufficient to build with clang; it just addresses the
simple problems. Two larger problems remain: clang isn't able to compile
std::ranges (not clear yet whether this is a libstdc++ problem or a clang
problem) and clang can't capture structured binding variables (due to
lagging behind on the standard).

The motivation for building with clang is gaining access to a working
implementation of coroutines and modules.

This series compiles with gcc and the unit tests pass.
2020-09-22 21:42:28 +03:00
Botond Dénes
a0107ba1c6 reader_permit: reader_resources: make true RAII class
Currently in all cases we first deduct the to-be-consumed resources,
then construct the `reader_resources` class to protect it (release it on
destruction). This is error prone as it relies on no exception being
thrown while constructing the `reader_resources`. Albeit the
`reader_resources` constructor is `noexcept` right now this might change
in the future and as the call sites relying on this are disconnected
from the declaration, the one modifying them might not notice.
To make this safe going forward, make the `reader_resources` a true RAII
class, consuming the units in its constructor and releasing them in its
destructor.

Fixes: #7256

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200922150625.1253798-1-bdenes@scylladb.com>
2020-09-22 18:13:35 +03:00
Avi Kivity
31a5378a82 utils: utf8: avoid harmless integer overflow
240 doesn't fit in char without overflow, so cast it explicitly
to avoid a clang warning.
2020-09-22 17:24:33 +03:00
Avi Kivity
e12c72ad55 utils: multiprecision_int: disambiguate operator templates by adding overloads
We have templates for multiprecision_int for both sides of the operator,
for example:

    template <typename T>
    bool operator==(const T& x) const

and

    template <typename T>
    friend bool operator==(const T& x, const multiprecision_int& y)

Clang considers them equally satisfying when both operands are
multiprecision_int, so provide a disambiguating overload.
2020-09-22 17:24:33 +03:00
Avi Kivity
d1c049b202 utils: error_injection: remove forward-declared function returning auto
Clang dislikes forward-declared functions returning auto, so declare the
type up front. Functions returning auto are a readability problem
anyway.

To solve a circular dependency problem (get_local_injector() ->
error_injection<> -> get_local_injector()), which is further compounded
by problems in using template specializations before they are defined
(which is forbidden), the storage for get_local_injector() was moved
to error_injection<>, and get_local_injector() is just an accessor.
After this, error_injection<> does not depend on get_local_injector().
2020-09-22 17:24:33 +03:00
Avi Kivity
765e632626 utils: bptree: remove redundant and possibly wrong friend declaration
Clang complains about befriending a constructor. It's possibly correct.
In any case it's redundant, so remove it.
2020-09-22 17:24:33 +03:00
Avi Kivity
c7105019b2 utils: bptree: add missing typename for clang
Clang does not implement p0634r3, so we must add more typenames.
2020-09-22 17:24:33 +03:00
Avi Kivity
0d25ea5a67 utils: bloom_calculations: avoid gratuitous conversion to double
The conversion to double evokes a complaint about precision loss
from clang, and is unneeded anyway, so use integral types throughout.
2020-09-22 17:24:33 +03:00
Avi Kivity
4c93ec8351 utils: updateable_value: fix nullptr_t name
nullptr_t's full name is std::nullptr_t. gcc somehow allows plain nullptr_t,
but that's not correct. Clang rejects it.

Use std::nullptr_t.
2020-09-22 17:24:33 +03:00
Avi Kivity
3570533e8f tracing: fix nullptr_t name
nullptr_t's full name is std::nullptr_t. gcc somehow allows plain nullptr_t,
but that's not correct. Clang rejects it.

Use std::nullptr_t.
2020-09-22 17:24:33 +03:00
Avi Kivity
dba07440c9 test: sstable_directory_test: make new_sstable() not a template
new_sstable is defined as a template, and later used in a context
that requires an object. Somehow gcc uses an instantiation with
an empty template parameter list, but I don't think it's right,
and clang refuses.

Since the template is gratuitous anyway, just make it a regular
function.
2020-09-22 17:24:33 +03:00
Avi Kivity
70ea785cc7 test: cql_query_test: don't use std::pow() in constexpr context
std::pow() is not constexpr, and clang correctly refuses to assign
its result in constexpr context. Add a constexpr replacement.
2020-09-22 17:24:25 +03:00
Nadav Har'El
c1e8d077a4 alternator test: add test for behavior of closed stream shards
This patch adds a test, test_streams_closed_read, which reproduces
two issues in Alternator Streams, regarding the behavior of *closed*
stream shards:

Refs #7239: After streaming is disabled, the stream should still be readable,
it's just that all its shards are now "closed".

Refs #7237: When reaching the end of a closed shard, NextShardIterator should
be missing. Not set to an empty string as we do today.

The test passes on DynamoDB, and xfails on Alterator, and should continue to
do so until both issues are fixed.

This patch changes the implementation of the disable_stream() function.
This function was never actually used by the existing code, and now that
I wanted to use it, I discovered it didn't work as expected and had to fix it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200915134643.236273-1-nyh@scylladb.com>
2020-09-22 10:18:01 +02:00
Pavel Emelyanov
a75b048616 gossiper: Unregister verbs if shadow round aborts start
The gossiper verbs are registered in two places -- start_gossiping
and do_shadow_round(). And unregistered in one -- stop_gossiping
iff the start took place. Respectively, there's a chance that after
a shadow round scylla exits without starting gossiping thus leaving
verbs armed.

Fix by unregistering verbs on stop if they are still registered.

fixes: #7262
tests: manual(start, abort start after shadow round), unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200921140357.24495-1-xemul@scylladb.com>
2020-09-22 10:18:01 +02:00
Pavel Emelyanov
550fc734d9 query_pager: Fix continuation handling for noop visitor
Before updating the _last_[cp]key (for subsequent .fetch_page())
the pager checks is 'if the pager is not exhausted OR the result
has data'.

The check seems broken: if the pager is not exhausted, but the
result is empty the call for keys will unconditionally try to
reference the last element from empty vector. The not exhausted
condition for empty result can happen if the short_read is set,
which, in turn, unconditionally happens upon meeting partition
end when visiting the partition with result builder.

The correct check should be 'if the pager is not exhausted AND
the result has data': the _last_[pc]key-s should be taken for
continuation (not exhausted), but can be taken if the result is
not empty (has data).

fixes: #7263
tests: unit(dev), but tests don't trigger this corner case

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200921124329.21209-1-xemul@scylladb.com>
2020-09-22 10:18:01 +02:00
Pavel Emelyanov
13281c2d79 results-view: Abort early if messing with empty vector
The .get_last_partition_and_clustering_key() method gets
the last partition from the on-board vector of partitions.
The vector in question is assumed not to be empty, but if
this assumption breaks, the result will look like memory
corruption (docs say that accessing empty's vector back()
results in undefined behavior).

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200921122948.20585-1-xemul@scylladb.com>
2020-09-22 10:18:01 +02:00
Pekka Enberg
ea8e545e4e Update tools/java submodule
* tools/java 2e2c056c07...d0cfef38d2 (1):
  > sstableloader: Support range boundary tombstones
2020-09-22 10:18:01 +02:00
Ivan Prisyazhnyy
f4412029f4 docs/docker-hub.md: add quickstart section with --smp 1
Also provide formula to calculate proper value for aio-max-nr.

Closes #7252
2020-09-22 10:18:01 +02:00
Ivan Prisyazhnyy
59463d6e5f docs/docker-hub.md: add jmx doc
Describe flags that allow override JMX service startup for docker
container.

Closes #7250
2020-09-22 10:18:01 +02:00
Takuya ASADA
f243ccfa89 scylla_cpuscaling_setup: add Install section for scylla-cpupower.service
Install section requires for enable/disable services.

Fixes #7230
2020-09-22 10:18:01 +02:00
Avi Kivity
bf8c8d292a test: cql_query_test: disambiguate single-element intializer_list for clang
Clang has a hard time dealing with single-element initializer lists. In this
case, adding an explicit conversion allows it to match the
initializer_list<data_value> parameter.
2020-09-21 20:16:11 +03:00
Avi Kivity
11835f7aa6 test: avoid using literal suffix 'd'
There is no literal suffix 'd', yet we use it for double-precision floats.
Clang rightly complains, so remove it.
2020-09-21 16:32:53 +03:00
Avi Kivity
d19c6c0d98 sstables: size_tiered_backlog_tracker: avoid assignment of non-constexpr expression to constexpr object
std::log() is not constexpr, so it cannot be assigned to a constexpr object.

Make it non-constexpr and automatic. The optimizer still figures out that it's
constant and optimizes it.

Found by clang. Apparently gcc only checks the expression is constant, not
constexpr.
2020-09-21 16:32:53 +03:00
Avi Kivity
a155b2bced sstables: leveled_manifest: prevent benign precision loss warning
Casting from the maximum int64_t to double loses precision, because
int64_t has 64 bits of precision while double has only 53. Clang
warns about it. Since it's not a real problem here, add an explicit
cast to silence the warning.
2020-09-21 16:32:53 +03:00
Avi Kivity
aa7426bde6 sstables: index_reader: make 'index_bound' public
index_reader::index_bound must be constructible by non-friend classes
since it's used in std::optional (which isn't anyone's friend). This
now works in gcc because gcc's inter-template access checking is broken,
but clang correctly rejects it.
2020-09-21 16:32:53 +03:00
Avi Kivity
bd42bdd6b5 sstables: index_reader: disambiguate promoted_index_blocks_reader "state" type and data member
promoted_index_blocks_reader has a data member called "state", and a type member
called "state". Somehow gcc manages to disambiguate the two when used, but
clang doesn't. I believe clang is correct here, one member should subsume the other.

Change the type member to have a different name to disambiguate the two.
2020-09-21 16:32:53 +03:00
Avi Kivity
422a7e07a3 timestamp_based_splitting_writer: supply a parameter to std::out_of_range contructor
std::out-of-range does not have a default constructor, yet gcc somehow
accepts a no-argument construction. Clang (correctly) doesn't, so add
a parameter.
2020-09-21 16:32:53 +03:00
Avi Kivity
a0ffcabd66 view: use nonwrapping_interval instead of nonwrapping_range to avoid clang deduction failure
We use class template argument deduction (CTAD) in a few places, but it appears
not to work for alias templates in clang. While it looks like a clang bug, using
the class name is an improvement, so let's do that.
2020-09-21 16:32:53 +03:00
Avi Kivity
933bc7bd99 cql3: select_statement: fix incorrect implicit conversion of bool_class to bool
bool_class only has explicit conversion to bool, so an assignment such as

   bool x = bool_class<foo>(true);

ought to fail. Somehow gcc allows it, but I believe clang is correct in
disallowing it.

Fix by using 'auto' to avoid the conversion.
2020-09-21 16:32:53 +03:00
Avi Kivity
ef20afea7c counters: unconfuse clang in counter_cell_builder::inserter_iterator
Clang gets confused in this operator=() implementation. Frankly, I
can't see why. But adding this-> helps it.
2020-09-21 16:32:53 +03:00
Avi Kivity
186c6cef57 cdc: sprinkle parentheses in EntryContainer concept
Due to a bug, clang does not decay a type to a reference, failing
the concept evaluation on correct input. Add parentheses to force
it to decay the type.
2020-09-21 16:32:53 +03:00
Avi Kivity
30f2b3ba2f bytes: define contructor for fmt_hex
clang 10 does not implement p0960r3, so we must define a
constructor for fmt_hex.
2020-09-21 16:32:53 +03:00
Avi Kivity
388dcf126c atomic_cell.hh: forward-declare atomic_cell_or_collection
atomic_cell_or_collection is also declared as a friend class further
down, and gcc appears to inject this friend declration into the
global namespace. Clang appears not to, and so complains when
atomic_cell_or_collection is mentioned in the declaration of
merge_column().

Add a forward declaration in the global namespace to satisfy clang.
2020-09-21 16:32:53 +03:00
Avi Kivity
cc3c9ba03a alternator/streams: don't use non-existent std::ostringstream::view()
We call ostringstream::view(), but that member doesn't exist. It
works because it is guarded by an #ifdef and the guard isn't satisified,
but if it is (as with clang) it doesn't compile. Remove it.
2020-09-21 16:32:10 +03:00
Avi Kivity
2d33a3f73c alternator/base64: fix harmless integer overflow
We assign 255 to an int8_t, but 255 isn't representable as
an int8_t. Change to the bitwise equivalent but representable -1.

Found by clang.
2020-09-21 16:32:10 +03:00
Avi Kivity
cf3e779180 alternator/base64: fix misuse of strlen() in constexpt context
base64_chars() calls strlen() from a static_assert, but strlen() isn't
(and can't be) constexpr. gcc somehow allows it, but clang rightfully
complains.

Fix by using a character array and sizeof, instead of a pointer
and strlen().
2020-09-21 16:32:10 +03:00
Avi Kivity
ee980ee32f hashers: convert illegal contraint to static_assert
The constraint on on cryptopp_hasher<>::impl is illegal, since it's not
on the base template. Convert it to a static_assert.

We could have moved it to the base template, but that would have undone
the work to push all the implementation details in .cc and reduce #include
load.

Found by clang.
2020-09-21 16:32:10 +03:00
Avi Kivity
c5312618d0 hashers: relax noexcept requirement from CryptoPP Update() functions
While we need CryptoPP Update() functions not to throw, they aren't
marked noexcept.

Since there is no reason for them to throw, we'll just hope they
don't, and relax the requirement.

Found by clang. Apparently gcc didn't bother to check the constraint
here.
2020-09-21 16:32:10 +03:00
Avi Kivity
22781ab7e3 hashers: add missing typename in Hashers concept
Found by clang. Likely due to clang not implementing p0634r3, not a gcc
bug.
2020-09-21 16:31:40 +03:00
Botond Dénes
ab59e7c725 flat_mutation_reader: add buffer() accessor
To allow outsiders to inspect the contents of the reader's buffer.
2020-09-21 13:33:42 +03:00
Etienne Adam
208a721253 redis: add hexists command
Add HEXISTS command which return 1 if the key/field
of a hash exist, otherwise return 0.

Signed-off-by: Etienne Adam <etienne.adam@gmail.com>
Message-Id: <20200917200259.338-1-etienne.adam@gmail.com>
2020-09-21 12:32:33 +03:00
Avi Kivity
75e72d18d2 Merge 'Simplify scylla_util.py' from Takuya ASADA
scylla_util.py becomes large, some code are unused, some code are only for specific script.
Drop unnecessary things, move non-common functions to caller script, make scylla_util.py simple.

Closes #7102

* syuu1228-refactor_scylla_util:
  scylla_util.py: de-duplicate code on parse_scylla_dirs_with_default() and get_scylla_dirs()
  scylla_util.py: remove rmtree() and redhat_version() since these are unused
  dist/common/scripts: drop makedirs(), use os.makedirs()
  dist/common/scripts: drop hex2list.py since it's nolonger used
  dist/common/scripts: drop is_systemd() since we nolonger support non-systemd environment
  dist/common/scripts: drop dist_name() and dist_ver()
  dist/common/scripts: move functions that are only called from single file
2020-09-19 21:01:24 +03:00
Takuya ASADA
48223022f7 scylla_util.py: de-duplicate code on parse_scylla_dirs_with_default() and get_scylla_dirs()
Seems like parse_scylla_dirs_with_default() and get_scylla_dirs() shares most of
the code, de-duplicate it.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2020-09-20 00:50:37 +09:00
Takuya ASADA
f8321bc66a scylla_util.py: remove rmtree() and redhat_version() since these are unused 2020-09-20 00:50:05 +09:00
Takuya ASADA
8e1f7d4fc7 dist/common/scripts: drop makedirs(), use os.makedirs()
Since os.makedirs() has exist_ok option, no need to create wrapper function.
2020-09-20 00:48:06 +09:00
Takuya ASADA
85f76e80b4 dist/common/scripts: drop hex2list.py since it's nolonger used
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2020-09-20 00:45:25 +09:00
Takuya ASADA
0f5c83f73d dist/common/scripts: drop is_systemd() since we nolonger support non-systemd environment 2020-09-20 00:45:02 +09:00
Takuya ASADA
82701dc5ed dist/common/scripts: drop dist_name() and dist_ver()
It can be replaced with distro.name() and distro.version().

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2020-09-20 00:42:27 +09:00
Takuya ASADA
79d8192dc7 dist/common/scripts: move functions that are only called from single file
scylla_util.py is a library for common functions across setup scripts,
it should not include private function of single file.
So move all those functions to caller file.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2020-09-20 00:42:19 +09:00
Avi Kivity
4e7ad448f4 Update seastar submodule
* seastar dc06cd1f0f...9ae33e67e1 (9):
  > expiring_fifo: mark methods noexcept
  > chunked_fifo: mark methods noexcept
  > circular_buffer: support stateful allocators
  > net/posix-stack: fix sockaddr forward reference
  > rpc: fix LZ4_DECODER_RING_BUFFER_SIZE not defined
  > futures_test: fix max_concurrent_for_each concepts error
  > core: limit memory size for each process to 64GB
  > core/reactor_backend: kill unused nr_retry
  > Merge "IO tracing" from Pavel E
2020-09-19 16:40:48 +03:00
Avi Kivity
311b6b827c build: unified tarball: reduce log spam
Don't print out the names of all files archived, there are too
many of them.

Closes #7191
2020-09-18 15:00:59 +03:00
Pekka Enberg
db272ba799 Update tools/java submodule
* tools/java 6c1c484140...2e2c056c07 (1):
  > sstableloader: Add verbose message if sstable/file cannot be opened
2020-09-17 17:19:41 +03:00
Pekka Enberg
9650b8d4b5 Update tools/java submodule
* tools/java b0114f64bc...6c1c484140 (1):
  > sstableloader: fix generating CQL statements
2020-09-17 12:25:08 +03:00
Avi Kivity
8722cb97ae Merge "storage_service: set_tables_autocompaction: run_with_api_lock" from Benny
"
Based on https://github.com/scylladb/scylla/issues/7199,
it looks like storage_service::set_tables_autocompaction
may be called on shards other than 0.

Use run_with_api_lock to both serialize the action
and to check _initialized on shard 0.

Fixes #7199

Test: unit(dev), compaction_test:TestCompaction_with_SizeTieredCompactionStrategy.disable_autocompaction_nodetool_test
"

* tag 'set_tables_autocompaction-v1' of github.com:bhalevy/scylla:
  storage_service: set_tables_autocompaction: fixup indentation
  storage_service: set_tables_autocompaction: run_with_api_lock
  storage_service: set_tables_autocompaction: use do_with to hold on to args
  storage_service: set_tables_autocompaction: log message in info level
2020-09-17 12:21:06 +03:00
Avi Kivity
12ca7f7ace Merge 'dist/common/scripts: skip internet access on offline installation' from Takuya ASADA
We need to skip internet access on offline installation.
To do this we need following changes:
 - prevent running yum/apt for each script
 - set default "NO" for scripts it requires package installation
 - set default "NO" for scripts it requires internet access, such as NTP

See #7153

Closes #7224

* syuu1228-offline_setup:
  dist/common/scripts: skip internet access on offline installation
  scylla_ntp_setup: use shutil.witch() to lookup command
2020-09-17 12:14:20 +03:00
Avi Kivity
809a13d0f4 Merge 'Gce image support' from Bentsi
- Add necessary changes to `scylla_util.py` in order to support Scylla Machine Image in GCE.
- Fixes and improvements for `curl` function.

Closes #7080

* bentsi-gce-image-support:
  scylla_util.py: added GCE instance/image support
  scylla_util.py: make max_retries as a curl function argument
  scylla_util.py: adding timeout to curl function
  scylla_util.py: styling fixes to curl function
  scylla_util.py: change default value for headers argument in curl function
2020-09-17 12:09:19 +03:00
Pekka Enberg
d6bf424127 configure.py: Build scylla-unified-package.tar.gz to build/<mode>/dist/tar
Let's build scylla-unified-package.tar.gz in build/<mode>/dist/tar for
symmetry. The old location is still kept for backward compatibility for
now. Also document the new official artifact location.

Message-Id: <20200917071131.126098-1-penberg@scylladb.com>
2020-09-17 11:01:02 +03:00
Avi Kivity
e43d6d1460 Merge "Unregister all RPC verbs" from Pavel E
"
... and make sure nothing is left. Whith the help of fresh seastar
this can be done quickly.

Before doing this check -- unregister remaining verbs in repair and
storage_service and fix tests not to register verbs, because they
are all local.

tests: unit(dev), manual
"

* 'br-messaging-service-stop-all' of https://github.com/xemul/scylla:
  messaging_service: Report still registered services as errors
  repair: Move CHECKSUM_RANGE verb into repair/
  repair: Toss messaging init/uninit calls
  storage_service: Uninit RPC verbs
  test: Do not init messaging verbs
2020-09-17 10:59:10 +03:00
Nadav Har'El
b81e3d9a4e alternator, doc: link to the separate repository on load balancing
Add in docs/alternator/alternator.md a link to the external repository
devoted to Alternator load balancing instructions, example, and code -
https://github.com/scylladb/alternator-load-balancing/.

Fixes #5030.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200917061328.371557-1-nyh@scylladb.com>
2020-09-17 09:56:29 +03:00
Pavel Emelyanov
2fde6bbfe7 messaging_service: Report still registered services as errors
On stop -- unregister the CLIENT_ID verb, which is registerd
in constructor, then check for any remaining ones.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-17 09:52:57 +03:00
Pavel Emelyanov
9a15ebfe6a repair: Move CHECKSUM_RANGE verb into repair/
The verb is sent by repair code, so it should be registered
in the same place, not in main. Also -- the verb should be
unregistered on stop.

The global messaging service instance is made similarly to the
row-level one, as there's no ready to use repair service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-17 09:52:48 +03:00
Pavel Emelyanov
d5769346d7 repair: Toss messaging init/uninit calls
There goal is to make it possible to reg/unreg not only row-level
verbs. While at it -- equip the init call with sharded<database>&
argument, it will be needed by the next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-17 09:52:48 +03:00
Pavel Emelyanov
949a258809 storage_service: Uninit RPC verbs
The service does this on stop, which is never called, so
do it separately.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-17 09:52:45 +03:00
Pavel Emelyanov
2d45d71413 test: Do not init messaging verbs
The CQL tests do not use networking, so there is no need in registering any verbs

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-17 09:51:33 +03:00
Avi Kivity
a81b731a1a scripts: pull_pr.sh: resolve pull request user name
Convert the github handle to a real name.

Closes #7247
2020-09-17 08:27:00 +02:00
Pekka Enberg
f6f9f832ee test.py: Add "--list" option to show a list of tests
This patch adds a "--list" option to test.py that shows a list of tests
instead of executing them. This is useful for people and scripts, which
need to discover the tests that will be run. For example, when Jenkins
needs to store failing tests, it can use "test.py --list" to figure out
what to archive.
Message-Id: <20200916135714.89350-1-penberg@scylladb.com>
2020-09-16 16:02:48 +02:00
Benny Halevy
f207cff73d token_metadata: set_pending_ranges: prep new interval_map out of line
And move-assign to _pending_ranges_interval_map[keyspace_name]
only when done.

This is more effient since there's no need to look up
_pending_ranges_interval_map[keyspace_name] for every insert to the
interval_map.

And it is exception safe in case we run out of memory mid-way.

Refs #7220

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200916115059.788606-1-bhalevy@scylladb.com>
2020-09-16 15:28:42 +03:00
Avi Kivity
81844fb476 Update tools/java submodule
* tools/java 2d49ded77b...b0114f64bc (1):
  > Merge "dist: do not install build dependencies on build script" from Takuya

Fixes #7219.
2020-09-16 12:52:07 +03:00
Bentsi Magidovich
12e018ad04 scylla_util.py: added GCE instance/image support 2020-09-16 11:32:57 +03:00
Nadav Har'El
5e8bdf6877 alternator: fix corruption of PutItem operation in case of contention
This patch fixes a bug noted in issue #7218 - where PutItem operations
sometimes lose part of the item's data - some attributes were lost,
and the name of other attributes replaced by empty strings. The problem
happened when the write-isolation policy was LWT and there was contention
of writes to the same partition (not necessarily the same item).

To use CAS (a.k.a. LWT), Alternator builds an alternator::rmw_operation
object with an apply() function which takes the old contents of the item
(if needed) and a timestamp, and builds a mutation that the CAS should
apply. In the case of the PutItem operation, we wrongly assumed that apply()
will be called only once - so as an optimization the strings saved in the
put_item_operation were moved into the returned mutation. But this
optimization is wrong - when there is contention, apply() may be called
again when the changed proposed by the previous one was not accepted by
the Paxos protocol.

The fix is to change the one place where put_item_operation *moved* strings
out of the saved operations into the mutations, to be a copy. But to prevent
this sort of bug from reoccuring in future code, this patch enlists the
compiler to help us verify that it can't happen: The apply() function is
marked "const" - it can use the information in the operation to build the
mutation, but it can never modify this information or move things out of it,
so it will be fine to call this function twice.

The single output field that apply() does write (_return_attributes) is
marked "mutable" to allow the const apply() to write to it anyway. Because
apply() might be called twice, it is important that if some apply()
implementation sometimes sets _return_attributes, then it must always
set it (even if to the default, empty, value) on every call to apply().

The const apply() means that the compiler verfies for us that I didn't
forget to fix additional wrong std::move()s. Additionally, a test I wrote
to easily reproduce issue #7218 (which I will submit as a dtest later)
passes after this fix.

Fixes #7218.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200916064906.333420-1-nyh@scylladb.com>
2020-09-16 10:30:19 +02:00
Bentsi Magidovich
5aef44fc82 scylla_util.py: make max_retries as a curl function argument 2020-09-16 11:25:01 +03:00
Bentsi Magidovich
1f31e976cc scylla_util.py: adding timeout to curl function 2020-09-16 11:25:01 +03:00
Bentsi Magidovich
f5d97afaa2 scylla_util.py: styling fixes to curl function
- rename deprecated logging.warn to logging.warning
- remove redundant round brackets in the if statement
2020-09-16 11:25:01 +03:00
Bentsi Magidovich
a24ec2686f scylla_util.py: change default value for headers argument in curl function
- It was set to {} that is incorrect and can lead to unexpected behavior
https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments
- Order of the arguments changed to more convinient way
2020-09-16 11:25:01 +03:00
Avi Kivity
4456645f97 scripts: pull_pr.sh: auto-close pull request after merge
Add a "Closes #$PR_NUM" annotation at the end of the commit
message to tell github to close the pull request, preventing
manual work and/or dangling pull requests.

Closes #7245
2020-09-16 10:23:34 +02:00
Avi Kivity
253a7640e3 Merge 'Clean up old cluster features' from Piotr Sarna
"
This series follows the suggestion from https://github.com/scylladb/scylla/pull/7203#issuecomment-689499773 discussion and deprecates a number of cluster features. The deprecation does not remove any features from the strings sent via gossip to other nodes, but it removes all checks for these features from code, assuming that the checks are always true. This assumption is quite safe for features introduced over 2 years ago, because the official upgrade path only allows upgrading from a previous official release, and these feature bits were introduced many release cycles ago.
All deprecated features were picked from a `git blame` output which indicated that they come from 2018:
```git
e46537b7d3 2016-05-31 11:44:17 +0200   RANGE_TOMBSTONES_FEATURE = "RANGE_TOMBSTONES";
85c092c56c 2016-07-11 10:59:40 +0100   LARGE_PARTITIONS_FEATURE = "LARGE_PARTITIONS";
02bc0d2ab3 2016-12-09 22:09:30 +0100   MATERIALIZED_VIEWS_FEATURE = "MATERIALIZED_VIEWS";
67ca6959bd 2017-01-30 19:50:13 +0000   COUNTERS_FEATURE = "COUNTERS";
815c91a1b8 2017-04-12 10:14:38 +0300   INDEXES_FEATURE = "INDEXES";
d2a2a6d471 2017-08-03 10:53:22 +0300   DIGEST_MULTIPARTITION_READ_FEATURE = "DIGEST_MULTIPARTITION_READ";
ecd2bf128b 2017-09-01 09:55:02 +0100   CORRECT_COUNTER_ORDER_FEATURE = "CORRECT_COUNTER_ORDER";
713d75fd51 2017-09-14 19:15:41 +0200   SCHEMA_TABLES_V3 = "SCHEMA_TABLES_V3";
2f513514cc 2017-11-29 11:57:09 +0000   CORRECT_NON_COMPOUND_RANGE_TOMBSTONES = "CORRECT_NON_COMPOUND_RANGE_TOMBSTONES";
0be3bd383b 2017-12-04 13:55:36 +0200   WRITE_FAILURE_REPLY_FEATURE = "WRITE_FAILURE_REPLY";
0bab3e59c2 2017-11-30 00:16:34 +0000   XXHASH_FEATURE = "XXHASH";
fbc97626c4 2018-01-14 21:28:58 -0500   ROLES_FEATURE = "ROLES";
802be72ca6 2018-03-18 06:25:52 +0100   LA_SSTABLE_FEATURE = "LA_SSTABLE_FORMAT";
71e22fe981 2018-05-25 10:37:54 +0800   STREAM_WITH_RPC_STREAM = "STREAM_WITH_RPC_STREAM";
```
Tests: unit(dev)
           manual(verifying with cqlsh that the feature strings are indeed still set)
"

Closes #7234.

* psarna-clean_up_features:
  gms: add comments for deprecated features
  gms: remove unused feature bits
  streaming: drop checks for RPC stream support
  roles: drop checks for roles schema support
  service: drop checks for xxhash support
  service: drop checks for write failure reply support
  sstables: drop checks for non-compound range tombstones support
  service: drop checks for v3 schema support
  repair: drop checks for large partitions support
  service: drop checks for digest multipartition read support
  sstables: drop checks for correct counter order support
  cql3: drop checks for materialized views support
  cql3: drop checks for counters support
  cql3: drop checks for indexing support
2020-09-16 10:53:25 +03:00
Avi Kivity
888fde59f8 Update tools/jmx submodule
* tools/jmx d3096f3...6795a22 (1):
  > Merge "dist: do not install build dependencies on build script" from Takuya

Ref #7219.
2020-09-16 10:30:33 +03:00
Takuya ASADA
db9e6f50f3 dist/common/scripts: skip internet access on offline installation
We need to skip internet access on offline installation.
To do this we need following changes:
 - prevent running yum/apt for each script
 - set default "NO" for scripts it requires package installation
 - set default "NO" for scripts it requires internet access, such as NTP

See #7153
Fixes #7182
2020-09-16 10:05:20 +09:00
Takuya ASADA
ca8f0ff588 scylla_ntp_setup: use shutil.witch() to lookup command
The command installed directory may different between distributions,
we can abstract the difference using shutil.witch().
Also the script become simpler than passing full path to os.path.exists().
2020-09-16 10:04:23 +09:00
Avi Kivity
9421cfded4 reconcilable_result_builder: don't aggrevate out-of-memory condition during recovery
Consider an unpaged query that consumes all of available memory, despite
fea5067dfa which limits them (perhaps the
user raised the limit, or this is a system query). Eventually we will see a
bad_alloc which will abort the query and destroy this reconcilable_result_builder.

During destruction, we first destroy _memory_accounter, and then _result.
Destroying _memory_accounter resumes some continuations which can then
allocate memory synchronously when increasing the task queue to accomodate
them. We will then crash. Had we not crashed, we would immediately afterwards
release _result, freeing all the memory that we would ever need.

Fix by making _result the last member, so it is freed first.

Fixes #7240.
2020-09-15 19:53:05 +02:00
Pavel Solodovnikov
6e10f2b530 schema_registry: make grace period configurable
Introduce new database config option `schema_registry_grace_period`
describing the amount of time in seconds after which unused schema
versions will be cleaned up from the schema registry cache.

Default value is 1 second, the same value as was hardcoded before.

Tests: unit(debug)
Refs: #7225

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200915131957.446455-1-pa.solodovnikov@scylladb.com>
2020-09-15 17:53:27 +02:00
Tomasz Grabiec
c9e1694c58 Merge "Some optimizations on cache entry lookup" from Pavel Emelyanov
The set contains 3 small optimizations:
- avoid copying of partition key on lookup path
- reduce number of args carried around when creating a new entry
- save one partition key comparison on reader creation

Plus related satellite cleanups.

* https://github.com/xemul/scylla/tree/br-row-cache-less-copies:
  row_cache: Revive do_find_or_create_entry concepts
  populating reader: Do not copy decorated key too early
  populating reader: Less allocator switching on population
  populating reader: Fix indentation after previous patch
  row_cache: Move missing entry creation into helper
  test: Lookup an existing entry with its own helper
  row_cache: Do not copy partition tombstone when creating cache entry
  row_cache: Kill incomplete_tag
  row_cache: Save one key compare on direct hit
2020-09-15 17:49:47 +02:00
Avi Kivity
7cf4c450cd Update seastar submodule
* seastar 8933f76d33...dc06cd1f0f (3):
  > lz4_fragmented_compressor: Fix buffer requirements
Fixes #6925
  > net: tls: Added feature to register callback for TLS verification
  > alien: be compatible use API Level 5
2020-09-15 17:33:24 +03:00
Avi Kivity
64ebb9c052 Merge 'Remove _pending_ranges and _pending_ranges_map in token_metadata' from Asias
"
This PR removes  _pending_ranges and _pending_ranges_map in token_metadata.
This removal of  makes copying of token_metadata faster and reduces the chance to cause reactor stall.

Refs: #7220
"

* asias-token_metadata_replication_config_less_maps:
  token_metadata: Remove _pending_ranges
  token_metadata: Get rid of unused _pending_ranges_map
2020-09-15 17:16:35 +03:00
Piotr Sarna
7c8728dd73 Merge 'Add progress metrics for replace decommission removenode'
from Asias.

This series follows "repair: Add progress metrics for node ops #6842"
and adds the metrics for the remaining node operations,
i.e., replace, decommission and removenode.

Fixes #1244, #6733

* asias-repair_progress_metrics_replace_decomm_removenode:
  repair: Add progress metrics for removenode ops
  repair: Add progress metrics for decommission ops
  repair: Add progress metrics for replace ops
2020-09-15 12:19:11 +02:00
Benny Halevy
0dc45529c8 abstract_replication_strategy: get_ranges_in_thread: copy _token_metadata if func may yield
Change 94995acedb added yielding to abstract_replication_strategy::do_get_ranges.
And 07e253542d used get_ranges_in_thread in compaction_manager.

However, there is nothing to prevent token_metadata, and in particular its
`_sorted_tokens` from changing while iterating over them in do_get_ranges if the latter yields.

Therefore copy the the replication strategy `_token_metadata` in `get_ranges_in_thread(inet_address ep)`.

If the caller provides `token_metadata` to get_ranges_in_thread, then the caller
must make sure that we can safely yield while accessing token_metadata (like
in `do_rebuild_replace_with_repair`).

Fixes #7044

Test: unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200915074555.431088-1-bhalevy@scylladb.com>
2020-09-15 11:33:55 +03:00
Asias He
c38ec98c6e token_metadata: Remove _pending_ranges
- Remove get_pending_ranges and introduce has_pending_ranges, since the
  caller only needs to know if there is a pending range for the keyspace
  and the node.

- Remove print_pending_ranges which is only used in logging. If we
  really want to log the new pending token ranges, we can log when we
  set the new pending token ranges.

This removal of _pending_ranges makes copying of token_metadata faster
and reduces the chance to cause reactor stall.

Refs: #7220
2020-09-15 16:27:50 +08:00
Avi Kivity
19ffc9455d Merge "Don't expose exact collection from range_tombstone_list" from Pavel E
"
The range_tombstone_list provides an abstraction to work with
sorted list of range tombstones with methods to add/retrive
them. However, there's a tombstones() method that just returns
modifiable reference to the used collection (boost::intrusive_set)
which makes it hard to track the exact usage of it.

This set encapsulates the collaction of range tombstones inside
the mentioned ..._list class.

tests: unit(dev)
"

* 'br-range-tombstone-encapsulate-collection' of https://github.com/xemul/scylla:
  range_tombstone_list: Do not expose internal collection
  range_tombstone_list: Introduce and use pop-and-lock helper
  range_tombstone_list: Introduce and use pop_as<>()
  flat_mutation_reader: Use range_tombstone_list begin/end API
  repair: Mark some partition_hasher methods noexcept
  hashers: Mark hash updates noexcept
2020-09-15 10:09:15 +02:00
Botond Dénes
3c3b63c2b7 scylla-gdb.py: histogram: don't use shared default argument
The histogram constructor has a `counts` parameter defaulted to
`defaultdict(int)`. Due to how default argument values work in
python -- the same value is passed to all invocations -- this results in
all histogram instances sharing the same underlying counts dict. Solve
it the way this is usually solved -- default the parameter to `None` and
when it is `None` create a new instance of `defaultdict(int)` local to
the histogram instance under construction.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200908142355.1263568-1-bdenes@scylladb.com>
2020-09-15 10:09:15 +02:00
Botond Dénes
c1bb648f90 scylla-gdb.py: managed_bytes_printer: print blobs in hex format
Currently blobs are converted to python bytes objects and printed by
simply converting them to string. This results in hard to read blobs as
the bytes' __str__() attempts to interpret the data as a printable
string. This patch changes this to use bytes.hex() which prints blobs in
hex format. This is much more readable and it is also the format that
scylla uses when printing blobs.

Also the conversion to bytes is made more efficient by using gdb's
gdb.inferior.read_memory() function to read the data.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200911085439.1461882-1-bdenes@scylladb.com>
2020-09-15 10:09:15 +02:00
Tomasz Grabiec
1f6c4f945e mutation_partition: Fix typo
drien -> driven

Message-Id: <1600103287-4948-1-git-send-email-tgrabiec@scylladb.com>
2020-09-15 10:09:15 +02:00
Asias He
d38506fbf0 token_metadata: Get rid of unused _pending_ranges_map
It is not used anymore. The size of _pending_ranges_map is is O(number
of keyspaces). It can be very big when we have lots of keyspaces.

Refs: #7220
2020-09-15 14:47:00 +08:00
Pavel Solodovnikov
e02301890b schema_tables: extract fill_column_info helper
The patch extracts a little helper function that populates
a schema_mutation with various column information.

Will be used in a subsequent patch to serialize column mappings.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-09-15 05:38:21 +03:00
Pavel Solodovnikov
778230f8b8 frozen_mutation: introduce unfreeze_upgrading method
This helper function is similar to the ordinary `unfreeze` of
`frozen_mutation` but in addition to the schema_ptr supplies a
custom column_mapping which is being used when upgrading the
mutation.

Needed for a subsequent patch regarding column mappings history.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-09-15 05:26:44 +03:00
Takuya ASADA
5f541fbdc5 scylla_setup: drop hugepages package installation
hugepages and libhugetlbfs-bin packages is only required for DPDK mode,
and unconditionally installation causes error on offline mode, so drop it.

Fixes #7182
2020-09-14 17:05:09 +03:00
Botond Dénes
192bcc5811 data/cell: don't overshoot target allocation sizes
data::cell targets 8KB as its maximum allocations size to avoid
pressuring the allocator. This 8KB target is used for internal storage
-- values small enough to be stored inside the cell itself -- as well
for external storage. Externally stored values use 8KB fragment sizes.
The problem is that only the size of data itself was considered when
making the allocations. For example when allocating the fragments
(chunks) for external storage, each fragment stored 8KB of data. But
fragments have overhead, they have next and back pointers. This resulted
in a 8KB + 2 * sizeof(void*) allocation. IMR uses the allocation
strategy mechanism, which works with aligned allocations. As the seastar
allocation only guarantees aligned allocations for power of two sizes,
it ends up allocating a 16KB slot. This results in the mutation fragment
using almost twice as much memory as would be required. This is a huge
waste.

This patch fixes the problem by considering the overhead of both
internal and external storage ensuring allocations are 8KB or less.

Fixes: #6043

Tests: unit(debug, dev, release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200910171359.1438029-1-bdenes@scylladb.com>
2020-09-14 14:21:46 +03:00
Piotr Sarna
d85a32ce70 mutation_partition: use proper hasher in row hashing
Instead of using the default hasher, hasing specializations should
use the hasher type they were specialized for. It's not a correctness
issue now because the default hasher (xx_hasher) is compatible with
its predecessor (legacy_xx_hasher_without_null_digest), but it's better
to be future-proof and use the correct type in case we ever change the
default hasher in a backward-incompatible way.

Message-Id: <c84ce569d12d9b4f247fb2717efa10dc2dabd75b.1600074632.git.sarna@scylladb.com>
2020-09-14 14:17:36 +03:00
Piotr Sarna
dd085b146a gms: add comments for deprecated features
Features which are propagated to other nodes via gossip,
but assumed they are supported in the code, are now marked
with comments.
2020-09-14 12:59:19 +02:00
Benny Halevy
0410e11213 storage_service: set_tables_autocompaction: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-09-14 13:55:28 +03:00
Benny Halevy
39cf84f291 storage_service: set_tables_autocompaction: run_with_api_lock
Based on https://github.com/scylladb/scylla/issues/7199,
it looks like storage_service::set_tables_autocompaction
may be called on shards other than 0.

Use run_with_api_lock to both serialize the action
and to check _initialized on shard 0.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-09-14 13:53:08 +03:00
Benny Halevy
1ca02c756f storage_service: set_tables_autocompaction: use do_with to hold on to args
In preparation to calling is_initialized() which may yield.

Plus, the way the tables vector is currently captured is inefficient.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-09-14 13:45:00 +03:00
Benny Halevy
e85a6c4853 storage_service: set_tables_autocompaction: log message in info level
This is rare enough and important for the operator to be logged in info level.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-09-14 13:38:59 +03:00
Piotr Sarna
defe6f49df gms: remove unused feature bits
Checks for features introduced over 2 years ago were removed
in previous commits, so all that is left is removing the feature
bits itself. Note that the feature strings are still sent
to other nodes just to be double sure, but the new code assumes
that all these features are implicitly enabled.
2020-09-14 12:35:28 +02:00
Piotr Sarna
f7a7931377 streaming: drop checks for RPC stream support
Streaming with RPC stream is supported for over 2 years and upgrades
are only allowed from versions which already have the support,
so the checks are hereby dropped.
2020-09-14 12:18:13 +02:00
Pavel Emelyanov
dff8aebe58 partition_snapshot_reader: Do not fill buffer in constructor
The reader fills up the buffer upon construction, which is not what other
readers do, and is considered to be waste of cycles, as the reader can be
dropped early.

Refs #1671
test: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200910171134.11287-2-xemul@scylladb.com>
2020-09-14 12:18:03 +02:00
Piotr Sarna
d1480a5260 roles: drop checks for roles schema support
Roles are supported for over 2 years and upgrades are only
allowed from versions which already have the support, so the checks
are hereby dropped.
2020-09-14 12:17:26 +02:00
Piotr Sarna
e19e86a6e7 service: drop checks for xxhash support
xxhash algorithm is supported for over 2 years and upgrades are only
allowed from versions which already have the support, so the checks
are hereby dropped.
2020-09-14 12:13:03 +02:00
Piotr Sarna
f05bf78716 service: drop checks for write failure reply support
Write failure reply is supported for over 2 years and upgrades are only
allowed from versions which already have the support, so the checks
are hereby dropped.
2020-09-14 12:11:42 +02:00
Piotr Sarna
16b4b86697 sstables: drop checks for non-compound range tombstones support
Correct non-compound range tombstones are supported for over 2 years
and upgrades are only allowed from versions which already have the
support, so the checks are hereby dropped.
2020-09-14 12:09:51 +02:00
Piotr Sarna
cc57f7b154 service: drop checks for v3 schema support
Schema v3 is supported for over 2 years and upgrades are only
allowed from versions which already have the support, so the checks
are hereby dropped.
2020-09-14 12:07:51 +02:00
Piotr Sarna
9e6098a422 repair: drop checks for large partitions support
Large partitions are supported for over 2 years and upgrades are only
allowed from versions which already have the support, so the checks
are hereby dropped.
2020-09-14 12:07:20 +02:00
Piotr Sarna
854a44ff9b service: drop checks for digest multipartition read support
Digest multipartition read is supported for over 2 years and upgrades are only
allowed from versions which already have the support, so the checks
are hereby dropped.
2020-09-14 12:06:32 +02:00
Piotr Sarna
f8ed1b5b67 sstables: drop checks for correct counter order support
Correct counter order is supported for over 2 years and upgrades are only
allowed from versions which already have the support, so the checks
are hereby dropped.
2020-09-14 12:05:11 +02:00
Piotr Sarna
18bd710dca cql3: drop checks for materialized views support
Views are supported for over 2 years and upgrades are only
allowed from versions which already have the support, so the checks
are hereby dropped.
2020-09-14 12:03:52 +02:00
Piotr Sarna
720d17a9c7 cql3: drop checks for counters support
Counters are supported for over 2 years and upgrades are only
allowed from versions which already have the support, so the checks
are hereby dropped.
2020-09-14 12:03:41 +02:00
Piotr Sarna
7ba7d35aad cql3: drop checks for indexing support
Indexing is supported for over 2 years and upgrades are only
allowed from versions which already have the support, so the checks
are hereby dropped.
2020-09-14 12:03:37 +02:00
Avi Kivity
dcaf4ea4dd Merge "Fix race in schema version recalculation leading to stale schema version in gossip" from Tomasz
"
Migration manager installs several cluster feature change listeners.
The listeners will call update_schema_version_and_announce() when cluster
features are enabled, which does this:

    return update_schema_version(proxy, features).then([] (utils::UUID uuid) {
        return announce_schema_version(uuid);
    });

It first updates the schema version and then publishes it via
gossip in announce_schema_version(). It is possible that the
announce_schema_version() part of the first schema change will be
deferred and will execute after the other four calls to
update_schema_version_and_announce(). It will install the old schema
version in gossip instead of the more recent one.

The fix is to serialize schema digest calculation and publishing.

Refs #7200

This problem also brought my attention to initialization code, which could be
prone to the same problem.

The storage service computes gossiper states before it starts the
gossiper. Among them, node's schema version. There are two problems with that.

First is that computing the schema version and publishing it is not
atomic, so is not safe against concurrent schema changes or schema
version recalculations. It will not exclude with
recalculate_schema_version() calls, and we could end up with the old
(and incorrect) schema version being advertised in gossip.

Second problem is that we should not allow the database layer to call
into the gossiper layer before it is fully initialized, as this may
produce undefined behavior.

Maybe we're not doing concurrent schema changes/recalculations now,
but it is easy to imagine that this could change for whatever reason
in the future.

The solution for both problems is to break the cyclic dependency
between the database layer and the storage_service layer by having the
database layer not use the gossiper at all. The database layer
publishes schema version inside the database class and allows
installing listeners on changes. The storage_service layer asks the
database layer for the current version when it initializes, and only
after that installs a listener which will update the gossiper.

Tests:

  - unit (dev)
  - manual (3 node ccm)
"

* tag 'fix-schema-digest-calculation-race-v1' of github.com:tgrabiec/scylla:
  db, schema: Hide update_schema_version_and_announce()
  db, storage_service: Do not call into gossiper from the database layer
  db: Make schema version observable
  utils: updateable_value_source: Introduce as_observable()
  schema: Fix race in schema version recalculation leading to stale schema version in gossip
2020-09-14 12:37:46 +03:00
Etienne Adam
f3ce5f0cbb redis: remove lambda in command_factory
This follows the patch removing the commands classes,
and removes unnecessary lambdas.

Signed-off-by: Etienne Adam <etienne.adam@gmail.com>
Message-Id: <20200914071651.28802-1-etienne.adam@gmail.com>
2020-09-14 11:30:20 +03:00
Avi Kivity
05229d1d31 Merge "Add unified tarball to build "dist" target" from Pekka
"
This pull request fixes unified relocatable package dependency issues in
other build modes than release, and then adds unified tarball to the
"dist" build target.

Fixes #6949
"

* 'penberg/build/unified-to-dist/v1' of github.com:penberg/scylla:
  configure.py: Build unified tarball as part of "dist" target
  unified/build_unified: Use build/<mode>/dist/tar for dependency tarballs
  configure.py: Use build/<mode>/dist/tar for unified tarball dependencies
2020-09-14 11:29:28 +03:00
Etienne Adam
bd82b4fc03 redis: remove commands classes
This patch is a proposal for the removal of the redis
classes describing the commands. 'prepare' and 'execute'
class functions have been merged into a function with
the name of the command.

Note: 'command_factory' still needs to be simplified.

Signed-off-by: Etienne Adam <etienne.adam@gmail.com>
Message-Id: <20200913183315.9437-1-etienne.adam@gmail.com>
2020-09-14 11:29:28 +03:00
Avi Kivity
764866ed02 Update tools/jmx submodule
* tools/jmx 8d92e54...d3096f3 (1):
  > dist: debian: fix detection of debuild
2020-09-13 16:26:53 +03:00
Raphael S. Carvalho
6a7409ef4c scylla-gdb: Fix scylla tasks
it's failing as so:
Python Exception <class 'TypeError'> unsupported operand type(s) for +: 'int' and 'str':

it's a regression caused by e4d06a3bbf.

_mask() should use the ref stored in the ctor to dereference _impl.

Fixes #7058.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200908154342.26264-1-raphaelsc@scylladb.com>
2020-09-13 16:25:27 +03:00
Avi Kivity
a8f40b3e89 Update seastar submodule
* seastar 52f0f38994...8933f76d33 (10):
  > future-util: add max_concurrent_for_each
  > rwlock: define rwlock_for_read, rwlock_for_write, as classes, not structs
  > loopback_socket: mark get_sockopt, set_sockopt as override
  > native-stack: mark get_sockopt, set_sockopt as override
  > treewide: remove unused lambda captures
  > future: prevent spurious unused-lambda-capture warning in future<>::then
  > future: make futurize<> a friend of future_state_base
  > future: fix uninitialized constexpr in is_tuple_effectively_trivially_move_constructible_and_destructible
  > net: expose hidden method from parent class
  > future: s/std::result_of_t/std::invoke_result_t/
2020-09-13 16:19:09 +03:00
Takuya ASADA
233d0fc0e5 unified: don't proceed offline install when openjdk is not available
Currently, we run openjdk existance check after scylla main program installed.
We should do it before installing anything.
2020-09-13 12:39:05 +03:00
Pavel Emelyanov
bf4063d78e row cache: Unfriend classes from each other
Now cache_tracker, mutation_partition and rows_entry do not
need to be friends.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-11 16:35:51 +03:00
Pavel Emelyanov
7a1265a338 rows_entry: Move container/hooks types declarations
Define container types near the containing elements' hook
members, so that they could be private without the need
to friend classes with each other.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-11 16:35:51 +03:00
Pavel Emelyanov
7ed1e18a13 rows_entry: Simplify LRU unlink
The cache_tracker tries to access private member of the
rows_entry to unlink it, but the lru_type is auto_unlink
and can unlink itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-11 16:35:51 +03:00
Pavel Emelyanov
7f2c6aed50 mutation_partition: Define .replace_with method for rows_entry
The one is needed to hide the guts of rows_entry from mutation_partition.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-11 16:35:51 +03:00
Pavel Emelyanov
a946326daf mutation_partition: Use rows_entry::apply_monotonically
There is no need in touching the private member of rows_entry, as it
exposes a method for this.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-11 16:13:10 +03:00
Tomasz Grabiec
691009bc1e db, schema: Hide update_schema_version_and_announce() 2020-09-11 14:42:48 +02:00
Tomasz Grabiec
9f58dcc705 db, storage_service: Do not call into gossiper from the database layer
The storage service computes gossiper states before it starts the
gossiper. Among them, node's schema version. There are two problems with that.

First is that computing the schema version and publishing it is not
atomic, so is not safe against concurrent schema changes or schema
version recalculations. It will not exclude with
recalculate_schema_version() calls, and we could end up with the old
(and incorrect) schema version being advertised in gossip.

Second problem is that we should not allow the database layer to call
into the gossiper layer before it is fully initialized, as this may
produce undefined behavior.

The solution for both problems is to break the cyclic dependency
between the database layer and the storage_service layer by having the
database layer not use the gossiper at all. The database layer
publishes schema version inside the database class and allows
installing listeners on changes. The storage_service layer asks the
database layer for the current version when it initializes, and only
after that installs a listener which will update the gossiper.

This also allows us to drop unsafe functions like update_schema_version().
2020-09-11 14:42:41 +02:00
Tomasz Grabiec
ad0b674b13 db: Make schema version observable 2020-09-11 14:42:41 +02:00
Tomasz Grabiec
fed89ee23e utils: updateable_value_source: Introduce as_observable() 2020-09-11 14:42:41 +02:00
Tomasz Grabiec
1a57d641d1 schema: Fix race in schema version recalculation leading to stale schema version in gossip
Migration manager installs several feature change listeners:

    if (this_shard_id() == 0) {
        _feature_listeners.push_back(_feat.cluster_supports_view_virtual_columns().when_enabled(update_schema));
        _feature_listeners.push_back(_feat.cluster_supports_digest_insensitive_to_expiry().when_enabled(update_schema));
        _feature_listeners.push_back(_feat.cluster_supports_cdc().when_enabled(update_schema));
        _feature_listeners.push_back(_feat.cluster_supports_per_table_partitioners().when_enabled(update_schema));
    }

They will call update_schema_version_and_announce() when features are enabled, which does this:

    return update_schema_version(proxy, features).then([] (utils::UUID uuid) {
        return announce_schema_version(uuid);
    });

So it first updates the schema version and then publishes it via
gossip in announce_schema_version(). It is possible that the
announce_schema_version() part of the first schema change will be
deferred and will execute after the other four calls to
update_schema_version_and_announce(). It will install the old schema
version in gossip instead of the more recent one.

The fix is to serialize schema digest calculation and publishing.

Refs #7200
2020-09-11 14:40:28 +02:00
Botond Dénes
e4798d9551 scylla-gdb.py: add scylla schema command
To pretty print a schema. Example:

(gdb) scylla schema $s
(schema*) 0x604009352380 ks="scylla_bench" cf="test" id=a3eadd80-f2a7-11ea-853c-000000000004 version=47e0bf13-6cc8-3421-93c6-a9fe169b1689

partition key: byte_order_equal=true byte_order_comparable=false is_reversed=false
    "org.apache.cassandra.db.marshal.LongType"

clustering key: byte_order_equal=true byte_order_comparable=false is_reversed=true
    "org.apache.cassandra.db.marshal.ReversedType(org.apache.cassandra.db.marshal.LongType)"

columns:
    column_kind::partition_key  id=0 ordinal_id=0 "pk" "org.apache.cassandra.db.marshal.LongType" is_atomic=true is_counter=false
    column_kind::clustering_key id=0 ordinal_id=1 "ck" "org.apache.cassandra.db.marshal.ReversedType(org.apache.cassandra.db.marshal.LongType)" is_atomic=true is_counter=false
    column_kind::regular_column id=0 ordinal_id=2 "v" "org.apache.cassandra.db.marshal.BytesType" is_atomic=true is_counter=false

To preserve easy inspection of schema objects the printer is a command,
not a pretty-printer.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200911100039.1467905-1-bdenes@scylladb.com>
2020-09-11 13:38:20 +02:00
Botond Dénes
4d7e2bf117 scylla-gdb.py: add pretty-printer for bytes
Reusing the sstring pretty-printer.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200911094846.1466600-1-bdenes@scylladb.com>
2020-09-11 13:36:32 +02:00
Botond Dénes
d49c87ff47 scylla-gdb.py: don't use the string display hint for UUIDs
It causes gdb to print UUIDs like this:

    "a3eadd80-f2a7-11ea-853c-", '0' <repeats 11 times>, "4"

This is quite hard to read, let's drop the string display hint, so they
are displayed like this:

    a3eadd80-f2a7-11ea-853c-000000000004

Much better. Also technically UUID is a 128 bit integer anyway, not a
string.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200911090135.1463099-1-bdenes@scylladb.com>
2020-09-11 13:19:49 +02:00
Pekka Enberg
b50063e84a configure.py: Build unified tarball as part of "dist" target
Let's include the unified tarball as part of the "dist" target.

Fixes #6949
2020-09-11 12:38:47 +03:00
Pekka Enberg
138e723e56 unified/build_unified: Use build/<mode>/dist/tar for dependency tarballs
The build_unified.sh script has the same bug as configure.py had: it
looks for the python tarball in
build/<mode>/scylla-python3-package.tar.gz, but it's never generated
there. Fix up the problem by using build/<mode>/dist/tar location for
all dependency tarballs.
2020-09-11 12:37:44 +03:00
Pekka Enberg
1af17f56f9 configure.py: Use build/<mode>/dist/tar for unified tarball dependencies
The build target for scylla-unified-package.tar.gz incorrectly depends
on "build/<mode>/scylla-python3-package.tar.gz", which is never
generated. Instead, the package is either generated in
"build/release/scylla-python3-package.tar.gz" (for legacy reasons) or
"build/<mode>/dist/tar/scylla-python3-package.tar.gz". This issues
causes building unified package in other modes to fail.

To solve the problem, let's switch to using the "build/<mode>/dist/tar"
locations for unified tarball dependencies, which is the correct place
to use anyway.
2020-09-11 11:57:58 +03:00
Nadav Har'El
3322328b21 alternator test: fix two tests that failed in HTTPS mode
When the test suite is run with Scylla serving in HTTPS mode, using
test/alternator/run --https, two Alternator Streams tests failed.
With this patch fixing a bug in the test, the tests pass.

The bug was in the is_local_java() function which was supposed to detect
DynamoDB Local (which behaves in some things differently from the real
DynamoDB). When that detection code makes an HTTPS request and does not
disable checking the server's certificate (which on Alternator is
self-signed), the request fails - but not in the way that the code expected.
So we need to fix the is_local_java() to allow the failure mode of the
self-signed certificate. Anyway, this case is *not* DynamoDB Local so
the detection function would return false.

Fixes #7214

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200910194738.125263-1-nyh@scylladb.com>
2020-09-11 08:10:13 +02:00
Nadav Har'El
02ee0483b2 alternator test: add reproducing tests for several issues
This patch adds regression tests for four recently-fixed issues which did not yet
have tests:

Refs #7157 (LatestStreamArn)
Refs #7158 (SequenceNumber should be numeric)
Refs #7162 (LatestStreamLabel)
Refs #7163 (StreamSpecification)

I verified that all the new tests failed before these issues were fixed, but
now pass.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200907155334.562844-1-nyh@scylladb.com>
2020-09-10 17:36:23 +02:00
Avi Kivity
0e03c979d2 Merge 'Fix ignoring cells after null in appending hash' from Piotr Sarna
"
This series fixes a bug in `appending_hash<row>` that caused it to ignore any cells after the first NULL. It also adds a cluster feature which starts using the new hashing only after the whole cluster is aware of it. The series comes with tests, which reproduce the issue.

Fixes #4567
Based on #4574
"

* psarna-fix_ignoring_cells_after_null_in_appending_hash:
  test: extend mutation_test for NULL values
  tests/mutation: add reproducer for #4567
  gms: add a cluster feature for fixed hashing
  digest: add null values to row digest
  mutation_partition: fix formatting
  appending_hash<row>: make publicly visible
2020-09-10 15:35:38 +03:00
Piotr Sarna
fe5cd846b5 test: extend mutation_test for NULL values
The test is extended for another possible corner case:
[1, NULL, 2] vs [1, 2, NULL] should have different digests.
Also, a check for legacy behavior is added.
2020-09-10 13:16:44 +02:00
Paweł Dziepak
287d0371fa tests/mutation: add reproducer for #4567 2020-09-10 13:16:44 +02:00
Piotr Sarna
21a77612b3 gms: add a cluster feature for fixed hashing
The new hashing routine which properly takes null cells
into account is now enabled if the whole cluster is aware of it.
2020-09-10 13:16:44 +02:00
Piotr Sarna
7b329f7102 digest: add null values to row digest
With the new hashing routine, null values are taken into account
when computing row digest. Previous behavior had a regression
which stopped computing the hash after the first null value
is encountered, but the original behavior was also prone
to errors - e.g. row [1, NULL, 2] was not distinguishable
from [1, 2, NULL], because their hashes were identical.
This hashing is not yet active - it will only be used after
the next commit introduces a proper cluster feature for it.
2020-09-10 13:16:44 +02:00
Piotr Sarna
5ffd929eaa mutation_partition: fix formatting 2020-09-10 12:20:32 +02:00
Paweł Dziepak
6f46010235 appending_hash<row>: make publicly visible
appending_hash<row> specialisation is declared and defined in a *.cc file
which means it cannot have a dedicated unit test. This patch moves the
declaration to the corresponding *.hh file.
2020-09-10 12:20:32 +02:00
Avi Kivity
d55a8148ed tools: toolchain: update for gnutls-3.6.15
GNUTLS-SA-2020-09-04 / GNUTLS-SA-2020-09-04.

Fixes #7212.
2020-09-10 12:52:00 +03:00
Raphael S. Carvalho
86b9ea6fb2 storage_service: Fix use-after-free when calculating effective ownership
Use-after-free happens because we take a ref to keyspace_name, which
is stack allocated, and ceases to exist after the next deferring
action.

Fixes #7209.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200909210741.104397-1-raphaelsc@scylladb.com>
2020-09-10 11:33:50 +03:00
Dejan Mircevski
9d02f10c71 cql3: Fix NULL reference in get_column_defs_for_filtering
There was a typo in get_column_defs_for_filtering(): it checked the
wrong pointer before dereferencing.  Add a test exposing the NULL
dereference and fix the typo.

Tests: unit (dev)

Fixes #7198.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-09-10 08:45:07 +02:00
Asias He
3ba6e3d264 storage_service: Fix a TOKENS update race for replace operation
In commit 7d86a3b208 (storage_service:
Make replacing node take writes), application state of TOKENS of the
replacing node is added into gossip and propagated to the cluster after
the initial start of gossip service. This can cause a race below

1. The replacing node replaces the old dead node with the same ip address
2. The replacing node starts gossip without application state of the TOKENS
3. Other nodes in the cluster replace the application states of old dead node's
   version with the new replacing node's version
4. replacing node dies
5. replace operation is performed again, the TOKENS application state is
   not preset and replace operation fails.

To fix, we can always add TOKENS application state when the
gossip service starts.

Fixes: #7166
Backports: 4.1 and 4.2
2020-09-09 15:24:21 +02:00
Tomasz Grabiec
de6aa668f5 Merge "Do not reimplement deletable_row in clustering_row" from Pavel Emelyanov
The clustering_row class looks as a decorated deletable_row, but
re-implements all its logic (and members). Embed the deletable_row
into clustering_row and keep the non-static row logic in one
class instead of two.
2020-09-09 14:27:18 +02:00
Takuya ASADA
59a6e08cb9 Add support passing python3 dependencies from main repo to scylla-python3 script
We don't want to update scylla-python3 submodule for every python3 dependency
update, bring python3 package list to python3-dependencies.txt, pass it on
package building time.

See #6702
See scylladb/scylla-python3#6

[avi: add

* tools/python3 19a9cd3...b4e52ee (1):
  > Allow specify package dependency list by --packages

 to maintain bisectability]
2020-09-08 23:39:34 +03:00
Avi Kivity
291117ea9c Update seastar submodule
* seastar 4ff91c4c3a...52f0f38994 (4):
  > rpc: Return actual chosen compressor in server reponse - not all avail
Fixes #6925.
  > net/tls: fix compilation guards around sec_param().
  > future: improved printing of seastar::nested_exception
  > merge: http: fix issues with the request parser and testing
2020-09-08 23:39:34 +03:00
Takuya ASADA
e1b15ba09e dist/common/scripts: abort scylla_prepare with better error message
When configuration files for perftune contain invalid parameter, scylla_prepare
may cause traceback because error handling is not eough.

Throw all errors from create_perftune_conf(), catch them on scylla_prepare,
print user understandable error.

Fixes #6847
2020-09-08 23:39:34 +03:00
Pavel Emelyanov
4e264b9e4f clustering_row: Do not re-implement deletable_row
The clustering_row is deletable_row + clustering_key, all
its internals work exactly as the relevant deletable_row's
ones.

The similar relation is between static_row and row, and
the former wrapes the latter, so here's the same trick
for the non-static row classes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-08 22:21:15 +03:00
Pavel Emelyanov
ca148acbf9 deletable_row: Do not mess with clustering_row
The deletable_row accepts clustering_row in constructor and
.apply() method. The next patch will make clustering_row
embed the deletable_row inside, so those two methods will
violate layering and should be fixed in advance.

The fix is in providing a clustering_row method to convert
itself into a deletable_row. There are two places that need
this: mutation_fragment_applier and partition_snapshot_row_cursor.
Both methods pass temporary clustering_row value, so the
method in question is also move-converter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-08 22:18:15 +03:00
Kamil Braun
42fb4fe37c cdc: fix deadlock inside check_and_repair_cdc_streams
check_and_repair_cdc_streams, in case it decides to create a new CDC
generation, updates the STATUS application state so that other nodes
gossiped with pick up the generation change.

The node which runs check_and_repair_cdc_streams also learns about a
generation change: the STATUS update causes a notification change.
This happens during add_local_application_state call
which caused the STATUS update; it would lead to calling
handle_cdc_generation, which detects a generation change and calls
add_local_application_state with the new generation's timestamp.

Thus, we get a recursive add_local_application_state call. Unforunately,
the function takes a lock before doing on_change notifications, so we
get a deadlock.

This commit prevents the deadlock.
We update the local variable which stores the generation timestamp
before updating STATUS, so handle_cdc_generation won't consider
the observed generation to be new, hence it won't perform the recursive
add_local_application_state call.
2020-09-08 16:33:47 +03:00
Avi Kivity
c075539fea Merge 'storage_proxy: add a separate smp_group for hints' from Eliran
Hints writes are handled by storage_proxy in the exact same way
regular writes are, which in turn means that the same smp service
group is used for both. The problem is that it can lead to a priority
inversion where writes of the lower priority  kind occupies a lot of
the semaphores units making the higher priority writes wait for an
empty slot.
This series adds a separate smp group for hints as well as a field
to pass the correct smp group to mutate_locally functions, and
then uses this field to properly classify the writes.

Fixes #7177

* eliransin-hint_priority_inversion:
  Storage proxy: use hints smp group in mutate locally
  Storage proxy: add a dedicated smp group for hints
2020-09-08 16:12:37 +03:00
Wojciech Mitros
66e8214606 cql: Forbid adding new fields to UDTs used in partition key columns
Changing a user type may allow adding apparently duplicate rows to
tables where this type is used in a partitioning key. Fix by checking
all types of existing partitioning columns before allowing to add new
fields to the type.

Fixes #6941
2020-09-08 16:08:07 +03:00
Avi Kivity
7ac59dcc98 lsa: decay reserves
The log-structured allocator (LSA) reserves memory when performing
operations, since its operations are performed with reclaiming disabled
and if it runs out, it cannot evict cache to gain more. The amount of
memory to reserve is remembered across calls so that it does not have
to repeat the fail/increase-reserve/retry cycle for every operation.

However, we currently lack decaying the amount to reserve. This means
that if a single operation increased the reserve in the distant past,
all current operations also require this large reserve. Large reserves
are expensive since they can cause large amounts of cache to be evicted.

This patch adds reserve decay. The time-to-decay is inversely proportional
to reserve size: 10GB/reserve. This means that a 20MB reserve is halved
after 500 operations (10GB/20MB) while a 20kB reserve is halved after
500,000 operations (10GB/20kB). So large, expensive reserves are decayed
quickly while small, inexpensive reserves are decayed slowly to reduce
the risk of allocation failures and exceptions.

A unit test is added.

Fixes #325.
2020-09-08 15:59:25 +03:00
Takuya ASADA
4deb245198 scylla_ntp_setup: don't install ntpd package when it's already exists
Don't install ntpd package when it's already exists.

Related with #7153
2020-09-08 13:59:04 +03:00
Etienne Adam
63a1a4cbb9 redis: add hgetall and hdel commands
This patch adds support for 2 hash commands HDEL and HGETALL.

Internally it introduces the hashes_result_builder class to
read hashes and stored them in a std::map.

Other changes:
  - one exception return string was fixed
  - tests now use pytest.raises

Signed-off-by: Etienne Adam <etienne.adam@gmail.com>
Message-Id: <20200907202528.4985-1-etienne.adam@gmail.com>
2020-09-08 11:59:52 +03:00
Eliran Sinvani
933b44f676 Storage proxy: use hints smp group in mutate locally
We are using mutate_locally to handle hint mutations that arrived
through RPC. The current implementation makes no distinction whether
the mutation came through hint verb or a mutation verb resulting in
using the same smp group for both. This commit adds the ability to
reference different smp group in mutate_locally private calls and
makes the handlers pass the correct smp group to mutate_locally.
2020-09-08 10:03:50 +03:00
Benny Halevy
f8d9e81bdb types: time_point_to_string: prevent overflow of nanoseconds
Due to #7175, microseconds are stored in a db_clock::time_point
as if they were milliseconds.

std::chrono::duration_cast<std::chrono::nanoseconds> may cause overflow
and end up with invalid/negative nanos.

This change specializes time_point_to_string to std::chrono::milliseconds
since it's currently only called to print db_clock::time_point
and uses boost::posix_time::milliseconds to print the count.

This would generate an exception in today's time stamps
and the output will look like:
    1599493018559873 milliseconds (Year is out of valid range: 1400..9999)
instead of:
    1799-07-16T19:57:52.175010

It is preferrable to print the numeric value annotated as out of valid range
than to print a bogus date in the past.

Test: unit(dev), commitlog_test:TestCommitLog.test_mixed_mode_commitlog_same_partition_smp_1
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200907162845.147477-1-bhalevy@scylladb.com>
2020-09-08 10:02:02 +03:00
Pavel Emelyanov
b9a4a06381 range_tombstone_list: Do not expose internal collection
Now all work with the list is described as API calls, it's
finally possible to stop exposing the boost::set outside
the class.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-07 23:17:41 +03:00
Pavel Emelyanov
f19b85b61d range_tombstone_list: Introduce and use pop-and-lock helper
There's an optimization in flat_mutation_reader_from_mutations
that folds the list from left-to-right in linear time. In case
of currently used boost::set the .unlink_leftmost_without_rebalance
helper is used, so wrap this exception with a method of the
range_tombstone_list. This is the last place where caller need
to mess with the exact internal collection.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-07 23:17:41 +03:00
Pavel Emelyanov
a89c7198c2 range_tombstone_list: Introduce and use pop_as<>()
The method extracts an element from the list, constructs
a desired object from it and frees. This is common usage
of range_tombstone_list. Having a helper helps encapsulating
the exact collection inside the class.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-07 23:17:41 +03:00
Pavel Emelyanov
27912375b2 flat_mutation_reader: Use range_tombstone_list begin/end API
The goal is to stop revealing the exact collection from
the range_tombstone_list, so make use of existing begin/end
methods and extend with rbegin() where needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-07 23:17:41 +03:00
Pavel Emelyanov
f19ade31ee repair: Mark some partition_hasher methods noexcept
The net patch will change the way range tombstones are
fed into hasher. To make sure the codeflow doesn't
become exception-unsafe, mark the relevant methods as
nont-throwing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-07 23:17:41 +03:00
Pavel Emelyanov
5adb8e555c hashers: Mark hash updates noexcept
All those methods end up with library calls, whose code
is not marked noexcept, but is such according to code
itself or docs.

The primary goal is to make some repair partition_hasher
methods noexcept (next patch).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-07 23:17:41 +03:00
Nadav Har'El
f76c519c1f merge: Alternator streams - fix table description and sequence number
Merged pull request https://github.com/scylladb/scylla/pull/7160
By Calle Wilund:

Stream descriptions, as returned from create, update and describe stream was missing "latest" stream arn.

Shard descriptions sequence number (for us timeuuid:s) were formatted wrongly. The spec states they should be numeric only.

Both these defects break Kinesis operations.

  alternator: Set CDC delta to keys only for alternator streams
  alternator: Include stream spec in desc for create/update/describe
  alternator: Include LatestStreamLabel in resulting desc for create/update table
  alternator: Make "StreamLabel" an iso8601 timestamp
  alternator: Alloc BILLING_MODE in update_table
  cdc: Add setter for delta mode
  alternator: Fix sequence number range using wrong format
  alternator: Include stream arn in table description if enabled
2020-09-07 18:26:21 +03:00
Piotr Grabowski
ffd8c8c505 utf8: Print invalid UTF-8 character position
Add new validate_with_error_position function
which returns -1 if data is a valid UTF-8 string
or otherwise a byte position of first invalid
character. The position is added to exception
messages of all UTF-8 parsing errors in Scylla.

validate_with_error_position is done in two
passes in order to preserve the same performance
in common case when the string is valid.
2020-09-07 18:11:21 +03:00
Piotr Grabowski
462d12f555 db: Propagate enable_cache to system keyspaces
Make enable_cache configuration option also
affect caching of system keyspaces. Fixes #2909.
2020-09-07 17:54:46 +03:00
Calle Wilund
7224ae6d38 alternator: Set CDC delta to keys only for alternator streams
Fixes #7190

Since we don't use any delta value when translating cdc -> streams
it is wasteful to write these to the log table, esp. since we already
write big fat pre- and post images.
2020-09-07 14:27:54 +00:00
Calle Wilund
f7bb0baba7 alternator: Include stream spec in desc for create/update/describe
Fixes #7163

If enabled, the resulting table description should include a
StreamDescription object with the appropriate members describing
current stream settings.
2020-09-07 14:26:21 +00:00
Calle Wilund
e6266d5652 alternator: Include LatestStreamLabel in resulting desc for create/update table
Fixes #7162

Same value as 'StreamLabel' in the currently active stream (cdc log) if
enabled.
2020-09-07 14:24:48 +00:00
Calle Wilund
fa68493d64 alternator: Make "StreamLabel" an iso8601 timestamp
Fixes #7164

See https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_TableDescription.html
StreamLabel: A timestamp, in ISO 8601 format, for this stream

Scylla tables do not have a timestamp as such, but the UUID for a given schema
is a timeuuid, so we can misuse this to fake a creation timestamp.
2020-09-07 14:24:00 +00:00
Calle Wilund
f16792aad0 alternator: Alloc BILLING_MODE in update_table
While it does not do anything, we want something to
update for testing (dynamo python libs refuse empty
update).
2020-09-07 14:15:17 +00:00
Calle Wilund
d29d676955 cdc: Add setter for delta mode 2020-09-07 14:14:04 +00:00
Botond Dénes
c01af1d9d2 tests/boost/multishard_mutation_query_test: remove last BOOST_REQUIRE* macros
Previous patches removed those `BOOST_REQUIRE*` macros that could be
invoked from shards other than 0. The reason is that said macros are not
thread-safe, so calling them from multiple shards produces mangled
output to stdout as well as the XML report file. It was assumed that
only these invocations -- from a non-0 shard -- are problematic, but it
turns out even these can race with seastar log messages emitted from
other shards. This patch removes all such macros, replacing them with
the thread safe `require*` functions from `test/lib/test_utils.hh`.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200907125309.1199104-1-bdenes@scylladb.com>
2020-09-07 17:07:26 +03:00
Eliran Sinvani
342fc07bd6 Storage proxy: add a dedicated smp group for hints
Hints and regular writes currently uses the same cross shard
operation semaphore, which can lead to priority inversion, making
cross shard writes wait for cross shard hints. This commit adds
an smp_service_group for hints and adds it usage in the mutate_hint
function.
2020-09-07 15:46:12 +03:00
Calle Wilund
a7d021ee57 alternator: Fix sequence number range using wrong format
Fixes #7158

A streams shard  descriptions has a sequence range describing start/end
(if available) of the shard. This is specified as being "numeric only".

Alternator incorrectly used UUID here, which breaks kinesis.

v2:
* Fix uint128_t parsing from string. bmp::number constructor accepted
  sstring, but did not interpret it as std::string/chars. Weird results.
2020-09-07 12:01:22 +00:00
Pekka Enberg
1ed9a336a5 Update tools/jmx submodule
* tools/jmx 12ab6aa...8d92e54 (1):
  > Merge 'JMX footprint work' from Calle
Fixes scylladb/scylla-jmx#133
Fixes scylladb/scylla-jmx#134
2020-09-07 13:56:47 +03:00
Benny Halevy
0c474b1c01 types: time_point_to_string: handle errors from boost::posix_time::to_iso_extended_string
As seen in https://github.com/scylladb/scylla/issues/7175,
1e676cd845
that was merged in bc77939ada
exposed a preexisting problem in time_point_to_string
where it tried printing a timestamp that was in microseconds
(taken from an api::timestamp_type instead of db_clock::time_point)
and hit `boost::wrapexcept<boost::gregorian::bad_year> (Year is out of valid range: 1400..9999)`

If hit, this patch with print the offending time_stamp in
nanoseconds and the error message.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200907083303.33229-1-bhalevy@scylladb.com>
2020-09-07 11:36:43 +03:00
Calle Wilund
f5c79d15a8 alternator: Include stream arn in table description if enabled
Fixes #7157

When creating/altering/describing a table, if streams are enabled, the
"latest active" stream arn should be included as LatestStreamArn.

Not doing so breaks java kinesis.
2020-09-07 08:16:11 +00:00
Pekka Enberg
e4266ead98 Improve build documentation
This improves the build documentation beyond just packaging:

- Explain how the configure.py step works

- Explain how to build just executables and tests (for development)

- Explain how to build for specific build mode if you didn't specify a
  build mode in configure.py step

- Fix build artifact locations, missing .debs, and add executables and
  tests

Message-Id: <20200904084443.495137-1-penberg@iki.fi>
2020-09-07 10:51:31 +03:00
Benny Halevy
66ce3a4c25 types: time_point_to_string: do not assume tp is in milliseconds
T& tp may have other period than milliseconds.
Cast the time_point duration to nanoseconds (or microseconds
if boost doesn't supports it) so it is printed in the best possible
resolution.

Note that we presume that the time_point epoch is the
Unix epoch of 1970-01-01, but the c++ standard doesn't guwarntee
that.  See https://github.com/scylladb/scylla/issues/5498

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200906171106.690872-1-bhalevy@scylladb.com>
2020-09-07 10:44:52 +03:00
Pekka Enberg
478e831d4f Update tools/jmx submodule
* tools/jmx d5d1efd...12ab6aa (1):
  > Merge "Fix JMX startup after offline installation" from Amos

Fixes: scylladb/scylla#7098
Fixes: scylladb/scylla-jmx#129
2020-09-07 09:42:26 +03:00
Piotr Jastrzebski
4499a37eae docs: Improve protocol-extensions documentation
Documentation states that `SCYLLA_LWT_OPTIMIZATION_META_BIT_MASK`
is a 32-bit integer that represents bit mask. What it fails to mention
is that it's a unsigned value and in fact it takes value of 2147483648.
This is problematic for clients in languages that don't have unsigned
types (like Java).

This patch improves the documentation to make it clear that
`SCYLLA_LWT_OPTIMIZATION_META_BIT_MASK` is represented by unsigned
value.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <7166b736461ae6f3d8ffdf5733e810a82aa02abc.1599382184.git.piotr@scylladb.com>
2020-09-06 13:35:12 +03:00
Dejan Mircevski
a127f5615b cql3: Simplify pk test in statement_restrictions
statement_restrictions::process_partition_key_restrictions() was
checking has_unrestricted_components(), whereas just an empty() check
suffices there, because has_unrestricted_components() is implicitly
checked five lines down by needs_filtering().

The replacement check is cheaper and simpler to understand.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-09-06 12:39:36 +03:00
Dejan Mircevski
df3ea2443b cql3: Drop all uses_function methods
No one seems to call them except for other uses_function methods.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-09-04 17:27:30 +02:00
Piotr Grabowski
561753fe71 mutation: Improve log print of mutations
Changes format of mutation, mutation_partition
log messages to more human-readable. Fixes #826.
2020-09-04 16:33:25 +02:00
Tomasz Grabiec
bcdcf06ec7 Merge "lwt: for each statement in cas_request provide a row in CAS result set" from Pavel Solodovnikov
Previously batch statement result set included rows for only
those updates which have a prefetch data present (i.e. there
was an "old" (pre-existing) row for a key).

Also, these rows were sorted not in the order in which statements
appear in the batch, but in the order of updated clustering keys.

If we have a batch which updates a few non-existent keys, then
it's impossible to figure out which update inserted a new key
by looking at the query response. Not only because the responses
may not correspond to the order of statements in the batch, but
even some rows may not show up in the result set at all.

Please see #7113 on Github for detailed description
of the problem:
https://github.com/scylladb/scylla/issues/7113

The patch set proposes the following fix:

For conditional batch statements the result set now always
includes a row for each LWT statement, in the same order
in which individual statements appear in the batch.

This way we can always tell which update did actually insert
a new key or update the existing one.

Technically, the following changes were made:
 * `update_parameters::prefetch_data::row::is_in_cas_result_set`
   member removed as well as the supporting code in
   `cas_request::applies_to` which iterated through cas updates
   and marked individual `prefetch_data` rows as "need to be in
   cas result set".
 * `cas_request::applies_to` substantially simplified since it
   doesn't do anything more than checking `stmt.applies_to()`
   in short-circuiting manner.
 * `modification_statement::build_cas_result_set` method moved
   to `cas_request`. This allows to easily iterate through
   individual `cas_row_update` instances and preserve the order
   of the rows in the result set.
 * A little helper `cas_request::find_old_row`
   is introduced to find a row in `prefetch_data` based on the
   (pk, ck) combination obtained from the current `cas_request`
   and a given `cas_row_update`.
 * A few tests for the issue #7113 are written, other lwt-batch-related
   tests adjusted accordingly.
2020-09-04 16:09:45 +02:00
Pavel Solodovnikov
92fd515186 lwt: for each statement in cas_request provide a row in CAS result set
Previously batch statement result set included rows for only
those updates which have a prefetch data present (i.e. there
was an "old" (pre-existing) row for a key).

Also, these rows were sorted not in the order in which statements
appear in the batch, but in the order of updated clustering keys.

If we have a batch which updates a few non-existent keys, then
it's impossible to figure out which update inserted a new key
by looking at the query response. Not only because the responses
may not correspond to the order of statements in the batch, but
even some rows may not show up in the result set at all.

The patch proposes the following fix:

For conditional batch statements the result set now always
includes a row for each LWT statement, in the same order
in which individual statements appear in the batch.

This way we can always tell which update did actually insert
a new key or update the existing one.

`update_parameters::prefetch_data::row::is_in_cas_result_set`
member variable was removed as well as supporting code in
`cas_request::applies_to` which iterated through cas updates
and marked individual `prefetch_data` rows as "need to be in
cas result set".

Instead now `cas_request::applies_to` is significantly
simplified since it doesn't do anything more than checking
`stmt.applies_to()` in short-circuiting manner.

A few tests for the issue are written, other lwt-batch-related
tests were adjusted accordingly to include rows in result set
for each statement inside conditional batches.

Tests: unit(dev, debug)

Co-authored-by: Konstantin Osipov <kostja@scylladb.com>
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-09-04 13:13:26 +03:00
Pavel Solodovnikov
feaf2b6320 cas_request: move modification_statement::build_cas_result_set to cas_request
This is just a plain move of the code from `modification_statement`
to `cas_request` without changes in the logic, which will further
help to refactor `build_cas_result_set` behavior to include a row
for each LWT statement and order rows in the order of statements
in a batch.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-09-04 12:25:06 +03:00
Pavel Solodovnikov
0f0ff73a58 cas_request: extract find_old_row helper function
Factor out little helper function which finds a pre-existing
row for a given `cas_row_update` (matching the primary key).
Used in `cas_request::applies_to`.

Will be used in a subsequent patch to move
`modification_statement::build_cas_result_set` into `cas_request`.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-09-04 12:09:31 +03:00
Pavel Emelyanov
fabf849fcb row_cache: Save one key compare on direct hit
The partitions_type::lower_bound() method can return a hint that saves
info about the "lower-ness of the bound", in particular when the search
key is found, this can be guessed from the hint without comparison.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-03 21:13:21 +03:00
Pavel Emelyanov
ada174c932 row_cache: Kill incomplete_tag
The incomplete entry is created in one place.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-03 21:13:21 +03:00
Pavel Emelyanov
240b966695 row_cache: Do not copy partition tombstone when creating cache entry
The row_cache::find_or_create is only used to put (or touch) an entry in cache
having the partition_start mutation at hands. Thus, theres no point in carrying
key reference and tombstone value through the calls, just the partition_start
reference is enough.

Since the new cache entry is created incomplete, rename the creation method
to reflect this.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-03 21:13:21 +03:00
Pavel Emelyanov
84a6d439ad test: Lookup an existing entry with its own helper
The only caller of find_or_create() in tests works on already existing (.populate()-d) entry,
so patch this place for explicity and for the sake of next patching.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-03 21:13:21 +03:00
Pavel Emelyanov
3f33a71c0c row_cache: Move missing entry creation into helper
No functional changes, just move the code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-03 21:13:21 +03:00
Pavel Emelyanov
4662082748 populating reader: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-03 21:13:21 +03:00
Pavel Emelyanov
e680bdc59c populating reader: Less allocator switching on population
Now when the key for new partition is copied inside do_find_or_create_entry we may call
this function without allocator set, as it sets the allocator inside.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-03 21:13:21 +03:00
Pavel Emelyanov
449f9e1218 populating reader: Do not copy decorated key too early
When the missing partition is created in cache the decorated key is copied from
the ring position view too early -- to do the lookup. However, the read context
had been already entered the partition and already has the decorated key on board,
so for lookup we can use the reference.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-03 21:13:21 +03:00
Pavel Emelyanov
5a29e17a5f row_cache: Revive do_find_or_create_entry concepts
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-03 21:13:21 +03:00
Botond Dénes
e88b8a9a07 docs/debugging.md: document how TLS variables work
We use a lot of TLS variables yet GDB is not of much help when working
with these. So in this patch I document where they are located in
memory, how to calculate the address of a known TLS variable and how to
find (identify) one given an address.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200903131802.1068288-1-bdenes@scylladb.com>
2020-09-03 17:41:15 +02:00
Avi Kivity
bc77939ada Update seastar submodule
* seastar 7f7cf0f232...4ff91c4c3a (47):
  > core/reactor: complete_timers(): restore previous scheduling group
Fixes #7117.
  > future: Drop spurious 'mutable' from a lambda
  > future: Don't put a std::tuple in future_state
  > future: Prepare for changing the future_state storage type
  > future: Add future_state::get0()
  > when_all: Replace another untuple with get0
  > when_all: Use get0 instead of untuple
  > rpc: Don't assume that a future stores a std::tuple
  > future: Add a futurize_base
  > testing: Stop boost from installing its own signal handler
  > future: Define futurize::value_type with future::value_type
  > future: Move futurize down the file
  > Merge "Put logging onto {fmt} rails" from Pavel E
  > Merge "Make future_state non variadic" from Rafael
  > sharded: propagate sharded instances stop error
  > log: Fix misprint in docs
  > future: Use destroy_at in destructor
  > future: Add static_assert rejecting variadic future_state
  > futures_test: Drop variadic test
  > when_all: Drop a superfluous use of futurize::from_tuples
  > everywhere: Use future::get0 when appropriate
  > net: upgrade sockopt comments to doxygen
  > iostream: document the read_exactly() function
  > net:tls: fix clang-10 compilation duration cast
  > future: Move continuation_base_from_future to future.hh
  > repeat: Drop unnecessary make_tuple
  > shared_future: Don't use future_state_type
  > future: Add a static_assert against variadic futures
  > future: Delete warn_variadic_future
  > rpc_demo: Don't use a variadic future
  > rpc_test: Don't use a variadic future
  > futures_test: Don't use a variadic future
  > future: Move disable_failure_guard to promise::schedule
  > net: add an interface for custom socket options
  > posix: add one more setsockopt overload
  > Merge "Simplify pollfn and its inheritants" from Pavel E
  > util:std-compat.hh: add forward declaration of std::pmr for clang-10
  > rpc: Add protocol::has_handlers() helper
  > Add a Seastar_DEBUG_ALLOCATIONS build option
  > futures_test: Add a test for futures of references
  > future: Simplify destruction of future_state
  > Use detect_stack_use_after_return=1
  > repeat: Fix indentation
  > repeat: Delete try/catch
  > repeat: Simplify loop
  > Avoid call to std::exception_ptr's destructor from a .hh
  > file: Add missing include
2020-09-03 15:56:12 +03:00
Avi Kivity
64c7c81bac Merge "Update log messages to {fmt} rules" from Pavel E
"
Before seastar is updated with the {fmt} engine under the
logging hood, some changes are to be made in scylla to
conform to {fmt} standards.

Compilation and tests checked against both -- old (current)
and new seastar-s.

tests: unit(dev), manual
"

* 'br-logging-update' of https://github.com/xemul/scylla:
  code: Force formatting of pointer in .debug and .trace
  code: Format { and } as {fmt} needs
  streaming: Do not reveal raw pointer in info message
  mp_row_consumer: Provide hex-formatting wrapper for bytes_view
  heat_load_balance: Include fmt/ranges.h
2020-09-03 15:10:09 +03:00
Nadav Har'El
1d06da18fc alternator test: test for the TRIM_HORIZON stream iterator
This patch adds a test for the TRIM_HORIZON option of GetShardIterator in
Alternator Streams. This option asks to fetch again *all* the available
history in this shard stream. We had an implementation for it, but not a
test - so this patch adds one. The test passes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200830131458.381350-1-nyh@scylladb.com>
2020-09-02 18:37:06 +02:00
Nadav Har'El
3d4183863a alternator test: add tests for sequence-number based iterators
Alternator Streams already support the AT_SEQUENCE_NUMBER and
AFTER_SEQUENCE_NUMBER options for iterators. These options allow to replay
a stream of changes from a known position or after that known position.
However, we never had a test verifying that these features actually work
as intended, beyond just checking syntax. Having such tests is important
because recently we changed the implementation of these iterators, but
didn't have a test verifying that they still work.

So in this patch we add such tests. The tests pass (as usual, on both
Alternator and DynamoDB).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200830115817.380075-1-nyh@scylladb.com>
2020-09-02 18:36:59 +02:00
Nadav Har'El
c879b23b82 alternator test: add another passing Alternator Streams test
We had a test, test_streams_last_result, that verifies that after reading from
an Alternator Stream the last event, reading again will find nothing.
But we didn't actually have a test which checks that if at that point a new event
*does* arrive, we can read it. This test checks this case, and it passes (we don't
have a bug there, but it's good as a regression test for NextShardIterator).

This test also verifies that after reading an event for a particular key on a
a specific stream "shard", the next event for the same key will arrive on the
same shard.

This test passes on both Alternator and DynamoDB.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200830105744.378790-1-nyh@scylladb.com>
2020-09-02 18:36:50 +02:00
Raphael S. Carvalho
adf576f769 compaction_manager: export method that returns if table has ongoing compaction
A compaction strategy, that supports parallel compaction, may want to know
if the table has compaction running on its behalf before making a decision.
For example, a size-tiered-like strategy may not want to trigger a behavior,
like cross-tier compaction, when there's ongoing compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200901134306.23961-1-raphaelsc@scylladb.com>
2020-09-02 16:46:49 +03:00
Botond Dénes
90042746bf scylla-gdb.py: scylla_sstables::filename(): add md format support
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200902090929.879377-1-bdenes@scylladb.com>
2020-09-02 16:12:17 +03:00
Dejan Mircevski
0c73ac107d cql3: Drop get_partition_key_unrestricted_components
Not used anywhere.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-09-02 08:14:54 +03:00
Botond Dénes
b3f00685ec scylla-gdb.py: scylla memory: better summary of semaphore memory usage
If available, use the recently added
`reader_concurrency_semaphore::_initial_resources` to calculate the
amount of memory used out of the initially configured amount. If not
available, the summary falls back to the previous mode of just printing
the remaining amount of memory.
Example:
Replica:
  Read Concurrency Semaphores:
    user sstable reads:       11/100,     263621214/     42949672 B, queued: 847
    streaming sstable reads:   0/ 10,             0/     42949672 B, queued: 0
    system sstable reads:      1/ 10,        251584/     42949672 B, queued: 0

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200901091452.806419-1-bdenes@scylladb.com>
2020-09-01 16:26:57 +02:00
Nadav Har'El
52f92b886b alternator streams: fix bug returning the same change again
This patch fixes a bug which caused sporadic failures of the Alternator
test - test_streams.py::test_streams_last_result.

The GetRecords operation reads from an Alternator Streams shard and then
returns an "iterator" from where to continue reading next time. Because
we obviously don't want to read the same change again, we "incremented"
the current position, to start at the incremented position on the next read.

Unfortunately, the implementation of the increment() function wasn't quite
right. The position in the CDC log is a timeuuid, which has a really bizarre
comparison function (see compare_visitor in types.cc). In particular the
least-sigificant bytes of the UUID are compared as *signed* bytes. This
means that if the last byte of the UUID was 127, and increment() increased
it to 128, and this was wrong because the comparison function later deemed
that as a signed byte, where 128 is lower than 127, not higher! The result
was that with 1/256 probability (whenever the last byte of the position was
127) we would return an item twice. This was reproduced (with 1/256
probability) by the test test_streams_last_result, as reported in issue #7004.

The fix in this patch is to drop the increment() and replace it by a flag
whether an iterator is inclusive of the threshold (>=) or exclusive (>).
The internal representation of the iterator has a boolean flag "inclusive",
and the string representation uses the prefixes "I" or "i" to indicate an
inclusive or exclusive range, respectively - whereas before this patch we
always used the prefix "I".

Although increment() could have been fixed to work correctly, the result would
have been ugly because of the weirdness of the timeuuid comparison function.
increment() would also require extensive new unit-tests: we were lucky that
the high-level functional tests caught a 1 in 256 error, but they would not
have caught rarer errors (e.g., 1 in 2^32). Furthermore, I am looking at
Alternator as the first "user" of CDC, and seeing how complicated and
error-prone increment() is, we should not recommend to users to use this
technique - they should use exclusive (>) range queries instead.

Fixes #7004.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200901102718.435227-1-nyh@scylladb.com>
2020-09-01 12:28:39 +02:00
Avi Kivity
37352a73b8 Update tools/python3 submodule
* tools/python3 f89ade5...19a9cd3 (1):
  > dist: redhat: reduce log spam from unpacking sources when building rpm
2020-09-01 12:36:24 +03:00
Pavel Emelyanov
86897aa040 partition_version: Remove dead code
The rows_iterator is no longer in use since 70c72773

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200831191208.18418-1-xemul@scylladb.com>
2020-09-01 10:19:47 +03:00
Raphael S. Carvalho
7f7f366cb5 compaction: add debug msg to inform the amount of expired ssts skipped by compaction
this information is useful when debugging compaction issues that involve
fully expired ssts.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200828140401.96440-1-raphaelsc@scylladb.com>
2020-08-31 17:18:47 +03:00
Amos Kong
5785947e28 unified/install.sh: set default python3/sysconfdir smartly
Users can set python3 and sysconfdir from cmdline of install.sh
according to the install mode (root or nonroot) and distro type.

It's helpful to correct the default python3/sysconfdir, otherwise
setup scripts or scylla-jmx doesn't work.

Fixes #7130

Signed-off-by: Amos Kong <amos@scylladb.com>
2020-08-31 15:54:51 +03:00
Amos Kong
83d7454787 install.sh: clean tmp scylla.yaml after installation
If the install.sh is executed by sudo, then the tmp scylla.yaml is owned
by root. It's difficult to overwrite it by non-privileged users.

Signed-off-by: Amos Kong <amos@scylladb.com>
2020-08-31 15:54:51 +03:00
Kamil Braun
ff78a3c332 cdc: rename CDC description tables... again
Commit a6ad70d3da changed the format of
stream IDs: the lower 8 bytes were previously generated randomly, now
some of them have semantics. In particular, the least significant byte
contains a version (stream IDs might evolve with further releases).

This is a backward-incompatible change: the code won't properly handle
stream IDs with all lower 8 bytes generated randomly. To protect us from
subtle bugs, the code has an assertion that checks the stream ID's
version.

This means that if an experimental user used CDC before the change and
then upgraded, they might hit the assertion when a node attempts to
retrieve a CDC generation with old stream IDs from the CDC description
tables and then decode it.
In effect, the user won't even be able to start a node.

Similarly as with the case described in
d89b7a0548, the simplest fix is to rename
the tables. This fix must get merged in before CDC goes out of
experimental.

Now, if the user upgrades their cluster from a pre-rename version, the
node will simply complain that it can't obtain the CDC generation
instead of preventing the cluster from working. The user will be able to
use CDC after running checkAndRepairCDCStreams.

Since a new table is added to the system_distributed keyspace, the
cluster's schema has changed, so sstables and digests need to be
regenerated for schema_digest_test.
2020-08-31 11:33:14 +03:00
Nadav Har'El
f3dfb6e011 merge: cdc: Remove post-filterings for keys-only/off cdc delta generation
Merged pull request https://github.com/scylladb/scylla/pull/7121
By Calle Wilund:

Refs #7095
Fixes #7128

CDC delta!=full both relied on post-filtering
to remove generated log row and/or cells. This is inefficient.
Instead, simply check if the data should be created in the
visitors.

Also removed delta_mode=off mode.

  cdc: Remove post-filterings for keys-only/off cdc delta generation
  cdc: Remove cdc delta_mode::off
2020-08-31 11:22:09 +03:00
Calle Wilund
70a282ced2 cdc: Remove post-filterings for keys-only/off cdc delta generation
Refs #7095

CDC delta!=full both relied on post-filtering
to remove generated log row and/or cells. This is inefficient.
Instead, simply check if the data should be created in the
visitors.

v2:
* Fixed delta logs rows created (empty) even when delta == off
v3:
* Killed delta == off
v4:
* Move checks into (const) member var(s)
2020-08-31 07:59:43 +00:00
Calle Wilund
78236c015a cdc: Remove cdc delta_mode::off
Fixes #7128

CDC logs are not useful without at least delta_mode==keys, since
pre/post image data has no info on _what_ was actually done to
base table in source mutation.
2020-08-31 07:59:40 +00:00
Asias He
8b4530a643 repair: Add progress metrics for removenode ops
The following metric is added:

scylla_node_maintenance_operations_removenode_finished_percentage{shard="0",type="gauge"} 0.650000

It is the number of finished percentage for removenode operation so
far.

Fixes #1244, #6733
2020-08-31 14:43:39 +08:00
Asias He
25e03233f1 repair: Add progress metrics for decommission ops
The following metric is added:

scylla_node_maintenance_operations_decommission_finished_percentage{shard="0",type="gauge"}
0.650000

It is the number of finished percentage for decommission operation so
far.

Fixes #1244, #6733
2020-08-31 14:43:39 +08:00
Asias He
80cb157669 repair: Add progress metrics for replace ops
The following metric is added:

scylla_node_maintenance_operations_replace_finished_percentage{shard="0",type="gauge"} 0.650000

It is the number of finished percentage for replace operation so far.

Fixes #1244, #6733
2020-08-31 14:03:05 +08:00
Etienne Adam
19683d04c6 redis: add hget and hset commands
hget and hset commands using hashes internally, thus
they are not using the existing write_strings() function.

Limitations:
 - hset only supports 3 params, instead of multiple field/value
list that is available in official redis-server.
 - hset should return 0 when the key and field already exists,
but I am not sure it's possible to retrieve this information
without doing read-before-write, which would not be atomic.

I factorized a bit the query_* functions to reduce duplication, but
I am not 100% sure of the naming, it may still be a bit confusing
between the schema used (strings, hashes) and the returned format
(currently only string but array should come later with hgetall).

Signed-off-by: Etienne Adam <etienne.adam@gmail.com>
Message-Id: <20200830190128.18534-1-etienne.adam@gmail.com>
2020-08-30 22:05:41 +03:00
Takuya ASADA
f1255cb2d0 unified: add uninstall.sh
Provide an uninstaller for offline & nonroot installation.

Fixes #7076
2020-08-29 20:55:06 +03:00
Botond Dénes
f063dc22af scylla-gdb: add scylla compaction-tasks command
Summarize the compaction_manager::task instances. Useful for detecting
compaction related problems. Example:

(gdb) scylla compaction-task
     2116 type=sstables::compaction_type::Compaction, running=false, "cdc_test"."test_table_postimage_scylla_cdc_log"
      769 type=sstables::compaction_type::Compaction, running=false, "cdc_test"."test_table_scylla_cdc_log"
      750 type=sstables::compaction_type::Compaction, running=false, "cdc_test"."test_table_preimage_postimage_scylla_cdc_log"
      731 type=sstables::compaction_type::Compaction, running=false, "cdc_test"."test_table_preimage_scylla_cdc_log"
      293 type=sstables::compaction_type::Compaction, running=false, "cdc_test"."test_table"
      286 type=sstables::compaction_type::Compaction, running=false, "cdc_test"."test_table_preimage"
      230 type=sstables::compaction_type::Compaction, running=false, "cdc_test"."test_table_postimage"
       58 type=sstables::compaction_type::Compaction, running=false, "cdc_test"."test_table_preimage_postimage"
        4 type=sstables::compaction_type::Compaction, running=true , "cdc_test"."test_table_postimage_scylla_cdc_log"
        2 type=sstables::compaction_type::Compaction, running=true , "cdc_test"."test_table"
        2 type=sstables::compaction_type::Compaction, running=true , "cdc_test"."test_table_preimage_postimage_scylla_cdc_log"
        2 type=sstables::compaction_type::Compaction, running=true , "cdc_test"."test_table_preimage"
        1 type=sstables::compaction_type::Compaction, running=true , "cdc_test"."test_table_preimage_postimage"
        1 type=sstables::compaction_type::Compaction, running=true , "cdc_test"."test_table_scylla_cdc_log"
        1 type=sstables::compaction_type::Compaction, running=true , "cdc_test"."test_table_preimage_scylla_cdc_log"
Total: 5246 instances of compaction_manager::task

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200828135030.689188-1-bdenes@scylladb.com>
2020-08-28 16:00:14 +02:00
Botond Dénes
727e9be342 scylla-gdb.py: scylla sstables: add --histogram option
Allowing to print a summary of per-table sstables. Example:
(gdb) scylla sstables --histogram
     2103 "cdc_test"."test_table_postimage_scylla_cdc_log"
      751 "cdc_test"."test_table_preimage_postimage_scylla_cdc_log"
      734 "cdc_test"."test_table_preimage_scylla_cdc_log"
      723 "cdc_test"."test_table_scylla_cdc_log"
      285 "cdc_test"."test_table"
      164 "cdc_test"."test_table_postimage"
      150 "cdc_test"."test_table_preimage"
       55 "cdc_test"."test_table_preimage_postimage"
        1 "system"."clients"
        1 "system"."compaction_history"
        1 "system_auth"."roles"
        1 "system"."peers"
total (shard-local): count=4969, data_file=171953448, in_memory=19195136

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200828091848.673398-1-bdenes@scylladb.com>
2020-08-28 11:36:37 +02:00
Pavel Solodovnikov
88ba184247 paxos: use schema_registry when applying accepted proposal if there is schema mismatch
Try to look up and use schema from the local schema_registry
in case when we have a schema mismatch between the most recent schema
version and the one that is stored inside the `frozen_mutation` for
the accepted proposal.

When such situation happens the stored `frozen_mutation` is able
to be applied only if we are lucky enough and column_mapping in
the mutation is "compatible" with the new table schema.

It wouldn't work if, for example, the columns are reordered, or
some columns, which are referenced by an LWT query, are dropped.

With the patch we are able to mitigate these cases as long as the
referenced schema is still present in the node cache (e.g.
it didn't restart/crash or the cache entry is not too old
to be evicted).

Tests: unit(dev, debug), dtest(paxos_tests.schema_mismatch_*_test)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200827150844.624017-1-pa.solodovnikov@scylladb.com>
2020-08-27 19:04:09 +02:00
Amnon Heiman
68b3ed1c9a storage_service.cc: get_natural_endpoints should translate key
The get_natural_endpoints returns the list of nodes holding a key.

There is a variation of the method that gets the key as string, the
current implementation just cast the string to bytes_view, which will
not work. Instead, this patch changes the implementation to use
from_nodetool_style_string to translate the key (in a nodetool like
format) to a token.

Fixes #7134

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-08-27 18:25:15 +03:00
Rafael Ávila de Espíndola
d18af34205 everywhere: Use future::get0 when appropriate
This works with current seastar and clears most of the way for
updating to a version that doesn't use std::tuple in futures.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200826231947.1145890-1-espindola@scylladb.com>
2020-08-27 15:05:51 +03:00
Nadav Har'El
1da3af5420 alternator test: enable a passing test
After issue #7107 was fixed (regarding the correctness of OldImage and NewImage
in Alternator Streams) we forgot to remove the "xfail" tag from one of the tests
for this issue.

This test now passes, as expected, so in this patch we remove the xfail tag.

Refs #7107

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200827103054.186555-1-nyh@scylladb.com>
2020-08-27 14:15:32 +03:00
Nadav Har'El
0faf91f254 docs: fix typo in alternator/getting-started.md
alternator/getting-started.md had a missing grave accent (`) character,
resulting in messed up rendering of the involved paragraph. Add the missing
quote.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200827110920.187328-1-nyh@scylladb.com>
2020-08-27 14:11:00 +03:00
Piotr Sarna
ca9422ca73 Merge 'Fix view_builder lockup and crash on shutdown' from Pavel
The lockup:

When view_builder starts all shards at some point get to a
barrier waiting for each other to pass. If any shard misses
this checkpoint, all others stuck forever. As this barrier
lives inside the _started future, which in turn is waited
on stop, the stop stucks as well.

Reasons to miss the barrier -- exception in the middle of the
fun^w start or explicit abort request while waiting for the
schema agreement.

Fix the "exception" case by unlocking the barrier promise with
exception and fix the "abort request" case by turning it into
an exception.

The bug can be reproduced by hands if making one shard never
see the schema agreement and continue looping until the abort
request.

The crash:

If the background start up fails, then the _started future is
resolved into exception. The view_builder::stop then turns this
future into a real exception caught-and-rethrown by main.cc.

This seems wrong that a failure in a background fiber aborts
the regular shutdown that may proceed otherwise.

tests: unit(dev), manual start-stop
branch: https://github.com/xemul/scylla/tree/br-view-builder-shutdown-fix-3
fixes: #7077

Patch #5 leaves the seastar::async() in the 1-st phase of the
start() although can also be tuned not to produce a thread.
However, there's one more (painless) issue with the _sem usage,
so this change appears too large for the part of the bug-fix
and will come as a followup.

* 'br-view-builder-shutdown-fix-3' of git://github.com/xemul/scylla:
  view_builder: Add comment about builder instances life-times
  view_builder: Do sleep abortable
  view_builder: Wakeup barrier on exception
  view_builder: Always resolve started future to success
  view_builder: Re-futurize start
  view_builder: Split calculate_shard_build_step into two
  view_builder: Populate the view_builder_init_state
  view_builder: Fix indentation after previous patch
  view_builder: Introduce view_builder_init_state
2020-08-27 11:51:46 +02:00
Nadav Har'El
95afadfe21 merge: alternator_streams: Include keys in OldImage/NewImage
Merged pull request https://github.com/scylladb/scylla/pull/7063
By Calle Wilund:

Fixes #6935

DynamoDB streams for some reason duplicate the record keys
into both the "Keys" and "OldImage"/"NewImage" sub-objects
when doing GetRecords. But only if there is other data
to include.

This patch appends the pk/ck parts into old/new image
iff we had any record data.

Updated to handle keys-only updates, and distinguish creating vs. updating rows. Changes cdc to not generate preimage
for non-existent/deleted rows, and also fixes missing operations/ttls in keys-only delta mode.

  alternator_streams: Include keys in OldImage/NewImage
  cdc: Do not generate pre/post image for non-existent rows
2020-08-27 11:23:35 +03:00
Pekka Enberg
0f1b54fa6e Update tools/java submodule
* tools/java d6c0ad1e2e...2d49ded77b (1):
  > sstableloader: remove wrong check that breaks range tombstones
2020-08-27 09:05:34 +03:00
Calle Wilund
678ecc7469 alternator_streams: Include keys in OldImage/NewImage
Fixes #6935
Fixes #7107

DynamoDB streams for some reason duplicate the record keys
into both the "Keys" and "OldImage"/"NewImage" sub-objects
when doing GetRecords.

This patch appends the pk/ck parts into old/new image, and
also removes the previous restrictions on image generation
since cdc now generates more consistent pre/post image
data.
2020-08-26 18:14:09 +00:00
Calle Wilund
e50911e5b0 cdc: Do not generate pre/post image for non-existent rows
Fixes #7119
Fixes #7120

If preimage select came up empty - i.e. the row did not exist, either
due to never been created, or once delete, we should not bother creating
a log preimage row for it. Esp. since it makes it harder to interpret the
cdc log.

If an operation in a cdc batch did a row delete (ranged, ck, etc), do
not generate postimage data, since the row does no longer exist.
Note that we differentiate deleting all (non-pk/ck) columns from actual
row delete.
2020-08-26 18:14:09 +00:00
Pavel Emelyanov
812eed27fe code: Force formatting of pointer in .debug and .trace
... and tests. Printin a pointer in logs is considered to be a bad practice,
so the proposal is to keep this explicit (with fmt::ptr) and allow it for
.debug and .trace cases.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-26 20:44:11 +03:00
Pavel Emelyanov
366b4e8a8f code: Format { and } as {fmt} needs
There are two places that want to print "{<text>}" strings, but do not format
the curly braces the {fmt}-way.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-26 20:44:11 +03:00
Pavel Emelyanov
78f2193956 streaming: Do not reveal raw pointer in info message
Showing raw pointer values in logs is not considered to be good
practice. However, for debugging/tracing this might be helpful.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-26 20:44:11 +03:00
Pavel Emelyanov
50e3a30dae mp_row_consumer: Provide hex-formatting wrapper for bytes_view
By default {fmt} doesn't know how to format this type (although it's a
basic_string_view instantiated), and even providing formatter/operator<<
does not help -- it anyway hits an earlier assertion in args mapper about
the disallowance of character types mixing.

The hex-wrapper with own operator<< solves the problem.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-26 20:44:11 +03:00
Pavel Emelyanov
fe33e3ed78 heat_load_balance: Include fmt/ranges.h
To provide vector<> formatter for {fmt}

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-26 20:44:08 +03:00
Avi Kivity
3daa49f098 Merge "materialized views: Fix undefined behavior on base table schema changes" from Tomasz
"
The view_info object, which is attached to the schema object of the
view, contains a data structure called
"base_non_pk_columns_in_view_pk". This data structure contains column
ids of the base table so is valid only for a particular version of the
base table schema. This data structure is used by materialized view
code to interpret mutations of the base table, those coming from base
table writes, or reads of the base table done as part of view updates
or view building.

The base table schema version of that data structure must match the
schema version of the mutation fragments, otherwise we hit undefined
behavior. This may include aborts, exceptions, segfaults, or data
corruption (e.g. writes landing in the wrong column in the view).

Before this patch, we could get schema version mismatch here after the
base table was altered. That's because the view schema did not change
when the base table was altered.

Another problem was that view building was using the current table's schema
to interpret the fragments and invoke view building. That's incorrect for two
reasons. First, fragments generated by a reader must be accessed only using
the reader's schema. Second, base_non_pk_columns_in_view_pk of the recorded
view ptrs may not longer match the current base table schema, which is used
to generate the view updates.

Part of the fix is to extract base_non_pk_columns_in_view_pk into a
third entity called base_dependent_view_info, which changes both on
base table schema changes and view schema changes.

It is managed by a shared pointer so that we can take immutable
snapshots of it, just like with schema_ptr. When starting the view
update, the base table schema_ptr and the corresponding
base_dependent_view_info have to match. So we must obtain them
atomically, and base_dependent_view_info cannot change during update.

Also, whenever the base table schema changes, we must update
base_dependent_view_infos of all attached views (atomically) so that
it matches the base table schema.

Fixes #7061.

Tests:

  - unit (dev)
  - [v1] manual (reproduced using scylla binary and cqlsh)
"

* tag 'mv-schema-mismatch-fix-v2' of github.com:tgrabiec/scylla:
  db: view: Refactor view_info::initialize_base_dependent_fields()
  tests: mv: Test dropping columns from base table
  db: view: Fix incorrect schema access during view building after base table schema changes
  schema: Call on_internal_error() when out of range id is passed to column_at()
  db: views: Fix undefined behavior on base table schema changes
  db: views: Introduce has_base_non_pk_columns_in_view_pk()
2020-08-26 17:37:52 +03:00
Pavel Emelyanov
cf1cb4d145 view_builder: Add comment about builder instances life-times
The barrier passing is tricky and deserves a description
about objects' life-times.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-26 15:56:38 +03:00
Pavel Emelyanov
643c431ce4 view_builder: Do sleep abortable
If one shard delays in seeing the schema agreement and returns on
abort request, other shards may get stuck waiting for it on the
status read barrier. Luckily with the previous patch the barrier
is exception-proof, so we may abort the waiting loop with exception
and handle the lock-up.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-26 15:56:38 +03:00
Pavel Emelyanov
c36bbc37c9 view_builder: Wakeup barrier on exception
If an exception pops up during the view_builder::start while some
shards wait for the status-read barrier, these shards are not woken
up, thus causing the shutdown to stuck.

Fix this by setting exception on the barrier promise, resolving all
pending and on-going futures.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-26 15:56:38 +03:00
Pavel Emelyanov
8f8ed625ab view_builder: Always resolve started future to success
If the view builder background start fails, the _started future resolves
to exceptional state. In turn, stopping the view builder keeps this state
through .finally() and aborts the shutdown very early, while it may and
should proceed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-26 15:56:38 +03:00
Pavel Emelyanov
60e21bb59a view_builder: Re-futurize start
Step two turning the view_builder::start() into a chain of lambdas --
rewrite (most of) the seastar::async()'s lambda into a more "classical"
form.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-26 15:56:38 +03:00
Pavel Emelyanov
77c7d94f85 view_builder: Split calculate_shard_build_step into two
The calculate_shard_build_step() has a cross-shard barrier in the middle and
passing the barrier is broken wrt exceptions that may happen before it. The
intention is to prepare this barrier passing for exception handling by turning
the view_builder::start() into a dedicated continuation lambda.

Step one in this campaign -- split the calculate_shard_build_step() into
steps called by view_builder::start():

 - before the barrier
 - barrier
 - after the barrier

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-26 15:56:38 +03:00
Pavel Emelyanov
fe0326b75b view_builder: Populate the view_builder_init_state
Keep the internal calculate_shard_build_step()'s stuff on the init helper
struct, as the method in question is about to be split into a chain of
continuation lambdas.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-26 15:56:35 +03:00
Pavel Emelyanov
2d2d04c6b7 view_builder: Fix indentation after previous patch
No functional changes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-26 15:46:36 +03:00
Pavel Emelyanov
d0393d92a2 view_builder: Introduce view_builder_init_state
This is the helper initialization struct that will carry the needed
objects accross continuation lambdas.

The indentation in ::start() will be fixed in the next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-26 15:45:15 +03:00
Avi Kivity
7416b3c34b Merge 'scylla-gdb.py: Add scylla repairs command' from Asias
"
This series adds scylla repairs command to help debug repair.

Fixes #7103
"

* asias-repair_help_debug_scylla_repairs_cmd:
  scylla-gdb.py: Add scylla repairs command
  repair: Add repair_state to track repair states
  scylla-gdb.py: Print the pointers of elements in boost_intrusive_list_printer
  scylla-gdb.py: Add printer for gms::inet_address
  scylla-gdb.py: Fix a typo in boost_intrusive_list
  repair: Fix the incorrect comments for _all_nodes
  repair: Add row_level_repair object pointer in repair_meta
  repair: Add counter for reads issued and finished for repair_reader
2020-08-26 13:57:31 +03:00
Avi Kivity
2b308a973f Merge 'Move temporaries to value view' from Piotr S
"
Issue https://github.com/scylladb/scylla/issues/7019 describes a problem of an ever-growing map of temporary values stored in query_options. In order to mitigate this kind of problems, the storage for temporary values is moved from an external data structure to the value views itself. This way, the temporary lives only as long as it's accessible and is automatically destroyed once a request finishes. The downside is that each temporary is now allocated separately, while previously they were bundled in a single byte stream.

Tests: unit(dev)
Fixes https://github.com/scylladb/scylla/issues/7019
"

* psarna-move_temporaries_to_value_view:
  cql3: remove query_options::linearize and _temporaries
  cql3: remove make_temporary helper function
  cql3: store temporaries in-place instead of in query_options
  cql3: add temporary_value to value view
  cql3: allow moving data out of raw_value
  cql3: split values.hh into a .cc file
2020-08-26 13:19:17 +03:00
Benny Halevy
f5ffd5fc5f sstables: Fix reactor stall in sstables::seal_summary()
With relatively big summaries, reactor can be stalled for a couple
of milliseconds.

This patch:
a. allocates positions upfront to avoid excessive reallocation.
b. returns a future from seal_summary() and uses `seastar::do_for_each`
to iterate over the summary entries so the loop can yield if necessary.

Fixes #7108.

Based on 2470aad5a389dfd32621737d2c17c7e319437692 by Raphael S. Carvalho <raphaelsc@scylladb.com>

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200826091337.28530-1-bhalevy@scylladb.com>
2020-08-26 12:18:05 +03:00
Botond Dénes
6ee36eeeb2 scylla-gdb.py: scylla memory: update w.r.t. moved per-shed group data
Per sheduling-group data was moved from the task queues to a separate
data member in the reactor itself. Update `scylla memory` to use the new
location to get the per sheduling group data for the storage proxy
stats.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200825140256.309299-1-bdenes@scylladb.com>
2020-08-26 11:17:08 +02:00
Avi Kivity
6ff12b7f79 repair: apply_rows_on_follower(): remove copy of repair_rows list
We copy a list, which was reported to generate a 15ms stall.

This is easily fixed by moving it instead, which is safe since this is
the last use of the variable.

Fixes #7115.
2020-08-26 11:52:39 +03:00
Benny Halevy
78a44dda57 sstables: avoid double close in file_writer destructor
If file_writer::close() fails to close the output stream
closing will be retried in file_writer::~file_writer,
leading to:
```
include/seastar/core/future.hh:1892: seastar::future<T ...> seastar::promise<T>::get_future() [with T = {}]: Assertion `!this->_future && this->_state && !this->_task' failed.
```
as seen in https://github.com/scylladb/scylla/issues/7085

Fixes #7085

Test: unit(dev), database_test with injected error in posix_file_impl::close()
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200826062456.661708-1-bhalevy@scylladb.com>
2020-08-26 11:33:23 +03:00
Nadav Har'El
d4b452002a alternator test: tests for NewImage feature of Alternator Streams
This patch adds tests for the "NewImage" attribute in Alternator Streams
in NEW_IMAGE and NEW_AND_OLD_IMAGES mode.

It reproduces issue #7107, that items' key attributes are missing in the
NewImage. It also verifies the risky corner cases where the new item is
"empty" and NewImage should include just the key, vs. the case where the
item is deleted, so NewImage should be missing.

This test currently passes on AWS DynamoDB, and xfails on Alternator.

Refs #7107.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200825113857.106489-1-nyh@scylladb.com>
2020-08-26 11:33:23 +03:00
Nadav Har'El
868194cd17 redis: fix another use-after-free crash in "exists" command
Never trust Occam's Razor - it turns out that the use-after-free bug in the
"exists" command was caused by two separate bugs. We fixed one in commit
9636a33993, but there is a second one fixed in
this patch.

The problem fixed here was that a "service_permit" object, which is designed to
be copied around from place to place (it contains a shared pointer, so is cheap
to copy), was saved by reference, and the reference was to a function argument
and was destroyed prematurely.

This time I tested *many times* that that test_strings.py passes on both dev and
debug builds.

Note that test/run/redis still fails in a debug build, but due to a different
problem.

Fixes #6469

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200825183313.120331-1-nyh@scylladb.com>
2020-08-26 11:33:23 +03:00
Nadav Har'El
8e06734893 redis test: add default host and port
test/redis/README.md suggests that when running "pytest" the default is to connect
to a local redis on localhost:6379. This default was recently lost when options
were added to use a different host and port. It's still good to have the default
suggested in README.md.

It also makes it easier to run the tests against the standard redis, which by
default runs on localhost:6379 - by just running "pytest".

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200825195143.124429-1-nyh@scylladb.com>
2020-08-26 11:33:23 +03:00
Rafael Ávila de Espíndola
0f9ad5151c auth: Inline standard_role_manager_name into only use
This is just a leftover cleanup I found in my git repo while rebasing.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200819184911.60687-1-espindola@scylladb.com>
2020-08-26 11:33:23 +03:00
Rafael Ávila de Espíndola
5fcfbd76a9 sstables: Delete duplicated code
For some reason date_tiered_compaction_strategy had its own identical
copy of get_value.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200819211509.106594-1-espindola@scylladb.com>
2020-08-26 11:33:23 +03:00
Piotr Sarna
7055297649 cql3: remove query_options::linearize and _temporaries
query_options::linearize was the only user of _temporaries helper
attribute, and it turns out that this function is never used -
- and is therefore removed.
2020-08-26 09:45:49 +02:00
Piotr Sarna
c0a7eda2a8 cql3: remove make_temporary helper function
Since temporary values will no longer be stored inside query options,
the helper function is removed altogether.
2020-08-26 09:45:49 +02:00
Piotr Sarna
70b09dcdf1 cql3: store temporaries in-place instead of in query_options
As a first step towards removing _temporaries from query options
altogether, all usages of query_options::make_temporary are
removed.
2020-08-26 09:45:49 +02:00
Piotr Sarna
ddd36de0ff cql3: add temporary_value to value view
When a value_view needs to store a temporarily instantiated object,
it can use the new variant field. The temporary value will live
only as long as the view itself.
2020-08-26 09:45:48 +02:00
Piotr Sarna
94a258d06c cql3: allow moving data out of raw_value
in order to be able to elide copying when transferring
data from raw_value.
2020-08-26 09:35:53 +02:00
Piotr Sarna
a4b07955c5 cql3: split values.hh into a .cc file
Some bigger functions are moved out-of-line. The .cc file
is going to be needed in next patches, which allow creating
a temporary value for a view.
2020-08-26 09:29:07 +02:00
Asias He
f30895ad22 scylla-gdb.py: Add scylla repairs command
This command lists all the active repair_meta objects for both repair
master and repair follower.

For example:

(gdb) scylla repairs

 (repair_meta*) for masters: addr = 0x600005abf830, table = myks2.standard1, ip = 127.0.0.1, states = ['127.0.0.1->repair_state::get_sync_boundary_started', '127.0.0.3->repair_state::get_sync_boundary_finished'], repair_meta = {
   db = @0x7fffe538c9f0,
   _messaging = @0x7fffe538ca90,
   _cf = @0x6000066f0000,

 ....

 (repair_meta*) for masters: addr = 0x60000521f830, table = myks2.standard1, ip = 127.0.0.1, states = ['127.0.0.1->repair_state::get_sync_boundary_started', '127.0.0.2->repair_state::get_sync_boundary_started'], repair_meta = {
   _db = @0x7fffe538c9f0,
   _messaging = @0x7fffe538ca90,
  _cf = @0x6000066f0000,

 ....

(repair_meta*) for follower: addr = 0x60000432a808, table = myks2.standard1, ip = 127.0.0.1, states = ['127.0.0.1->repair_state::get_sync_boundary_started', '127.0.0.2->repair_state::unknown'], repair_meta = {
  db = @0x7fffe538c9f0,
  messaging = @0x7fffe538ca90,
  _cf = @0x6000066f0000,

Fixes #7103
2020-08-26 11:19:25 +08:00
Asias He
ab57cea783 repair: Add repair_state to track repair states
Use repair_state to track the major state of repair from the beginning
to the end of repair.

With this patch, we can easily know at which state both the repair
master and followers are. It is very helpful when debugging a repair
hang issue.

Refs #7103
2020-08-26 11:19:25 +08:00
Asias He
77c2e69e22 scylla-gdb.py: Print the pointers of elements in boost_intrusive_list_printer
Sometimes it is helpful to print the pointers of the object in the list.

For example:

(gdb) p debug::repair_meta_for_masters._repair_metas

$1 = boost::intrusive::list of size 3 = [0x6000051df830, 0x60000221f830, 0x60000473f830] = [@0x6000051df830={
  _db = @0x7fffe538c9f0,
  _messaging = @0x7fffe538ca90,
  _cf = @0x6000066f0000,
  _schema = {
    _p = 0x600006568700
  },
  _range = {

  ...

(gdb) p debug::repair_meta_for_followers._repair_metas

$2 = boost::intrusive::list of size 3 = [0x60000081a808, 0x60000432b008, 0x60000432a808] = [@0x60000081a808={
  _db = @0x7fffe538c9f0,
  _messaging = @0x7fffe538ca90,
  _cf = @0x6000066f0000,
  _schema = {
    _p = 0x600006568700
  },

  ...

Refs #7103
2020-08-26 11:14:17 +08:00
Asias He
2b65f80271 scylla-gdb.py: Add printer for gms::inet_address
We need this to print the address of the peer nodes in repair.

Refs #7103
2020-08-26 10:12:07 +08:00
Asias He
0433f1060f scylla-gdb.py: Fix a typo in boost_intrusive_list
It is boost_intrusive_list not b0ost_intrusive_list.

Refs #7103
2020-08-26 10:12:07 +08:00
Asias He
9ee86bb5a0 repair: Fix the incorrect comments for _all_nodes
The _all_nodes field contains both the peer nodes and the node itself.

Refs #7103
2020-08-26 10:12:07 +08:00
Asias He
656ff93d49 repair: Add row_level_repair object pointer in repair_meta
It is helpful to track back the row_level_repair object for repair
master when debugging.

Refs #7103
2020-08-26 10:12:07 +08:00
Asias He
283c3dae0a repair: Add counter for reads issued and finished for repair_reader
It is helpful to check the reader blocks forever when debugging a repair
hang.

Refs #7103
2020-08-26 10:12:07 +08:00
Pekka Enberg
f7c5c48df6 Update tools/jmx submodule
* tools/jmx be8f1ac...d5d1efd (1):
  > dist/debian: Remove conflict tag for Java 11
2020-08-25 15:46:51 +03:00
Rafael Ávila de Espíndola
8204801b7f build: Add a --enable-seastar-debug-allocations
This enables the corresponding seastar option.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200824153137.19683-1-espindola@scylladb.com>
2020-08-25 14:54:10 +03:00
Asias He
6cadf4e4fa gossip: Apply state for local node in shadow round
We saw errors in killed_wiped_node_cannot_join_test dtest:

  Aug 2020 10:30:43 [node4] Missing: ['A node with address 127.0.76.4 already exists, cancelling join']:

The test does:
  n1, n2, n3, n4
  wipe data on n4
  start n4 again with the same ip address

Without this patch, n4 will bootstrap into the cluster new tokens. We
should prevent n4 to bootstrap because there is an existing
node in the cluster.

In shadow round, the local node should apply the application state of
the node with the same ip address. This is useful to detect a node
trying to bootstrap with the same IP address of an existing node.

Tests: bootstrap_test.py
Fixes: #7073
2020-08-25 12:53:59 +03:00
Calle Wilund
5ed3d6892d cdc: Remove stored (postimage) data when doing row delete
Fixes #6900

Clustered range deletes did not clear out the "row_states" data associated
with affected rows (might be many).

Adds a sweep through and erases relevant data. Since we do pre- and
postimage in "order", this should only affect postimage.
2020-08-25 12:27:18 +03:00
Pekka Enberg
3a78593481 configure.py: Fix test repeat and timeout options
Fix the default number of test repeats to 1, which it was before
(spotted by Nadav). Also, prefix the options so that they become
"--test-repeat" and "--test-timeout" (spotted by Avi).

Message-Id: <20200825081456.197210-1-penberg@scylladb.com>
2020-08-25 11:26:46 +03:00
Dejan Mircevski
cbf8186a12 cql3/expr: Drop make_column_op()
Instantiating binary_operator directly is more readable.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-08-25 11:10:36 +03:00
Asias He
e86881be99 repair: Print repair reason in repair stats log
It is useful to distinguish if the repair is a regular repair or used
for node operations.

In addition, log the keyspace and tables are repaired.

Fixes #7086
2020-08-25 11:05:47 +03:00
Piotr Jastrzebski
f01ce1458f cdc: Preserve metadata columns when geting only keys for delta
Fixes #7095

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-08-25 10:41:54 +03:00
Yaron Kaikov
02abcade27 ninja test: add --repeat and --timeout parameters
Adding missing parameters to ninja test:
--repeat: number of times to repeat each test, default = 3
--timeout: set total timeout in sec for each test
2020-08-25 10:41:54 +03:00
Raphael S. Carvalho
1c29f0a43d cql3/statements: verify that counter column cannot be added into non-counter table
A check, to validate that counter column cannot be added into non-counter table,
is missing for alter table statement. Validation is performed when building new
schema, but it's limited to checking that a schema will not contain both counter
and non-counter columns.

Due to lack of validation, the added counter column could be incorrectly
persisted to the schema, but this results in a crash when setting the new
schema to its table. On restart, it can be confirmed that the schema change
was indeed persisted when describing the table.
This problem is fixed by doing proper validation for the alter table statement,
which consists of making sure a new counter column cannot be added to a
non-counter table.

The test cdc_disallow_cdc_for_counters_test is adjusted because one of its tests
was built on the assumption that counter column can be added into a non-counter
table.

Fixes #7065.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200824155709.34743-1-raphaelsc@scylladb.com>
2020-08-25 10:41:54 +03:00
Avi Kivity
5ec2eae247 build: add python3-redis to dependencies
Needed for redis tests.
2020-08-24 20:03:50 +03:00
Avi Kivity
316d7f74ab Merge 'Do verify package for offline' from Takuya
To verify offline installed image with do_verify_package() in scylla_setup, we introduce is_offline() function works just like is_nonroot() but the install not with --nonroot.

* syuu1228-do_verify_package_for_offline:
  scylla_setup: verify package correctly on offline install
  scylla_util.py: implement is_offline() to detect offline installed image
2020-08-24 15:49:55 +03:00
Takuya ASADA
cb221ac393 scylla_setup: verify package correctly on offline install
do_verify_package written only for .rpm/.deb, does not working correctly
for offline install(including nonroot).
We should check file existance for the environment, not by package existance
using rpm/dpkg.

Fixes #7075
2020-08-24 20:10:36 +09:00
Takuya ASADA
c71e5f244a scylla_util.py: implement is_offline() to detect offline installed image
Like is_nonroot(), detect offline installed image using install.sh.
2020-08-24 20:10:36 +09:00
Nadav Har'El
9636a33993 redis: fix use-after-free crash in "exists" command
A missing "&" caused the key stored in a long-living command to be copied
and the copy quickly freed - and then used after freed.
This caused the test test_strings.py::test_exists_multiple_existent_key for
this feature to frequently crash.

Fixes #6469

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200823190141.88816-1-nyh@scylladb.com>
2020-08-24 11:41:43 +03:00
Asias He
fefa35987b storage_service: Avoid updating tokens in system.peers for nodes to be removed
Consider:

1) Start n1,n2,n3
2) Stop n3
3) Start n4 to replace n3 but list n4 as seed node
4) Node n4 finishes replacing operation
5) Restart n2
6) Run SELECT * from system.peers on node or node 1.
   cqlsh> SELECT * from system.peers ;
   peer| data_center | host_id| preferred_ip | rack  | release_version | rpc_address | schema_version| supported_features| tokens
   127.0.0.3 |null |null | null |  null | null |null |null |null |   {'-90410082611643223', '5874059110445936121'}

The replaced old node 127.0.0.3 shows in system.peers.

(Note, since commit 399d79fc6f (init: do not allow
replace-address for seeds), step 3 will be rejected. Assume we use a version without it)

The problem is that n2 sees n3 is in gossip status of SHUTDOWN after
restart. The storage_service::handle_state_normal callback is called for
127.0.0.3. Since n4 is using different token as n3 (seed node does not bootstrap so
it uses new tokens instead of tokens of n3 which is being replaced), so
owned_tokens will be set. We see logs like:

    [shard 0] storage_service - handle_state_normal: New node 127.0.0.3 at token 5874059110445936121

    [shard 0] storage_service - Host ID collision for cbec60e5-4060-428e-8d40-9db154572df7 between 127.0.0.4
    and 127.0.0.3; ignored 127.0.0.3

As a result, db::system_keyspace::update_tokens
will be called to write to system.peers for 127.0.0.3 wrongly.

    if (!owned_tokens.empty()) {
	db::system_keyspace::update_tokens(endpoint, owned_tokens)
    }

To fix, we should skip calling db::system_keyspace::update_tokens if the
nodes is present in endpoints_to_remove.

Refs: #4652
Refs: #6397
2020-08-24 10:06:37 +02:00
Takuya ASADA
fe8679a6ee test/redis: make redis tests runnable from test.py
Just like test/alternator, make redis-test runnable from test.py.
For this we move the redis tests into a subdirectory of tests/,
and create a script to run them: tests/redis/run.

These tests currently fail, so we did not yet modify test.py to actually
run them automatically.

Fixes #6331
2020-08-23 20:31:45 +03:00
Avi Kivity
907b775523 Merge "Free compaction from storage service" from Pavel E
"
There's last call for global storage service left in compaction code, it
comes from cleanup_compaction to get local token ranges for filtering.

The call in question is a pure wrapper over database, so this set just
makes use of the database where it's already available (perform_cleanup)
and adds it where it's needed (perform_sstable_upgrade).

tests: unit(dev), nodetool upgradesstables
"

* 'br-remove-ss-from-compaction-3' of https://github.com/xemul/scylla:
  storage_service: Remove get_local_ranges helper
  compaction: Use database from options to get local ranges
  compaction: Keep database reference on upgrade options
  compaction: Keep database reference on cleanup options
  db: Factor out get_local_ranges helper
2020-08-23 17:58:32 +03:00
Piotr Dulikowski
b111fa98ca hinted handoff: use default timeout for sending orphaned hints
This patch causes orphaned hints (hints that were written towards a node
that is no longer their replica) to be sent with a default write
timeout. This is what is currently done for non-orphaned hints.

Previously, the timeout was hardcoded to one hour. This could cause a
long delay while shutting down, as hints manager waits until all ongoing
hint sending operation finish before stopping itself.

Fixes: #7051
2020-08-23 11:50:27 +03:00
Botond Dénes
0a8cc4c2b5 db/size_estimates_virtual_reader: remove redundant _schema member
This reader was probably created in ancient times, when readers didn't
yet have a _schema member of their own. But now that they do, it is not
necessary to store the schema in the reader implementation, there is one
available in the parent class.

While at it also move the schema into the class when calling the
constructor.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200821070358.33937-1-bdenes@scylladb.com>
2020-08-22 20:47:49 +03:00
Botond Dénes
4944e050e3 mutation_reader: make_combined_reader(): return empty reader when combining 0 readers
Avoid creating all the combining machinery when we know there is no data
to be had.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200821045602.13096-1-bdenes@scylladb.com>
2020-08-22 20:47:49 +03:00
Avi Kivity
0dcb16c061 Merge "Constify access to token_metadata" from Benny
"
We keep refrences to locator::token_metadata in many places.
Most of them are for read-only access and only a few want
to modify the token_metadata.

Recently, in 94995acedb,
we added yielding loops that access token_metadata in order
to avoid cpu stalls.  To make that possible we need to make
sure they token_metadata object they are traversing won't change
mid-loop.

This series is a first step in ensuring the serialization of
updates to shared token metadata to reading it.

Test: unit(dev)
Dtest: bootstrap_test:TestBootstrap.start_stop_test{,_node}, update_cluster_layout_tests.py -a next-gating(dev)
"

* tag 'constify-token-metadata-access-v2' of github.com:bhalevy/scylla:
  api/http_context: keep a const sharded<locator::token_metadata>&
  gossiper: keep a const token_metadata&
  storage_service: separate get_mutable_token_metadata
  range_streamer: keep a const token_metadata&
  storage_proxy: delete unused get_restricted_ranges declaration
  storage_proxy: keep a const token_metadata&
  storage_proxy: get rid of mutable get_token_metadata getter
  database: keep const token_metadata&
  database: keyspace_metadata: pass const locator::token_metadata& around
  everywhere_replication_strategy: move methods out of line
  replication_strategy: keep a const token_metadata&
  abstract_replication_strategy: get_ranges: accept const token_metadata&
  token_metadata: rename calculate_pending_ranges to update_pending_ranges
  token_metadata: mark const methods
  token_ranges: pending_endpoints_for: return empty vector if keyspace not found
  token_ranges: get_pending_ranges: return empty vector if keyspace not found
  token_ranges: get rid of unused get_pending_ranges variant
  replication_strategy: calculate_natural_endpoints: make token_metadata& param const
  token_metadata: add get_datacenter_racks() const variant
2020-08-22 20:47:45 +03:00
Pavel Emelyanov
b3274c83e1 storage_service: Remove get_local_ranges helper
It's no longer in real use.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-21 14:58:40 +03:00
Pavel Emelyanov
171822cff8 compaction: Use database from options to get local ranges
The cleanup compaction wants to keep local tokens on-board and gets
them from storage_service.get_local_ranges().

This method is the wrapper around database.get_keyspace_local_ranges()
created in previous patch, the live database reference is already
available on the descriptor's options, so we can short-cut the call.

This allows removing the last explicit call for global storage_service
instance from compaction code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-21 14:58:40 +03:00
Pavel Emelyanov
8333fed8aa compaction: Keep database reference on upgrade options
The only place that creates them is the API upgrade_sstables call.

The created options object doesn't over-survive the returned
future, so it's safe to keep this reference there.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-21 14:58:40 +03:00
Pavel Emelyanov
a6e6856e1f compaction: Keep database reference on cleanup options
The database is available at both places that create the options --
tests and API perform_cleanup call.

Options object doesn't over-survive the returned future, so it's
safe to keep the reference on it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-21 14:58:40 +03:00
Pavel Emelyanov
06f4828b93 db: Factor out get_local_ranges helper
Storage service and repair code have identical helpers to get local
ranges for keyspace. Move this helper's code onto database, later it
will be reused by one more place.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-21 14:58:40 +03:00
Benny Halevy
436babdb3d api/http_context: keep a const sharded<locator::token_metadata>&
It has no need of changing token_metadata.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-20 16:20:34 +03:00
Benny Halevy
573142d4c4 gossiper: keep a const token_metadata&
gossiper has no need to change token_metadata.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-20 16:20:34 +03:00
Benny Halevy
2f7c529c1c storage_service: separate get_mutable_token_metadata
Use a different getter for a token_metadata& that
may be changed so we can better synchronize readers
and writers of token_metadata and eventually allow
them to yield in asynchronous loops.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-20 16:20:34 +03:00
Benny Halevy
569f2830c1 range_streamer: keep a const token_metadata&
range_streamer doesn't need to modify toekn_metadata.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-20 16:20:34 +03:00
Benny Halevy
2c61383215 storage_proxy: delete unused get_restricted_ranges declaration
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-20 16:20:34 +03:00
Benny Halevy
c8390da5f9 storage_proxy: keep a const token_metadata&
storage_proxy doesn't need to change token_metadata.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-20 16:20:34 +03:00
Benny Halevy
dfa5f8ff1e storage_proxy: get rid of mutable get_token_metadata getter
We'd like to strictly control who can modify token metadata
and nobody currently needs a mutable reference to storage_proxy::_token_metadata.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-20 16:20:34 +03:00
Benny Halevy
dd6d771331 database: keep const token_metadata&
No need to modify token_metadata form database code.
Also, get rid of mutable get_token_metadata variant.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-20 16:20:34 +03:00
Benny Halevy
8b5c32c7a8 database: keyspace_metadata: pass const locator::token_metadata& around
No need to modify token_metadata on this path.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-20 16:20:34 +03:00
Benny Halevy
e4e4b269c7 everywhere_replication_strategy: move methods out of line
Move methods depending on token_metadata to source file
so we can avoid including token_metadata.hh in header files
where spossible.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-20 16:20:34 +03:00
Benny Halevy
4dba81cb92 replication_strategy: keep a const token_metadata&
replication strategies don't need to change token_metadata.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-20 16:20:34 +03:00
Benny Halevy
2fd59e8bba abstract_replication_strategy: get_ranges: accept const token_metadata&
Now that calculate_natural_endpoints can be passed a const token_metadata&

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-20 16:20:34 +03:00
Benny Halevy
8b63523fb7 token_metadata: rename calculate_pending_ranges to update_pending_ranges
Since it sets the token_metadata_impl's pending ranges.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-20 16:20:34 +03:00
Benny Halevy
22275e579e token_metadata: mark const methods
Many token_metadata methods do not modify the object
and can be marked as const.

The motivation is to better control who may modify
token_metadata.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-20 16:20:21 +03:00
Botond Dénes
5e9a7d2608 row_cache: remove unnecessary includes of partition_snapshot_reader.hh
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200820124447.2561477-1-bdenes@scylladb.com>
2020-08-20 15:19:42 +02:00
Tomasz Grabiec
c44455d514 Merge "Miscellaneous schema code cleanups" from Rafael 2020-08-20 15:19:42 +02:00
Rafael Ávila de Espíndola
33669bd21d commitlog: Use try_with_gate
Now that we have try_with_gate we can use instead of futurize_invoke
and with_gate.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200819191334.74108-1-espindola@scylladb.com>
2020-08-20 15:19:42 +02:00
Benny Halevy
65d89512d0 token_ranges: pending_endpoints_for: return empty vector if keyspace not found
Rather than creating a bogus empty entry.

With that, it can be marked as const.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-20 16:16:14 +03:00
Benny Halevy
ca61c2797a token_ranges: get_pending_ranges: return empty vector if keyspace not found
Rather than creating a bogus empty entry.

With that, it can be marked as const.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-20 16:14:44 +03:00
Tomasz Grabiec
cf12b5e537 db: view: Refactor view_info::initialize_base_dependent_fields()
It is no longer called once for a given view_info, so the name
"initialize" is not appropriate.

This patch splits the "initialize" method into the "make" part, which
makes a new base_info object, and the "set" part, which changes the
current base_info object attached to the view.
2020-08-20 14:53:07 +02:00
Tomasz Grabiec
617ccc5408 tests: mv: Test dropping columns from base table
Reproduces #7061.
2020-08-20 14:53:07 +02:00
Tomasz Grabiec
f8df214836 db: view: Fix incorrect schema access during view building after base table schema changes
The view building process was accessing mutation fragments using
current table's schema. This is not correct, fragments must be
accessed using the schema of the generating reader.

This could lead to undefined behavior when the column set of the base
table changes. out_of_range exceptions could be observed, or data in
the view ending up in the wrong column.

Refs #7061.

The fix has two parts. First, we always use the reader's schema to
access fragments generated by the reader.

Second, when calling populate_views() we upgrade the fragment-wrapping
reader's schema to the base table schema so that it matches the base
table schema of view_and_base snapshots passed to populate_views().
2020-08-20 14:53:07 +02:00
Tomasz Grabiec
d64d60f576 schema: Call on_internal_error() when out of range id is passed to column_at()
Improves debuggability because backtrace is attached.

Before, plain std::out_of_range exception was thrown.
2020-08-20 14:53:07 +02:00
Tomasz Grabiec
3a6ec9933c db: views: Fix undefined behavior on base table schema changes
The view_info object, which is attached to the schema object of the
view, contains a data structure called
"base_non_pk_columns_in_view_pk". This data structure contains column
ids of the base table so is valid only for a particular version of the
base table schema. This data structure is used by materialized view
code to interpret mutations of the base table, those coming from base
table writes, or reads of the base table done as part of view updates
or view building.

The base table schema version of that data structure must match the
schema version of the mutation fragments, otherwise we hit undefined
behavior. This may include aborts, exceptions, segfaults, or data
corruption (e.g. writes landing in the wrong column in the view).

Before this patch, we could get schema version mismatch here after the
base table was altered. That's because the view schema does not change
when the base table is altered.

Part of the fix is to extract base_non_pk_columns_in_view_pk into a
third entitiy called base_dependent_view_info, which changes both on
base table schema changes and view schema changes.

It is managed by a shared pointer so that we can take immutable
snapshots of it, just like with schema_ptr. When starting the view
update, the base table schema_ptr and the corresponding
base_dependent_view_info have to match. So we must obtain them
atomically, and base_dependent_view_info cannot change during update.

Also, whenever the base table schema changes, we must update
base_dependent_view_infos of all attached views (atomically) so that
it matches the base table schema.

Refs #7061.
2020-08-20 14:53:07 +02:00
Tomasz Grabiec
dc18117b82 db: views: Introduce has_base_non_pk_columns_in_view_pk()
In preparation for pushing _base_non_pk_columns_in_view_pk deeper.
2020-08-20 14:53:07 +02:00
Pekka Enberg
10b2c23e19 configure.py: Fix build, check, and test targets when build mode is defined
When user defines a build mode with configure.py, the build, check, and
test targets fail as follows:

  ./configure.py --mode=dev && ninja build

  ninja: error: 'debug-build', needed by 'build', missing and no known rule to make it

Fix the issue by making the targets depend on build targets for
specified build modes, not all available modes.
Message-Id: <20200813105639.1641090-1-penberg@scylladb.com>
2020-08-20 15:08:06 +03:00
Benny Halevy
23a0625998 token_ranges: get rid of unused get_pending_ranges variant
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-20 14:46:53 +03:00
Benny Halevy
b4f76cbb8a replication_strategy: calculate_natural_endpoints: make token_metadata& param const
No replication strategy needs to change token_metadata
when calculating natural endpoints.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-20 14:38:45 +03:00
Benny Halevy
78f40cac8d token_metadata: add get_datacenter_racks() const variant
Needed for passing a const token_metadata& to
calculate_natural_endpoints methods.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-20 14:38:45 +03:00
Nadav Har'El
a1453303f8 alternator test: test for OLD_IMAGE of an empty item
We already have a test, test_streams.py::test_streams_updateitem_old_image,
for issue #6935: It tests that the OLD_IMAGE in Alternator Streams should
contain the item's key.

However this test was missing one corner case, which is the first solution
for this issue did incorrectly. So in this patch we add a test for this
corner case, test_streams_updateitem_old_image_empty_item:

This corner case about the item existing, but *empty*, i.e., having just
the key but no other attribute. In this case, OLD_IMAGE should return that
empty item - including its key. Not nothing.

As usual, this test passes on DynamoDB and xfails on Alternator, and the
"xfail" mark will be removed when issue #6935 is fixed.

Refs #6935.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200819155229.34475-1-nyh@scylladb.com>
2020-08-20 13:22:40 +02:00
Piotr Jastrzebski
49fd17a4ef cql3: Improve error messages for markers binding
When binding prepared statement it is possible that values being binded
are not correct. Unfortunately before this patch, the error message
was only saying what type got a wrong value. This was not very helpful
because there could be multiple columns with the same type in the table.
We also support collections so sometimes error was saying that there
is a wrong value for a type T but the affected column was actually of
type collection<T>.

This patch adds information about a column name that got the wrong
value so that it's easier to find and fix the problem.

Tests: unit(dev)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <90b70a7e5144d7cb3e53876271187e43fd8eee25.1597832048.git.piotr@scylladb.com>
2020-08-20 12:51:36 +03:00
Avi Kivity
392e24d199 Merge "Unglobal messaging service" from Pavel E
"
The messaging service is (as many other services) present in
the global namespace and is widely accessed from where needed
with global get(_local)?_messaging_service() calls. There's a
long-term task to get rid of this globality and make services
and componenets reference each-other and, for and due-to this,
start and stop in specific order. This set makes this for the
messaging service.

The service is very low level and doesn't depend on anything.
It's used by gossiper, streaming, repair, migration manager,
storage proxy, storage service and API. According to this
dependencies the set consists of several parts:

patches 1-9 are preparatory, they encapsulate messaging service
init/fini stuff in its own module and decouple it from the
db::config

patch 10-12 introduce local service reference in main and set
its init/fini calls at the early stage so that this reference
can later be passed to those depending on it

patches 13-42 replace global referencing of messaging service
from other subsystems with local references initialized from
main.

patch 43 finalizes tests.

patch 44 wraps things up with removing global messaiging service
instance along with get(_local)?_messaging_service calls.

The service's stopping part is deliberately left incomplete (as
it is now), the sharded service remains alive, only the instance's
stop() method is called (and is empty for a while). Since the
messaging service's users still do not stop cleanly, its instances
should better continue leaking on exit.

Once (if) the seastar gets the helper rpc::has_handlers() method
merged the messaging_service::stop() will be able to check if all
the verbs had been unregistered (spoiler: not yet, more fixes to
come).

For debugging purposes the pointer on now-local messaging service
instance is kept in service::debug namespace.

tests: unit(dev)
       dtest(dev: simple_boot_shutdown, repair, update_cluster_layout)
       manual start-stop
"

* 'br-unglobal-messaging-service-2' of https://github.com/xemul/scylla: (44 commits)
  messaging_service: Unglobal messaging service instance
  tests: Use own instances of messaging_service
  storage_service: Use local messaging reference
  storage_service: Keep reference on sharded messaging service
  migration_manager: Add messaging service as argument to get_schema_definition
  migration_manager: Use local messaging reference in simple cases
  migration_manager: Keep reference on messaging
  migration_manager: Make push_schema_mutation private non-static method
  migration_manager: Move get_schema_version verb handling from proxy
  repair: Stop using global messaging_service references
  repair: Keep sharded messaging service reference on repair_meta
  repair: Keep sharded messaging service reference on repair_info
  repair: Keep reference on messaging in row-level code
  repair: Keep sharded messaging service in API
  repair: Unset API endpoints on stop
  repair: Setup API endpoints in separate helper
  repair: Push the sharded<messaging_service> reference down to sync_data_using_repair
  repair: Use existing sharded db reference
  repair: Mark repair.cc local functions as static
  streaming: Keep messaging service on send_info
  ...
2020-08-20 12:20:36 +03:00
Rafael Ávila de Espíndola
f0e4e5b85a schema: Make some functions static
This just make it easier to see that they are file local helpers.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-08-19 14:05:31 -07:00
Rafael Ávila de Espíndola
6363716799 schema: Pass an rvalue to set_compaction_strategy_options
This produces less code and makes sure every caller moves the value.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-08-19 14:02:35 -07:00
Rafael Ávila de Espíndola
527c1ab546 schema: Move set_compaction_strategy_options out of line
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-08-19 14:02:13 -07:00
Pavel Emelyanov
623f61e63e messaging_service: Unglobal messaging service instance
Remove the global messaging_service, keep it on the main stack.
But also store a pointer on it in debug namespace for debugging.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:53 +03:00
Pavel Emelyanov
ee41645a1a tests: Use own instances of messaging_service
The global one is going away, no core code uses it, so all tests
can be safely switched to use their own instances.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:53 +03:00
Pavel Emelyanov
a6f8f450ba storage_service: Use local messaging reference
All the places the are (and had become such with previous patches) using
the global messaging service and the storage service methods, so they
can access the local reference on the messaging service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:53 +03:00
Pavel Emelyanov
4ea3c2797c storage_service: Keep reference on sharded messaging service
It is a bit step backward in the storage-service decompsition campaign, but...

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:53 +03:00
Pavel Emelyanov
24eaf827c0 migration_manager: Add messaging service as argument to get_schema_definition
There are 4 places that call this helper:

- storage proxy. Callers are rpc verb handlers and already have the proxy
  at hands from which they can get the messaging service instance
- repair. There's local-global messaging instance at hands, and the caller
  is in verb handler too
- streaming. The caller is verb handler, which is unregistered on stop, so
  the messaging service instance can be captured
- migration manager itself. The caller already uses "this", so the messaging
  service instance can be get from it

The better approach would be to make get_schema_definition be the method of
migration_manager, but the manager is stopped for real on shutdown, thus
referencing it from the callers might not be safe and needs revisiting. At
the same time the messaging service is always alive, so using its reference
is safe.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:53 +03:00
Pavel Emelyanov
2a4c0fa280 migration_manager: Use local messaging reference in simple cases
Most of those places are either non-static migration_manager methods.
Plus one place where the local service instance is already at hands.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:53 +03:00
Pavel Emelyanov
6c49127d04 migration_manager: Keep reference on messaging
That's another user of messaging service, init it with private reference.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:53 +03:00
Pavel Emelyanov
abb1dd608f migration_manager: Make push_schema_mutation private non-static method
The local migration manager instance is already available at caller, so
we can call a method on it. This is to facilitate next patching.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:53 +03:00
Pavel Emelyanov
56aa514cd9 migration_manager: Move get_schema_version verb handling from proxy
The user of this verb is migration manager, so the handler must be it as well.

The hander code now explicitly gets global proxy. This call is safe, as proxy
is not stopped nowadays. In the future we'll need to revisit the relation
between migration - proxy - stats anyway.

The use of local migration manager is safe, as it happens in verb handler which
is unregistered and is waited to be completed on migration manager stop.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:53 +03:00
Pavel Emelyanov
704880d564 repair: Stop using global messaging_service references
Now all the users of messaging service have the needed reference.

Again, the messaging service is not really stopped at the end, so its usage
is safe regardless of whether repair stuff itself leaks on stop or not.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:53 +03:00
Pavel Emelyanov
d7e90dbfa9 repair: Keep sharded messaging service reference on repair_meta
The reference comes from repair_info and storage_service calls, both
had been already patched for that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:53 +03:00
Pavel Emelyanov
285648620b repair: Keep sharded messaging service reference on repair_info
This reference comes from the API that already has it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:53 +03:00
Pavel Emelyanov
74494bac87 repair: Keep reference on messaging in row-level code
The row-level repair keeps its statics for needed services, same as the
streaming does. Treat the messaging service the same way to stop using
the global one in the next patches.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:53 +03:00
Pavel Emelyanov
8b4820b520 repair: Keep sharded messaging service in API
The reference will be needed in repair_start, so prepare one in advance

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:53 +03:00
Pavel Emelyanov
126dac8ad1 repair: Unset API endpoints on stop
This unset the roll-back of the correpsonding _set-s. The messaging
service will be (already is, but implicitly) used in repair API
callbacks, so make sure they are unset before the messaging service
is stopped.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:53 +03:00
Pavel Emelyanov
fe2c479c04 repair: Setup API endpoints in separate helper
There will be the unset part soon, this is the preparation. No functional
changes in api/storage_server.cc, just move the code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:52 +03:00
Pavel Emelyanov
45c31eadb3 repair: Push the sharded<messaging_service> reference down to sync_data_using_repair
This function needs the messaging service inside, but the closest place where it
can get one from is the storage_service API handlers. Temporarily move the call for
global messaging service into storage service, its turn for this cleanup will
come later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:52 +03:00
Pavel Emelyanov
6b0f4d5c8d repair: Use existing sharded db reference
The db.invoke_on_all's lambda tries to get the sharded db reference via
the global storage service. This can be done in a much nicer way.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:52 +03:00
Pavel Emelyanov
3d2e3203f7 repair: Mark repair.cc local functions as static
Just a cleanup to facilitate code reading.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:52 +03:00
Pavel Emelyanov
d2c475f27c streaming: Keep messaging service on send_info
And use it in send_mutation_fragments.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:52 +03:00
Pavel Emelyanov
a6888e3ce3 streaming: Keep reference on messaging
Streaming uses messaging, init it with itw own reference.

Nowadays the whole streaming subsystem uses global static references on the
needed services.  This is not nice, but still better than just using code-wide
globals, so treat the messaging service here the same way.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:52 +03:00
Pavel Emelyanov
163d615dc3 streaming: Use local ms() on ::start
This is just a cleanup to avoid explicit global call.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:52 +03:00
Pavel Emelyanov
528e4455b9 storage_proxy: Use _proxy in paxos_response_handler methods
The proxy pointer is non-null (and is already used in these methods),
so it should be safe.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:52 +03:00
Pavel Emelyanov
d397d7e734 storage_proxy: Pass proxy into forward_fn lambda of handle_write
It is alive there, so it is safe to pass one to lambda.
Once in forward_fn, it can be used to get messaging from.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:52 +03:00
Pavel Emelyanov
e5c10ee3e0 storage_proxy: Use reference on messaging in simple cases
Most of the places that need messaging service in proxy already use
storage_proxy instance, so it is safe to get the local messaging
from it too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:52 +03:00
Pavel Emelyanov
24cb1b781f storage_proxy: Keep reference on messaging
The proxy is another user of messaging, so keep the reference on it. Its
real usage will come in next patches.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:52 +03:00
Pavel Emelyanov
4ea63b2211 gossiper: Share the messaging service with snitch
And make snitch use gossiper's messaging, not global

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:52 +03:00
Pavel Emelyanov
65bd54604d gossiper: Use messaging service by reference
Gossiper needs messaging service, the messaging is started before the
gossiper, so we can push the former reference into it.

Gossiper is not stopped for real, neither the messaging service is, so
the memory usage is still safe.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 20:50:52 +03:00
Botond Dénes
6ad80f0adb test/lib/cql_test_env: set debug::db pointer
To allow using scylla-gdb.py scripts for debugging tests. These scripts
expect a valid database pointer in `debug::db`.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200819145632.2423462-1-bdenes@scylladb.com>
2020-08-19 19:13:05 +03:00
Raphael S. Carvalho
a0e0195a77 sstables: Avoid excessive reallocations when creating sharding metadata
Let's reserve space for sharding metadata in advance, to avoid excessive
allocations in create_sharding_metadata().
With the default ignore_msb_bits=12, it was observed that the # of
reallocations is frequently 11-12. With ignore_msb_bits=16, the number
can easily go up to 50.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200814210250.39361-1-raphaelsc@scylladb.com>
2020-08-19 17:58:29 +03:00
Nadav Har'El
2e1499ee93 merge: cdc: introduce a ,,change visitor'' and ,,inspect mutation'' abstractions
Merged pull request https://github.com/scylladb/scylla/pull/6978
by Kamil Braun:

These abstractions are used for walking over mutations created by a write
coordinator, deconstructing them into atomic'' pieces (changes''),
and consuming these pieces.

Read the big comment in cdc/change_visitor.hh for more details.

4 big functions were rewritten to use the new abstractions.

tests:

    unit (dev build)
    all dtests from cdc_tests.py, except cdc_tests.py:TestCdc.cluster_reduction_with_cdc_test, have passed (on a dev build). The test that fails also fails on master.

Part of #5945.

  cdc: rewrite process_changes using inspect_mutation
  cdc: move some functions out of `cdc::transformer`
  cdc: rewrite extract_changes using inspect_mutation
  cdc: rewrite should_split using inspect_mutation
  cdc: rewrite find_timestamp using inspect_mutation
  cdc: introduce a ,,change visitor'' abstraction
2020-08-19 17:19:01 +03:00
Avi Kivity
6f986df458 Merge "Fix TWCS compaction aggressiveness due to data segregation" from Raphael
"
After data segregation feature, anything that cause out-of-order writes,
like read repair, can result in small updates to past time windows.
This causes compaction to be very aggressive because whenever a past time
window is updated like that, that time window is recompacted into a
single SSTable.
Users expect that once a window is closed, it will no longer be written
to, but that has changed since the introduction of the data segregation
future. We didn't anticipate the write amplification issues that the
feature would cause. To fix this problem, let's perform size-tiered
compaction on the windows that are no longer active and were updated
because data was segregated. The current behavior where the last active
window is merged into one file is kept. But thereafter, that same
window will only be compacted using STCS.

Fixes #6928.
"

* 'fix_twcs_agressiveness_after_data_segregation_v2' of github.com:raphaelsc/scylla:
  compaction/twcs: improve further debug messages
  compaction/twcs: Improve debug log which shows all windows
  test: Check that TWCS properly performs size-tiered compaction on past windows
  compaction/twcs: Make task estimation take into account the size-tiered behavior
  compaction/stcs: Export static function that estimates pending tasks
  compaction/stcs: Make get_buckets() static
  compact/twcs: Perform size-tiered compaction on past time windows
  compaction/twcs: Make strategy easier to extend by removing duplicated knowledge
  compaction/twcs: Make newest_bucket() non-static
  compaction/twcs: Move TWCS implementation into source file
2020-08-19 17:19:01 +03:00
Avi Kivity
f6b66456fd Update seastar submodule
Contains patch from Rafael to fix up includes.

* seastar c872c3408c...7f7cf0f232 (9):
  > future: Consider result_unavailable invalid in future_state_base::ignore()
  > future: Consider result_unavailable invalid in future_state_base::valid()
  > Merge "future-util: split header" from Benny
  > docs: corrected some text and code-examples in streaming-rpc docs
  > future: Reduce nesting in future::then
  > demos: coroutines: include std-compat.hh
  > sstring: mark str() and methods using it as noexcept
  > tls: Add an assert
  > future: fix coroutine compilation
2020-08-19 17:18:57 +03:00
Pavel Emelyanov
dc0918e255 tests: Keep local reference on global messaging
Some tests directly reference the global messaging service. For the sake
of simpler patching wrap this global reference with a local one. Once the
global messaging service goes away tests will get their own instances.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Pavel Emelyanov
b895c2971a api: Use local reference to messaging_service
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Pavel Emelyanov
d477bd562d api: Unregister messaging endpoints on stop
API is one of the subsystems that work with messaging service. To keep
the dependencies correct the related API stuff should be stopped before
the messaging service stops.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Pavel Emelyanov
78298ec776 init: Use local messaging reference in main
There are few places that initialize db and system_ks and need the
messaging service. Pass the reference to it from main instead of
using the global helpers.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Pavel Emelyanov
878c50b9ad main: Keep reference on global messaging service
This is the preparation for moving the message service to main -- keep
a reference and eventually pass one to subsystems depending on messaging.
Once they are ready, the reference will be turned into an instance.

For now only push the reference into the messaging service init/exit
itself, other subsystems will be patched next.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Pavel Emelyanov
bdfb77492f init: The messaging_service::stop is back (not really)
Introduce back the .stop() method that will be used to really stop
the service. For now do not do sharded::stop, as its users are not
yet stopping, so this prevents use-after-free on messaging service.

For now the .stop() is empty, but will be in charge of checking if
all the other users had unregisterd their handlers from rpc.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Pavel Emelyanov
15998e20ce init: Move messaging service init up the main()
The messaging service is a low-level one which doesn't need other
services, so it can be started first. Nowadays it's indeed started
before most of its users but one -- the gossiper.

In current code gossiper doesn't do anything with messaging service
until it starts, but very soon this dependency will be expressed in
terms of a refernce from gossiper to messaging_service, thus by the
time the latter starts, the former should already exist.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Pavel Emelyanov
c28aeaee2e messaging_service: Move initialization to messaging/
Now the init_messaging_service() only deals with messaing service
and related internal stuff, so it can sit in its own module.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Pavel Emelyanov
41eee249d7 init: RIP init_scheduling_config
This struct is nowadays only used to transport arguments from db::config
to messaging_service::scheduling_config, things get simpler if dropping it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Pavel Emelyanov
ef6c75a732 init: Call init_messaging_service with its config only
This makes the messaging service configuration completely independent
from the db config. Next step would be to move the messaging service
init code into its module.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Pavel Emelyanov
5b169e8d16 messaging_service: Construct using config
This is the continuation of the previous patch -- change the primary
constructor to work with config. This, in turn, will decouple the
messaging service from database::config.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Pavel Emelyanov
304a414e39 messaging_service: Introduce and use config
This service constructor uses and copies many simple values, it would be
much simpler to group them on config. It also helps the next patches to
simplify the messaging service initialization and to keep the defaults
(for testing) in one place.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Pavel Emelyanov
f7d99b4a06 init: Split messaging service and gossiper initialization
The init_ms_fd_gossiper function initializes two services, but
effectively consists of two independent parts, so declare them
as such.

The duplication of listen address resolution will go away soon.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Pavel Emelyanov
1c8ea817cd messaging_service: Rename stop() to shutdown()
On today's stop() the messaging service is not really stopped as other
services still (may) use it and have registered handlers in it. Inside
the .stop() only the rpc servers are brought down, so the better name
for this method would be shutdown().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Pavel Emelyanov
e6fb2b58fc messaging_service: Cleanup visibility of stopping methods
Just a cleanup. These internal stoppers must be private, also there
are too many public specifiers in the class description around them.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Pavel Emelyanov
0601e9354d init: Remove unused lonely future from init_ms_fd_gossiper
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-19 13:08:12 +03:00
Rafael Ávila de Espíndola
56724d084d sstables: Move date_tiered_compaction_strategy_options::date_tiered_compaction_strategy_options out of line
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200812232915.442564-6-espindola@scylladb.com>
2020-08-19 11:34:13 +03:00
Rafael Ávila de Espíndola
07b3ead752 sstables: Move size_tiered_compaction_strategy_options::size_tiered_compaction_strategy_options out of line
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200812232915.442564-5-espindola@scylladb.com>
2020-08-19 11:34:13 +03:00
Rafael Ávila de Espíndola
7b3946fa0e sstables: Move compaction_strategy_impl::compaction_strategy_impl out of line
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200812232915.442564-4-espindola@scylladb.com>
2020-08-19 11:34:13 +03:00
Rafael Ávila de Espíndola
9ba765fe6f sstables: Move compaction_strategy_impl::get_value out of line
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200812232915.442564-3-espindola@scylladb.com>
2020-08-19 11:34:13 +03:00
Rafael Ávila de Espíndola
06b15aa7e3 sstables: Move time_window_compaction_strategy_options' constructors to a .cc
These are not trivial and not hot.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200812232915.442564-2-espindola@scylladb.com>
2020-08-19 11:34:13 +03:00
Raphael S. Carvalho
d601f78b4b compaction/twcs: improve further debug messages
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-08-18 15:14:09 -03:00
Raphael S. Carvalho
086f277584 compaction/twcs: Improve debug log which shows all windows
The current log prints one log entry for each window, it doesn't print
the # of SSTs in the bucket, and the now information is copied across
all the window entries.

previously, it looked like this:

[shard 0] compaction - Key 1597331160000000, now 1597331160000000
[shard 0] compaction - Key 1597331100000000, now 1597331160000000
[shard 0] compaction - Key 1597331040000000, now 1597331160000000
[shard 0] compaction - Key 1597330980000000, now 1597331160000000

this made it harder to group all windows which reflect the state of
the strategy in a given time.

now, it looks like as follow:

[shard 0] compaction - time_window_compaction_strategy::newest_bucket:
  now 1597331160000000
  buckets = {
    key=1597331160000000, size=1
    key=1597331100000000, size=2
    key=1597331040000000, size=1
    key=1597330980000000, size=1
  }

Also the level of this log is changed from debug to trace, given that
now it's compressed and only printed once.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-08-18 15:14:09 -03:00
Raphael S. Carvalho
3be1420083 test: Check that TWCS properly performs size-tiered compaction on past windows
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-08-18 15:14:09 -03:00
Raphael S. Carvalho
96436312be compaction/twcs: Make task estimation take into account the size-tiered behavior
The task estimation was not taking into account that TWCS does size-tiered
on the the windows, and it only added 1 to the estimation when there
could be more tasks than that depending on the amount of SSTables in
all the existing size tiers.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-08-18 15:14:09 -03:00
Raphael S. Carvalho
d287b1c198 compaction/stcs: Export static function that estimates pending tasks
That will be useful for allowing other compaction strategies that use
STCS to properly estimate the pending tasks.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-08-18 15:14:09 -03:00
Raphael S. Carvalho
b62737fd05 compaction/stcs: Make get_buckets() static
STCS will export a static function to estimate pending tasks, and
it relies on get_buckets() being static too.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-08-18 15:14:07 -03:00
Botond Dénes
48550eaae4 scylla-gdb.py: add find_vptrs_of_type() helper functions
One of the most common code I write when investigating coredumps is
finding all objects of a certain type and creating a human readable
report of certain properties of these objects. This usually involves
retrieving all objects with a vptr with `find_vptrs()` and matching
their type to some pattern. I found myself writing this boilerplate over
and over again, so in this patch I introduce a convenience method to
avoid repeating it in the future.
Message-Id: <20200818145247.2358116-1-bdenes@scylladb.com>
2020-08-18 17:20:06 +02:00
Botond Dénes
74ffafc8a7 scylla-gdb.py: scylla fiber: add actual return to early return
scylla_fiber._walk() has an early return condition on the passed-in
pointer actually being a task pointer. The problem is that the actual
return statement was missing and only an error was logged. This resulted
in execution continuing and further weird errors being printed due to
the code not knowing how to handle the bad pointer.
Message-Id: <20200818144902.2357289-1-bdenes@scylladb.com>
2020-08-18 17:17:25 +02:00
Botond Dénes
f3af6ff221 scylla-gdb.py: scylla fiber: add new FQ name of thread_wake_task
thread_wake_task was moved into an anonymous namespace, add this new
fully qualified name to the task name white-list. Leave the old name for
backward compatibility.

While at it, also add `seastar::thread_context` which is also a task
object, for better seastar thread support.
Message-Id: <20200818142206.2354921-1-bdenes@scylladb.com>
2020-08-18 16:48:01 +02:00
Botond Dénes
ece638fb3f scylla-gdb.py: collection_element(): add std::tuple support
Accessing the element of a tuple from the gdb command line is a
nightmare, add support to collection_element() retrieving one of its
elements to make this easier.
Message-Id: <20200818141123.2351892-1-bdenes@scylladb.com>
2020-08-18 16:48:01 +02:00
Botond Dénes
077dc7c021 scylla-gdb.py: boost_intrusive_list: add __len__() operator
Message-Id: <20200818141340.2352666-1-bdenes@scylladb.com>
2020-08-18 16:48:01 +02:00
Dejan Mircevski
fb6c011b52 everywhere: Insert space after switch
Quoth @avikivity: "switch is not a function, and we celebrate that by
putting a space after it like other control-flow keywords."

https://github.com/scylladb/scylla/pull/7052#discussion_r471932710

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-08-18 14:31:04 +03:00
Botond Dénes
78f94ba36a table: get_sstables_by_partition_key(): don't make a copy of selected sstables
Currently we assign the reference to the vector of selected sstables to
`auto sst`. This makes a copy and we pass this local variable to
`do_for_each()`, which will result in a use-after-free if the latter
defers.
Fix by not making a copy and instead just keep the reference.

Fixes: #7060

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200818091241.2341332-1-bdenes@scylladb.com>
2020-08-18 14:20:31 +03:00
Avi Kivity
ecb2bdad54 Merge 'Replace operator_type with an enum' from Dejan
"
operator_type is awkward because it's not copyable or assignable. Replace it with a new enum class.

Tests: unit(dev)
"

* dekimir-operator-type:
  cql3: Drop operator_type entirely
  cql3: Drop operator_type from the parser
  cql3/expr: Replace operator_type with an enum
2020-08-18 13:45:20 +03:00
Dejan Mircevski
1aa326c93b cql3: Drop operator_type entirely
Since no live code uses it anymore, it can be safely removed.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-08-18 12:27:01 +02:00
Dejan Mircevski
d97605f4f8 cql3: Drop operator_type from the parser
Replace operator_type with the nicer-behaved oper_t in CQL parser and,
consequently, in the relation hierarchy and column_condition.

After this, no references to operator_type remain in live code.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-08-18 12:27:00 +02:00
Dejan Mircevski
71c921111d cql3/expr: Replace operator_type with an enum
operator_type is awkward because it's not copyable or assignable.
Replace it in expression representation with a new enum class, oper_t.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-08-18 12:27:00 +02:00
Avi Kivity
b0ae9d0c7d Update tools/python3 submodule
* tools/python3 196be5a...f89ade5 (1):
  > reloc: cleanup deb builddir
2020-08-18 13:10:01 +03:00
Pekka Enberg
385ad5755b configure.py: Move tarballs to build/<mode>/dist/tar
As suggested by Avi, let's move the tarballs from
"build/dist/<mode>/tar" to "build/<mode>/dist/tar" to retain the
symmetry of different build modes, and make the tarballs easier to
discover. While at it, let's document the new tarball locations.

Message-Id: <20200818100427.1876968-1-penberg@scylladb.com>
2020-08-18 13:07:52 +03:00
Botond Dénes
22a6493716 view_update_generator: fix race between registering and processing sstables
fea83f6 introduced a race between processing (and hence removing)
sstables from `_sstables_with_tables` and registering new ones. This
manifested in sstables that were added concurrently with processing a
batch for the same sstables being dropped and the semaphore units
associated with them not returned. This resulted in repairs being
blocked indefinitely as the units of the semaphore were effectively
leaked.

This patch fixes this by moving the contents of `_sstables_with_tables`
to a local variable before starting the processing. A unit test
reproducing the problem is also added.

Fixes: #6892

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200817160913.2296444-1-bdenes@scylladb.com>
2020-08-18 10:22:35 +03:00
Takuya ASADA
352a136ae2 scylla-python3: move scylla-python3 to separated repository
Except scylla-python3, each scylla package has its own git repository, same package script filename, same build directory structure.
To put python3 thing on scylla repo, we created 'python3' directory on multiple locations, made '-python3' suffixed files, dig deeper build directory not to conflict scylla-server package build.
We should move all scylla-python3 related files to new repository, scylla-python3.

To keep compatibility with current Jenkins script, provide packages on
build/ directory for now.

Fixes #6751
2020-08-18 09:34:08 +03:00
Raphael S. Carvalho
f9f0be9ac8 compact/twcs: Perform size-tiered compaction on past time windows
After data segregation feature, anything that cause out-of-order writes,
like read repair, can result in small updates to past time windows.
This causes compaction to be very aggressive because whenever a past time
window is updated like that, that time window is recompacted into a
single SSTable.
Users expect that once a window is closed, it will no longer be written
to, but that has changed since the introduction of the data segregation
future. We didn't anticipate the write amplification issues that the
feature would cause. To fix this problem, let's perform size-tiered
compaction on the windows that are no longer active and were updated
because data was segregated. The current behavior where the last active
window is merged into one file is kept. But thereafter, that same
window will only be compacted using STCS.

Fixes #6928.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-08-17 12:29:34 -03:00
Raphael S. Carvalho
820b47e9a3 compaction/twcs: Make strategy easier to extend by removing duplicated knowledge
TWCS is hard to extend because its knowledge on what to do with a window
bucket is duplicated in two functions. Let's remove this duplication by
placing the knowledge into a single function.

This is important for the coming change that will perform size-tiered
instead of major on windows that are no longer active.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-08-17 12:29:34 -03:00
Raphael S. Carvalho
f2b588cfc4 compaction/twcs: Make newest_bucket() non-static
To fix #6928, newest_bucket() will have to access the class fields.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-08-17 12:29:34 -03:00
Raphael S. Carvalho
b95359314d compaction/twcs: Move TWCS implementation into source file
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-08-17 12:29:34 -03:00
Pavel Solodovnikov
9aa4712270 lwt: introduce paxos_grace_seconds per-table option to set paxos ttl
Previously system.paxos TTL was set as max(3h, gc_grace_seconds).

Introduce new per-table option named `paxos_grace_seconds` to set
the amount of seconds which are used to TTL data in paxos tables
when using LWT queries against the base table.

Default value is equal to `DEFAULT_GC_GRACE_SECONDS`,
which is 10 days.

This change allows to easily test various issues related to paxos TTL.

Fixes #6284

Tests: unit (dev, debug)

Co-authored-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200816223935.919081-1-pa.solodovnikov@scylladb.com>
2020-08-17 16:44:14 +02:00
Kamil Braun
0d3779e3e6 cdc: rewrite process_changes using inspect_mutation 2020-08-17 15:51:33 +02:00
Kamil Braun
9067f1a4e2 cdc: move some functions out of cdc::transformer
Preparing them to be used outside of `transformer`.
2020-08-17 15:51:33 +02:00
Kamil Braun
4533f62f54 cdc: rewrite extract_changes using inspect_mutation 2020-08-17 15:51:33 +02:00
Kamil Braun
e9192a6108 cdc: rewrite should_split using inspect_mutation 2020-08-17 15:51:33 +02:00
Kamil Braun
ee87f4026e cdc: rewrite find_timestamp using inspect_mutation 2020-08-17 15:51:33 +02:00
Kamil Braun
694714796f cdc: introduce a ,,change visitor'' abstraction
This is an abstraction for walking over mutations created by a write
coordinator, deconstructing them into ,,atomic'' pieces (,,changes''),
and consuming these pieces.

Read the big comment in cdc/change_visitor.hh for more details.
2020-08-17 15:51:30 +02:00
Nadav Har'El
4c73d43153 Alternator: allow CreateTable with SSESpecification explicitly disabled
While Alternator doesn't yet support creating a table with a different
"server-side encryption" (a.k.a. encryption-at-rest) parameters, the
SSESpecification option with Enabled=false should still be allowed, as
it is just the default, and means exactly the same as would a missing
SSESpecification.

This patch also adds a test for this case, which failed on Alternator
before this patch.

Fixes #7031.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200812205853.173846-1-nyh@scylladb.com>
2020-08-17 13:48:52 +02:00
Nadav Har'El
159a966949 alternator test, streams: add test LSI key attributes in OldImage
This patch adds a test that attributes which serve as a key for a
secondary index still appear in the OldImage in an Alternator Stream.

This is a special case, because although usually Alternator attributes
are saved as map elements, not stand-alone Scylla columns, in the special
case of secondary-index keys they *are* saved as actual Scylla columns
in the base table. And it turns out we produce wrong results in this case:
CDC's "preimage" does not currently include these columns if they didn't
change, while DynamoDB requires that all columns, not just the changed ones,
appear in OldImage. So the test added in this patch xfails on Alternator
(and as usual, passes on DynamoDB).

Refs #7030.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200812144656.148315-1-nyh@scylladb.com>
2020-08-17 13:46:53 +02:00
Pekka Enberg
d6354cb507 dbuild: Use host $USER and $HOME in Podman container
The "user.home" system property in JVM does not use the "HOME"
environment variable. This breaks Ant and Maven builds with Podman,
which attempts to look up the local Maven repository in "/root/.m2" when
building tools, for example:

  build.xml:757: /root/.m2/repository does not exist.

To fix the issue, let's bind-mount an /etc/passwd file, which contains
host username for UID 0, which ensures that Podman container $USER and $HOME
are the same as on the host.

Message-Id: <20200817085720.1756807-1-penberg@scylladb.com>
2020-08-17 13:46:28 +03:00
Avi Kivity
f4dbe3e65e Update tools/jmx submodule
* tools/jmx c5ed831...be8f1ac (1):
  > dist/common/systemd: set WorkingDirectory to get heap dump correctly
2020-08-17 09:54:59 +03:00
Avi Kivity
3b1ff90a1a Merge "Get rid of seed concept in gossip" from Asias
"
gossip: Get rid of seed concept

The concept of seed and the different behaviour between seed nodes and
non seed nodes generate a lot of confusion, complication and error for
users. For example, how to add a seed node into into a cluster, how to
promote a non seed node to a seed node, how to choose seeds node in
multiple DC setup, edit config files for seeds, why seed node does not
bootstrap.

If we remove the concept of seed, it will get much easier for users.
After this series, seed config option is only used once when a new node
joins a cluster.

Major changes:

Seed nodes are only used as the initial contact point nodes.

Seed nodes now perform bootstrap. The only exception is the first node
in the cluster.

The unsafe auto_bootstrap option is now ignored.

Gossip shadow round now talks to all nodes instead of just seed nodes.

Refs: #6845
Tests: update_cluster_layout_tests.py + manual test
"

* 'gossip_no_seed_v2' of github.com:asias/scylla:
  gossip: Get rid of seed concept
  gossip: Introduce GOSSIP_GET_ENDPOINT_STATES verb
  gossip: Add do_apply_state_locally helper
  gossip: Do not talk to seed node explicitly
  gossip: Talk to live endpoints in a shuffled fashion
2020-08-17 09:50:51 +03:00
Avi Kivity
5356e8319d Merge 'Support building packages on non-x86 platform' from Takuya
"
Allow users to build unofficial packages for non-x86 platform.
"

* syuu1228-aarch64_packaging_fix:
  dist/debian: allow building non-amd64 .deb
  configure.py: disable DPDK by default on non-x86_64 platform
2020-08-17 08:26:17 +03:00
Takuya ASADA
c73e945cf6 dist/debian: allow building non-amd64 .deb
Allow building .deb on any architecture, not only amd64.
2020-08-17 14:16:24 +09:00
Takuya ASADA
06079f0656 configure.py: disable DPDK by default on non-x86_64 platform
Since configure.py without option fails on some non-x86 architecture
such as ARM64, we should disable it on such architectures.
2020-08-17 14:16:24 +09:00
Asias He
d0b3f3dfe8 gossip: Get rid of seed concept
The concept of seed and the different behaviour between seed nodes and
non seed nodes generate a lot of confusion, complication and error for
users. For example, how to add a seed node into into a cluster, how to
promote a non seed node to a seed node, how to choose seeds node in
multiple DC setup, edit config files for seeds, why seed node does not
bootstrap.

If we remove the concept of seed, it will get much easier for users.
After this series, seed config option is only used once when a new node
joins a cluster.

Major changes:

- Seed nodes are only used as the initial contact point nodes.

- Seed nodes now perform bootstrap. The only exception is the first node
  in the cluster.

- The unsafe auto_bootstrap option is now ignored.

- Gossip shadow round now attempts to talk to all nodes instead of just seed nodes.

Manual test:

- bootstrap n1, n2, n3  (n1 and n2 are listed as seed, check only n1
  will skip bootstrap, n2 and n3 will bootstrap)
- shtudown n1, n2, n3
- start n2 (check non seed node can boot)
- start n1 (check n1 talks to both n2 and n3)
- start n3 (check n3 talks to both n1 and n3)

Upgrade/Downgrade test:

- Initialize cluster
  Start 3 node with n1, n2, n3 using old version
  n1 and n2 are listed as seed

- Test upgrade starting from seed nodes
  Rolling restart n1 using new version
  Rolling restart n2 using new version
  Rolling restart n3 using new version

- Test downgrade to old version
  Rolling restart n1 using old version
  Rolling restart n2 using old version
  Rolling restart n3 using old version

- Test upgrade starting from non seed nodes
  Rolling restart n3 using new version
  Rolling restart n2 using new version
  Rolling restart n1 using new version

Notes on upgrade procedure:

There is no special procedure needed to upgrade to Scylla without seed
concept. Rolling upgrade node one by one is good enough.

Fixes: #6845
Tests: ./test.py + update_cluster_layout_tests.py + manual test
2020-08-17 10:35:16 +08:00
Takuya ASADA
75c2362c95 dist/debian: disable debuginfo compression on .deb
Since older binutils on some distribution does not able to handle
compressed debuginfo generated on Fedora, we need to disable it.
However, debian packager force debuginfo compression since debian/compat = 9,
we have to uncompress them after compressed automatically.

Fixes #6982
2020-08-16 18:13:29 +03:00
Avi Kivity
125795bda5 Merge " Build tarballs to build/dist/<mode>/tar directory" from Pekka
"
This patch series changes the build system to build all tarballs to
build/dist/<mode>/tar directory. For example, running:

  ./tools/toolchain/dbuild ./configure.py --mode=dev && ./tools/toolchain/dbuild ninja-build dist-tar

produces the following tarballs in build/dist/dev/tar:

  $ ls -1 build/dist/dev/tar/
  scylla-jmx-package.tar.gz
  scylla-package.tar.gz
  scylla-python3-package.tar.gz
  scylla-tools-package.tar.gz

This makes it easy to locate release tarballs for humans and scripts. To
preserve backward compatibility, the tarballs are also retained in their
original locations. Once release engineering infrastructure has been
adjusted to use the new locations, we can drop the duplicate copies.
"

* 'penberg/build-dist-tar/v1' of github.com:penberg/scylla:
  configure.py: Copy tarballs to build/dist/<mode>/tar directory
  configure.py: Add "dist-<component>-tar" targets
  reloc/python3: Add "--builddir" to build_deb.sh
  configure.py: Use copy-on-write copies when possible
2020-08-16 17:55:35 +03:00
Avi Kivity
061ec49a6c Merge "Improve error reporting on invalid internal schema access" from Tomasz
"
Contains several fixes which improve debuggability in situations where
too large column ids are passed to column definition loop methods.
"

* 'schema-range-check-fix' of github.com:tgrabiec/scylla:
  schema: Add table name and schema version to error messages
  schema: Use on_internal_error() for range check errors
  schema: Fix off-by-one in column range check
  schema: Make range checks for regular and static columns the same as for clustering columns
2020-08-16 17:48:48 +03:00
Raphael S. Carvalho
81ec49c82f sstables/sstable_set: rename method to retrieve sstable runs
select() is too generic for the method that retrieve sstable runs,
and it has a completely different meaning that the former select
method used to select sstables based on token range.
let's give it a more descriptive name.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200811193401.22749-1-raphaelsc@scylladb.com>
2020-08-16 17:41:16 +03:00
Raphael S. Carvalho
b07920dd1f sstables: Fix remove_by_toc_name() on temporary toc
regression caused by 55cf219c97.

remove_by_toc_name() must work both for a sealed sstable with toc,
and also a partial sstable with tmp toc.
so dirname() should be called conditionally on the condition of
the sstable.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200813160612.101117-1-raphaelsc@scylladb.com>
2020-08-16 17:35:55 +03:00
Raphael S. Carvalho
7d7f9e1c54 sstables/LCS: increase per-level overlapping tolerance in reshape
LCS can have its overlapping invariant broken after operations that can
proceed in parallel to regular compaction like cleanup. That's because
there could be two compactions in parallel placing data in overlapping
token ranges of a given level > 0.
After reshape, the whole table will be rewritten, on restart, if a
given level has more than (fan_out*2)=20 overlaps.
That may sound like enough, but that's not taking into account the
exponential growth in # of SSTables per level, so 20 overlaps may
sound like a lot for level 2 which can afford 100 sstables, but it's
only 2% of level 3, and 0.2% of level 4. So let's change the
overlapping tolerance from the constant of fan_out*2 to 10% of level
limit on # of SSTables, or fan_out, whichever is higher.

Refs #6938.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200810154510.32794-1-raphaelsc@scylladb.com>
2020-08-16 17:33:48 +03:00
Raphael S. Carvalho
11df96718a compaction: Prevent non-regular compaction from picking compacting SSTables
After 8014c7124, cleanup can potentially pick a compacting SSTable.
Upgrade and scrub can also pick a compacting SSTable.
The problem is that table::candidates_for_compaction() was badly named.
It misleads the user into thinking that the SSTables returned are perfect
candidates for compaction, but manager still need to filter out the
compacting SSTables from the returned set. So it's being renamed.

When the same SSTable is compacted in parallel, the strategy invariant
can be broken like overlapping being introduced in LCS, and also
some deletion failures as more than one compaction process would try
to delete the same files.

Let's fix scrub, cleanup and ugprade by calling the manager function
which gets the correct candidates for compaction.

Fixes #6938.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200811200135.25421-1-raphaelsc@scylladb.com>
2020-08-16 17:31:03 +03:00
Nadav Har'El
7e01ae089e cdc: avoid including cdc/cdc_options.hh everywhere
Before this patch, modifying cdc/cdc_options.hh required recompiling 264
source files. This is because this header file was included by a couple
other header files - most notably schema.hh, where a forward declaration
would have been enough. Only the handful of source files which really
need to access the CDC options should include "cdc/cdc_options.hh" directly.

After this patch, modifying cdc/cdc_options.hh requires only 6 source files
to be recompiled.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200813070631.180192-1-nyh@scylladb.com>
2020-08-16 14:41:47 +03:00
Piotr Jastrzebski
01ea159fde codebase wide: use try_emplace when appropriate
C++17 introduced try_emplace for maps to replace a pattern:
if(element not in a map) {
    map.emplace(...)
}

try_emplace is more efficient and results in a more concise code.

This commit introduces usage of try_emplace when it's appropriate.

Tests: unit(dev)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <4970091ed770e233884633bf6d46111369e7d2dd.1597327358.git.piotr@scylladb.com>
2020-08-16 14:41:09 +03:00
Pekka Enberg
39400f58fb build_unified.sh: Compress generated tarball
Fixes #7039
Message-Id: <20200813124921.1648028-1-penberg@scylladb.com>
2020-08-16 14:41:01 +03:00
Dejan Mircevski
edf91e9e06 test: Restore a case in user_types_test
This testcase was temporarily commented out in 37ebe52, because it
relied on buggy (#6369) behaviour fixed by that commit.  Specifically,
it expected a NULL comparison to match a NULL cell value.  We now
bring it back, with corrected result expectation.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-08-16 13:49:55 +03:00
Pavel Emelyanov
319c9dda92 scylla-gdb: Fix netw command
The _clients is std::vector, it doesn't have _M_elems.
Luckily there's std_vector() class for it.

The seastar::rpc::server::_conns is unordered_map, not
unordered_set.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200814070858.32383-1-xemul@scylladb.com>
2020-08-16 11:41:02 +03:00
Asias He
c76296e97e scylla-gdb.py: Add boost_intrusive_list_printer
It is needed to print the boost::intrusive::list which is used
by repair_meta_for_masters in repair.

Fixes #7037

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Signed-off-by: Asias He <asias@scylladb.com>
2020-08-15 20:26:02 +03:00
Piotr Jastrzebski
c001374636 codebase wide: replace count with contains
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
`count` function was often used in various ways.

`contains` does not only express the intend of the code better but also
does it in more unified way.

This commit replaces all the occurences of the `count` with the
`contains`.

Tests: unit(dev)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <b4ef3b4bc24f49abe04a2aba0ddd946009c9fcb2.1597314640.git.piotr@scylladb.com>
2020-08-15 20:26:02 +03:00
Konstantin Osipov
6d7393b0df build: Make it possible to opt-out from building the packages.
Message-Id: <20200812162845.852515-1-kostja@scylladb.com>
2020-08-15 20:26:02 +03:00
Tomasz Grabiec
db1c8c439a schema: Add table name and schema version to error messages 2020-08-14 14:35:09 +02:00
Tomasz Grabiec
817c2e0508 schema: Use on_internal_error() for range check errors 2020-08-14 14:35:09 +02:00
Tomasz Grabiec
43d503102b schema: Fix off-by-one in column range check
We'd fail in std::vector::at() instead.

Let's catch all invalid accesses, as intended.
2020-08-14 14:34:51 +02:00
Tomasz Grabiec
b41f2c719b schema: Make range checks for regular and static columns the same as for clustering columns 2020-08-14 14:34:51 +02:00
Pekka Enberg
3c1db2fb87 configure.py: Copy tarballs to build/dist/<mode>/tar directory 2020-08-14 13:06:13 +03:00
Pekka Enberg
5a2d271df8 configure.py: Add "dist-<component>-tar" targets 2020-08-14 13:06:13 +03:00
Pekka Enberg
e4685020ba reloc/python3: Add "--builddir" to build_deb.sh
Add a "--builddir" command line option to build_deb.sh script of Python
3 so that we can use it to control artifact build location.
2020-08-14 13:06:13 +03:00
Pekka Enberg
7adae6b04a configure.py: Use copy-on-write copies when possible
Pass the "--reflink=auto" command line option to "cp" to use
copy-on-write copies whenever the filesystem supports it to reduce disk
space usage.
2020-08-14 13:06:13 +03:00
Asias He
e6ceec1685 gossip: Fix race between shutdown message handler and apply_state_locally
1. The node1 is shutdown
2. The node1 sends shutdown message to node2
3. The node2 receives gossip shutdown message but the handler yields
4. The node1 is restarted
5. The node1 sends new gossip endpoint_state to node2, node2 applies the state
   in apply_state_locally and calls gossiper::handle_major_state_change
   and then calls gossiper::mark_alive
6. The shutdown message handler in step 3 resumes and sets status of node1 to SHUTDOWN
7. The gossiper::mark_alive fiber in step 5 resumes and calls gossiper::real_mark_alive,
   node2 will skip to mark node1 as alive because the status of node1 is
   SHUTDOWN. As a result, node1 is alive but it is not marked as UP by node2.

To fix, we serialize the two operations.

Fixes #7032
2020-08-13 11:06:04 +03:00
Nadav Har'El
ee7291aa88 merge: CDC: allow "full" preimage in logs
Merged pull request https://github.com/scylladb/scylla/pull/7028
By Calle Wilund:

Changes the "preimage" option from binary true/false to on/off/full (accepting true/false, and using old style notation for normal to string - for upgrade reasons), where "full" will force us to include all columns in pre image log rows.

Adds small test (just adding the case to preimage test).
Uses the feature in alternator

Fixes #7030

  alternator: Set "preimage" to "full" for streams
  cdc_test: Do small test of "full"
  cdc: Make pre image optionally "full" (include all columns)
2020-08-12 23:19:46 +03:00
Calle Wilund
730c5ea283 alternator: Set "preimage" to "full" for streams
Fixes #7030

Dynamo/alternator streams old image data is supposed to
contain the full old value blob (all keys/values).

Setting preimage=full ensures we get even those properties
that have separate columns if they are not part of an actual
modification.
2020-08-12 16:05:00 +00:00
Calle Wilund
8cc5076033 cdc_test: Do small test of "full"
Not a huge test change, but at least verifies it works.
2020-08-12 16:04:52 +00:00
Calle Wilund
2eb4522fef cdc: Make pre image optionally "full" (include all columns)
Makes the "preimage" option for cdc non-binary, i.e. it can now
be "true"/"on", "false"/"off" or "full. The two former behaving like
previously, the latter obviously including all columns in pre image.
2020-08-12 16:03:06 +00:00
Avi Kivity
79851d6216 Update tools/java submodule
* tools/java f2c7cf8d8d...d6c0ad1e2e (3):
  > sstableloader: Preserve droppedColumns in column rename handling
  > Revert "reloc: Build relocatable package without Maven"
  > reloc: Build relocatable package without Maven
2020-08-12 16:58:45 +03:00
Takuya ASADA
7cccb018b8 aws: update enhanced networking supported instance list
Sync enhanced networking supported instance list to latest one.

Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking.html

Fixes #6991
2020-08-12 15:43:17 +03:00
Nadav Har'El
8135647906 merge: Add metrics to semaphores
Merged pull request https://github.com/scylladb/scylla/pull/7018
by Piotr Sarna:

This series addresses various issues with metrics and semaphores - it mainly adds missing metrics, which makes it possible to see the length of the queues attached to the semaphores. In case of view building and view update generation, metrics was not present in these services at all, so a first, basic implementation is added.

More precise semaphore metrics would ease the testing and development of load shedding and admission control.

	view_builder: add metrics
	db, view: add view update generator metrics
	hints: track resource_manager sending queue length
	hints: add drain queue length to metrics
	table: add metrics for sstable deletion semaphore
	database: remove unused semaphore
2020-08-12 12:39:59 +03:00
Botond Dénes
4cfab59eb1 scylla-gdb.py: find_db(): don't return current shard's database for shard=0
The `shard` parameter of `find_db()` is optional and is defaulted to
`None`. When missing, the current shard's database instance is returned.
The problem is that the if condition checking this uses `not shard`,
which also evaluates to `True` if `shard == 0`, resulting in returning
the current shard's database instance for shard 0. Change the condition
to `shard is None` to avoid this.

Fixes: #7016
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200812091546.1704016-1-bdenes@scylladb.com>
2020-08-12 12:22:46 +03:00
Avi Kivity
736863c385 Merge "repair: Add progress metrics for node ops" from Asias
"
This series adds progress metrics for the node operations. Metrics for bootstrap and rebuild progress are added as a starter. I will add more for the remaining operations after getting feedback.

With this the Scylla Monitor and Scylla Manager can know the progress of the bootstrap and other node operations. E.g.,

    scylla_node_ops_bootstrap_nr_ranges_finished{shard="0",type="derive"} 50
    scylla_node_ops_bootstrap_nr_ranges_total{shard="0",type="derive"} 1040

Fixes #1244, #6733
"

* 'repair_progress_metrics_v3' of github.com:asias/scylla:
  repair: Add progress metrics for repair ops
  repair: Add progress metrics for rebuild ops
  repair: Add progress metrics for bootstrap ops
2020-08-12 11:42:14 +03:00
Avi Kivity
8853eddaf6 Merge 'repair: Track repair_meta created on both repair follower and master' from Asias
"
It is pretty hard to find the repair_meta object when debugging a core.
This patch makes it is easier by putting repair_meta object created by
both repair follower and master into a map.

Fixes #7009
"

* asias-repair_make_debug_eaiser_track_all_repair_metas:
  repair: Add repair_meta_tracker to track repair_meta for followers and masters
  repair: Move thread local object _repair_metas out of the function
2020-08-12 11:01:32 +03:00
Botond Dénes
1d48442ae7 test/lib/mutation_source_test: test-monotonic-positions: test the reader-under-test
Instead of always testing `flat_mutation_reader_from_mutations()`.

Tests: unit(dev, debug)

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200812073406.1681250-1-bdenes@scylladb.com>
2020-08-12 10:52:26 +03:00
Avi Kivity
24aa03a13c Merge "Move some test code out of line" (sstable_run_based_compaction_strategy_for_test) from Rafael
* 'espindola/move-out-of-line' of https://github.com/espindola/scylla:
  test: Move code in sstable_run_based_compaction_strategy_for_tests.hh out of line
  test: Drop ifdef now that we always use c++20
  test: Move sstable_run_based_compaction_strategy_for_tests.hh to test/lib
2020-08-12 10:46:40 +03:00
Asias He
e9a520a22b repair: Add repair_meta_tracker to track repair_meta for followers and masters
It is pretty hard to find the repair_meta object when debugging a core.
This patch makes it is easier by putting repair_meta object created by
both repair follower and master into boost intrusive list.

Fixes #7009
2020-08-12 15:44:22 +08:00
Asias He
58f4c730b0 repair: Move thread local object _repair_metas out of the function
It is a lot of pain to access _repair_metas when debugging.

Refs #7009
2020-08-12 11:23:18 +08:00
Rafael Ávila de Espíndola
aa2476d7ac test: Move code in sstable_run_based_compaction_strategy_for_tests.hh out of line
Most of this is virtual and it is all test code.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-08-11 11:49:49 -07:00
Rafael Ávila de Espíndola
ef6a52a407 test: Drop ifdef now that we always use c++20
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-08-11 11:49:20 -07:00
Rafael Ávila de Espíndola
bd2f9fc685 test: Move sstable_run_based_compaction_strategy_for_tests.hh to test/lib
This is in preparation to moving the code to a .cc file.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-08-11 11:48:41 -07:00
Avi Kivity
f158d056e8 Update seastar submodule
* seastar e615054c75...c872c3408c (5):
  > future-util: Pass a rvalue reference to repeat
  > tutorial: service_loop: do not return handle_connection future
  > future-util: Drop redundant make_tuple call
  > future-util: Pass an rvalue reference to the repeater constructor
  > allow move assign empty expiring_fifo
2020-08-11 19:53:36 +03:00
Benny Halevy
6deba1d0b4 test: cql_query_test: test_cache_bypass: use table stats
test is currently flaky since system reads can happen
in the background and disturb the global row cache stats.

Use the table's row_cache stats instead.

Fixes #6773

Test: cql_query_test.test_cache_bypass(dev, debug)

Credit-to: Botond Dénes <bdenes@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200811140521.421813-1-bhalevy@scylladb.com>
2020-08-11 19:52:16 +03:00
Piotr Sarna
5086a5ca32 view_builder: add metrics
The view builder service lacked metrics, so a basic set of them
is added.
2020-08-11 17:43:53 +02:00
Piotr Sarna
e4d78b60ff db, view: add view update generator metrics
The view update generator completely lacked metrics, so a basic set
of them is now exposed.
2020-08-11 17:43:53 +02:00
Piotr Sarna
180a1505fd hints: track resource_manager sending queue length
The number of tasks waiting for a hint to be sent is now tracked.
2020-08-11 17:43:53 +02:00
Piotr Sarna
58a9fa7d2e hints: add drain queue length to metrics
The number of tasks waiting for a drain is now tracked.
2020-08-11 17:43:53 +02:00
Piotr Sarna
8b56b24737 table: add metrics for sstable deletion semaphore
It's now possible to read the number of tasks waiting on the
sstable deletion semaphore.
2020-08-11 17:43:53 +02:00
Benny Halevy
13f437157a compaction_manager: register_compacting_sstables: allocate before registering sstables
make all required allocations in advance to merging sstables
into _compacting_sstables so it should not throw
after registering some sstables, but not all.

Test: database_test(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200811132440.416945-1-bhalevy@scylladb.com>
2020-08-11 18:14:58 +03:00
Botond Dénes
4ab4619341 auth: common: separate distributed query timeout for debug builds
Currently when running against a debug build, our integration test suite
suffers from a ton of timeout related error logs, caused by auth queries
timing out. This causes spurious test failures due to the unexpected
error messages in the log.
This patch increases the timeout for internal distributed auth queries
in debug mode, to give the slow debug builds more headroom to meet the
timeout.

Refs: #6548
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200811145757.1593350-1-bdenes@scylladb.com>
2020-08-11 18:07:53 +03:00
Avi Kivity
58104d17e0 Merge 'transport: Allow user to disable unencrypted native transport' from Pekka
"
Let users disable the unencrypted native transport too by setting the port to
zero in the scylla.yaml configuration file.

Fixes #6997
"

* penberg-penberg/native-transport-disable:
  docs/protocol: Document CQL protocol port configuration options
  transport: Allow user to disable unencrypted native transport
2020-08-11 16:30:52 +03:00
Avi Kivity
d36601a838 Merge 'Make commitlog respect disk limit better' from Calle
"
Refs #6148

Separates disk usage into two cases: Allocated and used.
Since we use both reserve and recycled segments, both
which are not actually filled with anything at the point
of waiting.

Also refuses to recycle segments or increase reserve size
if our current disk footprint exceeds threshold.

And finally uses some initial heuristics to determine when
we should suggest flushing, based on disk limit, segment
size, and current usage. Right now, when we only have
a half segment left before hitting used == max.

Some initial tests show an improved adherence to limit
though it will still be exceeded, because we do _not_
force waiting for segments to become cleared or similar
if we need to add data, thus slow flushing can still make
usage create extra segments. We will however attempt to
shrink disk usage when load is lighter.

Somewhat unclear how much this impacts performance
with tight limits, and how much this matters.
"

* elcallio-calle/commitlog_size:
  commitlog: Make commitlog respect disk limit better
  commitlog: Demote buffer write log messages to trace
2020-08-11 15:03:32 +03:00
Dejan Mircevski
013893b08d auth: Drop needless role-manager check
The service constructor included a check ensuring that only
standard_role_manager can be used with password_authenticator. But
after 00f7bc6, password_authenticator does not depend on any action of
standard_role_manager. All queries to meta::roles_table in
password_authenticator seem self-contained: the table is created at
the start if missing, and salted_hash is CRUDed independently of any
other columns bar the primary key role_col_name.

NOTE: a nonstandard role manager may not delete a role's row in
meta::roles_table when that role is dropped. This will result in
successful authentication for that non-existing role. But the clients
call check_user_can_login() after such authentication, which in turn
calls role_manager::exists(role). Any correctly implemented role
manager will then return false, and authentication_exception will be
thrown. Therefore, no dependencies exist on the role-manager
behaviour, other than it being self-consistent.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-08-11 14:56:18 +03:00
Avi Kivity
4547949420 Merge "Fix repair stalls in get_sync_boundary and apply_rows_on_master_in_thread" from Asias
"
This path set fixes stalls in repair that are caused by std::list merge and clear operations during test_latency_read_with_nemesis test.

Fixes #6940
Fixes #6975
Fixes #6976
"

* 'fix_repair_list_stall_merge_clear_v2' of github.com:asias/scylla:
  repair: Fix stall in apply_rows_on_master_in_thread and apply_rows_on_follower
  repair: Use clear_gently in get_sync_boundary to avoid stall
  utils: Add clear_gently
  repair: Use merge_to_gently to merge two lists
  utils: Add merge_to_gently
2020-08-11 14:52:23 +03:00
Botond Dénes
db5926134a sstables: sstable_mutation_reader: read_partition(): include more information in exception
Resolve the FIXME to help investigating related issues and include the
position of the consumer in the error message.

Refs: #6529

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200811111101.1576222-1-bdenes@scylladb.com>
2020-08-11 14:52:04 +03:00
Asias He
c65ad02fcd repair: Fix stall in apply_rows_on_master_in_thread and apply_rows_on_follower
The row_diff list in apply_rows_on_master_in_thread and
apply_rows_on_follower can be large. Modify do_apply_rows to remove the
row from the list when the row is consumed to avoid stall when the list
is destroyed.

Fixes #6975
2020-08-11 19:37:47 +08:00
Asias He
9f4b3a5fa6 repair: Use clear_gently in get_sync_boundary to avoid stall
The _row_buf and _working_row_buf list can be large. Use
clear_gently helper to avoid stalls.

Fixes #6940
2020-08-11 19:37:47 +08:00
Asias He
3e8c4a6788 utils: Add clear_gently
A helper to clear a list without stall.

Refs #6975
Refs #6940
2020-08-11 19:37:47 +08:00
Calle Wilund
ed86e870ee docs/cdc.md: Add short explanation of stream ID bit composition
Bit layout, sort order and field usage of CDC stream ids.
2020-08-11 14:09:45 +03:00
Avi Kivity
41a75f2b99 Merge "make do_io_check path noexcept" from Benny
"
Make do_io_check and the io_check functions that
call it noexcept.  Up to sstable_write_io_check
and sstable_touch_directory_io_check.

Tests: unit (dev)
"

* tag 'io-check-noexcept-v1' of github.com:bhalevy/scylla:
  ssstable: io_check functions: make noexcept
  utils: do_io_check: adjust indentation
  utils: io_check: make noexcept for future-returning functions
2020-08-11 13:41:20 +03:00
Calle Wilund
5d044ab74e commitlog: Make commitlog respect disk limit better
Refs #6148

Separates disk usage into two cases: Allocated and used.
Since we use both reserve and recycled segments, both
which are not actually filled with anything at the point
of waiting.

Also refuses to recycle segments or increase reserve size
if our current disk footprint exceeds threshold.

And finally uses some initial heuristics to determine when
we should suggest flushing, based on disk limit, segment
size, and current usage. Right now, when we only have
a half segment left before hitting used == max.

Some initial tests show an improved adherence to limit
though it will still be exceeded, because we do _not_
force waiting for segments to become cleared or similar
if we need to add data, thus slow flushing can still make
usage create extra segments. We will however attempt to
shrink disk usage when load is lighter.

Somewhat unclear how much this impacts performance
with tight limits, and how much this matters.

v2:
* Add some comments/explanations
v3:
* Made disk footprint subtract happen post delete (non-optimistic)
2020-08-11 10:40:56 +00:00
Avi Kivity
3530e80ce1 Merge "Support md format" from Benny
"
This series adds support for the "md" sstable format.

Support is based on the following:

* do not use clustering based filtering in the presence
  of static row, tombstones.
* Disabling min/max column names in the metadata for
  formats older than "md".
* When updating the metadata, reset and disable min/max
  in the presence of range tombstones (like Cassandra does
  and until we process them accurately).
* Fix the way we maintain min/max column names by:
  keeping whole clustering key prefixes as min/max
  rather than calculating min/max independently for
  each component, like Cassandra does in the "md" format.

Fixes #4442

Tests: unit(dev), cql_query_test -t test_clustering_filtering* (debug)
md migration_test dtest from git@github.com:bhalevy/scylla-dtest.git migration_test-md-v1
"

* tag 'md-format-v4' of github.com:bhalevy/scylla: (27 commits)
  config: enable_sstables_md_format by default
  test: cql_query_test: add test_clustering_filtering unit tests
  table: filter_sstable_for_reader: allow clustering filtering md-format sstables
  table: create_single_key_sstable_reader: emit partition_start/end for empty filtered results
  table: filter_sstable_for_reader: adjust to md-format
  table: filter_sstable_for_reader: include non-scylla sstables with tombstones
  table: filter_sstable_for_reader: do not filter if static column is requested
  table: filter_sstable_for_reader: refactor clustering filtering conditional expression
  features: add MD_SSTABLE_FORMAT cluster feature
  config: add enable_sstables_md_format
  database: add set_format_by_config
  test: sstable_3_x_test: test both mc and md versions
  test: Add support for the "md" format
  sstables: mx/writer: use version from sstable for write calls
  sstables: mx/writer: update_min_max_components for partition tombstone
  sstables: metadata_collector: support min_max_components for range tombstones
  sstable: validate_min_max_metadata: drop outdated logic
  sstables: rename mc folder to mx
  sstables: may_contain_rows: always true for old formats
  sstables: add may_contain_rows
  ...
2020-08-11 13:29:11 +03:00
Piotr Jastrzebski
80e3923b3c codebase wide: replace find(...) != end() with contains
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
the code pattern looked like:

<collection>.find(<element>) != <collection>.end()

In C++20 the same can be expressed with:

<collection>.contains(<element>)

This is not only more concise but also expresses the intend of the code
more clearly.

This commit replaces all the occurences of the old pattern with the new
approach.

Tests: unit(dev)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <f001bbc356224f0c38f06ee2a90fb60a6e8e1980.1597132302.git.piotr@scylladb.com>
2020-08-11 13:28:50 +03:00
Avi Kivity
55cf219c97 Merge "sstable: close files on error" from Benny
"
Make sure to close sstable files also on error paths.

Refs #5509
Fixes #6448

Tests: unit (dev)
"

* tag 'sstable-close-files-on-error-v6' of github.com:bhalevy/scylla:
  sstable: file_writer: auto-close in destructor
  sstable: file_writer: add optional filename member
  sstable: add make_component_file_writer
  sstable: remove_by_toc_name: accept std::string_view
  sstable: remove_by_toc_name: always close file and input stream
  sstable: delete_sstables: delete outdated FIXME comment
  sstable: remove_by_toc_name: drop error_handler parameter
  sstable: remove_by_toc_name: make static
  sstable: read_toc: always close file
  sstable: mark read_toc and methods calling it noexcept
  sstable: read_toc: get rid of file_path
  sstable: open_data, create_data: set member only on success.
  sstable: open_file: mark as noexcept
  sstable: new_sstable_component_file: make noexcept
  sstable: new_sstable_component_file: close file on failure
  sstable: rename_new_sstable_component_file: do not pass file
  sstable: open_sstable_component_file_non_checked: mark as noexcept
  sstable: open_integrity_checked_file_dma: make noexcept
  sstable: open_integrity_checked_file_dma: close file on failure
2020-08-11 13:28:50 +03:00
Pekka Enberg
4a02e0c3c0 docs/protocol: Document CQL protocol port configuration options 2020-08-11 13:15:24 +03:00
Pekka Enberg
e401a26701 transport: Allow user to disable unencrypted native transport
Let users disable the unencrypted native transport too by setting the port to
zero in the scylla.yaml configuration file.

Fixes #6997
2020-08-11 13:15:17 +03:00
Asias He
97d47bffa5 repair: Add progress metrics for repair ops
The following metric is added:

scylla_node_maintenance_operations_repair_finished_percentage{shard="0",type="gauge"} 0.650000

It is the number of finished percentage for all ongoing repair operations.

When all ongoing repair operations finish, the percentage stays at 100%.

Fixes #1244, #6733
2020-08-11 18:15:10 +08:00
Botond Dénes
b11d181413 scylla-gdb.py: restore python2 compatibility
Although python2 should be a distant memory by now, the reality is that
we still need to debug scylla on platforms that still have no python3
available (centos7), so we need to keep scylla-gdb.py python2
compatible.

Refs: #7014
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200811093753.1567689-1-bdenes@scylladb.com>
2020-08-11 12:55:42 +03:00
Nadav Har'El
796ad24f37 docs: correct typo in maintainers.md
maintainers.md contains a very helpful explanation of how to backport
Seastar fixes to old branches of Scylla, but has a tiny typo, which
this patch corrects.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200811095350.77146-1-nyh@scylladb.com>
2020-08-11 12:54:41 +03:00
Takuya ASADA
6fbbe836c1 scylla_raid_setup: use mdadm.service on older Debian variants
On older Debian variants does not have mdmonitor.service, we should use
mdadm.service instead.

Fixes #7000
2020-08-11 12:52:24 +03:00
Calle Wilund
a6ad70d3da cdc:stream_id: Encode format version + vnode grouping/index in id
Fixes #6948

Changes the stream_id format from
 <token:64>:<rand:64>
to
 <token:64>:<rand:38><index:22><version:4>

The code will attempt to assert version match when
presented with a stored id (i.e. construct from bytes).
This means that ID:s created by previous (experimental)
versions will break.

Moves the ID encoding fully into the ID class, and makes
the code path private for the topology generation code
path.

Removes some superflous accessors but adds accessors for
token, version and index. (For alternator etc).
2020-08-11 12:48:04 +03:00
Calle Wilund
9167d1ac76 commitlog: Demote buffer write log messages to trace
Because they become very plentiful and annoying when
one tries to analyze segment behaviour. More so in
batch mode.
2020-08-11 09:18:23 +00:00
Piotr Sarna
3b8fd11fa3 database: remove unused semaphore
A semaphore for limiting the number of loaded sstables is completely
unused, so it can be removed.
2020-08-11 09:48:12 +02:00
Asias He
53fee789f0 repair: Use merge_to_gently to merge two lists
During a performance test, test_latency_read_with_nemesis during manager
repair, it experienced a stall of 73 ms:

```
 (inlined by) std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > >::operator=(repair_row const&) at /usr/include/c++/9/bits/stl_iterator.h:515
 (inlined by) std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > > std::__copy_move<false, false, std::bidirectional_iterator_tag>::__copy_m<std::_List_iterator<repair_row>, std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > > >(std::_List_iterator<repair_row>, std::_List_iterator<repair_row>, std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > >) at /usr/include/c++/9/bits/stl_algobase.h:312
 (inlined by) std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > > std::__copy_move_a<false, std::_List_iterator<repair_row>, std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > > >(std::_List_iterator<repair_row>, std::_List_iterator<repair_row>, std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > >) at /usr/include/c++/9/bits/stl_algobase.h:404
 (inlined by) std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > > std::__copy_move_a2<false, std::_List_iterator<repair_row>, std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > > >(std::_List_iterator<repair_row>, std::_List_iterator<repair_row>, std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > >) at /usr/include/c++/9/bits/stl_algobase.h:440
 (inlined by) std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > > std::copy<std::_List_iterator<repair_row>, std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > > >(std::_List_iterator<repair_row>, std::_List_iterator<repair_row>, std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > >) at /usr/include/c++/9/bits/stl_algobase.h:474
 (inlined by) std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > > std::__merge<std::_List_iterator<repair_row>, std::_List_iterator<repair_row>, std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > >, __gnu_cxx::__ops::_Iter_comp_iter<repair_meta::apply_rows_on_master_in_thread(std::__cxx11::list<partition_key_and_mutation_fragments, std::allocator<partition_key_and_mutation_fragments> >, gms::inet_address, seastar::bool_class<update_working_row_buf_tag>, seastar::bool_class<update_peer_row_hash_sets_tag>, unsigned int)::{lambda(repair_row const&, repair_row const&)#1}> >(std::_List_iterator<repair_row>, std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > >, std::_List_iterator<repair_row>, std::_List_iterator<repair_row>, __gnu_cxx::__ops::_Iter_comp_iter<repair_meta::apply_rows_on_master_in_thread(std::__cxx11::list<partition_key_and_mutation_fragments, std::allocator<partition_key_and_mutation_fragments> >, gms::inet_address, seastar::bool_class<update_working_row_buf_tag>, seastar::bool_class<update_peer_row_hash_sets_tag>, unsigned int)::{lambda(repair_row const&, repair_row const&)#1}>, __gnu_cxx::__ops::_Iter_comp_iter<repair_meta::apply_rows_on_master_in_thread(std::__cxx11::list<partition_key_and_mutation_fragments, std::allocator<partition_key_and_mutation_fragments> >, gms::inet_address, seastar::bool_class<update_working_row_buf_tag>, seastar::bool_class<update_peer_row_hash_sets_tag>, unsigned int)::{lambda(repair_row const&, repair_row const&)#1}>) at /usr/include/c++/9/bits/stl_algo.h:4923
 (inlined by) std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > > std::merge<std::_List_iterator<repair_row>, std::_List_iterator<repair_row>, std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > >, repair_meta::apply_rows_on_master_in_thread(std::__cxx11::list<partition_key_and_mutation_fragments, std::allocator<partition_key_and_mutation_fragments> >, gms::inet_address, seastar::bool_class<update_working_row_buf_tag>, seastar::bool_class<update_peer_row_hash_sets_tag>, unsigned int)::{lambda(repair_row const&, repair_row const&)#1}>(std::_List_iterator<repair_row>, std::back_insert_iterator<std::__cxx11::list<repair_row, std::allocator<repair_row> > >, std::_List_iterator<repair_row>, std::_List_iterator<repair_row>, repair_meta::apply_rows_on_master_in_thread(std::__cxx11::list<partition_key_and_mutation_fragments, std::allocator<partition_key_and_mutation_fragments> >, gms::inet_address, seastar::bool_class<update_working_row_buf_tag>, seastar::bool_class<update_peer_row_hash_sets_tag>, unsigned int)::{lambda(repair_row const&, repair_row const&)#1}, repair_meta::apply_rows_on_master_in_thread(std::__cxx11::list<partition_key_and_mutation_fragments, std::allocator<partition_key_and_mutation_fragments> >, gms::inet_address, seastar::bool_class<update_working_row_buf_tag>, seastar::bool_class<update_peer_row_hash_sets_tag>, unsigned int)::{lambda(repair_row const&, repair_row const&)#1}) at /usr/include/c++/9/bits/stl_algo.h:5018
 (inlined by) repair_meta::apply_rows_on_master_in_thread(std::__cxx11::list<partition_key_and_mutation_fragments, std::allocator<partition_key_and_mutation_fragments> >, gms::inet_address, seastar::bool_class<update_working_row_buf_tag>, seastar::bool_class<update_peer_row_hash_sets_tag>, unsigned int) at ./repair/row_level.cc:1242
repair_meta::get_row_diff_source_op(seastar::bool_class<update_peer_row_hash_sets_tag>, gms::inet_address, unsigned int, seastar::rpc::sink<repair_hash_with_cmd>&, seastar::rpc::source<repair_row_on_wire_with_cmd>&) at ./repair/row_level.cc:1608
repair_meta::get_row_diff_with_rpc_stream(std::unordered_set<repair_hash, std::hash<repair_hash>, std::equal_to<repair_hash>, std::allocator<repair_hash> >, seastar::bool_class<needs_all_rows_tag>, seastar::bool_class<update_peer_row_hash_sets_tag>, gms::inet_address, unsigned int) at ./repair/row_level.cc:1674
row_level_repair::get_missing_rows_from_follower_nodes(repair_meta&) at ./repair/row_level.cc:2413
```

The problem was that when std::merge() ran out of one range, it copied the second range.

To fix, use the new merge_to_gently helper.

Fixes #6976
2020-08-11 10:37:34 +08:00
Asias He
0bf0019eeb utils: Add merge_to_gently
This helper is similar to std::merge but it runs inside a thread and
does not stall.

Refs #6976
2020-08-11 10:37:34 +08:00
Benny Halevy
e2340d0684 config: enable_sstables_md_format by default
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 19:19:32 +03:00
Benny Halevy
0d85ceaf37 test: cql_query_test: add test_clustering_filtering unit tests
Add unit tests reproducing https://github.com/scylladb/scylla/issues/3552
with clustering-key filtering enabled.

enable_sstables_md_format option is set to true as clustering-key
filtering is enabled only for md-format sstables.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 19:19:32 +03:00
Benny Halevy
7cfca519cb table: filter_sstable_for_reader: allow clustering filtering md-format sstables
Now that it is safe to filter md format sstable by min/max column names
we can remove the `filtering_broken` variable that disabled filtering
in 19b76bf75b to fix #4442.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 19:19:32 +03:00
Benny Halevy
ab67629ea6 table: create_single_key_sstable_reader: emit partition_start/end for empty filtered results
To prevent https://github.com/scylladb/scylla/issues/3552
we want to ensure that in any case that the partition exists in any
sstable, we emit partition_start/end, even when returning no rows.

In the first filtering pass, filter_sstable_for_reader_by_pk filters
the input sstables based on the partition key, and num_sstables is set the size
of the sstables list after the first filtering pass.

An empty sstables list at this stage means there are indeed no sstables
with the required partition so returning an empty result will leave the
cache in the desired state.

Otherwise, we filter again, using filter_sstable_for_reader_by_ck,
and examine the list of the remaining readers.

If num_readers != num_sstables, we know that
some sstables were filterd by clustering key, so
we append a flat_mutation_reader_from_mutations to
the list of readers and return a combined reader as before.
This will ensure that we will always have a partition_start/end
mutations for the queried partition, even if the filtered
readers emit no rows.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 19:19:32 +03:00
Benny Halevy
a672747da3 table: filter_sstable_for_reader: adjust to md-format
With the md sstable format, min/max column names in the metadata now
track clustering rows (with or without row tombstones),
range tombstones, and partition tombstones (that are
reflected with empty min/max column names - indicating
the full range).

As such, min and max column names may be of different lengths
due to range tombstones and potentially short clustering key
prefixes with compact storage, so the current matching algorithm
must be changed to take this into account.

To determine if a slice range overlaps the min/max range
we are using position_range::overlaps.

sstable::clustering_components_ranges was renamed to position_range
as it now holds a single position_range rather than a vector of bytes_view ranges.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 19:19:30 +03:00
Benny Halevy
90d0fea7df table: filter_sstable_for_reader: include non-scylla sstables with tombstones
Move contains_rows from table code to sstable::may_contain_rows
since its implementation now has too specific knowledge of sstable
internals.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:53:04 +03:00
Benny Halevy
2a57ec8c3d table: filter_sstable_for_reader: do not filter if static column is requested
Static rows aren't reflected in the sstable min/max clustering keys metadata.
Since we don't have any indication in the metadata that the sstable stores
static rows, we must read all sstables if a static column is requested.

Refs #3553

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:53:04 +03:00
Benny Halevy
2fed3f472c table: filter_sstable_for_reader: refactor clustering filtering conditional expression
We're about to drop `filtering_broken` in a future patche
when clustering filtering can be supported for md-format sstables.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:53:04 +03:00
Benny Halevy
e8d7744040 features: add MD_SSTABLE_FORMAT cluster feature
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:53:04 +03:00
Benny Halevy
65239a6e50 config: add enable_sstables_md_format
MD format is disabled by default at this point.

The option extends enable_sstables_mc_format
so that both are needed to be set for supporting
the md format.

The MD_FORMAT cluster feature will be added in
a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:53:04 +03:00
Benny Halevy
8e0e2c8a48 database: add set_format_by_config
This is required for test applications that may select a sstable
format different than the default mc format, like perf_fast_forward.

These apps don't use the gossip-based sstables_format_selector
to set the format based on the cluster feature and so they
need to rely on the db config.

Call set_format_by_config in single_node_cql_env::do_with.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:53:04 +03:00
Benny Halevy
d77ceba498 test: sstable_3_x_test: test both mc and md versions
Run the test cases that write sstables using both the
mc and md versions.  Note that we can still compare the
resulting Data, Index, Digest, and Filter components
with the prepared mc sstables we have since these
haven't changed in md.

We take special consideration around validating
min/max column names that are now calculated using
a revised algorithm in the md format.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:53:04 +03:00
Pekka Enberg
3168be3483 test: Add support for the "md" format
Test also the md format in all_sstable_versions.
Add pre-computed md-sstable files generated using Cassandra version 3.11.7

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:53:04 +03:00
Benny Halevy
e44ec45ab9 sstables: mx/writer: use version from sstable for write calls
Rather than using a constant sstable_version_types::mc.
In preparation to supporting sstable_version_types::md.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:53:04 +03:00
Benny Halevy
bd4383a842 sstables: mx/writer: update_min_max_components for partition tombstone
Partition tombstones represent an implicit clustering range
that is unbound on both sides, so reflect than in min/max
column names metadata using empty clustering key prefixes.

If we don't do that, when using the sstable for filtering, we have no
other way of distinguishing range tombstones from partition tombstones
given the sstable metadata and we would need to include any sstable
with tombstones, even if those are range tombstone, for which
we can do a better filtering job, using the sstable min/max
column names metadata.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:53:04 +03:00
Benny Halevy
68acae5873 sstables: metadata_collector: support min_max_components for range tombstones
We essentially treat min/max column names as range bounds
with min as incl_start and max as incl_end.

By generating a bound_view for min/max column names on the fly,
we can correctly track and compare also short clustering
key prefixes that may be used as bounds for range tombstones.

Extend the sstable_tombstone_metadata_check unit test
to cover these cases.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:53:04 +03:00
Benny Halevy
34fb95dacf sstable: validate_min_max_metadata: drop outdated logic
The following checks were introduced in 0a5af61176
To deal with a bug in min max metadata generation of our own,
from a time where only ka / la were supported.

This is no longer relevant now that we'll consider min_max_column_names
only for sstable format > mc (in sstable::may_contain_rows)

We choose not to clear_incorrect_min_max_column_names
from older versions here as this disturbs sstable unit tests.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:53:04 +03:00
Benny Halevy
12393c5ec2 sstables: rename mc folder to mx
Prepare for supporting the md format.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:53:04 +03:00
Benny Halevy
7139fb92e6 sstables: may_contain_rows: always true for old formats
the min/max column names metadata can be trusted only
starting the md format, so just always return `true`
for older sstable formats.

Note that we could achieve that by clearing the min/max
metadata in set_clustering_components_ranges but we choose
not to do so since it disturbs sstable unit tests

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:53:04 +03:00
Benny Halevy
200d8d41d9 sstables: add may_contain_rows
Move the logic from table to sstable as it will contain
intimate knowledge of the sstable min/max column names validity
for md format.

Also, get rid of the sstable::clustering_components_ranges() method
as the member is used only internally by the sstable code now.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:53:04 +03:00
Pekka Enberg
a37eaaa022 sstables: Add support for the "md" format enum value
Add the sstable_version_types::md enum value
and logically extend sstable_version_types comparisons to cover
also the > sstable_version_types::mc cases.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:53:04 +03:00
Benny Halevy
7de004d42a sstables: version: delete unused is_latest_supported predicate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:53:04 +03:00
Benny Halevy
025b74e20e sstables: metadata_collector: use empty key to represent full min/max range
Instead of keeping the `_has_min_max_clustering_keys` flag,
just store an empty key for `_{min,max}_clustering_key` to represent
the full range.  These will never be narrowed down and will be
encoded as empty min/max column names as if they weren't set.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:53:04 +03:00
Benny Halevy
9f114d821a sstables: keep whole clustering_key_prefix as min/max_column_names
Currently we compare each min/max component independently.
This may lead to suboptimal, inclusive clustering ranges
that do not indicate any actual key we encountered.

For example: ['a', 2], ['b', 1] will lead to min=['a', 1], max=['b', 2]
instead of the keys themselves.

This change keeps the min or max keys as a whole.

It considers shorter clustering prefixes (that are possible with compact
storage) as range tombstone bounds, so that a shorter key is considered
less than the minimum if the latter has a common prefix, and greater
than the maximum if the latter has a common prefix.

Extend the min_max_clustering_key_test to test for this case.
Previously {"a", "2"}, {"b", "1"} clustering keys would erronuously
end up with min={"a", "1"} max={"b", "2"} while we want them to be
min={"a", "2"} max={"b", "1"}.

Adjust sstable_3_x_test to ignore original mc sstables that were
previously computed with different min/max column names.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:53:03 +03:00
Benny Halevy
707b098f44 sstables: metadata_collector: construct with schema
Pass the sstable schema to the metadata_collector constructor.

Note that the long term plan is to move metadata_collector
to the sstable writer but this requires a bigger change to
get rid of the dependencies on it in the legacy writer code
in class sstable methods.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:52:43 +03:00
Benny Halevy
c9cade833c sstables: metadata_collector: make only for write path
make a metadata_collector only when writing the sstable,
no need to make one when reading.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:51:12 +03:00
Rafael Ávila de Espíndola
74db08165d tests: Convert to using memory::with_allocation_failures
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200805155143.122396-1-espindola@scylladb.com>
2020-08-10 18:37:42 +03:00
Piotr Jastrzebski
52ec0c683e codebase wide: replace erase + remove_if with erase_if
C++20 introduced std::erase_if which simplifies removal of elements
from the collection. Previously the code pattern looked like:

<collection>.erase(
        std::remove_if(<collection>.begin(), <collection>.end(), <predicate>),
        <collection>.end());

In C++20 the same can be expressed with:

std::erase_if(<collection>, <predicate>);

This commit replaces all the occurences of the old pattern with the new
approach.

Tests: unit(dev)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <6ffcace5cce79793ca6bd65c61dc86e6297233fd.1597064990.git.piotr@scylladb.com>
2020-08-10 18:17:38 +03:00
Calle Wilund
9620755c7f database: Do not assert on replay positions if truncate does not flush
Fixes #6995

In c2c6c71 the assert on replay positions in flushed sstables discarded by
truncate was broken, by the fact that we no longer flush all sstables
unless auto snapshot is enabled.

This means the low_mark assertion does not hold, because we maybe/probably
never got around to creating the sstables that would hold said mark.

Note that the (old) change to not create sstables and then just delete
them is in itself good. But in that case we should not try to verify
the rp mark.
2020-08-10 18:17:38 +03:00
Avi Kivity
f9aea94c5c Merge 'add out of box configs for GCP VMs with nvmes' from Lubos
"
not recommended setups will still run iotune
fixes #6631
"

* tarzanek-gcp-iosetup:
  scylla_io_setup: Supported GCP VMs with NVMEs get out of box I/O configs
  scylla_util.py: add support for gcp instances
  scylla_util.py: support http headers in curl function
  scylla_io_setup: refactor iotune run to a function
2020-08-10 18:17:38 +03:00
Avi Kivity
188c832e3d Merge 'scylla_swap_setup improvement' from Takuya
"
As I described at https://github.com/scylladb/scylla/issues/6973#issuecomment-669705374,
we need to avoid disk full on scylla_swap_setup, also we should provide manual configuration of swap size.

This PR provides following things:
 - use psutil to get memtotal and disk free, since it provides better APIs
 - calculate swap size in bytes to avoid causing error on low-memory environment
 - prevent to use more than 50% of disk space when auto-configured swap size, abort setup when almost no disk space available (less than 2GB)
 - add --swap-size option to specify swap size both on scylla_swap_setup and scylla_setup
 - add interactive swap size prompt on scylla_setup

Fixes #6947
Related #6973
Related scylladb/scylla-machine-image#48
"

* syuu1228-scylla_swap_setup_improves:
  scylla_setup: add swap size interactive prompt on swap setup
  scylla_swap_setup: add --swap-size option to specify swap size
  scylla_swap_setup: limit swapfile size to half of diskspace
  scylla_swap_setup: calculate in bytes instead of GB
  scylla_swap_setup: use psutil to get memtotal and disk free
2020-08-10 18:17:38 +03:00
Botond Dénes
1e7cf27f1f scylla-gdb.py: scylla find: add option to include freed objects
Sometimes (e.g. when investigating a suspected heap use-after-free) it
is useful to include dead objects in the search results. This patch adds
a new option to scylla find to enable just that. Also make sure we
save and print the offset of the address in the object for dead
objects, just like we do for live ones.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200810091202.1401231-1-bdenes@scylladb.com>
2020-08-10 18:17:38 +03:00
Takuya ASADA
48944adc72 scylla_setup: add swap size interactive prompt on swap setup
Fixes #6947
2020-08-10 13:53:20 +03:00
Takuya ASADA
e9f688b0e8 scylla_swap_setup: add --swap-size option to specify swap size
Add --swap-size option to allow user to customize swap size.
2020-08-10 13:53:20 +03:00
Takuya ASADA
1fa0886ac0 scylla_swap_setup: limit swapfile size to half of diskspace
We should not fill entire disk space with swapfile, it's safer to limit
swap size 50% of diskspace.
Also, if 50% of diskspace <= 1GB, abort setup since it's out of disk space.
2020-08-10 13:53:20 +03:00
Takuya ASADA
7f5c8d6553 scylla_swap_setup: calculate in bytes instead of GB
Converting memory & disk sizes to an int value of N gigabytes was too rough,
it become problematic in low memory size / low disk size environment, such as
some types of EC2 instances.
We should calculate in bytes.
2020-08-10 13:53:19 +03:00
Takuya ASADA
b21bed701b scylla_swap_setup: use psutil to get memtotal and disk free
To get better API of memory & disk statistics, switch to psutil.
With the library we don't need to parse /proc/meminfo.

[avi: regenerate tools/toolchain/image for new python3-psutils package]
2020-08-10 13:50:09 +03:00
Benny Halevy
25c1a16f8e sstables: move column_name_helper to metadata_collector.cc
It is used only for updating the metadata_collector {min,max}_column_names.
Implement metadata_collector::do_update_min_max_components in
sstables/metadata_collector.cc that will be used to host some other
metadata_collector methods in following patches that need not be
implemented in the header file.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 13:27:26 +03:00
Asias He
e3c2d08f4f repair: Add progress metrics for rebuild ops
The following metric is added:

scylla_node_maintenance_operations_rebuild_finished_percentage{shard="0",type="gauge"} 0.650000

It is the number of finished percentage for rebuild operation so far.

Fixes #1244, #6733
2020-08-10 15:45:37 +08:00
Asias He
b23f65d1d9 repair: Add progress metrics for bootstrap ops
The following metric is added:

scylla_node_maintenance_operations_bootstrap_finished_percentage{shard="0",type="gauge"} 0.850000

It is the number of finished percentage for bootstrap operation so far.

Fixes #1244, #6733
2020-08-10 15:45:37 +08:00
Benny Halevy
60873d2360 sstable: file_writer: auto-close in destructor
Otherwise we may trip the following
    assert(_closing_state == state::closed);
in ~append_challenged_posix_file_impl
when the output_stream is destructed.

Example stack trace:
    non-virtual thunk to seastar::append_challenged_posix_file_impl::~append_challenged_posix_file_impl() at /jenkins/slave/workspace/scylla-3.2/build/scylla/seastar/include/seastar/core/future.hh:944
    seastar::shared_ptr_count_for<checked_file_impl>::~shared_ptr_count_for() at crtstuff.c:?
    seastar::shared_ptr<seastar::file_impl>::~shared_ptr() at /jenkins/slave/workspace/scylla-3.2/build/scylla/seastar/include/seastar/core/future.hh:944
     (inlined by) seastar::file::~file() at /jenkins/slave/workspace/scylla-3.2/build/scylla/seastar/include/seastar/core/file.hh:155
     (inlined by) seastar::file_data_sink_impl::~file_data_sink_impl() at /jenkins/slave/workspace/scylla-3.2/build/scylla/seastar/src/core/fstream.cc:312
     (inlined by) seastar::file_data_sink_impl::~file_data_sink_impl() at /jenkins/slave/workspace/scylla-3.2/build/scylla/seastar/src/core/fstream.cc:312
    seastar::output_stream<char>::~output_stream() at crtstuff.c:?
    sstables::sstable::write_crc(sstables::checksum const&) [clone .cold] at sstables.cc:?
    sstables::mc::writer::close_data_writer() at crtstuff.c:?
    sstables::mc::writer::consume_end_of_stream() at crtstuff.c:?
    sstables::sstable::write_components(flat_mutation_reader, unsigned long, seastar::lw_shared_ptr<schema const>, sstables::sstable_writer_config const&, encoding_stats, seastar::io_priority_class const&)::{lambda()#1}::operator()() at sstables.cc:?

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-09 13:58:13 +03:00
Benny Halevy
d277ec2ab9 sstable: file_writer: add optional filename member
To be used for reporting errors when failing to closing the output stream.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-09 13:58:13 +03:00
Benny Halevy
c5feeb7723 sstable: add make_component_file_writer
Unify common code for file creation and file_writer construction
for sstable components.

It is defined as noexcept based on `new_sstable_component_file`
and makes sure the file is closed on error by using `file_writer::make`
that guarantees that.

Will be used for auto-closing the writer as a RAII object.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-09 13:58:13 +03:00
Benny Halevy
34bf9ae5ed sstable: remove_by_toc_name: accept std::string_view
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-09 13:57:48 +03:00
Benny Halevy
aacf69358a sstable: remove_by_toc_name: always close file and input stream
Get rid of seastar::async.
Use seastar::with_file to make sure the opened file is always closed
and move in.close() into a .finally continuation.

While at it, make remove_by_toc_name noexcept.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-09 13:54:15 +03:00
Avi Kivity
8e87c52747 Update seastar submodule
* seastar eb452a22a0...e615054c75 (13):
  > memory: fix small aligned free memory corruption
Fixes #6831
  > logger: specify log methods as noexcept
  > logger: mark trivial methods as noexcept
  > everywhere: Replace boost::optional with std::optional
  > httpd: fix Expect header case sensitivity
  > rpc: Avoid excessive casting in has_handler() helper
  > test: Increase capacity of fair-queue unit test case
  > file_io_test: Simplify a test with memory::with_allocation_failures
  > Merge 'Add HTTP/1.1 100 Continue' from Wojciech
Fixes #6844
  > future: Use "if constexpr"
  > future: Drop code that was avoiding a gcc 8 warning
  > file: specify alignment get methods noexcept
  > merge: Specify abort_source subscription handlers as noexcept
2020-08-09 13:17:22 +03:00
Piotr Sarna
29e2dc242a row_cache: add tracing
In order to improve tracing for the read path, cache is now
also actively adding basic trace information.
Example:
select * from t where token(p) >= 42 and token(p) < 112;
 activity                                                                                | timestamp                  | source    | source_elapsed | client
-----------------------------------------------------------------------------------------+----------------------------+-----------+----------------+-----------
                                                                      Execute CQL3 query | 2020-08-07 13:10:34.694000 | 127.0.0.1 |              0 | 127.0.0.1
                                                           Parsing a statement [shard 0] | 2020-08-07 13:10:34.694307 | 127.0.0.1 |             -- | 127.0.0.1
                                                        Processing a statement [shard 0] | 2020-08-07 13:10:34.694377 | 127.0.0.1 |             70 | 127.0.0.1
                                                   read_data: querying locally [shard 0] | 2020-08-07 13:10:34.694425 | 127.0.0.1 |            118 | 127.0.0.1
                        Start querying token range [{42, start}, {112, start}] [shard 0] | 2020-08-07 13:10:34.694432 | 127.0.0.1 |            125 | 127.0.0.1
                                             Creating shard reader on shard: 0 [shard 0] | 2020-08-07 13:10:34.694446 | 127.0.0.1 |            139 | 127.0.0.1
 Scanning cache for range [{42, start}, {112, start}] and slice {(-inf, +inf)} [shard 0] | 2020-08-07 13:10:34.694454 | 127.0.0.1 |            147 | 127.0.0.1
                                                              Querying is done [shard 0] | 2020-08-07 13:10:34.694494 | 127.0.0.1 |            187 | 127.0.0.1
                                          Done processing - preparing a result [shard 0] | 2020-08-07 13:10:34.694520 | 127.0.0.1 |            213 | 127.0.0.1
                                                                        Request complete | 2020-08-07 13:10:34.694221 | 127.0.0.1 |            221 | 127.0.0.1

Example with cache miss:
select * from t where p = 7;
 activity                                                                                                                                                                          | timestamp                  | source    | source_elapsed | client
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+-----------
                                                                                                                                                                Execute CQL3 query | 2020-08-07 13:25:04.363000 | 127.0.0.1 |              0 | 127.0.0.1
                                                                                                                                                     Parsing a statement [shard 0] | 2020-08-07 13:25:04.363310 | 127.0.0.1 |             -- | 127.0.0.1
                                                                                                                                                  Processing a statement [shard 0] | 2020-08-07 13:25:04.363384 | 127.0.0.1 |             74 | 127.0.0.1
                                                   Creating read executor for token 1634052884888577606 with all: {127.0.0.1} targets: {127.0.0.1} repair decision: NONE [shard 0] | 2020-08-07 13:25:04.363450 | 127.0.0.1 |            139 | 127.0.0.1
                                                                                                                                             read_data: querying locally [shard 0] | 2020-08-07 13:25:04.363455 | 127.0.0.1 |            145 | 127.0.0.1
                                                                                                 Start querying singular range {{1634052884888577606, pk{000400000007}}} [shard 0] | 2020-08-07 13:25:04.363461 | 127.0.0.1 |            151 | 127.0.0.1
                                                                             Querying cache for range {{1634052884888577606, pk{000400000007}}} and slice {(-inf, +inf)} [shard 0] | 2020-08-07 13:25:04.363490 | 127.0.0.1 |            180 | 127.0.0.1
                                                                                                      Range {{1634052884888577606, pk{000400000007}}} not found in cache [shard 0] | 2020-08-07 13:25:04.363494 | 127.0.0.1 |            183 | 127.0.0.1
          Reading key {{1634052884888577606, pk{000400000007}}} from sstable /home/sarna/.ccm/scylla-1/node1/data/ks/t-f7b7a9b0d89f11eab650000000000000/mc-1-big-Data.db [shard 0] | 2020-08-07 13:25:04.363522 | 127.0.0.1 |            211 | 127.0.0.1
                           /home/sarna/.ccm/scylla-1/node1/data/ks/t-f7b7a9b0d89f11eab650000000000000/mc-1-big-Index.db: scheduling bulk DMA read of size 16 at offset 0 [shard 0] | 2020-08-07 13:25:04.363546 | 127.0.0.1 |            235 | 127.0.0.1
 /home/sarna/.ccm/scylla-1/node1/data/ks/t-f7b7a9b0d89f11eab650000000000000/mc-1-big-Index.db: finished bulk DMA read of size 16 at offset 0, successfully read 16 bytes [shard 0] | 2020-08-07 13:25:04.364406 | 127.0.0.1 |           1095 | 127.0.0.1
                            /home/sarna/.ccm/scylla-1/node1/data/ks/t-f7b7a9b0d89f11eab650000000000000/mc-1-big-Data.db: scheduling bulk DMA read of size 56 at offset 0 [shard 0] | 2020-08-07 13:25:04.364445 | 127.0.0.1 |           1134 | 127.0.0.1
  /home/sarna/.ccm/scylla-1/node1/data/ks/t-f7b7a9b0d89f11eab650000000000000/mc-1-big-Data.db: finished bulk DMA read of size 56 at offset 0, successfully read 56 bytes [shard 0] | 2020-08-07 13:25:04.364599 | 127.0.0.1 |           1288 | 127.0.0.1
                                                                                                                                                        Querying is done [shard 0] | 2020-08-07 13:25:04.364685 | 127.0.0.1 |           1375 | 127.0.0.1
                                                                                                                                    Done processing - preparing a result [shard 0] | 2020-08-07 13:25:04.364719 | 127.0.0.1 |           1408 | 127.0.0.1
                                                                                                                                                                  Request complete | 2020-08-07 13:25:04.364421 | 127.0.0.1 |           1421 | 127.0.0.1
Example without cache for verification:
select * from t where token(p) >= 42 and token(p) < 112 bypass cache;
 activity                                                         | timestamp                  | source    | source_elapsed | client
------------------------------------------------------------------+----------------------------+-----------+----------------+-----------
                                               Execute CQL3 query | 2020-08-07 13:11:16.122000 | 127.0.0.1 |              0 | 127.0.0.1
                                    Parsing a statement [shard 0] | 2020-08-07 13:11:16.122657 | 127.0.0.1 |             -- | 127.0.0.1
                                 Processing a statement [shard 0] | 2020-08-07 13:11:16.122742 | 127.0.0.1 |             85 | 127.0.0.1
                            read_data: querying locally [shard 0] | 2020-08-07 13:11:16.122806 | 127.0.0.1 |            149 | 127.0.0.1
 Start querying token range [{42, start}, {112, start}] [shard 0] | 2020-08-07 13:11:16.122814 | 127.0.0.1 |            158 | 127.0.0.1
                      Creating shard reader on shard: 0 [shard 0] | 2020-08-07 13:11:16.122829 | 127.0.0.1 |            172 | 127.0.0.1
                                       Querying is done [shard 0] | 2020-08-07 13:11:16.122895 | 127.0.0.1 |            239 | 127.0.0.1
                   Done processing - preparing a result [shard 0] | 2020-08-07 13:11:16.122928 | 127.0.0.1 |            271 | 127.0.0.1
                                                 Request complete | 2020-08-07 13:11:16.122280 | 127.0.0.1 |            280 | 127.0.0.1
Message-Id: <3b31584c13f23f84af35660d0aa73ba56c30cf13.1596799589.git.sarna@scylladb.com>
2020-08-09 12:53:04 +03:00
Piotr Sarna
71bb277cbc multishard_mutation_query: fix a typo in variable name
s/allwoed/allowed
Message-Id: <eedb62b1f13ebf4ab1e6e92642a77fab32379d73.1596799589.git.sarna@scylladb.com>
2020-08-09 12:52:40 +03:00
Piotr Sarna
5e8247fd8c storage_proxy: make tracing more specific wrt. token ranges
Until now, only singular ranges were present in tracing, and, what's
more, their tracing message suggested that the range is not singular:
  Start querying the token range that starts with (...)
This commit makes the message more specific and also provides
a corresponding tracing message to non-singular ranges.
Example for a singular range:
 activity                                                                                                                         | timestamp                  | source    | source_elapsed | client
----------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+-----------
                                                                                                               Execute CQL3 query | 2020-08-07 13:11:55.479000 | 127.0.0.1 |              0 | 127.0.0.1
                                                                                                    Parsing a statement [shard 0] | 2020-08-07 13:11:55.479616 | 127.0.0.1 |             -- | 127.0.0.1
                                                                                                 Processing a statement [shard 0] | 2020-08-07 13:11:55.479695 | 127.0.0.1 |             80 | 127.0.0.1
 Creating read executor for token -7160136740246525330 with all: {127.0.0.1} targets: {127.0.0.1} repair decision: NONE [shard 0] | 2020-08-07 13:11:55.479747 | 127.0.0.1 |            132 | 127.0.0.1
                                                                                            read_data: querying locally [shard 0] | 2020-08-07 13:11:55.479752 | 127.0.0.1 |            137 | 127.0.0.1
                                               Start querying singular range {{-7160136740246525330, pk{00040000002a}}} [shard 0] | 2020-08-07 13:11:55.479758 | 127.0.0.1 |            143 | 127.0.0.1
                                                                                                       Querying is done [shard 0] | 2020-08-07 13:11:55.479816 | 127.0.0.1 |            201 | 127.0.0.1
                                                                                   Done processing - preparing a result [shard 0] | 2020-08-07 13:11:55.479844 | 127.0.0.1 |            229 | 127.0.0.1
                                                                                                                 Request complete | 2020-08-07 13:11:55.479238 | 127.0.0.1 |            238 | 127.0.0.1

Example for nonsingular range:
 activity                                                   | timestamp                  | source    | source_elapsed | client
------------------------------------------------------------+----------------------------+-----------+----------------+-----------
                                         Execute CQL3 query | 2020-08-07 13:13:47.189000 | 127.0.0.1 |              0 | 127.0.0.1
                              Parsing a statement [shard 0] | 2020-08-07 13:13:47.189259 | 127.0.0.1 |             -- | 127.0.0.1
                           Processing a statement [shard 0] | 2020-08-07 13:13:47.189346 | 127.0.0.1 |             87 | 127.0.0.1
                      read_data: querying locally [shard 0] | 2020-08-07 13:13:47.189412 | 127.0.0.1 |            153 | 127.0.0.1
 Start querying token range [{7, end}, {42, end}] [shard 0] | 2020-08-07 13:13:47.189421 | 127.0.0.1 |            162 | 127.0.0.1
                Creating shard reader on shard: 0 [shard 0] | 2020-08-07 13:13:47.189436 | 127.0.0.1 |            177 | 127.0.0.1
                                 Querying is done [shard 0] | 2020-08-07 13:13:47.189495 | 127.0.0.1 |            236 | 127.0.0.1
             Done processing - preparing a result [shard 0] | 2020-08-07 13:13:47.189526 | 127.0.0.1 |            268 | 127.0.0.1
                                           Request complete | 2020-08-07 13:13:47.189276 | 127.0.0.1 |            276 | 127.0.0.1
Message-Id: <82f1a8680fc8383cd7e6c7b283de94e5b71a52ab.1596799589.git.sarna@scylladb.com>
2020-08-09 12:52:08 +03:00
Benny Halevy
c4d023d622 sstable: delete_sstables: delete outdated FIXME comment
delete_sstables is used for replaying pending_delete logs
since 043673b236.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-09 12:33:58 +03:00
Benny Halevy
d0bb180e53 sstable: remove_by_toc_name: drop error_handler parameter
It's now always called with the default one:
sstable_write_error_handler.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-09 12:04:36 +03:00
Benny Halevy
78595303f9 sstable: remove_by_toc_name: make static
It's not called outside of sstables code anymore.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-09 12:04:36 +03:00
Benny Halevy
55826deb05 sstable: read_toc: always close file
Use utils::with_file helper to always close the file
new_sstable_component_file opens.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-09 12:04:36 +03:00
Benny Halevy
69f7454d88 sstable: mark read_toc and methods calling it noexcept
read_toc can be marked as noexcept now that new_sstable_component_file is.
With that, other methods that call it can be marked noexcept too.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-09 12:04:36 +03:00
Benny Halevy
3444abcb8f sstable: read_toc: get rid of file_path
In preparation for closing the file in all paths,
get rid of the file_path sstring and just recompute
it as needed on error paths using the this->filename method.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-09 12:04:36 +03:00
Benny Halevy
c2c18bc708 sstable: open_data, create_data: set member only on success.
Now that sstable::open_file is noexcept,
if any of open or create of data/index doesn't succeed,
we don't set the respective sstable member and return the failure.

When destroyed, the sstable destructor will close any file (data or index),
that we managed to open.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-09 12:04:36 +03:00
Benny Halevy
803aadf89f sstable: open_file: mark as noexcept
Now that new_sstable_component_file is noexcept,
open_file can be specified as noexcept too.

This makes error handling in create/open sstable
data and index files easier using when_all_succeed().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-09 12:04:36 +03:00
Benny Halevy
155f06b0d5 sstable: new_sstable_component_file: make noexcept
Try/catch any exception in the function body and return
it as an exceptional future.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-09 12:04:36 +03:00
Benny Halevy
9544b98787 sstable: new_sstable_component_file: close file on failure
Use with_file_close_on_error to make sure any files we open and/or wrap
are closed on failure.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-09 12:04:36 +03:00
Benny Halevy
cad5c31141 sstable: rename_new_sstable_component_file: do not pass file
Currently the function is handed over a `file` that it just passes through
on success.  Let its single caller do that to simplify its error handling.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-09 12:04:36 +03:00
Benny Halevy
881e32d0fe sstable: open_sstable_component_file_non_checked: mark as noexcept
Now that open_integrity_checked_file_dma (and open_file_dma)
are noexcept, open_sstable_component_file_non_checked can be noexcept too.

Also, get a std::string_view name instead of a const sstring&
to match both open_integrity_checked_file_dma and open_file_dma
name arg.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-09 12:04:36 +03:00
Benny Halevy
472013de27 sstable: open_integrity_checked_file_dma: make noexcept
Convert to accepting std::string_view name.
Move the sstring allocation to make_integrity_checked_file
that may already throw.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-09 12:04:36 +03:00
Benny Halevy
3ad9503a3f sstable: open_integrity_checked_file_dma: close file on failure
Use seastar::with_file_close_on_failure to make sure the file
we open is closed on failure of the continuation,
as make_integrity_checked_file may throw from ::make_shared.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-09 12:04:36 +03:00
Dejan Mircevski
8cae61ee6b cql3: Move #include from .hh to .cc
restrictions.hh included fmt/ostream.h, which is expensive due to its
transitive #includes.  Replace it with fmt/core.h, which transitively
includes only standard C++ headers.

As requested by #5763 feedback:
https://github.com/scylladb/scylla/pull/5763#discussion_r443210634

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-08-08 21:37:08 +03:00
Dejan Mircevski
df20854963 cql3: Move expressions to their own namespace
Move the classes representing CQL expressions (and utility functions
on them) from the `restrictions` namespace to a new namespace `expr`.

Most of the restriction.hh content was moved verbatim to
expression.hh.  Similarly, all expression-related code was moved from
statement_restrictions.cc verbatim to expression.cc.

As suggested in #5763 feedback
https://github.com/scylladb/scylla/pull/5763#discussion_r443210498

Tests: dev (unit)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-08-08 21:03:26 +03:00
Lubos Kosco
c203b6bb1f scylla_io_setup: Supported GCP VMs with NVMEs get out of box I/O configs
Iotune isn't run for supported/recommended GCP instances anymore,
we set the I/O properties now based on GCP tables or our
measurements(where applicable).
Not recommended/supported setups will still run iotune.

Fixes #6631
2020-08-08 11:05:28 +02:00
Lubos Kosco
125a84d7c5 scylla_util.py: add support for gcp instances
GCP VMs with NVMEs that are recommended can be recognized now.
Detection is done using resolving of internal GCP metadata server.
We recommend 2 cpu instances at least. For more than 16 disks
we mandate at least 32 cpus. 50:1 disk to ram ratio also has to be
kept.
Instances that use NVMEs as root disks are also considered as unsupported.

Supported instances for NVMEs are n1, n2, n2d, c2 and m1-megamem-96.
All others are unsupported now.
2020-08-08 11:04:37 +02:00
Lubos Kosco
97e3ab739b scylla_util.py: support http headers in curl function 2020-08-08 10:58:59 +02:00
Lubos Kosco
0c5dbb4c4f scylla_io_setup: refactor iotune run to a function 2020-08-08 10:57:31 +02:00
Nadav Har'El
f8291500cf alternator test: test for ConditionExpression on key columns
While answering a stackoverflow question on how to create an item but only
if we don't already have an item with the same key, I realized that we never
actually tested that ConditionExpressions works on key columns: all the
tests we had in test_condition_expression.py had conditions on non-key
attributes. So in this patch we add two tests with a condition on the key
attribute.

Most examples of conditions on the key attributes would be silly, but
in these two tests we demonstrate how a test on key attributes can be
useful to solve the above need of creating an item if no such item
exists yet. We demonstrate two ways to do this using a condition on
the key - using either the "<>" (not equal) operator, or the
"attribute_not_exists()" function.

These tests pass - we don't have a bug in this. But it's nice to have
a test that confirms that we don't (and don't regress in that area).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200806200322.1568103-1-nyh@scylladb.com>
2020-08-07 08:05:48 +02:00
Benny Halevy
6d66d5099a ssstable: io_check functions: make noexcept
Now that do_io_check is noexcept.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-06 19:38:41 +03:00
Avi Kivity
1072acb215 Update tools/java and tools/jmx submodules
* tools/java aa7898d771...f2c7cf8d8d (2):
  > dist: debian: support non-x86
  > NodeProbe: get all histogram values in a single call

* tools/jmx 626fd75...c5ed831 (1):
  > dist: debian: support non-x86
2020-08-06 19:17:14 +03:00
Avi Kivity
ba4d7e8523 Merge "Teach B+ to use AVX for keys search" from Pavel E
"
The current implementation of B+ benefits from using SIMD
instruction in intra-nodes keys search. This set adds this
functionality.

The general idea behind the implementation is in "asking"
the less comparator if it is the plain "<" and allows for
key simplification to do this natural comparison. If it
does, the search key is simplified to int64_t, the node's
array of keys is casted to array of integers, then both are
fed into avx-optimized searcher.

The searcher should work on nodes that are not filled with
keys. For performance the "unused" keys are set to int64_t
minimum, the search loop compares them too (!) and adjusts
the result index by node size. This needs some care in the
maybe_key{} wrapper.

fixes: #186
tests: unit(dev)
"

* 'br-bptree-avx-b' of https://github.com/xemul/scylla:
  utils: AVX searcher
  bptree: Special intra-node key search when possible
  bptree: Add lesses to maybe_key template
  token: Restrict TokenCarrier concept with noexcept
2020-08-06 19:14:46 +03:00
Tomasz Grabiec
bfd129cffe thrift: Fix crash on unsorted column names in SlicePredicate
The column names in SlicePredicate can be passed in arbitrary order.
We converted them to clustering ranges in read_command preserving the
original order. As a result, the clustering ranges in read command may
appear out of order. This violates storage engine's assumptions and
lead to undefined behavior.

It was seen manifesting as a SIGSEGV or an abort in sstable reader
when executing a get_slice() thrift verb:

scylla: sstables/consumer.hh:476: seastar::future<> data_consumer::continuous_data_consumer<StateProcessor>::fast_forward_to(size_t, size_t) [with StateProcessor = sstables::data_consume_rows_context_m; size_t = long unsigned int]: Assertion `end >= _stream_position.position' failed.

Fixes #6486.

Tests:

   - added a new dtest to thrift_tests.py which reproduces the problem

Message-Id: <1596725657-15802-1-git-send-email-tgrabiec@scylladb.com>
2020-08-06 19:13:22 +03:00
Benny Halevy
e33fc10638 utils: do_io_check: adjust indentation
was broken by the previous patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-06 19:01:18 +03:00
Benny Halevy
fd5b2672c1 utils: io_check: make noexcept for future-returning functions
Use futurize_apply to handle any exception the passed function
may throw.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-06 19:01:17 +03:00
Rafael Ávila de Espíndola
b1315a2120 cql_test_env: Delay starting the compaction manager
In case of an initialization failure after

  db.get_compaction_manager().enable();

But before stop_database, we would never stop the compaction manager
and it would assert during destruction.

I am trying to add a test for this using the memory failure injector,
but that will require fixing other crashes first.

Found while debugging #6831.

Refs #6831.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200805181840.196064-1-espindola@scylladb.com>
2020-08-06 16:07:16 +03:00
Pavel Emelyanov
7c20e3ed05 utils: AVX searcher
With all the preparations made so far it's now possible to implement
the avx-powered search in an array.

The array to search in has both -- capacity and size, so searching in
it needs to take allocated, but unused tail into account. Two options
for that -- limit the number of comparisons "by hands" or keep minimal
and impossible value in this tail, scan "capacity" elements, then
correct the result with "size" value. The latter approach is up to 50%
faster than any (tried) attempt to do the former one.

The run-time selection of the array search code is done with the gnu
target attribute. It's available since gcc 4.8. For AVX-less platforms
the default linear scanner is used.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-06 15:41:31 +03:00
Pavel Emelyanov
35a22ac48a bptree: Special intra-node key search when possible
If the key type is int64_t and the less-comparator is "natural" (i.e. it's
literally 'a < b') we may use the SIMD instructions to search for the key
on a node. Before doing so, the maybe_key and the searcher should be prepared
for that, in particular:

1. maybe_key should set unused keys to the minimal value
2. the searcher for this case should call the gt() helper with
   primitive types -- int64_t search key and array of int64_t values

To tell to B+ code that the key-less pair is such the less-er should define
the simplify_key() method converting search keys to int64_t-s.

This searcher is selected automatically, if any mismatch happens it silently
falls back to default one. Thus also add a static assertion to the row-cache
to mitigate this.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-06 15:41:31 +03:00
Pavel Emelyanov
14f0cdb779 bptree: Add lesses to maybe_key template
The way maybe_key works will be in-sync with the intra-node searching
code and will require to know what the Less type is, so prepare for that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-06 15:41:31 +03:00
Pavel Emelyanov
61d4a73ed0 token: Restrict TokenCarrier concept with noexcept
The <search-key>.token() is noexcept currently and will have
to be explicitly such for future optimized key searcher, so
restrict the constraint and update the related classes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-06 15:41:31 +03:00
Dejan Mircevski
54749112f9 cql3: Rewrite bounds_ck_symmetrically for deep conjunctions
As suggested by #5763 feedback:
https://github.com/scylladb/scylla/pull/5763#discussion_r443214356

Pull found_bounds outside the visit call and apply the visitor
recursively to conjunction children.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-08-05 19:34:18 +03:00
Dejan Mircevski
5e382aaaab cql3: Rename bounded_ck to bounds_ck_symmetrically
As suggested by #6818 feedback:
https://github.com/scylladb/scylla/pull/6818#discussion_r460494026

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-08-05 19:34:18 +03:00
Nadav Har'El
8f12ef3628 alternator test: faster stream tests by reducing number of vnodes
The single-node test Scylla run by test/alternator/run uses, as the
default, 256 vnodes. When we have 256 vnodes and two shards, our CDC
implementation produces 512 separate "streams" (called "shards" in
DynamoDB lingo). This causes each of the tests in test_streams.py
which need to read data from the stream to need to do 1024 (!) API
requests (512 calls to GetShardIterator and 512 calls to GetRecords)
which takes significant time - about a second per test.

In this patch, we reduce the number of vnodes to 16. We still have
a non-negligible number of stream "shards" (32) so this part of the
CDC code is still exercised. Moreover, to ensure we still routinely
test the paging feature of DescribeStream (whose default page size
is 100), the patch changes the request to use a Limit of 10, so
paging will still be used to retrieve the list of 32 shards.

The time to run the 27 tests in test_streams.py, on my laptop:
Before this patch: 26 seconds
After this patch:   6 seconds.

Fixes #6979
Message-Id: <20200805093418.1490305-1-nyh@scylladb.com>
2020-08-05 19:34:18 +03:00
Nadav Har'El
59dff3226b Alternator tests: more tests for Alternator Streams
This patch adds additional tests for Alternator Streams, which helped
uncover 9 new issues.

The stream tests are noticibly slower than most other Alternator tests -
test_streams.py now has 27 tests taking a total of 20 seconds. Much of this
slowness is attributed to Alternator Stream's 512 "shards" per stream in the
single-node test setup with 256 vnodes, meaning that we need over 1000 API
requests per test using GetRecords. These tests could be made significantly
faster (as little as 4 seconds) by setting a lower number of vnodes.
Issue #6979 is about doing this in the future.

The tests in this patch have comments explaining clearly (I hope) what they
test, and also pointing to issues I opened about the problems discovered
through these tests. In particular, the tests reproduce the following bugs:

Refs #6918
Refs #6926
Refs #6930
Refs #6933
Refs #6935
Refs #6939
Refs #6942

The tests also work around the following issues (and can be changed to
be more strict and reproduce these issues):

Refs #6918
Refs #6931

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200804154755.1461309-1-nyh@scylladb.com>
2020-08-05 19:34:18 +03:00
Avi Kivity
e994275f80 Merge "auth: Avoid more global variable initializations" from Rafael
"
This patch series converts a few more global variables from sstring to
constexpr std::string_view.

Doing that makes it impossible for them to be part of any
initialization order problems.
"

* 'espindola/more-constexpr-v2' of https://github.com/espindola/scylla:
  auth: Turn DEFAULT_USER_NAME into a std::string_view variable
  auth: Turn SALTED_HASH into a std::string_view variable
  auth: Turn meta::role_members_table::qualified_name into a std::string_view variable
  auth: Turn meta::roles_table::qualified_name into a std::string_view variable
  auth: Turn password_authenticator_name into a std::string_view variable
  auth: Inline default_authorizer_name into only use
  auth: Turn allow_all_authorizer_name into a std::string_view variable
  auth: Turn allow_all_authenticator_name into a std::string_view variable
2020-08-05 10:54:13 +03:00
Raphael S. Carvalho
f640d71b23 sstables/LCS: Dump # of overlapping SSTables too if reshape is triggered
Refs #6938.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200804142620.54340-1-raphaelsc@scylladb.com>
2020-08-05 10:53:03 +03:00
Rafael Ávila de Espíndola
f98ea77ae8 cql: Mark functions::init noexcept
If initialization of a TLS variable fails there is nothing better to
do than call std::unexpected.

This also adds a disable_failure_guard to avoid errors when using
allocation error injection.

With init() being noexcept, we can also mark clear_functions.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200804180550.96150-1-espindola@scylladb.com>
2020-08-05 10:13:06 +03:00
Rafael Ávila de Espíndola
a4916ce553 auth: Turn DEFAULT_USER_NAME into a std::string_view variable
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-08-04 16:40:00 -07:00
Rafael Ávila de Espíndola
61de1fe752 auth: Turn SALTED_HASH into a std::string_view variable
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-08-04 16:40:00 -07:00
Rafael Ávila de Espíndola
f6006dbba8 auth: Turn meta::role_members_table::qualified_name into a std::string_view variable
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-08-04 16:40:00 -07:00
Rafael Ávila de Espíndola
cb4c3e45d5 auth: Turn meta::roles_table::qualified_name into a std::string_view variable
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-08-04 16:40:00 -07:00
Rafael Ávila de Espíndola
27c2b3de30 auth: Turn password_authenticator_name into a std::string_view variable
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-08-04 16:40:00 -07:00
Rafael Ávila de Espíndola
e526ed369b auth: Inline default_authorizer_name into only use
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-08-04 16:39:57 -07:00
Rafael Ávila de Espíndola
1a11e64f52 auth: Turn allow_all_authorizer_name into a std::string_view variable
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-08-04 16:38:55 -07:00
Rafael Ávila de Espíndola
0da6807f7e auth: Turn allow_all_authenticator_name into a std::string_view variable
There is no constexpr operator+ for std::string_view, so we have to
concatenate the strings ourselves.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-08-04 16:38:27 -07:00
Nadav Har'El
db08ff4cbd Additional entries in CODEOWNERS
List a few more code areas, and add or correct paths for existing code areas.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200804175651.1473082-1-nyh@scylladb.com>
2020-08-04 21:03:23 +03:00
Nadav Har'El
936cf4cce0 merge: Increase row limits
Merged pull request https://github.com/scylladb/scylla/pull/6910
by Wojciech Mitros:

This patch enables selecting more than 2^32 rows from a table. The change
becomes active after upgrading whole cluster - until then old limits are
used.

Tested reading 4.5*10^9 rows from a virtual table, manually upgrading a
cluster with ccm and performing cql SELECT queries during the upgrade,
ran unit tests in dev mode and cql and paging dtests.

  tests: add large paging state tests
  increase the maximum size of query results to 2^64
2020-08-04 19:52:30 +03:00
Wojciech Mitros
4863e8a11f tests: add large paging state tests
Add a unit test checking if the top 32 bits of the number of remaining rows in paging state
is used correctly and a manual test checking if it's possible to select over 2^32 rows from a table
and a virtual reader for this table.
2020-08-04 18:44:29 +02:00
Kamil Braun
b5f3aef900 cdc: add an abstraction for building log mutations
This commit takes out some responsibilities of `cdc::transformer`
(which is currently a big ball of mud) into a separate class.

This class is a simple abstraction for creating entries in a CDC log
mutation.

Low-level calls to the mutation API (such as `set_cell`) inside
`cdc::transformer` were replaced by higher-level calls to the builder
abstraction, removing some duplication of logic.
2020-08-04 19:37:03 +03:00
Avi Kivity
c97924b8ad Update seastar submodule
util/loading_cache.hh includes adjusted.

* seastar 02ad74fa7d...eb452a22a0 (17):
  > core: add missing include for std::allocator_traits
  > exceptions: move timed_out_error and factory into its own header file
  > future: parallel_for_each: add disable_failure_guard for parallel_for_each_state
  > Merge "Improve file API noexcept correctness" from Rafael
  > util: Add a with_allocation_failures helper
  > future: Fix indentation
  > future: Refactor duplicated try/catch
  > future: Make set_to_current_exception public
  > future: Add noexcept to continuation related functions
  > core: mark timer cancellation functions as noexcept
  > future: Simplify future::schedule
  > test: add a case for overwriting exact routes
  > http: throw on duplicated routes to prevent memory leaks
  > metrics: Remove the type label
  > fstream: turn file_data_source_impl's memory corruption bugs into aborts
  > doc: update tutorial splitting script
  > reactor_backend: let the reactor know again if any work was done by aio backend
2020-08-04 17:54:45 +03:00
Nadav Har'El
1adcd7aca7 merge: Alternator streams get_records - fix threshold/record
Merged pull request https://github.com/scylladb/scylla/pull/6969 by
Calle Wilund:

Fixes #6942
Fixes #6926
Fixes #6933

We use clustering [lo:hi) range for iterator query.
To avoid encoding inclusive/exclusive range (depending on
init/last get_records call), instead just increment
the timeuuid threshold.

Also, dynamo result always contains a "records" entry. Include one for us as well.

Also, if old (or new) image for a change set is empty, dynamo will not include
this key at all. Alternator did return an empty object. This changes it to be excluded
on empty.

  alternator::streams: Don't include empty new/old image
  alternator::streams: Always include "Records" array in get_records reponse
  alternator::streams: Incr shard iterator threshold in get_records
2020-08-04 11:11:07 +03:00
Rafael Ávila de Espíndola
d5e8b64f01 Simplify a few calls to seastar::make_shared
There is no need to construct a value and then move it when using
make_shared. It can be constructed in place.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200804001144.59641-1-espindola@scylladb.com>
2020-08-04 11:03:18 +03:00
Avi Kivity
311ba4e427 Merge "sstables: Simplify the relationship of monitors and writers" from Rafael
"
With this patches a monitor is destroyed before the writer, which
simplifies the writer destructor.
"

* 'espindola/simplify-write-monitor-v2' of https://github.com/espindola/scylla:
  sstables: Delete write_failed
  sstables: Move monitor after writer in compaction_writer
2020-08-04 11:01:55 +03:00
Rafael Ávila de Espíndola
74ea522cd2 Use detect_stack_use_after_return=1
This works great with gcc 10.2, but unfortunately not any previous
gcc.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200731161205.22369-1-espindola@scylladb.com>
2020-08-04 11:00:09 +03:00
Calle Wilund
bf63b8f9f4 alternator::streams: Don't include empty new/old image
Fixes #6933

If old (or new) image for a change set is empty, dynamo will not
include this key at all. Alternator did return an empty object.
This changes it to be excluded on empty.
2020-08-04 07:39:09 +00:00
Calle Wilund
f80b465350 alternator::streams: Always include "Records" array in get_records reponse
Fixes #6926
Even it empty...
2020-08-04 07:39:09 +00:00
Calle Wilund
a763bb223f alternator::streams: Incr shard iterator threshold in get_records
Fixes #6942

We use clustering [lo:hi) range for iterator query.
To avoid encoding inclusive/exclusive range (depending on
init/last get_records call), instead just increment
the timeuuid threshold.
2020-08-04 07:39:02 +00:00
Rafael Ávila de Espíndola
ef0bed7253 Drop duplicated 'if' in comment
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200730170109.5789-1-espindola@scylladb.com>
2020-08-04 07:53:34 +03:00
Calle Wilund
a978e043c3 alternator::streams: Do not allow enabling streams when CDC is off
Fixes #6866

If we try to create/alter an Alternator table to include streams,
we must check that the cluster does in fact support CDC
(experimental still). If not, throw a hopefully somewhat descriptive
error.
(Normal CQL table create goes through a similar check in cql_prop_defs)

Note: no other operations are prohibited. The cluster could have had CDC
enabled before, so streams could exist to list and even read.
Any tables loaded from schema tables should be reposnsible for their
own validation.
2020-08-03 21:01:31 +03:00
Calle Wilund
05851578d4 alternator::streams: Report streams as not ready until CDC stream id:s are available
Refs #6864

When booting a clean scylla, CDC stream ID:s will not be availble until
a n*ring delay time period has passed. Before this, writing to a CDC
enabled table will fail hard.
For alternator (and its tests), we can report the stream(s) for tables as not yet
available (ENABLING) until such time as id:s are
computed.

v2:
* Keep storage service ref in executor
2020-08-03 20:34:15 +03:00
Avi Kivity
1572b9e41c Merge 'transport: Added listener with port-based load balancing' from Juliusz
"
This is inspired by #6781. The idea is to make Scylla listen for CQL connections on port 9042 (where both old shard-aware and shard-unaware clients can still connect the traditional way). On top of that I added a new port, where everything works the same way, only the port from client's socket used to determine the shard No. to connect to. Desired shard No. is the result of `clientside_port % num_shards`.

The new port is configurable from scylla.yaml and defaults to 19042 (unencrypted, unless user configures encryption options and omits `native_shard_aware_transport_port_ssl` in DB config).

Two "SUPPORTED" tags are added: "SCYLLA_SHARD_AWARE_PORT" and "SCYLLA_SHARD_AWARE_PORT_SSL". For compatibility, "SCYLLA_SHARDING_ALGORITHM" is still kept.

Fixes #5239
"

* jul-stas-shard-aware-listener:
  docs: Info about shard-aware listeners in protocol-extensions
  transport: Added listener with port-based load balancing
2020-08-03 19:23:28 +03:00
Wojciech Mitros
45215746fe increase the maximum size of query results to 2^64
Currently, we cannot select more than 2^32 rows from a table because we are limited by types of
variables containing the numbers of rows. This patch changes these types and sets new limits.

The new limits take effect while selecting all rows from a table - custom limits of rows in a result
stay the same (2^32-1).

In classes which are being serialized and used in messaging, in order to be able to process queries
originating from older nodes, the top 32 bits of new integers are optional and stay at the end
of the class - if they're absent we assume they equal 0.

The backward compatibility was tested by querying an older node for a paged selection, using the
received paging_state with the same select statement on an upgraded node, and comparing the returned
rows with the result generated for the same query by the older node, additionally checking if the
paging_state returned by the upgraded node contained new fields with correct values. Also verified
if the older node simply ignores the top 32 bits of the remaining rows number when handling a query
with a paging_state originating from an upgraded node by generating and sending such a query to
an older node and checking the paging_state in the reply(using python driver).

Fixes #5101.
2020-08-03 17:32:49 +02:00
Juliusz Stasiewicz
201268ea19 docs: Info about shard-aware listeners in protocol-extensions 2020-08-03 16:45:42 +02:00
Takuya ASADA
c0b2933106 scylla_setup: skip RAID prompt when var-lib-scylla.mount already exists
Since scylla_raid_setup always cause error when var-lib-scylla.mount already
exists, it's better to skip RAID prompt.

See #6965
2020-08-03 17:44:02 +03:00
Takuya ASADA
cff3e60f98 scylla_raid_setup: check var-lib-scylla.mount existance before formatting RAID
We should run var-lib-scyllla.mount existance check before formatting RAID.

Fixes #6965
2020-08-03 17:44:02 +03:00
Avi Kivity
4edfdfa78d Merge 'Build id cleanups' from Benny
"
Refs #5525

- main: add --build-id option
- build_id: mv sources to utils/
- build_id: throw on errors rather than assert
- build_id: simplify callback pointer type casting
"

* bhalevy-build-id-cleanups:
  build_id: simplify callback pointer type casting
  build_id: mv sources to utils/
  main: add --build-id option
2020-08-03 17:18:09 +03:00
Calle Wilund
30a700c5b0 system_keyspace: Remove support for legacy truncation records
Fixes #6341

Since scylla no longer supports upgrading from a version without the
"new" (dedicated) truncation record table, we can remove support for these
and the migtration thereof.

Make sure the above holds whereever this is committed.

Note that this does not  remove the "truncated_at" field in
system.local.
2020-08-03 17:16:26 +03:00
Botond Dénes
a9013030cf multishard_mutation_reader: add a trace message for each shard reader created
So we can see in the trace output, the shards that actually participated
in the reads. There is a single message for each shard reader.

Fixes: #6888
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200803132338.95013-1-bdenes@scylladb.com>
2020-08-03 16:24:46 +03:00
Benny Halevy
9256d2f504 build_id: simplify callback pointer type casting
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-03 15:55:18 +03:00
Benny Halevy
bf6e8f66d9 build_id: mv sources to utils/
The root directory is already overcrowded.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-03 15:55:16 +03:00
Benny Halevy
46f7d01536 main: add --build-id option
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-03 15:52:08 +03:00
Nadav Har'El
2dcb6294da merge: cdc: New delta modes: off, keys, fulll
Merged pull request https://github.com/scylladb/scylla/pull/6914
by By Juliusz Stasiewicz:

The goal is to have finer control over CDC "delta" rows, i.e.:

    disable them totally (mode off);
    record only base PK+CK columns (mode keys);
    make them behave as usual (mode full, default).

The editing of log rows is performed at the stage of finishing CDC mutation.

Fixes #6838

  tests: Added CQL test for `delta mode`
  cdc: Implementations of `delta_mode::off/keys`
  cdc: Infrastructure for controlling `delta_mode`
2020-08-03 14:10:15 +03:00
Piotr Sarna
ed829fade0 sstables: make abort handlers noexcept
Abort handlers are used in noexcept environment, so they should be
noexcept themselves.
Tested on a not-merged-yet Seastar patch with hardened noexcept
checks for abort_source.

Message-Id: <fbfd4950c0e8cc4f6005ad5b862d7bce01b90162.1596446857.git.sarna@scylladb.com>
2020-08-03 14:00:19 +03:00
Piotr Sarna
bd2d48e99c streaming: make stream_plan::abort noexcept
Aborting a stream plan is used in deinitialization code
ran in noexcept environment, so it should be noexcept itself.
Tested on a not-merged-yet Seastar patch with hardened noexcept
checks for abort_source.

Message-Id: <6eada033bb394d725b83a7e0f92381cb792ef6a1.1596446857.git.sarna@scylladb.com>
2020-08-03 14:00:19 +03:00
Piotr Sarna
5cc5b64d82 github: remove THE REST rule from CODEOWNERS file
The rule for THE REST results in each person listed in it
to receive notifications about every single pull request,
which can easily lead to inbox overload - the generic
rule is therefore dropped and authors of pull requests
are expected to manually add reviewers. GitHub offers
semi-random suggestions for reviewers anyway.

Message-Id: <3c0f7a2f13c098438a8abf998ec56b74db87c733.1596450426.git.sarna@scylladb.com>
2020-08-03 13:48:39 +03:00
Eliran Sinvani
779502ab11 Revert "schema: take into account features when converting a table creation to"
This reverts commit b97f466438.

It turns out that the schema mechanism has a lot of nuances,
after this change, for unknown reason, it was empirically
proven that the amount of cross shard on an upgraded node was
increased significantly with a steady stress traffic, if
was so significant that the node appeared unavailable to
the coordinators because all of the requests started to fail
on smp_srvice_group semaphore.

This revert will bring back a caveat in Scylla, the caveat is
that creating a table in a mixed cluster **might** under certain
condition cause schema mismatch on the newly created table, this
make the table essentially unusable until the whole cluster has
a uniform version (rolling upgrade or rollback completion).

Fixes #6893.
2020-08-03 12:51:16 +03:00
Botond Dénes
c81658c96e configure.py: remove unused variable do_sanitize
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200803082724.120916-1-bdenes@scylladb.com>
2020-08-03 12:51:16 +03:00
Botond Dénes
f4c8163d11 db/config_file.hh: named_value: remove unused members _name and _desc
They seem to be just copypasta.

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200803080604.45595-1-bdenes@scylladb.com>
2020-08-03 12:51:16 +03:00
Benny Halevy
3fa0f289de table: snapshot: do not capture name
This captured sstring is unused.

Test: database_test(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200803072258.44681-1-bhalevy@scylladb.com>
2020-08-03 12:51:16 +03:00
Botond Dénes
e4d06a3bbf scylla-gdb.py: collection_element: add circular_buffer support
Also add a __getitem__() to circular_buffer and mask indexes so they are
mapped to [`_impl.begin`, `_impl.end`).

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200803053646.14689-1-bdenes@scylladb.com>
2020-08-03 12:51:16 +03:00
Benny Halevy
122136c617 tables: snapshot: do not create links from multiple shards
We need only one of the shards owning each ssatble to call create_links.
This will allow us to simplify it and only handle crash/replay scenarios rather than rename/link/remove races.

Fixes #1622

Test: unit(dev), database_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200803065505.42100-3-bhalevy@scylladb.com>
2020-08-03 10:07:07 +03:00
Benny Halevy
ec6e136819 table: snapshot: reduce copies of snapshot dir sstring
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200803065505.42100-2-bhalevy@scylladb.com>
2020-08-03 10:07:06 +03:00
Benny Halevy
72365445c6 table: snapshot: create destination dir only once
No need to recursive_touch_directory for each sstable.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200803065505.42100-1-bhalevy@scylladb.com>
2020-08-03 10:07:05 +03:00
Pekka Enberg
4f0f97773e configure.py: Use build directory variable
The "outdir" variable in configure.py and "$builddir" in build.ninja
file specifies the build directory. Let's use them to eliminate
hard-coded "build" paths from configure.py.
Message-Id: <20200731105113.388073-1-penberg@scylladb.com>
2020-08-03 09:51:51 +03:00
Nadav Har'El
ae25661d9c alternator test: set streams time window to zero
Alternator Streams have a "alternator_streams_time_window_s" parameter which
is used to allow for correct ordering in the stream in the face of clock
differences between Scylla nodes and possibly network delays. This parameter
currently defaults to 10 seconds, and there is a discussion on issue #6929
on whether it is perhaps too high. But in any case, for tests running on a
single node there is no reason not to set this parameter to zero.

Setting this parameter to zero greatly speeds up the Alternator Streams
tests which use ReadRecords to read from the stream. Previously each such
test took at least 10 seconds, because the data was only readable after a
10 second delay. With alternator_streams_time_window_s=0,  these tests can
finish in less than a second. Unfortunately they are still relatively slow
because our Streams implementation has 512 shards, and thus we need over a
thousand (!) API calls to read from the stream).

Running "test/alternator/run test_streams.py" with 25 tests took before
this patch 114 seconds, after this patch, it is down to 18 seconds.

Refs #6929

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Reviewed-by: Calle Wilund <calle@scylladb.com>
Message-Id: <20200728184612.1253178-1-nyh@scylladb.com>
2020-08-03 09:19:57 +03:00
Avi Kivity
257c17a87a Merge "Don't depend on seastar::make_(lw_)?shared idiosyncrasies" from Rafael
"
While working on another patch I was getting odd compiler errors
saying that a call to ::make_shared was ambiguous. The reason was that
seastar has both:

template <typename T, typename... A>
shared_ptr<T> make_shared(A&&... a);

template <typename T>
shared_ptr<T> make_shared(T&& a);

The second variant doesn't exist in std::make_shared.

This series drops the dependency in scylla, so that a future change
can make seastar::make_shared a bit more like std::make_shared.
"

* 'espindola/make_shared' of https://github.com/espindola/scylla:
  Everywhere: Explicitly instantiate make_lw_shared
  Everywhere: Add a make_shared_schema helper
  Everywhere: Explicitly instantiate make_shared
  cql3: Add a create_multi_column_relation helper
  main: Return a shared_ptr from defer_verbose_shutdown
2020-08-02 19:51:24 +03:00
Avi Kivity
bb9ad9c90b Merge 'Mount RAID volume correctly beyond reboot' from Takuya
"
To mount RAID volume correctly (#6876), we need to wait for MDRAID initialization.
To do so we need to add After=mdmonitor.service on var-lib-scylla.mount.
Also, `lsblk -n -oPARTTYPE {dev}` does not work for CentOS7, since older lsblk does not supported PARTTYPE column (#6954).
We need to provide relocatable lsblk and run it on out() / run() function instead of distribution provided version.
"

* syuu1228-scylla_raid_setup_mount_correctly_beyond_reboot:
  scylla_raid_setup: initialize MDRAID before mounting data volume
  create-relocatable-package.py: add lsblk for relocatable CLI tools
  scylla_util.py: always use relocatable CLI tools
2020-08-02 16:36:45 +03:00
Piotr Sarna
ccbffc3177 codeowners: add some @psarnas and @penbergs where applicable
I shamelessly added myself to some modules I usually take part
in reviewing. Also, I assume that the *THE REST* bucket should
show current maintainers, so the list is extended appropriately.

Message-Id: <0c172d0f20e367c3ce47fdf8d40755038ddee373.1596195689.git.sarna@scylladb.com>
2020-07-31 17:08:28 +03:00
Rafael Ávila de Espíndola
30722b8c8e logalloc: Add disable_failure_guard during a few tls variable initialization
The constructors of these global variables can allocate memory. Since
the variables are thread_local, they are initialized at first use.

There is nothing we can do if these allocations fail, so use
disable_failure_guard.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200729184901.205646-1-espindola@scylladb.com>
2020-07-31 15:49:21 +02:00
Pavel Emelyanov
14b279020b scylla-gdb.py: Support b+tree-based row_cache::_partitions
The row_cache::_partitions type is nowadays a double_decker which is B+tree of
intrusive_arrays of cache_entrys, so scylla cache command will raise an error
being unable to parse this new data type.

The respective iterator for double decker starts on the tree and walks the list
of leaf nodes, on each node it walks the plain array of data nodes, then on each
data node it walks the intrusive array of cache_entrys yielding them to the
caller.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200730145851.8819-1-xemul@scylladb.com>
2020-07-31 15:48:25 +02:00
Piotr Jastrzębski
b16b2c348f Add CDC code owners 2020-07-31 14:22:08 +03:00
Piotr Jastrzębski
7eff7a39a0 Add hinted handoff code owners 2020-07-31 14:21:59 +03:00
Piotr Jastrzębski
443affa525 Update counters code owners 2020-07-31 14:21:48 +03:00
Juliusz Stasiewicz
1c11d8f4c4 transport: Added listener with port-based load balancing
The new port is configurable from scylla.yaml and defaults to 19042
(unencrypted, unless client configures encryption options and omits
`native_shard_aware_transport_port_ssl`).

Two "SUPPORTED" tags are added: "SCYLLA_SHARD_AWARE_PORT" and
"SCYLLA_SHARD_AWARE_PORT_SSL". For compatibility,
"SCYLLA_SHARDING_ALGORITHM" is still kept.

Fixes #5239
2020-07-31 13:02:13 +02:00
Tomasz Grabiec
5263e0453a CMakeLists.txt: Add abseil to include directories
Fixes IDE integration.
Message-Id: <1596190352-15467-1-git-send-email-tgrabiec@scylladb.com>
2020-07-31 12:15:23 +02:00
Avi Kivity
66c2b4c8bf tools: toolchain: regenerate for gcc 10.2
Fixes #6813.

As a side effect, this also brings in xxhash 0.7.4.
2020-07-31 08:32:16 +03:00
Takuya ASADA
9e5d548f75 scylla_raid_setup: initialize MDRAID before mounting data volume
var-lib-scylla.mount should wait for MDRAID initilization, so we need to add
'After=mdmonitor.service'.
However, currently mdmonitor.service fails to start due to no mail address
specified, we need to add the entry on mdadm.conf.

Fixes #6876
2020-07-31 06:33:52 +09:00
Takuya ASADA
6ba2a6c42e create-relocatable-package.py: add lsblk for relocatable CLI tools
We need latest version of lsblk that supported partition type UUID.

Fixes #6954
2020-07-31 04:23:03 +09:00
Takuya ASADA
a19a62e6f6 scylla_util.py: always use relocatable CLI tools
On some CLI tools, command options may different between latest version
vs older version.
To maximize compatibility of setup scripts, we should always use
relocatable CLI tools instead of distribution version of the tool.

Related #6954
2020-07-31 04:17:01 +09:00
Piotr Sarna
b3ad5042c4 .gitignore: add .vscode to the list
Since it looks like vscode is used as main IDE
by some developers, including me, let's ignore its helper files.

Message-Id: <63931cadc733c3d0345616be633a6479dc85ca19.1596115302.git.sarna@scylladb.com>
2020-07-30 16:35:06 +03:00
Piotr Sarna
8728c70628 .gitignore: allow symlinks when ignoring testlog
The .gitignore entry for testlog/ directory is generalized
from "testlog/*" to "testlog", in order to please everyone
who potentially wants test logs to use ramfs by symlinking
testlog to /tmp. Without the change, the symlink remains
visible in `git status`.

Message-Id: <e600f5954868aea7031beb02b1d8e12a2ff869e2.1596115302.git.sarna@scylladb.com>
2020-07-30 16:35:02 +03:00
Piotr Sarna
0788a77109 Merge 'Replace MAINTAINERS with CODEOWNERS' from Pekka
Replace the MAINTAINERS file with a CODEOWNERS file, which Github is
able to parse, and suggest reviewers for pull requests.

* penberg-penberg/codeowners:
  Replace MAINTAINERS with CODEOWNERS
  Update MAINTAINERS
2020-07-30 15:12:59 +02:00
Nadav Har'El
8b9da9c92a alternator test: tests for combination of query filter and projection
The tests in this patch, which pass on DynamoDB but fail on Alternator,
reproduce a bug described in issue #6951. This bug makes it impossible for
a Query (or Scan) to filter on an attribute if that attribute is not
requested to be included in the output.

This patch includes two xfailing tests of this type: One testing a
combination of FilterExpression and ProjectionExpression, and the second
testing a combination of QueryFilter and AttributesToGet; These two
pairs are, respectively, DynamoDB's newer and older syntaxes to achieve
the same thing.

Additionally, we add two xfailing tests that demonstrates that combining
old and new style syntax (e.g., FilterExpression with AttributesToGet)
should not have been allowed (DynamoDB doesn't allow such combinations),
but Alternator currently accepts these combinations.

Refs #6951

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200729210346.1308461-1-nyh@scylladb.com>
2020-07-30 09:34:23 +02:00
Rafael Ávila de Espíndola
a548e5f5d1 test: Mark tmpdir::remove noexcept
Also disable the allocation failure injection in it.

Refs #6831.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200729200019.250908-2-espindola@scylladb.com>
2020-07-30 09:55:52 +03:00
Rafael Ávila de Espíndola
d8ba9678b4 test: Move tmpdir code to a .cc file
This is not hot, so we can move it out of the header.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200729200019.250908-1-espindola@scylladb.com>
2020-07-30 09:55:52 +03:00
Tomasz Grabiec
3486eba1ce commitlog: Fix use-after-free on mutation object during replay
The mutation object may be freed prematurely during commitlog replay
in the schema upgrading path. We will hit the problem if the memtable
is full and apply_in_memory() needs to defer.

This will typically manifest as a segfault.

Fixes #6953

Introduced in 79935df

Tests:
  - manual using scylla binary. Reproduced the problem then verified the fix makes it go away

Message-Id: <1596044010-27296-1-git-send-email-tgrabiec@scylladb.com>
2020-07-29 20:58:15 +03:00
Juliusz Stasiewicz
7e42a42381 tests: Added CQL test for delta mode
Tested scenario is just a single insert in every `delta_mode`.
It is also checked that CDC cannot be enabled with all its
subfeatures disabled.
2020-07-29 16:42:26 +02:00
Nadav Har'El
665b78253a alternator test: reduce amount of Scylla logs saved
The test/alternator/run script follows the pytest log with a full log of
Scylla. This saved log can be useful in diagnosing problems, but most of
it is filled with non-useful "INFO"-level messages. The two biggest
offenders are compaction - which logs every single compaction happening,
and the migration manager, which is just a second (and very long) message
about schema change operations (e.g., table creations). Neither of these
are interesting for Alternator's tests, which shouldn't care exactly when
compaction of which sstable is happening. These two components alone
are reponsible for 80% of the log lines, and 90% of the log bytes!

In this patch we increase the log level of just these two components -
compaction and migration_manager - to WARN, which reduces the log
by the same percentages (80% by lines, 90% by bytes).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200728191420.1254961-1-nyh@scylladb.com>
2020-07-29 14:17:12 +03:00
Takuya ASADA
3a25e7285b scylla_post_install.sh: generate memory.conf for CentOS7
On CentOS7, systemd does not support percentage-based parameter.
To apply memory parameter on CentOS7, we need to override the parameter
in bytes, instead of percentage.

Fixes #6783
2020-07-29 14:10:16 +03:00
Avi Kivity
fea5067dfa Merge "Limit non-paged query memory consumption" from Botond
"
Non-paged queries completely ignore the query result size limiter
mechanism. They consume all the memory they want. With sufficiently
large datasets this can easily lead to a handful or even a single
unpaged query producing an OOM.

This series continues the work started by 134d5a5f7, by introducing a
configurable pair of soft/hard limit (default to 1MB/100MB) that is
applied to otherwise unlimited queries, like reverse and unpaged ones.
When an unlimited query reaches the soft limit a warning is logged. This
should give users some heads-up to adjust their application. When the
hard limit is reached the query is aborted. The idea is to not greet
users with failing queries after an upgrade while at the same time
protect the database from the really bad queries. The hard limit should
be decreased from time to time gradually approaching the desired goal of
1MB.

We don't want to limit internal queries, we trust ourselves to either
use another form of memory usage control, or read only small datasets.
So the limit is selected according to the query class. User reads use
the `max_memory_for_unlimited_query_{soft,hard}_limit` configuration
items, while internal reads are not limited. The limit is obtained by
the coordinator, who passes it down to replicas using the existing
`max_result_size` parameter (which is not a special type containing the
two limits), which is now passed on every verb, instead of once per
connection. This ensures that all replicas work with the same limits.
For normal paged queries `max_result_size` is set to the usual
`query::result_memory_limiter::maximum_result_size` For queries that can
consume unlimited amount of memory -- unpaged and reverse queries --
this is set to the value of the aforementioned
`max_memory_for_unlimited_query_{soft,hard}_limit` configuration item,
but only for user reads, internal reads are not limited.

This has the side-effect that reverse reads now send entire
partitions in a single page, but this is not that bad. The data was
already read, and its size was below the limit, the replica might as well
send it all.

Fixes: #5870
"

* 'nonpaged-query-limit/v5' of https://github.com/denesb/scylla: (26 commits)
  test: database_test: add test for enforced max result limit
  mutation_partition: abort read when hard limit is exceeded for non-paged reads
  query-result.hh: move the definition of short_read to the top
  test: cql_test_env: set the max_memory_unlimited_query_{soft,hard}_limit
  test: set the allow_short_read slice option for paged queries
  partition_slice_builder: add with_option()
  result_memory_accounter: remove default constructor
  query_*(): use the coordinator specified memory limit for unlimited queries
  storage_proxy: use read_command::max_result_size to pass max result size around
  query: result_memory_limiter: use the new max_result_size type
  query: read_command: add max_result_size
  query: read_command: use tagged ints for limit ctor params
  query: read_command: add separate convenience constructor
  service: query_pager: set the allow_short_read flag
  result_memory_accounter: check(): use _maximum_result_size instead of hardcoded limit
  storage_proxy: add get_max_result_size()
  result_memory_limiter: add unlimited_result_size constant
  database: add get_statement_scheduling_group()
  database: query_mutations(): obtain the memory accounter inside
  query: query_class_config: use max_result_size for the max_memory_for_unlimited_query field
  ...
2020-07-29 13:41:53 +03:00
Avi Kivity
22fe38732d Update tools/jmx and tools/java submodules
* tools/java a9480f3a87...aa7898d771 (4):
  > dist: debian: do not require root during package build
  > cassandra-stress: Add serial consistency options
  > dist: debian: fix detection of debuild
  > bin tools: Use non-default `cassandra.config`

* tools/jmx c0d9d0f...626fd75 (1):
  > dist: debian: do not require root during package build

Fixes #6655.
2020-07-29 12:55:18 +03:00
Botond Dénes
3804dfcc0c test: database_test: add test for enforced max result limit
Two tests are added: one that works on the low-level database API, and
another one that works on the CQL API.
2020-07-29 08:32:34 +03:00
Botond Dénes
f7a4d19fb1 mutation_partition: abort read when hard limit is exceeded for non-paged reads
If the read is not paged (short read is not allowed) abort the query if
the hard memory limit is reached. On reaching the soft memory limit a
warning is logged. This should allow users to adjust their application
code while at the same time protecting the database from the really bad
queries.
The enforcement happens inside the memory accounter and doesn't require
cooperation from the result builders. This ensures memory limit set for
the query is respected for all kind of reads. Previously non-paged reads
simply ignored the memory accounter requesting the read to stop and
consumed all the memory they wanted.
2020-07-29 08:32:31 +03:00
Rafael Ávila de Espíndola
c4cb3817cf build: Use -fdata-sections and -ffunction-sections
This is a 4.2% reduction in the scylla text size, from 38975956 to
37404404 bytes.

When benchmarking perf_simple_query without --shuffle-sections, there
is no performance difference.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200724032504.3004-1-espindola@scylladb.com>
2020-07-28 19:39:26 +03:00
Botond Dénes
02a7492d62 query-result.hh: move the definition of short_read to the top
It will be used by `result_memory_{limiter,accounter}` soon.
2020-07-28 18:00:29 +03:00
Botond Dénes
43c0da4b63 test: cql_test_env: set the max_memory_unlimited_query_{soft,hard}_limit
To an unlimited value, in order to avoid aborting any unpaged queries
executed by tests, that would exceed the default result limit of
1MB/100MB.
2020-07-28 18:00:29 +03:00
Botond Dénes
648ce473ab test: set the allow_short_read slice option for paged queries
Some tests use the lower level methods directly and meant to use paging
but didn't and nobody noticed. This was revealed by the enforcement of
max result size (introduced in a later patch), which caused these tests
to fail due to exceeding the max result size.
This patch fixes this by setting the `allow_short_reads` slice option.
2020-07-28 18:00:29 +03:00
Botond Dénes
d27f8321d7 partition_slice_builder: add with_option() 2020-07-28 18:00:29 +03:00
Botond Dénes
6660a5df51 result_memory_accounter: remove default constructor
If somebody wants to bypass proper memory accounting they should at
the very least be forced to consider if that is indeed wise and think a
second about the limit they want to apply.
2020-07-28 18:00:29 +03:00
Botond Dénes
9eab5bca27 query_*(): use the coordinator specified memory limit for unlimited queries
It is important that all replicas participating in a read use the same
memory limits to avoid artificial differences due to different amount of
results. The coordinator now passes down its own memory limit for reads,
in the form of max_result_size (or max_size). For unpaged or reverse
queries this has to be used now instead of the locally set
max_memory_unlimited_query configuration item.

To avoid the replicas accidentally using the local limit contained in
the `query_class_config` returned from
`database::make_query_class_config()`, we refactor the latter into
`database::get_reader_concurrency_semaphore()`. Most of its callers were
only interested in the semaphore only anyway and those that were
interested in the limit as well should get it from the coordinator
instead, so this refactoring is a win-win.
2020-07-28 18:00:29 +03:00
Botond Dénes
159d37053d storage_proxy: use read_command::max_result_size to pass max result size around
Use the recently added `max_result_size` field of `query::read_command`
to pass the max result size around, including passing it to remote
nodes. This means that the max result size will be sent along each read,
instead of once per connection.
As we want to select the appropriate `max_result_size` based on the type
of the query as well as based on the query class (user or internal) the
previous method won't do anymore. If the remote doesn't fill this
field, the old per-connection value is used.
2020-07-28 18:00:29 +03:00
Botond Dénes
fbbbc3e05c query: result_memory_limiter: use the new max_result_size type 2020-07-28 18:00:29 +03:00
Botond Dénes
92a7b16cba query: read_command: add max_result_size
This field will replace max size which is currently passed once per
established rpc connection via the CLIENT_ID verb and stored as an
auxiliary value on the client_info. For now it is unused, but we update
all sites creating a read command to pass the correct value to it. In the
next patch we will phase out the old max size and use this field to pass
max size on each verb instead.
2020-07-28 18:00:29 +03:00
Botond Dénes
8992bcd1f8 query: read_command: use tagged ints for limit ctor params
The convenience constructor of read_command now has two integer
parameter next to each other. In the next patch we intend to add another
one. This is recipe for disaster, so to avoid mistakes this patch
converts these parameters to tagged integers. This makes sure callers
pass what they meant to pass. As a matter of fact, while fixing up
call-sites, I already found several ones passing `query::max_partitions`
to the `row_limit` parameter. No harm done yet, as
`query::max_partitions` == `query::max_rows` but this shows just how
easy it is to mix up parameters with the same type.
2020-07-28 18:00:29 +03:00
Botond Dénes
2ca118b2d5 query: read_command: add separate convenience constructor
query::read_command currently has a single constructor, which serves
both as an idl constructor (order of parameters is fixed) and a convenience one
(most parameters have default values). This makes it very error prone to
add new parameters, that everyone should fill. The new parameter has to
be added as last, with a default value, as the previous ones have a
default value as well. This means the compiler's help cannot be enlisted
to make sure all usages are updated.

This patch adds a separate convenience constructor to be used by normal
code. The idl constructor looses all default parameters. New parameters
can be added to any position in the convenience constructor (to force
users to fill in a meaningful value) while the removed default
parameters from the idl constructor means code cannot accidentally use
it without noticing.
2020-07-28 18:00:29 +03:00
Botond Dénes
1615fe4c5e service: query_pager: set the allow_short_read flag
All callers should set this already before passing the slice to the
pager, however not all actually do (e.g.
`cql3::indexed_table_select_statement::read_posting_list()`). Instead of
auditing each call site, just make sure this is set in the pager
itself. If someone is creating a pager we can be sure they mean to use
paging.
2020-07-28 18:00:29 +03:00
Botond Dénes
989142464c result_memory_accounter: check(): use _maximum_result_size instead of hardcoded limit
The use of the global `result_memory_limiter::maximum_result_size` is
probably a leftover from before the `_maximum_result_size` member was
introduced (aa083d3d85).
2020-07-28 18:00:29 +03:00
Botond Dénes
9eb6d704b2 storage_proxy: add get_max_result_size()
Meant to be used by the coordinator node to obtain the max result size
applicable to the query-class (determined based on the current
scheduling group). For normal  paged queries the previously used
`query::result_memory_limiter::maximum_result_size` is used uniformly.
For reverse and unpaged queries, a query class dependent value is used.
For user reads, the value of the
`max_memory_for_unlimited_query_{soft,hard}_limit` configuration items
is used, for other classes no limit is used
(`query::result_memory_limiter::unlimited_result_size`).
2020-07-28 18:00:29 +03:00
Botond Dénes
c364c7c6a2 result_memory_limiter: add unlimited_result_size constant
To be used as the max result size for internal queries.
2020-07-28 18:00:29 +03:00
Botond Dénes
a64d9b8883 database: add get_statement_scheduling_group() 2020-07-28 18:00:29 +03:00
Botond Dénes
d5cc932a0b database: query_mutations(): obtain the memory accounter inside
Instead of requesting callers to do it and pass it as a parameter. This
is in line with data_query().
2020-07-28 18:00:29 +03:00
Botond Dénes
92ce39f014 query: query_class_config: use max_result_size for the max_memory_for_unlimited_query field
We want to switch from using a single limit to a dual soft/hard limit.
As a first step we switch the limit field of `query_class_config` to use
the recently introduced type for this. As this field has a single user
at the moment -- reverse queries (and not a lot of propagation) -- we
update it in this same patch to use the soft/hard limit: warn on
reaching the soft limit and abort on the hard limit (the previous
behaviour).
2020-07-28 18:00:29 +03:00
Botond Dénes
8aee7662a9 query: introduce max_result_size
To be used to pass around the soft/hard limit configured via
`max_memory_for_unlimited_query_{soft,hard}_limit` in the codebase.
2020-07-28 18:00:29 +03:00
Botond Dénes
517a941feb query_class_config: move into the query namespace
It belongs there, its name even starts with "query".
2020-07-28 18:00:29 +03:00
Botond Dénes
46d5b651eb db/config: introduce max_memory_for_unlimited_query_soft_limit and max_memory_for_unlimited_query_hard_limit
This pair of limits replace the old max_memory_for_unlimited_query one,
which remains as an alias to the hard limit. The soft limit inherits the
previous value of the limit (1MB), when this limit is reached a warning
will be logged allowing the users to adjust their client codes without
downtime. The hard limit starts out with a more permissive default of
100MB. When this is reached queries are aborted, the same behaviour as
with the previous single limit.

The idea is to allow clients a grace period for fixing their code, while
at the same time protecting the database from the really bad queries.
2020-07-28 18:00:29 +03:00
Botond Dénes
9faaf46d4b utils: config_src::add_command_line_options(): drop name and desc args
Now that there are no ad-hoc aliases needing to overwrite the name and
description parameter of this method, we can drop these and have each
config item just use `name()` and `desc()` to access these.
2020-07-28 18:00:29 +03:00
Botond Dénes
dc23736d0c db/config: replace ad-hoc aliases with alias mechanism
We already uses aliases for some configuration items, although these are
created with an ad-hoc mechanism that only registers them on the command
line. Replace this with the built-in alias mechanism in the previous
patch, which has the benefit of conflict resolution and also working
with YAML.
2020-07-28 18:00:29 +03:00
Botond Dénes
003f5e9e54 utils: config: add alias support
Allow configuration items to also have an alias, besides the name.
This allows easy replacement of configuration items, with newer names,
while still supporting the old name for backward compatibility.

The alias mechanism takes care of registering both the name and the
alias as command line arguments, as well as parsing them from YAML.
The command line documentation of the alias will just refer to the name
for documentation.
2020-07-28 17:59:51 +03:00
Raphael S. Carvalho
99b75d1f63 compaction: Improve compaction efficiency by killing the procedure that trims jobs
This procedure consists of trimming SSTables off a compaction job until its weight[1]
is smaller than one already taken by a running compaction. Min threshold is respected
though, we only trim a job while its size is > min threshold.

[1]: this value is a logarithimic function of the total size of the SSTables in a
given job, and it's used to control the compaction parallelism.

It's intended to improve the compaction efficiency by allowing more jobs to run in
parallel, but it turns out that this can have an opposite effect because the write
amplification can be significantly increased.

Take STCS for example, the more similar-sized SSTables you compact together, the
higher the compaction efficiency will be. With the trimming procedure, we're aiming
at running smaller jobs, thinking that running more parallel compactions will provide
us with better performance, but that's not true. Most of the efficiency comes from
making informed decisions when selecting candidates for compaction.

Similarly, this will also hurt TWCS, which does STCS in current window, and a sort
of major compaction when the current window closes. If the TWCS jobs are trimmed,
we'll likely need another compaction to get to the desired state, recompacting
the same data again.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200728143648.31349-1-raphaelsc@scylladb.com>
2020-07-28 17:44:00 +03:00
Takuya ASADA
d7de9518fe scylla_setup: skip boot partition
On GCE, /dev/sda14 reported as unused disk but it's BIOS boot partition,
should not use for scylla data partition, also cannot use for it since it's
too small.

It's better to exclude such partiotion from unsed disk list.

Fixes #6636
2020-07-28 12:19:55 +03:00
Asias He
e6f640441a repair: Fix race between create_writer and wait_for_writer_done
We saw scylla hit user after free in repair with the following procedure during tests:

- n1 and n2 in the cluster

- n2 ran decommission

- n2 sent data to n1 using repair

- n2 was killed forcely

- n1 tried to remove repair_meta for n1

- n1 hit use after free on repair_meta object

This was what happened on n1:

1) data was received -> do_apply_rows was called -> yield before create_writer() was called

2) repair_meta::stop() was called -> wait_for_writer_done() / do_wait_for_writer_done was called
   with _writer_done[node_idx] not engaged

3) step 1 resumed, create_writer() was called and _repair_writer object was referenced

4) repair_meta::stop() finished, repair_meta object and its member _repair_writer was destroyed

5) The fiber created by create_writer() at step 3 hit use after free on _repair_writer object

To fix, we should call wait_for_writer_done() after any pending
operations were done which were protected by repair_meta::_gate. This
prevents wait for writer done finishes before the writer is in the
process of being created.

Fixes: #6853
Fixes: #6868
Backports: 4.0, 4.1, 4.2
2020-07-28 11:53:40 +03:00
Asias He
bdaf904864 storage_service: Improve log on removing pending replacing node
The log "removing pending replacing node" is printed whenever a node
jumps to normal status including a normal restart. For example, on
node1, we saw the following when node2 restarts.

[shard 0] storage_service - Node 127.0.0.2 state jump to normal
[shard 0] storage_service - Remove node 127.0.0.2 from pending replacing endpoint

This is confusing since no node is really being replaced.

To fix, log only if a node is really removed from the pending replacing
nodes.

In addition, since do_remove_node will call del_replacing_endpoint,
there is no need to call del_replacing_endpoint again in
storage_service::handle_state_normal after do_remove_node.

Fixes #6936
2020-07-28 11:51:22 +03:00
Piotr Sarna
ee35c4c3d6 db: handle errors when loading view build progress
Currently, encountering an error when loading view build progress
would result in view builder refusing to start - which also means
that future views would not be built until the server restarts.
A more user-friendly solution would be to log an error message,
but continue to boot the view builder as if no views are currently
in progress, which would at least allow future views to be built
correctly.
The test case is also amended, since now it expects the call
to return that "no view builds are in progress" instead of
an exception.

Fixes #6934
Tests: unit(dev)
Message-Id: <9f26de941d10e6654883a919fd43426066cee89c.1595922374.git.sarna@scylladb.com>
2020-07-28 11:32:09 +03:00
Piotr Sarna
0dbcaa1fd9 test: add a case for disengaged optional values in system tables
Following the patch which fixes incorrect access to disengaged
optionals, a test case which used to reproduce the problem is added.
Message-Id: <99174d47c1c55ed8730b4998d5e5e464990d36e3.1595834092.git.sarna@scylladb.com>
2020-07-28 10:06:42 +03:00
Piotr Sarna
43a3719fe4 cql3: fix potential segfault on disengaged optional
In untyped_result_set::get_view, there exists a silent assumption
that the underlying data, which is an optional, to always be engaged.
In case the value happens to be disengaged it may lead to creating
an incorrect bytes view from a disengaged optional.
In order to make the code safer (since values parsed by this code
often come from the network and can contain virtually anything)
a segfault is replaced with an exception, by calling optional's
value() function, which throws when called on disengaged optionals.

Fixes #6915
Tests: unit(dev)
Message-Id: <6e9e4ca67e0e17c17b718ab454c3130c867684e2.1595834092.git.sarna@scylladb.com>
2020-07-28 10:06:00 +03:00
Raphael S. Carvalho
0d70efa58e sstable: index_reader: Make sure streams are all properly closed on failure
Turns out the fix f591c9c710 wasn't enough to make sure all input streams
are properly closed on failure.
It only closes the main input stream that belongs to context, but it misses
all the input streams that can be opened in the consumer for promote index
reading. Consumer stores a list of indexes, where each of them has its own
input stream. On failure, we need to make sure that every single one of
them is properly closed before destroying the indexes as that could cause
memory corruption due to read ahead.

Fixes #6924.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200727182214.377140-1-raphaelsc@scylladb.com>
2020-07-28 10:01:44 +03:00
Rafael Ávila de Espíndola
34d60efbf9 sstables: Delete write_failed
It is no longer used.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-07-27 11:23:48 -07:00
Rafael Ávila de Espíndola
030f96be1a sstables: Move monitor after writer in compaction_writer
With this the monitor is destroyed first. It makes intuitive sense to
me to destroy a monitor_X before X. This is also the order we had
before 55a8b6e3c9.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-07-27 11:23:47 -07:00
Juliusz Stasiewicz
9e4247090f cdc: Implementations of delta_mode::off/keys
At the stage of `finish`ing CDC mutation, deltas are removed (mode
`off`) or edited to keep only PK+CK of the base table (mode `keys`).

Fixes #6838
2020-07-27 19:05:47 +02:00
Juliusz Stasiewicz
c05128d217 cdc: Infrastructure for controlling delta_mode
The goal is to have finer control over CDC "delta" rows, i.e.:
- disable them totally (mode `off`);
- record only PK+CK (mode `keys`);
- make them behave as usual (mode `full`, default).

This commit adds the necessary infrastructure to `cdc_options`.
2020-07-27 19:00:06 +02:00
Nadav Har'El
a7df8486b1 alternator test: add test for tracing
In commit 8d27e1b, we added tracing (see docs/tracing.md) support to
Alternator requests. However, we never had a functional test that
verifies this feature actually works as expected, and we recently
noticed that for the GetItem and BatchGetItem requestd, the trace
doesn't really work (it returns an empty list of events).

So this patch adds a test, test/alternator/test_tracing.py, which verifies
that the tracing feature works for the PutItem, GetItem, DeleteItem,
UpdateItem, BatchGetItem, BatchWriteItem, Query and Scan operations.

This test is very peculiar. It needs to use out-of-band REST API
requests to enable and disable tracing (of course, the test is skipped
when running against AWS - this is a Scylla-only feature). It also needs
to read CQL-only system tables and does this using Alternator's
".scylla.alternator" interface for system tables - which came through
for us here beautifully and demonstrated their usefulness.

I paid a lot of attention for this test to remain reasonably fast -
this entire test now runs in a little less than one second. Achieving
this while testing eight different requests was a bit of a challenge,
because traces take time until they are visible in the trace table.
This is the main reason why in this patch the test for all eight
request types are done in one test, instead of eight separate tests.

Fixes #6891

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200727115401.1199024-1-nyh@scylladb.com>
2020-07-27 14:31:45 +02:00
Takuya ASADA
97fa17b17b scylla_setup: remove square bracket from disk prompt selected list
Selected list on disk prompt is looks like an alternatives, it's better to use
single quote.

Fixes #6760
2020-07-27 14:50:31 +03:00
Avi Kivity
3f84d41880 Merge "messaging: make verb handler registering independent of current scheduling group" from Botond
"
0c6bbc8 refactored `get_rpc_client_idx()` to select different clients
for statement verbs depending on the current scheduling group.
The goal was to allow statement verbs to be sent on different
connections depending on the current scheduling group. The new
connections use per-connection isolation. For backward compatibility the
already existing connections fall-back to per-handler isolation used
previously. The old statement connection, called the default statement
connection, also used this. `get_rpc_client_idx()` was changed to select
the default statement connection when the current scheduling group is
the statement group, and a non-default connection otherwise.

This inadvertently broke `scheduling_group_for_verb()` which also used
this method to get the scheduling group to be used to isolate a verb at
handle register time. This method needs the default client idx for each
verb, but if verb registering is run under the system group it instead
got the non-default one, resulting in the per-handler isolation not
being set-up for the default statement connection, resulting in default
statement verb handlers running in whatever scheduling group the process
loop of the rpc is running in, which is the system scheduling group.

This caused all sorts of problems, even beyond user queries running in
the system group. Also as of 0c6bbc8 queries on the replicas are
classified based on the scheduling group they are running on, so user
reads also ended up using the system concurrency semaphore.

In particular this caused severe problems with ranges scans, which in
some cases ended up using different semaphores per page resulting in a
crash. This could happen because when the page was read locally the code
would run in the statement scheduling group, but when the request
arrived from a remote coordinator via rpc, it was read in a system
scheduling group. This caused a mismatch between the semaphore the saved
reader was created with and the one the new page was read with. The
result was that in some cases when looking up a paused reader from the
wrong semaphore, a reader belonging to another read was returned,
creating a disconnect between the lifecycle between readers and that of
the slice and range they were referencing.

This series fixes the underlying problem of the scheduling group
influencing the verb handler registration, as well as adding some
additional defenses if this semaphore mismatch ever happens in the
future. Inactive read handles are now unique across all semaphores,
meaning that it is not possible anymore that a handle succeeds in
looking up a reader when used with the wrong semaphore. The range scan
algorithm now also makes sure there is no semaphore mismatch between the
one used for the current page and that of the saved reader from the
previous page.

I manually checked that each individual defense added is already
preventing the crash from happening.

Fixes: #6613
Fixes: #6907
Fixes: #6908

Tests: unit(dev), manual(run the crash reproducer, observe no crash)
"

* 'query-classification-regressions/v1' of https://github.com/denesb/scylla:
  multishard_mutation_query: use cached semaphore
  messaging: make verb handler registering independent of current scheduling group
  multishard_mutation_query: validate the semaphore of the looked-up reader
  reader_concurrency_semaphore: make inactive read handles unique across semaphores
  reader_concurrency_semaphore: add name() accessor
  reader_concurrency_semaphore: allow passing name to no-limit constructor
2020-07-27 13:56:52 +03:00
Nadav Har'El
9080709c56 docs: add paragraph to tracing.md
Issue #6919 was caused by an incorrect assumption: I *assumed* that we see
the tracing session record, we can be sure that the event records for this
session had already been written. In this patch we add a paragraph to
the tracing documentation - docs/tracing.md, which explains that this
assumption is in fact incorrect:

1. On a multi-node setup, replicas may continue to write tracing events
   after the coordinator "finished" (moved to background) the request
   and wrote the session record.

2. Even on a single-node setup, the writes of the session record and the
   individual events are asynchronous, and can happen in an unexpected
   order (which is what happened in issue #6919).

Refs #6919.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200727102438.1194314-1-nyh@scylladb.com>
2020-07-27 13:38:57 +03:00
Takuya ASADA
0ffa0e8745 dist_util.py: use correct ID value to detect Amazon Linux 2
On 2d63acdd6a we replaced 'ol' and 'amzn'
to 'oracle' and 'amazon', but distro.id() actually returns 'amzn' for
Amazon Linux 2, so we need to revert the change.

Fixes #6882
2020-07-27 12:46:21 +03:00
Botond Dénes
eeeef0a0f1 multishard_mutation_query: use cached semaphore
Instead of requesting the query class config from the database every
time the semaphore is needed, use the cached one by calling
`semaphore()`.
2020-07-27 12:17:22 +03:00
Nadav Har'El
65f75e3862 alternator test: enable test_get_records
After issue #6864 was fixed, the test_streams.py::test_get_records test no
longer fails, so its "xfail" marker can be removed.

Refs #6864.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200722132518.1077882-1-nyh@scylladb.com>
2020-07-27 09:19:37 +02:00
Nadav Har'El
f488eaebaf merge: db/view: view_update_generator: make staging reader evictable
Merged patch set by Botond Dénes:

The view update generation process creates two readers. One is used to
read the staging sstables, the data which needs view updates to be
generated for, and another reader for each processed mutation, which
reads the current value (pre-image) of each row in said mutation. The

staging reader is created first and is kept alive until all staging data
is processed. The pre-image reader is created separately for each
processed mutation. The staging reader is not restricted, meaning it
does not wait for admission on the relevant reader concurrency
semaphore, but it does register its resource usage on it. The pre-image
reader however *is* restricted. This creates a situation, where the
staging reader possibly consumes all resources from the semaphore,
leaving none for the later created pre-image reader, which will not be
able to start reading. This will block the view building process meaning
that the staging reader will not be destroyed, causing a deadlock.

This patch solves this by making the staging reader restricted and
making it evictable. To prevent thrashing -- evicting the staging reader
after reading only a really small partition -- we only make the staging
reader evictable after we have read at least 1MB worth of data from it.

  test/boost: view_build_test: add test_view_update_generator_buffering
  test/boost: view_build_test: add test test_view_update_generator_deadlock
  reader_permit: reader_resources: add operator- and operator+
  reader_concurrency_semaphore: add initial_resources()
  test: cql_test_env: allow overriding database_config
  mutation_reader: expose new_reader_base_cost
  db/view: view_updating_consumer: allow passing custom update pusher
  db/view: view_update_generator: make staging reader evictable
  db/view: view_updating_consumer: move implementation from table.cc to view.cc
  database: add make_restricted_range_sstable_reader()

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
---
 db/view/view_updating_consumer.hh | 51 ++++++++++++++++++++++++++++---
 db/view/view.cc                   | 39 +++++++++++++++++------
 db/view/view_update_generator.cc  | 19 +++++++++---
 3 files changed, 91 insertions(+), 18 deletions(-)
2020-07-27 09:19:37 +02:00
Botond Dénes
fe127a2155 sstables: clamp estimated_partitions to [1, +inf) in writers
In some cases estimated number of partitions can be 0, which is albeit a
legit estimation result, breaks many low-level sstable writer code, so
some of these have assertions to ensure estimated partitions is > 0.
To avoid hitting this assert all users of the sstable writers do the
clamping, to ensure estimated partitions is at least 1. However leaving
this to the callers is error prone as #6913 has shown it. As this
clamping is standard practice, it is better to do it in the writers
themselves, avoiding this problem altogether. This is exactly what this
patch does. It also adds two unit tests, one that reproduces the crash
in #6913, and another one that ensures all sstable writers are fine with
estimated partitions being 0 now. Call sites previously doing the
clamping are changed to not do it, it is unnecessary now as the writer
does it itself.

Fixes #6913

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200724120227.267184-1-bdenes@scylladb.com>
2020-07-27 09:19:37 +02:00
Avi Kivity
91619d77a1 Merge "Simplify the lifetime management of write monitors" from Raphael
"
This makes sure that monitors are always owned by the same struct that
owns the monitored writer, simplifying the lifetime management.

This hopefully fixes some of the crashes we have observed around this
area.
"

* 'espindola/use-compaction_writer-v6' of https://github.com/espindola/scylla:
  sstables: Rename _writer to _compaction_writer
  sstables: Move compaction_write_monitor to compaction_writer
  sstables: Add couple of writer() getters to garbage_collected_sstable_writer
  sstables: Move compaction_write_monitor earlier in the file
2020-07-27 09:19:37 +02:00
Dejan Mircevski
c11b2de84c cql3: Fix tombstone-range check for TRUE
A DELETE statement checks that the deletion range is symmetrically
bounded.  This check was broken for expression TRUE.

Test the fix by setting initial_key_restrictions::expression to TRUE,
since CQL doesn't currently allow WHERE TRUE.  That change has been
proposed anyway in feedback to #5763:

https://github.com/scylladb/scylla/pull/5763#discussion_r443213343

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-07-27 09:19:37 +02:00
Dejan Mircevski
ba74659f5a cql/restrictions: Constrain to_sorted_vector
As requested in #5763 feedback, enforce the function's assumptions
with concept asserts.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-07-27 09:19:37 +02:00
Botond Dénes
0df4c2fd3b messaging: make verb handler registering independent of current scheduling group
0c6bbc8 refactored `get_rpc_client_idx()` to select different clients
for statement verbs depending on the current scheduling group.
The goal was to allow statement verbs to be sent on different
connections depending on the current scheduling group. The new
connections use per-connection isolation. For backward compatibility the
already existing connections fall-back to per-handler isolation used
previously. The old statement connection, called the default statement
connection, also used this. `get_rpc_client_idx()` was changed to select
the default statement connection when the current scheduling group is
the statement group, and a non-default connection otherwise.

This inadvertently broke `scheduling_group_for_verb()` which also used
this method to get the scheduling group to be used to isolate a verb at
handle register time. This method needs the default client idx for each
verb, but if verb registering is run under the system group it instead
got the non-default one, resulting in the per-handler isolation not
being set-up for the default statement connection, resulting in default
statement verb handlers running in whatever scheduling group the process
loop of the rpc is running in, which is the system scheduling group.

This caused all sorts of problems, even beyond user queries running in
the system group. Also as of 0c6bbc8 queries on the replicas are
classified based on the scheduling group they are running on, so user
reads also ended up using the system concurrency semaphore.
2020-07-27 10:11:21 +03:00
Asias He
cd7d64f588 gossip: Introduce GOSSIP_GET_ENDPOINT_STATES verb
The new verb is used to replace the current gossip shadow round
implementation. Current shadow round implementation reuses the gossip
syn and ack async message, which has plenty of drawbacks. It is hard to
tell if the syn messages to a specific peer node has responded. The
delayed responses from shadow round can apply to the normal gossip
states even if the shadow round is done. The syn and ack message
handler are full special cases due to the shadow round. All gossip
application states including the one that are not relevant are sent
back. The gossip application states are applied and the gossip
listeners are called as if is in the normal gossip operation. It is
completely unnecessary to call the gossip listeners in the shadow round.

This patch introduces a new verb to request the exact gossip application
states the shadow round  needed with a synchronous verb and applies the
application states without calling the gossip listeners. This patch
makes the shadow round easier to reason about, more robust and
efficient.

Refs: #6845
Tests: update_cluster_layout_tests.py
2020-07-27 09:15:11 +08:00
Asias He
bebd683177 gossip: Add do_apply_state_locally helper
The code in do_apply_state_locally will be shared in the next patch.

Refs: #6845
Tests: update_cluster_layout_tests.py
2020-07-27 09:00:47 +08:00
Piotr Sarna
d08e22c4eb alternator: fix tracing BatchGetItem
The BatchGetItem request did not pass its trace state to lower layers
in a correct manner, which resulted in losing tracing information.

Refs #6891
Message-Id: <078f58a0f76b9f182f671a8d16e147ded489138c.1595515815.git.sarna@scylladb.com>
2020-07-23 20:05:10 +03:00
Piotr Sarna
7256572e41 alternator: fix tracing GetItem
The GetItem request did not pass the trace state properly,
which resulted in having almost empty traces.

Refs #6891

Tests: manual:

Before:
 session_id                           | event_id                             | activity                                                                                                               | scylla_parent_id | scylla_span_id  | source    | source_elapsed | thread
--------------------------------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------+------------------+-----------------+-----------+----------------+---------
 57995da0-cce4-11ea-97ea-000000000000 | 579971c4-cce4-11ea-97ea-000000000000 |                                                                                                                GetItem |                0 | 131309406144163 | 127.0.0.1 |              0 | shard 0

After:
 session_id                           | event_id                             | activity                                                                                                               | scylla_parent_id | scylla_span_id  | source    | source_elapsed | thread
--------------------------------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------+------------------+-----------------+-----------+----------------+---------
 57995da0-cce4-11ea-97ea-000000000000 | 579971c4-cce4-11ea-97ea-000000000000 |                                                                                                                GetItem |                0 | 131309406144163 | 127.0.0.1 |              0 | shard 0
 57995da0-cce4-11ea-97ea-000000000000 | 57997327-cce4-11ea-97ea-000000000000 | Creating read executor for token -7535857341981351089 with all: {127.0.0.1} targets: {127.0.0.1} repair decision: NONE |                0 | 131309406144163 | 127.0.0.1 |             35 | shard 0
 57995da0-cce4-11ea-97ea-000000000000 | 5799733d-cce4-11ea-97ea-000000000000 |                                                                                            read_data: querying locally |                0 | 131309406144163 | 127.0.0.1 |             38 | shard 0
 57995da0-cce4-11ea-97ea-000000000000 | 57997358-cce4-11ea-97ea-000000000000 |                                                   Start querying the token range that starts with -7535857341981351089 |                0 | 131309406144163 | 127.0.0.1 |             40 | shard 0
 57995da0-cce4-11ea-97ea-000000000000 | 57997579-cce4-11ea-97ea-000000000000 |                                                                                                       Querying is done |                0 | 131309406144163 | 127.0.0.1 |             95 | shard 0
Message-Id: <d585ff7aaaeebf2050890643d40cdafb2efb8d98.1595509338.git.sarna@scylladb.com>
2020-07-23 20:05:06 +03:00
Avi Kivity
39db54a758 Merge "Use seastar::with_file_close_on_failure in commitlog" from Benny
"
`close_on_failure` was committed to seastar so use
the library version.

This requires making the lambda function passed to
it nothrow move constructible, so this series also
makes db::commitlog::descriptor move constructor noexcept
and changes allocate_segment_ex and segment::segment
to get a descriptor by value rather than by reference.

Test: unit(dev), commitlog_test(debug)
"

* tag 'commit-log-use-with_file_close_on_failure-v1' of github.com:bhalevy/scylla:
  commitlog: use seastar::with_file_close_on_failure
  commitlog: descriptor: make nothrow move constructible
  commitlog: allocate_segment_ex, segment: pass descriptor by value
  commitlog: allocate_segment_ex: filename capture is unused
2020-07-23 19:23:23 +03:00
Rafael Ávila de Espíndola
bca4eb8b8c Build: Garbage collect dead sections
In another patch I noticed gcc producing dead functions. I am not sure
why gcc is doing that. Some of those functions are already placed in
independent sections, and so can be garbage collected by the linker.

This is a 1% text section reduction in scylla, from 39363380 to
38974324 bytes. There is no difference in the tps reported by
perf_simple_query.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200723152511.8214-1-espindola@scylladb.com>
2020-07-23 18:57:01 +03:00
Piotr Sarna
6cdc9f1a43 Merge 'alternator: refactor api_error class' from Nadav
In the patch "Add exception overloads for Dynamo types", Alternator's single
api_error exception type was replaced by a more complex hierarchy of types.
The implementation was not only longer and more complex to understand -
I believe it also negated an important observation:

The "api_error" exception type is special. It is not an exception created
by code for other code. It is not meant to be caught in Alternator code.
Instead, it is supposed to contain an error message created for the *user*,
containing one of the few supported exception exception "names" described
in the DynamoDB documentation, and a user-readable text message. Throwing
such an exception in Alternator code means the thrower wants the request
to abort immediately, and this message to reach the user. These exceptions
are not designed to be caught in Alternator code. Code should use other
exceptions - or alternatives to exceptions (e.g., std::optional) for
problems that should be handled before returning a different error to the
user. Moreover, "api_error" isn't just thrown as an exception - it can
also be returned-by-value in a executor::request_return_type) - which is
another reason why it should not be subclassed.

For these reasons, I believe we should have a single api_error type, and
it's wrong to subclass it. So in this patch I am reverting the subclasses
and template added in the aforementioned patch.

Still, one correct observation made in that patch was that it is
inconvenient to type in DynamoDB exception names (no help from the editor
in completing those strings) and also error-prone. In this patch we
propse a different - simpler - solution to the same problem:

We add trivial factory functions, e.g., api_error::validation(std::string)
as a shortcut to api_error("ValidationException"). The new implementation
is easy to understand, and also more self explanatory to readers:
It is now clear that "api_error::validation()" is actually a user-visible
"api_error", something which was obscured by the name validation_exception()
used before this patch.

Finally, this patch also improves the comment in error.hh explaining the
purpose of api_error and the fact it can be returned or thrown. The fact
it should not be subclassed is legislated with a "finally". There is also
no point of this class inheriting from std::exception or having virtual
functions, or an empty constructor - so all these are dropped as well.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

* 'api-error-refactor' of https://github.com/nyh/scylla:
  alternator: use api_error factory functions in auth.cc
  alternator: use api_error::validation()
  alternator: use api_error factory functions in executor.cc
  alternator: use api_error factory functions in server.cc
  alternator: refactor api_error class
2020-07-23 17:35:56 +02:00
Piotr Sarna
e7c18963e4 test: check sizes before dereferencing the vector
It's better to assert a certain vector size first and only then
dereference its elements - otherwise, if a bug causes the size
to be different, the test can crash with a segfault on an invalid
dereference instead of graciously failing with a test assertion.
2020-07-23 16:49:35 +03:00
Piotr Sarna
6b04034566 cql3: fix multi column restriction bounds
Generating bounds from multi-column restrictions used to create
incorrect nonwrapping intervals, which only happened to work because
they're implemented as wrapping intervals underneath.
The following CQL restriction:
  WHERE (a, b) >= (1, 0)
should translate to
  (a, b) >= (1, 0), no upper bound,
while it incorrectly translates to
  (a, b) >= (1, 0) AND (a, b) < empty-prefix.
Since empty prefix is smaller than any other clustering key,
this range was in fact not correct, since the assumption
was that starting bound was never greater than the ending bound.
While the bug does not trigger any errors in tests right now,
it starts to do so after the code is modified in order to
correctly handle empty intervals (intervals with end > start).
2020-07-23 16:49:24 +03:00
Botond Dénes
b7cfa4ea97 multishard_mutation_query: validate the semaphore of the looked-up reader
To make sure it belongs to the same semaphore that the database thinks
is appropriate for the current query. Since a semaphore mismatch points
to a serious bug, we use `on_internal_error()` to allow generating
coredumps on-demand.
2020-07-23 16:43:37 +03:00
Botond Dénes
11105cbb78 reader_concurrency_semaphore: make inactive read handles unique across semaphores
Currently inactive read handles are only unique within the same
semaphore, allowing for an unregister against another semaphore to
potentially succeed. This can lead to disasters ranging from crashes to
data corruption. While a handle should never be used with another
semaphore in the first place, we have recently seen a bug (#6613)
causing exactly that, so in this patch we prevent such unregister
operations from ever succeeding by making handles unique across all
semaphores. This is achieved by adding a pointer to the semaphore to the
handle.
2020-07-23 16:43:33 +03:00
Botond Dénes
d12540bfbf reader_concurrency_semaphore: add name() accessor
Allows identifying the semaphore in question in semaphore related error
messages.
2020-07-23 16:42:54 +03:00
Botond Dénes
88129f500f reader_concurrency_semaphore: allow passing name to no-limit constructor
So tests can provide names for semaphores as well, making test output
more clear.
2020-07-23 16:42:36 +03:00
Nadav Har'El
b661c1eae2 alternator: use api_error factory functions in auth.cc
All the places in auth.cc where we constructed an api_error with inline
strings now use api_error factory functions.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-07-23 15:36:39 +03:00
Nadav Har'El
bca88521ba alternator: use api_error::validation()
All the places in conditions.cc, expressions.cc and serialization.cc where
we constructed an api_error, we always used the ValidationException type
string, which the code repeated dozens of times.
This patch converts all these places to use the factory function
api_error::validation().

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-07-23 15:36:39 +03:00
Nadav Har'El
06ba0c0232 alternator: use api_error factory functions in executor.cc
All the places in executor.cc where we constructed an api_error with inline
strings now use api_error factory functions. Most of them, but not all of
them, were api_error::validation(). We also needed to add a couple more of
these factory functions.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-07-23 15:36:39 +03:00
Nadav Har'El
81589be00a alternator: use api_error factory functions in server.cc
All the places in server.cc where we constructed an api_error with inline
strings now use api_error factory functions - we needed to add a few more.

Interestingly, we had a wrong type string for "Internal Server Error",
which we fix in this patch. We wrote the type string like that - with spaces -
because this is how it was listed in the DynamoDB documentation at
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html
But this was in fact wrong, and it should be without spaces:
"InternalServerError". The botocore library (for example) recognizes it
this way, and this string can also be seen in other online DynamoDB examples.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-07-23 15:36:39 +03:00
Nadav Har'El
5a35632cd3 alternator: refactor api_error class
In the patch "Add exception overloads for Dynamo types", Alternator's single
api_error exception type was replaced by a more complex hierarchy of types.
The implementation was not only longer and more complex to understand -
I believe it also negated an important observation:

The "api_error" exception type is special. It is not an exception created
by code for other code. It is not meant to be caught in Alternator code.
Instead, it is supposed to contain an error message created for the *user*,
containing one of the few supported exception exception "names" described
in the DynamoDB documentation, and a user-readable text message. Throwing
such an exception in Alternator code means the thrower wants the request
to abort immediately, and this message to reach the user. These exceptions
are not designed to be caught in Alternator code. Code should use other
exceptions - or alternatives to exceptions (e.g., std::optional) for
problems that should be handled before returning a different error to the
user. Moreover, "api_error" isn't just thrown as an exception - it can
also be returned-by-value in a executor::request_return_type) - which is
another reason why it should not be subclassed.

For these reasons, I believe we should have a single api_error type, and
it's wrong to subclass it. So in this patch I am reverting the subclasses
and template added in the aforementioned patch.

Still, one correct observation made in that patch was that it is
inconvenient to type in DynamoDB exception names (no help from the editor
in completing those strings) and also error-prone. In this patch we
propse a different - simpler - solution to the same problem:

We add trivial factory functions, e.g., api_error::validation(std::string)
as a shortcut to api_error("ValidationException"). The new implementation
is easy to understand, and also more self explanatory to readers:
It is now clear that "api_error::validation()" is actually a user-visible
"api_error", something which was obscured by the name validation_exception()
used before this patch.

Finally, this patch also improves the comment in error.hh explaining the
purpose of api_error and the fact it can be returned or thrown. The fact
it should not be subclassed is legislated with a "finally". There is also
no point of this class inheriting from std::exception or having virtual
functions, or an empty constructor - so all these are dropped as well.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-07-23 15:36:39 +03:00
Avi Kivity
01b838e291 Merge "Unregister RPC verbs on stop" from Pavel E
"
There are 5 services, that register their RPC handlers in messaging
service, but quite a few of them unregister them on stop.

Unregistering is somewhat critical, not just because it makes the
code look clean, but also because unregistration does wait for the
message processing to complete, thus avoiding use-after-free's in
the handlers.

In particular, several handlers call service::get_schema_for_write()
which, in turn, may end up in service::maybe_sync() calling for
the local migration manager instance. All those handlers' processing
must be waited for before stopping the migration manager.

The set brings the RPC handlers unregistration in sync with the
registration part.

tests: unit (dev)
       dtest (dev: simple_boot_shutdown, repair)
       start-stop by hands (dev)
fixes: #6904
"

* 'br-rpc-unregister-verbs' of https://github.com/xemul/scylla:
  main: Add missing calls to unregister RPC hanlers
  messaging: Add missing per-service unregistering methods
  messaging: Add missing handlers unregistration helpers
  streaming: Do not use db->invoke_on_all in vain
  storage_proxy: Detach rpc unregistration from stop
  main: Shorten call to storage_proxy::init_messaging_service
2020-07-23 12:03:49 +03:00
Pekka Enberg
f9092bc4fc Replace MAINTAINERS with CODEOWNERS
Replace the MAINTAINERS file with a CODEOWNERS file, which Github is
able to parse, and suggest reviewers for pull requests.
2020-07-23 09:25:40 +03:00
Asias He
55271f714e gossip: Do not talk to seed node explicitly
Currently, we talk to a seed node in each gossip round with some
probability, i.e., nr_of_seeds / (nr_of_live_nodes + nr_of_unreachable_nodes)
For example, with 5 seeds in a 50 nodes cluster, the probability is 0.1.

Now that we talk to all live nodes, including the seed nodes, in a
bounded time period. It is not a must to talk to seed node in each
gossip round.

In order to get rid of the seed concept, do not talk to seed node
explicitly in each gossip round.

This patch is a preparatory patch to remove the seed concept in gossip.

Refs: #6845
Tests: update_cluster_layout_tests.py
2020-07-23 14:24:06 +08:00
Asias He
8e219e10e7 gossip: Talk to live endpoints in a shuffled fashion
Currently, we select 10 percent of random live nodes to talk with in
each gossip round. There is no upper bound how long it will take to talk
to all live nodes.

This patch changes the way we select live nodes to talk with as below:

 1) Shuffle all the live endpoints randomly
 2) Split the live endpoints into 10 groups
 3) Talk to one of the groups in each gossip round
 4) Go to step 1 to shuffle again after we groups are talked with

We keep both randomness of selecting nodes as before and determinacy to complete
talking to all live nodes.

In addition, the way to favor newly added node is simplified. When a
new live node is added, it is always added to the front of the group, so
it will be talked with in the next gossip round.

This patch is a preparatory patch to remove the seed concept in gossip.

Refs: #6845
Tests: update_cluster_layout_tests.py
2020-07-23 14:23:59 +08:00
Pekka Enberg
39885dbdc8 Update MAINTAINERS 2020-07-23 09:03:39 +03:00
Avi Kivity
b4b9deadf3 build: install jmx and tools-java submodule dependencies
Let each submodule be responsible for its own dependencies, and
call the submodule's dependency installation script.

Reviewed-by: Piotr Jastrzebski <piotr@scylladb.com>
Reviewed-by: Takuya ASADA <syuu@scylladb.com>
2020-07-22 20:13:50 +03:00
Avi Kivity
7fbe50a4e4 build: remove pystache from install-dependencies
As of d6165bc1c3 we do not
depend on pystache, so don't install it.

Reviewed-by: Takuya ASADA <syuu@scylladb.com>
2020-07-22 20:12:31 +03:00
Avi Kivity
19da4a5b8f build: don't package tools/java and tools/jmx in relocatable pacakge
tools/java and tools/jmx have their own relocatable packages (and rpm/deb),
so they should not be part of the main relocatable package.

Enforce this by enabling the filter parameter in reloc_add, and passing
a filter that excludes tools/java and tools/jmx.
2020-07-22 20:03:18 +03:00
Avi Kivity
98a22e572a dist: redhat: reduce log spam from unpacking sources when building rpm
rpmbuild defaults to logging the name of every file it unpacks from
the archive.

Make it quiet with the %setup -q flag.
2020-07-22 20:02:04 +03:00
Rafael Ávila de Espíndola
87b261ab32 sstables: Rename _writer to _compaction_writer
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-07-22 08:15:55 -07:00
Rafael Ávila de Espíndola
97b7fee78e sstables: Move compaction_write_monitor to compaction_writer
There is one monitor per writer, so we new keep them together in the
compaction_writer struct.

This trivially guarantees that the monitor is always destroyed before
the writer.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-07-22 08:15:53 -07:00
Rafael Ávila de Espíndola
f8cc582e4a sstables: Add couple of writer() getters to garbage_collected_sstable_writer
This just reduces the noise of an upcoming patch.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-07-22 07:46:05 -07:00
Rafael Ávila de Espíndola
c740c66840 sstables: Move compaction_write_monitor earlier in the file
This will used by followup patches.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-07-22 07:46:05 -07:00
Pavel Emelyanov
50d07696e4 main: Add missing calls to unregister RPC hanlers
The gossiper's and migration_manager's unregistration is done on
the services' stopm, for the rest we need to call the recently
introduced methods.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-22 16:35:07 +03:00
Pavel Emelyanov
5060063cd6 messaging: Add missing per-service unregistering methods
5 services register handlers in messaging, but not all of them
have clear unregistration methods.

Summary:

migration_manager: everything is in place, no changes
gossiper: ditto
proxy: some verbs unregistration is missing
repair: no unregistration at all
streaming: ditto

This patch adds the needed unregistration methods.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-22 16:34:00 +03:00
Pavel Emelyanov
7a7b1b3108 messaging: Add missing handlers unregistration helpers
Handlers for each verb have both -- register and unregister helpers, but unregistration ones
for some verbs are missing, so here they are.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-22 16:31:57 +03:00
Pavel Emelyanov
08e36ca77c streaming: Do not use db->invoke_on_all in vain
The db instance is not needed to initialize messages, so use plain smp::invoke_on_all

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-22 16:31:57 +03:00
Pavel Emelyanov
f845a78d9a storage_proxy: Detach rpc unregistration from stop
The proxy's stop method is not called (and unlikely will be soon), but stopping
the message handlers is needed now, so prepare the existing method for this.'

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-22 16:31:57 +03:00
Pavel Emelyanov
cc070ceca0 main: Shorten call to storage_proxy::init_messaging_service
Just for brevity

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-22 16:31:57 +03:00
Kamil Braun
12e2891c60 cdc: if ring_delay == 0, don't add delay to newly created generation
If ring_delay == 0, something fishy is going on, e.g. single-node tests
are being performed. In this case we want the CDC generation to start
operating immediately. There is no need to wait until it propagates to
the cluster.

You should not use ring_delay == 0 in production.

Fixes https://github.com/scylladb/scylla/issues/6864.
2020-07-22 16:06:09 +03:00
Avi Kivity
5e1fa13d08 Merge 'docker: Make I/O configuration setup configurable' from Pekka
"
This adds a '--io-setup N' command line option, which users can pass to
specify whether they want to run the "scylla_io_setup" script or not.
This is useful if users want to specify I/O settings themselves in
environments such as Kubernetes, where running "iotune" is problematic.

While at it, add the same option to "scylla_setup" to keep the interface
between that script and Docker consistent.

Fixes #6587
"

* penberg-penberg/docker-no-io-setup:
  scylla_setup: Add '--io-setup ENABLE' command line option
  dist/docker: Add '--io-setup ENABLE' command line option
2020-07-22 14:17:53 +03:00
Rafael Ávila de Espíndola
e83e91e352 alternator: Fix use after return
Avoid a copy of timeout so that we don't end up with a reference to a
stack allocated variable.

Fixes #6897

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200721184939.111665-1-espindola@scylladb.com>
2020-07-21 22:06:13 +03:00
Rafael Ávila de Espíndola
e15c8ee667 Everywhere: Explicitly instantiate make_lw_shared
seastar::make_lw_shared has a constructor taking a T&&. There is no
such constructor in std::make_shared:

https://en.cppreference.com/w/cpp/memory/shared_ptr/make_shared

This means that we have to move from

    make_lw_shared(T(...)

to

    make_lw_shared<T>(...)

If we don't want to depend on the idiosyncrasies of
seastar::make_lw_shared.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-07-21 10:33:49 -07:00
Rafael Ávila de Espíndola
efeaded427 Everywhere: Add a make_shared_schema helper
This replaces a lot of make_lw_shared(schema(...)) with
make_shared_schema(...).

This makes it easier to drop a dependency on the differences between
seastar::make_shared and std::make_shared.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-07-21 10:33:49 -07:00
Rafael Ávila de Espíndola
ad6d65dbbd Everywhere: Explicitly instantiate make_shared
seastar::make_shared has a constructor taking a T&&. There is no such
constructor in std::make_shared:

https://en.cppreference.com/w/cpp/memory/shared_ptr/make_shared

This means that we have to move from

    make_shared(T(...)

to

    make_shared<T>(...)

If we don't want to depend on the idiosyncrasies of
seastar::make_shared.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-07-21 10:33:49 -07:00
Rafael Ávila de Espíndola
abba521199 cql3: Add a create_multi_column_relation helper
This moves a few calls to make_shared to a single location.

This makes it easier to drop a dependency on the differences between
seastar::make_shared and std::make_shared.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-07-21 10:33:49 -07:00
Rafael Ávila de Espíndola
8858873d85 main: Return a shared_ptr from defer_verbose_shutdown
This moves a few calls to make_shared to a single location.

This makes it easier to drop a dependency on the differences between
seastar::make_shared and std::make_shared.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-07-21 10:33:44 -07:00
Avi Kivity
098d24fd6d Update seastar submodule
* seastar 4a99d56453...02ad74fa7d (5):
  > TLS: Use "known" (precalculated) DH parameters if available
  > tutorial: fix advanced service_loop examples
  > tutorial: further fix service_loop example text
  > linux-aio: make the RWF_NOWAIT support work again
  > locking_test: Fix a use after return
2020-07-21 19:08:36 +03:00
Avi Kivity
5ead33d486 Update tools/jmx and tools/java submodules
* tools/java 113c7d993b...a9480f3a87 (3):
  > reloc/build_deb.sh: Fix extra whitespace in"mv" command path
  > README.md: Document repository purpose for Scylla
  > reloc: Add "--builddir" option to build_{rpm,deb}.sh

* tools/jmx aa94fe5...c0d9d0f (2):
  > add build/ to gitignore
  > reloc: Add "--builddir" option to build_{rpm,deb}.sh
2020-07-21 15:33:54 +03:00
Pekka Enberg
0b8c9668e3 scylla_setup: Add '--io-setup ENABLE' command line option
To make the "scylla_setup" interface similar to Docker image, let's add
a "--io-setup ENABLE" command line option. The old "--no-io-setup"
option is retained for compatibility.
2020-07-21 14:48:01 +03:00
Pekka Enberg
fc1851cdc1 dist/docker: Add '--io-setup ENABLE' command line option
This adds a '--io-setup N' command line option, which users can pass to
specify whether they want to run the "scylla_io_setup" script or not.
This is useful if users want to specify I/O settings themselves in
environments such as Kubernetes, where running "iotune" is problematic.

Fixes #6587
2020-07-21 14:42:46 +03:00
Rafael Ávila de Espíndola
bc20b71e6a configure: Don't use pkg-config for xxhash
The pkg-config for xxhash points to the wrong directory. I reported

https://bugzilla.redhat.com/show_bug.cgi?id=1858407

But xxhash is such a simple library that it is trivial to avoid
pkg-config.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200717204344.601729-1-espindola@scylladb.com>
2020-07-20 21:51:23 +03:00
Botond Dénes
929cdd3a15 test/boost: view_build_test: add test_view_update_generator_buffering
To exercise the new buffering and pausing logic of the
view updating consumer.
2020-07-20 14:32:45 +03:00
Botond Dénes
e316796b3f test/boost: view_build_test: add test test_view_update_generator_deadlock
A test case which reproduces the view update generator hang, where the
staging reader consumes all resources and leaves none for the pre-image
reader which blocks on the semaphore indefinitely.
2020-07-20 14:32:13 +03:00
Pekka Enberg
9d183aed2d scripts: Fix submodule names in refresh-submodules.sh
The submodules were moved under tools/jmx and tools/java.
Message-Id: <20200720112447.754850-1-penberg@scylladb.com>
2020-07-20 14:28:39 +03:00
Asias He
28f8798464 repair: Do not use libfmt format specifiers if not needed
We recently saw a weird log message:

   WARN  2020-07-19 10:22:46,678 [shard 0] repair - repair id [id=4,
   uuid=0b1092a1-061f-4691-b0ac-547b281ef09d] failed: std::runtime_error
   ({shard 0: fmt::v6::format_error (invalid type specifier), shard 1:
   fmt::v6::format_error (invalid type specifier)})

It turned out we have:

   throw std::runtime_error(format("repair id {:d} on shard {:d} failed to
   repair {:d} sub ranges", id, shard, nr_failed_ranges));

in the code, but we changed the id from integer to repair_uniq_id class.

We do not really need to specify the format specifiers for numbers.

Fixes #6874
2020-07-20 12:52:36 +03:00
Botond Dénes
e5db1ce785 reader_permit: reader_resources: add operator- and operator+
In addition to the already available operator+= and operator-=.
2020-07-20 11:23:39 +03:00
Botond Dénes
aabbdc34ac reader_concurrency_semaphore: add initial_resources()
To allow tests to reliably calculate the amount of resources they need
to consume in order to effectively reduce the resources of the semaphore
to a desired amount. Using `available_resources()` is not reliable as it
doesn't factor in resources that are consumed at the moment but will be
returned later.
This will also benefit debugging coredumps where we will now be able to
tell how much resources the semaphore was created with and this
calculate the amount of memory and count currently used.
2020-07-20 11:23:39 +03:00
Botond Dénes
f264d2b00f test: cql_test_env: allow overriding database_config 2020-07-20 11:23:39 +03:00
Botond Dénes
5de0afdab7 mutation_reader: expose new_reader_base_cost
So that test code can use it.
2020-07-20 11:23:39 +03:00
Botond Dénes
566e31a5ac db/view: view_updating_consumer: allow passing custom update pusher
So that tests can test the `view_update_consumer` in isolation, without
having to set up the whole database machinery. In addition to less
infrastructure setup, this allows more direct checking of mutations
pushed for view generation.
2020-07-20 11:23:39 +03:00
Botond Dénes
0166f97096 db/view: view_update_generator: make staging reader evictable
The view update generation process creates two readers. One is used to
read the staging sstables, the data which needs view updates to be
generated for, and another reader for each processed mutation, which
reads the current value (pre-image) of each row in said mutation. The
staging reader is created first and is kept alive until all staging data
is processed. The pre-image reader is created separately for each
processed mutation. The staging reader is not restricted, meaning it
does not wait for admission on the relevant reader concurrency
semaphore, but it does register its resource usage on it. The pre-image
reader however *is* restricted. This creates a situation, where the
staging reader possibly consumes all resources from the semaphore,
leaving none for the later created pre-image reader, which will not be
able to start reading. This will block the view building process meaning
that the staging reader will not be destroyed, causing a deadlock.

This patch solves this by making the staging reader restricted and
making it evictable. To prevent thrashing -- evicting the staging reader
after reading only a really small partition -- we only make the staging
reader evictable after we have read at least 1MB worth of data from it.
2020-07-20 11:23:39 +03:00
Botond Dénes
84357f0722 db/view: view_updating_consumer: move implementation from table.cc to view.cc
table.cc is a very counter-intuitive place for view related stuff,
especially if the declarations reside in `db/view/`.
2020-07-20 11:23:39 +03:00
Botond Dénes
cd849ed40d database: add make_restricted_range_sstable_reader()
A variant of `make_range_sstable_reader()` that wraps the reader in a
restricting reader, hence making it wait for admission on the read
concurrency semaphore, before starting to actually read.
2020-07-20 11:23:39 +03:00
Raphael S. Carvalho
b67066cae2 table: Fix Staging SSTables being incorrectly added or removed from the backlog tracker
Staging SSTables can be incorrectly added or removed from the backlog tracker,
after an ALTER TABLE or TRUNCATE, because the add and removal don't take
into account if the SSTable requires view building, so a Staging SSTable can
be added to the tracker after a ALTER table, or removed after a TRUNCATE,
even though not added previously, potentially causing the backlog to
become negative.

Fixes #6798.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200716180737.944269-1-raphaelsc@scylladb.com>
2020-07-20 10:57:38 +03:00
Nadav Har'El
e0693f19d0 alternator test: produce newer xunit format for test results
test.py passes the "--junit-xml" option to test/alternator/run, which passes
this option to pytest to get an xunit-format summary of the test results.
However, unfortunately until very recent versions (which aren't yet in Linux
distributions), pytest defaulted to a non-standard xunit format which tools
like Jenkins couldn't properly parse. The more standard format can be chosen
by passing the option "-o junit_family=xunit2", so this is what we do in
this patch.

Fixes #6767 (hopefully).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200719203414.985340-1-nyh@scylladb.com>
2020-07-20 09:24:50 +03:00
Avi Kivity
5371be71e9 Merge "Reduce fanout of some mutation-related headers" from Pavel E
"
The set's goal is to reduce the indirect fanout of 3 headers only,
but likely affects more. The measured improvement rates are

flat_mutation_reader.hh: -80%
mutation.hh            : -70%
mutation_partition.hh  : -20%

tests: dev-build, 'checkheaders' for changed headers (the tree-wide
       fails on master)
"

* 'br-debloat-mutation-headers' of https://github.com/xemul/scylla:
  headers:: Remove flat_mutation_reader.hh from several other headers
  migration_manager: Remove db/schema_tables.hh inclustion into header
  storage_proxy: Remove frozen_mutation.hh inclustion
  storage_proxy: Move paxos/*.hh inclusions from .hh to .cc
  storage_proxy: Move hint_wrapper from .hh to .cc
  headers: Remove mutation.hh from trace_state.hh
2020-07-19 19:47:59 +03:00
Rafael Ávila de Espíndola
9fd2682bfd restrictions_test: Fix use after return
The query_options constructor captures a reference to the cql_config.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200718013221.640926-1-espindola@scylladb.com>
2020-07-19 15:44:38 +03:00
Takuya ASADA
c99f31f770 scylla_setup: abort RAID disk prompt when no free disks available
Fixes #6860
2020-07-19 14:48:59 +03:00
Eliran Sinvani
b97f466438 schema: take into account features when converting a table creation to
schema_mutations

When upgrading from a version that lacks some schema features,
during the transition, when we have a mixed cluster. Schema digests
are calculated without taking into account the mixed cluster supported
features. Every node calculate the digest as if the whole cluster supports
its supported features.
Scylla already has a mechanism of redaction to the lowest common
denominator, but it haven't been used in this context.

This commit is using the redaction mechanism when calculating the digest on
the newly added table so it will match the supported features of the
whole cluster.

Tests: Manual upgrading - upgraded to a version with an additional
feature and additional schema column and validated that the digest
of the tables schema is identical on every node on the mixed cluster.
2020-07-19 10:30:51 +03:00
Avi Kivity
e4deaaced3 Update tools/java submodule
* tools/java 3eca0e3511...113c7d993b (1):
  > dist: redhat: reduce log spam from unpacking sources when building rpm
2020-07-18 12:07:57 +03:00
Pavel Emelyanov
92f58f62f2 headers:: Remove flat_mutation_reader.hh from several other headers
All they can live with forward declaration of the f._m._r. plus a
seastar header in commitlog code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-17 17:54:47 +03:00
Pavel Emelyanov
8618a02815 migration_manager: Remove db/schema_tables.hh inclustion into header
The schema_tables.hh -> migration_manager.hh couple seems to work as one
of "single header for everyhing" creating big blot for many seemingly
unrelated .hh's.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-17 17:54:43 +03:00
Pavel Emelyanov
a80403e8f3 storage_proxy: Remove frozen_mutation.hh inclustion
Nothing in it requres the needed classes any longer, forward
declarations are enough.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-17 17:47:30 +03:00
Pavel Emelyanov
6174252282 storage_proxy: Move paxos/*.hh inclusions from .hh to .cc
The storage_proxy.hh can live with forward declarations of paxos
classes it refers to.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-17 17:44:02 +03:00
Pavel Emelyanov
3df4f3078f storage_proxy: Move hint_wrapper from .hh to .cc
It's only used there, but requires mutation_query.hh, which can thus be
removed from storage_proxy.hh

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-17 17:40:25 +03:00
Pavel Emelyanov
757a7145b9 headers: Remove mutation.hh from trace_state.hh
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-17 17:40:23 +03:00
Nadav Har'El
3b5122fd04 alternator test: fix warning message in test_streams.py
In test_streams.py, we had the line:
  assert desc['StreamDescription'].get('StreamLabel')

In Alternator, the 'StreamLabel' attribute is missing, which the author of
this test probably thought would cause this test to fail (which is expected,
the test is marked with "xfail"). However, my version of pytest actually
doesn't like that assert is given a value instead of a comparison, and we
get the warning message:

  PytestAssertRewriteWarning: asserting the value None, please use "assert is None"

I think that the nicest replacement for this line is

  assert 'StreamLabel' in desc['StreamDescription']

This is very readable, "pythonic", and checks the right thing - it checks
that the JSON must include the 'StreamLabel' item, as the get() assertion
was supposed to have been doing.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200716124621.906473-1-nyh@scylladb.com>
2020-07-17 14:36:23 +03:00
Rafael Ávila de Espíndola
9fe4dc91d7 sstables: Move noop_write_monitor to a .cc file
There is no need to expose a type that is only used via a virtual
interface.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200717021215.545525-1-espindola@scylladb.com>
2020-07-17 11:59:03 +03:00
Rafael Ávila de Espíndola
66d866427d sstable_datafile_test: Use BOOST_REQUIRE_EQUAL
This only works for types that can be printed, but produces a better
error message if the check fails.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200716232700.521414-1-espindola@scylladb.com>
2020-07-17 11:58:58 +03:00
Rafael Ávila de Espíndola
c5405a5268 managed_bytes: Delete dead 'if'
If external is true, _u.ptr is not null. An empty managed_bytes uses
the internal representation.

The current code looks scary, since it seems possible that backref
would still point to the old location, which would invite corruption
when the reclaimer runs.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200716233124.521796-1-espindola@scylladb.com>
2020-07-17 11:58:53 +03:00
Avi Kivity
0ae770da35 Update seastar submodule
* seastar 0fe32ec59...4a99d5645 (3):
  > httpd: Don't warn on ECONNABORTED
  > httpd: Avoid calling future::then twice on the same future
Fixes #6709.
  > futures: Add a test for a broken promise in repeat
2020-07-17 08:42:26 +03:00
Rafael Ávila de Espíndola
44cf4d74cd build: Put test.py invocations in the console pool
Ninja has a special pool called console that causes programs in that
pool to output directly to the console instead of being logged. By
putting test.py in it is now possible to run just

$ ninja dev-test

And see the test.py output while it is running.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200716204048.452082-1-espindola@scylladb.com>
2020-07-17 00:33:10 +03:00
Benny Halevy
3ab1d9fe1d commitlog: use seastar::with_file_close_on_failure
`close_on_failure` was committed to seastar so use the
library version.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-07-16 20:32:32 +03:00
Benny Halevy
742298fa2a commitlog: descriptor: make nothrow move constructible
inherit from sstring nothrow move constructor.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-07-16 20:32:22 +03:00
Benny Halevy
54c5583b8d commitlog: allocate_segment_ex, segment: pass descriptor by value
Besdies being more robust than passing const descriptor&
to continuations, this helps simplify making allocate_segment_ex's
continuations nothrow_move_constructible, that is need for using
seastar::with_file_close_on_failure().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-07-16 20:31:12 +03:00
Benny Halevy
22c384c2e9 commitlog: allocate_segment_ex: filename capture is unused
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-07-16 20:23:57 +03:00
Raphael S. Carvalho
09d3a35438 Update MAINTAINERS
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200716142642.918204-1-raphaelsc@scylladb.com>
2020-07-16 17:29:41 +03:00
Avi Kivity
7bf51b8c6c Merge 'Distinguish single-column expressions in AST' from Dejan
"
Fix #6825 by explicitly distinguishing single- from multi-column expressions in AST.

Tests: unit (dev), dtest secondary_indexes_test.py (dev)
"

* dekimir-single-multiple-ast:
  cql3/restrictions: Separate AST for single column
  cql3/restrictions: Single-column helper functions
2020-07-16 16:59:14 +03:00
Pavel Solodovnikov
5ff5df1afd storage_proxy: un-hardcode force sync flag for mutate_locally(mutation) overload
Corresponding overload of `storage_proxy::mutate_locally`
was hardcoded to pass `db::commitlog::force_sync::no` to the
`database::apply`. Unhardcode it and substitute `force_sync::no`
to all existing call sites (as it were before).

`force_sync::yes` will be used later for paxos learn writes
when trying to apply mutations upgraded from an obsolete
schema version (similar to the current case when applying
locally a `frozen_mutation` stored in accepted proposal).

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200716124915.464789-1-pa.solodovnikov@scylladb.com>
2020-07-16 16:38:48 +03:00
Avi Kivity
0c7c255f94 Merge "compaction uuid for log and compaction_history" from Benny
"
We'd like to use the same uuid both for printing compaction log
messages and to update compaction_history.

Generate one when starting compaction and keep it in
compaction_info.  Then use it by convention in all
compaction log messages, along with compaction type,
and keyspace.table information.  Finally, use the
same uuid to update compaction_history.

Fixes #6840
"

* tag 'compaction-uuid-v1' of github.com:bhalevy/scylla:
  compaction: print uuid in log messages
  compaction: report_(start|finish): just return description
  compaction: move compaction uuid generation to compaction_info
2020-07-16 16:38:48 +03:00
Dejan Mircevski
cc86d915ed configure.py: $mode-test targets depend on scylla
The targets {dev|debug|release}-test run all unit tests, including
alternator/run.  But this test requires the Scylla executable, which
wasn't among the dependencies.  Fix it by adding build/$mode/scylla to
the dependency list.

Fixes #6855.

Tests: `ninja dev-test` after removing build/dev/scylla

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-07-16 16:38:48 +03:00
Piotr Dulikowski
e2462bce3b cdc: fix a corner case inside get_base_table
It is legal for a user to create a table with name that has a
_scylla_cdc_log suffix. In such case, the table won't be treated as a
cdc log table, and does not require a corresponding base table to exist.

During refactoring done as a part of initial implemetation of of
Alternator streams (#6694), `is_log_for_some_table` started throwing
when trying to check a name like `X_scylla_cdc_log` when there was no
table with name `X`. Previously, it just returned false.

The exception originates inside `get_base_table`, which tries to return
the base table schema, not checking for its existence - which may throw.
It makes more sense for this function to return nullptr in such case (it
already does when provided log table name does not have the cdc log
suffix), so this patch adds an explicit check and returns nullptr when
necessary.

A similar oversight happened before (see #5987), so this patch also adds
a comment which explains why existence of `X_scylla_cdc_log` does not
imply existence of `X`.

Fixes: #6852
Refs: #5724, #5987
2020-07-16 16:38:48 +03:00
Benny Halevy
eb1d558d00 compaction: print uuid in log messages
By convention, print the following information
in all compaction log messages:

[{compaction.type} {keyspace}.{table} {compaction.uuid}]

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-07-16 13:55:23 +03:00
Benny Halevy
dec751cfbe compaction: report_(start|finish): just return description
Rather than logging the message in the virtual callee method
just return a string description and make the logger call in
the common caller.

1. There is no need to do the logger call in the callee,
it is simpler to format the log message in the the caller
and just retrieve the per-compaction-type description.

2. Prepare to centrally print the compaction uuid.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-07-16 13:55:23 +03:00
Benny Halevy
e39fbe1849 compaction: move compaction uuid generation to compaction_info
We'd like to use the same uuid both for printing compaction log
messages and to update compaction_history.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-07-16 13:55:23 +03:00
Dejan Mircevski
0047e1e44d cql3/restrictions: Separate AST for single column
Existing AST assumes the single-column expression is a special case of
multi-column expressions, so it cannot distinguish `c=(0)` from
`(c)=(0)`.  This leads to incorrect behaviour and dtest failures.  Fix
it by separating the two cases explicitly in the AST representation.

Modify AST-creation code to create different AST for single- and
multi-column expressions.

Modify AST-consuming code to handle column_name separately from
vector<column_name>.  Drop code relying on cardinality testing to
distinguisn single-column cases.

Add a new unit test for `c=(0)`.

Fixes #6825.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-07-16 12:27:25 +02:00
Dejan Mircevski
a23e43090f cql3/restrictions: Single-column helper functions
This commit is separated out for ease of review.  It introduces some
functions that subsequent commits will use, thereby reducing diff
complexity of those subsequent commits.  Because the new functions
aren't invoked anywhere, they are guarded by `#if 0` to avoid
unused-function errors.

The new functions perform the same work as their existing namesakes,
but assuming single-column expressions.  The old versions continue to
try handling single-column as a special case of multi-column,
remaining unable to distinguish between `c = (1)` and `(c) = (1)`.
This will change in the next commit, which will drop attempts to
handle single-column cases from existing functions.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-07-16 11:25:30 +02:00
Asias He
4d7faac350 repair: Add uuid to a repair job
Currently, repair uses an integer to identify a repair job. The repair
id starts from 1 since node restart. As a result, different repair jobs
will have same id across restart.

To make the id more unique across restart, we can use an uuid in
addition to the integer id. We can not drop the use of the integer id
completely since the http api and nodetool use it.

Fixes #6786
2020-07-16 11:03:19 +03:00
Pekka Enberg
8b5121ea0c README.md: Add Slack and Twitter social banners
Add social banners for Slack and Twitter in README that are easy to
find for people new to the project.

Fixes #6538

Message-Id: <20200716070449.630864-1-penberg@scylladb.com>
2020-07-16 10:55:15 +03:00
Pekka Enberg
ed71ebafe5 README.md: Improve contribution section
The markdown syntax in the contribution section is incorrect, which is
why the links appear on the same line.

Improve the contribution section by making it more explicit what the
links are about.

Message-Id: <20200714070716.143768-1-penberg@scylladb.com>
2020-07-16 10:53:12 +03:00
Nadav Har'El
dcf9c888a2 alternator test: disable test_streams.py::test_get_records
This test usually fails, with the following error. Marking it "xfail" until
we can get to the bottom of this.

dynamodb = dynamodb.ServiceResource()
dynamodbstreams = <botocore.client.DynamoDBStreams object at 0x7fa91e72de80>

    def test_get_records(dynamodb, dynamodbstreams):
        # TODO: add tests for storage/transactionable variations and global/local index
        with create_stream_test_table(dynamodb, StreamViewType='NEW_AND_OLD_IMAGES') as table:
            arn = wait_for_active_stream(dynamodbstreams, table)

            p = 'piglet'
            c = 'ninja'
            val = 'lucifers'
            val2 = 'flowers'
>           table.put_item(Item={'p': p, 'c': c, 'a1': val, 'a2': val2})
test_streams.py:316:
...
E           botocore.exceptions.ClientError: An error occurred (Internal Server Error) when calling the PutItem operation (reached max retries: 3): Internal server error: std::runtime_error (cdc::metadata::get_stream: could not find any CDC stream (current time: 2020/07/15 17:26:36). Are we in the middle of a cluster upgrade?)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-07-16 08:24:25 +03:00
Nadav Har'El
61f52da9b1 merge: Alternator/CDC: Implement streams support
Merged pull request https://github.com/scylladb/scylla/pull/6694
by Calle Wilund:

Implementation of DynamoDB streams using Scylla CDC.

Fixes #5065

Initial, naive implementation insofar that it uses 1:1 mapping CDC stream to
DynamoDB shard. I.e. there are a lot of shards.

Includes tests verified against both local DynamoDB server and actual AWS
remote one.

Note:
Because of how data put is implemented in alternator, currently we do not
get "proper" INSERT labels for first write of data, because to CDC it looks
like an update. The test compensates for this, but actual users might not
like it.
2020-07-16 08:18:25 +03:00
Nadav Har'El
c4497bf770 alternator test: enable experimental CDC
In the script test/alternator/run, which runs Scylla for the Alternator
tests, add the "--experimental-features=cdc" option, to allow us testing
the streams API whose implementation requires the experimenal CDC feature.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-07-16 08:18:09 +03:00
Takuya ASADA
e52ae78f79 reloc: support unified relocatable package
This introduce unified relocatable package, a single tarball to install
all Scylla packages.

Fixes #6626
See scylladb/scylla-pkg#1218
2020-07-15 20:29:31 +03:00
Avi Kivity
afd9b29627 Update tools/jmx and tools/java submodules
* tools/java 50dbf77123...3eca0e3511 (1):
  > config: Avoid checking options and filtering scylla.yaml

* tools/jmx 5820992...aa94fe5 (3):
  > dist: redhat: reduce log spam from unpacking sources when building rpm
  > Merge 'gitignore: fix typo and add scylla-apiclient/target/' from Benny
  > apiclient: Bump Jackson version to 2.10.4
2020-07-15 19:42:47 +03:00
Takuya ASADA
5da8784494 install.sh: support calling install.sh from other directory
On .deb package with new relocatable package format, all files moved to
under scylla/ directory.
So we need to call ./scylla/install.sh on debian/rules, but it does not work
correctly, since install.sh does not support calling from other directory.

To support this, we need to changedir to scylla top directory before
copying files.
2020-07-15 18:55:12 +03:00
Nadav Har'El
09a71ccd84 merge: cql3/restrictions: exclude NULLs from comparison in filtering
Merge pull request https://github.com/scylladb/scylla/pull/6834 by
Juliusz Stasiewicz:

NULLs used to give false positives in GT, LT, GEQ and LEQ ops performed upon
ALLOW FILTERING. That was a consequence of not distinguishing NULL from an
empty buffer.

This patch excludes NULLs on high level, preventing them from entering LHS
of any comparison, i.e. it assumes that any binary operation should return
false whenever the LHS operand is NULL (note: at the moment filters with
RHS NULL, such as ...WHERE x=NULL ALLOW FILTERING, return empty sets anyway).

Fixes #6295

* '6295-do-not-compare-nulls-v2' of github.com:jul-stas/scylla:
  filtering_test: check that NULLs do not compare to normal values
  cql3/restrictions: exclude NULLs from comparison in filtering
2020-07-15 18:32:14 +03:00
Takuya ASADA
9b5f28a2e3 scylla_raid_setup: fix incorrect block device path
To use UUID, we need a tag "UUID=<uuid>".

reference: https://www.freedesktop.org/software/systemd/man/systemd.mount.html
reference: https://man7.org/linux/man-pages/man8/mount.8.html
2020-07-15 18:22:46 +03:00
Tomasz Grabiec
b8531fb885 Merge "Switch partitions cache from BST to B+tree & array" from Pavel E.
The data model is now

        bplus::tree<Key = int64_t, T = array<entry>>

where entry can be cache_entry or memtable_entry.

The whole thing is encapsulated into a collection called "double_decker"
from patch #3. The array<T> is an array of T-s with 0-bytes overhead used
to resolve hash conflicts (patch #2).

branch:
tests: unit(debug)
tests before v7:
        unit(debug) for new collections, memtable and row_cache
        unit(dev) for the rest
        perf(dev)

* https://github.com/xemul/scylla/commits/row-cache-over-bptree-9:
  test: Print more sizes in memory_footprint_test
  memtable: Switch onto B+ rails
  row_cache: Switch partition tree onto B+ rails
  memtable: Count partitions separately
  token: Introduce raw() helper and raw comparator
  row-cache: Use ring_position_comparator in some places
  dht: Detach ring_position_comparator_for_sstables
  double-decker: A combination of B+tree with array
  intrusive-array: Array with trusted bounds
  utils: B+ tree implementation
  test: Move perf measurement helpers into header
2020-07-15 14:54:29 +02:00
Calle Wilund
3b74b9585f cql3::lists: Fix setter_by_uuid not handing null value
Fixes #6828

When using the scylla list index from UUID extension,
null values were not handled properly causing throws
from underlying layer.
2020-07-15 13:52:09 +02:00
Raphael S. Carvalho
7a728803f7 cql3/functions: protect against uninitialized value
impl_count_function doesn't explicitly initialize _count, so its correctness
depends on default initialization. Let's explicitly initialize _count to
make the code future proof.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200714162604.64402-1-raphaelsc@scylladb.com>
2020-07-15 12:38:39 +03:00
Calle Wilund
ac4d0bb144 docs/alternator: Change the streams sectionSmall but updated blurb describing the state of streams inalternator/scylla. 2020-07-15 08:21:34 +00:00
Calle Wilund
76f6fe679a alternator tests: Add streams test
Small set of positive and negative tests of streams
functionality. Verified against DynamoDB and Alternator.
2020-07-15 08:21:34 +00:00
Calle Wilund
cbb70f4af4 executor: "UpdateTable" support for streams
Partial implementation of the "UpdateTable" command.
Supports only enabling/disabling streams.
2020-07-15 08:21:34 +00:00
Calle Wilund
45ee73969d executor: Allow streams specification in CreateTable schema 2020-07-15 08:21:34 +00:00
Calle Wilund
3376209718 cdc::schema: Make extensions expicitly settable from builder
To make non-cql cdc schema options a reality.
2020-07-15 08:21:34 +00:00
Calle Wilund
bbc544748f alternator: Implement GetRecords
Simplistic variant, using 1:1 mapping of scylla stream id <-> shard
2020-07-15 08:21:34 +00:00
Calle Wilund
3756febbf5 alternator: expose describe_single_item and default_timeout
To be able to describe single alternator items from other files.
And query with the default timeout.
2020-07-15 08:10:23 +00:00
Calle Wilund
c45781de1e alternator: Implement GetShardIterator 2020-07-15 08:10:23 +00:00
Calle Wilund
8084b5a9b7 alternator: Implement DescribeStream 2020-07-15 08:10:23 +00:00
Calle Wilund
8fb9b32bd3 alternator: Implement ListStreams command 2020-07-15 08:10:23 +00:00
Calle Wilund
811b531e2d db::config: Add option to set streams confidence window
Option to control the alternator streams CDC query/shard range time
confidence interval, i.e. the period we enforce as timestamp threshold
when reading. The default, 10s, should be sufficient on a normal
cluster, but across DCs:, or with client timestamps or whatever, one might
need a larger window.
2020-07-15 08:10:23 +00:00
Calle Wilund
a9641d4f02 system_distributed_keyspace: Add cdc topology/stream ids reader
To read the full topology (with expired and expirations etc)
from within.
2020-07-15 08:10:23 +00:00
Calle Wilund
0158f6473b cdc: Add stream ids structure with time and expiration
For reading the topology tables from within scylla.
2020-07-15 08:10:23 +00:00
Calle Wilund
331aa7c501 cdc: Add "is_cdc_metacolumn_name" predicate
To sift column names
2020-07-15 08:10:23 +00:00
Calle Wilund
8a728ce618 cdc: Add get_base_table helper 2020-07-15 08:10:23 +00:00
Calle Wilund
8f462e8606 CDC::log: Add base_name helper
To extract base table name from CDC log table name.
2020-07-15 08:10:23 +00:00
Calle Wilund
0708a9971a executor: Add system_distributed_keyspace as parameter/member
Streams implementation will require querying system tables
etc to do its work, thus will need access to this object.
2020-07-15 08:10:23 +00:00
Calle Wilund
e382d79bcd executor: Make some helper and subroutines class-visible
Subroutines needed by (in this case) streams implementation
moved from being file-static to class-static (exported).
To make putting handler routines in separate sources possible.
Because executor.cc is large and slow to compile.
Separation is nice.

Unfortunately, not all methods can be kept class-private,
since unrelated types also use them.

Reviewer suggested to instead place there is a top-level
header for export, i.e. not class-private at all.
I am skipping that for now, mainly because I can't come up
with a good file name. Can be part of a generate refactor
of helper routine organization in executor.
2020-07-15 08:10:23 +00:00
Calle Wilund
8a7b24dea1 alternator::error: Add exception overloads for Dynamo types
Add types exception overloads for ValidationException, ResourceNotFoundException, etc,
to avoid writing explicit error type as string everywhere (with the potential for
spelling errors ever present).
Also allows intellisense etc to complete the exception when coded.
2020-07-15 08:10:23 +00:00
Calle Wilund
699c4d2c7e rjson: Add templated get/set overloads and optional get<T>
To allow immediate json value conversion for types we
have TypeHelper<...>:s for.

Typed opt-get to get both automatic type conversion, _and_
find functionality in one call.
2020-07-15 08:10:23 +00:00
Calle Wilund
72ec525045 rjson: Add exception overloads
To avoid copying error message composing, as well as forcing
said code info rjson.cc.
Also helps caller to determine fault by catch type.
2020-07-15 08:10:23 +00:00
Piotr Sarna
f1c1043701 README: update the alternator paragraph
Since alternator is no longer experimental, its paragraph
in README.md is rephrased to better reflect its current state.

Message-Id: <a89eb70c4350e021ad9d6f684e49f94e4c735c19.1594792604.git.sarna@scylladb.com>
2020-07-15 09:27:35 +03:00
Juliusz Stasiewicz
c25398e8cf filtering_test: check that NULLs do not compare to normal values
Tested operators are: `<` and `>`. Tests all types of NULLs except
`duration` because durations are explicitly not comparable. Values
to compare against were chosen arbitrarily.
2020-07-14 15:37:17 +02:00
Pavel Emelyanov
f8ffc31218 test: Print more sizes in memory_footprint_test
The row cache memory footprint changed after switch to B+
because we no longer have a sole cache_entry allocation, but
also the bplus::data and bplus::node. Knowing their sizes
helps analyzing the footprint changes.

Also print the size of memtable_entry that's now also stored
in B+'s data.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-14 16:30:02 +03:00
Pavel Emelyanov
4d2f5f93a4 memtable: Switch onto B+ rails
The change is the same as with row-cache -- use B+ with int64_t token
as key and array of memtable_entry-s inside it.

The changes are:

Similar to those for row_cache:

- compare() goes away, new collection uses ring_position_comparator

- insertion and removal happens with the help of double_decker, most
  of the places are about slightly changed semantics of it

- flags are added to memtable_entry, this makes its size larger than
  it could be, but still smaller than it was before

Memtable-specific:

- when the new entry is inserted into tree iterators _might_ get
  invalidated by double-decker inner array. This is easy to check
  when it happens, so the invalidation is avoided when possible

- the size_in_allocator_without_rows() is now not very precise. This
  is because after the patch memtable_entries are not allocated
  individually as they used to. They can be squashed together with
  those having token conflict and asking allocator for the occupied
  memory slot is not possible. As the closest (lower) estimate the
  size of enclosing B+ data node is used

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-14 16:30:02 +03:00
Pavel Emelyanov
174b101a49 row_cache: Switch partition tree onto B+ rails
The row_cache::partitions_type is replaced from boost::intrusive::set
to bplus::tree<Key = int64_t, T = array_trusted_bounds<cache_entry>>

Where token is used to quickly locate the partition by its token and
the internal array -- to resolve hashing conflicts.

Summary of changes in cache_entry:

- compare's goes away as the new collection needs tri-compare one which
  is provided by ring_position_comparator

- when initialized the dummy entry is added with "after_all_keys" kind,
  not "before_all_keys" as it was by default. This is to make tree
  entries sorted by token

- insertion and removing of cache_entries happens inside double_decker,
  most of the changes in row_cache.cc are about passing constructor args
  from current_allocator.construct into double_decker.empace_before()

- the _flags is extended to keep array head/tail bits. There's a room
  for it, sizeof(cache_entry) remains unchanged

The rest fits smothly into the double_decker API.

Also, as was told in the previous patch, insertion and removal _may_
invalidate iterators, but may leave them intact. However, currently
this doesn't seem to be a problem as the cache_tracker ::insert() and
::on_partition_erase do invalidate iterators unconditionally.

Later this can be otimized, as iterators are invalidated by double-decker
only in case of hash conflict, otherwise it doesn't change arrays and
B+ tree doesn't invalidate its.

tests: unit(dev), perf(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-14 16:30:02 +03:00
Pavel Emelyanov
dff5eb6f25 memtable: Count partitions separately
The B+ will not have constant-time .size() call, so do it by hands

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-14 16:30:02 +03:00
Pavel Emelyanov
ae28814b1c token: Introduce raw() helper and raw comparator
In next patches the entries having token on-board will be
moved onto B+-tree rails. For this the int64_t value of the
token will be used as B+ key, so prepare for this.

One corner case -- the after_all_keys tokens must be resolved
to int64::max value to appear at the "end" of the tree. This
is not the same as "before_all_keys" case, which maps to the
int64::min value which is not allowed for regular tokens. But
for the sake of B+ switch this is OK, the conflicts of token
raw values are explicitly resolved in next patches.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-14 16:30:02 +03:00
Pavel Emelyanov
7b2754cf5f row-cache: Use ring_position_comparator in some places
The row cache (and memtable) code uses own comparators built on top
of the ring_position_comparator for collections of partitions. These
collections will be switched from the key less-compare to the pair
of token less-compare + key tri-compare.

Prepare for the switch by generalizing the ring_partition_comparator
and by patching all the non-collections usage of less-compare to use
one.

The memtable code doesn't use it outside of collections, but patch it
anyway as a part of preparations.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-14 16:30:02 +03:00
Pavel Emelyanov
1e15c06889 dht: Detach ring_position_comparator_for_sstables
Next patches will generalize ring_position_comparator with templates
to replace cache_entry's and memtable_entry's comparators. The overload
of operator() for sstables has its own implementation, that differs from
the "generic" one, for smoother generalization it's better to detach it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-14 16:30:02 +03:00
Pavel Emelyanov
cf1315cde5 double-decker: A combination of B+tree with array
The collection is K:V store

   bplus::tree<Key = K, Value = array_trusted_bounds<V>>

It will be used as partitions cache. The outer tree is used to
quickly map token to cache_entry, the inner array -- to resolve
(expected to be rare) hash collisions.

It also must be equipped with two comparators -- less one for
keys and full one for values. The latter is not kept on-board,
but it required on all calls.

The core API consists of just 2 calls

- Heterogenuous lower_bound(search_key) -> iterator : finds the
  element that's greater or equal to the provided search key

  Other than the iterator the call returns a "hint" object
  that helps the next call.

- emplace_before(iterator, key, hint, ...) : the call construct
  the element right before the given iterator. The key and hint
  are needed for more optimal algo, but strictly speaking not
  required.

  Adding an entry to the double_decker may result in growing the
  node's array. Here to B+ iterator's .reconstruct() method
  comes into play. The new array is created, old elements are
  moved onto it, then the fresh node replaces the old one.

// TODO: Ideally this should be turned into the
// template <typename OuterCollection, typename InnerCollection>
// but for now the double_decker still has some intimate knowledge
// about what outer and inner collections are.

Insertion into this collection _may_ invalidate iterators, but
may leave intact. Invalidation only happens in case of hashing
conflict, which can be clearly seen from the hint object, so
there's a good room for improvement.

The main usage by row_cache (the find_or_create_entry) looks like

   cache_entry find_or_create_entry() {
       bound_hint hint;

       it = lower_bound(decorated_key, &hint);
       if (!hint.found) {
           it = emplace_before(it, decorated_key.token(), hint,
                                 <constructor args>)
       }
       return *it;
  }

Now the hint. It contains 3 booleans, that are

  - match: set to true when the "greater or equal" condition
    evaluated to "equal". This frees the caller from the need
    to manually check whether the entry returned matches the
    search key or the new one should be inserted.

    This is the "!found" check from the above snippet.

To explain the next 2 bools, here's a small example. Consider
the tree containing two elements {token, partition key}:

   { 3, "a" }, { 5, "z" }

As the collection is sorted they go in the order shown. Next,
this is what the lower_bound would return for some cases:

   { 3, "z" } -> { 5, "z" }
   { 4, "a" } -> { 5, "z" }
   { 5, "a" } -> { 5, "z" }

Apparently, the lower bound for those 3 elements are the same,
but the code-flows of emplacing them before one differ drastically.

   { 3, "z" } : need to get previous element from the tree and
                push the element to it's vector's back
   { 4, "a" } : need to create new element in the tree and populate
                its empty vector with the single element
   { 5, "a" } : need to put the new element in the found tree
                element right before the found vector position

To make one of the above decisions the .emplace_before would need
to perform another set of comparisons of keys and elements.
Fortunately, the needed information was already known inside the
lower_bound call and can be reported via the hint.

Said that,

  - key_match: set to true if tree.lower_bound() found the element
    for the Key (which is token). For above examples this will be
    true for cases 3z and 5a.

  - key_tail: set to true if the tree element was found, but when
    comparing values from array the bounding element turned out
    to belong to the next tree element and the iterator was ++-ed.
    For above examples this would be true for case 3z only.

And the last, but not least -- the "erase self" feature. Which is
given only the cache_entry pointer at hands remove it from the
collection. To make this happen we need to make two steps:

1. get the array the entry sits in
2. get the b+ tree node the vectors sits in

Both methods are provided by array_trusted_bounds and bplus::tree.
So, when we need to get iterator from the given T pointer, the algo
looks like

- Walk back the T array untill hitting the head element
- Call array_trusted_bounds::from_element() getting the array
- Construct b+ iterator from obtained array
- Construct the double_decker iterator from b+ iterator and from
  the number of "steps back" from above
- Call double_decker::iterator.erase()

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-14 16:29:53 +03:00
Pavel Emelyanov
eb70644c1c intrusive-array: Array with trusted bounds
A plain array of elements that grows and shrinks by
constructing the new instance from an existing one and
moving the elements from it.

Behaves similarly to vector's external array, but has
0-bytes overhead. The array bounds (0-th and N-th
elemements) are determined by checking the flags on the
elements themselves. For this the type must support
getters and setters for the flags.

To remove an element from array there's also a nothrow
option that drops the requested element from array,
shifts the righter ones left and keeps the trailing
unused memory (so called "train") until reconstruction
or destruction.

Also comes with lower_bound() helper that helps keeping
the elements sotred and the from_element() one that
returns back reference to the array in which the element
sits.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-14 16:29:49 +03:00
Pavel Emelyanov
95f15ea383 utils: B+ tree implementation
// The story is at
// https://groups.google.com/forum/#!msg/scylladb-dev/sxqTHM9rSDQ/WqwF1AQDAQAJ

This is the B+ version which satisfies several specific requirements
to be suitable for row-cache usage.

1. Insert/Remove doesn't invalidate iterators
2. Elements should be LSA-compactable
3. Low overhead of data nodes (1 pointer)
4. External less-only comparator
5. As little actions on insert/delete as possible
6. Iterator walks the sorted keys

The design, briefly is:

There are 3 types of nodes: inner, leaf and data, inner and leaf
keep build-in array of N keys and N(+1) nodes. Leaf nodes sit in
a doubly linked list. Data nodes live separately from the leaf ones
and keep pointers on them. Tree handler keeps pointers on root and
left-most and right-most leaves. Nodes do _not_ keep pointers or
references on the tree (except 3 of them, see below).

changes in v9:

- explicitly marked keys/kids indices with type aliases
- marked the whole erase/clear stuff noexcept
- disposers now accept object pointer instead of reference
- clear tree in destructor
- added more comments
- style/readability review comments fixed

Prior changes

**
- Add noexcepts where possible
- Restrict Less-comparator constraint -- it must be noexcept
- Generalized node_id
- Packed code for beging()/cbegin()

**
- Unsigned indices everywhere
- Cosmetics changes

**
- Const iterators
- C++20 concepts

**
- The index_for() implmenetation is templatized the other way
  to make it possible for AVX key search specialization (further
  patching)

**
- Insertion tries to push kids to siblings before split

  Before this change insertion into full node resulted into this
  node being split into two equal parts. This behaviour for random
  keys stress gives a tree with ~2/3 of nodes half-filled.

  With this change before splitting the full node try to push one
  element to each of the siblings (if they exist and not full).
  This slows the insertion a bit (but it's still way faster than
  the std::set), but gives 15% less total number of nodes.

- Iterator method to reconstruct the data at the given position

  The helper creates a new data node, emplaces data into it and
  replaces the iterator's one with it. Needed to keep arrays of
  data in tree.

- Milli-optimize erase()
  - Return back an iterator that will likely be not re-validated
  - Do not try to update ancestors separation key for leftmost kid

  This caused the clear()-like workload work poorly as compared to
  std:set. In particular the row_cache::invalidate() method does
  exactly this and this change improves its timing.

- Perf test to measure drain speed
- Helper call to collect tree counters

**
- Fix corner case of iterator.emplace_before()
- Clean heterogenous lookup API
- Handle exceptions from nodes allocations
- Explicitly mark places where the key is copied (for future)
- Extend the tree.lower_bound() API to report back whether
  the bound hit the key or not
- Addressed style/cleanness review comments

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-14 16:29:43 +03:00
Pekka Enberg
7ef50d7c71 configure.py: Don't install dependencies when building submodules
Let's pass the "--nodeps" option to "build_reloc.sh" script of the
submodules to avoid the build system running "sudo"...

Reported-by: Piotr Sarna <sarna@scylladb.com>
Reported-by: Pavel Emelyanov <xemul@scylladb.com>
Tested-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200714114340.440781-1-penberg@scylladb.com>
2020-07-14 14:50:59 +03:00
Tomasz Grabiec
f20c77d0f8 Merge "Make handle_state_left more robust when tokens are empty" from Asias
1. storage_service: Make handle_state_left more robust when tokens are empty

In case the tokens for the node to be removed from the cluster are
empty, log the application_state of the leaving node to help understand
why the tokens are empty and try to get the tokens from token_metadata.

2. token_metadata: Do not throw if empty tokens are passed to remove_bootstrap_tokens

Gossip on_change callback calls storage_service::excise which calls
remove_bootstrap_tokens to remove the tokens of the leaving node from
bootstrap tokens. If empty tokens, e.g., due to gossip propagation issue
as we saw in #6468, are passed
to remove_bootstrap_tokens, it will throw. Since the on_change callback
is marked as noexcept, such throw will cause the node to terminate which
is an overkill.

To avoid such error causing the whole cluster to down in worse cases,
just log the tokens are empty passed to remove_bootstrap_tokens.

Refs #6468
2020-07-14 13:19:45 +02:00
Asias He
116f6141d5 token_metadata: Fix incorrect log in update_normal_tokens
Currently, when update_normal_tokens is called, a warning logged.

   Token X changing ownership from A to B

It is not correct to log so because we can call update_normal_tokens
against a temporary token_metadata object during topology calculation.

Refs: #6437
2020-07-14 14:13:37 +03:00
Juliusz Stasiewicz
c69075bbef cql3/restrictions: exclude NULLs from comparison in filtering
NULLs used to give false positives in GT, LT, GEQ and LEQ ops
performed upon `ALLOW FILTERING`. That was a consequence of
not distinguishing NULL from an empty buffer. This patch excludes
NULLs on high level, preventing them from entering any comparison,
i.e. it assumes that any binary operator should return `false`
whenever one of the operands is NULL (note: ATM filters such as
`...WHERE x=NULL ALLOW FILTERING` return empty sets anyway).

`restriction_test/regular_col_slice` had to be updated accordingly.

Fixes #6295
2020-07-14 12:59:01 +02:00
Pekka Enberg
f0ae550553 configure.py: Add 'build' target for building artifats
The default ninja build target now builds artifacts and packages. Let's
add a 'build' target that only builds the artifacts.

Message-Id: <20200714105042.416698-1-penberg@scylladb.com>
2020-07-14 13:55:32 +03:00
Pavel Emelyanov
9d38846ed2 test: Move perf measurement helpers into header
To use the code in new perf tests in next patches.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-14 12:58:26 +03:00
Asias He
38d964352d repair: Relax node selection in bootstrap when nodes are less than RF
Consider a cluster with two nodes:

 - n1 (dc1)
 - n2 (dc2)

A third node is bootstrapped:

 - n3 (dc2)

The n3 fails to bootstrap as follows:

 [shard 0] init - Startup failed: std::runtime_error
 (bootstrap_with_repair: keyspace=system_distributed,
 range=(9183073555191895134, 9196226903124807343], no existing node in
 local dc)

The system_distributed keyspace is using SimpleStrategy with RF 3. For
the keyspace that does not use NetworkTopologyStrategy, we should not
require the source node to be in the same DC.

Fixes: #6744
Backports: 4.0 4.1, 4.2
2020-07-14 11:54:34 +02:00
Pekka Enberg
16baf98d67 README.md: Add project description
This adds a short project description to README to make the git
repository more discoverable. The text is an edited version of a Scylla
blurb provided by Peter Corless.

Message-Id: <20200714065726.143147-1-penberg@scylladb.com>
2020-07-14 11:28:43 +03:00
Asias He
271fac56a3 repair: Add synchronous API to query repair status
This new api blocks until the repair job is either finished or failed or timeout.

E.g.,

- Without timeout
curl -X GET http://127.0.0.1:10000/storage_service/repair_status/?id=123

- With timeout
curl -X GET http://127.0.0.1:10000/storage_service/repair_status/?id=123&timeout=5

The timeout is in second.

The current asynchronous api returns immediately even if the repair is in progress.

E.g., curl -X GET http://127.0.0.1:10000/storage_service/repair_async/ks?id=123

User can use the new synchronous API to avoid keep sending the query to
poll if the repair job is finished.

Fixes #6445
2020-07-14 11:20:15 +03:00
Amnon Heiman
186301aff8 per table metrics: change estimated_histogram to time_estimated_histogram
This patch changes the per table latencies histograms: read, write,
cas_prepare, cas_accept, and cas_learn.

Beside changing the definition type and the insertion method, the API
was changed to support the new metrics.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-07-14 11:17:43 +03:00
Amnon Heiman
ea8d52b11c row_locking: change estimated histogram with time_estimated_histogram
This patch changes the row locking latencies to use
time_estimated_histogram.

The change consist of changing the histogram definition and changing how
values are inserted to the histogram.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-07-14 11:17:43 +03:00
Amnon Heiman
edd3c97364 alternator: change estimated_histogram to time_estimated_histogram
This patch moves the alternator latencies histograms to use the time_estimated_histogram.
The changes requires changing the defined type and use the simpler
insertion method.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-07-14 11:17:43 +03:00
Takuya ASADA
a233b0ab3b redis: add strlen command
Add strlen command that returns string length of the key.

see: https://redis.io/commands/strlen
2020-07-14 10:56:23 +03:00
Asias He
a00ab8688f repair: Relax size check of get_row_diff and set_diff
In case a row hash conflict, a hash in set_diff will get more than one
row from get_row_diff.

For example,

Node1 (Repair master):
row1  -> hash1
row2  -> hash2
row3  -> hash3
row3' -> hash3

Node2 (Repair follower):
row1  -> hash1
row2  -> hash2

We will have set_diff = {hash3} between node1 and node2, while
get_row_diff({hash3}) will return two rows: row3 and row3'. And the
error below was observed:

   repair - Got error in row level repair: std::runtime_error
   (row_diff.size() != set_diff.size())

In this case, node1 should send both row3 and row3' to peer node
instead of fail the whole repair. Because node2 does not have row3 or
row3', otherwise node1 won't send row with hash3 to node1 in the first
place.

Refs: #6252
2020-07-14 10:39:30 +03:00
Nadav Har'El
8e3be5e7d6 alternator test: configurable temporary directory
The test/alternator/run script creates a temporary directory for the Scylla
database in /tmp. The assumption was that this is the fastest disk (usually
even a ramdisk) on the test machine, and we didn't need anything else from
it.

But it turns out that on some systems, /tmp is actually a slow disk, so
this patch adds a way to configure the temporary directory - if the TMPDIR
environment variable exists, it is used instead of /tmp. As before this
patch, a temporary subdirectry is created in $TMPDIR, and this subdirectory
is automatically deleted when the test ends.

The test.py script already passes an appropriate TMPDIR (testlog/$mode),
which after this patch the Alternator test will use instead of /tmp.

Fixes #6750

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200713193023.788634-1-nyh@scylladb.com>
2020-07-14 08:52:22 +03:00
Konstantin Osipov
e628da863d Export TMPDIR pointing at subdir of testlog/
Export TMPDIR environment variable pointing at a subdir of testlog.
This variable is used by seastar/scylla tests to create a
a subdirectory with temporary test data. Normally a test cleans
up the temporary directory, but if it crashes or is killed the
directory remains.

By resetting the default location from /tmp to testlog/{mode}
we allow test.py we consolidate all test artefacts in a single
place.

Fixes #6062, "test.py uses tmpfs"
2020-07-13 22:22:43 +03:00
Avi Kivity
60c115add2 Update seastar submodule
* seastar 5632cf2146...0fe32ec596 (11):
  > futures: Add a test for a broken promise in a parallel_for_each
  > future: Simplify finally_body implementation
  > futures_test: Extend nested_exception test
  > Merge "make gate methods noexcept" from Benny
  > tutorial: fix service_loop example
  > sharded: fix doxygen \example clause for sharded_parameter
  > Merge "future: Don't call need_preempt in 'then' and 'then_impl'" from Rafael
  > future: Refactor a bit of duplicated code
  > Merge "Add with_file helpers" from Benny
  > Merge "Fix doxygen warnings" from Benny
  > build: add doxygen to install-dependencies.sh
2020-07-13 20:19:42 +03:00
Juliusz Stasiewicz
d1dec3fcd7 cdc: Retry generation fetching after read_failure_exception
While fetching CDC generations, various exceptions can occur. They
are divided into "fatal" and "nonfatal", where "fatal" ones prevent
retrying of the fetch operation.

This patch makes `read_failure_exception` "non-fatal", because such
error may appear during restart. In general this type of error can
mean a few different things (e.g. an error code in a response from
replica, but also a broken connection) so retrying seems reasonable.

Fixes #6804
2020-07-13 18:17:45 +03:00
Pekka Enberg
d67f4dba1e README.md: Consolidate Docker image build instructions
Consolidate the Docker image build instructions into the "Building Scylla"
section of the README instead of having it in a separate section in a different
place of the file.

Message-Id: <20200713132600.126360-1-penberg@scylladb.com>
2020-07-13 17:14:44 +03:00
Nadav Har'El
35f7048228 alternator: CreateTable with bad Tags shouldn't create a table
Currently, if a user tries to CreateTable with a forbidden set of tags,
e.g., the Tags list is too long or contains an invalid value for
system:write_isolation, then the CreateTable request fails but the table
is still created. Without the tag of course.

This patch fixes this bug, and adds two test cases for it that fail
before this patch, and succeed with it. One of the test cases is
scylla_only because it checks the Scylla-specific system:write_isolation
tag, but the second test case works on DynamoDB as well.

What this patch does is to split the update_tags() function into two
parts - the first part just parses the Tags, validates them, and builds
a map. Only the second part actually writes the tags to the schema.
CreateTable now does the first part early, before creating the table,
so failure in parsing or validating the Tags will not leave a created
table behind.

Fixes #6809.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200713120611.767736-1-nyh@scylladb.com>
2020-07-13 17:14:44 +03:00
Pekka Enberg
c6116c36e0 configure.py: Remove obsolete "--with-osv" option
The "--with-osv" option is has been a no-op since commit cc17c44640
("Move seastar to a submodule"). Let's remove it as obsolete.

Message-Id: <20200713131333.125634-1-penberg@scylladb.com>
2020-07-13 17:14:44 +03:00
Nadav Har'El
21ae457e8a test.py: print test durations
When tests are run in parallel, it is hard to tell how much time each test
ran. The time difference between consecutive printouts (indicating a test's
end) says nothing about the test's duration.

This patch adds in "--verbose" mode, at the end of each test result, the
duration in seconds (in wall-clock time) of the test. For example,

$ ./test.py --mode dev --verbose alternator
================================================================================
[N/TOTAL] TEST                                                 MODE   RESULT
------------------------------------------------------------------------------
[1/2]     boost/alternator_base64_test                         dev    [ PASS ] 0.02s
[2/2]     alternator/run                                       dev    [ PASS ] 26.57s

These durations are useful for recognizing tests which are especially slow,
or runs where all the tests are unusually slow (which might indicate some
sort of misconfiguration of the test machine).

Fixes #6759

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200706142109.438905-1-nyh@scylladb.com>
2020-07-13 17:14:44 +03:00
Pekka Enberg
ace1b15ed6 configure.py: Make "dist" part of default target
This adds a new "dist-<mode>" target, which builds the server package in
selected build mode together with the other packages, and wires it to
the "<mode>" target, which is built as part of default "ninja"
invocation.

This allows us to perform a full build, package, and test cycle across
all build modes with:

  ./configure.py && ninja && ./test.py

Message-Id: <20200713101918.117692-1-penberg@scylladb.com>
2020-07-13 17:14:44 +03:00
Takuya ASADA
e6e4359414 scylla_raid_setup: switch to systemd mount unit
Since we already use systemd unit file for coredump bind mount and swapfile,
we should move to systemd mount unit for data partition as well.
2020-07-13 17:14:44 +03:00
Pekka Enberg
c807c903ab pull_github_pr.sh: Use "cherry-pick" for single-commit pull requests
Improve the "pull_github_pr.sh" to detect the number of commits in a
pull request, and use "git cherry-pick" to merge single-commit pull
requests.
Message-Id: <20200713093044.96764-1-penberg@scylladb.com>
2020-07-13 17:14:44 +03:00
Avi Kivity
d74582fbc5 move jmx/tools submodules to tools directory
Move all package repositories to tools directory.
2020-07-13 17:14:14 +03:00
Avi Kivity
06341d2528 dist: fix debian generated files for non-default PRODUCT setting
There are a bunch of renames that are done if PRODUCT is not the
default, but the Python code for them is incorrect. Path.glob()
is not a static method, and Path does not support .endswith().

Fix by constructing a Path object, and later casting to str.
2020-07-13 11:51:31 +03:00
Pekka Enberg
f2b4c1a212 scylla_prepare: Improve error message on missing CPU features
Let's report each missing CPU feature individually, and improve the
error message a bit. For example, if the "clmul" instruction is missing,
the report looks as follows:

  ERROR: You will not be able to run Scylla on this machine because its CPU lacks the following features: pclmulqdq

  If this is a virtual machine, please update its CPU feature configuration or upgrade to a newer hypervisor.

Fixes #6528
2020-07-13 11:39:29 +03:00
Pekka Enberg
bc053b3cfa README.md: Add links to mailing lists and Slack
Add links to the users and developers mailing lists, and the Slack
channel in README.md to make them more discoverable.

Message-Id: <20200713074654.90204-1-penberg@scylladb.com>
2020-07-13 10:48:55 +03:00
Pekka Enberg
df6a0ec5e5 README.md: Update build and run instructions
Simplify the build and run instructions by splitting the text in three
sections (prerequisites, building, and running) and streamlining the
steps a bit.

Message-Id: <20200713065910.84582-1-penberg@scylladb.com>
2020-07-13 10:04:12 +03:00
Pekka Enberg
5476efabb3 configure.py: Make output less verbose by default
The configure.py script outputs the Seastar build command it executes:

['./cooking.sh', '-i', 'dpdk', '-d', '../build/release/seastar', '--', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DCMAKE_C_COMPILER=gcc', '-DCMAKE_CXX_COMPILER=g++', '-DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON', '-DSeastar_CXX_FLAGS=;-Wno-error=stack-usage=-ffile-prefix-map=/home/penberg/src/scylla/scylla=.;-march=westmere;-O3;-Wstack-usage=13312;--param;inline-unit-growth=300', '-DSeastar_LD_FLAGS=-Wl,--build-id=sha1,--dynamic-linker=/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////lib64/ld-linux-x86-64.so.2 ', '-DSeastar_CXX_DIALECT=gnu++20', '-DSeastar_API_LEVEL=4', '-DSeastar_UNUSED_RESULT_ERROR=ON', '-DSeastar_DPDK=ON', '-DSeastar_DPDK_MACHINE=wsm']

The output is mostly useful for debugging the build process itself, so
hide it behind a "--verbose" flag, and make it more human-readable while
at it:

./cooking.sh \
  -i \
  dpdk \
  -d \
  ../build/release/seastar \
  -- \
  -DCMAKE_BUILD_TYPE=RelWithDebInfo \
  -DCMAKE_C_COMPILER=gcc \
  -DCMAKE_CXX_COMPILER=g++ \
  -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON \
  -DSeastar_CXX_FLAGS=;-Wno-error=stack-usage=-ffile-prefix-map=/home/penberg/src/scylla/scylla=.;-march=westmere;-O3;-Wstack-usage=13312;--param;inline-unit-growth=300 \
  -DSeastar_LD_FLAGS=-Wl,--build-id=sha1,--dynamic-linker=/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////lib64/ld-linux-x86-64.so.2  \
  -DSeastar_CXX_DIALECT=gnu++20 \
  -DSeastar_API_LEVEL=4 \
  -DSeastar_UNUSED_RESULT_ERROR=ON \
  -DSeastar_DPDK=ON \
  -DSeastar_DPDK_MACHINE=wsm
Message-Id: <20200713065509.83184-1-penberg@scylladb.com>
2020-07-13 09:57:38 +03:00
Botond Dénes
ef2c8f563b scylla-gdb.py: scylla fiber: add suggestion for further investigation
scylla fiber often fails to really unwind the entire fiber, stopping
sooner than expected. This is expected as scylla fiber only recognizes
the most standard continuations but can drop the ball as soon as there
is an unusual transmission.
This commits adds a message below the found tasks explaining that the
list might not be exhaustive and prints a command which can be used to
explain why the unwinding stopped at the last task.

While at it also rephrase an out-of-date comment.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200710120813.100009-1-bdenes@scylladb.com>
2020-07-12 15:43:21 +03:00
Dejan Mircevski
29fccd76ea cql/restrictions: Rename find_if to find_atom
As requested in #5763 feedback, rename to avoid clashes with
std::find_if and boost::find_if.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-07-12 14:12:30 +03:00
Dejan Mircevski
9dac9a25e5 cql/restrictions: Constrain find_if and count_if
As requested in #5763 feedback, require that Fn be callable with
binary_operator in the functions mentioned above.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-07-12 14:11:39 +03:00
Pavel Emelyanov
1331623465 test.py: Don't feed fail-on-abandoned-failed-futures to unit tests
The problem is that this option is defined in seastar testing wrapper,
while no unit tests use it, all just start themselves with app.run() and
would complain on unknown option.

"Would", because nowadays every single test in it declares its own options
in suite.yaml, that override test.py's defaults. Once an option-less unit
test is added (B+ tree ones) it will complain.

The proposal is to remove this option from defaults, if any unit test will
use the seastar testing wrappers and will need this option, it can add one
to the suite.yaml.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200709084602.8386-1-xemul@scylladb.com>
2020-07-10 16:21:14 +02:00
Tomasz Grabiec
883ac4a78c Merge "Some selective noexcept bombing" form Pavel E.
The goal is to make the lambdas, that are fed into partition cache's
clear_and_dispose() and erase_in_dispose(), to be noexcept.

This is to satisfy B+, which strictly requires those to be noexcept
(currently used collections don't care).

The set covers not only the strictly required minimum, but also some
other methods that happened to be nearby.

* https://github.com/xemul/scylla/tree/br-noexcepts-over-the-row-cache:
  row_cache: Mark invalidation lambda as noexcept
  cache_tracker: Mark methods noexcept
  cache_entry: Mark methods noexcept
  region: Mark trivial noexcept methods as such
  allocation_strategy: Mark returning lambda as noexcept
  allocation_strategy: Mark trivial noexcept methods as such
  dht: Mark noexcept methods
2020-07-10 15:02:52 +02:00
Nadav Har'El
f549d147ea alternator: fix Expected's "NULL" operator with missing AttributeValueList
The "NULL" operator in Expected (old-style conditional operations) doesn't
have any parameters, so we insisted that the AttributeValueList be empty.
However, we forgot to allow it to also be missing - a possibility which
DynamoDB allows.

This patch adds a test to reproduce this case (the test passes on DyanmoDB,
fails on Alternator before this patch, and succeeds after this patch), and
a fix.

Fixes #6816.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200709161254.618755-1-nyh@scylladb.com>
2020-07-10 07:45:02 +02:00
Benny Halevy
3ce86a7160 test: restrictions_test: set_contains: uncomment check depnding on #6797
Now that #6797 is fixed.

Refs #5763

Cc: Dejan Mircevski <dejan@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Test: restrictions_test(debug)
Message-Id: <20200709123703.955897-1-bhalevy@scylladb.com>
2020-07-09 17:56:09 +03:00
Benny Halevy
ec77777bda bytes: compare_unsigned: do not pass nullptr to memcmp
If any of the compared bytes_view's is empty
consider the empty prefix is same and proceed to compare
the size of the suffix.

A similar issue exists in legacy_compound_view::tri_comparator::operator().
It too must not pass nullptr to memcmp if any of the compared byte_view's
is empty.

Fixes #6797
Refs #6814

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Test: unit(dev)
Branches: all
Message-Id: <20200709123453.955569-1-bhalevy@scylladb.com>
2020-07-09 17:54:46 +03:00
Nadav Har'El
9042161ba3 merge: cdc: better pre/postimages for complicated batches
Merged pull request https://github.com/scylladb/scylla/pull/6741
by Piotr Dulikowski:

This PR changes the algorithm used to generate preimages and postimages
in CDC log. While its behavior is the same for non-batch operations
(with one exception described later), it generates pre/postimages that
are organized more nicely, and account for multiple updates to the same
row in one CQL batch.

Fixes #6597, #6598

Tests:
- unit(dev), for each consecutive commit
- unit(debug), for the last commit

Previous method

The previous method worked on a per delta row basis. First, the base
table is queried for the current state of the rows being modified in
the processed mutation (this is called the "preimage query"). Then,
for each delta row (representing a modification of a row):

    If preimage is enabled and the row was already present in the table,
    a corresponding preimage row is inserted before the delta row.
    The preimage row contains data taken directly from the preimage
    query result. Only columns that are modified by the delta are
    included in the preimage.
    If postimage is enabled, then a postimage row is inserted after the
    delta row. The postimage row contains data which was a result of
    taking row data directly from the preimage query result and applying
    the change the corresponding delta row represented. All columns
    of the row are included in the postimage.

The above works well for simple cases such like singular CQL INSERT,
UPDATE, DELETE, or simple CQL BATCH-es. An example:

cqlsh:ks> BEGIN UNLOGGED BATCH
			INSERT INTO tbl (pk, ck, v) VALUES (0, 1, 111);
			INSERT INTO tbl (pk, ck, v) VALUES (0, 2, 222);
			APPLY BATCH;
cqlsh:ks> SELECT "cdc$batch_seq_no", "cdc$operation", "cdc$ttl",
			pk, ck, v from ks.tbl_scylla_cdc_log ;

 cdc$batch_seq_no | cdc$operation | cdc$ttl | pk | ck | v
------------------+---------------+---------+----+----+-----
...snip...
                0 |             0 |    null |  0 |  1 | 100
                1 |             2 |    null |  0 |  1 | 111
                2 |             9 |    null |  0 |  1 | 111
                3 |             0 |    null |  0 |  2 | 200
                4 |             2 |    null |  0 |  2 | 222
                5 |             9 |    null |  0 |  2 | 222

Preimage rows are represented by cdc operation 0, and postimage by 9.
Please note that all rows presented above share the same value of
cdc$time column, which was not shown here for brevity.

Problems with previous approach

This simple algorithm has some conceptual and implementational problems
which arise when processing more complicated CQL BATCH-es. Consider
the following example:

cqlsh:ks> BEGIN UNLOGGED BATCH
			INSERT INTO tbl (pk, ck, v1) VALUES (0, 0, 1) USING TTL 1000;
			INSERT INTO tbl (pk, ck, v2) VALUES (0, 0, 2) USING TTL 2000;
			APPLY BATCH;
cqlsh:ks> SELECT "cdc$batch_seq_no", "cdc$operation", "cdc$ttl",
			pk, ck, v1, v2 FROM tbl_scylla_cdc_log;

 cdc$batch_seq_no | cdc$operation | cdc$ttl | pk | ck | v1   | v2
------------------+---------------+---------+----+----+------+------
...snip...
                0 |             0 |    null |  0 |  0 | null |    0
                1 |             2 |    2000 |  0 |  0 | null |    2
                2 |             9 |    null |  0 |  0 |    0 |    2
                3 |             0 |    null |  0 |  0 |    0 | null
                4 |             1 |    1000 |  0 |  0 |    1 | null
                5 |             9 |    null |  0 |  0 |    1 |    0

A single cdc group (corresponding to rows sharing the same cdc$time)
might have more than one delta that modify the same row. For example,
this happens when modifying two columns of the same row with
different TTLs - due to our choice of CDC log schema, we must
represent such change with two delta rows.

It does not make sense to present a postimage after the first delta
and preimage before the second - both deltas are applied
simultaneously by the same CQL BATCH, so the middle "image" is purely
imaginary and does not appear at any point in the table.

Moreover, in this example, the last postimage is wrong - v1 is updated,
but v2 is not. None of the postimages presented above represent the
final state of the row.

New algorithm

The new algorithm works now on per cdc group basis, not delta row.
When starting processing a CQL BATCH:

    Load preimage query results into a data structure representing
    current state of the affected rows.

For each cdc group:

    For each row modified within the group, a preimage is produced,
    regardless if the row was present in the table. The preimage
    is calculated based on the current state. Only include columns
    that are modified for this row within the group.
    For each delta, produce a delta row and update the current state
    accordingly.
    Produce postimages in the same way as preimages - but include all
    columns for each row in the postimage.

The new algorithm produces postimage correctly when multiple deltas
affect one, because the state of the row is updated on the fly.

This algorithm moves preimage and postimage rows to the beginning and
the end of the cdc group, accordingly. This solves the problem of
imaginary preimages and postimages appearing inside a cdc group.

Unfortunately, it is possible for one CQL BATCH to contain changes that
use multiple timestamps. This will result in one CQL BATCH creating
multiple cdc groups, with different cdc$time. As it is impossible, with
our choice of schema, to tell that those cdc groups were created from
one CQL BATCH, instead we pretend as if those groups were separate CQL
operations. By tracking the state of the affected rows, we make sure
that preimage in later groups will reflect changes introduces in
previous groups.

One more thing - this algorithm should have the same results for
singular CQL operations and simple CQL BATCH-es, with one exception.
Previously, preimage not produced if a row was not present in the
table. Now, the preimage row will appear unconditionally - it will have
nulls in place of column values.

* 'cdc-pre-postimage-persistence' of github.com:piodul/scylla:
  cdc: fix indentation
  cdc: don't update partition state when not needed
  cdc: implement pre/postimage persistence
  cdc: add interface for producing pre/postimages
  cdc: load preimage query result into partition state fields
  cdc: introduce fields for keeping partition state
  cdc: rename set_pk_columns -> allocate_new_log_row
  cdc: track batch_no inside transformer
  cdc: move cdc$time generation to transformer
  cdc: move find_timestamp to split.cc
  cdc: introduce change_processor interface
  cdc: remove redundant schema arguments from cdc functions
  cdc: move management of generated mutations inside transformer
  cdc: move preimage result set into a field of transformer
  cdc: keep ts and tuuid inside transformer
  cdc: track touched parts of mutations inside transformer
  cdc: always include preimage for affected rows
2020-07-09 16:55:55 +03:00
Pavel Emelyanov
bb32cff23d row_cache: Mark invalidation lambda as noexcept
It calls noexcept functions inside and handles the exception from throwing one itself

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-09 14:46:38 +03:00
Pavel Emelyanov
1346289151 cache_tracker: Mark methods noexcept
All but few are trivially such.

The clear_continuity() calls cache_entry::set_continuous() that had become noexcept
a patch ago.

The allocator() calls region.allocator() which had been marked noexcept few patches
back.

The on_partition_erase() calls allocator().invalidate_references(), both had
been marked noexcept few patches back.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-09 14:44:17 +03:00
Pavel Emelyanov
d4ef845136 cache_entry: Mark methods noexcept
All but one are trivially such, the position() one calls is_dummy_entry()
which has become noexcept right now.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-09 14:41:43 +03:00
Pavel Emelyanov
3237796e00 region: Mark trivial noexcept methods as such
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-09 14:41:37 +03:00
Pavel Emelyanov
2c4a94aeab allocation_strategy: Mark returning lambda as noexcept
It just calls current_alloctor().destroy() which is noexcept

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-09 14:41:23 +03:00
Pavel Emelyanov
a497dfdd0b allocation_strategy: Mark trivial noexcept methods as such
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-09 14:41:03 +03:00
Pavel Emelyanov
6d7ae4ead1 dht: Mark noexcept methods
These are either trivially noexcept already, or call each-other, thus becoming noexcept too

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-09 14:41:03 +03:00
Piotr Sarna
7ae3b25d8e alternator: cleanup raw GetString() calls
Instead of using raw GetString() from rapidjson, it's neater
to use a helper for creating string views: rjson::to_string_view().
Message-Id: <3afda97403d4601c9600f6838f2028bfabd2f2f9.1594289250.git.sarna@scylladb.com>
2020-07-09 13:58:40 +03:00
Piotr Sarna
75dbaa0834 test: add alternator test for incorrect numeric values
The test case is put inside test_manual_requests suite, because
boto3 validates numeric inputs and does not allow passing arbitrary
incorrect values.

Tests: unit(dev), alternator(local, remote)

Message-Id: <ac2baedc2ea61f0d857e7c01839f34cd15f7e02d.1594289250.git.sarna@scylladb.com>
2020-07-09 13:58:33 +03:00
Piotr Sarna
96426df72e alternator: translate number errors to ValidationException
In order to be consistent with returned error types, marshaling
exceptions thrown from parsing big decimals are translated
to ValidationException.

Message-Id: <1446878cd63ad8291327a399cf700e4f402d108c.1594289250.git.sarna@scylladb.com>
2020-07-09 13:58:25 +03:00
Dejan Mircevski
d956233a80 cql_query_test: Drop get() on cquery_nofail result
cquery_nofail returns the query result, not a future.  Invoking .get()
on its result is unnecessary.  This just happened to compile because
shared_ptr has a get() method with the same signature as future::get.

Tests: cql_query_test unit test (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-07-09 13:52:52 +03:00
Nadav Har'El
8b3dac040a alternator: add request headers to trace-level logging
When "trace"-level logging is enabled for Alternator, we log every request,
but currently only the request's body. For debugging, it is sometimes useful
to also see the headers - which are important to debug authentication,
for example. So let's print the headers as well.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200709103414.599883-1-nyh@scylladb.com>
2020-07-09 12:38:45 +02:00
Asias He
67f6da6466 repair: Switch to btree_set for repair_hash.
In one of the longevity tests, we observed 1.3s reactor stall which came from
repair_meta::get_full_row_hashes_source_op. It traced back to a call to
std::unordered_set::insert() which triggered big memory allocation and
reclaim.

I measured std::unordered_set, absl::flat_hash_set, absl::node_hash_set
and absl::btree_set. The absl::btree_set was the only one that seastar
oversized allocation checker did not warn in my tests where around 300K
repair hashes were inserted into the container.

- unordered_set:
hash_sets=295634, time=333029199 ns

- flat_hash_set:
hash_sets=295634, time=312484711 ns

- node_hash_set:
hash_sets=295634, time=346195835 ns

- btree_set:
hash_sets=295634, time=341379801 ns

The btree_set is a bit slower than unordered_set but it does not have
huge memory allocation. I do not measure real difference of total time
to finish repair of the same dataset with unordered_set and btree_set.

To fix, switch to absl btree_set container.

Fixes #6190
2020-07-09 11:35:18 +03:00
Nadav Har'El
9ff9cd37c3 alternator test: tests for the number type
We had some tests for the number type in Alternator and how it can be
stored, retrieved, calculated and sorted, but only had rudementary tests
for the allowed magnitude and precision of numbers.

This patch creates a new test file, test_number.py, with tests aiming to
check exactly the supported magnitudes and precision of numbers.

These tests verify two things:

1. That Alternator's number type supports the full precision and magnitude
   that DynamoDB's number type supports. We don't want to see precision
   or magnitude lost when storing and retrieving numbers, or when doing
   calculations on them.

2. That Alternator's number type does not have *better* precision or
   magnitude than DynamoDB does. If it did, users may be tempted to rely
   on that implementation detail.

The three tests of the first type pass; But all four tests of the second
type xfail: Alternator currently stores numbers using big_decimal which
has unlimited precision and almost-unlimited magnitude, and is not yet
limited by the precision and magnitude allowed by DynamoDB.
This is a known issue - Refs #6794 - and these four new xfailing tests
will can be used to reproduce that issue.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200707204824.504877-1-nyh@scylladb.com>
2020-07-09 07:38:36 +02:00
Piotr Sarna
91a616968c Update seastar submodule
* seastar cbf88f59...5632cf21 (1):
  > Merge "Handle or avoid a few std::bad_alloc" from Rafael
2020-07-08 21:22:31 +02:00
Hagit Segev
aec910278f build-deb.sh: fix rm to erase only python
While building unified-deb we first use scylla/reloc/build_deb.sh to create the scylla core package, and after that scylla/reloc/python3/build_deb.sh to create python3.

On 058da69#diff-4a42abbd0ed654a1257c623716804c82 a new rm -rf command was added.
It causes python3 process to erase Scylla-core process.

Set python3 to erase its own dir scylla-python3-package only.
2020-07-08 17:58:38 +03:00
Piotr Dulikowski
ad811a48bf cdc: fix indentation 2020-07-08 15:36:41 +02:00
Piotr Dulikowski
20b236d27d cdc: don't update partition state when not needed
In some cases, tracking the state of processed rows inside `transformer`
is not needd at all. We don't need to do it if either:

- Preimage and postimage are disabled for the table,
- Only preimage is enabled and we are processing the last timestamp.

This commit disables updating the state in the cases listed above.
2020-07-08 15:36:41 +02:00
Piotr Dulikowski
246f8da6f6 cdc: implement pre/postimage persistence
Moves responsibility for generating pre/postimage rows from the
"process_change" method to "produce_preimage" and "produce_postimage".
This commit actually affects the contents of generated CDC log
mutations.

Added a unit test that verifies more complicated cases with CQL BATCH.
2020-07-08 15:36:41 +02:00
Piotr Dulikowski
24b50ffbc8 cdc: add interface for producing pre/postimages
Introduces new methods to the change_processor interface that will cause
it to produce pre/postimage rows for requested clustering key, or for
static row.

Introduces logic in split.cc responsible for calling pre/postimage
methods of the change_processor interface. This does not have any effect
on generated CDC log mutations yet, because the transformer class has
empty implementations in place of those methods.
2020-07-08 15:36:41 +02:00
Piotr Dulikowski
761c59d92a cdc: load preimage query result into partition state fields
Instead of looking up preimage data directly from the raw preimage query
results, use the raw results to populate current partition state data,
and read directly from the current partition state.
2020-07-08 15:36:41 +02:00
Piotr Dulikowski
946354ee74 cdc: introduce fields for keeping partition state
Introduces data structures that will be used for keeping the current
state of processed rows: _clustering_row_states, and _static_row_state.
2020-07-08 15:36:41 +02:00
Piotr Dulikowski
bb587a93be cdc: rename set_pk_columns -> allocate_new_log_row
The new name better describes what this function does.
2020-07-08 15:36:41 +02:00
Piotr Dulikowski
82ddeb1992 cdc: track batch_no inside transformer
Move tracking of batch_no inside the transformer.
2020-07-08 15:36:41 +02:00
Piotr Dulikowski
7b47f84965 cdc: move cdc$time generation to transformer
Generate the timeuuid on the transformer side, which allows to simplify
the change_processor interface.
2020-07-08 15:36:41 +02:00
Piotr Dulikowski
7691568b0a cdc: move find_timestamp to split.cc
The function is no longer used in log.cc, so instead it is moved to
split.cc.

Removed declaration of the function from the log.hh header, because it
is not used elsewhere - apart from testing code, but it already
declared find_timestamp in the cdc_test.cc file.
2020-07-08 15:36:40 +02:00
Piotr Dulikowski
51d97be0b3 cdc: introduce change_processor interface
This allows for a more refined use of the transformer by the
for_each_change function (now named "process_changes_with_splitting).

The change_processor interface exposes two methods so far:
begin_timestamp, and process_change (previously named "transform").
By separating those two and exposing them, process_changes_with\
_splitting can cause the transformer to generate less CDC log mutations
- only one for each timestamp in the batch.
2020-07-08 15:36:40 +02:00
Piotr Dulikowski
f907cab156 cdc: remove redundant schema arguments from cdc functions
A `mutation` object already has a reference to its schema. It does not
make sense to call functions changed in this commit with a different
schema.
2020-07-08 15:36:40 +02:00
Piotr Dulikowski
fa00ea996a cdc: move management of generated mutations inside transformer
CDC log mutations are now stored inside `transformer`, and only moved to
the final set of mutations at the end of `transformer`'s lifetime.
2020-07-08 15:36:40 +02:00
Piotr Dulikowski
76a323a02d cdc: move preimage result set into a field of transformer
Instead of passing the preimage result set in each `transform` call, it
is now assigned to a field, and `transform` uses that field.
2020-07-08 15:36:40 +02:00
Piotr Dulikowski
79eabc04a8 cdc: keep ts and tuuid inside transformer
Adds a `begin_timestamp` method which tells the `transformer` to start
using the following timestamp and timeuuid when generating new log row
mutations.
2020-07-08 15:36:40 +02:00
Piotr Dulikowski
3c01b3c41d cdc: track touched parts of mutations inside transformer
Moves tracking of the "touched parts" statistics inside the transformer
class.

This commit is the first of multiple commits in this series which move
parts of the state used in CDC log row generation inside the
`transformer` class. There is a lot of state being passed to
`transformer` each time its methods are called, which could be as well
tracked by the `transformer` itself. This will result in a nicer
interface and will allow us to generate less CDC log mutations which
give the same result.
2020-07-08 15:36:40 +02:00
Piotr Dulikowski
027d20c654 cdc: always include preimage for affected rows
This changes the current algorithm so that the preimage row will not be
skipped if the corresponding rows was not present in preimage query
results.
2020-07-08 15:36:40 +02:00
Rafael Ávila de Espíndola
b10beead61 memtable_snapshot_source: Avoid a std::bad_alloc crash
_should_compact is a condition_variable and condition_variable::wait()
allocates memory.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200706223201.903072-1-espindola@scylladb.com>
2020-07-08 15:21:50 +02:00
Avi Kivity
7ea9ee27dd Merge 'aggregates: Use type-specific comparators in min/max' from Juliusz
"
For collections and UDTs the `MIN()` and `MAX()` functions are
generated on the fly. Until now they worked by comparing just the
byte representations of their arguments.

This patch employs specific per-type comparators to provide semantically
sensible, dynamically created aggregates.

Fixes #6768
"

* jul-stas-6768-use-type-comparators-for-minmax:
  tests: Test min/max on set
  aggregate_fcts: Use per-type comparators for dynamic types
2020-07-08 15:07:57 +03:00
Juliusz Stasiewicz
f08e0e10be tests: Test min/max on set
Expected behavior is the lexicographical comparison of sets
(element by element), so this test was failing when raw byte
representations were compared.
2020-07-08 13:39:15 +02:00
Juliusz Stasiewicz
5b438e79be aggregate_fcts: Use per-type comparators for dynamic types
For collections and UDTs the `MIN()` and `MAX()` functions are
generated on the fly. Until now they worked by comparing just the
byte representations of arguments.

This patch uses specific per-type comparators to provide semantically
sensible, dynamically created aggregates.

Fixes #6768
2020-07-08 13:39:10 +02:00
Benny Halevy
d4615f4293 sstables: sstable_version_types: implement operator<=>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200707061715.578604-1-bhalevy@scylladb.com>
2020-07-08 14:23:11 +03:00
Avi Kivity
5a59bb948c Update seastar submodule
* seastar dbecfff5a4...cbf88f59f2 (14):
  > future: mark detach_promise noexcept
  > net/tls: wait for flush() in shutdown
  > httpd: Use handle_exception instead of then_wrapped
  > httpd: Use std::unique_ptr instead of a raw pointer
  > with_lock: Handle mutable lambdas
  > Merge "Make the coroutine implementation a bit more like seastar::thread" from Rafael
  > tests: Fix perf fair queue stuck
  > util/backtrace: don't get hash from moved simple_backtrace in tasktrace ctor
  > scheduling: Allow scheduling_group_get_specific(key) to access elements with queue_is_initialized = false
  > prometheus: be compatible with protobuf < 3.4.0
  > Merge "fix with_lock error handling" from Benny
  > Merge "Simplify the logic for detecting broken promises" from Rafael
  > Merge "make scheduling_group functions noexcept" from Benny
  > Merge "io_queue: Fixes and reworks for shared fair-queue" from Pavel E
2020-07-08 10:38:52 +03:00
Avi Kivity
b0698dfb38 Merge 'Rewrite CQL3 restriction representation' from dekimir
"
This is the first stage of replacing the existing restrictions code with a new representation. It adds a new class `expression` to replace the existing class `restriction`. Lots of the old code is deleted, though not all -- that will come in subsequent stages.

Tests: unit (dev, debug restrictions_test), dtest (next-gating)
"

* dekimir-restrictions-rewrite:
  cql3/restrictions: Drop dead code
  cql3/restrictions: Use free functions instead of methods
  cql3/restrictions: Create expression objects
  cql3/restrictions: Add free functions over new classes
  cql3/restrictions: Add new representation
2020-07-08 10:22:17 +03:00
Avi Kivity
bced68f187 Update tools and jmx submodules
* scylla-jmx b219573...5820992 (1):
  > dist/debian: apply generated package version for .orig.tar.gz file

* scylla-tools 1639b12061...50dbf77123 (1):
  > dist/debian: apply generated package version for .orig.tar.gz file
2020-07-08 08:49:42 +03:00
Dejan Mircevski
61288ea7db cql3/restrictions: Drop dead code
Delete unused parts of the old restrictions representation:

- drop all methods, members, and types from class restriction, but
  keep the class itself: it's the return type of
  relation::to_restriction, which we're keeping intact for now

- drop all subclasses of single_column_restriction and
  token_restriction, but keep multi_column_restriction subclasses for
  their bounds_ranges method

Keep the restrictions (plural) class, because statement_restrictions
still keeps partition/clustering/other columns in separate
collections.

Move the restriction::merge_with method to primary_key_restrictions,
where it's still being used.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-07-07 23:08:09 +02:00
Dejan Mircevski
37ebe521e3 cql3/restrictions: Use free functions instead of methods
Instead of `restriction` class methods, use the new free functions.
Specific replacement actions are listed below.

Note that class `restrictions` (plural) remains intact -- both its
methods and its type hierarchy remain intact for now.

Ensure full test coverage of the replacement code with new file
test/boost/restrictions_test.cc and some extra testcases in
test/cql/*.

Drop some existing tests because they codify buggy behaviour
(reference #6369, #6382).  Drop others because they forbid relation
combinations that are now allowed (eg, mixing equality and
inequality, comparing to NULL, etc.).

Here are some specific categories of what was replaced:

- restriction::is_foo predicates are replaced by using the free
  function find_if; sometimes it is used transitively (see, eg,
  has_slice)

- restriction::is_multi_column is replaced by dynamic casts (recall
  that the `restrictions` class hierarchy still exists)

- utility methods is_satisfied_by, is_supported_by, to_string, and
  uses_function are replaced by eponymous free functions; note that
  restrictions::uses_function still exists

- restriction::apply_to is replaced by free function
  replace_column_def

- when checking infinite_bound_range_deletions, the has_bound is
  replaced by local free function bounded_ck

- restriction::bounds and restriction::value are replaced by the more
  general free function possible_lhs_values

- using free functions allows us to simplify the
  multi_column_restriction and token_restriction hierarchies; their
  methods merge_with and uses_function became identical in all
  subclasses, so they were moved to the base class

- single_column_primary_key_restrictions<clustering_key>::needs_filtering
  was changed to reuse num_prefix_columns_that_need_not_be_filtered,
  which uses free functions

Fixes #5799.
Fixes #6369.
Fixes #6371.
Fixes #6372.
Fixes #6382.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-07-07 23:08:09 +02:00
Avi Kivity
4c221855a1 Merge 'hinted handoff: fix commitlog memory leak' from Piotr D
"
When commitlog is recreated in hints manager, only shutdown() method is
called, but not release(). Because of that, some internal commitlog
objects (`segment_manager` and `segment`s) may be left pointing to each
other through shared_ptr reference cycles, which may result in memory
leak when the parent commitlog object is destroyed.

This PR prevents memory leaks that may happen this way by calling
release() after shutdown() from the hints manager.

Fixes: #6409, Fixes #6776
"

* piodul-fix-commitlog-memory-leak-in-hinted-handoff:
  hinted handoff: disable warnings about segments left on disk
  hinted handoff: release memory on commitlog termination
2020-07-07 21:36:14 +03:00
Piotr Dulikowski
b955793088 hinted handoff: disable warnings about segments left on disk
When a mutation is written to the commitlog, a rp_handle object is
returned which keeps a reference to commitlog segment. A segment is
"dirty" when its reference count is not zero, otherwise it is "clean".

When commitlog object is being destroyed, a warning is being printed
for every dirty segment. On the other hand, clean segments are deleted.

In case of standard mutation writing path, the rp_handle moves
responsibility for releasing the reference to the memtable to which the
mutation is written. When the memtable is flushed to disk, all
references accumulated in the memtable are released. In this context, it
makes sense to warn about dirty segments, because such segments contain
mutations that are not written to sstables, and need to be replayed.

However, hinted handoff uses a different workflow - it recreates a
commitlog object periodically. When a hint is written to commitlog, the
rp_handle reference is not released, so that segments with hints are not
deleted when destroying the commitlog. When commitlog is created again,
we get a list of saved segments with hints that we can try to send at a
later time.

Although this is intended behavior, now that releasing the hints
commitlog is done properly, it causes the mentioned warning to
periodically appear in the logs.

This patch adds a parameter for the commitlog that allows to disable
this warning. It is only used when creating hinted handoff commitlogs.
2020-07-07 19:40:42 +02:00
Piotr Dulikowski
002e6c4056 hinted handoff: release memory on commitlog termination
When commitlog is recreated in hints manager, only shutdown() method is
called, but not release(). Because of that, some internal commitlog
objects (`segment_manager` and `segment`s) may be left pointing to each
other through shared_ptr reference cycles, which may result in memory
leak when the parent commitlog object is destroyed.

This commit prevents memory leaks that may happen this way by calling
release() after shutdown() from the hints manager.

Fixes: #6409, #6776
2020-07-07 19:40:32 +02:00
Nadav Har'El
0143aaa5a8 merge: Forbid internal schema changes for distributed tables
Merged patch set from Piotr Sarna:

This series addresses issue #6700 again (it was reopened),
by forbidding all non-local schema changes to be performed
from within the database via CQL interface. These changes
are dangerous since they are not directly propagated to other
nodes.

Tests: unit(dev)
Fixes #6700

Piotr Sarna (4):
  test: make schema changes in query_processor_test global
  cql3: refuse to change schema internally for distributed tables
  test: expand testing internal schema changes
  cql3: add explanatory comments to execute_internal

 cql3/query_processor.hh                      | 13 ++++++++++++-
 cql3/statements/alter_table_statement.cc     |  6 ------
 cql3/statements/schema_altering_statement.cc | 15 +++++++++++++++
 test/boost/cql_query_test.cc                 |  8 ++++++--
 test/boost/query_processor_test.cc           | 16 ++++++++--------
 5 files changed, 41 insertions(+), 17 deletions(-)
2020-07-07 18:27:16 +03:00
Takuya ASADA
967084b567 scylla_coredump_setup: support older version of coredumpctl message format
"coredumpctl info" behavior had been changed since systemd-v232, we need to
support both version.

Before systemd-v232, it was simple.
It print 'Coredump' field only when the coredump exists on filesystem.
Otherwise print nothing.

After the change made on systemd-v232, it become more complex.
It always print 'Storage' field even the coredump does not exists.
Not just available/unavailable, it describe more:
 - Storage: none
 - Storage: journal
 - Storage: /path/to/file (inacessible)
 - Storage: /path/to/file

To support both of them, we need to detect message version first, then
try to detect coredump path.

Fixes: #6789
reference: 47f5064207
2020-07-07 18:27:16 +03:00
Takuya ASADA
ef05ea8e91 node_exporter_install: stop service before force installing
Stop node-exporter.service before re-install it, to avoid 'Text file busy' error.

Fixes #6782
2020-07-07 18:27:16 +03:00
Takuya ASADA
f34001ff14 debian: use symlink copying files to build/debian/debian/
Instead of running shutil.copy() for each *.{service,default},
create symlink for these files.
Python will copy original file when copying debian directory.
2020-07-07 18:27:16 +03:00
Asias He
0929a5e82b repair: Fix inaccurate exception message in check_failed_ranges
The reason for the failure can be other reasons than failure of
checksum.

Fixes #6785
2020-07-07 18:27:16 +03:00
Asias He
6e6e554944 repair: Use warn level for logs with recoverable failures
Those logs are not fatal and recoverable. We should make them warn level
instead of info level.

Fixes #5612
2020-07-07 18:27:16 +03:00
Piotr Sarna
86f8b83ece cql3: add explanatory comments to execute_internal
Executing internal CQL queries needs to be done with caution,
since they were designed to be used mainly for local tables
and have very specific semantics wrt. propagating
schema changes. A short comment is added in order to prevent
future misuse of this interface.
2020-07-07 11:54:36 +02:00
Piotr Sarna
8ecae38d6b test: expand testing internal schema changes
... in order to ensure that not only ALTER TABLE, but also other
schema altering statements are not allowed for distributed
tables/keyspaces.
2020-07-07 10:02:58 +02:00
Piotr Sarna
a544ca64e2 cql3: refuse to change schema internally for distributed tables
Changing the schemas via internal calls to CQL is dangerous,
since the changes are not propagated to other nodes. Thus, it should
never be used for regular distributed tables.
The guarding code was already added for ALTER TABLE statement
and it's now expanded to cover all schema altering statements.

Tests: unit(dev)
Fixes #6700
2020-07-07 09:32:33 +02:00
Piotr Sarna
9bdf17a804 test: make schema changes in query_processor_test global
Now that schema changes are going to be forbidden for non-local tables,
query_processor_test is updated accordingly.
2020-07-07 09:09:40 +02:00
Botond Dénes
5ebe2c28d1 db/view: view_update_generator: re-balance wait/signal on the register semaphore
The view update generator has a semaphore to limit concurrency. This
semaphore is waited on in `register_staging_sstable()` and later the
unit is returned after the sstable is processed in the loop inside
`start()`.
This was broken by 4e64002, which changed the loop inside `start()` to
process sstables in per table batches, however didn't change the
`signal()` call to return the amount of units according to the number of
sstables processed. This can cause the semaphore units to dry up, as the
loop can process multiple sstables per table but return just a single
unit. This can also block callers of `register_staging_sstable()`
indefinitely as some waiters will never be released as under the right
circumstances the units on the semaphore can permanently go below 0.
In addition to this, 4e64002 introduced another bug: table entries from
the `_sstables_with_tables` are never removed, so they are processed
every turn. If the sstable list is empty, there won't be any update
generated but due to the unconditional `signal()` described above, this
can cause the units on the semaphore to grow to infinity, allowing
future staging sstables producers to register a huge amount of sstables,
causing memory problems due to the amount of sstable readers that have
to be opened (#6603, #6707).
Both outcomes are equally bad. This patch fixes both issues and modifies
the `test_view_update_generator` unit test to reproduce them and hence
to verify that this doesn't happen in the future.

Fixes: #6774
Refs: #6707
Refs: #6603

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200706135108.116134-1-bdenes@scylladb.com>
2020-07-07 08:53:00 +02:00
Wojciech Mitros
76038b8d8e view: differentiate identical error messages and change them to warnings
Modified log message in view_builder::calculate_shard_build_step to make it distinct from the one in view_builder::execute, changed their logging level to warning, since we're continuing even if we handle an exception.

Fixes #4600
2020-07-06 20:50:34 +03:00
Dejan Mircevski
921dbd0978 cql/restrictions: Handle WHERE a>0 AND a<0
WHERE clauses with start point above the end point were handled
incorrectly.  When the slice bounds are transformed to interval
bounds, the resulting interval is interpreted as wrap-around (because
start > end), so it contains all values above 0 and all values below
0.  This is clearly incorrect, as the user's intent was to filter out
all possible values of a.

Fix it by explicitly short-circuiting to false when start > end.  Add
a test case.

Fixes #5799.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-07-06 19:11:20 +03:00
Piotr Sarna
e4b74356bb Merge 'view_update_generator: use partitioned sstable set'
from Botond.

Recently it was observed (#6603) that since 4e6400293ea, the staging
reader is reading from a lot of sstables (200+). This consumes a lot of
memory, and after this reaches a certain threshold -- the entire memory
amount of the streaming reader concurrency semaphore -- it can cause a
deadlock within the view update generation. To reduce this memory usage,
we exploit the fact that the staging sstables are usually disjoint, and
use the partitioned sstable set to create the staging reader. This
should ensure that only the minimum number of sstable readers will be
opened at any time.

Refs: #6603
Fixes: #6707

Tests: unit(dev)

* 'view-update-generator-use-partitioned-set/v1' of https://github.com/denesb/scylla:
  db/view: view_update_generator: use partitioned sstable set
  sstables: make_partitioned_sstable_set(): return an sstable_set
2020-07-06 14:36:08 +02:00
Botond Dénes
62c6859b69 db/view: view_update_generator: use partitioned sstable set
And pass it to `make_range_sstable_reader()` when creating the reader,
thus allowing the incremental selector created therein to exploit the
fact that staging sstables are disjoint (in the case of repair and
streaming at least). This should reduce the memory consumption of the
staging reader considerably when reading from a lot of sstables.
2020-07-06 13:38:23 +03:00
Botond Dénes
84b5d6d6d0 sstables: make_partitioned_sstable_set(): return an sstable_set
Instead of an `std::unique_ptr<sstable_set_impl>`. The latter doesn't
have a publicly available destructor, so it can only be called from
withing `sstables/compaction_strategy.cc` where its definition resides.
Thus it is not really usable as a public function in its current form,
which shows as it has no users either.
This patch makes it usable by returning an `sstable_set`. That is what
potential callers would want anyway. In fact this patch prepares the
ground for the next one, which wishes to use this function for just
that but can't in its current form.
2020-07-06 13:38:23 +03:00
Takuya ASADA
2d63acdd6a scylla_util.py: use correct ID value for distro.id()
It seems distro.id() is NOT always same output as ID in /etc/os-release.
We need to replace "ol" to "oracle", "amzn" to "amazon".

Fixes #6761
2020-07-06 11:40:00 +03:00
Asias He
a19917eb91 gossiper: Drop replacement_quarantine
It is not used any more after "gossiper: Drop unused replaced_endpoint".

Refs #5482
2020-07-06 11:27:55 +03:00
Asias He
2bc73ad290 gossiper: Drop unused replaced_endpoint
It is not used any more after 75cf1d18b5
(storage_service: Unify handling of replaced node removal from gossip)
in the "Make replacing node take writes" series.

Refs #5482
2020-07-06 11:27:55 +03:00
Piotr Sarna
446b89f408 test: move json tests from manual/ to boost/
Manual tests are, as the name suggests, not run automatically,
which makes them more prone to regressions. JSON tests are
fast and correct, so there's no reason for them to be marked
as manual.

Message-Id: <dea75b0a0d1c238d12382a28840978884ac6ec2c.1594023481.git.sarna@scylladb.com>
2020-07-06 11:24:12 +03:00
Asias He
7926ff787b storage_service: Make handle_state_left more robust when tokens are empty
In case the tokens for the node to be removed from the cluster are
empty, log the application_state of the leaving node to help understand
why the tokens are empty and try to get the tokens from token_metadata.

Refs #6468
2020-07-06 15:51:19 +08:00
Avi Kivity
058b30b891 Merge "scylla-gdb.py: scylla_fiber: protect against reference loops" from Botond
"
This mini-series adds protection against reference loops between tasks,
preventing infinite recursion in this case.
It also contains some other improvements, like updating the task
whitelist as well as the task identification mechanism w.r.t. recent
changes in seastar.
It also improves verbose logging, which was found to not work well while
investigating the other issues fixed herein.
"

* 'scylla-gdb.py-scylla-fiber-update/v1' of https://github.com/denesb/scylla:
  scylla-gdb.py: scylla_fiber: add protection against reference loops
  scylla-gdb.py: scylla_fiber: relax requirement w.r.t. what object qualifies as task
  scylla-gdb.py: scylla_fiber: update whitelist
  scylla-gdb.py: scylla_fiber: improve verbose log output
2020-07-06 10:34:13 +03:00
Piotr Sarna
83ab41c76d test: add json test for parsing from map
Our JSON legacy helper functions for parsing documents to/from
string maps are indirectly tested by several unit tests, e.g.
caching_options_test.cc. They however lacked one corner case
detected only by dtest - parsing an empty map from a null JSON document.
This case is hereby added in order to prevent future regressions.

Message-Id: <df8243bd083b2ba198df665aeb944c8710834736.1594020411.git.sarna@scylladb.com>
2020-07-06 10:28:55 +03:00
Avi Kivity
cc7a906149 Merge "random_access_reader: futurize seek" from Benny
"
Rather than relying on a gate to serialize seek's
background work with close(), change seek() to return a
future<> and wait on it.

Also, now random_access_reader read_exactly(), seek(), and close()
are made noexcept.  This will be followed up by making
sstable parse methods noexcept.

Test: unit(dev)
"

* tag 'random_access_reader-v4' of github.com:bhalevy/scylla:
  sstables: random_access_reader: make methods noexcept
  sstables: random_access_reader: futurize seek
  sstables: random_access_reader: unify input stream close code
  sstables: random_access_reader: let file_random_access_reader set the input stream
  sstables: random_access_reader: move functions out of line
2020-07-06 10:16:18 +03:00
Asias He
027fa022e2 token_metadata: Do not throw if empty tokens are passed to remove_bootstrap_tokens
Gossip on_change callback calls storage_service::excise which calls
remove_bootstrap_tokens to remove the tokens of the leaving node from
bootstrap tokens. If empty tokens, e.g., due to gossip propagation issue
as we saw in https://github.com/scylladb/scylla/issues/6468, are passed
to remove_bootstrap_tokens, it will throw. Since the on_change callback
is marked as noexcept, such throw will cause the node to terminate which
is an overkill.

To avoid such error causing the whole cluster to down in worse cases,
just log the tokens are empty passed to remove_bootstrap_tokens.

Refs #6468
2020-07-06 14:28:23 +08:00
Botond Dénes
54bb9ddaae docs/debugging.md: drop --privileged from dbuild start instructions
Instead, label the mapped volume by passing `:z` options to `-v`
argument, like we do for other mapped volumes in the `dbuild` script.
Passing the `--privileged` flag doesn't work after the most recent
Fedora update and anyway, using `:z` is the proper way to make sure the
mounted volume is accessible. Historically it was needed to be able to
open cores as well, but since 5b08e91bd this is not necessary as the
container is created with SYS_PTRACE capability.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200703072703.10355-1-bdenes@scylladb.com>
2020-07-06 08:09:58 +02:00
Benny Halevy
fc89018146 sstables: random_access_reader: make methods noexcept
handle all exceptions in read_exactly, seek, and close
and specify them as noexcept.

Also, specify eof() as noexcept as it trivially is.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-07-05 19:40:48 +03:00
Benny Halevy
94460f3199 sstables: random_access_reader: futurize seek
And adjust its callers to wait on the returned future.

With this, there is no need for a gate to serialize close()
with the background work seek() used to leave behind.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-07-05 19:40:26 +03:00
Benny Halevy
765c5752c2 sstables: random_access_reader: unify input stream close code
Define a close_if_needed() helper function, to be called
from seek() and close().

A future patch will call it with a possibly disengaged
`_in` so it will close it only if it was engaged.

close_if_needed() captures the input stream unique ptr
so it will remain valid throughout close.
This was missing from close().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-07-05 19:37:39 +03:00
Benny Halevy
e7fdadd748 sstables: random_access_reader: let file_random_access_reader set the input stream
Allow file_random_access_reader constructor to set the
input stream to prepare for futurizing seek() by adding
a protected set() method.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-07-05 19:37:36 +03:00
Benny Halevy
0bb1c0f37d sstables: random_access_reader: move functions out of line
These are not good candidates for inlining.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-07-05 18:47:04 +03:00
Avi Kivity
36b6ee7b11 Merge 'python3: simplified .rpm/.deb build process' from Takuya
"
Follow scylla-server package changes, simplified .rpm/.deb build process which merge build scripts into single script.
"

* syuu1228-python3_simplified_pkg_scripts:
  python3: simplified .deb build process
  python3: simplified .rpm build process
2020-07-05 18:09:17 +03:00
Avi Kivity
cc891a5de8 Merge "Convert a few uses of sstring to std::string_view" from Rafael
"
This series converts an API to use std::string_view and then converts
a few sstring variables to be constexpr std::string_view. This has the
advantage that a constexpr variables cannot be part of any
initialization order problem.
"

* 'espindola/convert-to-constexpr' of https://github.com/espindola/scylla:
  auth: Convert sstring variables in common.hh to constexpr std::string_view
  auth: Convert sstring variables in default_authorizer to constexpr std::string_view
  cql_test_env: Make ks_name a constexpr std::string_view
  class_registry: Use std::string_view in (un)?qualified_name
2020-07-05 17:08:54 +03:00
Dmitry Kropachev
de82b3efae dist/common/scripts/scylla-housekeeping: wrap urllib.request with try ... except
We could hit "cannot serialize '_io.BufferedReader' object" when request get 404 error from the server
	Now you will get legit error message in the case.

	Fixes #6690
2020-07-05 16:33:11 +03:00
Takuya ASADA
d94fe346ee scylla_coredump_setup: detect missing coredump file
Print error message and exit with non-zero status by following condition:
    - coredumpctl says the coredump file is inaccessible
    - failed to detect coredump file path from 'coredumpctl info <pid>'
    - deleting coredump file failed because the file is missing

Fixes #6654
2020-07-05 14:24:51 +03:00
Takuya ASADA
d65b15f3b2 dist/debian/python3: apply version number fixup on scylla-python3
Sync version number fixup from main package, contains #6546 and #6752 fixes.

Note that scylla-python3 likely does not affect this versioning issue,
since it uses python3 version, which normally does not contain 'rcX'.
2020-07-05 14:21:18 +03:00
Takuya ASADA
8750c5ccf3 python3: simplified .deb build process
We don't really need to have two build_deb.sh, merge it to reloc.
2020-07-04 23:41:33 +09:00
Takuya ASADA
fc320ac49d python3: simplified .rpm build process
We don't really need to have two build_rpm.sh, merge it to reloc.
2020-07-04 23:41:22 +09:00
Rafael Ávila de Espíndola
400212e81f auth: Convert sstring variables in common.hh to constexpr std::string_view
This converts the following variables:
DEFAULT_SUPERUSER_NAME AUTH_KS USERS_CF AUTH_PACKAGE_NAME

Since they are now constexpr they will not be part of any
initialization order problems.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-07-03 12:35:58 -07:00
Rafael Ávila de Espíndola
53ed39e64a auth: Convert sstring variables in default_authorizer to constexpr std::string_view
This converts the following variables:
ROLE_NAME RESOURCE_NAME PERMISSIONS_NAME PERMISSIONS_CF

Since they are now constexpr they will not be part of any
initialization order problems.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-07-03 12:33:33 -07:00
Rafael Ávila de Espíndola
33af0c293f cql_test_env: Make ks_name a constexpr std::string_view
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-07-03 12:28:20 -07:00
Rafael Ávila de Espíndola
a2110e413f class_registry: Use std::string_view in (un)?qualified_name
This gives more flexibility for constructing a qualified_name or
unqualified_name.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-07-03 12:28:14 -07:00
Nadav Har'El
8e3ecc30a9 merge: Migrate from libjsoncpp to rjson
Merged patch series by Piotr Sarna:

The alternator project was in need of a more optimized
JSON library, which resulted in creating "rjson" helper functions.
Scylla generally used libjsoncpp for its JSON handling, but in  order
to reduce the dependency hell, the usage is now migrated
to rjson, which is faster and offers the same functionality.

The original plan was to be able to drop the dependency
on libjsoncpp-lib altogether and remove it from install-dependencies.sh,
but one last usage of it remains in our test suite,
namely cql_repl. The tool compares its output JSON textually,
so it depends on how a library presents JSON - what are the delimeters,
indentation, etc. It's possible to provide a layer of translation
to force rjson to print in an identical format, but the other issue
is that libjsoncpp keeps subobjects sorted by their name,
while rjson uses an unordered structure.
There are two possible solutions for the last remaining usage
of libjsoncpp:
 1. change our test suite to compare JSON documents with a JSON parser,
    so that we don't rely on internal library details
 2. provide a layer of translation which forces rjson to print
    its objects in a format idential to libjsoncpp.
(1.) would be preferred, since now we're also vulnerable for changes
inside libjsoncpp itself - if they change anything in their output
format, tests would start failing. The issue is not critical however,
so it's left for later.

Tests: unit(dev), manual(json_test),
       dtest(partitioner_tests.TestPartitioner.murmur3_partitioner_test)

Piotr Sarna (8):
  alternator,utils: move rjson.hh to utils/
  alternator: remove ambiguous string overloads in rjson
  rjson: add parse_to_map helper function
  rjson: add from_string_map function
  rjson: add non-throwing parsing
  rjson: move quote_json_string to rjson
  treewide: replace libjsoncpp usage with rjson
  configure: drop json.cc and json.hh helpers

 alternator/base64.hh                |   2 +-
 alternator/conditions.cc            |   2 +-
 alternator/executor.hh              |   2 +-
 alternator/expressions.hh           |   2 +-
 alternator/expressions_types.hh     |   2 +-
 alternator/rmw_operation.hh         |   2 +-
 alternator/serialization.cc         |   2 +-
 alternator/serialization.hh         |   2 +-
 alternator/server.cc                |   2 +-
 caching_options.hh                  |   9 +-
 cdc/log.cc                          |   4 +-
 column_computation.hh               |   5 +-
 configure.py                        |   3 +-
 cql3/functions/functions.cc         |   4 +-
 cql3/statements/update_statement.cc |  24 ++--
 cql3/type_json.cc                   | 212 ++++++++++++++++++----------
 cql3/type_json.hh                   |   7 +-
 db/legacy_schema_migrator.cc        |  12 +-
 db/schema_tables.cc                 |   1 -
 flat_mutation_reader.cc             |   1 +
 index/secondary_index.cc            |  80 +++++------
 json.cc                             |  80 -----------
 json.hh                             | 113 ---------------
 schema.cc                           |  25 ++--
 test/boost/cql_query_test.cc        |   9 +-
 test/manual/json_test.cc            |   4 +-
 test/tools/cql_repl.cc              |   1 +
 {alternator => utils}/rjson.cc      |  75 +++++++++-
 {alternator => utils}/rjson.hh      |  40 +++++-
 29 files changed, 344 insertions(+), 383 deletions(-)
 delete mode 100644 json.cc
 delete mode 100644 json.hh
 rename {alternator => utils}/rjson.cc (86%)
 rename {alternator => utils}/rjson.hh (81%)
2020-07-03 18:23:56 +02:00
Piotr Sarna
449e72826f configure: drop json.cc and json.hh helpers
Now that only rjson is used in the code, the old helper is not used
anywhere in the code, so it can be dropped.
2020-07-03 10:27:23 +02:00
Piotr Sarna
4cb79f04b0 treewide: replace libjsoncpp usage with rjson
In order to eventually switch to a single JSON library,
most of the libjsoncpp usage is dropped in favor of rjson.
Unfortunately, one usage still remains:
test/utils/test_repl utility heavily depends on the *exact textual*
format of its output JSON files, so replacing a library results
in all tests failing because of differences in formatting.
It is possible to force rjson to print its documents in the exact
matching format, but that's left for later, since the issue is not
critical. It would be nice though if our test suite compared
JSON documents with a real JSON parser, since there are more
differences - e.g. libjsoncpp keeps children of the object
sorted, while rapidjson uses an unordered data structure.
This change should cause no change in semantics, it strives
just to replace all usage of libjsoncpp with rjson.
2020-07-03 10:27:23 +02:00
Piotr Sarna
1b37517aab rjson: move quote_json_string to rjson
This utility function is used for type serialization,
but it also has a dedicated unit test, so it needs to be globally
reachable.
2020-07-03 10:27:23 +02:00
Piotr Sarna
f568fe869f rjson: add non-throwing parsing
Returning a disengaged optional instead of throwing an error
can be useful when the input string is expected not to be a valid
JSON in certain cases.
2020-07-03 10:27:23 +02:00
Piotr Sarna
3fda9908f2 rjson: add from_string_map function
This legacy function is needed because the existing implementation
relies on being able to parse flat JSON documents to and from maps
of strings.
2020-07-03 10:27:23 +02:00
Piotr Sarna
39b5408a84 rjson: add parse_to_map helper function
Existing infrastructure relies on being able to parse a JSON string
straight into a map of strings. In order to make rjson a drop-in
replacement(tm) for libjsoncpp, a similar helper function is provided.
2020-07-03 10:27:23 +02:00
Piotr Sarna
1df6d98b1a alternator: remove ambiguous string overloads in rjson
It's redundant to provide function overloads for both string_view
and const string&, since both of them can be implicitly created from
const char*. Thus, only string_view overloads are kept.
Example code which was ambiguous before the patch, but compiles fine
after it:
  rjson::from_string("hello");
Without the patch, one had to explicitly state the type, e.g.:
  rjson::from_string(std::string_view("hello"));
which is excessive.
2020-07-03 08:30:01 +02:00
Piotr Sarna
4de23d256e alternator,utils: move rjson.hh to utils/
rjson is going to replace libjsoncpp, so it's moved from alternator
to the common utils/ directory.
2020-07-03 08:30:01 +02:00
Takuya ASADA
a107f086bc dist/debian: apply generated package version for .orig.tar.gz file
We currently does not able to apply version number fixup for .orig.tar.gz file,
even we applied correct fixup on debian/changelog, becuase it just reading
SCYLLA-VERSION-FILE.
We should parse debian/{changelog,control} instead.

Fixes #6736
2020-07-03 08:24:41 +02:00
Takuya ASADA
4769f30a11 python3: fix incorrect variable name
builddir should be BUILDDIR.
2020-07-03 08:24:41 +02:00
Avi Kivity
a3dd1ba76f build: thrift: avoid rebuild if cassandra.thrift is touched but not modified
Thrift 0.12 includes a change [1] that avoids writing the generated output
if it has not changed. As a result, if you touch cassandra.thrift
(but not change it), the generated files will not update, and as
a result ninja will try to rebuild them every time. The compilation
of thrift files will be fast due to ccache, but still we will re-link
everything.

This touching of cassandra.thrift can happen naturally when switching
to a different git branch and then switching back. The net result
is that cassandra.thrift's contents has not changed, but its timestamp
has.

Fix by adding the "restat" option to the thrift rule. This instructs
ninja to check of the output has changed as expected or not, and to
avoid unneeded rebuilds if it has not.

[1] https://issues.apache.org/jira/browse/THRIFT-4532
2020-07-03 08:24:41 +02:00
Rafael Ávila de Espíndola
6fe7706fce mutation_reader_test: Wait for a future
Nothing was waiting for this future. Found while testing another
patch.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200630183929.1704908-1-espindola@scylladb.com>
2020-07-03 08:24:41 +02:00
Rafael Ávila de Espíndola
b7f5e2e0dd big_decimal: Add more tests
It looks like an order version of my patch series was merged. The only
difference is that the new one had more tests. This patch adds the
missing ones.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200630141150.1286893-1-espindola@scylladb.com>
2020-07-03 08:24:41 +02:00
Botond Dénes
b91cb8cc60 scylla-gdb.py: scylla_fiber: add protection against reference loops
Remember all previously visited tasks and stop if one of them is seen
again. The walk algorithm is converted from recursive to iterative to
facilitate this.
2020-07-01 16:37:47 +03:00
Botond Dénes
427dae61f8 scylla-gdb.py: scylla_fiber: relax requirement w.r.t. what object qualifies as task
Don't require that the object is located at the start of the allocation
block. Some tasks, like `seastar::internal::when_all_state_component`
might not.
2020-07-01 16:34:36 +03:00
Botond Dénes
bb5b0ccbd9 scylla-gdb.py: scylla_fiber: update whitelist
We have some new task derivatives.
2020-07-01 16:33:47 +03:00
Botond Dénes
6814f8c762 scylla-gdb.py: scylla_fiber: improve verbose log output
Gdb doesn't seem to handle multiple calls to `gdb.write()` writing to
the same line well, the content of some of these calls just disappears.
So make sure each log message is a separate line and use indentation
instead to illustrate references between messages.
2020-07-01 16:31:48 +03:00
Asias He
7f3eb8b4e8 repair: Handle dropped table in repair_range
In commit 12d929a5ae (repair: Add table_id
to row_level_repair), a call to find_column_family() was added in
repair_range. In case of the table is dropped, it will fail the
repair_range which in turn fails the bootstrap operation.

Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_new_node_while_schema_changes_test
Fixes: #5942
2020-07-01 12:13:14 +03:00
Takuya ASADA
03ce19d53a scylla_setup: follow hugepages package name change on Ubuntu 20.04LTS
hugepages package now renamed to libhugetlbfs-bin, we need to follow
the change.

Fixes #6673
2020-07-01 11:41:07 +03:00
Takuya ASADA
01f9be1ced scylla_setup: improve help message 2020-07-01 11:39:44 +03:00
Avi Kivity
c84217adaa Merge 'Compaction fix stall in perform cleanup' from Asias
"
compaction_manager: Avoid stall in perform_cleanup

The following stall was seen during a cleanup operation:

    scylla: Reactor stalled for 16262 ms on shard 4.

    | std::_MakeUniq<locator::tokens_iterator_impl>::__single_object std::make_unique<locator::tokens_iterator_impl, locator::tokens_iterator_impl&>(locator::tokens_iterator_impl&) at /usr/include/fmt/format.h:1158
    |  (inlined by) locator::token_metadata::tokens_iterator::tokens_iterator(locator::token_metadata::tokens_iterator const&) at ./locator/token_metadata.cc:1602
    | locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at simple_strategy.cc:?
    |  (inlined by) locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at ./locator/simple_strategy.cc:56
    | locator::abstract_replication_strategy::get_ranges(gms::inet_address, locator::token_metadata&) const at /usr/include/fmt/format.h:1158
    | locator::abstract_replication_strategy::get_ranges(gms::inet_address) const at /usr/include/fmt/format.h:1158
    | service::storage_service::get_ranges_for_endpoint(seastar::basic_sstring<char, unsigned int, 15u, true> const&, gms::inet_address const&) const at /usr/include/fmt/format.h:1158
    | service::storage_service::get_local_ranges(seastar::basic_sstring<char, unsigned int, 15u, true> const&) const at /usr/include/fmt/format.h:1158
    |  (inlined by) operator() at ./sstables/compaction_manager.cc:691
    |  (inlined by) _M_invoke at /usr/include/c++/9/bits/std_function.h:286
    | std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>::operator()(table const&) const at /usr/include/fmt/format.h:1158
    |  (inlined by) compaction_manager::rewrite_sstables(table*, sstables::compaction_options, std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>) at ./sstables/compaction_manager.cc:604
    | compaction_manager::perform_cleanup(table*) at /usr/include/fmt/format.h:1158

To fix, we furturize the function to get sstables. If get_local_ranges()
is called inside a thread, get_local_ranges will yield automatically.

Fixes #6662
"

* asias-compaction_fix_stall_in_perform_cleanup:
  compaction_manager: Avoid stall in perform_cleanup
  compaction_manager: Return exception future in perform_cleanup
  abstract_replication_strategy: Add get_ranges_in_thread
2020-07-01 11:30:37 +03:00
Avi Kivity
7e9a3b08ac Merge "mutation_reader: shard_reader fix fast-forwarding with read-ahead" from Botond
"
Currently, the fast forwarding implementation of the shard reader is
 broken in some read-ahead related corner cases, namely:
* If the reader was not created yet, but there is an ongoing read-ahead
  (which is going to create it), the function bails out. This will
  result in this shard reader not being fast-forwarded to the new range
  at all.
* If the reader was already created and there is an ongoing read-ahead,
  the function will wait for this to complete, then fast-forward the
  reader, as it should. However, the buffer is cleared *before* the
  read-ahead is waited for. So if the read-ahead brings in new data,
  this will land in the buffer. This data will be outside of the
  fast-forwarded-to range and worse, as we just cleared the buffer, it
  might violate mutation fragment stream monotonicity requirements.

This series fixes these two bugs and adds a unit test which reproduces
both of them.

There are no known field issues related to these bugs. Only row-level
repair ever fast-forwards the multishard reader, but it only uses it in
heterogenous clusters. Even so, in theory none of these bugs affect
repair as it doesn't ever fast-forward the multishard reader before all
shards arrive at EOS.
The bugs were found while auditing the code, looking for the possible
cause of #6613.

Fixes: #6715

Tests: unit(dev)
"

* 'multishard-combining-reader-fast-forward-to-fixes/v1.1' of https://github.com/denesb/scylla:
  test/boost/mutation_reader_test: multishard reader: add tests for fast-forwarding with read-ahead
  test/boost/mutation_reader_test: extract multishard read-ahead test setup
  test/boost/mutation_reader_test: puppet_reader: fast-forward-support
  mutation_reader_test: puppet_reader: make interface more predictable
  dht::sharder: add virtual destructor
  mutation_reader: shard_reader: fix fast-forwarding with read-ahead
2020-07-01 11:22:41 +03:00
Botond Dénes
cb69406f6c test/boost/mutation_reader_test: multishard reader: add tests for fast-forwarding with read-ahead 2020-07-01 10:15:49 +03:00
Botond Dénes
d6e2033d8a test/boost/mutation_reader_test: extract multishard read-ahead test setup
Testing the multishard reader's various read-ahead related corner cases
requires a non-trivial setup. Currently there is just one such test,
but we plan to add more so in this patch we extract this setup code to a
free function to allow reuse across multiple tests.
2020-07-01 10:15:49 +03:00
Botond Dénes
851ae8c650 test/boost/mutation_reader_test: puppet_reader: fast-forward-support
A fast-forwarded puppet reader goes immediately to EOS. A counter is
added to the remote control to allow tests to check which readers were
actually fast forwarded.
2020-07-01 10:15:49 +03:00
Botond Dénes
741f0c276d mutation_reader_test: puppet_reader: make interface more predictable
Currently the puppet reader will do an automatic (half) buffer-fill in
the constructor. This makes it very hard to reason about when and how
the action that was passed to it will be executed. Refactor it to take a
list of actions and only execute those, no hidden buffer-fill anymore.
No better proof is needed for this than the fact that the test which is
supposed to test the multishard reader being destroyed with a pending
read-ahead was silently broken (not testing what it should).
This patch fixes this test too.

Also fixed in this patch is the `pending` and `destroyed` fields of the
remote control, tests can now rely on these to be correct and add
additional checkpoints to ensure the test is indeed doing what it was
intended to do.
2020-07-01 10:15:49 +03:00
Botond Dénes
6ae8e0bc7d dht::sharder: add virtual destructor
This is a class with virtual methods, it should have a virtual
destructor too.
2020-07-01 10:15:49 +03:00
Asias He
07e253542d compaction_manager: Avoid stall in perform_cleanup
The following stall was seen during a cleanup operation:

scylla: Reactor stalled for 16262 ms on shard 4.

| std::_MakeUniq<locator::tokens_iterator_impl>::__single_object std::make_unique<locator::tokens_iterator_impl, locator::tokens_iterator_impl&>(locator::tokens_iterator_impl&) at /usr/include/fmt/format.h:1158
|  (inlined by) locator::token_metadata::tokens_iterator::tokens_iterator(locator::token_metadata::tokens_iterator const&) at ./locator/token_metadata.cc:1602
| locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at simple_strategy.cc:?
|  (inlined by) locator::simple_strategy::calculate_natural_endpoints(dht::token const&, locator::token_metadata&) const at ./locator/simple_strategy.cc:56
| locator::abstract_replication_strategy::get_ranges(gms::inet_address, locator::token_metadata&) const at /usr/include/fmt/format.h:1158
| locator::abstract_replication_strategy::get_ranges(gms::inet_address) const at /usr/include/fmt/format.h:1158
| service::storage_service::get_ranges_for_endpoint(seastar::basic_sstring<char, unsigned int, 15u, true> const&, gms::inet_address const&) const at /usr/include/fmt/format.h:1158
| service::storage_service::get_local_ranges(seastar::basic_sstring<char, unsigned int, 15u, true> const&) const at /usr/include/fmt/format.h:1158
|  (inlined by) operator() at ./sstables/compaction_manager.cc:691
|  (inlined by) _M_invoke at /usr/include/c++/9/bits/std_function.h:286
| std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>::operator()(table const&) const at /usr/include/fmt/format.h:1158
|  (inlined by) compaction_manager::rewrite_sstables(table*, sstables::compaction_options, std::function<std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > > (table const&)>) at ./sstables/compaction_manager.cc:604
| compaction_manager::perform_cleanup(table*) at /usr/include/fmt/format.h:1158

To fix, we furturize the function to get local ranges and sstables.

In addition, this patch removes the dependency to global storage_service object.

Fixes #6662
2020-07-01 15:03:50 +08:00
Asias He
868e2da1c4 compaction_manager: Return exception future in perform_cleanup
We should return the exception future instead of throw a plain
exception.

Refs #6662
2020-07-01 15:00:01 +08:00
Asias He
94995acedb abstract_replication_strategy: Add get_ranges_in_thread
Add a version that runs inside a seastar thread. The benefit is that
get_ranges can yield to avoid stalls.

Refs #6662
2020-07-01 15:00:01 +08:00
Botond Dénes
627054c3d7 mutation_reader: shard_reader: fix fast-forwarding with read-ahead
The current `fast_forward_to(const dht::partition_range&)`
implementation has two problems:
* If the reader was not created yet, but there is an ongoing read-ahead
  (which is going to create it), the function bails out. This will
  result in this shard reader not being fast-forwarded to the new range
  at all.
* If the reader was already created and there is an ongoing read-ahead,
  the function will wait for this to complete, then fast-forward the
  reader, as it should. However, the buffer is cleared *before* the
  read-ahead is waited for. So if the read-ahead brings in new data,
  this will land in the buffer. This data will be outside of the
  fast-forwarded-to range and worse, as we just cleared the buffer, it
  might violate mutation fragment stream monotonicity requirements.

This patch fixes both of these bugs. Targeted reproducer unit tests are
coming in the next patches.
2020-07-01 09:51:02 +03:00
Takuya ASADA
bbd3ed9d47 scylla_util.py: switch to subprocess.run()
When we started to porting bash script to python script, we are not able to use
subprocess.run() since EPEL only provides python 3.4, but now we have
relocatable python, so we can switch to it.
2020-06-30 20:13:30 +03:00
Takuya ASADA
a9de438b1f scylla_swap_setup: handle <1GB environment
Show better error message and exit with non-zero status when memory size <1GB.

Fixes #6659
2020-06-30 20:12:32 +03:00
Avi Kivity
5bcef44935 Update seastar submodule
* seastar 5db34ea8d3...dbecfff5a4 (3):
  > sharded: Do not hang on never set freed promise
Fixes #6606.
  > foreign_ptr: make constructors and methods conditionally noexcept
  > foreign_ptr: specify methods as noexcept
2020-06-30 19:27:21 +03:00
Botond Dénes
27a0772d71 docs/debugging.md: extend section on relocatable binaries
Currently the section on "Debugging coredumps" only briefly mentions
relocatable binaries, then starts with an extensive subsection on how to
open cores generated by non-relocatable binaries. There is a subsection
about relocatable binaries, but it just contains some out-of-date
workaround without any context.
In this patch we completely replace this outdated and not very useful
subsection on relocatable binaries, with a much more extensive one,
documenting step-by-step a procedure that is known to work. Also, this
subsection is moved above the non-relocatable one. All our current
releases except for 2019.1 use relocatable binaries, so the subsection
about these should be the more prominent one.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200630145655.159926-1-bdenes@scylladb.com>
2020-06-30 18:35:36 +03:00
Avi Kivity
40f722f4c7 Update seastar submodule
* seastar 11e86172ba...5db34ea8d3 (7):
  > scollectd: Avoid a deprecated warning
  > prometheus: Avoid protobuf deprecated warning
  > Merge "Avoid abandoned futures in tests" from Rafael
  > Merge "make circular_buffer methods noexcept" from Benny
  > futures_test: Wait for future in test
  > iotune: Report disk IOPS instead of kernel IOPS
  > io_tester: Ability to add dsync option to open_file_dma
2020-06-30 17:23:11 +03:00
Takuya ASADA
5e207696d9 scylla_ntp_setup: switch to distro package
Use distro API to simplify distribution detection.
2020-06-30 13:57:08 +03:00
Raphael S. Carvalho
cf352e7c14 sstables: optimize procedure that checks if a sstable needs cleanup
needs_cleanup() returns true if a sstable needs cleanup.

Turns out it's very slow because it iterates through all the local
ranges for all sstables in the set, making its complexity:
	O(num_sstables * local_ranges)

We can optimize it by taking into account that abstract_replication_strategy
documents that get_ranges() will return a list of ranges that is sorted
and non-overlapping. Compaction for cleanup already takes advantage of that
when checking if a given partition can be actually purged.

So needs_cleanup() can be optimized into O(num_sstables * log(local_ranges)).

With num_sstables=1000, RF=3, then local_ranges=256(num_tokens)*3, it means
the max # of checks performed will go from 768000 to ~9584.

Fixes #6730.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200629171355.45118-2-raphaelsc@scylladb.com>
2020-06-30 12:58:43 +03:00
Raphael S. Carvalho
a9eebdc778 sstables: export needs_cleanup()
May be needed elsewhere, like in an unit test.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200629171355.45118-1-raphaelsc@scylladb.com>
2020-06-30 12:58:43 +03:00
Avi Kivity
08d41ee841 Merge "Fix API snapshot details getter" from Pavel E
"
Recent branch introduced uncaught by regular snapshot tests issue
with snapshots listing via API, this set fixes it (patch #1),
cleans the indentation after the fix (#2) and polishes the way
stuff is captured nearby (#3)

Tests: dtest(nodetool_additional_tests)
"

* 'br-fix-snap-get-details' of https://github.com/xemul/scylla:
  api: Remove excessive capture
  api: Fix indentation after previous patch
  api: Fix wrongly captured map of snapshots
2020-06-30 12:56:00 +03:00
Botond Dénes
effa632743 scylla-gdb.py: scylla_find: use ptr to start of object to lookup vptr
And not the pointer to the offset where the searched-for value was found
in the object.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200625081242.486929-1-bdenes@scylladb.com>
2020-06-30 12:54:18 +03:00
Raphael S. Carvalho
68e12bd17e sstables: sstable_directory: place debug message in logger
this message, intended for debugging purposes, is not going through
the logger.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200629184642.53348-1-raphaelsc@scylladb.com>
2020-06-30 12:47:17 +03:00
Benny Halevy
7dc3ce4994 init: init_ms_fd_gossiper: use logger for error message
Currently fmt::print is used to print an error message
if (broadcast_address != listen && seeds.count(listen))
and the logger should be used instead.

While at it, the information printed in this message is valueable
also in the error-free case, so this change logs it at `info`
level and then logs an error without repeating the said info.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Test: bootstrap_test.py:TestBootstrap.start_stop_test_node(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200630083826.153326-1-bhalevy@scylladb.com>
2020-06-30 12:46:44 +03:00
Avi Kivity
5125e9b51d Merge "Avoid varidic futures" from Rafael
"
These patches removes the last few uses of variadic futures in scylla.
"

* 'espindola/no-variadic-future' of https://github.com/espindola/scylla:
  row level repair: Don't return a variadic future from get_sink_source
  row level repair: Don't return a variadic future from read_rows_from_disk
  messaging_service: Don't return variadic futures from make_sink_and_source_for_*
  cql3: Don't use variadic futures in select_statement
2020-06-30 12:45:37 +03:00
Avi Kivity
293d4117c1 Merge "Initial cleanup work post off-strategy" from Raphael
"
Offstrategy work, on boot and refresh, guarantees that a shared SSTable
will not reach the table whatsoever. We have lots of extra code in
table to make it able to live with those shared SSTables.
Now we can fortunately get rid of all that code.

tests: mode(dev).
also manually tested it by triggering resharding both on boot/refresh.
"

* 'cleanup_post_offstrategy_v2' of https://github.com/raphaelsc/scylla:
  distributed_loader: kill unused invoke_shards_with_ptr()
  sstables:: kill unused sstables::sstable_open_info
  sstables: kill unused sstable::load_shared_components()
  distributed_loader: remove declaration of inexistent do_populate_column_family()
  table: simplify table::discard_sstables()
  table: simplify add_sstable()
  table: simplify update_stats_for_new_sstable()
  table: remove unused open_sstable function
  distributed_loader: remove unused code
  table: no longer keep track of sstables that need resharding
  table: Remove unused functions no longer used by resharding
  table: remove sstable::shared() condition from backlog tracker add/remove functions
  table: No longer accept a shared SSTable
2020-06-30 12:42:34 +03:00
Tomasz Grabiec
8bd7359d93 Merge "lwt: introduce LWT flag in prepared statement metadata" from Pavel
This patch set adds a few new features in order to fix issue

The list of changes is briefly as follows:
 - Add a new `LWT` flag to `cql3::prepared_metadata`,
   which allows clients to clearly distinguish betwen lwt and
   non-lwt statements without need to execute some custom parsing
   logic (e.g. parsing the prepared query with regular expressions),
   which is obviously quite fragile.
 - Introduce the negotiation procedure for cql protocol extensions.
   This is done via `cql_protocol_extension` enum and is expected
   to have an appropriate mirroring implementation on the client
   driver side in order to work properly.
 - Implmenent a `LWT_ADD_METADATA_MARK` cql feature on top of the
   aforementioned algorithm to make the feature negotiable and use
   it conditionally (iff both server and client agrees with each
   other on the set of cql extensions).

The feature is meant to be further utilized by client drivers
to use primary replicas consistently when dealing with conditional
statements.

* git@github.com:ManManson/scylla feature/lwt_prepared_meta_flag_2:
  lwt: introduce "LWT" flag in prepared statement metadata
  transport: introduce `cql_protocol_extension` enum and cql protocol extensions negotiation
2020-06-30 12:40:19 +03:00
Takuya ASADA
eb405f0908 scylla_util.py: stop using /etc/os-release, use distro
Currently we we mistakenly made two different way to detect distribution,
directly reading /etc/os-release and use distro package.

distro package provides well abstracted APIs and still have full access to
os-release informations, we should switch to it.

Fixes #6691
2020-06-30 12:40:19 +03:00
Asias He
9abaf9bc2e boot_strapper: Ignore node to be replaced explicitly as stream source
After commit 7d86a3b208 (storage_service:
Make replacing node take writes), during replace operation, tokens in
_token_metadata for node being replaced are updated only after the replace
operation is finished. As a result, in range_streamer::add_ranges, the
node being replaced will be considered as a source to stream data from.

Before commit 7d86a3b208, the node being
replaced will not be considered as a source node because it is already
replaced by the replacing node before the replace operation is finished.
This is the reason why it works in the past.

To fix, filter out the node being replaced as a source node explicitly.

Tests: replace_first_boot_test and replace_stopped_node_test
Backports: 4.1
Fixes: #6728
2020-06-30 12:40:19 +03:00
Rafael Ávila de Espíndola
3964b1a551 row level repair: Don't return a variadic future from get_sink_source 2020-06-29 16:51:41 -07:00
Rafael Ávila de Espíndola
eeee63a9a3 row level repair: Don't return a variadic future from read_rows_from_disk 2020-06-29 16:51:10 -07:00
Rafael Ávila de Espíndola
af44684418 messaging_service: Don't return variadic futures from make_sink_and_source_for_* 2020-06-29 16:50:45 -07:00
Rafael Ávila de Espíndola
abb36cc7d1 cql3: Don't use variadic futures in select_statement 2020-06-29 16:49:41 -07:00
Raphael S. Carvalho
18880af9ad distributed_loader: kill unused invoke_shards_with_ptr()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-06-29 14:23:50 -03:00
Raphael S. Carvalho
593c1e00c8 sstables:: kill unused sstables::sstable_open_info
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-06-29 14:23:48 -03:00
Raphael S. Carvalho
c7ba495691 sstables: kill unused sstable::load_shared_components()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-06-29 14:23:45 -03:00
Raphael S. Carvalho
4683cb06c2 distributed_loader: remove declaration of inexistent do_populate_column_family()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-06-29 14:23:42 -03:00
Raphael S. Carvalho
1e9c5b5295 table: simplify table::discard_sstables()
no longer need to have any special code for shared SSTables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-06-29 14:23:40 -03:00
Raphael S. Carvalho
ce210a4420 table: simplify add_sstable()
get_shards_for_this_sstable() can be called inside table::add_sstable()
because the shards for a sstable is precomputed and so completely
exception safe. We want a central point for checking that table will
no longer added shared SSTables to its sstable set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-06-29 14:23:32 -03:00
Raphael S. Carvalho
68b527f100 table: simplify update_stats_for_new_sstable()
no longer need to conditionally track the SSTable metadata,
as table will no longer accept shared SSTables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-06-29 14:22:04 -03:00
Raphael S. Carvalho
607c74dc95 table: remove unused open_sstable function
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-06-29 14:22:00 -03:00
Raphael S. Carvalho
6dfeb107ae distributed_loader: remove unused code
Remove code no longer used by population procedure.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-06-29 14:21:40 -03:00
Raphael S. Carvalho
60467a7e36 table: no longer keep track of sstables that need resharding
Now that table will no longer accept shared SSTables, it no longer
needs to keep track of them.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-06-29 14:21:38 -03:00
Raphael S. Carvalho
cd548c6304 table: Remove unused functions no longer used by resharding
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-06-29 14:21:36 -03:00
Raphael S. Carvalho
68a4739a42 table: remove sstable::shared() condition from backlog tracker add/remove functions
Now that table no longer accept shared SSTables, those two functions can
be simplified by removing the shared condition.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-06-29 14:21:34 -03:00
Raphael S. Carvalho
343efe797d table: No longer accept a shared SSTable
With off-strategy work on reshard on boot and refresh, table no
longer needs to work with Shared SSTables. That will unlock
a host of cleanups.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-06-29 14:21:04 -03:00
Pavel Emelyanov
d0d2da6ccb api: Remove excessive capture
The "result" in this lambda is already not used and can be removed

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-29 19:08:59 +03:00
Pavel Emelyanov
4f5ffa980d api: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-29 19:08:59 +03:00
Pavel Emelyanov
d99969e0e0 api: Fix wrongly captured map of snapshots
The results of get_snapshot_details() is saved in do_with, then is
captured on the json callback by reference, then the do_with's
future returns, so by the time callback is called the map is already
free and empty.

Fix by capturing the result directly on the callback.
Fixes recently merged b6086526.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-29 19:08:21 +03:00
Pavel Solodovnikov
6c6f3dbe42 lwt: introduce "LWT" flag in prepared statement metadata
This patch adds a new `LWT` flag to `cql3::prepared_metadata`.

That allows clients to clearly distinguish betwen lwt and
non-lwt statements without need to execute some custom parsing
logic (e.g. parsing the prepared query with regular expressions),
which is obviously quite fragile.

The feature is meant to be further utilized by client drivers
to use primary replicas consistently when dealing with conditional
statements.

Whether to use lwt optimization flag or not is handled by negotiation
procedure between scylla server and client library via SUPPORTED/STARTUP
messages (`LWT_ADD_METADATA_MARK` extension).

Tests: unit(dev, debug), manual testing with modified scylla/gocql driver

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-06-29 12:30:37 +03:00
Nadav Har'El
23ce6864a3 alternator test: ProjectionExpression test for BatchGetItem
The tests in test_projection_expression.py test that ProjectionExpression
works - including attribute paths - for the GetItem, Query and Scan
operations.

There is a fourth read operation - BatchGetItem, and it supports
ProjectionExpression too. We tested BatchGetItem + ProjectionExpression in
test_batch.py, but this only tests the basic feature, with top-level
attributes, and we were missing a test for nested document paths.

This patch adds such a test. It is still xfailing on Alternator (and passing
on DynamoDB), because attribute paths are still not supported (this is
issue #5024).

Refs #5024.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200629063244.287571-1-nyh@scylladb.com>
2020-06-29 08:51:05 +02:00
Nadav Har'El
b6fdd956bd alternator test: ProjectionExpression tests for document paths
This patch adds three more tests for the ProjectionExpression parameter
of GetItem. They are tests for nested document paths like a.b[2].c.

We don't support nested paths in Alternator yet (this is issue #5024),
so the new tests all xfail (and pass on DynamoDB).

We already had similar tests for UpdateExpression, which also needs to
support document paths, but the tests were missing for ProjectionExpression.
I am planning to start the implementation of document paths with
ProjectionExpression (which is the simplest use of document paths), so I
want the tests for this expression to be as complete as possible.

Refs #5024.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200628213208.275050-1-nyh@scylladb.com>
2020-06-29 08:50:55 +02:00
Avi Kivity
509442b128 Merge "Move snapshot code from storage_service into independent component" from Pavel E
"
The snapshotting code is already well isolated from the rest of
the storage_service, so it's relatively easy to move it into
independent component, thus de-bloating the storage_service.

As a side effect this allows painless removal of calls to global
get_storage_service() from schema::describe code.

Test: unit(debug), dtest.snapshot_test(dev), manual start-stop
"

* 'br-snapshot-controller-4' of https://github.com/xemul/scylla:
  snap: Get rid of storage_service reference in schema.cc
  main: Stop http server
  snapshot: Make check_snapshot_not_exist a method
  snapshots: Move ops gate from storage_service
  snapshot: Move lock from storage_service
  snapshot: Move all code into db::snapshot_ctl class
  storage_service: Move all snapshot code into snapshot-ctl.cc
  snapshots: Initial skeleton
  snapshots: Properly shutdown API endpoints
  api: Rewrap set_server_snapshot lambda
2020-06-28 13:17:32 +03:00
Takuya ASADA
a77882f075 scylla_setup: don't print prompt message multiple times when disk list passed
When comma sepalated disk list passed to RAID prompt, it show up prompt message
multiple times.
It should always just one time.

Fixes #6724
2020-06-28 12:19:22 +03:00
Takuya ASADA
835e76fdfc scylla_setup: don't add same disk device twice
We shouldn't accept adding same disk twice for RAID prompt.

Fixes #6711
2020-06-28 12:19:22 +03:00
Botond Dénes
e31f7316c0 mutation_reader: evictable_reader: add assert against pause handle leak
We are currently investigating a segmentation fault, which is suspected
to be caused by a leaked pause handle. Although according to the latest
theory the handle leak is not the root cause of the issue, just a
symptom, its better to catch any bugs that would cause a handle leaking
at the act, and not later when some side-effect causes a segfault.

Refs: #6613
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200625153729.522811-1-bdenes@scylladb.com>
2020-06-28 12:08:25 +03:00
Avi Kivity
3e2eeec83a Merge "Fix handling of decimals with negative scales" from Rafael
"
Before this series scylla would effectively infinite loop when, for
example, casting a decimal with a negative scale to float.

Fixes #6720
"

* 'espindola/fix-decimal-issue' of https://github.com/espindola/scylla:
  big_decimal: Add a test for a corner case
  big_decimal: Correctly handle negative scales
  big_decimal: Add a as_rational member function
  big_decimal: Move constructors out of line
2020-06-28 12:06:35 +03:00
Dejan Mircevski
2d91e5f6a0 configure.py: Drop unused var cassandra_interface
The variable doesn't appear to be used anywhere.

Tests: manually run configure.py

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-06-27 21:20:05 +03:00
Dejan Mircevski
65030f1406 configure.py: Update gcc version check
As HACKING.md suggests, we now require gcc version >= 10.  Set the
minimum at 10.1.1, as that is the first official 10 release:

https://gcc.gnu.org/releases.html

Tests: manually run configure.py and ensure it passes/fails
appropriately.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-06-27 21:19:00 +03:00
Dejan Mircevski
a12bbef980 README.md: dedupe "offers offers"
The word "offers" was inadvertently repeated in a sentence.

Signed-off-by: Dejan Mircevski <github@mircevski.com>
2020-06-27 21:17:33 +03:00
Pavel Emelyanov
f045cec586 snap: Get rid of storage_service reference in schema.cc
Now when the snapshot stopping is correctly handled, we may pull the database
reference all the way down to the schema::describe().

One tricky place is in table::napshot() -- the local db reference is pulled
through an smp::submit_to call, but thanks to the shard checks in the place
where it is needed the db is still "local"

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-26 20:28:25 +03:00
Pavel Emelyanov
8d2e05778c main: Stop http server
Currently it's not stopped at all, so calling a REST request shutdown-time
may crash things at random places.

Fixes: #5702

But it's not the end of the story. Since the server stays up while we are
shutting things down, each subsystem should carefully handle the cases when
it's half-down, but a request comes. A better solution is to unregister
rest verbs eventually, but httpd's rules cannot do it now.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-26 20:27:28 +03:00
Pavel Emelyanov
9211df2cdf snapshot: Make check_snapshot_not_exist a method
Sanitation. It now can access the this->_db pointer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-26 20:26:15 +03:00
Pavel Emelyanov
ba47ef0397 snapshots: Move ops gate from storage_service
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-26 20:17:21 +03:00
Pavel Emelyanov
e439873319 snapshot: Move lock from storage_service
For this de-static run_snapshot_*_operation (because we no longer have
the static global to get the lock from) and make the snapshot_ctl be
peering_sharded_service to call invoke_on.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-26 20:17:19 +03:00
Pavel Emelyanov
d674baacef snapshot: Move all code into db::snapshot_ctl class
This includes
- rename namespace in snapshot-ctl.[cc|hh]
- move methods from storage_service to snapshot_ctl
- move snapshot_details struct
- temporarily make storage_service._snapshot_lock and ._snapshot_ops public
- replace two get_local_storage_service() occurrences with this._db

The latter is not 100% clear as the code that does this references "this"
from another shard, but the _db in question is the distributed object, so
they are all the same on all instances.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-26 19:59:53 +03:00
Pavel Emelyanov
8d36607044 storage_service: Move all snapshot code into snapshot-ctl.cc
This is plain move, no other modifications are made, even the
"service" namespace is kept, only few broken indentation fixes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-26 19:54:15 +03:00
Pavel Emelyanov
d989d9c1c7 snapshots: Initial skeleton
A placeholder for snapshotting code that will be moved into it
from the storage_service.

Also -- pass it through the API for future use.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-26 19:54:14 +03:00
Pavel Emelyanov
9a8a1635b7 snapshots: Properly shutdown API endpoints
Now with the seastar httpd routes unset() at hands we
can shut down individual API endpoints. Do this for
snapshot calls, this will make snapshot controller stop
safe.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-26 17:27:45 +03:00
Pavel Emelyanov
b608652622 api: Rewrap set_server_snapshot lambda
The lambda calls the core snapshot method deep inside the
json marshalling callback. This will bring problems with
stopping the snapshot controller in the next patches.

To prepare for this -- call the .get_snapshot_details()
first, then keep the result in do_with() context. This
change doesn't affect the issue the lambde in question is
about to solve as the whole result set is anyway kept in
memory while being streamed outside.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-26 17:27:45 +03:00
Dejan Mircevski
0688f5c3f9 cql3/restrictions: Create expression objects
Add expression as a member of restriction.  Create or update
expression everywhere restrictions are created or updated.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-06-26 09:19:36 -04:00
Dejan Mircevski
d33053b841 cql3/restrictions: Add free functions over new classes
These functions will replace class methods from the existing
restriction classes.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-06-26 09:19:36 -04:00
Dejan Mircevski
1d66b33325 cql3/restrictions: Add new representation
These new classes will replace the existing restrictions hierarchy.
Instead of having member functions, they expose their data publicly,
for future free functions to operate on.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-06-26 09:19:36 -04:00
Rafael Ávila de Espíndola
85bb7ff743 big_decimal: Add a test for a corner case
This behavior is different from cassandra, but without arithmetic
operations it doesn't seem possible to notice the difference from
CQL. Using avg produces the same results, since we use an initial
value of 0 (scale = 0).

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-25 15:37:23 -07:00
Rafael Ávila de Espíndola
684f32c862 big_decimal: Correctly handle negative scales
A negative scale was being passed an a positive value to
boost::multiprecision::pow, which would never finish.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-25 15:34:10 -07:00
Rafael Ávila de Espíndola
bac0f3a9ee big_decimal: Add a as_rational member function
This just refactors some duplicated code so that it can be fixed in
one place.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-25 15:33:31 -07:00
Rafael Ávila de Espíndola
77725ce1a4 big_decimal: Move constructors out of line
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-25 15:33:01 -07:00
Benny Halevy
a843945115 comapction: restore % in compaction completion message
The % sign fell off in c4841fa735

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200625151352.736561-1-bhalevy@scylladb.com>
2020-06-25 18:11:59 +02:00
Avi Kivity
e5be3352cf database, streaming, messaging: drop streaming memtables
Before Scylla 3.0, we used to send streaming mutations using
individual RPC requests and flush them together using dedicated
streaming memtables. This mechanism is no longer in use and all
versions that use it have long reached end-of-life.

Remove this code.
2020-06-25 15:25:54 +02:00
Pavel Solodovnikov
6028588148 transport: introduce cql_protocol_extension enum and cql protocol extensions negotiation
The patch introduces two new features to aid with negotiating
protocol extensions for the CQL protocol:
 - `cql_protocol_extensions` enum, which holds all supported
   extensions for the CQL protocol (currently contains only
   `LWT_ADD_METADATA_MARK` extension, which will be mentioned
   below).
 - An additional mechainsm of negotiating cql protocol extensions
   to be used in a client connection between a scylla server
   and a client driver.

These extensions are propagated in SUPPORTED message sent from the
server side with "SCYLLA_" prefix and received back as a response
from the client driver in order to determine intersection between
the cql extensions that are both supported by the server and
acknowledged by a client driver.

This intersection of features is later determined to be a working
set of cql protocol extensions in use for the current `client_state`,
which is associated with a particular client connection.

This way we can easily settle on the used extensions set on
both sides of the connection.

Currently there is only one value: `LWT_ADD_METADATA_MARK`, which
regulates whether to set a designated bit in prepared statement
metadata indicating if the statement at hand is an lwt statement
or not (actual implementation for the feature will be in a later
patch).

Each extension can also propagate some custom parameters to the
corresponding key. CQL protocol specification allows to send
a list of values with each key in the SUPPORTED message, we use
that to pass parameters to extensions as `PARAM=VALUE` strings.

In case of `LWT_ADD_METADATA_MARK` it's
`SCYLLA_LWT_OPTIMIZATION_META_BIT_MASK` which designates the
bitmask for LWT flag in prepared statement metadata in order to be
used for lookup in a client library. The associated bits of code in
`cql3::prepared_metadata` are adjusted to accomodate the feature.

The value for the flag is chosen on purpose to be the last bit
in the flags bitset since we don't want to possibly clash with
C* implementation in case they add more possible flag values to
prepared metadata (though there is an issue regarding that:
https://issues.apache.org/jira/browse/CASSANDRA-15746).

If it's fixed in upstream Cassandra, then we could synchronize
the value for the flag with them.

Also extend the underlying type of `flag` enum in
`cql3::prepared_metadata` to be `uint32_t` instead of `uint8_t`
because in either case flags mask is serialized as 32-bit integer.

In theory, shard-awareness extension support also should be
reworked in terms of provided minimal infrastructure, but for the
sake of simplicity, this is left to be done in a follow-up some
time later.

This solution eliminates the need to assume that all the client
drivers follow the CQL spec carefully because scylla-specific
features and protocol extensions could be enabled only in case both
server and client driver negotiate the supported feature set.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2020-06-16 11:35:52 +03:00
5225 changed files with 100309 additions and 44025 deletions

87
.github/CODEOWNERS vendored Normal file
View File

@@ -0,0 +1,87 @@
# AUTH
auth/* @elcallio @vladzcloudius
# CACHE
row_cache* @tgrabiec @haaawk
*mutation* @tgrabiec @haaawk
tests/mvcc* @tgrabiec @haaawk
# CDC
cdc/* @haaawk @kbr- @elcallio @piodul @jul-stas
test/cql/cdc_* @haaawk @kbr- @elcallio @piodul @jul-stas
test/boost/cdc_* @haaawk @kbr- @elcallio @piodul @jul-stas
# COMMITLOG / BATCHLOG
db/commitlog/* @elcallio
db/batch* @elcallio
# COORDINATOR
service/storage_proxy* @gleb-cloudius
# COMPACTION
sstables/compaction* @raphaelsc @nyh
# CQL TRANSPORT LAYER
transport/* @penberg
# CQL QUERY LANGUAGE
cql3/* @tgrabiec @penberg @psarna
# COUNTERS
counters* @haaawk @jul-stas
tests/counter_test* @haaawk @jul-stas
# GOSSIP
gms/* @tgrabiec @asias
# DOCKER
dist/docker/* @penberg
# LSA
utils/logalloc* @tgrabiec
# MATERIALIZED VIEWS
db/view/* @nyh @psarna
cql3/statements/*view* @nyh @psarna
test/boost/view_* @nyh @psarna
# PACKAGING
dist/* @syuu1228
# REPAIR
repair/* @tgrabiec @asias @nyh
# SCHEMA MANAGEMENT
db/schema_tables* @tgrabiec @nyh
db/legacy_schema_migrator* @tgrabiec @nyh
service/migration* @tgrabiec @nyh
schema* @tgrabiec @nyh
# SECONDARY INDEXES
db/index/* @nyh @penberg @psarna
cql3/statements/*index* @nyh @penberg @psarna
test/boost/*index* @nyh @penberg @psarna
# SSTABLES
sstables/* @tgrabiec @raphaelsc @nyh
# STREAMING
streaming/* @tgrabiec @asias
service/storage_service.* @tgrabiec @asias
# ALTERNATOR
alternator/* @nyh @psarna
test/alternator/* @nyh @psarna
# HINTED HANDOFF
db/hints/* @haaawk @piodul @vladzcloudius
# REDIS
redis/* @nyh @syuu1228
redis-test/* @nyh @syuu1228
# READERS
reader_* @denesb
querier* @denesb
test/boost/mutation_reader_test.cc @denesb
test/boost/querier_cache_test.cc @denesb

33
.github/workflows/pages.yml vendored Normal file
View File

@@ -0,0 +1,33 @@
name: "CI Docs"
on:
push:
branches:
- master
paths:
- 'docs/**'
jobs:
release:
name: Build
runs-on: ubuntu-latest
env:
LATEST_VERSION: master
steps:
- name: Checkout
uses: actions/checkout@v2
with:
persist-credentials: false
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v1
with:
python-version: 3.7
- name: Build docs
run: |
export PATH=$PATH:~/.local/bin
cd docs
make multiversion
- name: Deploy
run : ./docs/_utils/deploy.sh
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

5
.gitignore vendored
View File

@@ -22,5 +22,8 @@ resources
.pytest_cache
/expressions.tokens
tags
testlog/*
testlog
test/*/*.reject
.vscode
docs/_build
docs/poetry.lock

9
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui
@@ -13,8 +13,11 @@
path = abseil
url = ../abseil-cpp
[submodule "scylla-jmx"]
path = scylla-jmx
path = tools/jmx
url = ../scylla-jmx
[submodule "scylla-tools"]
path = scylla-tools
path = tools/java
url = ../scylla-tools-java
[submodule "scylla-python3"]
path = tools/python3
url = ../scylla-python3

View File

@@ -1,8 +1,5 @@
##
## For best results, first compile the project using the Ninja build-system.
##
cmake_minimum_required(VERSION 3.18)
cmake_minimum_required(VERSION 3.7)
project(scylla)
if(NOT CMAKE_BUILD_TYPE AND NOT CMAKE_CONFIGURATION_TYPES)
@@ -20,136 +17,740 @@ else()
set(BUILD_TYPE "release")
endif()
if (NOT DEFINED FOR_IDE AND NOT DEFINED ENV{FOR_IDE} AND NOT DEFINED ENV{CLION_IDE})
message(FATAL_ERROR "This CMakeLists.txt file is only valid for use in IDEs, please define FOR_IDE to acknowledge this.")
endif()
# These paths are always available, since they're included in the repository. Additional DPDK headers are placed while
# Seastar is built, and are captured in `SEASTAR_INCLUDE_DIRS` through parsing the Seastar pkg-config file (below).
set(SEASTAR_DPDK_INCLUDE_DIRS
seastar/dpdk/lib/librte_eal/common/include
seastar/dpdk/lib/librte_eal/common/include/generic
seastar/dpdk/lib/librte_eal/common/include/x86
seastar/dpdk/lib/librte_ether)
find_package(PkgConfig REQUIRED)
set(ENV{PKG_CONFIG_PATH} "${CMAKE_SOURCE_DIR}/build/${BUILD_TYPE}/seastar:$ENV{PKG_CONFIG_PATH}")
pkg_check_modules(SEASTAR seastar)
if(NOT SEASTAR_INCLUDE_DIRS)
# Default value. A more accurate list is populated through `pkg-config` below if `seastar.pc` is available.
set(SEASTAR_INCLUDE_DIRS "seastar/include")
endif()
find_package(Boost COMPONENTS filesystem program_options system thread)
##
## Populate the names of all source and header files in the indicated paths in a designated variable.
##
## When RECURSIVE is specified, directories are traversed recursively.
##
## Use: scan_scylla_source_directories(VAR my_result_var [RECURSIVE] PATHS [path1 path2 ...])
##
function (scan_scylla_source_directories)
set(options RECURSIVE)
set(oneValueArgs VAR)
set(multiValueArgs PATHS)
cmake_parse_arguments(args "${options}" "${oneValueArgs}" "${multiValueArgs}" "${ARGN}")
set(globs "")
foreach (dir ${args_PATHS})
list(APPEND globs "${dir}/*.cc" "${dir}/*.hh")
endforeach()
if (args_RECURSIVE)
set(glob_kind GLOB_RECURSE)
function(default_target_arch arch)
set(x86_instruction_sets i386 i686 x86_64)
if(CMAKE_SYSTEM_PROCESSOR IN_LIST x86_instruction_sets)
set(${arch} "westmere" PARENT_SCOPE)
elseif(CMAKE_SYSTEM_PROCESSOR EQUAL "aarch64")
set(${arch} "armv8-a+crc+crypto" PARENT_SCOPE)
else()
set(glob_kind GLOB)
set(${arch} "" PARENT_SCOPE)
endif()
endfunction()
default_target_arch(target_arch)
if(target_arch)
set(target_arch_flag "-march=${target_arch}")
endif()
file(${glob_kind} var
${globs})
# Configure Seastar compile options to align with Scylla
set(Seastar_CXX_FLAGS -fcoroutines ${target_arch_flag} CACHE INTERNAL "" FORCE)
set(Seastar_CXX_DIALECT gnu++20 CACHE INTERNAL "" FORCE)
set(${args_VAR} ${var} PARENT_SCOPE)
add_subdirectory(seastar)
add_subdirectory(abseil)
# Exclude absl::strerror from the default "all" target since it's not
# used in Scylla build and, moreover, makes use of deprecated glibc APIs,
# such as sys_nerr, which are not exposed from "stdio.h" since glibc 2.32,
# which happens to be the case for recent Fedora distribution versions.
#
# Need to use the internal "absl_strerror" target name instead of namespaced
# variant because `set_target_properties` does not understand the latter form,
# unfortunately.
set_target_properties(absl_strerror PROPERTIES EXCLUDE_FROM_ALL TRUE)
# System libraries dependencies
find_package(Boost COMPONENTS filesystem program_options system thread regex REQUIRED)
find_package(Lua REQUIRED)
find_package(ZLIB REQUIRED)
find_package(ICU COMPONENTS uc REQUIRED)
set(scylla_build_dir "${CMAKE_BINARY_DIR}/build/${BUILD_TYPE}")
set(scylla_gen_build_dir "${scylla_build_dir}/gen")
file(MAKE_DIRECTORY "${scylla_build_dir}" "${scylla_gen_build_dir}")
# Place libraries, executables and archives in ${buildroot}/build/${mode}/
foreach(mode RUNTIME LIBRARY ARCHIVE)
set(CMAKE_${mode}_OUTPUT_DIRECTORY "${scylla_build_dir}")
endforeach()
# Generate C++ source files from thrift definitions
function(scylla_generate_thrift)
set(one_value_args TARGET VAR IN_FILE OUT_DIR SERVICE)
cmake_parse_arguments(args "" "${one_value_args}" "" ${ARGN})
get_filename_component(in_file_name ${args_IN_FILE} NAME_WE)
set(aux_out_file_name ${args_OUT_DIR}/${in_file_name})
set(outputs
${aux_out_file_name}_types.cpp
${aux_out_file_name}_types.h
${aux_out_file_name}_constants.cpp
${aux_out_file_name}_constants.h
${args_OUT_DIR}/${args_SERVICE}.cpp
${args_OUT_DIR}/${args_SERVICE}.h)
add_custom_command(
DEPENDS
${args_IN_FILE}
thrift
OUTPUT ${outputs}
COMMAND ${CMAKE_COMMAND} -E make_directory ${args_OUT_DIR}
COMMAND thrift -gen cpp:cob_style,no_skeleton -out "${args_OUT_DIR}" "${args_IN_FILE}")
add_custom_target(${args_TARGET}
DEPENDS ${outputs})
set(${args_VAR} ${outputs} PARENT_SCOPE)
endfunction()
## Although Seastar is an external project, it is common enough to explore the sources while doing
## Scylla development that we'll treat the Seastar sources as part of this project for easier navigation.
scan_scylla_source_directories(
VAR SEASTAR_SOURCE_FILES
RECURSIVE
scylla_generate_thrift(
TARGET scylla_thrift_gen_cassandra
VAR scylla_thrift_gen_cassandra_files
IN_FILE interface/cassandra.thrift
OUT_DIR ${scylla_gen_build_dir}
SERVICE Cassandra)
PATHS
seastar/core
seastar/http
seastar/json
seastar/net
seastar/rpc
seastar/testing
seastar/util)
# Parse antlr3 grammar files and generate C++ sources
function(scylla_generate_antlr3)
set(one_value_args TARGET VAR IN_FILE OUT_DIR)
cmake_parse_arguments(args "" "${one_value_args}" "" ${ARGN})
scan_scylla_source_directories(
VAR SCYLLA_ROOT_SOURCE_FILES
PATHS .)
get_filename_component(in_file_pure_name ${args_IN_FILE} NAME)
get_filename_component(stem ${in_file_pure_name} NAME_WE)
scan_scylla_source_directories(
VAR SCYLLA_SUB_SOURCE_FILES
RECURSIVE
set(outputs
"${args_OUT_DIR}/${stem}Lexer.hpp"
"${args_OUT_DIR}/${stem}Lexer.cpp"
"${args_OUT_DIR}/${stem}Parser.hpp"
"${args_OUT_DIR}/${stem}Parser.cpp")
PATHS
api
auth
cql3
db
dht
exceptions
gms
index
io
locator
message
repair
service
sstables
streaming
test
thrift
tracing
transport
utils)
add_custom_command(
DEPENDS
${args_IN_FILE}
OUTPUT ${outputs}
# Remove #ifdef'ed code from the grammar source code
COMMAND sed -e "/^#if 0/,/^#endif/d" "${args_IN_FILE}" > "${args_OUT_DIR}/${in_file_pure_name}"
COMMAND antlr3 "${args_OUT_DIR}/${in_file_pure_name}"
# We replace many local `ExceptionBaseType* ex` variables with a single function-scope one.
# Because we add such a variable to every function, and because `ExceptionBaseType` is not a global
# name, we also add a global typedef to avoid compilation errors.
COMMAND sed -i -e "/^.*On :.*$/d" "${args_OUT_DIR}/${stem}Lexer.hpp"
COMMAND sed -i -e "/^.*On :.*$/d" "${args_OUT_DIR}/${stem}Lexer.cpp"
COMMAND sed -i -e "/^.*On :.*$/d" "${args_OUT_DIR}/${stem}Parser.hpp"
COMMAND sed -i
-e "s/^\\( *\\)\\(ImplTraits::CommonTokenType\\* [a-zA-Z0-9_]* = NULL;\\)$/\\1const \\2/"
-e "/^.*On :.*$/d"
-e "1i using ExceptionBaseType = int;"
-e "s/^{/{ ExceptionBaseType\\* ex = nullptr;/; s/ExceptionBaseType\\* ex = new/ex = new/; s/exceptions::syntax_exception e/exceptions::syntax_exception\\& e/"
"${args_OUT_DIR}/${stem}Parser.cpp"
VERBATIM)
scan_scylla_source_directories(
VAR SCYLLA_GEN_SOURCE_FILES
RECURSIVE
PATHS build/${BUILD_TYPE}/gen)
add_custom_target(${args_TARGET}
DEPENDS ${outputs})
set(SCYLLA_SOURCE_FILES
${SCYLLA_ROOT_SOURCE_FILES}
${SCYLLA_GEN_SOURCE_FILES}
${SCYLLA_SUB_SOURCE_FILES})
set(${args_VAR} ${outputs} PARENT_SCOPE)
endfunction()
set(antlr3_grammar_files
cql3/Cql.g
alternator/expressions.g)
set(antlr3_gen_files)
foreach(f ${antlr3_grammar_files})
get_filename_component(grammar_file_name "${f}" NAME_WE)
get_filename_component(f_dir "${f}" DIRECTORY)
scylla_generate_antlr3(
TARGET scylla_antlr3_gen_${grammar_file_name}
VAR scylla_antlr3_gen_${grammar_file_name}_files
IN_FILE ${f}
OUT_DIR ${scylla_gen_build_dir}/${f_dir})
list(APPEND antlr3_gen_files "${scylla_antlr3_gen_${grammar_file_name}_files}")
endforeach()
# Generate C++ sources from ragel grammar files
seastar_generate_ragel(
TARGET scylla_ragel_gen_protocol_parser
VAR scylla_ragel_gen_protocol_parser_file
IN_FILE redis/protocol_parser.rl
OUT_FILE ${scylla_gen_build_dir}/redis/protocol_parser.hh)
# Generate C++ sources from Swagger definitions
set(swagger_files
api/api-doc/cache_service.json
api/api-doc/collectd.json
api/api-doc/column_family.json
api/api-doc/commitlog.json
api/api-doc/compaction_manager.json
api/api-doc/config.json
api/api-doc/endpoint_snitch_info.json
api/api-doc/error_injection.json
api/api-doc/failure_detector.json
api/api-doc/gossiper.json
api/api-doc/hinted_handoff.json
api/api-doc/lsa.json
api/api-doc/messaging_service.json
api/api-doc/storage_proxy.json
api/api-doc/storage_service.json
api/api-doc/stream_manager.json
api/api-doc/system.json
api/api-doc/utils.json)
set(swagger_gen_files)
foreach(f ${swagger_files})
get_filename_component(fname "${f}" NAME_WE)
get_filename_component(dir "${f}" DIRECTORY)
seastar_generate_swagger(
TARGET scylla_swagger_gen_${fname}
VAR scylla_swagger_gen_${fname}_files
IN_FILE "${f}"
OUT_DIR "${scylla_gen_build_dir}/${dir}")
list(APPEND swagger_gen_files "${scylla_swagger_gen_${fname}_files}")
endforeach()
# Create C++ bindings for IDL serializers
function(scylla_generate_idl_serializer)
set(one_value_args TARGET VAR IN_FILE OUT_FILE)
cmake_parse_arguments(args "" "${one_value_args}" "" ${ARGN})
get_filename_component(out_dir ${args_OUT_FILE} DIRECTORY)
set(idl_compiler "${CMAKE_SOURCE_DIR}/idl-compiler.py")
find_package(Python3 COMPONENTS Interpreter)
add_custom_command(
DEPENDS
${args_IN_FILE}
${idl_compiler}
OUTPUT ${args_OUT_FILE}
COMMAND ${CMAKE_COMMAND} -E make_directory ${out_dir}
COMMAND Python3::Interpreter ${idl_compiler} --ns ser -f ${args_IN_FILE} -o ${args_OUT_FILE})
add_custom_target(${args_TARGET}
DEPENDS ${args_OUT_FILE})
set(${args_VAR} ${args_OUT_FILE} PARENT_SCOPE)
endfunction()
set(idl_serializers
idl/cache_temperature.idl.hh
idl/commitlog.idl.hh
idl/consistency_level.idl.hh
idl/frozen_mutation.idl.hh
idl/frozen_schema.idl.hh
idl/gossip_digest.idl.hh
idl/idl_test.idl.hh
idl/keys.idl.hh
idl/messaging_service.idl.hh
idl/mutation.idl.hh
idl/paging_state.idl.hh
idl/partition_checksum.idl.hh
idl/paxos.idl.hh
idl/query.idl.hh
idl/range.idl.hh
idl/read_command.idl.hh
idl/reconcilable_result.idl.hh
idl/replay_position.idl.hh
idl/result.idl.hh
idl/ring_position.idl.hh
idl/streaming.idl.hh
idl/token.idl.hh
idl/tracing.idl.hh
idl/truncation_record.idl.hh
idl/uuid.idl.hh
idl/view.idl.hh)
set(idl_gen_files)
foreach(f ${idl_serializers})
get_filename_component(idl_name "${f}" NAME)
get_filename_component(idl_target "${idl_name}" NAME_WE)
get_filename_component(idl_dir "${f}" DIRECTORY)
string(REPLACE ".idl.hh" ".dist.hh" idl_out_hdr_name "${idl_name}")
scylla_generate_idl_serializer(
TARGET scylla_idl_gen_${idl_target}
VAR scylla_idl_gen_${idl_target}_files
IN_FILE ${f}
OUT_FILE ${scylla_gen_build_dir}/${idl_dir}/${idl_out_hdr_name})
list(APPEND idl_gen_files "${scylla_idl_gen_${idl_target}_files}")
endforeach()
set(scylla_sources
absl-flat_hash_map.cc
alternator/auth.cc
alternator/base64.cc
alternator/conditions.cc
alternator/executor.cc
alternator/expressions.cc
alternator/serialization.cc
alternator/server.cc
alternator/stats.cc
alternator/streams.cc
api/api.cc
api/cache_service.cc
api/collectd.cc
api/column_family.cc
api/commitlog.cc
api/compaction_manager.cc
api/config.cc
api/endpoint_snitch.cc
api/error_injection.cc
api/failure_detector.cc
api/gossiper.cc
api/hinted_handoff.cc
api/lsa.cc
api/messaging_service.cc
api/storage_proxy.cc
api/storage_service.cc
api/stream_manager.cc
api/system.cc
atomic_cell.cc
auth/allow_all_authenticator.cc
auth/allow_all_authorizer.cc
auth/authenticated_user.cc
auth/authentication_options.cc
auth/authenticator.cc
auth/common.cc
auth/default_authorizer.cc
auth/password_authenticator.cc
auth/passwords.cc
auth/permission.cc
auth/permissions_cache.cc
auth/resource.cc
auth/role_or_anonymous.cc
auth/roles-metadata.cc
auth/sasl_challenge.cc
auth/service.cc
auth/standard_role_manager.cc
auth/transitional.cc
bytes.cc
canonical_mutation.cc
cdc/cdc_partitioner.cc
cdc/generation.cc
cdc/log.cc
cdc/metadata.cc
cdc/split.cc
clocks-impl.cc
collection_mutation.cc
compress.cc
connection_notifier.cc
converting_mutation_partition_applier.cc
counters.cc
cql3/abstract_marker.cc
cql3/attributes.cc
cql3/cf_name.cc
cql3/column_condition.cc
cql3/column_identifier.cc
cql3/column_specification.cc
cql3/constants.cc
cql3/cql3_type.cc
cql3/expr/expression.cc
cql3/functions/aggregate_fcts.cc
cql3/functions/castas_fcts.cc
cql3/functions/error_injection_fcts.cc
cql3/functions/functions.cc
cql3/functions/user_function.cc
cql3/index_name.cc
cql3/keyspace_element_name.cc
cql3/lists.cc
cql3/maps.cc
cql3/operation.cc
cql3/query_options.cc
cql3/query_processor.cc
cql3/relation.cc
cql3/restrictions/statement_restrictions.cc
cql3/result_set.cc
cql3/role_name.cc
cql3/selection/abstract_function_selector.cc
cql3/selection/selectable.cc
cql3/selection/selection.cc
cql3/selection/selector.cc
cql3/selection/selector_factories.cc
cql3/selection/simple_selector.cc
cql3/sets.cc
cql3/single_column_relation.cc
cql3/statements/alter_keyspace_statement.cc
cql3/statements/alter_table_statement.cc
cql3/statements/alter_type_statement.cc
cql3/statements/alter_view_statement.cc
cql3/statements/authentication_statement.cc
cql3/statements/authorization_statement.cc
cql3/statements/batch_statement.cc
cql3/statements/cas_request.cc
cql3/statements/cf_prop_defs.cc
cql3/statements/cf_statement.cc
cql3/statements/create_function_statement.cc
cql3/statements/create_index_statement.cc
cql3/statements/create_keyspace_statement.cc
cql3/statements/create_table_statement.cc
cql3/statements/create_type_statement.cc
cql3/statements/create_view_statement.cc
cql3/statements/delete_statement.cc
cql3/statements/drop_function_statement.cc
cql3/statements/drop_index_statement.cc
cql3/statements/drop_keyspace_statement.cc
cql3/statements/drop_table_statement.cc
cql3/statements/drop_type_statement.cc
cql3/statements/drop_view_statement.cc
cql3/statements/function_statement.cc
cql3/statements/grant_statement.cc
cql3/statements/index_prop_defs.cc
cql3/statements/index_target.cc
cql3/statements/ks_prop_defs.cc
cql3/statements/list_permissions_statement.cc
cql3/statements/list_users_statement.cc
cql3/statements/modification_statement.cc
cql3/statements/permission_altering_statement.cc
cql3/statements/property_definitions.cc
cql3/statements/raw/parsed_statement.cc
cql3/statements/revoke_statement.cc
cql3/statements/role-management-statements.cc
cql3/statements/schema_altering_statement.cc
cql3/statements/select_statement.cc
cql3/statements/truncate_statement.cc
cql3/statements/update_statement.cc
cql3/statements/use_statement.cc
cql3/token_relation.cc
cql3/tuples.cc
cql3/type_json.cc
cql3/untyped_result_set.cc
cql3/update_parameters.cc
cql3/user_types.cc
cql3/ut_name.cc
cql3/util.cc
cql3/values.cc
cql3/variable_specifications.cc
data/cell.cc
database.cc
db/batchlog_manager.cc
db/commitlog/commitlog.cc
db/commitlog/commitlog_entry.cc
db/commitlog/commitlog_replayer.cc
db/config.cc
db/consistency_level.cc
db/cql_type_parser.cc
db/data_listeners.cc
db/extensions.cc
db/heat_load_balance.cc
db/hints/manager.cc
db/hints/resource_manager.cc
db/large_data_handler.cc
db/legacy_schema_migrator.cc
db/marshal/type_parser.cc
db/schema_tables.cc
db/size_estimates_virtual_reader.cc
db/snapshot-ctl.cc
db/sstables-format-selector.cc
db/system_distributed_keyspace.cc
db/system_keyspace.cc
db/view/row_locking.cc
db/view/view.cc
db/view/view_update_generator.cc
dht/boot_strapper.cc
dht/i_partitioner.cc
dht/murmur3_partitioner.cc
dht/range_streamer.cc
dht/token.cc
distributed_loader.cc
duration.cc
exceptions/exceptions.cc
flat_mutation_reader.cc
frozen_mutation.cc
frozen_schema.cc
gms/application_state.cc
gms/endpoint_state.cc
gms/failure_detector.cc
gms/feature_service.cc
gms/gossip_digest_ack.cc
gms/gossip_digest_ack2.cc
gms/gossip_digest_syn.cc
gms/gossiper.cc
gms/inet_address.cc
gms/version_generator.cc
gms/versioned_value.cc
hashers.cc
index/secondary_index.cc
index/secondary_index_manager.cc
init.cc
keys.cc
lister.cc
locator/abstract_replication_strategy.cc
locator/ec2_multi_region_snitch.cc
locator/ec2_snitch.cc
locator/everywhere_replication_strategy.cc
locator/gce_snitch.cc
locator/gossiping_property_file_snitch.cc
locator/local_strategy.cc
locator/network_topology_strategy.cc
locator/production_snitch_base.cc
locator/rack_inferring_snitch.cc
locator/simple_snitch.cc
locator/simple_strategy.cc
locator/snitch_base.cc
locator/token_metadata.cc
lua.cc
main.cc
memtable.cc
message/messaging_service.cc
multishard_mutation_query.cc
mutation.cc
raft/fsm.cc
raft/log.cc
raft/progress.cc
raft/raft.cc
raft/server.cc
mutation_fragment.cc
mutation_partition.cc
mutation_partition_serializer.cc
mutation_partition_view.cc
mutation_query.cc
mutation_reader.cc
mutation_writer/multishard_writer.cc
mutation_writer/shard_based_splitting_writer.cc
mutation_writer/timestamp_based_splitting_writer.cc
mutation_writer/feed_writers.cc
partition_slice_builder.cc
partition_version.cc
querier.cc
query-result-set.cc
query.cc
range_tombstone.cc
range_tombstone_list.cc
reader_concurrency_semaphore.cc
redis/abstract_command.cc
redis/command_factory.cc
redis/commands.cc
redis/keyspace_utils.cc
redis/lolwut.cc
redis/mutation_utils.cc
redis/options.cc
redis/query_processor.cc
redis/query_utils.cc
redis/server.cc
redis/service.cc
redis/stats.cc
repair/repair.cc
repair/row_level.cc
row_cache.cc
schema.cc
schema_mutations.cc
schema_registry.cc
service/client_state.cc
service/migration_manager.cc
service/migration_task.cc
service/misc_services.cc
service/pager/paging_state.cc
service/pager/query_pagers.cc
service/paxos/paxos_state.cc
service/paxos/prepare_response.cc
service/paxos/prepare_summary.cc
service/paxos/proposal.cc
service/priority_manager.cc
service/storage_proxy.cc
service/storage_service.cc
sstables/compaction.cc
sstables/compaction_manager.cc
sstables/compaction_strategy.cc
sstables/compress.cc
sstables/integrity_checked_file_impl.cc
sstables/kl/writer.cc
sstables/leveled_compaction_strategy.cc
sstables/m_format_read_helpers.cc
sstables/metadata_collector.cc
sstables/mp_row_consumer.cc
sstables/mx/writer.cc
sstables/partition.cc
sstables/prepended_input_stream.cc
sstables/random_access_reader.cc
sstables/size_tiered_compaction_strategy.cc
sstables/sstable_directory.cc
sstables/sstable_version.cc
sstables/sstables.cc
sstables/sstables_manager.cc
sstables/time_window_compaction_strategy.cc
sstables/writer.cc
streaming/progress_info.cc
streaming/session_info.cc
streaming/stream_coordinator.cc
streaming/stream_manager.cc
streaming/stream_plan.cc
streaming/stream_reason.cc
streaming/stream_receive_task.cc
streaming/stream_request.cc
streaming/stream_result_future.cc
streaming/stream_session.cc
streaming/stream_session_state.cc
streaming/stream_summary.cc
streaming/stream_task.cc
streaming/stream_transfer_task.cc
table.cc
table_helper.cc
thrift/controller.cc
thrift/handler.cc
thrift/server.cc
thrift/thrift_validation.cc
timeout_config.cc
tracing/trace_keyspace_helper.cc
tracing/trace_state.cc
tracing/traced_file.cc
tracing/tracing.cc
tracing/tracing_backend_registry.cc
transport/controller.cc
transport/cql_protocol_extension.cc
transport/event.cc
transport/event_notifier.cc
transport/messages/result_message.cc
transport/server.cc
types.cc
unimplemented.cc
utils/UUID_gen.cc
utils/arch/powerpc/crc32-vpmsum/crc32_wrapper.cc
utils/array-search.cc
utils/ascii.cc
utils/big_decimal.cc
utils/bloom_calculations.cc
utils/bloom_filter.cc
utils/buffer_input_stream.cc
utils/build_id.cc
utils/config_file.cc
utils/directories.cc
utils/disk-error-handler.cc
utils/dynamic_bitset.cc
utils/error_injection.cc
utils/exceptions.cc
utils/file_lock.cc
utils/generation-number.cc
utils/gz/crc_combine.cc
utils/human_readable.cc
utils/i_filter.cc
utils/large_bitset.cc
utils/like_matcher.cc
utils/limiting_data_source.cc
utils/logalloc.cc
utils/managed_bytes.cc
utils/multiprecision_int.cc
utils/murmur_hash.cc
utils/rate_limiter.cc
utils/rjson.cc
utils/runtime.cc
utils/updateable_value.cc
utils/utf8.cc
utils/uuid.cc
validation.cc
vint-serialization.cc
zstd.cc
release.cc)
set(scylla_gen_sources
"${scylla_thrift_gen_cassandra_files}"
"${scylla_ragel_gen_protocol_parser_file}"
"${swagger_gen_files}"
"${idl_gen_files}"
"${antlr3_gen_files}")
add_executable(scylla
${SEASTAR_SOURCE_FILES}
${SCYLLA_SOURCE_FILES})
${scylla_sources}
${scylla_gen_sources})
# If the Seastar pkg-config information is available, append to the default flags.
#
# For ease of browsing the source code, we always pretend that DPDK is enabled.
target_compile_options(scylla PUBLIC
-std=gnu++20
-DHAVE_DPDK
-DHAVE_HWLOC
"${SEASTAR_CFLAGS}")
target_link_libraries(scylla PRIVATE
seastar
# Boost dependencies
Boost::filesystem
Boost::program_options
Boost::system
Boost::thread
Boost::regex
Boost::headers
# Abseil libs
absl::hashtablez_sampler
absl::raw_hash_set
absl::synchronization
absl::graphcycles_internal
absl::stacktrace
absl::symbolize
absl::debugging_internal
absl::demangle_internal
absl::time
absl::time_zone
absl::int128
absl::city
absl::hash
absl::malloc_internal
absl::spinlock_wait
absl::base
absl::dynamic_annotations
absl::raw_logging_internal
absl::exponential_biased
absl::throw_delegate
# System libs
ZLIB::ZLIB
ICU::uc
systemd
zstd
snappy
${LUA_LIBRARIES}
thrift
crypt)
# The order matters here: prefer the "static" DPDK directories to any dynamic paths from pkg-config. Some files are only
# available dynamically, though.
target_include_directories(scylla PUBLIC
.
${SEASTAR_DPDK_INCLUDE_DIRS}
${SEASTAR_INCLUDE_DIRS}
${Boost_INCLUDE_DIRS}
xxhash
libdeflate
build/${BUILD_TYPE}/gen)
target_link_libraries(scylla PRIVATE
-Wl,--build-id=sha1 # Force SHA1 build-id generation
# TODO: Use lld linker if it's available, otherwise gold, else bfd
-fuse-ld=lld)
# TODO: patch dynamic linker to match configure.py behavior
target_compile_options(scylla PRIVATE
-std=gnu++20
-fcoroutines # TODO: Clang does not have this flag, adjust to both variants
${target_arch_flag})
# Hacks needed to expose internal APIs for xxhash dependencies
target_compile_definitions(scylla PRIVATE XXH_PRIVATE_API HAVE_LZ4_COMPRESS_DEFAULT)
target_include_directories(scylla PRIVATE
"${CMAKE_CURRENT_SOURCE_DIR}"
libdeflate
abseil
"${scylla_gen_build_dir}")
###
### Create crc_combine_table helper executable.
### Use it to generate crc_combine_table.cc to be used in scylla at build time.
###
add_executable(crc_combine_table utils/gz/gen_crc_combine_table.cc)
target_link_libraries(crc_combine_table PRIVATE seastar)
target_include_directories(crc_combine_table PRIVATE "${CMAKE_CURRENT_SOURCE_DIR}")
target_compile_options(crc_combine_table PRIVATE
-std=gnu++20
-fcoroutines
${target_arch_flag})
add_dependencies(scylla crc_combine_table)
# Generate an additional source file at build time that is needed for Scylla compilation
add_custom_command(OUTPUT "${scylla_gen_build_dir}/utils/gz/crc_combine_table.cc"
COMMAND $<TARGET_FILE:crc_combine_table> > "${scylla_gen_build_dir}/utils/gz/crc_combine_table.cc"
DEPENDS crc_combine_table)
target_sources(scylla PRIVATE "${scylla_gen_build_dir}/utils/gz/crc_combine_table.cc")
###
### Generate version file and supply appropriate compile definitions for release.cc
###
execute_process(COMMAND ${CMAKE_SOURCE_DIR}/SCYLLA-VERSION-GEN RESULT_VARIABLE scylla_version_gen_res)
if(scylla_version_gen_res)
message(SEND_ERROR "Version file generation failed. Return code: ${scylla_version_gen_res}")
endif()
file(READ build/SCYLLA-VERSION-FILE scylla_version)
string(STRIP "${scylla_version}" scylla_version)
file(READ build/SCYLLA-RELEASE-FILE scylla_release)
string(STRIP "${scylla_release}" scylla_release)
get_property(release_cdefs SOURCE "${CMAKE_SOURCE_DIR}/release.cc" PROPERTY COMPILE_DEFINITIONS)
list(APPEND release_cdefs "SCYLLA_VERSION=\"${scylla_version}\"" "SCYLLA_RELEASE=\"${scylla_release}\"")
set_source_files_properties("${CMAKE_SOURCE_DIR}/release.cc" PROPERTIES COMPILE_DEFINITIONS "${release_cdefs}")
###
### Custom command for building libdeflate. Link the library to scylla.
###
set(libdeflate_lib "${scylla_build_dir}/libdeflate/libdeflate.a")
add_custom_command(OUTPUT "${libdeflate_lib}"
COMMAND make -C libdeflate
BUILD_DIR=../build/${BUILD_TYPE}/libdeflate/
CC=${CMAKE_C_COMPILER}
"CFLAGS=${target_arch_flag}"
../build/${BUILD_TYPE}/libdeflate//libdeflate.a) # Two backslashes are important!
# Hack to force generating custom command to produce libdeflate.a
add_custom_target(libdeflate DEPENDS "${libdeflate_lib}")
target_link_libraries(scylla PRIVATE "${libdeflate_lib}")
# TODO: create cmake/ directory and move utilities (generate functions etc) there
# TODO: Build tests if BUILD_TESTING=on (using CTest module)

View File

@@ -1,11 +1,18 @@
# Asking questions or requesting help
# Contributing to Scylla
Use the [ScyllaDB user mailing list](https://groups.google.com/forum/#!forum/scylladb-users) or the [Slack workspace](http://slack.scylladb.com) for general questions and help.
## Asking questions or requesting help
# Reporting an issue
Use the [Scylla Users mailing list](https://groups.google.com/g/scylladb-users) or the [Slack workspace](http://slack.scylladb.com) for general questions and help.
Please use the [Issue Tracker](https://github.com/scylladb/scylla/issues/) to report issues. Fill in as much information as you can in the issue template, especially for performance problems.
Join the [Scylla Developers mailing list](https://groups.google.com/g/scylladb-dev) for deeper technical discussions and to discuss your ideas for contributions.
# Contributing Code to Scylla
## Reporting an issue
To contribute code to Scylla, you need to sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send your changes as [patches](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches) to the [mailing list](https://groups.google.com/forum/#!forum/scylladb-dev). We don't accept pull requests on GitHub.
Please use the [issue tracker](https://github.com/scylladb/scylla/issues/) to report issues or to suggest features. Fill in as much information as you can in the issue template, especially for performance problems.
## Contributing code to Scylla
Before you can contribute code to Scylla for the first time, you should sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send the signed form cla@scylladb.com. You can then submit your changes as patches to the to the [scylladb-dev mailing list](https://groups.google.com/forum/#!forum/scylladb-dev) or as a pull request to the [Scylla project on github](https://github.com/scylladb/scylla).
If you need help formatting or sending patches, [check out these instructions](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches).
The Scylla C++ source code uses the [Seastar coding style](https://github.com/scylladb/seastar/blob/master/coding-style.md) so please adhere to that in your patches. Note that Scylla code is written with `using namespace seastar`, so should not explicitly add the `seastar::` prefix to Seastar symbols. You will usually not need to add `using namespace seastar` to new source files, because most Scylla header files have `#include "seastarx.hh"`, which does this.

1
DEDICATION.txt Normal file
View File

@@ -0,0 +1 @@
Dedicated to the memory of Alberto José Araújo, a coworker and a friend.

View File

@@ -1,114 +0,0 @@
M: Maintainer with commit access
R: Reviewer with subsystem expertise
F: Filename, directory, or pattern for the subsystem
---
AUTH
R: Calle Wilund <calle@scylladb.com>
R: Vlad Zolotarov <vladz@scylladb.com>
R: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
F: auth/*
CACHE
M: Tomasz Grabiec <tgrabiec@scylladb.com>
R: Piotr Jastrzebski <piotr@scylladb.com>
F: row_cache*
F: *mutation*
F: tests/mvcc*
COMMITLOG / BATCHLOGa
R: Calle Wilund <calle@scylladb.com>
F: db/commitlog/*
F: db/batch*
COORDINATOR
R: Gleb Natapov <gleb@scylladb.com>
F: service/storage_proxy*
COMPACTION
R: Raphael S. Carvalho <raphaelsc@scylladb.com>
R: Glauber Costa <glauber@scylladb.com>
R: Nadav Har'El <nyh@scylladb.com>
F: sstables/compaction*
CQL TRANSPORT LAYER
M: Pekka Enberg <penberg@scylladb.com>
F: transport/*
CQL QUERY LANGUAGE
M: Tomasz Grabiec <tgrabiec@scylladb.com>
M: Pekka Enberg <penberg@scylladb.com>
F: cql3/*
COUNTERS
F: counters*
F: tests/counter_test*
GOSSIP
M: Tomasz Grabiec <tgrabiec@scylladb.com>
R: Asias He <asias@scylladb.com>
F: gms/*
DOCKER
M: Pekka Enberg <penberg@scylladb.com>
F: dist/docker/*
LSA
M: Tomasz Grabiec <tgrabiec@scylladb.com>
F: utils/logalloc*
MATERIALIZED VIEWS
M: Pekka Enberg <penberg@scylladb.com>
M: Nadav Har'El <nyh@scylladb.com>
F: db/view/*
F: cql3/statements/*view*
PACKAGING
R: Takuya ASADA <syuu@scylladb.com>
F: dist/*
REPAIR
M: Tomasz Grabiec <tgrabiec@scylladb.com>
R: Asias He <asias@scylladb.com>
R: Nadav Har'El <nyh@scylladb.com>
F: repair/*
SCHEMA MANAGEMENT
M: Tomasz Grabiec <tgrabiec@scylladb.com>
M: Pekka Enberg <penberg@scylladb.com>
F: db/schema_tables*
F: db/legacy_schema_migrator*
F: service/migration*
F: schema*
SECONDARY INDEXES
M: Pekka Enberg <penberg@scylladb.com>
M: Nadav Har'El <nyh@scylladb.com>
R: Pekka Enberg <penberg@scylladb.com>
F: db/index/*
F: cql3/statements/*index*
SSTABLES
M: Tomasz Grabiec <tgrabiec@scylladb.com>
R: Raphael S. Carvalho <raphaelsc@scylladb.com>
R: Glauber Costa <glauber@scylladb.com>
R: Nadav Har'El <nyh@scylladb.com>
F: sstables/*
STREAMING
M: Tomasz Grabiec <tgrabiec@scylladb.com>
R: Asias He <asias@scylladb.com>
F: streaming/*
F: service/storage_service.*
ALTERNATOR
M: Nadav Har'El <nyh@scylladb.com>
F: alternator/*
F: alternator-test/*
THE REST
M: Avi Kivity <avi@scylladb.com>
M: Tomasz Grabiec <tgrabiec@scylladb.com>
M: Nadav Har'El <nyh@scylladb.com>
F: *

View File

@@ -5,3 +5,5 @@ It includes files from https://github.com/antonblanchard/crc32-vpmsum (author An
These files are located in utils/arch/powerpc/crc32-vpmsum. Their license may be found in licenses/LICENSE-crc32-vpmsum.TXT.
It includes modified code from https://gitbox.apache.org/repos/asf?p=cassandra-dtest.git (owned by The Apache Software Foundation)
It includes modified tests from https://github.com/etcd-io/etcd.git (owned by The etcd Authors)

109
README.md
View File

@@ -1,64 +1,84 @@
# Scylla
## Quick-start
[![Slack](https://img.shields.io/badge/slack-scylla-brightgreen.svg?logo=slack)](http://slack.scylladb.com)
[![Twitter](https://img.shields.io/twitter/follow/ScyllaDB.svg?style=social&label=Follow)](https://twitter.com/intent/follow?screen_name=ScyllaDB)
## What is Scylla?
Scylla is the real-time big data database that is API-compatible with Apache Cassandra and Amazon DynamoDB.
Scylla embraces a shared-nothing approach that increases throughput and storage capacity to realize order-of-magnitude performance improvements and reduce hardware costs.
For more information, please see the [ScyllaDB web site].
[ScyllaDB web site]: https://www.scylladb.com
## Build Prerequisites
Scylla is fairly fussy about its build environment, requiring very recent
versions of the C++20 compiler and of many libraries to build. The document
[HACKING.md](HACKING.md) includes detailed information on building and
developing Scylla, but to get Scylla building quickly on (almost) any build
machine, Scylla offers offers a [frozen toolchain](tools/toolchain/README.md),
machine, Scylla offers a [frozen toolchain](tools/toolchain/README.md),
This is a pre-configured Docker image which includes recent versions of all
the required compilers, libraries and build tools. Using the frozen toolchain
allows you to avoid changing anything in your build machine to meet Scylla's
requirements - you just need to meet the frozen toolchain's prerequisites
(mostly, Docker or Podman being available).
Building and running Scylla with the frozen toolchain is as easy as:
## Building Scylla
Building Scylla with the frozen toolchain `dbuild` is as easy as:
```bash
$ ./tools/toolchain/dbuild ./configure.py
$ ./tools/toolchain/dbuild ninja build/release/scylla
$ ./tools/toolchain/dbuild ./build/release/scylla --developer-mode 1
$ git submodule update --init --force --recursive
$ ./tools/toolchain/dbuild ./configure.py
$ ./tools/toolchain/dbuild ninja build/release/scylla
```
For further information, please see:
* [Developer documentation] for more information on building Scylla.
* [Build documentation] on how to build Scylla binaries, tests, and packages.
* [Docker image build documentation] for information on how to build Docker images.
[developer documentation]: HACKING.md
[build documentation]: docs/guides/building.md
[docker image build documentation]: dist/docker/redhat/README.md
## Running Scylla
* Run Scylla
```
./build/release/scylla
To start Scylla server, run:
```bash
$ ./tools/toolchain/dbuild ./build/release/scylla --workdir tmp --smp 1 --developer-mode 1
```
* run Scylla with one CPU and ./tmp as work directory
This will start a Scylla node with one CPU core allocated to it and data files stored in the `tmp` directory.
The `--developer-mode` is needed to disable the various checks Scylla performs at startup to ensure the machine is configured for maximum performance (not relevant on development workstations).
Please note that you need to run Scylla with `dbuild` if you built it with the frozen toolchain.
```
./build/release/scylla --workdir tmp --smp 1
```
For more run options, run:
* For more run options:
```
./build/release/scylla --help
```bash
$ ./tools/toolchain/dbuild ./build/release/scylla --help
```
## Testing
See [test.py manual](docs/testing.md).
See [test.py manual](docs/guides/testing.md).
## Scylla APIs and compatibility
By default, Scylla is compatible with Apache Cassandra and its APIs - CQL and
Thrift. There is also experimental support for the API of Amazon DynamoDB,
but being experimental it needs to be explicitly enabled to be used. For more
information on how to enable the experimental DynamoDB compatibility in Scylla,
and the current limitations of this feature, see
Thrift. There is also support for the API of Amazon DynamoDB,
which needs to be enabled and configured in order to be used. For more
information on how to enable the DynamoDB™ API in Scylla,
and the current compatibility of this feature as well as Scylla-specific extensions, see
[Alternator](docs/alternator/alternator.md) and
[Getting started with Alternator](docs/alternator/getting-started.md).
## Documentation
Documentation can be found in [./docs](./docs) and on the
[wiki](https://github.com/scylladb/scylla/wiki). There is currently no clear
definition of what goes where, so when looking for something be sure to check
both.
Documentation can be found [here](https://scylla.docs.scylladb.com).
Seastar documentation can be found [here](http://docs.seastar.io/master/index.html).
User documentation can be found [here](https://docs.scylladb.com/).
@@ -69,27 +89,22 @@ The courses are free, self-paced and include hands-on examples. They cover a var
administration, architecture, basic NoSQL concepts, using drivers for application development, Scylla setup, failover, compactions,
multi-datacenters and how Scylla integrates with third-party applications.
## Building a CentOS-based Docker image
Build a Docker image with:
```
cd dist/docker/redhat
docker build -t <image-name> .
```
This build is based on executables downloaded from downloads.scylladb.com,
**not** on the executables built in this source directory. See further
instructions in dist/docker/redhat/README.md to build a docker image from
your own executables.
Run the image with:
```
docker run -p $(hostname -i):9042:9042 -i -t <image name>
```
## Contributing to Scylla
[Hacking howto](HACKING.md)
[Guidelines for contributing](CONTRIBUTING.md)
If you want to report a bug or submit a pull request or a patch, please read the [contribution guidelines].
If you are a developer working on Scylla, please read the [developer guidelines].
[contribution guidelines]: CONTRIBUTING.md
[developer guidelines]: HACKING.md
## Contact
* The [users mailing list] and [Slack channel] are for users to discuss configuration, management, and operations of the ScyllaDB open source.
* The [developers mailing list] is for developers and people interested in following the development of ScyllaDB to discuss technical topics.
[Users mailing list]: https://groups.google.com/forum/#!forum/scylladb-users
[Slack channel]: http://slack.scylladb.com/
[Developers mailing list]: https://groups.google.com/forum/#!forum/scylladb-dev

View File

@@ -1,7 +1,7 @@
#!/bin/sh
PRODUCT=scylla
VERSION=666.development
VERSION=4.5.7
if test -f version
then

2
abseil

Submodule abseil updated: 2069dc796a...9c6a50fdd8

View File

@@ -62,6 +62,14 @@ static std::string apply_sha256(std::string_view msg) {
return to_hex(hasher.finalize());
}
static std::string apply_sha256(const std::vector<temporary_buffer<char>>& msg) {
sha256_hasher hasher;
for (const temporary_buffer<char>& buf : msg) {
hasher.update(buf.get(), buf.size());
}
return to_hex(hasher.finalize());
}
static std::string format_time_point(db_clock::time_point tp) {
time_t time_point_repr = db_clock::to_time_t(tp);
std::string time_point_str;
@@ -78,12 +86,12 @@ void check_expiry(std::string_view signature_date) {
std::string expiration_str = format_time_point(db_clock::now() - 15min);
std::string validity_str = format_time_point(db_clock::now() + 15min);
if (signature_date < expiration_str) {
throw api_error("InvalidSignatureException",
throw api_error::invalid_signature(
fmt::format("Signature expired: {} is now earlier than {} (current time - 15 min.)",
signature_date, expiration_str));
}
if (signature_date > validity_str) {
throw api_error("InvalidSignatureException",
throw api_error::invalid_signature(
fmt::format("Signature not yet current: {} is still later than {} (current time + 15 min.)",
signature_date, validity_str));
}
@@ -91,16 +99,16 @@ void check_expiry(std::string_view signature_date) {
std::string get_signature(std::string_view access_key_id, std::string_view secret_access_key, std::string_view host, std::string_view method,
std::string_view orig_datestamp, std::string_view signed_headers_str, const std::map<std::string_view, std::string_view>& signed_headers_map,
std::string_view body_content, std::string_view region, std::string_view service, std::string_view query_string) {
const std::vector<temporary_buffer<char>>& body_content, std::string_view region, std::string_view service, std::string_view query_string) {
auto amz_date_it = signed_headers_map.find("x-amz-date");
if (amz_date_it == signed_headers_map.end()) {
throw api_error("InvalidSignatureException", "X-Amz-Date header is mandatory for signature verification");
throw api_error::invalid_signature("X-Amz-Date header is mandatory for signature verification");
}
std::string_view amz_date = amz_date_it->second;
check_expiry(amz_date);
std::string_view datestamp = amz_date.substr(0, 8);
if (datestamp != orig_datestamp) {
throw api_error("InvalidSignatureException",
throw api_error::invalid_signature(
format("X-Amz-Date date does not match the provided datestamp. Expected {}, got {}",
orig_datestamp, datestamp));
}
@@ -126,19 +134,18 @@ std::string get_signature(std::string_view access_key_id, std::string_view secre
future<std::string> get_key_from_roles(cql3::query_processor& qp, std::string username) {
static const sstring query = format("SELECT salted_hash FROM {} WHERE {} = ?",
auth::meta::roles_table::qualified_name(), auth::meta::roles_table::role_col_name);
auth::meta::roles_table::qualified_name, auth::meta::roles_table::role_col_name);
auto cl = auth::password_authenticator::consistency_for_user(username);
auto timeout = auth::internal_distributed_timeout_config();
return qp.execute_internal(query, cl, timeout, {sstring(username)}, true).then_wrapped([username = std::move(username)] (future<::shared_ptr<cql3::untyped_result_set>> f) {
return qp.execute_internal(query, cl, auth::internal_distributed_query_state(), {sstring(username)}, true).then_wrapped([username = std::move(username)] (future<::shared_ptr<cql3::untyped_result_set>> f) {
auto res = f.get0();
auto salted_hash = std::optional<sstring>();
if (res->empty()) {
throw api_error("UnrecognizedClientException", fmt::format("User not found: {}", username));
throw api_error::unrecognized_client(fmt::format("User not found: {}", username));
}
salted_hash = res->one().get_opt<sstring>("salted_hash");
if (!salted_hash) {
throw api_error("UnrecognizedClientException", fmt::format("No password found for user: {}", username));
throw api_error::unrecognized_client(fmt::format("No password found for user: {}", username));
}
return make_ready_future<std::string>(*salted_hash);
});

View File

@@ -39,7 +39,7 @@ using key_cache = utils::loading_cache<std::string, std::string>;
std::string get_signature(std::string_view access_key_id, std::string_view secret_access_key, std::string_view host, std::string_view method,
std::string_view orig_datestamp, std::string_view signed_headers_str, const std::map<std::string_view, std::string_view>& signed_headers_map,
std::string_view body_content, std::string_view region, std::string_view service, std::string_view query_string);
const std::vector<temporary_buffer<char>>& body_content, std::string_view region, std::string_view service, std::string_view query_string);
future<std::string> get_key_from_roles(cql3::query_processor& qp, std::string username);

View File

@@ -32,13 +32,13 @@
// and the character used in base64 encoding to represent it.
static class base64_chars {
public:
static constexpr const char* to =
static constexpr const char to[] =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
int8_t from[255];
base64_chars() {
static_assert(strlen(to) == 64);
static_assert(sizeof(to) == 64 + 1);
for (int i = 0; i < 255; i++) {
from[i] = 255; // signal invalid character
from[i] = -1; // signal invalid character
}
for (int i = 0; i < 64; i++) {
from[(unsigned) to[i]] = i;

View File

@@ -23,7 +23,7 @@
#include <string_view>
#include "bytes.hh"
#include "rjson.hh"
#include "utils/rjson.hh"
std::string base64_encode(bytes_view);

View File

@@ -26,7 +26,7 @@
#include "alternator/error.hh"
#include "cql3/constants.hh"
#include <unordered_map>
#include "rjson.hh"
#include "utils/rjson.hh"
#include "serialization.hh"
#include "base64.hh"
#include <stdexcept>
@@ -57,12 +57,12 @@ comparison_operator_type get_comparison_operator(const rjson::value& comparison_
{"NOT_CONTAINS", comparison_operator_type::NOT_CONTAINS},
};
if (!comparison_operator.IsString()) {
throw api_error("ValidationException", format("Invalid comparison operator definition {}", rjson::print(comparison_operator)));
throw api_error::validation(format("Invalid comparison operator definition {}", rjson::print(comparison_operator)));
}
std::string op = comparison_operator.GetString();
auto it = ops.find(op);
if (it == ops.end()) {
throw api_error("ValidationException", format("Unsupported comparison operator {}", op));
throw api_error::validation(format("Unsupported comparison operator {}", op));
}
return it->second;
}
@@ -98,11 +98,16 @@ struct nonempty : public size_check {
// Check that array has the expected number of elements
static void verify_operand_count(const rjson::value* array, const size_check& expected, const rjson::value& op) {
if (!array && expected(0)) {
// If expected() allows an empty AttributeValueList, it is also fine
// that it is missing.
return;
}
if (!array || !array->IsArray()) {
throw api_error("ValidationException", "With ComparisonOperator, AttributeValueList must be given and an array");
throw api_error::validation("With ComparisonOperator, AttributeValueList must be given and an array");
}
if (!expected(array->Size())) {
throw api_error("ValidationException",
throw api_error::validation(
format("{} operator requires AttributeValueList {}, instead found list size {}",
op, expected.what(), array->Size()));
}
@@ -118,7 +123,7 @@ struct rjson_engaged_ptr_comp {
// as internally they're stored in an array, and the order of elements is
// not important in set equality. See issue #5021
static bool check_EQ_for_sets(const rjson::value& set1, const rjson::value& set2) {
if (set1.Size() != set2.Size()) {
if (!set1.IsArray() || !set2.IsArray() || set1.Size() != set2.Size()) {
return false;
}
std::set<const rjson::value*, rjson_engaged_ptr_comp> set1_raw;
@@ -126,7 +131,40 @@ static bool check_EQ_for_sets(const rjson::value& set1, const rjson::value& set2
set1_raw.insert(&*it);
}
for (const auto& a : set2.GetArray()) {
if (set1_raw.count(&a) == 0) {
if (!set1_raw.contains(&a)) {
return false;
}
}
return true;
}
// Moreover, the JSON being compared can be a nested document with outer
// layers of lists and maps and some inner set - and we need to get to that
// inner set to compare it correctly with check_EQ_for_sets() (issue #8514).
static bool check_EQ(const rjson::value* v1, const rjson::value& v2);
static bool check_EQ_for_lists(const rjson::value& list1, const rjson::value& list2) {
if (!list1.IsArray() || !list2.IsArray() || list1.Size() != list2.Size()) {
return false;
}
auto it1 = list1.Begin();
auto it2 = list2.Begin();
while (it1 != list1.End()) {
// Note: Alternator limits an item's depth (rjson::parse() limits
// it to around 37 levels), so this recursion is safe.
if (!check_EQ(&*it1, *it2)) {
return false;
}
++it1;
++it2;
}
return true;
}
static bool check_EQ_for_maps(const rjson::value& list1, const rjson::value& list2) {
if (!list1.IsObject() || !list2.IsObject() || list1.MemberCount() != list2.MemberCount()) {
return false;
}
for (auto it1 = list1.MemberBegin(); it1 != list1.MemberEnd(); ++it1) {
auto it2 = list2.FindMember(it1->name);
if (it2 == list2.MemberEnd() || !check_EQ(&it1->value, it2->value)) {
return false;
}
}
@@ -135,42 +173,71 @@ static bool check_EQ_for_sets(const rjson::value& set1, const rjson::value& set2
// Check if two JSON-encoded values match with the EQ relation
static bool check_EQ(const rjson::value* v1, const rjson::value& v2) {
if (!v1) {
return false;
}
if (v1->IsObject() && v1->MemberCount() == 1 && v2.IsObject() && v2.MemberCount() == 1) {
if (v1 && v1->IsObject() && v1->MemberCount() == 1 && v2.IsObject() && v2.MemberCount() == 1) {
auto it1 = v1->MemberBegin();
auto it2 = v2.MemberBegin();
if ((it1->name == "SS" && it2->name == "SS") || (it1->name == "NS" && it2->name == "NS") || (it1->name == "BS" && it2->name == "BS")) {
return check_EQ_for_sets(it1->value, it2->value);
if (it1->name != it2->name) {
return false;
}
if (it1->name == "SS" || it1->name == "NS" || it1->name == "BS") {
return check_EQ_for_sets(it1->value, it2->value);
} else if(it1->name == "L") {
return check_EQ_for_lists(it1->value, it2->value);
} else if(it1->name == "M") {
return check_EQ_for_maps(it1->value, it2->value);
} else {
// Other, non-nested types (number, string, etc.) can be compared
// literally, comparing their JSON representation.
return it1->value == it2->value;
}
} else {
// If v1 and/or v2 are missing (IsNull()) the result should be false.
// In the unlikely case that the object is malformed (issue #8070),
// let's also return false.
return false;
}
return *v1 == v2;
}
// Check if two JSON-encoded values match with the NE relation
static bool check_NE(const rjson::value* v1, const rjson::value& v2) {
return !v1 || *v1 != v2; // null is unequal to anything.
return !check_EQ(v1, v2);
}
// Check if two JSON-encoded values match with the BEGINS_WITH relation
static bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2) {
// BEGINS_WITH requires that its single operand (v2) be a string or
// binary - otherwise it's a validation error. However, problems with
// the stored attribute (v1) will just return false (no match).
if (!v2.IsObject() || v2.MemberCount() != 1) {
throw api_error("ValidationException", format("BEGINS_WITH operator encountered malformed AttributeValue: {}", v2));
}
auto it2 = v2.MemberBegin();
if (it2->name != "S" && it2->name != "B") {
throw api_error("ValidationException", format("BEGINS_WITH operator requires String or Binary type in AttributeValue, got {}", it2->name));
}
bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2,
bool v1_from_query, bool v2_from_query) {
bool bad = false;
if (!v1 || !v1->IsObject() || v1->MemberCount() != 1) {
if (v1_from_query) {
throw api_error::validation("begins_with() encountered malformed argument");
} else {
bad = true;
}
} else if (v1->MemberBegin()->name != "S" && v1->MemberBegin()->name != "B") {
if (v1_from_query) {
throw api_error::validation(format("begins_with supports only string or binary type, got: {}", *v1));
} else {
bad = true;
}
}
if (!v2.IsObject() || v2.MemberCount() != 1) {
if (v2_from_query) {
throw api_error::validation("begins_with() encountered malformed argument");
} else {
bad = true;
}
} else if (v2.MemberBegin()->name != "S" && v2.MemberBegin()->name != "B") {
if (v2_from_query) {
throw api_error::validation(format("begins_with() supports only string or binary type, got: {}", v2));
} else {
bad = true;
}
}
if (bad) {
return false;
}
auto it1 = v1->MemberBegin();
auto it2 = v2.MemberBegin();
if (it1->name != it2->name) {
return false;
}
@@ -228,12 +295,12 @@ static bool check_NOT_CONTAINS(const rjson::value* v1, const rjson::value& v2) {
// Check if a JSON-encoded value equals any element of an array, which must have at least one element.
static bool check_IN(const rjson::value* val, const rjson::value& array) {
if (!array[0].IsObject() || array[0].MemberCount() != 1) {
throw api_error("ValidationException",
throw api_error::validation(
format("IN operator encountered malformed AttributeValue: {}", array[0]));
}
const auto& type = array[0].MemberBegin()->name;
if (type != "S" && type != "N" && type != "B") {
throw api_error("ValidationException",
throw api_error::validation(
"IN operator requires AttributeValueList elements to be of type String, Number, or Binary ");
}
if (!val) {
@@ -242,7 +309,7 @@ static bool check_IN(const rjson::value* val, const rjson::value& array) {
bool have_match = false;
for (const auto& elem : array.GetArray()) {
if (!elem.IsObject() || elem.MemberCount() != 1 || elem.MemberBegin()->name != type) {
throw api_error("ValidationException",
throw api_error::validation(
"IN operator requires all AttributeValueList elements to have the same type ");
}
if (!have_match && *val == elem) {
@@ -274,24 +341,40 @@ static bool check_NOT_NULL(const rjson::value* val) {
return val != nullptr;
}
// Only types S, N or B (string, number or bytes) may be compared by the
// various comparion operators - lt, le, gt, ge, and between.
// Note that in particular, if the value is missing (v->IsNull()), this
// check returns false.
static bool check_comparable_type(const rjson::value& v) {
if (!v.IsObject() || v.MemberCount() != 1) {
return false;
}
const rjson::value& type = v.MemberBegin()->name;
return type == "S" || type == "N" || type == "B";
}
// Check if two JSON-encoded values match with cmp.
template <typename Comparator>
bool check_compare(const rjson::value* v1, const rjson::value& v2, const Comparator& cmp) {
if (!v2.IsObject() || v2.MemberCount() != 1) {
throw api_error("ValidationException",
format("{} requires a single AttributeValue of type String, Number, or Binary",
cmp.diagnostic));
bool check_compare(const rjson::value* v1, const rjson::value& v2, const Comparator& cmp,
bool v1_from_query, bool v2_from_query) {
bool bad = false;
if (!v1 || !check_comparable_type(*v1)) {
if (v1_from_query) {
throw api_error::validation(format("{} allow only the types String, Number, or Binary", cmp.diagnostic));
}
bad = true;
}
const auto& kv2 = *v2.MemberBegin();
if (kv2.name != "S" && kv2.name != "N" && kv2.name != "B") {
throw api_error("ValidationException",
format("{} requires a single AttributeValue of type String, Number, or Binary",
cmp.diagnostic));
if (!check_comparable_type(v2)) {
if (v2_from_query) {
throw api_error::validation(format("{} allow only the types String, Number, or Binary", cmp.diagnostic));
}
bad = true;
}
if (!v1 || !v1->IsObject() || v1->MemberCount() != 1) {
if (bad) {
return false;
}
const auto& kv1 = *v1->MemberBegin();
const auto& kv2 = *v2.MemberBegin();
if (kv1.name != kv2.name) {
return false;
}
@@ -305,7 +388,8 @@ bool check_compare(const rjson::value* v1, const rjson::value& v2, const Compara
if (kv1.name == "B") {
return cmp(base64_decode(kv1.value), base64_decode(kv2.value));
}
clogger.error("check_compare panic: LHS type equals RHS type, but one is in {N,S,B} while the other isn't");
// cannot reach here, as check_comparable_type() verifies the type is one
// of the above options.
return false;
}
@@ -336,57 +420,71 @@ struct cmp_gt {
static constexpr const char* diagnostic = "GT operator";
};
// True if v is between lb and ub, inclusive. Throws if lb > ub.
// True if v is between lb and ub, inclusive. Throws or returns false
// (depending on bounds_from_query parameter) if lb > ub.
template <typename T>
static bool check_BETWEEN(const T& v, const T& lb, const T& ub) {
static bool check_BETWEEN(const T& v, const T& lb, const T& ub, bool bounds_from_query) {
if (cmp_lt()(ub, lb)) {
throw api_error("ValidationException",
format("BETWEEN operator requires lower_bound <= upper_bound, but {} > {}", lb, ub));
if (bounds_from_query) {
throw api_error::validation(
format("BETWEEN operator requires lower_bound <= upper_bound, but {} > {}", lb, ub));
} else {
return false;
}
}
return cmp_ge()(v, lb) && cmp_le()(v, ub);
}
static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const rjson::value& ub) {
if (!v) {
static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const rjson::value& ub,
bool v_from_query, bool lb_from_query, bool ub_from_query) {
if ((v && v_from_query && !check_comparable_type(*v)) ||
(lb_from_query && !check_comparable_type(lb)) ||
(ub_from_query && !check_comparable_type(ub))) {
throw api_error::validation("between allow only the types String, Number, or Binary");
}
if (!v || !v->IsObject() || v->MemberCount() != 1 ||
!lb.IsObject() || lb.MemberCount() != 1 ||
!ub.IsObject() || ub.MemberCount() != 1) {
return false;
}
if (!v->IsObject() || v->MemberCount() != 1) {
throw api_error("ValidationException", format("BETWEEN operator encountered malformed AttributeValue: {}", *v));
}
if (!lb.IsObject() || lb.MemberCount() != 1) {
throw api_error("ValidationException", format("BETWEEN operator encountered malformed AttributeValue: {}", lb));
}
if (!ub.IsObject() || ub.MemberCount() != 1) {
throw api_error("ValidationException", format("BETWEEN operator encountered malformed AttributeValue: {}", ub));
}
const auto& kv_v = *v->MemberBegin();
const auto& kv_lb = *lb.MemberBegin();
const auto& kv_ub = *ub.MemberBegin();
bool bounds_from_query = lb_from_query && ub_from_query;
if (kv_lb.name != kv_ub.name) {
throw api_error(
"ValidationException",
if (bounds_from_query) {
throw api_error::validation(
format("BETWEEN operator requires the same type for lower and upper bound; instead got {} and {}",
kv_lb.name, kv_ub.name));
} else {
return false;
}
}
if (kv_v.name != kv_lb.name) { // Cannot compare different types, so v is NOT between lb and ub.
return false;
}
if (kv_v.name == "N") {
const char* diag = "BETWEEN operator";
return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag));
return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag), bounds_from_query);
}
if (kv_v.name == "S") {
return check_BETWEEN(std::string_view(kv_v.value.GetString(), kv_v.value.GetStringLength()),
std::string_view(kv_lb.value.GetString(), kv_lb.value.GetStringLength()),
std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()));
std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()),
bounds_from_query);
}
if (kv_v.name == "B") {
return check_BETWEEN(base64_decode(kv_v.value), base64_decode(kv_lb.value), base64_decode(kv_ub.value));
return check_BETWEEN(base64_decode(kv_v.value), base64_decode(kv_lb.value), base64_decode(kv_ub.value), bounds_from_query);
}
throw api_error("ValidationException",
format("BETWEEN operator requires AttributeValueList elements to be of type String, Number, or Binary; instead got {}",
if (v_from_query) {
throw api_error::validation(
format("BETWEEN operator requires AttributeValueList elements to be of type String, Number, or Binary; instead got {}",
kv_lb.name));
} else {
return false;
}
}
// Verify one Expect condition on one attribute (whose content is "got")
@@ -404,24 +502,24 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu
// and requires a different combinations of parameters in the request
if (value) {
if (exists && (!exists->IsBool() || exists->GetBool() != true)) {
throw api_error("ValidationException", "Cannot combine Value with Exists!=true");
throw api_error::validation("Cannot combine Value with Exists!=true");
}
if (comparison_operator) {
throw api_error("ValidationException", "Cannot combine Value with ComparisonOperator");
throw api_error::validation("Cannot combine Value with ComparisonOperator");
}
return check_EQ(got, *value);
} else if (exists) {
if (comparison_operator) {
throw api_error("ValidationException", "Cannot combine Exists with ComparisonOperator");
throw api_error::validation("Cannot combine Exists with ComparisonOperator");
}
if (!exists->IsBool() || exists->GetBool() != false) {
throw api_error("ValidationException", "Exists!=false requires Value");
throw api_error::validation("Exists!=false requires Value");
}
// Remember Exists=false, so we're checking that the attribute does *not* exist:
return !got;
} else {
if (!comparison_operator) {
throw api_error("ValidationException", "Missing ComparisonOperator, Value or Exists");
throw api_error::validation("Missing ComparisonOperator, Value or Exists");
}
comparison_operator_type op = get_comparison_operator(*comparison_operator);
switch (op) {
@@ -433,19 +531,19 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu
return check_NE(got, (*attribute_value_list)[0]);
case comparison_operator_type::LT:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_lt{});
return check_compare(got, (*attribute_value_list)[0], cmp_lt{}, false, true);
case comparison_operator_type::LE:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_le{});
return check_compare(got, (*attribute_value_list)[0], cmp_le{}, false, true);
case comparison_operator_type::GT:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_gt{});
return check_compare(got, (*attribute_value_list)[0], cmp_gt{}, false, true);
case comparison_operator_type::GE:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_compare(got, (*attribute_value_list)[0], cmp_ge{});
return check_compare(got, (*attribute_value_list)[0], cmp_ge{}, false, true);
case comparison_operator_type::BEGINS_WITH:
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
return check_BEGINS_WITH(got, (*attribute_value_list)[0]);
return check_BEGINS_WITH(got, (*attribute_value_list)[0], false, true);
case comparison_operator_type::IN:
verify_operand_count(attribute_value_list, nonempty(), *comparison_operator);
return check_IN(got, *attribute_value_list);
@@ -457,7 +555,8 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu
return check_NOT_NULL(got);
case comparison_operator_type::BETWEEN:
verify_operand_count(attribute_value_list, exact_size(2), *comparison_operator);
return check_BETWEEN(got, (*attribute_value_list)[0], (*attribute_value_list)[1]);
return check_BETWEEN(got, (*attribute_value_list)[0], (*attribute_value_list)[1],
false, true, true);
case comparison_operator_type::CONTAINS:
{
verify_operand_count(attribute_value_list, exact_size(1), *comparison_operator);
@@ -466,7 +565,7 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu
const rjson::value& arg = (*attribute_value_list)[0];
const auto& argtype = (*arg.MemberBegin()).name;
if (argtype != "S" && argtype != "N" && argtype != "B") {
throw api_error("ValidationException",
throw api_error::validation(
format("CONTAINS operator requires a single AttributeValue of type String, Number, or Binary, "
"got {} instead", argtype));
}
@@ -480,7 +579,7 @@ static bool verify_expected_one(const rjson::value& condition, const rjson::valu
const rjson::value& arg = (*attribute_value_list)[0];
const auto& argtype = (*arg.MemberBegin()).name;
if (argtype != "S" && argtype != "N" && argtype != "B") {
throw api_error("ValidationException",
throw api_error::validation(
format("CONTAINS operator requires a single AttributeValue of type String, Number, or Binary, "
"got {} instead", argtype));
}
@@ -497,7 +596,7 @@ conditional_operator_type get_conditional_operator(const rjson::value& req) {
return conditional_operator_type::MISSING;
}
if (!conditional_operator->IsString()) {
throw api_error("ValidationException", "'ConditionalOperator' parameter, if given, must be a string");
throw api_error::validation("'ConditionalOperator' parameter, if given, must be a string");
}
auto s = rjson::to_string_view(*conditional_operator);
if (s == "AND") {
@@ -505,7 +604,7 @@ conditional_operator_type get_conditional_operator(const rjson::value& req) {
} else if (s == "OR") {
return conditional_operator_type::OR;
} else {
throw api_error("ValidationException",
throw api_error::validation(
format("'ConditionalOperator' parameter must be AND, OR or missing. Found {}.", s));
}
}
@@ -520,13 +619,13 @@ bool verify_expected(const rjson::value& req, const rjson::value* previous_item)
auto conditional_operator = get_conditional_operator(req);
if (conditional_operator != conditional_operator_type::MISSING &&
(!expected || (expected->IsObject() && expected->GetObject().ObjectEmpty()))) {
throw api_error("ValidationException", "'ConditionalOperator' parameter cannot be specified for missing or empty Expression");
throw api_error::validation("'ConditionalOperator' parameter cannot be specified for missing or empty Expression");
}
if (!expected) {
return true;
}
if (!expected->IsObject()) {
throw api_error("ValidationException", "'Expected' parameter, if given, must be an object");
throw api_error::validation("'Expected' parameter, if given, must be an object");
}
bool require_all = conditional_operator != conditional_operator_type::OR;
return verify_condition(*expected, require_all, previous_item);
@@ -569,7 +668,8 @@ static bool calculate_primitive_condition(const parsed::primitive_condition& con
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error(format("Wrong number of values {} in BETWEEN primitive_condition", cond._values.size()));
}
return check_BETWEEN(&calculated_values[0], calculated_values[1], calculated_values[2]);
return check_BETWEEN(&calculated_values[0], calculated_values[1], calculated_values[2],
cond._values[0].is_constant(), cond._values[1].is_constant(), cond._values[2].is_constant());
case parsed::primitive_condition::type::IN:
return check_IN(calculated_values);
case parsed::primitive_condition::type::VALUE:
@@ -584,7 +684,7 @@ static bool calculate_primitive_condition(const parsed::primitive_condition& con
return it->value.GetBool();
}
}
throw api_error("ValidationException",
throw api_error::validation(
format("ConditionExpression: condition results in a non-boolean value: {}",
calculated_values[0]));
default:
@@ -600,13 +700,17 @@ static bool calculate_primitive_condition(const parsed::primitive_condition& con
case parsed::primitive_condition::type::NE:
return check_NE(&calculated_values[0], calculated_values[1]);
case parsed::primitive_condition::type::GT:
return check_compare(&calculated_values[0], calculated_values[1], cmp_gt{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_gt{},
cond._values[0].is_constant(), cond._values[1].is_constant());
case parsed::primitive_condition::type::GE:
return check_compare(&calculated_values[0], calculated_values[1], cmp_ge{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_ge{},
cond._values[0].is_constant(), cond._values[1].is_constant());
case parsed::primitive_condition::type::LT:
return check_compare(&calculated_values[0], calculated_values[1], cmp_lt{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_lt{},
cond._values[0].is_constant(), cond._values[1].is_constant());
case parsed::primitive_condition::type::LE:
return check_compare(&calculated_values[0], calculated_values[1], cmp_le{});
return check_compare(&calculated_values[0], calculated_values[1], cmp_le{},
cond._values[0].is_constant(), cond._values[1].is_constant());
default:
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error(format("Unknown type {} in primitive_condition object", (int)(cond._op)));

View File

@@ -52,6 +52,7 @@ bool verify_expected(const rjson::value& req, const rjson::value* previous_item)
bool verify_condition(const rjson::value& condition, bool require_all, const rjson::value* previous_item);
bool check_CONTAINS(const rjson::value* v1, const rjson::value& v2);
bool check_BEGINS_WITH(const rjson::value* v1, const rjson::value& v2, bool v1_from_query, bool v2_from_query);
bool verify_condition_expression(
const parsed::condition_expression& condition_expression,

View File

@@ -26,12 +26,15 @@
namespace alternator {
// DynamoDB's error messages are described in detail in
// api_error contains a DynamoDB error message to be returned to the user.
// It can be returned by value (see executor::request_return_type) or thrown.
// The DynamoDB's error messages are described in detail in
// https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html
// Ah An error message has a "type", e.g., "ResourceNotFoundException", a coarser
// HTTP code (almost always, 400), and a human readable message. Eventually these
// will be wrapped into a JSON object returned to the client.
class api_error : public std::exception {
// An error message has an HTTP code (almost always 400), a type, e.g.,
// "ResourceNotFoundException", and a human readable message.
// Eventually alternator::api_handler will convert a returned or thrown
// api_error into a JSON object, and that is returned to the user.
class api_error final {
public:
using status_type = httpd::reply::status_type;
status_type _http_code;
@@ -42,8 +45,47 @@ public:
, _type(std::move(type))
, _msg(std::move(msg))
{ }
api_error() = default;
virtual const char* what() const noexcept override { return _msg.c_str(); }
// Factory functions for some common types of DynamoDB API errors
static api_error validation(std::string msg) {
return api_error("ValidationException", std::move(msg));
}
static api_error resource_not_found(std::string msg) {
return api_error("ResourceNotFoundException", std::move(msg));
}
static api_error resource_in_use(std::string msg) {
return api_error("ResourceInUseException", std::move(msg));
}
static api_error invalid_signature(std::string msg) {
return api_error("InvalidSignatureException", std::move(msg));
}
static api_error missing_authentication_token(std::string msg) {
return api_error("MissingAuthenticationTokenException", std::move(msg));
}
static api_error unrecognized_client(std::string msg) {
return api_error("UnrecognizedClientException", std::move(msg));
}
static api_error unknown_operation(std::string msg) {
return api_error("UnknownOperationException", std::move(msg));
}
static api_error access_denied(std::string msg) {
return api_error("AccessDeniedException", std::move(msg));
}
static api_error conditional_check_failed(std::string msg) {
return api_error("ConditionalCheckFailedException", std::move(msg));
}
static api_error expired_iterator(std::string msg) {
return api_error("ExpiredIteratorException", std::move(msg));
}
static api_error trimmed_data_access_exception(std::string msg) {
return api_error("TrimmedDataAccessException", std::move(msg));
}
static api_error request_limit_exceeded(std::string msg) {
return api_error("RequestLimitExceeded", std::move(msg));
}
static api_error internal(std::string msg) {
return api_error("InternalServerError", std::move(msg), reply::status_type::internal_server_error);
}
};
}

File diff suppressed because it is too large Load Diff

View File

@@ -30,16 +30,121 @@
#include "service/storage_proxy.hh"
#include "service/migration_manager.hh"
#include "service/client_state.hh"
#include "db/timeout_clock.hh"
#include "alternator/error.hh"
#include "stats.hh"
#include "rjson.hh"
#include "utils/rjson.hh"
namespace db {
class system_distributed_keyspace;
}
namespace query {
class partition_slice;
class result;
}
namespace cql3::selection {
class selection;
}
namespace service {
class storage_service;
}
namespace cdc {
class metadata;
}
namespace alternator {
class rmw_operation;
struct make_jsonable : public json::jsonable {
rjson::value _value;
public:
explicit make_jsonable(rjson::value&& value);
std::string to_json() const override;
};
struct json_string : public json::jsonable {
std::string _value;
public:
explicit json_string(std::string&& value);
std::string to_json() const override;
};
namespace parsed {
class path;
};
// An attribute_path_map object is used to hold data for various attributes
// paths (parsed::path) in a hierarchy of attribute paths. Each attribute path
// has a root attribute, and then modified by member and index operators -
// for example in "a.b[2].c" we have "a" as the root, then ".b" member, then
// "[2]" index, and finally ".c" member.
// Data can be added to an attribute_path_map using the add() function, but
// requires that attributes with data not be *overlapping* or *conflicting*:
//
// 1. Two attribute paths which are identical or an ancestor of one another
// are considered *overlapping* and not allowed. If a.b.c has data,
// we can't add more data in a.b.c or any of its descendants like a.b.c.d.
//
// 2. Two attribute paths which need the same parent to have both a member and
// an index are considered *conflicting* and not allowed. E.g., if a.b has
// data, you can't add a[1]. The meaning of adding both would be that the
// attribute a is both a map and an array, which isn't sensible.
//
// These two requirements are common to the two places where Alternator uses
// this abstraction to describe how a hierarchical item is to be transformed:
//
// 1. In ProjectExpression: for filtering from a full top-level attribute
// only the parts for which user asked in ProjectionExpression.
//
// 2. In UpdateExpression: for taking the previous value of a top-level
// attribute, and modifying it based on the instructions in the user
// wrote in UpdateExpression.
template<typename T>
class attribute_path_map_node {
public:
using data_t = T;
// We need the extra unique_ptr<> here because libstdc++ unordered_map
// doesn't work with incomplete types :-(
using members_t = std::unordered_map<std::string, std::unique_ptr<attribute_path_map_node<T>>>;
// The indexes list is sorted because DynamoDB requires handling writes
// beyond the end of a list in index order.
using indexes_t = std::map<unsigned, std::unique_ptr<attribute_path_map_node<T>>>;
// The prohibition on "overlap" and "conflict" explained above means
// That only one of data, members or indexes is non-empty.
std::optional<std::variant<data_t, members_t, indexes_t>> _content;
bool is_empty() const { return !_content; }
bool has_value() const { return _content && std::holds_alternative<data_t>(*_content); }
bool has_members() const { return _content && std::holds_alternative<members_t>(*_content); }
bool has_indexes() const { return _content && std::holds_alternative<indexes_t>(*_content); }
// get_members() assumes that has_members() is true
members_t& get_members() { return std::get<members_t>(*_content); }
const members_t& get_members() const { return std::get<members_t>(*_content); }
indexes_t& get_indexes() { return std::get<indexes_t>(*_content); }
const indexes_t& get_indexes() const { return std::get<indexes_t>(*_content); }
T& get_value() { return std::get<T>(*_content); }
const T& get_value() const { return std::get<T>(*_content); }
};
template<typename T>
using attribute_path_map = std::unordered_map<std::string, attribute_path_map_node<T>>;
using attrs_to_get_node = attribute_path_map_node<std::monostate>;
using attrs_to_get = attribute_path_map<std::monostate>;
class executor : public peering_sharded_service<executor> {
service::storage_proxy& _proxy;
service::migration_manager& _mm;
db::system_distributed_keyspace& _sdks;
service::storage_service& _ss;
cdc::metadata& _cdc_metadata;
// An smp_service_group to be used for limiting the concurrency when
// forwarding Alternator request between shards - if necessary for LWT.
smp_service_group _ssg;
@@ -52,12 +157,13 @@ public:
static constexpr auto KEYSPACE_NAME_PREFIX = "alternator_";
static constexpr std::string_view INTERNAL_TABLE_PREFIX = ".scylla.alternator.";
executor(service::storage_proxy& proxy, service::migration_manager& mm, smp_service_group ssg)
: _proxy(proxy), _mm(mm), _ssg(ssg) {}
executor(service::storage_proxy& proxy, service::migration_manager& mm, db::system_distributed_keyspace& sdks, service::storage_service& ss, cdc::metadata& cdc_metadata, smp_service_group ssg)
: _proxy(proxy), _mm(mm), _sdks(sdks), _ss(ss), _cdc_metadata(cdc_metadata), _ssg(ssg) {}
future<request_return_type> create_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> describe_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> delete_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> update_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> put_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> get_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> delete_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
@@ -71,13 +177,48 @@ public:
future<request_return_type> tag_resource(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> untag_resource(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> list_tags_of_resource(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> list_streams(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> describe_stream(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> get_shard_iterator(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> get_records(client_state& client_state, tracing::trace_state_ptr, service_permit permit, rjson::value request);
future<> start();
future<> stop() { return make_ready_future<>(); }
future<> create_keyspace(std::string_view keyspace_name);
static tracing::trace_state_ptr maybe_trace_query(client_state& client_state, sstring_view op, sstring_view query);
static sstring table_name(const schema&);
static db::timeout_clock::time_point default_timeout();
static void set_default_timeout(db::timeout_clock::duration timeout);
private:
static db::timeout_clock::duration s_default_timeout;
public:
static schema_ptr find_table(service::storage_proxy&, const rjson::value& request);
private:
friend class rmw_operation;
static bool is_alternator_keyspace(const sstring& ks_name);
static sstring make_keyspace_name(const sstring& table_name);
static void describe_key_schema(rjson::value& parent, const schema&, std::unordered_map<std::string,std::string> * = nullptr);
static void describe_key_schema(rjson::value& parent, const schema& schema, std::unordered_map<std::string,std::string>&);
public:
static std::optional<rjson::value> describe_single_item(schema_ptr,
const query::partition_slice&,
const cql3::selection::selection&,
const query::result&,
const attrs_to_get&);
static void describe_single_item(const cql3::selection::selection&,
const std::vector<bytes_opt>&,
const attrs_to_get&,
rjson::value&,
bool = false);
void add_stream_options(const rjson::value& stream_spec, schema_builder&) const;
void supplement_table_info(rjson::value& descr, const schema& schema) const;
void supplement_table_stream_info(rjson::value& descr, const schema& schema) const;
};
}

View File

@@ -130,6 +130,27 @@ void condition_expression::append(condition_expression&& a, char op) {
}, _expression);
}
void path::check_depth_limit() {
if (1 + _operators.size() > depth_limit) {
throw expressions_syntax_error(format("Document path exceeded {} nesting levels", depth_limit));
}
}
std::ostream& operator<<(std::ostream& os, const path& p) {
os << p.root();
for (const auto& op : p.operators()) {
std::visit(overloaded_functor {
[&] (const std::string& member) {
os << '.' << member;
},
[&] (unsigned index) {
os << '[' << index << ']';
}
}, op);
}
return os;
}
} // namespace parsed
// The following resolve_*() functions resolve references in parsed
@@ -151,22 +172,44 @@ void condition_expression::append(condition_expression&& a, char op) {
// we need to resolve the expression just once but then use it many times
// (once for each item to be filtered).
static void resolve_path(parsed::path& p,
static std::optional<std::string> resolve_path_component(const std::string& column_name,
const rjson::value* expression_attribute_names,
std::unordered_set<std::string>& used_attribute_names) {
const std::string& column_name = p.root();
if (column_name.size() > 0 && column_name.front() == '#') {
if (!expression_attribute_names) {
throw api_error("ValidationException",
throw api_error::validation(
format("ExpressionAttributeNames missing, entry '{}' required by expression", column_name));
}
const rjson::value* value = rjson::find(*expression_attribute_names, column_name);
if (!value || !value->IsString()) {
throw api_error("ValidationException",
throw api_error::validation(
format("ExpressionAttributeNames missing entry '{}' required by expression", column_name));
}
used_attribute_names.emplace(column_name);
p.set_root(std::string(rjson::to_string_view(*value)));
return std::string(rjson::to_string_view(*value));
}
return std::nullopt;
}
static void resolve_path(parsed::path& p,
const rjson::value* expression_attribute_names,
std::unordered_set<std::string>& used_attribute_names) {
std::optional<std::string> r = resolve_path_component(p.root(), expression_attribute_names, used_attribute_names);
if (r) {
p.set_root(std::move(*r));
}
for (auto& op : p.operators()) {
std::visit(overloaded_functor {
[&] (std::string& s) {
r = resolve_path_component(s, expression_attribute_names, used_attribute_names);
if (r) {
s = std::move(*r);
}
},
[&] (unsigned index) {
// nothing to resolve
}
}, op);
}
}
@@ -176,16 +219,16 @@ static void resolve_constant(parsed::constant& c,
std::visit(overloaded_functor {
[&] (const std::string& valref) {
if (!expression_attribute_values) {
throw api_error("ValidationException",
throw api_error::validation(
format("ExpressionAttributeValues missing, entry '{}' required by expression", valref));
}
const rjson::value* value = rjson::find(*expression_attribute_values, valref);
if (!value) {
throw api_error("ValidationException",
throw api_error::validation(
format("ExpressionAttributeValues missing entry '{}' required by expression", valref));
}
if (value->IsNull()) {
throw api_error("ValidationException",
throw api_error::validation(
format("ExpressionAttributeValues null value for entry '{}' required by expression", valref));
}
validate_value(*value, "ExpressionAttributeValues");
@@ -348,6 +391,39 @@ bool condition_expression_on(const parsed::condition_expression& ce, std::string
}, ce._expression);
}
// for_condition_expression_on() runs a given function over all the attributes
// mentioned in the expression. If the same attribute is mentioned more than
// once, the function will be called more than once for the same attribute.
static void for_value_on(const parsed::value& v, const noncopyable_function<void(std::string_view)>& func) {
std::visit(overloaded_functor {
[&] (const parsed::constant& c) { },
[&] (const parsed::value::function_call& f) {
for (const parsed::value& value : f._parameters) {
for_value_on(value, func);
}
},
[&] (const parsed::path& p) {
func(p.root());
}
}, v._value);
}
void for_condition_expression_on(const parsed::condition_expression& ce, const noncopyable_function<void(std::string_view)>& func) {
std::visit(overloaded_functor {
[&] (const parsed::primitive_condition& cond) {
for (const parsed::value& value : cond._values) {
for_value_on(value, func);
}
},
[&] (const parsed::condition_expression::condition_list& list) {
for (const parsed::condition_expression& cond : list.conditions) {
for_condition_expression_on(cond, func);
}
}
}, ce._expression);
}
// The following calculate_value() functions calculate, or evaluate, a parsed
// expression. The parsed expression is assumed to have been "resolved", with
// the matching resolve_* function.
@@ -359,7 +435,7 @@ static rjson::value list_concatenate(const rjson::value& v1, const rjson::value&
const rjson::value* list1 = unwrap_list(v1);
const rjson::value* list2 = unwrap_list(v2);
if (!list1 || !list2) {
throw api_error("ValidationException", "UpdateExpression: list_append() given a non-list");
throw api_error::validation("UpdateExpression: list_append() given a non-list");
}
rjson::value cat = rjson::copy(*list1);
for (const auto& a : list2->GetArray()) {
@@ -380,28 +456,28 @@ static rjson::value calculate_size(const rjson::value& v) {
// must come from the request itself, not from the database, so it makes
// sense to throw a ValidationException if we see such a problem.
if (!v.IsObject() || v.MemberCount() != 1) {
throw api_error("ValidationException", format("invalid object: {}", v));
throw api_error::validation(format("invalid object: {}", v));
}
auto it = v.MemberBegin();
int ret;
if (it->name == "S") {
if (!it->value.IsString()) {
throw api_error("ValidationException", format("invalid string: {}", v));
throw api_error::validation(format("invalid string: {}", v));
}
ret = it->value.GetStringLength();
} else if (it->name == "NS" || it->name == "SS" || it->name == "BS" || it->name == "L") {
if (!it->value.IsArray()) {
throw api_error("ValidationException", format("invalid set: {}", v));
throw api_error::validation(format("invalid set: {}", v));
}
ret = it->value.Size();
} else if (it->name == "M") {
if (!it->value.IsObject()) {
throw api_error("ValidationException", format("invalid map: {}", v));
throw api_error::validation(format("invalid map: {}", v));
}
ret = it->value.MemberCount();
} else if (it->name == "B") {
if (!it->value.IsString()) {
throw api_error("ValidationException", format("invalid byte string: {}", v));
throw api_error::validation(format("invalid byte string: {}", v));
}
ret = base64_decoded_len(rjson::to_string_view(it->value));
} else {
@@ -445,11 +521,11 @@ static const
std::unordered_map<std::string_view, function_handler_type*> function_handlers {
{"list_append", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {
if (caller != calculate_value_caller::UpdateExpression) {
throw api_error("ValidationException",
throw api_error::validation(
format("{}: list_append() not allowed here", caller));
}
if (f._parameters.size() != 2) {
throw api_error("ValidationException",
throw api_error::validation(
format("{}: list_append() accepts 2 parameters, got {}", caller, f._parameters.size()));
}
rjson::value v1 = calculate_value(f._parameters[0], caller, previous_item);
@@ -459,15 +535,15 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {
},
{"if_not_exists", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {
if (caller != calculate_value_caller::UpdateExpression) {
throw api_error("ValidationException",
throw api_error::validation(
format("{}: if_not_exists() not allowed here", caller));
}
if (f._parameters.size() != 2) {
throw api_error("ValidationException",
throw api_error::validation(
format("{}: if_not_exists() accepts 2 parameters, got {}", caller, f._parameters.size()));
}
if (!std::holds_alternative<parsed::path>(f._parameters[0]._value)) {
throw api_error("ValidationException",
throw api_error::validation(
format("{}: if_not_exists() must include path as its first argument", caller));
}
rjson::value v1 = calculate_value(f._parameters[0], caller, previous_item);
@@ -477,11 +553,11 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {
},
{"size", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {
if (caller != calculate_value_caller::ConditionExpression) {
throw api_error("ValidationException",
throw api_error::validation(
format("{}: size() not allowed here", caller));
}
if (f._parameters.size() != 1) {
throw api_error("ValidationException",
throw api_error::validation(
format("{}: size() accepts 1 parameter, got {}", caller, f._parameters.size()));
}
rjson::value v = calculate_value(f._parameters[0], caller, previous_item);
@@ -490,15 +566,15 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {
},
{"attribute_exists", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {
if (caller != calculate_value_caller::ConditionExpressionAlone) {
throw api_error("ValidationException",
throw api_error::validation(
format("{}: attribute_exists() not allowed here", caller));
}
if (f._parameters.size() != 1) {
throw api_error("ValidationException",
throw api_error::validation(
format("{}: attribute_exists() accepts 1 parameter, got {}", caller, f._parameters.size()));
}
if (!std::holds_alternative<parsed::path>(f._parameters[0]._value)) {
throw api_error("ValidationException",
throw api_error::validation(
format("{}: attribute_exists()'s parameter must be a path", caller));
}
rjson::value v = calculate_value(f._parameters[0], caller, previous_item);
@@ -507,15 +583,15 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {
},
{"attribute_not_exists", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {
if (caller != calculate_value_caller::ConditionExpressionAlone) {
throw api_error("ValidationException",
throw api_error::validation(
format("{}: attribute_not_exists() not allowed here", caller));
}
if (f._parameters.size() != 1) {
throw api_error("ValidationException",
throw api_error::validation(
format("{}: attribute_not_exists() accepts 1 parameter, got {}", caller, f._parameters.size()));
}
if (!std::holds_alternative<parsed::path>(f._parameters[0]._value)) {
throw api_error("ValidationException",
throw api_error::validation(
format("{}: attribute_not_exists()'s parameter must be a path", caller));
}
rjson::value v = calculate_value(f._parameters[0], caller, previous_item);
@@ -524,18 +600,18 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {
},
{"attribute_type", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {
if (caller != calculate_value_caller::ConditionExpressionAlone) {
throw api_error("ValidationException",
throw api_error::validation(
format("{}: attribute_type() not allowed here", caller));
}
if (f._parameters.size() != 2) {
throw api_error("ValidationException",
throw api_error::validation(
format("{}: attribute_type() accepts 2 parameters, got {}", caller, f._parameters.size()));
}
// There is no real reason for the following check (not
// allowing the type to come from a document attribute), but
// DynamoDB does this check, so we do too...
if (!f._parameters[1].is_constant()) {
throw api_error("ValidationException",
throw api_error::validation(
format("{}: attribute_types()'s first parameter must be an expression attribute", caller));
}
rjson::value v0 = calculate_value(f._parameters[0], caller, previous_item);
@@ -544,7 +620,7 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {
// If the type parameter is not one of the legal types
// we should generate an error, not a failed condition:
if (!known_type(rjson::to_string_view(v1.MemberBegin()->value))) {
throw api_error("ValidationException",
throw api_error::validation(
format("{}: attribute_types()'s second parameter, {}, is not a known type",
caller, v1.MemberBegin()->value));
}
@@ -554,77 +630,33 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {
return to_bool_json(false);
}
} else {
throw api_error("ValidationException",
throw api_error::validation(
format("{}: attribute_type() second parameter must refer to a string, got {}", caller, v1));
}
}
},
{"begins_with", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {
if (caller != calculate_value_caller::ConditionExpressionAlone) {
throw api_error("ValidationException",
throw api_error::validation(
format("{}: begins_with() not allowed here", caller));
}
if (f._parameters.size() != 2) {
throw api_error("ValidationException",
throw api_error::validation(
format("{}: begins_with() accepts 2 parameters, got {}", caller, f._parameters.size()));
}
rjson::value v1 = calculate_value(f._parameters[0], caller, previous_item);
rjson::value v2 = calculate_value(f._parameters[1], caller, previous_item);
// TODO: There's duplication here with check_BEGINS_WITH().
// But unfortunately, the two functions differ a bit.
// If one of v1 or v2 is malformed or has an unsupported type
// (not B or S), what we do depends on whether it came from
// the user's query (is_constant()), or the item. Unsupported
// values in the query result in an error, but if they are in
// the item, we silently return false (no match).
bool bad = false;
if (!v1.IsObject() || v1.MemberCount() != 1) {
bad = true;
if (f._parameters[0].is_constant()) {
throw api_error("ValidationException", format("{}: begins_with() encountered malformed AttributeValue: {}", caller, v1));
}
} else if (v1.MemberBegin()->name != "S" && v1.MemberBegin()->name != "B") {
bad = true;
if (f._parameters[0].is_constant()) {
throw api_error("ValidationException", format("{}: begins_with() supports only string or binary in AttributeValue: {}", caller, v1));
}
}
if (!v2.IsObject() || v2.MemberCount() != 1) {
bad = true;
if (f._parameters[1].is_constant()) {
throw api_error("ValidationException", format("{}: begins_with() encountered malformed AttributeValue: {}", caller, v2));
}
} else if (v2.MemberBegin()->name != "S" && v2.MemberBegin()->name != "B") {
bad = true;
if (f._parameters[1].is_constant()) {
throw api_error("ValidationException", format("{}: begins_with() supports only string or binary in AttributeValue: {}", caller, v2));
}
}
bool ret = false;
if (!bad) {
auto it1 = v1.MemberBegin();
auto it2 = v2.MemberBegin();
if (it1->name == it2->name) {
if (it2->name == "S") {
std::string_view val1 = rjson::to_string_view(it1->value);
std::string_view val2 = rjson::to_string_view(it2->value);
ret = val1.starts_with(val2);
} else /* it2->name == "B" */ {
ret = base64_begins_with(rjson::to_string_view(it1->value), rjson::to_string_view(it2->value));
}
}
}
return to_bool_json(ret);
return to_bool_json(check_BEGINS_WITH(v1.IsNull() ? nullptr : &v1, v2,
f._parameters[0].is_constant(), f._parameters[1].is_constant()));
}
},
{"contains", [] (calculate_value_caller caller, const rjson::value* previous_item, const parsed::value::function_call& f) {
if (caller != calculate_value_caller::ConditionExpressionAlone) {
throw api_error("ValidationException",
throw api_error::validation(
format("{}: contains() not allowed here", caller));
}
if (f._parameters.size() != 2) {
throw api_error("ValidationException",
throw api_error::validation(
format("{}: contains() accepts 2 parameters, got {}", caller, f._parameters.size()));
}
rjson::value v1 = calculate_value(f._parameters[0], caller, previous_item);
@@ -634,6 +666,55 @@ std::unordered_map<std::string_view, function_handler_type*> function_handlers {
},
};
// Given a parsed::path and an item read from the table, extract the value
// of a certain attribute path, such as "a" or "a.b.c[3]". Returns a null
// value if the item or the requested attribute does not exist.
// Note that the item is assumed to be encoded in JSON using DynamoDB
// conventions - each level of a nested document is a map with one key -
// a type (e.g., "M" for map) - and its value is the representation of
// that value.
static rjson::value extract_path(const rjson::value* item,
const parsed::path& p, calculate_value_caller caller) {
if (!item) {
return rjson::null_value();
}
const rjson::value* v = rjson::find(*item, p.root());
if (!v) {
return rjson::null_value();
}
for (const auto& op : p.operators()) {
if (!v->IsObject() || v->MemberCount() != 1) {
// This shouldn't happen. We shouldn't have stored malformed
// objects. But today Alternator does not validate the structure
// of nested documents before storing them, so this can happen on
// read.
throw api_error::validation(format("{}: malformed item read: {}", *item));
}
const char* type = v->MemberBegin()->name.GetString();
v = &(v->MemberBegin()->value);
std::visit(overloaded_functor {
[&] (const std::string& member) {
if (type[0] == 'M' && v->IsObject()) {
v = rjson::find(*v, member);
} else {
v = nullptr;
}
},
[&] (unsigned index) {
if (type[0] == 'L' && v->IsArray() && index < v->Size()) {
v = &(v->GetArray()[index]);
} else {
v = nullptr;
}
}
}, op);
if (!v) {
return rjson::null_value();
}
}
return rjson::copy(*v);
}
// Given a parsed::value, which can refer either to a constant value from
// ExpressionAttributeValues, to the value of some attribute, or to a function
// of other values, this function calculates the resulting value.
@@ -650,22 +731,13 @@ rjson::value calculate_value(const parsed::value& v,
[&] (const parsed::value::function_call& f) -> rjson::value {
auto function_it = function_handlers.find(std::string_view(f._function_name));
if (function_it == function_handlers.end()) {
throw api_error("ValidationException",
format("UpdateExpression: unknown function '{}' called.", f._function_name));
throw api_error::validation(
format("{}: unknown function '{}' called.", caller, f._function_name));
}
return function_it->second(caller, previous_item, f);
},
[&] (const parsed::path& p) -> rjson::value {
if (!previous_item) {
return rjson::null_value();
}
std::string update_path = p.root();
if (p.has_operators()) {
// FIXME: support this
throw api_error("ValidationException", "Reading attribute paths not yet implemented");
}
const rjson::value* previous_value = rjson::find(*previous_item, update_path);
return previous_value ? rjson::copy(*previous_value) : rjson::null_value();
return extract_path(previous_item, p, caller);
}
}, v._value);
}
@@ -674,7 +746,7 @@ rjson::value calculate_value(const parsed::value& v,
// either a single value, or v1+v2 or v1-v2.
rjson::value calculate_value(const parsed::set_rhs& rhs,
const rjson::value* previous_item) {
switch(rhs._op) {
switch (rhs._op) {
case 'v':
return calculate_value(rhs._v1, calculate_value_caller::UpdateExpression, previous_item);
case '+': {

View File

@@ -27,8 +27,10 @@
#include <unordered_set>
#include <string_view>
#include <seastar/util/noncopyable_function.hh>
#include "expressions_types.hh"
#include "rjson.hh"
#include "utils/rjson.hh"
namespace alternator {
@@ -59,6 +61,11 @@ void validate_value(const rjson::value& v, const char* caller);
bool condition_expression_on(const parsed::condition_expression& ce, std::string_view attribute);
// for_condition_expression_on() runs the given function on the attributes
// that the expression uses. It may run for the same attribute more than once
// if the same attribute is used more than once in the expression.
void for_condition_expression_on(const parsed::condition_expression& ce, const noncopyable_function<void(std::string_view)>& func);
// calculate_value() behaves slightly different (especially, different
// functions supported) when used in different types of expressions, as
// enumerated in this enum:

View File

@@ -27,7 +27,7 @@
#include <seastar/core/shared_ptr.hh>
#include "rjson.hh"
#include "utils/rjson.hh"
/*
* Parsed representation of expressions and their components.
@@ -49,15 +49,23 @@ class path {
// dot (e.g., ".xyz").
std::string _root;
std::vector<std::variant<std::string, unsigned>> _operators;
// It is useful to limit the depth of a user-specified path, because is
// allows us to use recursive algorithms without worrying about recursion
// depth. DynamoDB officially limits the length of paths to 32 components
// (including the root) so let's use the same limit.
static constexpr unsigned depth_limit = 32;
void check_depth_limit();
public:
void set_root(std::string root) {
_root = std::move(root);
}
void add_index(unsigned i) {
_operators.emplace_back(i);
check_depth_limit();
}
void add_dot(std::string(name)) {
_operators.emplace_back(std::move(name));
check_depth_limit();
}
const std::string& root() const {
return _root;
@@ -65,6 +73,13 @@ public:
bool has_operators() const {
return !_operators.empty();
}
const std::vector<std::variant<std::string, unsigned>>& operators() const {
return _operators;
}
std::vector<std::variant<std::string, unsigned>>& operators() {
return _operators;
}
friend std::ostream& operator<<(std::ostream&, const path&);
};
// When an expression is first parsed, all constants are references, like

View File

@@ -1,300 +0,0 @@
/*
* Copyright 2019 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "rjson.hh"
#include "error.hh"
#include <seastar/core/print.hh>
#include <seastar/core/thread.hh>
namespace rjson {
static allocator the_allocator;
/*
* This wrapper class adds nested level checks to rapidjson's handlers.
* Each rapidjson handler implements functions for accepting JSON values,
* which includes strings, numbers, objects, arrays, etc.
* Parsing objects and arrays needs to be performed carefully with regard
* to stack overflow - each object/array layer adds another stack frame
* to parsing, printing and destroying the parent JSON document.
* To prevent stack overflow, a rapidjson handler can be wrapped with
* guarded_json_handler, which accepts an additional max_nested_level parameter.
* After trying to exceed the max nested level, a proper rjson::error will be thrown.
*/
template<typename Handler, bool EnableYield>
struct guarded_yieldable_json_handler : public Handler {
size_t _nested_level = 0;
size_t _max_nested_level;
public:
using handler_base = Handler;
explicit guarded_yieldable_json_handler(size_t max_nested_level) : _max_nested_level(max_nested_level) {}
guarded_yieldable_json_handler(string_buffer& buf, size_t max_nested_level)
: handler_base(buf), _max_nested_level(max_nested_level) {}
void Parse(const char* str, size_t length) {
rapidjson::MemoryStream ms(static_cast<const char*>(str), length * sizeof(typename encoding::Ch));
rapidjson::EncodedInputStream<encoding, rapidjson::MemoryStream> is(ms);
rapidjson::GenericReader<encoding, encoding, allocator> reader(&the_allocator);
reader.Parse(is, *this);
if (reader.HasParseError()) {
throw rjson::error(format("Parsing JSON failed: {}", rapidjson::GetParseError_En(reader.GetParseErrorCode())));
}
//NOTICE: The handler has parsed the string, but in case of rapidjson::GenericDocument
// the data now resides in an internal stack_ variable, which is private instead of
// protected... which means we cannot simply access its data. Fortunately, another
// function for populating documents from SAX events can be abused to extract the data
// from the stack via gadget-oriented programming - we use an empty event generator
// which does nothing, and use it to call Populate(), which assumes that the generator
// will fill the stack with something. It won't, but our stack is already filled with
// data we want to steal, so once Populate() ends, our document will be properly parsed.
// A proper solution could be programmed once rapidjson declares this stack_ variable
// as protected instead of private, so that this class can access it.
auto dummy_generator = [](handler_base&){return true;};
handler_base::Populate(dummy_generator);
}
bool StartObject() {
++_nested_level;
check_nested_level();
maybe_yield();
return handler_base::StartObject();
}
bool EndObject(rapidjson::SizeType elements_count = 0) {
--_nested_level;
return handler_base::EndObject(elements_count);
}
bool StartArray() {
++_nested_level;
check_nested_level();
maybe_yield();
return handler_base::StartArray();
}
bool EndArray(rapidjson::SizeType elements_count = 0) {
--_nested_level;
return handler_base::EndArray(elements_count);
}
bool Null() { maybe_yield(); return handler_base::Null(); }
bool Bool(bool b) { maybe_yield(); return handler_base::Bool(b); }
bool Int(int i) { maybe_yield(); return handler_base::Int(i); }
bool Uint(unsigned u) { maybe_yield(); return handler_base::Uint(u); }
bool Int64(int64_t i64) { maybe_yield(); return handler_base::Int64(i64); }
bool Uint64(uint64_t u64) { maybe_yield(); return handler_base::Uint64(u64); }
bool Double(double d) { maybe_yield(); return handler_base::Double(d); }
bool String(const value::Ch* str, size_t length, bool copy = false) { maybe_yield(); return handler_base::String(str, length, copy); }
bool Key(const value::Ch* str, size_t length, bool copy = false) { maybe_yield(); return handler_base::Key(str, length, copy); }
protected:
static void maybe_yield() {
if constexpr (EnableYield) {
thread::maybe_yield();
}
}
void check_nested_level() const {
if (RAPIDJSON_UNLIKELY(_nested_level > _max_nested_level)) {
throw rjson::error(format("Max nested level reached: {}", _max_nested_level));
}
}
};
std::string print(const rjson::value& value) {
string_buffer buffer;
guarded_yieldable_json_handler<writer, false> writer(buffer, 78);
value.Accept(writer);
return std::string(buffer.GetString());
}
rjson::value copy(const rjson::value& value) {
return rjson::value(value, the_allocator);
}
rjson::value parse(std::string_view str) {
guarded_yieldable_json_handler<document, false> d(78);
d.Parse(str.data(), str.size());
if (d.HasParseError()) {
throw rjson::error(format("Parsing JSON failed: {}", GetParseError_En(d.GetParseError())));
}
rjson::value& v = d;
return std::move(v);
}
rjson::value parse_yieldable(std::string_view str) {
guarded_yieldable_json_handler<document, true> d(78);
d.Parse(str.data(), str.size());
if (d.HasParseError()) {
throw rjson::error(format("Parsing JSON failed: {}", GetParseError_En(d.GetParseError())));
}
rjson::value& v = d;
return std::move(v);
}
rjson::value& get(rjson::value& value, std::string_view name) {
// Although FindMember() has a variant taking a StringRef, it ignores the
// given length (see https://github.com/Tencent/rapidjson/issues/1649).
// Luckily, the variant taking a GenericValue doesn't share this bug,
// and we can create a string GenericValue without copying the string.
auto member_it = value.FindMember(rjson::value(name.data(), name.size()));
if (member_it != value.MemberEnd())
return member_it->value;
else {
throw rjson::error(format("JSON parameter {} not found", name));
}
}
const rjson::value& get(const rjson::value& value, std::string_view name) {
auto member_it = value.FindMember(rjson::value(name.data(), name.size()));
if (member_it != value.MemberEnd())
return member_it->value;
else {
throw rjson::error(format("JSON parameter {} not found", name));
}
}
rjson::value from_string(const std::string& str) {
return rjson::value(str.c_str(), str.size(), the_allocator);
}
rjson::value from_string(const sstring& str) {
return rjson::value(str.c_str(), str.size(), the_allocator);
}
rjson::value from_string(const char* str, size_t size) {
return rjson::value(str, size, the_allocator);
}
rjson::value from_string(std::string_view view) {
return rjson::value(view.data(), view.size(), the_allocator);
}
const rjson::value* find(const rjson::value& value, std::string_view name) {
// Although FindMember() has a variant taking a StringRef, it ignores the
// given length (see https://github.com/Tencent/rapidjson/issues/1649).
// Luckily, the variant taking a GenericValue doesn't share this bug,
// and we can create a string GenericValue without copying the string.
auto member_it = value.FindMember(rjson::value(name.data(), name.size()));
return member_it != value.MemberEnd() ? &member_it->value : nullptr;
}
rjson::value* find(rjson::value& value, std::string_view name) {
auto member_it = value.FindMember(rjson::value(name.data(), name.size()));
return member_it != value.MemberEnd() ? &member_it->value : nullptr;
}
bool remove_member(rjson::value& value, std::string_view name) {
// Although RemoveMember() has a variant taking a StringRef, it ignores
// given length (see https://github.com/Tencent/rapidjson/issues/1649).
// Luckily, the variant taking a GenericValue doesn't share this bug,
// and we can create a string GenericValue without copying the string.
return value.RemoveMember(rjson::value(name.data(), name.size()));
}
void set_with_string_name(rjson::value& base, const std::string& name, rjson::value&& member) {
base.AddMember(rjson::value(name.c_str(), name.size(), the_allocator), std::move(member), the_allocator);
}
void set_with_string_name(rjson::value& base, std::string_view name, rjson::value&& member) {
base.AddMember(rjson::value(name.data(), name.size(), the_allocator), std::move(member), the_allocator);
}
void set_with_string_name(rjson::value& base, const std::string& name, rjson::string_ref_type member) {
base.AddMember(rjson::value(name.c_str(), name.size(), the_allocator), rjson::value(member), the_allocator);
}
void set_with_string_name(rjson::value& base, std::string_view name, rjson::string_ref_type member) {
base.AddMember(rjson::value(name.data(), name.size(), the_allocator), rjson::value(member), the_allocator);
}
void set(rjson::value& base, rjson::string_ref_type name, rjson::value&& member) {
base.AddMember(name, std::move(member), the_allocator);
}
void set(rjson::value& base, rjson::string_ref_type name, rjson::string_ref_type member) {
base.AddMember(name, rjson::value(member), the_allocator);
}
void push_back(rjson::value& base_array, rjson::value&& item) {
base_array.PushBack(std::move(item), the_allocator);
}
bool single_value_comp::operator()(const rjson::value& r1, const rjson::value& r2) const {
auto r1_type = r1.GetType();
auto r2_type = r2.GetType();
// null is the smallest type and compares with every other type, nothing is lesser than null
if (r1_type == rjson::type::kNullType || r2_type == rjson::type::kNullType) {
return r1_type < r2_type;
}
// only null, true, and false are comparable with each other, other types are not compatible
if (r1_type != r2_type) {
if (r1_type > rjson::type::kTrueType || r2_type > rjson::type::kTrueType) {
throw rjson::error(format("Types are not comparable: {} {}", r1, r2));
}
}
switch (r1_type) {
case rjson::type::kNullType:
// fall-through
case rjson::type::kFalseType:
// fall-through
case rjson::type::kTrueType:
return r1_type < r2_type;
case rjson::type::kObjectType:
throw rjson::error("Object type comparison is not supported");
case rjson::type::kArrayType:
throw rjson::error("Array type comparison is not supported");
case rjson::type::kStringType: {
const size_t r1_len = r1.GetStringLength();
const size_t r2_len = r2.GetStringLength();
size_t len = std::min(r1_len, r2_len);
int result = std::strncmp(r1.GetString(), r2.GetString(), len);
return result < 0 || (result == 0 && r1_len < r2_len);
}
case rjson::type::kNumberType: {
if (r1.IsInt() && r2.IsInt()) {
return r1.GetInt() < r2.GetInt();
} else if (r1.IsUint() && r2.IsUint()) {
return r1.GetUint() < r2.GetUint();
} else if (r1.IsInt64() && r2.IsInt64()) {
return r1.GetInt64() < r2.GetInt64();
} else if (r1.IsUint64() && r2.IsUint64()) {
return r1.GetUint64() < r2.GetUint64();
} else {
// it's safe to call GetDouble() on any number type
return r1.GetDouble() < r2.GetDouble();
}
}
default:
return false;
}
}
} // end namespace rjson
std::ostream& std::operator<<(std::ostream& os, const rjson::value& v) {
return os << rjson::print(v);
}

View File

@@ -1,177 +0,0 @@
/*
* Copyright 2019 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
/*
* rjson is a wrapper over rapidjson library, providing fast JSON parsing and generation.
*
* rapidjson has strict copy elision policies, which, among other things, involves
* using provided char arrays without copying them and allows copying objects only explicitly.
* As such, one should be careful when passing strings with limited liveness
* (e.g. data underneath local std::strings) to rjson functions, because created JSON objects
* may end up relying on dangling char pointers. All rjson functions that create JSONs from strings
* by rjson have both APIs for string_ref_type (more optimal, used when the string is known to live
* at least as long as the object, e.g. a static char array) and for std::strings. The more optimal
* variants should be used *only* if the liveness of the string is guaranteed, otherwise it will
* result in undefined behaviour.
* Also, bear in mind that methods exposed by rjson::value are generic, but some of them
* work fine only for specific types. In case the type does not match, an rjson::error will be thrown.
* Examples of such mismatched usages is calling MemberCount() on a JSON value not of object type
* or calling Size() on a non-array value.
*/
#include <string>
#include <stdexcept>
namespace rjson {
class error : public std::exception {
std::string _msg;
public:
error() = default;
error(const std::string& msg) : _msg(msg) {}
virtual const char* what() const noexcept override { return _msg.c_str(); }
};
}
// rapidjson configuration macros
#define RAPIDJSON_HAS_STDSTRING 1
// Default rjson policy is to use assert() - which is dangerous for two reasons:
// 1. assert() can be turned off with -DNDEBUG
// 2. assert() crashes a program
// Fortunately, the default policy can be overridden, and so rapidjson errors will
// throw an rjson::error exception instead.
#define RAPIDJSON_ASSERT(x) do { if (!(x)) throw rjson::error(std::string("JSON error: condition not met: ") + #x); } while (0)
#include <rapidjson/document.h>
#include <rapidjson/writer.h>
#include <rapidjson/stringbuffer.h>
#include <rapidjson/error/en.h>
#include <seastar/core/sstring.hh>
#include "seastarx.hh"
namespace rjson {
using allocator = rapidjson::CrtAllocator;
using encoding = rapidjson::UTF8<>;
using document = rapidjson::GenericDocument<encoding, allocator>;
using value = rapidjson::GenericValue<encoding, allocator>;
using string_ref_type = value::StringRefType;
using string_buffer = rapidjson::GenericStringBuffer<encoding>;
using writer = rapidjson::Writer<string_buffer, encoding>;
using type = rapidjson::Type;
// Returns an object representing JSON's null
inline rjson::value null_value() {
return rjson::value(rapidjson::kNullType);
}
// Returns an empty JSON object - {}
inline rjson::value empty_object() {
return rjson::value(rapidjson::kObjectType);
}
// Returns an empty JSON array - []
inline rjson::value empty_array() {
return rjson::value(rapidjson::kArrayType);
}
// Returns an empty JSON string - ""
inline rjson::value empty_string() {
return rjson::value(rapidjson::kStringType);
}
// Convert the JSON value to a string with JSON syntax, the opposite of parse().
// The representation is dense - without any redundant indentation.
std::string print(const rjson::value& value);
// Returns a string_view to the string held in a JSON value (which is
// assumed to hold a string, i.e., v.IsString() == true). This is a view
// to the existing data - no copying is done.
inline std::string_view to_string_view(const rjson::value& v) {
return std::string_view(v.GetString(), v.GetStringLength());
}
// Copies given JSON value - involves allocation
rjson::value copy(const rjson::value& value);
// Parses a JSON value from given string or raw character array.
// The string/char array liveness does not need to be persisted,
// as parse() will allocate member names and values.
// Throws rjson::error if parsing failed.
rjson::value parse(std::string_view str);
// Needs to be run in thread context
rjson::value parse_yieldable(std::string_view str);
// Creates a JSON value (of JSON string type) out of internal string representations.
// The string value is copied, so str's liveness does not need to be persisted.
rjson::value from_string(const std::string& str);
rjson::value from_string(const sstring& str);
rjson::value from_string(const char* str, size_t size);
rjson::value from_string(std::string_view view);
// Returns a pointer to JSON member if it exists, nullptr otherwise
rjson::value* find(rjson::value& value, std::string_view name);
const rjson::value* find(const rjson::value& value, std::string_view name);
// Returns a reference to JSON member if it exists, throws otherwise
rjson::value& get(rjson::value& value, std::string_view name);
const rjson::value& get(const rjson::value& value, std::string_view name);
// Sets a member in given JSON object by moving the member - allocates the name.
// Throws if base is not a JSON object.
void set_with_string_name(rjson::value& base, const std::string& name, rjson::value&& member);
void set_with_string_name(rjson::value& base, std::string_view name, rjson::value&& member);
// Sets a string member in given JSON object by assigning its reference - allocates the name.
// NOTICE: member string liveness must be ensured to be at least as long as base's.
// Throws if base is not a JSON object.
void set_with_string_name(rjson::value& base, const std::string& name, rjson::string_ref_type member);
void set_with_string_name(rjson::value& base, std::string_view name, rjson::string_ref_type member);
// Sets a member in given JSON object by moving the member.
// NOTICE: name liveness must be ensured to be at least as long as base's.
// Throws if base is not a JSON object.
void set(rjson::value& base, rjson::string_ref_type name, rjson::value&& member);
// Sets a string member in given JSON object by assigning its reference.
// NOTICE: name liveness must be ensured to be at least as long as base's.
// NOTICE: member liveness must be ensured to be at least as long as base's.
// Throws if base is not a JSON object.
void set(rjson::value& base, rjson::string_ref_type name, rjson::string_ref_type member);
// Adds a value to a JSON list by moving the item to its end.
// Throws if base_array is not a JSON array.
void push_back(rjson::value& base_array, rjson::value&& item);
// Remove a member from a JSON object. Throws if value isn't an object.
bool remove_member(rjson::value& value, std::string_view name);
struct single_value_comp {
bool operator()(const rjson::value& r1, const rjson::value& r2) const;
};
} // end namespace rjson
namespace std {
std::ostream& operator<<(std::ostream& os, const rjson::value& v);
}

View File

@@ -24,7 +24,7 @@
#include "seastarx.hh"
#include "service/storage_proxy.hh"
#include "service/storage_proxy.hh"
#include "rjson.hh"
#include "utils/rjson.hh"
#include "executor.hh"
namespace alternator {
@@ -87,7 +87,11 @@ protected:
// When _returnvalues != NONE, apply() should store here, in JSON form,
// the values which are to be returned in the "Attributes" field.
// The default null JSON means do not return an Attributes field at all.
rjson::value _return_attributes;
// This field is marked "mutable" so that the const apply() can modify
// it (see explanation below), but note that because apply() may be
// called more than once, if apply() will sometimes set this field it
// must set it (even if just to the default empty value) every time.
mutable rjson::value _return_attributes;
public:
// The constructor of a rmw_operation subclass should parse the request
// and try to discover as many input errors as it can before really
@@ -100,7 +104,12 @@ public:
// conditional expression, apply() should return an empty optional.
// apply() may throw if it encounters input errors not discovered during
// the constructor.
virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) = 0;
// apply() may be called more than once in case of contention, so it must
// not change the state saved in the object (issue #7218 was caused by
// violating this). We mark apply() "const" to let the compiler validate
// this for us. The output-only field _return_attributes is marked
// "mutable" above so that apply() can still write to it.
virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) const = 0;
// Convert the above apply() into the signature needed by cas_request:
virtual std::optional<mutation> apply(foreign_ptr<lw_shared_ptr<query::result>> qr, const query::partition_slice& slice, api::timestamp_type ts) override;
virtual ~rmw_operation() = default;

View File

@@ -65,7 +65,7 @@ struct from_json_visitor {
void operator()(const reversed_type_impl& t) const { visit(*t.underlying_type(), from_json_visitor{v, bo}); };
void operator()(const string_type_impl& t) {
bo.write(t.from_string(sstring_view(v.GetString(), v.GetStringLength())));
bo.write(t.from_string(rjson::to_string_view(v)));
}
void operator()(const bytes_type_impl& t) const {
bo.write(base64_decode(v));
@@ -74,23 +74,27 @@ struct from_json_visitor {
bo.write(boolean_type->decompose(v.GetBool()));
}
void operator()(const decimal_type_impl& t) const {
bo.write(t.from_string(sstring_view(v.GetString(), v.GetStringLength())));
try {
bo.write(t.from_string(rjson::to_string_view(v)));
} catch (const marshal_exception& e) {
throw api_error::validation(format("The parameter cannot be converted to a numeric value: {}", v));
}
}
// default
void operator()(const abstract_type& t) const {
bo.write(from_json_object(t, Json::Value(rjson::print(v)), cql_serialization_format::internal()));
bo.write(from_json_object(t, v, cql_serialization_format::internal()));
}
};
bytes serialize_item(const rjson::value& item) {
if (item.IsNull() || item.MemberCount() != 1) {
throw api_error("ValidationException", format("An item can contain only one attribute definition: {}", item));
throw api_error::validation(format("An item can contain only one attribute definition: {}", item));
}
auto it = item.MemberBegin();
type_info type_info = type_info_from_string(rjson::to_string_view(it->name)); // JSON keys are guaranteed to be strings
if (type_info.atype == alternator_type::NOT_SUPPORTED_YET) {
slogger.trace("Non-optimal serialization of type {}", it->name.GetString());
slogger.trace("Non-optimal serialization of type {}", it->name);
return bytes{int8_t(type_info.atype)} + to_bytes(rjson::print(item));
}
@@ -128,7 +132,7 @@ struct to_json_visitor {
rjson::value deserialize_item(bytes_view bv) {
rjson::value deserialized(rapidjson::kObjectType);
if (bv.empty()) {
throw api_error("ValidationException", "Serialized value empty");
throw api_error::validation("Serialized value empty");
}
alternator_type atype = alternator_type(bv[0]);
@@ -164,7 +168,7 @@ bytes get_key_column_value(const rjson::value& item, const column_definition& co
std::string column_name = column.name_as_text();
const rjson::value* key_typed_value = rjson::find(item, column_name);
if (!key_typed_value) {
throw api_error("ValidationException", format("Key column {} not found", column_name));
throw api_error::validation(format("Key column {} not found", column_name));
}
return get_key_from_typed_value(*key_typed_value, column);
}
@@ -175,20 +179,20 @@ bytes get_key_column_value(const rjson::value& item, const column_definition& co
bytes get_key_from_typed_value(const rjson::value& key_typed_value, const column_definition& column) {
if (!key_typed_value.IsObject() || key_typed_value.MemberCount() != 1 ||
!key_typed_value.MemberBegin()->value.IsString()) {
throw api_error("ValidationException",
throw api_error::validation(
format("Malformed value object for key column {}: {}",
column.name_as_text(), key_typed_value));
}
auto it = key_typed_value.MemberBegin();
if (it->name != type_to_string(column.type)) {
throw api_error("ValidationException",
throw api_error::validation(
format("Type mismatch: expected type {} for key column {}, got type {}",
type_to_string(column.type), column.name_as_text(), it->name.GetString()));
type_to_string(column.type), column.name_as_text(), it->name));
}
std::string_view value_view = rjson::to_string_view(it->value);
if (value_view.empty()) {
throw api_error("ValidationException",
throw api_error::validation(
format("The AttributeValue for a key attribute cannot contain an empty string value. Key: {}", column.name_as_text()));
}
if (column.type == bytes_type) {
@@ -247,20 +251,24 @@ clustering_key ck_from_json(const rjson::value& item, schema_ptr schema) {
big_decimal unwrap_number(const rjson::value& v, std::string_view diagnostic) {
if (!v.IsObject() || v.MemberCount() != 1) {
throw api_error("ValidationException", format("{}: invalid number object", diagnostic));
throw api_error::validation(format("{}: invalid number object", diagnostic));
}
auto it = v.MemberBegin();
if (it->name != "N") {
throw api_error("ValidationException", format("{}: expected number, found type '{}'", diagnostic, it->name));
throw api_error::validation(format("{}: expected number, found type '{}'", diagnostic, it->name));
}
if (it->value.IsNumber()) {
// FIXME(sarna): should use big_decimal constructor with numeric values directly:
return big_decimal(rjson::print(it->value));
try {
if (it->value.IsNumber()) {
// FIXME(sarna): should use big_decimal constructor with numeric values directly:
return big_decimal(rjson::print(it->value));
}
if (!it->value.IsString()) {
throw api_error::validation(format("{}: improperly formatted number constant", diagnostic));
}
return big_decimal(rjson::to_string_view(it->value));
} catch (const marshal_exception& e) {
throw api_error::validation(format("The parameter cannot be converted to a numeric value: {}", it->value));
}
if (!it->value.IsString()) {
throw api_error("ValidationException", format("{}: improperly formatted number constant", diagnostic));
}
return big_decimal(it->value.GetString());
}
const std::pair<std::string, const rjson::value*> unwrap_set(const rjson::value& v) {
@@ -312,10 +320,10 @@ rjson::value set_sum(const rjson::value& v1, const rjson::value& v2) {
auto [set1_type, set1] = unwrap_set(v1);
auto [set2_type, set2] = unwrap_set(v2);
if (set1_type != set2_type) {
throw api_error("ValidationException", format("Mismatched set types: {} and {}", set1_type, set2_type));
throw api_error::validation(format("Mismatched set types: {} and {}", set1_type, set2_type));
}
if (!set1 || !set2) {
throw api_error("ValidationException", "UpdateExpression: ADD operation for sets must be given sets as arguments");
throw api_error::validation("UpdateExpression: ADD operation for sets must be given sets as arguments");
}
rjson::value sum = rjson::copy(*set1);
std::set<rjson::value, rjson::single_value_comp> set1_raw;
@@ -323,7 +331,7 @@ rjson::value set_sum(const rjson::value& v1, const rjson::value& v2) {
set1_raw.insert(rjson::copy(*it));
}
for (const auto& a : set2->GetArray()) {
if (set1_raw.count(a) == 0) {
if (!set1_raw.contains(a)) {
rjson::push_back(sum, rjson::copy(a));
}
}
@@ -340,10 +348,10 @@ std::optional<rjson::value> set_diff(const rjson::value& v1, const rjson::value&
auto [set1_type, set1] = unwrap_set(v1);
auto [set2_type, set2] = unwrap_set(v2);
if (set1_type != set2_type) {
throw api_error("ValidationException", format("Mismatched set types: {} and {}", set1_type, set2_type));
throw api_error::validation(format("Mismatched set types: {} and {}", set1_type, set2_type));
}
if (!set1 || !set2) {
throw api_error("ValidationException", "UpdateExpression: DELETE operation can only be performed on a set");
throw api_error::validation("UpdateExpression: DELETE operation can only be performed on a set");
}
std::set<rjson::value, rjson::single_value_comp> set1_raw;
for (auto it = set1->Begin(); it != set1->End(); ++it) {

View File

@@ -26,7 +26,7 @@
#include "types.hh"
#include "schema_fwd.hh"
#include "keys.hh"
#include "rjson.hh"
#include "utils/rjson.hh"
#include "utils/big_decimal.hh"
namespace alternator {

View File

@@ -22,10 +22,12 @@
#include "alternator/server.hh"
#include "log.hh"
#include <seastar/http/function_handlers.hh>
#include <seastar/http/short_streams.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/json/json_elements.hh>
#include "seastarx.hh"
#include "error.hh"
#include "rjson.hh"
#include "utils/rjson.hh"
#include "auth.hh"
#include <cctype>
#include "cql3/query_processor.hh"
@@ -59,6 +61,40 @@ inline std::vector<std::string_view> split(std::string_view text, char separator
return tokens;
}
// Handle CORS (Cross-origin resource sharing) in the HTTP request:
// If the request has the "Origin" header specifying where the script which
// makes this request comes from, we need to reply with the header
// "Access-Control-Allow-Origin: *" saying that this (and any) origin is fine.
// Additionally, if preflight==true (i.e., this is an OPTIONS request),
// the script can also "request" in headers that the server allows it to use
// some HTTP methods and headers in the followup request, and the server
// should respond by "allowing" them in the response headers.
// We also add the header "Access-Control-Expose-Headers" to let the script
// access additional headers in the response.
// This handle_CORS() should be used when handling any HTTP method - both the
// usual GET and POST, and also the "preflight" OPTIONS method.
static void handle_CORS(const request& req, reply& rep, bool preflight) {
if (!req.get_header("origin").empty()) {
rep.add_header("Access-Control-Allow-Origin", "*");
// This is the list that DynamoDB returns for expose headers. I am
// not sure why not just return "*" here, what's the risk?
rep.add_header("Access-Control-Expose-Headers", "x-amzn-RequestId,x-amzn-ErrorType,x-amzn-ErrorMessage,Date");
if (preflight) {
sstring s = req.get_header("Access-Control-Request-Headers");
if (!s.empty()) {
rep.add_header("Access-Control-Allow-Headers", std::move(s));
}
s = req.get_header("Access-Control-Request-Method");
if (!s.empty()) {
rep.add_header("Access-Control-Allow-Methods", std::move(s));
}
// Our CORS response never change anyway, let the browser cache it
// for two hours (Chrome's maximum):
rep.add_header("Access-Control-Max-Age", "7200");
}
}
}
// DynamoDB HTTP error responses are structured as follows
// https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html
// Our handlers throw an exception to report an error. If the exception
@@ -75,20 +111,17 @@ public:
// returned to the client as expected. Other types of
// exceptions are unexpected, and returned to the user
// as an internal server error:
api_error ret;
try {
resf.get();
} catch (api_error &ae) {
ret = ae;
generate_error_reply(*rep, ae);
} catch (rjson::error & re) {
ret = api_error("ValidationException", re.what());
generate_error_reply(*rep,
api_error::validation(re.what()));
} catch (...) {
ret = api_error(
"Internal Server Error",
format("Internal server error: {}", std::current_exception()),
reply::status_type::internal_server_error);
generate_error_reply(*rep,
api_error::internal(format("Internal server error: {}", std::current_exception())));
}
generate_error_reply(*rep, ret);
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
}
auto res = resf.get0();
@@ -96,6 +129,10 @@ public:
[&] (const json::json_return_type& json_return_value) {
slogger.trace("api_handler success case");
if (json_return_value._body_writer) {
// Unfortunately, write_body() forces us to choose
// from a fixed and irrelevant list of "mime-types"
// at this point. But we'll override it with the
// one (application/x-amz-json-1.0) below.
rep->write_body("json", std::move(json_return_value._body_writer));
} else {
rep->_content += json_return_value._res;
@@ -108,14 +145,16 @@ public:
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
});
}), _type("json") { }
}) { }
api_handler(const api_handler&) = default;
future<std::unique_ptr<reply>> handle(const sstring& path,
std::unique_ptr<request> req, std::unique_ptr<reply> rep) override {
handle_CORS(*req, *rep, false);
return _f_handle(std::move(req), std::move(rep)).then(
[this](std::unique_ptr<reply> rep) {
rep->done(_type);
rep->set_mime_type("application/x-amz-json-1.0");
rep->done();
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
});
}
@@ -129,7 +168,6 @@ protected:
}
future_handler_function _f_handle;
sstring _type;
};
class gated_handler : public handler_base {
@@ -149,6 +187,7 @@ public:
health_handler(seastar::gate& pending_requests) : gated_handler(pending_requests) {}
protected:
virtual future<std::unique_ptr<reply>> do_handle(const sstring& path, std::unique_ptr<request> req, std::unique_ptr<reply> rep) override {
handle_CORS(*req, *rep, false);
rep->set_status(reply::status_type::ok);
rep->write_body("txt", format("healthy: {}", req->get_header("Host")));
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
@@ -181,38 +220,61 @@ protected:
}
};
future<> server::verify_signature(const request& req) {
// The CORS (Cross-origin resource sharing) protocol can send an OPTIONS
// request before ("pre-flight") the main request. The response to this
// request can be empty, but needs to have the right headers (which we
// fill with handle_CORS())
class options_handler : public gated_handler {
public:
options_handler(seastar::gate& pending_requests) : gated_handler(pending_requests) {}
protected:
virtual future<std::unique_ptr<reply>> do_handle(const sstring& path, std::unique_ptr<request> req, std::unique_ptr<reply> rep) override {
handle_CORS(*req, *rep, true);
rep->set_status(reply::status_type::ok);
rep->write_body("txt", sstring(""));
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
}
};
future<> server::verify_signature(const request& req, const chunked_content& content) {
if (!_enforce_authorization) {
slogger.debug("Skipping authorization");
return make_ready_future<>();
}
auto host_it = req._headers.find("Host");
if (host_it == req._headers.end()) {
throw api_error("InvalidSignatureException", "Host header is mandatory for signature verification");
throw api_error::invalid_signature("Host header is mandatory for signature verification");
}
auto authorization_it = req._headers.find("Authorization");
if (authorization_it == req._headers.end()) {
throw api_error("InvalidSignatureException", "Authorization header is mandatory for signature verification");
throw api_error::missing_authentication_token("Authorization header is mandatory for signature verification");
}
std::string host = host_it->second;
std::vector<std::string_view> credentials_raw = split(authorization_it->second, ' ');
std::string_view authorization_header = authorization_it->second;
auto pos = authorization_header.find_first_of(' ');
if (pos == std::string_view::npos || authorization_header.substr(0, pos) != "AWS4-HMAC-SHA256") {
throw api_error::invalid_signature(format("Authorization header must use AWS4-HMAC-SHA256 algorithm: {}", authorization_header));
}
authorization_header.remove_prefix(pos+1);
std::string credential;
std::string user_signature;
std::string signed_headers_str;
std::vector<std::string_view> signed_headers;
for (std::string_view entry : credentials_raw) {
do {
// Either one of a comma or space can mark the end of an entry
pos = authorization_header.find_first_of(" ,");
std::string_view entry = authorization_header.substr(0, pos);
if (pos != std::string_view::npos) {
authorization_header.remove_prefix(pos + 1);
}
if (entry.empty()) {
continue;
}
std::vector<std::string_view> entry_split = split(entry, '=');
if (entry_split.size() != 2) {
if (entry != "AWS4-HMAC-SHA256") {
throw api_error("InvalidSignatureException", format("Only AWS4-HMAC-SHA256 algorithm is supported. Found: {}", entry));
}
continue;
}
std::string_view auth_value = entry_split[1];
// Commas appear as an additional (quite redundant) delimiter
if (auth_value.back() == ',') {
auth_value.remove_suffix(1);
}
if (entry_split[0] == "Credential") {
credential = std::string(auth_value);
} else if (entry_split[0] == "Signature") {
@@ -222,10 +284,11 @@ future<> server::verify_signature(const request& req) {
signed_headers = split(auth_value, ';');
std::sort(signed_headers.begin(), signed_headers.end());
}
}
} while (pos != std::string_view::npos);
std::vector<std::string_view> credential_split = split(credential, '/');
if (credential_split.size() != 5) {
throw api_error("ValidationException", format("Incorrect credential information format: {}", credential));
throw api_error::validation(format("Incorrect credential information format: {}", credential));
}
std::string user(credential_split[0]);
std::string datestamp(credential_split[1]);
@@ -246,10 +309,10 @@ future<> server::verify_signature(const request& req) {
}
}
auto cache_getter = [] (std::string username) {
return get_key_from_roles(cql3::get_query_processor().local(), std::move(username));
auto cache_getter = [&qp = _qp] (std::string username) {
return get_key_from_roles(qp, std::move(username));
};
return _key_cache.get_ptr(user, cache_getter).then([this, &req,
return _key_cache.get_ptr(user, cache_getter).then([this, &req, &content,
user = std::move(user),
host = std::move(host),
datestamp = std::move(datestamp),
@@ -259,53 +322,100 @@ future<> server::verify_signature(const request& req) {
service = std::move(service),
user_signature = std::move(user_signature)] (key_cache::value_ptr key_ptr) {
std::string signature = get_signature(user, *key_ptr, std::string_view(host), req._method,
datestamp, signed_headers_str, signed_headers_map, req.content, region, service, "");
datestamp, signed_headers_str, signed_headers_map, content, region, service, "");
if (signature != std::string_view(user_signature)) {
_key_cache.remove(user);
throw api_error("UnrecognizedClientException", "The security token included in the request is invalid.");
throw api_error::unrecognized_client("The security token included in the request is invalid.");
}
});
}
future<executor::request_return_type> server::handle_api_request(std::unique_ptr<request>&& req) {
static tracing::trace_state_ptr create_tracing_session(tracing::tracing& tracing_instance) {
tracing::trace_state_props_set props;
props.set<tracing::trace_state_props::full_tracing>();
props.set_if<tracing::trace_state_props::log_slow_query>(tracing_instance.slow_query_tracing_enabled());
return tracing_instance.create_session(tracing::trace_type::QUERY, props);
}
// truncated_content_view() prints a potentially long chunked_content for
// debugging purposes. In the common case when the content is not excessively
// long, it just returns a view into the given content, without any copying.
// But when the content is very long, it is truncated after some arbitrary
// max_len (or one chunk, whichever comes first), with "<truncated>" added at
// the end. To do this modification to the string, we need to create a new
// std::string, so the caller must pass us a reference to one, "buf", where
// we can store the content. The returned view is only alive for as long this
// buf is kept alive.
static std::string_view truncated_content_view(const chunked_content& content, std::string& buf) {
constexpr size_t max_len = 1024;
if (content.empty()) {
return std::string_view();
} else if (content.size() == 1 && content.begin()->size() <= max_len) {
return std::string_view(content.begin()->get(), content.begin()->size());
} else {
buf = std::string(content.begin()->get(), std::min(content.begin()->size(), max_len)) + "<truncated>";
return std::string_view(buf);
}
}
static tracing::trace_state_ptr maybe_trace_query(service::client_state& client_state, sstring_view op, const chunked_content& query) {
tracing::trace_state_ptr trace_state;
tracing::tracing& tracing_instance = tracing::tracing::get_local_tracing_instance();
if (tracing_instance.trace_next_query() || tracing_instance.slow_query_tracing_enabled()) {
trace_state = create_tracing_session(tracing_instance);
std::string buf;
tracing::add_session_param(trace_state, "alternator_op", op);
tracing::add_query(trace_state, truncated_content_view(query, buf));
tracing::begin(trace_state, format("Alternator {}", op), client_state.get_client_address());
}
return trace_state;
}
future<executor::request_return_type> server::handle_api_request(std::unique_ptr<request> req) {
_executor._stats.total_operations++;
sstring target = req->get_header(TARGET);
std::vector<std::string_view> split_target = split(target, '.');
//NOTICE(sarna): Target consists of Dynamo API version followed by a dot '.' and operation type (e.g. CreateTable)
std::string op = split_target.empty() ? std::string() : std::string(split_target.back());
slogger.trace("Request: {} {}", op, req->content);
return verify_signature(*req).then([this, op, req = std::move(req)] () mutable {
auto callback_it = _callbacks.find(op);
if (callback_it == _callbacks.end()) {
_executor._stats.unsupported_operations++;
throw api_error("UnknownOperationException",
format("Unsupported operation {}", op));
}
return with_gate(_pending_requests, [this, callback_it = std::move(callback_it), op = std::move(op), req = std::move(req)] () mutable {
//FIXME: Client state can provide more context, e.g. client's endpoint address
// We use unique_ptr because client_state cannot be moved or copied
return do_with(std::make_unique<executor::client_state>(executor::client_state::internal_tag()),
[this, callback_it = std::move(callback_it), op = std::move(op), req = std::move(req)] (std::unique_ptr<executor::client_state>& client_state) mutable {
tracing::trace_state_ptr trace_state = executor::maybe_trace_query(*client_state, op, req->content);
tracing::trace(trace_state, op);
// JSON parsing can allocate up to roughly 2x the size of the raw document, + a couple of bytes for maintenance.
// FIXME: by this time, the whole HTTP request was already read, so some memory is already occupied.
// Once HTTP allows working on streams, we should grab the permit *before* reading the HTTP payload.
size_t mem_estimate = req->content.size() * 3 + 8000;
auto units_fut = get_units(*_memory_limiter, mem_estimate);
if (_memory_limiter->waiters()) {
++_executor._stats.requests_blocked_memory;
}
return units_fut.then([this, callback_it = std::move(callback_it), &client_state, trace_state, req = std::move(req)] (semaphore_units<> units) mutable {
return _json_parser.parse(req->content).then([this, callback_it = std::move(callback_it), &client_state, trace_state,
units = std::move(units), req = std::move(req)] (rjson::value json_request) mutable {
return callback_it->second(_executor, *client_state, trace_state, make_service_permit(std::move(units)), std::move(json_request), std::move(req)).finally([trace_state] {});
});
});
});
});
});
// JSON parsing can allocate up to roughly 2x the size of the raw
// document, + a couple of bytes for maintenance.
// TODO: consider the case where req->content_length is missing. Maybe
// we need to take the content_length_limit and return some of the units
// when we finish read_content_and_verify_signature?
size_t mem_estimate = req->content_length * 2 + 8000;
auto units_fut = get_units(*_memory_limiter, mem_estimate);
if (_memory_limiter->waiters()) {
++_executor._stats.requests_blocked_memory;
}
auto units = co_await std::move(units_fut);
assert(req->content_stream);
chunked_content content = co_await httpd::read_entire_stream(*req->content_stream);
co_await verify_signature(*req, content);
if (slogger.is_enabled(log_level::trace)) {
std::string buf;
slogger.trace("Request: {} {} {}", op, truncated_content_view(content, buf), req->_headers);
}
auto callback_it = _callbacks.find(op);
if (callback_it == _callbacks.end()) {
_executor._stats.unsupported_operations++;
co_return api_error::unknown_operation(format("Unsupported operation {}", op));
}
if (_pending_requests.get_count() >= _max_concurrent_requests) {
_executor._stats.requests_shed++;
co_return api_error::request_limit_exceeded(format("too many in-flight requests (configured via max_concurrent_requests_per_shard): {}", _pending_requests.get_count()));
}
_pending_requests.enter();
auto leave = defer([this] { _pending_requests.leave(); });
//FIXME: Client state can provide more context, e.g. client's endpoint address
// We use unique_ptr because client_state cannot be moved or copied
executor::client_state client_state{executor::client_state::internal_tag()};
tracing::trace_state_ptr trace_state = maybe_trace_query(client_state, op, content);
tracing::trace(trace_state, op);
rjson::value json_request = co_await _json_parser.parse(std::move(content));
co_return co_await callback_it->second(_executor, client_state, trace_state,
make_service_permit(std::move(units)), std::move(json_request), std::move(req));
}
void server::set_routes(routes& r) {
@@ -327,15 +437,17 @@ void server::set_routes(routes& r) {
// scan an entire subnet for nodes responding to the health request,
// or even just scan for open ports.
r.put(operation_type::GET, "/localnodes", new local_nodelist_handler(_pending_requests));
r.put(operation_type::OPTIONS, "/", new options_handler(_pending_requests));
}
//FIXME: A way to immediately invalidate the cache should be considered,
// e.g. when the system table which stores the keys is changed.
// For now, this propagation may take up to 1 minute.
server::server(executor& exec)
server::server(executor& exec, cql3::query_processor& qp)
: _http_server("http-alternator")
, _https_server("https-alternator")
, _executor(exec)
, _qp(qp)
, _key_cache(1024, 1min, slogger)
, _enforce_authorization(false)
, _enabled_servers{}
@@ -350,6 +462,9 @@ server::server(executor& exec)
{"DeleteTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.delete_table(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
}},
{"UpdateTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.update_table(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
}},
{"PutItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.put_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
}},
@@ -389,13 +504,26 @@ server::server(executor& exec)
{"ListTagsOfResource", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.list_tags_of_resource(client_state, std::move(permit), std::move(json_request));
}},
{"ListStreams", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.list_streams(client_state, std::move(permit), std::move(json_request));
}},
{"DescribeStream", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.describe_stream(client_state, std::move(permit), std::move(json_request));
}},
{"GetShardIterator", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.get_shard_iterator(client_state, std::move(permit), std::move(json_request));
}},
{"GetRecords", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.get_records(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
}},
} {
}
future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
bool enforce_authorization, semaphore* memory_limiter) {
bool enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests) {
_memory_limiter = memory_limiter;
_enforce_authorization = enforce_authorization;
_max_concurrent_requests = std::move(max_concurrent_requests);
if (!port && !https_port) {
return make_exception_future<>(std::runtime_error("Either regular port or TLS port"
" must be specified in order to init an alternator HTTP server instance"));
@@ -407,12 +535,14 @@ future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std:
if (port) {
set_routes(_http_server._routes);
_http_server.set_content_length_limit(server::content_length_limit);
_http_server.set_content_streaming(true);
_http_server.listen(socket_address{addr, *port}).get();
_enabled_servers.push_back(std::ref(_http_server));
}
if (https_port) {
set_routes(_https_server._routes);
_https_server.set_content_length_limit(server::content_length_limit);
_https_server.set_content_streaming(true);
_https_server.set_tls_credentials(creds->build_reloadable_server_credentials([](const std::unordered_set<sstring>& files, std::exception_ptr ep) {
if (ep) {
slogger.warn("Exception loading {}: {}", files, ep);
@@ -450,7 +580,7 @@ server::json_parser::json_parser() : _run_parse_json_thread(async([this] {
return;
}
try {
_parsed_document = rjson::parse_yieldable(_raw_document);
_parsed_document = rjson::parse_yieldable(std::move(_raw_document));
_current_exception = nullptr;
} catch (...) {
_current_exception = std::current_exception();
@@ -460,12 +590,12 @@ server::json_parser::json_parser() : _run_parse_json_thread(async([this] {
})) {
}
future<rjson::value> server::json_parser::parse(std::string_view content) {
future<rjson::value> server::json_parser::parse(chunked_content&& content) {
if (content.size() < yieldable_parsing_threshold) {
return make_ready_future<rjson::value>(rjson::parse(content));
return make_ready_future<rjson::value>(rjson::parse(std::move(content)));
}
return with_semaphore(_parsing_sem, 1, [this, content] {
_raw_document = content;
return with_semaphore(_parsing_sem, 1, [this, content = std::move(content)] () mutable {
_raw_document = std::move(content);
_document_waiting.signal();
return _document_parsed.wait().then([this] {
if (_current_exception) {

View File

@@ -28,10 +28,13 @@
#include <optional>
#include "alternator/auth.hh"
#include "utils/small_vector.hh"
#include "utils/updateable_value.hh"
#include <seastar/core/units.hh>
namespace alternator {
using chunked_content = rjson::chunked_content;
class server {
static constexpr size_t content_length_limit = 16*MB;
using alternator_callback = std::function<future<executor::request_return_type>(executor&, executor::client_state&,
@@ -41,6 +44,7 @@ class server {
http_server _http_server;
http_server _https_server;
executor& _executor;
cql3::query_processor& _qp;
key_cache _key_cache;
bool _enforce_authorization;
@@ -49,10 +53,11 @@ class server {
alternator_callbacks_map _callbacks;
semaphore* _memory_limiter;
utils::updateable_value<uint32_t> _max_concurrent_requests;
class json_parser {
static constexpr size_t yieldable_parsing_threshold = 16*KB;
std::string_view _raw_document;
chunked_content _raw_document;
rjson::value _parsed_document;
std::exception_ptr _current_exception;
semaphore _parsing_sem{1};
@@ -62,21 +67,24 @@ class server {
future<> _run_parse_json_thread;
public:
json_parser();
future<rjson::value> parse(std::string_view content);
// Moving a chunked_content into parse() allows parse() to free each
// chunk as soon as it is parsed, so when chunks are relatively small,
// we don't need to store the sum of unparsed and parsed sizes.
future<rjson::value> parse(chunked_content&& content);
future<> stop();
};
json_parser _json_parser;
public:
server(executor& executor);
server(executor& executor, cql3::query_processor& qp);
future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
bool enforce_authorization, semaphore* memory_limiter);
bool enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests);
future<> stop();
private:
void set_routes(seastar::httpd::routes& r);
future<> verify_signature(const seastar::httpd::request& r);
future<executor::request_return_type> handle_api_request(std::unique_ptr<request>&& req);
future<> verify_signature(const seastar::httpd::request&, const chunked_content&);
future<executor::request_return_type> handle_api_request(std::unique_ptr<request> req);
};
}

View File

@@ -20,7 +20,7 @@
*/
#include "stats.hh"
#include "utils/histogram_metrics_helper.hh"
#include <seastar/core/metrics.hh>
namespace alternator {
@@ -37,7 +37,8 @@ stats::stats() : api_operations{} {
seastar::metrics::description("number of operations via Alternator API"), {op(CamelCaseName)}),
#define OPERATION_LATENCY(name, CamelCaseName) \
seastar::metrics::make_histogram("op_latency", \
seastar::metrics::description("Latency histogram of an operation via Alternator API"), {op(CamelCaseName)}, [this]{return api_operations.name.get_histogram(1,20);}),
seastar::metrics::description("Latency histogram of an operation via Alternator API"), {op(CamelCaseName)}, [this]{return to_metrics_histogram(api_operations.name);}),
OPERATION(batch_get_item, "BatchGetItem")
OPERATION(batch_write_item, "BatchWriteItem")
OPERATION(create_backup, "CreateBackup")
OPERATION(create_global_table, "CreateGlobalTable")
@@ -77,6 +78,11 @@ stats::stats() : api_operations{} {
OPERATION_LATENCY(get_item_latency, "GetItem")
OPERATION_LATENCY(delete_item_latency, "DeleteItem")
OPERATION_LATENCY(update_item_latency, "UpdateItem")
OPERATION(list_streams, "ListStreams")
OPERATION(describe_stream, "DescribeStream")
OPERATION(get_shard_iterator, "GetShardIterator")
OPERATION(get_records, "GetRecords")
OPERATION_LATENCY(get_records_latency, "GetRecords")
});
_metrics.add_group("alternator", {
seastar::metrics::make_total_operations("unsupported_operations", unsupported_operations,
@@ -91,6 +97,8 @@ stats::stats() : api_operations{} {
seastar::metrics::description("number writes that had to be bounced from this shard because of LWT requirements")),
seastar::metrics::make_total_operations("requests_blocked_memory", requests_blocked_memory,
seastar::metrics::description("Counts a number of requests blocked due to memory pressure.")),
seastar::metrics::make_total_operations("requests_shed", requests_shed,
seastar::metrics::description("Counts a number of requests shed due to overload.")),
seastar::metrics::make_total_operations("filtered_rows_read_total", cql_stats.filtered_rows_read_total,
seastar::metrics::description("number of rows read during filtering operations")),
seastar::metrics::make_total_operations("filtered_rows_matched_total", cql_stats.filtered_rows_matched_total,

View File

@@ -74,11 +74,16 @@ public:
uint64_t update_item = 0;
uint64_t update_table = 0;
uint64_t update_time_to_live = 0;
uint64_t list_streams = 0;
uint64_t describe_stream = 0;
uint64_t get_shard_iterator = 0;
uint64_t get_records = 0;
utils::estimated_histogram put_item_latency;
utils::estimated_histogram get_item_latency;
utils::estimated_histogram delete_item_latency;
utils::estimated_histogram update_item_latency;
utils::time_estimated_histogram put_item_latency;
utils::time_estimated_histogram get_item_latency;
utils::time_estimated_histogram delete_item_latency;
utils::time_estimated_histogram update_item_latency;
utils::time_estimated_histogram get_records_latency;
} api_operations;
// Miscellaneous event counters
uint64_t total_operations = 0;
@@ -87,6 +92,7 @@ public:
uint64_t write_using_lwt = 0;
uint64_t shard_bounce_for_lwt = 0;
uint64_t requests_blocked_memory = 0;
uint64_t requests_shed = 0;
// CQL-derived stats
cql3::cql_stats cql_stats;
private:

1116
alternator/streams.cc Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -2925,6 +2925,10 @@
"id":"toppartitions_query_results",
"description":"nodetool toppartitions query results",
"properties":{
"read_cardinality":{
"type":"long",
"description":"Number of the unique operations in the sample set"
},
"read":{
"type":"array",
"items":{
@@ -2932,6 +2936,10 @@
},
"description":"Read results"
},
"write_cardinality":{
"type":"long",
"description":"Number of the unique operations in the sample set"
},
"write":{
"type":"array",
"items":{

View File

@@ -148,6 +148,30 @@
]
}
]
},
{
"path":"/gossiper/force_remove_endpoint/{addr}",
"operations":[
{
"method":"POST",
"summary":"Force remove an endpoint from gossip",
"type":"void",
"nickname":"force_remove_endpoint",
"produces":[
"application/json"
],
"parameters":[
{
"name":"addr",
"description":"The endpoint address",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
}
]
}

View File

@@ -76,7 +76,7 @@
"items":{
"type":"message_counter"
},
"nickname":"get_completed_messages",
"nickname":"get_replied_messages",
"produces":[
"application/json"
],
@@ -249,7 +249,7 @@
"MIGRATION_REQUEST",
"PREPARE_MESSAGE",
"PREPARE_DONE_MESSAGE",
"STREAM_MUTATION",
"UNUSED__STREAM_MUTATION",
"STREAM_MUTATION_DONE",
"COMPLETE_MESSAGE",
"REPAIR_CHECKSUM_RANGE",

View File

@@ -68,7 +68,7 @@
"summary":"Get the hinted handoff enabled by dc",
"type":"array",
"items":{
"type":"mapper_list"
"type":"array"
},
"nickname":"get_hinted_handoff_enabled_by_dc",
"produces":[

View File

@@ -104,6 +104,68 @@
}
]
},
{
"path":"/storage_service/toppartitions/",
"operations":[
{
"method":"GET",
"summary":"Toppartitions query",
"type":"toppartitions_query_results",
"nickname":"toppartitions_generic",
"produces":[
"application/json"
],
"parameters":[
{
"name":"table_filters",
"description":"Optional list of table name filters in keyspace:name format",
"required":false,
"allowMultiple":false,
"type":"array",
"items":{
"type":"string"
},
"paramType":"query"
},
{
"name":"keyspace_filters",
"description":"Optional list of keyspace filters",
"required":false,
"allowMultiple":false,
"type":"array",
"items":{
"type":"string"
},
"paramType":"query"
},
{
"name":"duration",
"description":"Duration (in milliseconds) of monitoring operation",
"required":true,
"allowMultiple":false,
"type": "long",
"paramType":"query"
},
{
"name":"list_size",
"description":"number of the top partitions to list",
"required":false,
"allowMultiple":false,
"type": "long",
"paramType":"query"
},
{
"name":"capacity",
"description":"capacity of stream summary: determines amount of resources used in query processing",
"required":false,
"allowMultiple":false,
"type": "long",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/nodes/leaving",
"operations":[
@@ -833,6 +895,43 @@
}
]
},
{
"path":"/storage_service/repair_status/",
"operations":[
{
"method":"GET",
"summary":"Query the repair status and return when the repair is finished or timeout",
"type":"string",
"enum":[
"RUNNING",
"SUCCESSFUL",
"FAILED"
],
"nickname":"repair_await_completion",
"produces":[
"application/json"
],
"parameters":[
{
"name":"id",
"description":"The repair ID to check for status",
"required":true,
"allowMultiple":false,
"type": "long",
"paramType":"query"
},
{
"name":"timeout",
"description":"Seconds to wait before the query returns even if the repair is not finished. The value -1 or not providing this parameter means no timeout",
"required":false,
"allowMultiple":false,
"type": "long",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/repair_async/{keyspace}",
"operations":[
@@ -933,6 +1032,14 @@
"type":"string",
"paramType":"query"
},
{
"name":"ignore_nodes",
"description":"Which hosts are to ignore in this repair. Multiple hosts can be listed separated by commas.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"trace",
"description":"If the value is the string 'true' with any capitalization, enable tracing of the repair.",
@@ -1068,6 +1175,14 @@
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"ignore_nodes",
"description":"List of dead nodes to ingore in removenode operation",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
@@ -1719,6 +1834,22 @@
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"load_and_stream",
"description":"Load the sstables and stream to all replica nodes that owns the data",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"primary_replica_only",
"description":"Load the sstables and stream to primary replica node that owns the data. Repair is needed after the load and stream process",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
@@ -1829,6 +1960,14 @@
"allowMultiple":false,
"type":"long",
"paramType":"query"
},
{
"name":"fast",
"description":"Lightweight tracing mode: if true, slow queries tracing records only session headers",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
},
@@ -2327,6 +2466,10 @@
"threshold":{
"type":"long",
"description":"The slow query logging threshold in microseconds. Queries that takes longer, will be logged"
},
"fast":{
"type":"boolean",
"description":"Is lightweight tracing mode enabled. In that mode tracing ignore events and tracks only sessions."
}
}
},
@@ -2431,7 +2574,7 @@
"version":{
"type":"string",
"enum":[
"ka", "la", "mc"
"ka", "la", "mc", "md"
],
"description":"SSTable version"
},

View File

@@ -52,6 +52,22 @@
}
]
},
{
"path":"/system/drop_sstable_caches",
"operations":[
{
"method":"POST",
"summary":"Drop in-memory caches for data which is in sstables",
"type":"void",
"nickname":"drop_sstable_caches",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/system/uptime_ms",
"operations":[

View File

@@ -113,8 +113,20 @@ future<> set_server_storage_service(http_context& ctx) {
return register_api(ctx, "storage_service", "The storage service API", set_storage_service);
}
future<> set_server_snapshot(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { set_snapshot(ctx, r); });
future<> set_server_repair(http_context& ctx, sharded<netw::messaging_service>& ms) {
return ctx.http_server.set_routes([&ctx, &ms] (routes& r) { set_repair(ctx, r, ms); });
}
future<> unset_server_repair(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_repair(ctx, r); });
}
future<> set_server_snapshot(http_context& ctx, sharded<db::snapshot_ctl>& snap_ctl) {
return ctx.http_server.set_routes([&ctx, &snap_ctl] (routes& r) { set_snapshot(ctx, r, snap_ctl); });
}
future<> unset_server_snapshot(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_snapshot(ctx, r); });
}
future<> set_server_snitch(http_context& ctx) {
@@ -131,9 +143,14 @@ future<> set_server_load_sstable(http_context& ctx) {
"The column family API", set_column_family);
}
future<> set_server_messaging_service(http_context& ctx) {
future<> set_server_messaging_service(http_context& ctx, sharded<netw::messaging_service>& ms) {
return register_api(ctx, "messaging_service",
"The messaging service API", set_messaging_service);
"The messaging service API", [&ms] (http_context& ctx, routes& r) {
set_messaging_service(ctx, r, ms);
});
}
future<> unset_server_messaging_service(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_messaging_service(ctx, r); });
}
future<> set_server_storage_proxy(http_context& ctx) {

View File

@@ -256,4 +256,6 @@ public:
operator T() const { return value; }
};
utils_json::estimated_histogram time_to_json_histogram(const utils::time_estimated_histogram& val);
}

View File

@@ -24,9 +24,11 @@
#include <seastar/http/httpd.hh>
namespace service { class load_meter; }
namespace locator { class token_metadata; }
namespace locator { class shared_token_metadata; }
namespace cql_transport { class controller; }
class thrift_controller;
namespace db { class snapshot_ctl; }
namespace netw { class messaging_service; }
namespace api {
@@ -37,27 +39,33 @@ struct http_context {
distributed<database>& db;
distributed<service::storage_proxy>& sp;
service::load_meter& lmeter;
sharded<locator::token_metadata>& token_metadata;
const sharded<locator::shared_token_metadata>& shared_token_metadata;
http_context(distributed<database>& _db,
distributed<service::storage_proxy>& _sp,
service::load_meter& _lm, sharded<locator::token_metadata>& _tm)
: db(_db), sp(_sp), lmeter(_lm), token_metadata(_tm) {
service::load_meter& _lm, const sharded<locator::shared_token_metadata>& _stm)
: db(_db), sp(_sp), lmeter(_lm), shared_token_metadata(_stm) {
}
const locator::token_metadata& get_token_metadata();
};
future<> set_server_init(http_context& ctx);
future<> set_server_config(http_context& ctx);
future<> set_server_snitch(http_context& ctx);
future<> set_server_storage_service(http_context& ctx);
future<> set_server_repair(http_context& ctx, sharded<netw::messaging_service>& ms);
future<> unset_server_repair(http_context& ctx);
future<> set_transport_controller(http_context& ctx, cql_transport::controller& ctl);
future<> unset_transport_controller(http_context& ctx);
future<> set_rpc_controller(http_context& ctx, thrift_controller& ctl);
future<> unset_rpc_controller(http_context& ctx);
future<> set_server_snapshot(http_context& ctx);
future<> set_server_snapshot(http_context& ctx, sharded<db::snapshot_ctl>& snap_ctl);
future<> unset_server_snapshot(http_context& ctx);
future<> set_server_gossip(http_context& ctx);
future<> set_server_load_sstable(http_context& ctx);
future<> set_server_messaging_service(http_context& ctx);
future<> set_server_messaging_service(http_context& ctx, sharded<netw::messaging_service>& ms);
future<> unset_server_messaging_service(http_context& ctx);
future<> set_server_storage_proxy(http_context& ctx);
future<> set_server_stream_manager(http_context& ctx);
future<> set_server_gossip_settle(http_context& ctx);

View File

@@ -28,6 +28,7 @@
#include <algorithm>
#include "db/system_keyspace_view_types.hh"
#include "db/data_listeners.hh"
#include "storage_service.hh"
extern logging::logger apilog;
@@ -180,7 +181,7 @@ static future<json::json_return_type> get_cf_unleveled_sstables(http_context& ct
static int64_t min_partition_size(column_family& cf) {
int64_t res = INT64_MAX;
for (auto i: *cf.get_sstables() ) {
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
res = std::min(res, i->get_stats_metadata().estimated_partition_size.min());
}
return (res == INT64_MAX) ? 0 : res;
@@ -188,7 +189,7 @@ static int64_t min_partition_size(column_family& cf) {
static int64_t max_partition_size(column_family& cf) {
int64_t res = 0;
for (auto i: *cf.get_sstables() ) {
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
res = std::max(i->get_stats_metadata().estimated_partition_size.max(), res);
}
return res;
@@ -196,7 +197,7 @@ static int64_t max_partition_size(column_family& cf) {
static integral_ratio_holder mean_partition_size(column_family& cf) {
integral_ratio_holder res;
for (auto i: *cf.get_sstables() ) {
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
auto c = i->get_stats_metadata().estimated_partition_size.count();
res.sub += i->get_stats_metadata().estimated_partition_size.mean() * c;
res.total += c;
@@ -249,6 +250,12 @@ static future<json::json_return_type> sum_sstable(http_context& ctx, bool total)
});
}
future<json::json_return_type> map_reduce_cf_time_histogram(http_context& ctx, const sstring& name, std::function<utils::time_estimated_histogram(const column_family&)> f) {
return map_reduce_cf_raw(ctx, name, utils::time_estimated_histogram(), f, utils::time_estimated_histogram_merge).then([](const utils::time_estimated_histogram& res) {
return make_ready_future<json::json_return_type>(time_to_json_histogram(res));
});
}
template <typename T>
class sum_ratio {
uint64_t _n = 0;
@@ -268,7 +275,7 @@ public:
static double get_compression_ratio(column_family& cf) {
sum_ratio<double> result;
for (auto i : *cf.get_sstables()) {
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
auto compression_ratio = i->get_compression_ratio();
if (compression_ratio != sstables::metadata_collector::NO_COMPRESSION_RATIO) {
result(compression_ratio);
@@ -304,8 +311,8 @@ void set_column_family(http_context& ctx, routes& r) {
return res;
});
cf::get_column_family.set(r, [&ctx] (const_req req){
vector<cf::column_family_info> res;
cf::get_column_family.set(r, [&ctx] (std::unique_ptr<request> req){
std::list<cf::column_family_info> res;
for (auto i: ctx.db.local().get_column_families_mapping()) {
cf::column_family_info info;
info.ks = i.first.first;
@@ -313,7 +320,7 @@ void set_column_family(http_context& ctx, routes& r) {
info.type = "ColumnFamilies";
res.push_back(info);
}
return res;
return make_ready_future<json::json_return_type>(json::stream_range_as_array(std::move(res), std::identity()));
});
cf::get_column_family_name_keyspace.set(r, [&ctx] (const_req req){
@@ -325,15 +332,15 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_memtable_columns_count.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], 0, [](column_family& cf) {
return map_reduce_cf(ctx, req->param["name"], uint64_t{0}, [](column_family& cf) {
return cf.active_memtable().partition_count();
}, std::plus<int>());
}, std::plus<>());
});
cf::get_all_memtable_columns_count.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, 0, [](column_family& cf) {
return map_reduce_cf(ctx, uint64_t{0}, [](column_family& cf) {
return cf.active_memtable().partition_count();
}, std::plus<int>());
}, std::plus<>());
});
cf::get_memtable_on_heap_size.set(r, [] (const_req req) {
@@ -418,7 +425,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_estimated_row_size_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {
utils::estimated_histogram res(0);
for (auto i: *cf.get_sstables() ) {
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
res.merge(i->get_stats_metadata().estimated_partition_size);
}
return res;
@@ -430,7 +437,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_estimated_row_count.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], int64_t(0), [](column_family& cf) {
uint64_t res = 0;
for (auto i: *cf.get_sstables() ) {
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
res += i->get_stats_metadata().estimated_partition_size.count();
}
return res;
@@ -441,7 +448,7 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_estimated_column_count_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {
utils::estimated_histogram res(0);
for (auto i: *cf.get_sstables() ) {
for (auto sstables = cf.get_sstables(); auto& i : *sstables) {
res.merge(i->get_stats_metadata().estimated_cells_count);
}
return res;
@@ -593,7 +600,8 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_get_false_positive();
});
}, std::plus<uint64_t>());
@@ -601,7 +609,8 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_get_false_positive();
});
}, std::plus<uint64_t>());
@@ -609,7 +618,8 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_recent_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_get_recent_false_positive();
});
}, std::plus<uint64_t>());
@@ -617,7 +627,8 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_all_recent_bloom_filter_false_positives.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_get_recent_false_positive();
});
}, std::plus<uint64_t>());
@@ -649,48 +660,54 @@ void set_column_family(http_context& ctx, routes& r) {
cf::get_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_size();
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_size();
});
}, std::plus<uint64_t>());
});
cf::get_all_bloom_filter_disk_space_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_size();
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_size();
});
}, std::plus<uint64_t>());
});
cf::get_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_memory_size();
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_memory_size();
});
}, std::plus<uint64_t>());
});
cf::get_all_bloom_filter_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->filter_memory_size();
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->filter_memory_size();
});
}, std::plus<uint64_t>());
});
cf::get_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->get_summary().memory_footprint();
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->get_summary().memory_footprint();
});
}, std::plus<uint64_t>());
});
cf::get_all_index_summary_off_heap_memory_used.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, uint64_t(0), [] (column_family& cf) {
return std::accumulate(cf.get_sstables()->begin(), cf.get_sstables()->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return sst->get_summary().memory_footprint();
auto sstables = cf.get_sstables();
return std::accumulate(sstables->begin(), sstables->end(), uint64_t(0), [](uint64_t s, auto& sst) {
return s + sst->get_summary().memory_footprint();
});
}, std::plus<uint64_t>());
});
@@ -796,24 +813,21 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_cas_prepare.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const column_family& cf) {
return cf.get_stats().estimated_cas_prepare;
},
utils::estimated_histogram_merge, utils_json::estimated_histogram());
});
});
cf::get_cas_propose.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const column_family& cf) {
return cf.get_stats().estimated_cas_accept;
},
utils::estimated_histogram_merge, utils_json::estimated_histogram());
});
});
cf::get_cas_commit.set(r, [&ctx] (std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const column_family& cf) {
return cf.get_stats().estimated_cas_learn;
},
utils::estimated_histogram_merge, utils_json::estimated_histogram());
});
});
cf::get_sstables_per_read_histogram.set(r, [&ctx] (std::unique_ptr<request> req) {
@@ -862,7 +876,9 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_built_indexes.set(r, [&ctx](std::unique_ptr<request> req) {
auto [ks, cf_name] = parse_fully_qualified_cf_name(req->param["name"]);
auto ks_cf = parse_fully_qualified_cf_name(req->param["name"]);
auto&& ks = std::get<0>(ks_cf);
auto&& cf_name = std::get<1>(ks_cf);
return db::system_keyspace::load_view_build_progress().then([ks, cf_name, &ctx](const std::vector<db::system_keyspace::view_build_progress>& vb) mutable {
std::set<sstring> vp;
for (auto b : vb) {
@@ -875,7 +891,7 @@ void set_column_family(http_context& ctx, routes& r) {
column_family& cf = ctx.db.local().find_column_family(uuid);
res.reserve(cf.get_index_manager().list_indexes().size());
for (auto&& i : cf.get_index_manager().list_indexes()) {
if (vp.find(secondary_index::index_table_name(i.metadata().name())) == vp.end()) {
if (!vp.contains(secondary_index::index_table_name(i.metadata().name()))) {
res.emplace_back(i.metadata().name());
}
}
@@ -909,17 +925,15 @@ void set_column_family(http_context& ctx, routes& r) {
});
cf::get_read_latency_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const column_family& cf) {
return cf.get_stats().estimated_read;
},
utils::estimated_histogram_merge, utils_json::estimated_histogram());
});
});
cf::get_write_latency_estimated_histogram.set(r, [&ctx](std::unique_ptr<request> req) {
return map_reduce_cf(ctx, req->param["name"], utils::estimated_histogram(0), [](column_family& cf) {
return map_reduce_cf_time_histogram(ctx, req->param["name"], [](const column_family& cf) {
return cf.get_stats().estimated_write;
},
utils::estimated_histogram_merge, utils_json::estimated_histogram());
});
});
cf::set_compaction_strategy_class.set(r, [&ctx](std::unique_ptr<request> req) {
@@ -970,42 +984,20 @@ void set_column_family(http_context& ctx, routes& r) {
});
});
cf::toppartitions.set(r, [&ctx] (std::unique_ptr<request> req) {
auto name_param = req->param["name"];
auto [ks, cf] = parse_fully_qualified_cf_name(name_param);
auto name = req->param["name"];
auto [ks, cf] = parse_fully_qualified_cf_name(name);
api::req_param<std::chrono::milliseconds, unsigned> duration{*req, "duration", 1000ms};
api::req_param<unsigned> capacity(*req, "capacity", 256);
api::req_param<unsigned> list_size(*req, "list_size", 10);
apilog.info("toppartitions query: name={} duration={} list_size={} capacity={}",
name_param, duration.param, list_size.param, capacity.param);
name, duration.param, list_size.param, capacity.param);
return seastar::do_with(db::toppartitions_query(ctx.db, ks, cf, duration.value, list_size, capacity), [&ctx](auto& q) {
return q.scatter().then([&q] {
return sleep(q.duration()).then([&q] {
return q.gather(q.capacity()).then([&q] (auto topk_results) {
apilog.debug("toppartitions query: processing results");
cf::toppartitions_query_results results;
for (auto& d: topk_results.read.top(q.list_size())) {
cf::toppartitions_record r;
r.partition = sstring(d.item);
r.count = d.count;
r.error = d.error;
results.read.push(r);
}
for (auto& d: topk_results.write.top(q.list_size())) {
cf::toppartitions_record r;
r.partition = sstring(d.item);
r.count = d.count;
r.error = d.error;
results.write.push(r);
}
return make_ready_future<json::json_return_type>(results);
});
});
});
return seastar::do_with(db::toppartitions_query(ctx.db, {{ks, cf}}, {}, duration.value, list_size, capacity), [&ctx] (db::toppartitions_query& q) {
return run_toppartitions_query(q, ctx, true);
});
});

View File

@@ -68,6 +68,8 @@ future<json::json_return_type> map_reduce_cf(http_context& ctx, const sstring& n
});
}
future<json::json_return_type> map_reduce_cf_time_histogram(http_context& ctx, const sstring& name, std::function<utils::time_estimated_histogram(const column_family&)> f);
struct map_reduce_column_families_locally {
std::any init;
std::function<std::unique_ptr<std::any>(column_family&)> mapper;
@@ -114,4 +116,7 @@ future<json::json_return_type> get_cf_stats(http_context& ctx, const sstring& n
future<json::json_return_type> get_cf_stats(http_context& ctx,
int64_t column_family_stats::*f);
std::tuple<sstring, sstring> parse_fully_qualified_cf_name(sstring name);
}

View File

@@ -58,6 +58,7 @@ void set_compaction_manager(http_context& ctx, routes& r) {
for (const auto& c : cm.get_compactions()) {
cm::summary s;
s.id = c->compaction_uuid.to_sstring();
s.ks = c->ks_name;
s.cf = c->cf_name;
s.unit = "keys";

View File

@@ -66,6 +66,13 @@ void set_gossiper(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(json_void());
});
});
httpd::gossiper_json::force_remove_endpoint.set(r, [](std::unique_ptr<request> req) {
gms::inet_address ep(req->param["addr"]);
return gms::get_local_gossiper().force_remove_endpoint(ep).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
}
}

View File

@@ -26,6 +26,7 @@
#include <seastar/http/exception.hh>
#include "utils/logalloc.hh"
#include "log.hh"
#include "database.hh"
namespace api {

View File

@@ -53,8 +53,8 @@ std::vector<message_counter> map_to_message_counters(
* according to a function that it gets as a parameter.
*
*/
future_json_function get_client_getter(std::function<uint64_t(const shard_info&)> f) {
return [f](std::unique_ptr<request> req) {
future_json_function get_client_getter(sharded<netw::messaging_service>& ms, std::function<uint64_t(const shard_info&)> f) {
return [&ms, f](std::unique_ptr<request> req) {
using map_type = std::unordered_map<gms::inet_address, uint64_t>;
auto get_shard_map = [f](messaging_service& ms) {
std::unordered_map<gms::inet_address, unsigned long> map;
@@ -63,15 +63,15 @@ future_json_function get_client_getter(std::function<uint64_t(const shard_info&)
});
return map;
};
return get_messaging_service().map_reduce0(get_shard_map, map_type(), map_sum<map_type>).
return ms.map_reduce0(get_shard_map, map_type(), map_sum<map_type>).
then([](map_type&& map) {
return make_ready_future<json::json_return_type>(map_to_message_counters(map));
});
};
}
future_json_function get_server_getter(std::function<uint64_t(const rpc::stats&)> f) {
return [f](std::unique_ptr<request> req) {
future_json_function get_server_getter(sharded<netw::messaging_service>& ms, std::function<uint64_t(const rpc::stats&)> f) {
return [&ms, f](std::unique_ptr<request> req) {
using map_type = std::unordered_map<gms::inet_address, uint64_t>;
auto get_shard_map = [f](messaging_service& ms) {
std::unordered_map<gms::inet_address, unsigned long> map;
@@ -80,53 +80,57 @@ future_json_function get_server_getter(std::function<uint64_t(const rpc::stats&)
});
return map;
};
return get_messaging_service().map_reduce0(get_shard_map, map_type(), map_sum<map_type>).
return ms.map_reduce0(get_shard_map, map_type(), map_sum<map_type>).
then([](map_type&& map) {
return make_ready_future<json::json_return_type>(map_to_message_counters(map));
});
};
}
void set_messaging_service(http_context& ctx, routes& r) {
get_timeout_messages.set(r, get_client_getter([](const shard_info& c) {
void set_messaging_service(http_context& ctx, routes& r, sharded<netw::messaging_service>& ms) {
get_timeout_messages.set(r, get_client_getter(ms, [](const shard_info& c) {
return c.get_stats().timeout;
}));
get_sent_messages.set(r, get_client_getter([](const shard_info& c) {
get_sent_messages.set(r, get_client_getter(ms, [](const shard_info& c) {
return c.get_stats().sent_messages;
}));
get_dropped_messages.set(r, get_client_getter([](const shard_info& c) {
get_replied_messages.set(r, get_client_getter(ms, [](const shard_info& c) {
return c.get_stats().replied;
}));
get_dropped_messages.set(r, get_client_getter(ms, [](const shard_info& c) {
// We don't have the same drop message mechanism
// as origin has.
// hence we can always return 0
return 0;
}));
get_exception_messages.set(r, get_client_getter([](const shard_info& c) {
get_exception_messages.set(r, get_client_getter(ms, [](const shard_info& c) {
return c.get_stats().exception_received;
}));
get_pending_messages.set(r, get_client_getter([](const shard_info& c) {
get_pending_messages.set(r, get_client_getter(ms, [](const shard_info& c) {
return c.get_stats().pending;
}));
get_respond_pending_messages.set(r, get_server_getter([](const rpc::stats& c) {
get_respond_pending_messages.set(r, get_server_getter(ms, [](const rpc::stats& c) {
return c.pending;
}));
get_respond_completed_messages.set(r, get_server_getter([](const rpc::stats& c) {
get_respond_completed_messages.set(r, get_server_getter(ms, [](const rpc::stats& c) {
return c.sent_messages;
}));
get_version.set(r, [](const_req req) {
return netw::get_local_messaging_service().get_raw_version(req.get_query_param("addr"));
get_version.set(r, [&ms](const_req req) {
return ms.local().get_raw_version(req.get_query_param("addr"));
});
get_dropped_messages_by_ver.set(r, [](std::unique_ptr<request> req) {
get_dropped_messages_by_ver.set(r, [&ms](std::unique_ptr<request> req) {
shared_ptr<std::vector<uint64_t>> map = make_shared<std::vector<uint64_t>>(num_verb);
return netw::get_messaging_service().map_reduce([map](const uint64_t* local_map) mutable {
return ms.map_reduce([map](const uint64_t* local_map) mutable {
for (auto i = 0; i < num_verb; i++) {
(*map)[i]+= local_map[i];
}
@@ -151,5 +155,19 @@ void set_messaging_service(http_context& ctx, routes& r) {
});
});
}
void unset_messaging_service(http_context& ctx, routes& r) {
get_timeout_messages.unset(r);
get_sent_messages.unset(r);
get_replied_messages.unset(r);
get_dropped_messages.unset(r);
get_exception_messages.unset(r);
get_pending_messages.unset(r);
get_respond_pending_messages.unset(r);
get_respond_completed_messages.unset(r);
get_version.unset(r);
get_dropped_messages_by_ver.unset(r);
}
}

View File

@@ -23,8 +23,11 @@
#include "api.hh"
namespace netw { class messaging_service; }
namespace api {
void set_messaging_service(http_context& ctx, routes& r);
void set_messaging_service(http_context& ctx, routes& r, sharded<netw::messaging_service>& ms);
void unset_messaging_service(http_context& ctx, routes& r);
}

View File

@@ -201,29 +201,39 @@ void set_storage_proxy(http_context& ctx, routes& r) {
});
sp::get_hinted_handoff_enabled.set(r, [&ctx](std::unique_ptr<request> req) {
auto enabled = ctx.db.local().get_config().hinted_handoff_enabled();
return make_ready_future<json::json_return_type>(enabled);
const auto& filter = service::get_storage_proxy().local().get_hints_host_filter();
return make_ready_future<json::json_return_type>(!filter.is_disabled_for_all());
});
sp::set_hinted_handoff_enabled.set(r, [](std::unique_ptr<request> req) {
//TBD
unimplemented();
auto enable = req->get_query_param("enable");
return make_ready_future<json::json_return_type>(json_void());
auto filter = (enable == "true" || enable == "1")
? db::hints::host_filter(db::hints::host_filter::enabled_for_all_tag {})
: db::hints::host_filter(db::hints::host_filter::disabled_for_all_tag {});
return service::get_storage_proxy().invoke_on_all([filter = std::move(filter)] (service::storage_proxy& sp) {
return sp.change_hints_host_filter(filter);
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
sp::get_hinted_handoff_enabled_by_dc.set(r, [](std::unique_ptr<request> req) {
//TBD
unimplemented();
std::vector<sp::mapper_list> res;
std::vector<sstring> res;
const auto& filter = service::get_storage_proxy().local().get_hints_host_filter();
const auto& dcs = filter.get_dcs();
res.reserve(res.size());
std::copy(dcs.begin(), dcs.end(), std::back_inserter(res));
return make_ready_future<json::json_return_type>(res);
});
sp::set_hinted_handoff_enabled_by_dc_list.set(r, [](std::unique_ptr<request> req) {
//TBD
unimplemented();
auto enable = req->get_query_param("dcs");
return make_ready_future<json::json_return_type>(json_void());
auto dcs = req->get_query_param("dcs");
auto filter = db::hints::host_filter::parse_from_dc_list(std::move(dcs));
return service::get_storage_proxy().invoke_on_all([filter = std::move(filter)] (service::storage_proxy& sp) {
return sp.change_hints_host_filter(filter);
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
sp::get_max_hint_window.set(r, [](std::unique_ptr<request> req) {

View File

@@ -22,10 +22,14 @@
#include "storage_service.hh"
#include "api/api-doc/storage_service.json.hh"
#include "db/config.hh"
#include <optional>
#include "db/schema_tables.hh"
#include "utils/hash.hh"
#include <sstream>
#include <time.h>
#include <boost/range/adaptor/map.hpp>
#include <boost/range/adaptor/filtered.hpp>
#include <boost/algorithm/string/trim_all.hpp>
#include <boost/functional/hash.hpp>
#include "service/storage_service.hh"
#include "service/load_meter.hh"
#include "db/commitlog/commitlog.hh"
@@ -41,11 +45,20 @@
#include "sstables/sstables.hh"
#include "database.hh"
#include "db/extensions.hh"
#include "db/snapshot-ctl.hh"
#include "transport/controller.hh"
#include "thrift/controller.hh"
#include "locator/token_metadata.hh"
#include "cdc/generation_service.hh"
extern logging::logger apilog;
namespace api {
const locator::token_metadata& http_context::get_token_metadata() {
return *shared_token_metadata.local().get();
}
namespace ss = httpd::storage_service_json;
using namespace json;
@@ -87,6 +100,37 @@ static auto wrap_ks_cf(http_context &ctx, ks_cf_func f) {
};
}
seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request) {
namespace cf = httpd::column_family_json;
return q.scatter().then([&q, legacy_request] {
return sleep(q.duration()).then([&q, legacy_request] {
return q.gather(q.capacity()).then([&q, legacy_request] (auto topk_results) {
apilog.debug("toppartitions query: processing results");
cf::toppartitions_query_results results;
results.read_cardinality = topk_results.read.size();
results.write_cardinality = topk_results.write.size();
for (auto& d: topk_results.read.top(q.list_size())) {
cf::toppartitions_record r;
r.partition = (legacy_request ? "" : "(" + d.item.schema->ks_name() + ":" + d.item.schema->cf_name() + ") ") + sstring(d.item);
r.count = d.count;
r.error = d.error;
results.read.push(r);
}
for (auto& d: topk_results.write.top(q.list_size())) {
cf::toppartitions_record r;
r.partition = (legacy_request ? "" : "(" + d.item.schema->ks_name() + ":" + d.item.schema->cf_name() + ") ") + sstring(d.item);
r.count = d.count;
r.error = d.error;
results.write.push(r);
}
return make_ready_future<json::json_return_type>(results);
});
});
});
}
future<json::json_return_type> set_tables_autocompaction(http_context& ctx, const sstring &keyspace, std::vector<sstring> tables, bool enabled) {
if (tables.empty()) {
tables = map_keys(ctx.db.local().find_keyspace(keyspace).metadata().get()->cf_meta_data());
@@ -149,6 +193,104 @@ void unset_rpc_controller(http_context& ctx, routes& r) {
ss::is_rpc_server_running.unset(r);
}
void set_repair(http_context& ctx, routes& r, sharded<netw::messaging_service>& ms) {
ss::repair_async.set(r, [&ctx, &ms](std::unique_ptr<request> req) {
static std::vector<sstring> options = {"primaryRange", "parallelism", "incremental",
"jobThreads", "ranges", "columnFamilies", "dataCenters", "hosts", "ignore_nodes", "trace",
"startToken", "endToken" };
std::unordered_map<sstring, sstring> options_map;
for (auto o : options) {
auto s = req->get_query_param(o);
if (s != "") {
options_map[o] = s;
}
}
// The repair process is asynchronous: repair_start only starts it and
// returns immediately, not waiting for the repair to finish. The user
// then has other mechanisms to track the ongoing repair's progress,
// or stop it.
return repair_start(ctx.db, ms, validate_keyspace(ctx, req->param),
options_map).then([] (int i) {
return make_ready_future<json::json_return_type>(i);
});
});
ss::get_active_repair_async.set(r, [&ctx](std::unique_ptr<request> req) {
return get_active_repairs(ctx.db).then([] (std::vector<int> res){
return make_ready_future<json::json_return_type>(res);
});
});
ss::repair_async_status.set(r, [&ctx](std::unique_ptr<request> req) {
return repair_get_status(ctx.db, boost::lexical_cast<int>( req->get_query_param("id")))
.then_wrapped([] (future<repair_status>&& fut) {
ss::ns_repair_async_status::return_type_wrapper res;
try {
res = fut.get0();
} catch(std::runtime_error& e) {
throw httpd::bad_param_exception(e.what());
}
return make_ready_future<json::json_return_type>(json::json_return_type(res));
});
});
ss::repair_await_completion.set(r, [&ctx](std::unique_ptr<request> req) {
int id;
using clock = std::chrono::steady_clock;
clock::time_point expire;
try {
id = boost::lexical_cast<int>(req->get_query_param("id"));
// If timeout is not provided, it means no timeout.
sstring s = req->get_query_param("timeout");
int64_t timeout = s.empty() ? int64_t(-1) : boost::lexical_cast<int64_t>(s);
if (timeout < 0 && timeout != -1) {
return make_exception_future<json::json_return_type>(
httpd::bad_param_exception("timeout can only be -1 (means no timeout) or non negative integer"));
}
if (timeout < 0) {
expire = clock::time_point::max();
} else {
expire = clock::now() + std::chrono::seconds(timeout);
}
} catch (std::exception& e) {
return make_exception_future<json::json_return_type>(httpd::bad_param_exception(e.what()));
}
return repair_await_completion(ctx.db, id, expire)
.then_wrapped([] (future<repair_status>&& fut) {
ss::ns_repair_async_status::return_type_wrapper res;
try {
res = fut.get0();
} catch (std::exception& e) {
return make_exception_future<json::json_return_type>(httpd::bad_param_exception(e.what()));
}
return make_ready_future<json::json_return_type>(json::json_return_type(res));
});
});
ss::force_terminate_all_repair_sessions.set(r, [](std::unique_ptr<request> req) {
return repair_abort_all(service::get_local_storage_service().db()).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::force_terminate_all_repair_sessions_new.set(r, [](std::unique_ptr<request> req) {
return repair_abort_all(service::get_local_storage_service().db()).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
}
void unset_repair(http_context& ctx, routes& r) {
ss::repair_async.unset(r);
ss::get_active_repair_async.unset(r);
ss::repair_async_status.unset(r);
ss::repair_await_completion.unset(r);
ss::force_terminate_all_repair_sessions.unset(r);
ss::force_terminate_all_repair_sessions_new.unset(r);
}
void set_storage_service(http_context& ctx, routes& r) {
ss::local_hostid.set(r, [](std::unique_ptr<request> req) {
return db::system_keyspace::get_local_host_id().then([](const utils::UUID& id) {
@@ -157,14 +299,14 @@ void set_storage_service(http_context& ctx, routes& r) {
});
ss::get_tokens.set(r, [&ctx] (std::unique_ptr<request> req) {
return make_ready_future<json::json_return_type>(stream_range_as_array(ctx.token_metadata.local().sorted_tokens(), [](const dht::token& i) {
return make_ready_future<json::json_return_type>(stream_range_as_array(ctx.get_token_metadata().sorted_tokens(), [](const dht::token& i) {
return boost::lexical_cast<std::string>(i);
}));
});
ss::get_node_tokens.set(r, [&ctx] (std::unique_ptr<request> req) {
gms::inet_address addr(req->param["endpoint"]);
return make_ready_future<json::json_return_type>(stream_range_as_array(ctx.token_metadata.local().get_tokens(addr), [](const dht::token& i) {
return make_ready_future<json::json_return_type>(stream_range_as_array(ctx.get_token_metadata().get_tokens(addr), [](const dht::token& i) {
return boost::lexical_cast<std::string>(i);
}));
});
@@ -182,8 +324,58 @@ void set_storage_service(http_context& ctx, routes& r) {
}));
});
ss::toppartitions_generic.set(r, [&ctx] (std::unique_ptr<request> req) {
bool filters_provided = false;
std::unordered_set<std::tuple<sstring, sstring>, utils::tuple_hash> table_filters {};
if (req->query_parameters.contains("table_filters")) {
filters_provided = true;
auto filters = req->get_query_param("table_filters");
std::stringstream ss { filters };
std::string filter;
while (!filters.empty() && ss.good()) {
std::getline(ss, filter, ',');
table_filters.emplace(parse_fully_qualified_cf_name(filter));
}
}
std::unordered_set<sstring> keyspace_filters {};
if (req->query_parameters.contains("keyspace_filters")) {
filters_provided = true;
auto filters = req->get_query_param("keyspace_filters");
std::stringstream ss { filters };
std::string filter;
while (!filters.empty() && ss.good()) {
std::getline(ss, filter, ',');
keyspace_filters.emplace(std::move(filter));
}
}
// when the query is empty return immediately
if (filters_provided && table_filters.empty() && keyspace_filters.empty()) {
apilog.debug("toppartitions query: processing results");
httpd::column_family_json::toppartitions_query_results results;
results.read_cardinality = 0;
results.write_cardinality = 0;
return make_ready_future<json::json_return_type>(results);
}
api::req_param<std::chrono::milliseconds, unsigned> duration{*req, "duration", 1000ms};
api::req_param<unsigned> capacity(*req, "capacity", 256);
api::req_param<unsigned> list_size(*req, "list_size", 10);
apilog.info("toppartitions query: #table_filters={} #keyspace_filters={} duration={} list_size={} capacity={}",
!table_filters.empty() ? std::to_string(table_filters.size()) : "all", !keyspace_filters.empty() ? std::to_string(keyspace_filters.size()) : "all", duration.param, list_size.param, capacity.param);
return seastar::do_with(db::toppartitions_query(ctx.db, std::move(table_filters), std::move(keyspace_filters), duration.value, list_size, capacity), [&ctx] (db::toppartitions_query& q) {
return run_toppartitions_query(q, ctx);
});
});
ss::get_leaving_nodes.set(r, [&ctx](const_req req) {
return container_to_vec(ctx.token_metadata.local().get_leaving_endpoints());
return container_to_vec(ctx.get_token_metadata().get_leaving_endpoints());
});
ss::get_moving_nodes.set(r, [](const_req req) {
@@ -192,7 +384,7 @@ void set_storage_service(http_context& ctx, routes& r) {
});
ss::get_joining_nodes.set(r, [&ctx](const_req req) {
auto points = ctx.token_metadata.local().get_bootstrap_tokens();
auto points = ctx.get_token_metadata().get_bootstrap_tokens();
std::unordered_set<sstring> addr;
for (auto i: points) {
addr.insert(boost::lexical_cast<std::string>(i.second));
@@ -220,11 +412,26 @@ void set_storage_service(http_context& ctx, routes& r) {
});
ss::get_range_to_endpoint_map.set(r, [&ctx](std::unique_ptr<request> req) {
//TBD
unimplemented();
auto keyspace = validate_keyspace(ctx, req->param);
std::vector<ss::maplist_mapper> res;
return make_ready_future<json::json_return_type>(res);
return make_ready_future<json::json_return_type>(stream_range_as_array(service::get_local_storage_service().get_range_to_address_map(keyspace),
[](const std::pair<dht::token_range, std::vector<gms::inet_address>>& entry){
ss::maplist_mapper m;
if (entry.first.start()) {
m.key.push(entry.first.start().value().value().to_sstring());
} else {
m.key.push("");
}
if (entry.first.end()) {
m.key.push(entry.first.end().value().value().to_sstring());
} else {
m.key.push("");
}
for (const gms::inet_address& address : entry.second) {
m.value.push(address.to_sstring());
}
return m;
}));
});
ss::get_pending_range_to_endpoint_map.set(r, [&ctx](std::unique_ptr<request> req) {
@@ -246,7 +453,7 @@ void set_storage_service(http_context& ctx, routes& r) {
ss::get_host_id_map.set(r, [&ctx](const_req req) {
std::vector<ss::mapper> res;
return map_to_key_value(ctx.token_metadata.local().get_endpoint_to_host_id_map_for_reading(), res);
return map_to_key_value(ctx.get_token_metadata().get_endpoint_to_host_id_map_for_reading(), res);
});
ss::get_load.set(r, [&ctx](std::unique_ptr<request> req) {
@@ -280,7 +487,7 @@ void set_storage_service(http_context& ctx, routes& r) {
});
ss::cdc_streams_check_and_repair.set(r, [&ctx] (std::unique_ptr<request> req) {
return service::get_local_storage_service().check_and_repair_cdc_streams().then([] {
return service::get_local_storage_service().get_cdc_generation_service().check_and_repair_cdc_streams().then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
@@ -322,8 +529,8 @@ void set_storage_service(http_context& ctx, routes& r) {
for (auto cf : column_families) {
column_families_vec.push_back(&db.find_column_family(keyspace, cf));
}
return parallel_for_each(column_families_vec, [&cm] (column_family* cf) {
return cm.perform_cleanup(cf);
return parallel_for_each(column_families_vec, [&cm, &db] (column_family* cf) {
return cm.perform_cleanup(db, cf);
});
}).then([]{
return make_ready_future<json::json_return_type>(0);
@@ -338,7 +545,7 @@ void set_storage_service(http_context& ctx, routes& r) {
return do_for_each(column_families, [=, &db](sstring cfname) {
auto& cm = db.get_compaction_manager();
auto& cf = db.find_column_family(keyspace, cfname);
return cm.perform_sstable_upgrade(&cf, exclude_current_version);
return cm.perform_sstable_upgrade(db, &cf, exclude_current_version);
});
}).then([]{
return make_ready_future<json::json_return_type>(0);
@@ -361,59 +568,6 @@ void set_storage_service(http_context& ctx, routes& r) {
});
ss::repair_async.set(r, [&ctx](std::unique_ptr<request> req) {
static std::vector<sstring> options = {"primaryRange", "parallelism", "incremental",
"jobThreads", "ranges", "columnFamilies", "dataCenters", "hosts", "trace",
"startToken", "endToken" };
std::unordered_map<sstring, sstring> options_map;
for (auto o : options) {
auto s = req->get_query_param(o);
if (s != "") {
options_map[o] = s;
}
}
// The repair process is asynchronous: repair_start only starts it and
// returns immediately, not waiting for the repair to finish. The user
// then has other mechanisms to track the ongoing repair's progress,
// or stop it.
return repair_start(ctx.db, validate_keyspace(ctx, req->param),
options_map).then([] (int i) {
return make_ready_future<json::json_return_type>(i);
});
});
ss::get_active_repair_async.set(r, [&ctx](std::unique_ptr<request> req) {
return get_active_repairs(ctx.db).then([] (std::vector<int> res){
return make_ready_future<json::json_return_type>(res);
});
});
ss::repair_async_status.set(r, [&ctx](std::unique_ptr<request> req) {
return repair_get_status(ctx.db, boost::lexical_cast<int>( req->get_query_param("id")))
.then_wrapped([] (future<repair_status>&& fut) {
ss::ns_repair_async_status::return_type_wrapper res;
try {
res = fut.get0();
} catch(std::runtime_error& e) {
throw httpd::bad_param_exception(e.what());
}
return make_ready_future<json::json_return_type>(json::json_return_type(res));
});
});
ss::force_terminate_all_repair_sessions.set(r, [](std::unique_ptr<request> req) {
return repair_abort_all(service::get_local_storage_service().db()).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::force_terminate_all_repair_sessions_new.set(r, [](std::unique_ptr<request> req) {
return repair_abort_all(service::get_local_storage_service().db()).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::decommission.set(r, [](std::unique_ptr<request> req) {
return service::get_local_storage_service().decommission().then([] {
return make_ready_future<json::json_return_type>(json_void());
@@ -429,7 +583,22 @@ void set_storage_service(http_context& ctx, routes& r) {
ss::remove_node.set(r, [](std::unique_ptr<request> req) {
auto host_id = req->get_query_param("host_id");
return service::get_local_storage_service().removenode(host_id).then([] {
std::vector<sstring> ignore_nodes_strs= split(req->get_query_param("ignore_nodes"), ",");
auto ignore_nodes = std::list<gms::inet_address>();
for (std::string n : ignore_nodes_strs) {
try {
std::replace(n.begin(), n.end(), '\"', ' ');
std::replace(n.begin(), n.end(), '\'', ' ');
boost::trim_all(n);
if (!n.empty()) {
auto node = gms::inet_address(n);
ignore_nodes.push_back(node);
}
} catch (...) {
throw std::runtime_error(format("Failed to parse ignore_nodes parameter: ignore_nodes={}, node={}", ignore_nodes_strs, n));
}
}
return service::get_local_storage_service().removenode(host_id, std::move(ignore_nodes)).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
@@ -649,11 +818,19 @@ void set_storage_service(http_context& ctx, routes& r) {
ss::load_new_ss_tables.set(r, [&ctx](std::unique_ptr<request> req) {
auto ks = validate_keyspace(ctx, req->param);
auto cf = req->get_query_param("cf");
auto stream = req->get_query_param("load_and_stream");
auto primary_replica = req->get_query_param("primary_replica_only");
boost::algorithm::to_lower(stream);
boost::algorithm::to_lower(primary_replica);
bool load_and_stream = stream == "true" || stream == "1";
bool primary_replica_only = primary_replica == "true" || primary_replica == "1";
// No need to add the keyspace, since all we want is to avoid always sending this to the same
// CPU. Even then I am being overzealous here. This is not something that happens all the time.
auto coordinator = std::hash<sstring>()(cf) % smp::count;
return service::get_storage_service().invoke_on(coordinator, [ks = std::move(ks), cf = std::move(cf)] (service::storage_service& s) {
return s.load_new_sstables(ks, cf);
return service::get_storage_service().invoke_on(coordinator,
[ks = std::move(ks), cf = std::move(cf),
load_and_stream, primary_replica_only] (service::storage_service& s) {
return s.load_new_sstables(ks, cf, load_and_stream, primary_replica_only);
}).then_wrapped([] (auto&& f) {
if (f.failed()) {
auto msg = fmt::format("Failed to load new sstables: {}", f.get_exception());
@@ -671,9 +848,12 @@ void set_storage_service(http_context& ctx, routes& r) {
});
ss::reset_local_schema.set(r, [](std::unique_ptr<request> req) {
//TBD
unimplemented();
return make_ready_future<json::json_return_type>(json_void());
// FIXME: We should truncate schema tables if more than one node in the cluster.
auto& sp = service::get_storage_proxy();
auto& fs = service::get_local_storage_service().features();
return db::schema_tables::recalculate_schema_version(sp, fs).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::set_trace_probability.set(r, [](std::unique_ptr<request> req) {
@@ -706,6 +886,7 @@ void set_storage_service(http_context& ctx, routes& r) {
res.enable = tracing::tracing::get_local_tracing_instance().slow_query_tracing_enabled();
res.ttl = tracing::tracing::get_local_tracing_instance().slow_query_record_ttl().count() ;
res.threshold = tracing::tracing::get_local_tracing_instance().slow_query_threshold().count();
res.fast = tracing::tracing::get_local_tracing_instance().ignore_trace_events_enabled();
return res;
});
@@ -713,8 +894,9 @@ void set_storage_service(http_context& ctx, routes& r) {
auto enable = req->get_query_param("enable");
auto ttl = req->get_query_param("ttl");
auto threshold = req->get_query_param("threshold");
auto fast = req->get_query_param("fast");
try {
return tracing::tracing::tracing_instance().invoke_on_all([enable, ttl, threshold] (auto& local_tracing) {
return tracing::tracing::tracing_instance().invoke_on_all([enable, ttl, threshold, fast] (auto& local_tracing) {
if (threshold != "") {
local_tracing.set_slow_query_threshold(std::chrono::microseconds(std::stol(threshold.c_str())));
}
@@ -724,6 +906,9 @@ void set_storage_service(http_context& ctx, routes& r) {
if (enable != "") {
local_tracing.set_slow_query_enabled(strcasecmp(enable.c_str(), "true") == 0);
}
if (fast != "") {
local_tracing.set_ignore_trace_events(strcasecmp(fast.c_str(), "true") == 0);
}
}).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
@@ -893,7 +1078,7 @@ void set_storage_service(http_context& ctx, routes& r) {
tst.keyspace = schema->ks_name();
tst.table = schema->cf_name();
for (auto sstable : *t->get_sstables_including_compacted_undeleted()) {
for (auto sstables = t->get_sstables_including_compacted_undeleted(); auto sstable : *sstables) {
auto ts = db_clock::to_time_t(sstable->data_file_write_time());
::tm t;
::gmtime_r(&ts, &t);
@@ -921,7 +1106,7 @@ void set_storage_service(http_context& ctx, routes& r) {
e.value = p.second;
nm.attributes.push(std::move(e));
}
if (!cp->options().count(compression_parameters::SSTABLE_COMPRESSION)) {
if (!cp->options().contains(compression_parameters::SSTABLE_COMPRESSION)) {
ss::mapper e;
e.key = compression_parameters::SSTABLE_COMPRESSION;
e.value = cp->name();
@@ -979,31 +1164,29 @@ void set_storage_service(http_context& ctx, routes& r) {
}
void set_snapshot(http_context& ctx, routes& r) {
ss::get_snapshot_details.set(r, [](std::unique_ptr<request> req) {
std::function<future<>(output_stream<char>&&)> f = [](output_stream<char>&& s) {
return do_with(output_stream<char>(std::move(s)), true, [] (output_stream<char>& s, bool& first){
return s.write("[").then([&s, &first] {
return service::get_local_storage_service().get_snapshot_details().then([&s, &first] (std::unordered_map<sstring, std::vector<service::storage_service::snapshot_details>>&& result) {
return do_with(std::move(result), [&s, &first](const std::unordered_map<sstring, std::vector<service::storage_service::snapshot_details>>& result) {
return do_for_each(result, [&s, &result,&first](std::tuple<sstring, std::vector<service::storage_service::snapshot_details>>&& map){
return do_with(ss::snapshots(), [&s, &first, &result, &map](ss::snapshots& all_snapshots) {
all_snapshots.key = std::get<0>(map);
future<> f = first ? make_ready_future<>() : s.write(", ");
first = false;
std::vector<ss::snapshot> snapshot;
for (auto& cf: std::get<1>(map)) {
ss::snapshot snp;
snp.ks = cf.ks;
snp.cf = cf.cf;
snp.live = cf.live;
snp.total = cf.total;
snapshot.push_back(std::move(snp));
}
all_snapshots.value = std::move(snapshot);
return f.then([&s, &all_snapshots] {
return all_snapshots.write(s);
});
void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_ctl) {
ss::get_snapshot_details.set(r, [&snap_ctl](std::unique_ptr<request> req) {
return snap_ctl.local().get_snapshot_details().then([] (std::unordered_map<sstring, std::vector<db::snapshot_ctl::snapshot_details>>&& result) {
std::function<future<>(output_stream<char>&&)> f = [result = std::move(result)](output_stream<char>&& s) {
return do_with(output_stream<char>(std::move(s)), true, [&result] (output_stream<char>& s, bool& first){
return s.write("[").then([&s, &first, &result] {
return do_for_each(result, [&s, &first](std::tuple<sstring, std::vector<db::snapshot_ctl::snapshot_details>>&& map){
return do_with(ss::snapshots(), [&s, &first, &map](ss::snapshots& all_snapshots) {
all_snapshots.key = std::get<0>(map);
future<> f = first ? make_ready_future<>() : s.write(", ");
first = false;
std::vector<ss::snapshot> snapshot;
for (auto& cf: std::get<1>(map)) {
ss::snapshot snp;
snp.ks = cf.ks;
snp.cf = cf.cf;
snp.live = cf.live;
snp.total = cf.total;
snapshot.push_back(std::move(snp));
}
all_snapshots.value = std::move(snapshot);
return f.then([&s, &all_snapshots] {
return all_snapshots.write(s);
});
});
});
@@ -1013,12 +1196,13 @@ void set_snapshot(http_context& ctx, routes& r) {
});
});
});
});
};
return make_ready_future<json::json_return_type>(std::move(f));
};
return make_ready_future<json::json_return_type>(std::move(f));
});
});
ss::take_snapshot.set(r, [](std::unique_ptr<request> req) {
ss::take_snapshot.set(r, [&snap_ctl](std::unique_ptr<request> req) {
auto tag = req->get_query_param("tag");
auto column_families = split(req->get_query_param("cf"), ",");
@@ -1026,7 +1210,7 @@ void set_snapshot(http_context& ctx, routes& r) {
auto resp = make_ready_future<>();
if (column_families.empty()) {
resp = service::get_local_storage_service().take_snapshot(tag, keynames);
resp = snap_ctl.local().take_snapshot(tag, keynames);
} else {
if (keynames.empty()) {
throw httpd::bad_param_exception("The keyspace of column families must be specified");
@@ -1034,37 +1218,37 @@ void set_snapshot(http_context& ctx, routes& r) {
if (keynames.size() > 1) {
throw httpd::bad_param_exception("Only one keyspace allowed when specifying a column family");
}
resp = service::get_local_storage_service().take_column_family_snapshot(keynames[0], column_families, tag);
resp = snap_ctl.local().take_column_family_snapshot(keynames[0], column_families, tag);
}
return resp.then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::del_snapshot.set(r, [](std::unique_ptr<request> req) {
ss::del_snapshot.set(r, [&snap_ctl](std::unique_ptr<request> req) {
auto tag = req->get_query_param("tag");
auto column_family = req->get_query_param("cf");
std::vector<sstring> keynames = split(req->get_query_param("kn"), ",");
return service::get_local_storage_service().clear_snapshot(tag, keynames, column_family).then([] {
return snap_ctl.local().clear_snapshot(tag, keynames, column_family).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::true_snapshots_size.set(r, [](std::unique_ptr<request> req) {
return service::get_local_storage_service().true_snapshots_size().then([] (int64_t size) {
ss::true_snapshots_size.set(r, [&snap_ctl](std::unique_ptr<request> req) {
return snap_ctl.local().true_snapshots_size().then([] (int64_t size) {
return make_ready_future<json::json_return_type>(size);
});
});
ss::scrub.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<request> req, sstring keyspace, std::vector<sstring> column_families) {
ss::scrub.set(r, wrap_ks_cf(ctx, [&snap_ctl] (http_context& ctx, std::unique_ptr<request> req, sstring keyspace, std::vector<sstring> column_families) {
const auto skip_corrupted = req_param<bool>(*req, "skip_corrupted", false);
auto f = make_ready_future<>();
if (!req_param<bool>(*req, "disable_snapshot", false)) {
auto tag = format("pre-scrub-{:d}", db_clock::now().time_since_epoch().count());
f = parallel_for_each(column_families, [keyspace, tag](sstring cf) {
return service::get_local_storage_service().take_column_family_snapshot(keyspace, cf, tag);
f = parallel_for_each(column_families, [&snap_ctl, keyspace, tag](sstring cf) {
return snap_ctl.local().take_column_family_snapshot(keyspace, cf, tag);
});
}
@@ -1082,4 +1266,12 @@ void set_snapshot(http_context& ctx, routes& r) {
}));
}
void unset_snapshot(http_context& ctx, routes& r) {
ss::get_snapshot_details.unset(r);
ss::take_snapshot.unset(r);
ss::del_snapshot.unset(r);
ss::true_snapshots_size.unset(r);
ss::scrub.unset(r);
}
}

View File

@@ -21,18 +21,26 @@
#pragma once
#include <seastar/core/sharded.hh>
#include "api.hh"
#include "db/data_listeners.hh"
namespace cql_transport { class controller; }
class thrift_controller;
namespace db { class snapshot_ctl; }
namespace netw { class messaging_service; }
namespace api {
void set_storage_service(http_context& ctx, routes& r);
void set_repair(http_context& ctx, routes& r, sharded<netw::messaging_service>& ms);
void unset_repair(http_context& ctx, routes& r);
void set_transport_controller(http_context& ctx, routes& r, cql_transport::controller& ctl);
void unset_transport_controller(http_context& ctx, routes& r);
void set_rpc_controller(http_context& ctx, routes& r, thrift_controller& ctl);
void unset_rpc_controller(http_context& ctx, routes& r);
void set_snapshot(http_context& ctx, routes& r);
void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_ctl);
void unset_snapshot(http_context& ctx, routes& r);
seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request = false);
}

View File

@@ -25,6 +25,9 @@
#include <seastar/core/reactor.hh>
#include <seastar/http/exception.hh>
#include "log.hh"
#include "database.hh"
extern logging::logger apilog;
namespace api {
@@ -70,6 +73,16 @@ void set_system(http_context& ctx, routes& r) {
}
return json::json_void();
});
hs::drop_sstable_caches.set(r, [&ctx](std::unique_ptr<request> req) {
apilog.info("Dropping sstable caches");
return ctx.db.invoke_on_all([] (database& db) {
return db.drop_caches();
}).then([] {
apilog.info("Caches dropped");
return json::json_return_type(json::json_void());
});
});
}
}

View File

@@ -24,142 +24,130 @@
#include "counters.hh"
#include "types.hh"
/// LSA mirator for cells with irrelevant type
///
///
const data::type_imr_descriptor& no_type_imr_descriptor() {
static thread_local data::type_imr_descriptor state(data::type_info::make_variable_size());
return state;
}
atomic_cell atomic_cell::make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {
auto& imr_data = no_type_imr_descriptor();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_dead(timestamp, deletion_time), &imr_data.lsa_migrator())
);
return atomic_cell_type::make_dead(timestamp, deletion_time);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value, atomic_cell::collection_member cm) {
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, bool(cm)), &imr_data.lsa_migrator())
);
return atomic_cell_type::make_live(timestamp, single_fragment_range(value));
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, managed_bytes_view value, atomic_cell::collection_member cm) {
return atomic_cell_type::make_live(timestamp, fragment_range(value));
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value, atomic_cell::collection_member cm) {
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, bool(cm)), &imr_data.lsa_migrator())
);
return atomic_cell_type::make_live(timestamp, value);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value, collection_member cm)
{
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, bool(cm)), &imr_data.lsa_migrator())
);
return atomic_cell_type::make_live(timestamp, value);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl, atomic_cell::collection_member cm) {
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, expiry, ttl, bool(cm)), &imr_data.lsa_migrator())
);
return atomic_cell_type::make_live(timestamp, single_fragment_range(value), expiry, ttl);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, managed_bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl, atomic_cell::collection_member cm) {
return atomic_cell_type::make_live(timestamp, fragment_range(value), expiry, ttl);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value,
gc_clock::time_point expiry, gc_clock::duration ttl, atomic_cell::collection_member cm) {
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, expiry, ttl, bool(cm)), &imr_data.lsa_migrator())
);
return atomic_cell_type::make_live(timestamp, value, expiry, ttl);
}
atomic_cell atomic_cell::make_live(const abstract_type& type, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member cm)
{
auto& imr_data = type.imr_state();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live(imr_data.type_info(), timestamp, value, expiry, ttl, bool(cm)), &imr_data.lsa_migrator())
);
return atomic_cell_type::make_live(timestamp, value, expiry, ttl);
}
atomic_cell atomic_cell::make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
auto& imr_data = no_type_imr_descriptor();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live_counter_update(timestamp, value), &imr_data.lsa_migrator())
);
return atomic_cell_type::make_live_counter_update(timestamp, value);
}
atomic_cell atomic_cell::make_live_uninitialized(const abstract_type& type, api::timestamp_type timestamp, size_t size) {
auto& imr_data = no_type_imr_descriptor();
return atomic_cell(
imr_data.type_info(),
imr_object_type::make(data::cell::make_live_uninitialized(imr_data.type_info(), timestamp, size), &imr_data.lsa_migrator())
);
}
static imr::utils::object<data::cell::structure> copy_cell(const data::type_imr_descriptor& imr_data, const uint8_t* ptr)
{
using imr_object_type = imr::utils::object<data::cell::structure>;
// If the cell doesn't own any memory it is trivial and can be copied with
// memcpy.
auto f = data::cell::structure::get_member<data::cell::tags::flags>(ptr);
if (!f.template get<data::cell::tags::external_data>()) {
data::cell::context ctx(f, imr_data.type_info());
// XXX: We may be better off storing the total cell size in memory. Measure!
auto size = data::cell::structure::serialized_object_size(ptr, ctx);
return imr_object_type::make_raw(size, [&] (uint8_t* dst) noexcept {
std::copy_n(ptr, size, dst);
}, &imr_data.lsa_migrator());
}
return imr_object_type::make(data::cell::copy_fn(imr_data.type_info(), ptr), &imr_data.lsa_migrator());
return atomic_cell_type::make_live_uninitialized(timestamp, size);
}
atomic_cell::atomic_cell(const abstract_type& type, atomic_cell_view other)
: atomic_cell(type.imr_state().type_info(),
copy_cell(type.imr_state(), other._view.raw_pointer()))
{ }
: _data(other._view) {
set_view(_data);
}
// Based on:
// - org.apache.cassandra.db.AbstractCell#reconcile()
// - org.apache.cassandra.db.BufferExpiringCell#reconcile()
// - org.apache.cassandra.db.BufferDeletedCell#reconcile()
int
compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right) {
if (left.timestamp() != right.timestamp()) {
return left.timestamp() > right.timestamp() ? 1 : -1;
}
if (left.is_live() != right.is_live()) {
return left.is_live() ? -1 : 1;
}
if (left.is_live()) {
auto c = compare_unsigned(left.value(), right.value());
if (c != 0) {
return c;
}
if (left.is_live_and_has_ttl() != right.is_live_and_has_ttl()) {
// prefer expiring cells.
return left.is_live_and_has_ttl() ? 1 : -1;
}
if (left.is_live_and_has_ttl()) {
if (left.expiry() != right.expiry()) {
return left.expiry() < right.expiry() ? -1 : 1;
} else {
// prefer the cell that was written later,
// so it survives longer after it expires, until purged.
if (left.ttl() != right.ttl()) {
return left.ttl() < right.ttl() ? 1 : -1;
} else {
return 0;
}
}
}
} else {
// Both are deleted
if (left.deletion_time() != right.deletion_time()) {
// Origin compares big-endian serialized deletion time. That's because it
// delegates to AbstractCell.reconcile() which compares values after
// comparing timestamps, which in case of deleted cells will hold
// serialized expiry.
return (uint64_t) left.deletion_time().time_since_epoch().count()
< (uint64_t) right.deletion_time().time_since_epoch().count() ? -1 : 1;
}
}
return 0;
}
atomic_cell_or_collection atomic_cell_or_collection::copy(const abstract_type& type) const {
if (!_data.get()) {
if (_data.empty()) {
return atomic_cell_or_collection();
}
auto& imr_data = type.imr_state();
return atomic_cell_or_collection(
copy_cell(imr_data, _data.get())
);
return atomic_cell_or_collection(managed_bytes(_data));
}
atomic_cell_or_collection::atomic_cell_or_collection(const abstract_type& type, atomic_cell_view acv)
: _data(copy_cell(type.imr_state(), acv._view.raw_pointer()))
: _data(acv._view)
{
}
bool atomic_cell_or_collection::equals(const abstract_type& type, const atomic_cell_or_collection& other) const
{
auto ptr_a = _data.get();
auto ptr_b = other._data.get();
if (!ptr_a || !ptr_b) {
return !ptr_a && !ptr_b;
if (_data.empty() || other._data.empty()) {
return _data.empty() && other._data.empty();
}
if (type.is_atomic()) {
auto a = atomic_cell_view::from_bytes(type.imr_state().type_info(), _data);
auto b = atomic_cell_view::from_bytes(type.imr_state().type_info(), other._data);
auto a = atomic_cell_view::from_bytes(type, _data);
auto b = atomic_cell_view::from_bytes(type, other._data);
if (a.timestamp() != b.timestamp()) {
return false;
}
@@ -191,28 +179,7 @@ bool atomic_cell_or_collection::equals(const abstract_type& type, const atomic_c
size_t atomic_cell_or_collection::external_memory_usage(const abstract_type& t) const
{
if (!_data.get()) {
return 0;
}
auto ctx = data::cell::context(_data.get(), t.imr_state().type_info());
auto view = data::cell::structure::make_view(_data.get(), ctx);
auto flags = view.get<data::cell::tags::flags>();
size_t external_value_size = 0;
if (flags.get<data::cell::tags::external_data>()) {
if (flags.get<data::cell::tags::collection>()) {
external_value_size = as_collection_mutation().data.size_bytes();
} else {
auto cell_view = data::cell::atomic_cell_view(t.imr_state().type_info(), view);
external_value_size = cell_view.value_size();
}
// Add overhead of chunk headers. The last one is a special case.
external_value_size += (external_value_size - 1) / data::cell::maximum_external_chunk_length * data::cell::external_chunk_overhead;
external_value_size += data::cell::external_last_chunk_overhead;
}
return data::cell::structure::serialized_object_size(_data.get(), ctx)
+ imr_object_type::size_overhead + external_value_size;
return _data.external_memory_usage();
}
std::ostream&
@@ -221,7 +188,7 @@ operator<<(std::ostream& os, const atomic_cell_view& acv) {
return fmt_print(os, "atomic_cell{{{},ts={:d},expiry={:d},ttl={:d}}}",
acv.is_counter_update()
? "counter_update_value=" + to_sstring(acv.counter_update_value())
: to_hex(acv.value().linearize()),
: to_hex(to_bytes(acv.value())),
acv.timestamp(),
acv.is_live_and_has_ttl() ? acv.expiry().time_since_epoch().count() : -1,
acv.is_live_and_has_ttl() ? acv.ttl().count() : 0);
@@ -247,12 +214,11 @@ operator<<(std::ostream& os, const atomic_cell_view::printer& acvp) {
cell_value_string_builder << "counter_update_value=" << acv.counter_update_value();
} else {
cell_value_string_builder << "shards: ";
counter_cell_view::with_linearized(acv, [&cell_value_string_builder] (counter_cell_view& ccv) {
cell_value_string_builder << ::join(", ", ccv.shards());
});
auto ccv = counter_cell_view(acv);
cell_value_string_builder << ::join(", ", ccv.shards());
}
} else {
cell_value_string_builder << type.to_string(acv.value().linearize());
cell_value_string_builder << type.to_string(to_bytes(acv.value()));
}
return fmt_print(os, "atomic_cell{{{},ts={:d},expiry={:d},ttl={:d}}}",
cell_value_string_builder.str(),
@@ -271,12 +237,11 @@ operator<<(std::ostream& os, const atomic_cell::printer& acp) {
}
std::ostream& operator<<(std::ostream& os, const atomic_cell_or_collection::printer& p) {
if (!p._cell._data.get()) {
if (p._cell._data.empty()) {
return os << "{ null atomic_cell_or_collection }";
}
using dc = data::cell;
os << "{ ";
if (dc::structure::get_member<dc::tags::flags>(p._cell._data.get()).get<dc::tags::collection>()) {
if (p._cdef.type->is_multi_cell()) {
os << "collection ";
auto cmv = p._cell.as_collection_mutation();
os << collection_mutation_view::printer(*p._cdef.type, cmv);

View File

@@ -26,54 +26,205 @@
#include "tombstone.hh"
#include "gc_clock.hh"
#include "utils/managed_bytes.hh"
#include "utils/fragment_range.hh"
#include <seastar/net//byteorder.hh>
#include <seastar/util/bool_class.hh>
#include <cstdint>
#include <iosfwd>
#include "data/cell.hh"
#include "data/schema_info.hh"
#include "imr/utils.hh"
#include <concepts>
#include "utils/fragmented_temporary_buffer.hh"
#include "serializer.hh"
class abstract_type;
class collection_type_impl;
class atomic_cell_or_collection;
using atomic_cell_value_view = data::value_view;
using atomic_cell_value_mutable_view = data::value_mutable_view;
using atomic_cell_value = managed_bytes;
template <mutable_view is_mutable>
using atomic_cell_value_basic_view = managed_bytes_basic_view<is_mutable>;
using atomic_cell_value_view = atomic_cell_value_basic_view<mutable_view::no>;
using atomic_cell_value_mutable_view = atomic_cell_value_basic_view<mutable_view::yes>;
template <typename T>
requires std::is_trivial_v<T>
static void set_field(atomic_cell_value_mutable_view& out, unsigned offset, T val) {
auto out_view = managed_bytes_mutable_view(out);
out_view.remove_prefix(offset);
write<T>(out_view, val);
}
template <typename T>
requires std::is_trivial_v<T>
static void set_field(atomic_cell_value& out, unsigned offset, T val) {
auto out_view = atomic_cell_value_mutable_view(out);
set_field(out_view, offset, val);
}
template <FragmentRange Buffer>
static void set_value(managed_bytes& b, unsigned value_offset, const Buffer& value) {
auto v = managed_bytes_mutable_view(b).substr(value_offset, value.size_bytes());
for (auto frag : value) {
write_fragmented(v, single_fragmented_view(frag));
}
}
template <typename T, FragmentedView Input>
requires std::is_trivial_v<T>
static T get_field(Input in, unsigned offset = 0) {
in.remove_prefix(offset);
return read_simple<T>(in);
}
/*
* Represents atomic cell layout. Works on serialized form.
*
* Layout:
*
* <live> := <int8_t:flags><int64_t:timestamp>(<int64_t:expiry><int32_t:ttl>)?<value>
* <dead> := <int8_t: 0><int64_t:timestamp><int64_t:deletion_time>
*/
class atomic_cell_type final {
private:
static constexpr int8_t LIVE_FLAG = 0x01;
static constexpr int8_t EXPIRY_FLAG = 0x02; // When present, expiry field is present. Set only for live cells
static constexpr int8_t COUNTER_UPDATE_FLAG = 0x08; // Cell is a counter update.
static constexpr unsigned flags_size = 1;
static constexpr unsigned timestamp_offset = flags_size;
static constexpr unsigned timestamp_size = 8;
static constexpr unsigned expiry_offset = timestamp_offset + timestamp_size;
static constexpr unsigned expiry_size = 8;
static constexpr unsigned deletion_time_offset = timestamp_offset + timestamp_size;
static constexpr unsigned deletion_time_size = 8;
static constexpr unsigned ttl_offset = expiry_offset + expiry_size;
static constexpr unsigned ttl_size = 4;
friend class counter_cell_builder;
private:
static bool is_counter_update(atomic_cell_value_view cell) {
return cell.front() & COUNTER_UPDATE_FLAG;
}
static bool is_live(atomic_cell_value_view cell) {
return cell.front() & LIVE_FLAG;
}
static bool is_live_and_has_ttl(atomic_cell_value_view cell) {
return cell.front() & EXPIRY_FLAG;
}
static bool is_dead(atomic_cell_value_view cell) {
return !is_live(cell);
}
// Can be called on live and dead cells
static api::timestamp_type timestamp(atomic_cell_value_view cell) {
return get_field<api::timestamp_type>(cell, timestamp_offset);
}
static void set_timestamp(atomic_cell_value_mutable_view& cell, api::timestamp_type ts) {
set_field(cell, timestamp_offset, ts);
}
// Can be called on live cells only
private:
template <mutable_view is_mutable>
static managed_bytes_basic_view<is_mutable> do_get_value(managed_bytes_basic_view<is_mutable> cell) {
auto expiry_field_size = bool(cell.front() & EXPIRY_FLAG) * (expiry_size + ttl_size);
auto value_offset = flags_size + timestamp_size + expiry_field_size;
cell.remove_prefix(value_offset);
return cell;
}
public:
static atomic_cell_value_view value(managed_bytes_view cell) {
return do_get_value(cell);
}
static atomic_cell_value_mutable_view value(managed_bytes_mutable_view cell) {
return do_get_value(cell);
}
// Can be called on live counter update cells only
static int64_t counter_update_value(atomic_cell_value_view cell) {
return get_field<int64_t>(cell, flags_size + timestamp_size);
}
// Can be called only when is_dead() is true.
static gc_clock::time_point deletion_time(atomic_cell_value_view cell) {
assert(is_dead(cell));
return gc_clock::time_point(gc_clock::duration(get_field<int64_t>(cell, deletion_time_offset)));
}
// Can be called only when is_live_and_has_ttl() is true.
static gc_clock::time_point expiry(atomic_cell_value_view cell) {
assert(is_live_and_has_ttl(cell));
auto expiry = get_field<int64_t>(cell, expiry_offset);
return gc_clock::time_point(gc_clock::duration(expiry));
}
// Can be called only when is_live_and_has_ttl() is true.
static gc_clock::duration ttl(atomic_cell_value_view cell) {
assert(is_live_and_has_ttl(cell));
return gc_clock::duration(get_field<int32_t>(cell, ttl_offset));
}
static managed_bytes make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {
managed_bytes b(managed_bytes::initialized_later(), flags_size + timestamp_size + deletion_time_size);
b[0] = 0;
set_field(b, timestamp_offset, timestamp);
set_field(b, deletion_time_offset, static_cast<int64_t>(deletion_time.time_since_epoch().count()));
return b;
}
template <FragmentRange Buffer>
static managed_bytes make_live(api::timestamp_type timestamp, const Buffer& value) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size_bytes());
b[0] = LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
set_value(b, value_offset, value);
return b;
}
static managed_bytes make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + sizeof(value));
b[0] = LIVE_FLAG | COUNTER_UPDATE_FLAG;
set_field(b, timestamp_offset, timestamp);
set_field(b, value_offset, value);
return b;
}
template <FragmentRange Buffer>
static managed_bytes make_live(api::timestamp_type timestamp, const Buffer& value, gc_clock::time_point expiry, gc_clock::duration ttl) {
auto value_offset = flags_size + timestamp_size + expiry_size + ttl_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size_bytes());
b[0] = EXPIRY_FLAG | LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
set_field(b, expiry_offset, static_cast<int64_t>(expiry.time_since_epoch().count()));
set_field(b, ttl_offset, static_cast<int32_t>(ttl.count()));
set_value(b, value_offset, value);
return b;
}
static managed_bytes make_live_uninitialized(api::timestamp_type timestamp, size_t size) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + size);
b[0] = LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
return b;
}
template <mutable_view is_mutable>
friend class basic_atomic_cell_view;
friend class atomic_cell;
};
/// View of an atomic cell
template<mutable_view is_mutable>
class basic_atomic_cell_view {
protected:
data::cell::basic_atomic_cell_view<is_mutable> _view;
friend class atomic_cell;
public:
using pointer_type = std::conditional_t<is_mutable == mutable_view::no, const uint8_t*, uint8_t*>;
managed_bytes_basic_view<is_mutable> _view;
friend class atomic_cell;
protected:
explicit basic_atomic_cell_view(data::cell::basic_atomic_cell_view<is_mutable> v)
: _view(std::move(v)) { }
basic_atomic_cell_view(const data::type_info& ti, pointer_type ptr)
: _view(data::cell::make_atomic_cell_view(ti, ptr))
{ }
void set_view(managed_bytes_basic_view<is_mutable> v) {
_view = v;
}
basic_atomic_cell_view() = default;
explicit basic_atomic_cell_view(managed_bytes_basic_view<is_mutable> v) : _view(std::move(v)) { }
friend class atomic_cell_or_collection;
public:
operator basic_atomic_cell_view<mutable_view::no>() const noexcept {
return basic_atomic_cell_view<mutable_view::no>(_view);
}
void swap(basic_atomic_cell_view& other) noexcept {
using std::swap;
swap(_view, other._view);
}
bool is_counter_update() const {
return _view.is_counter_update();
return atomic_cell_type::is_counter_update(_view);
}
bool is_live() const {
return _view.is_live();
return atomic_cell_type::is_live(_view);
}
bool is_live(tombstone t, bool is_counter) const {
return is_live() && !is_covered_by(t, is_counter);
@@ -82,73 +233,72 @@ public:
return is_live() && !is_covered_by(t, is_counter) && !has_expired(now);
}
bool is_live_and_has_ttl() const {
return _view.is_expiring();
return atomic_cell_type::is_live_and_has_ttl(_view);
}
bool is_dead(gc_clock::time_point now) const {
return !is_live() || has_expired(now);
return atomic_cell_type::is_dead(_view) || has_expired(now);
}
bool is_covered_by(tombstone t, bool is_counter) const {
return timestamp() <= t.timestamp || (is_counter && t.timestamp != api::missing_timestamp);
}
// Can be called on live and dead cells
api::timestamp_type timestamp() const {
return _view.timestamp();
return atomic_cell_type::timestamp(_view);
}
void set_timestamp(api::timestamp_type ts) {
_view.set_timestamp(ts);
atomic_cell_type::set_timestamp(_view, ts);
}
// Can be called on live cells only
data::basic_value_view<is_mutable> value() const {
return _view.value();
atomic_cell_value_basic_view<is_mutable> value() const {
return atomic_cell_type::value(_view);
}
// Can be called on live cells only
size_t value_size() const {
return _view.value_size();
return atomic_cell_type::value(_view).size();
}
bool is_value_fragmented() const {
return _view.is_value_fragmented();
return _view.is_fragmented();
}
// Can be called on live counter update cells only
int64_t counter_update_value() const {
return _view.counter_update_value();
return atomic_cell_type::counter_update_value(_view);
}
// Can be called only when is_dead(gc_clock::time_point)
gc_clock::time_point deletion_time() const {
return !is_live() ? _view.deletion_time() : expiry() - ttl();
return !is_live() ? atomic_cell_type::deletion_time(_view) : expiry() - ttl();
}
// Can be called only when is_live_and_has_ttl()
gc_clock::time_point expiry() const {
return _view.expiry();
return atomic_cell_type::expiry(_view);
}
// Can be called only when is_live_and_has_ttl()
gc_clock::duration ttl() const {
return _view.ttl();
return atomic_cell_type::ttl(_view);
}
// Can be called on live and dead cells
bool has_expired(gc_clock::time_point now) const {
return is_live_and_has_ttl() && expiry() <= now;
}
bytes_view serialize() const {
return _view.serialize();
managed_bytes_view serialize() const {
return _view;
}
};
class atomic_cell_view final : public basic_atomic_cell_view<mutable_view::no> {
atomic_cell_view(const data::type_info& ti, const uint8_t* data)
: basic_atomic_cell_view<mutable_view::no>(ti, data) {}
atomic_cell_view(managed_bytes_view v)
: basic_atomic_cell_view(v) {}
template<mutable_view is_mutable>
atomic_cell_view(data::cell::basic_atomic_cell_view<is_mutable> view)
: basic_atomic_cell_view<mutable_view::no>(view) { }
atomic_cell_view(basic_atomic_cell_view<is_mutable> view)
: basic_atomic_cell_view<mutable_view::no>(view) {}
friend class atomic_cell;
public:
static atomic_cell_view from_bytes(const data::type_info& ti, const imr::utils::object<data::cell::structure>& data) {
return atomic_cell_view(ti, data.get());
static atomic_cell_view from_bytes(const abstract_type& t, managed_bytes_view v) {
return atomic_cell_view(v);
}
static atomic_cell_view from_bytes(const data::type_info& ti, bytes_view bv) {
return atomic_cell_view(ti, reinterpret_cast<const uint8_t*>(bv.begin()));
static atomic_cell_view from_bytes(const abstract_type& t, bytes_view v) {
return atomic_cell_view(managed_bytes_view(v));
}
friend std::ostream& operator<<(std::ostream& os, const atomic_cell_view& acv);
@@ -163,11 +313,11 @@ public:
};
class atomic_cell_mutable_view final : public basic_atomic_cell_view<mutable_view::yes> {
atomic_cell_mutable_view(const data::type_info& ti, uint8_t* data)
: basic_atomic_cell_view<mutable_view::yes>(ti, data) {}
atomic_cell_mutable_view(managed_bytes_mutable_view data)
: basic_atomic_cell_view(data) {}
public:
static atomic_cell_mutable_view from_bytes(const data::type_info& ti, imr::utils::object<data::cell::structure>& data) {
return atomic_cell_mutable_view(ti, data.get());
static atomic_cell_mutable_view from_bytes(const abstract_type& t, managed_bytes_mutable_view v) {
return atomic_cell_mutable_view(v);
}
friend class atomic_cell;
@@ -176,26 +326,31 @@ public:
using atomic_cell_ref = atomic_cell_mutable_view;
class atomic_cell final : public basic_atomic_cell_view<mutable_view::yes> {
using imr_object_type = imr::utils::object<data::cell::structure>;
imr_object_type _data;
atomic_cell(const data::type_info& ti, imr::utils::object<data::cell::structure>&& data)
: basic_atomic_cell_view<mutable_view::yes>(ti, data.get()), _data(std::move(data)) {}
managed_bytes _data;
atomic_cell(managed_bytes b) : _data(std::move(b)) {
set_view(_data);
}
public:
class collection_member_tag;
using collection_member = bool_class<collection_member_tag>;
atomic_cell(atomic_cell&&) = default;
atomic_cell& operator=(const atomic_cell&) = delete;
atomic_cell& operator=(atomic_cell&&) = default;
void swap(atomic_cell& other) noexcept {
basic_atomic_cell_view<mutable_view::yes>::swap(other);
_data.swap(other._data);
atomic_cell(atomic_cell&& o) noexcept : _data(std::move(o._data)) {
set_view(_data);
}
operator atomic_cell_view() const { return atomic_cell_view(_view); }
atomic_cell& operator=(const atomic_cell&) = delete;
atomic_cell& operator=(atomic_cell&& o) {
_data = std::move(o._data);
set_view(_data);
return *this;
}
operator atomic_cell_view() const { return atomic_cell_view(managed_bytes_view(_data)); }
atomic_cell(const abstract_type& t, atomic_cell_view other);
static atomic_cell make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, bytes_view value,
collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, managed_bytes_view value,
collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value,
collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type& type, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value,
@@ -207,6 +362,8 @@ public:
static atomic_cell make_live_counter_update(api::timestamp_type timestamp, int64_t value);
static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, managed_bytes_view value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, ser::buffer_view<bytes_ostream::fragment_iterator> value,
gc_clock::time_point expiry, gc_clock::duration ttl, collection_member = collection_member::no);
static atomic_cell make_live(const abstract_type&, api::timestamp_type timestamp, const fragmented_temporary_buffer::view& value,

View File

@@ -52,9 +52,7 @@ struct appending_hash<atomic_cell_view> {
feed_hash(h, cell.timestamp());
if (cell.is_live()) {
if (cdef.is_counter()) {
counter_cell_view::with_linearized(cell, [&] (counter_cell_view ccv) {
::feed_hash(h, ccv);
});
::feed_hash(h, counter_cell_view(cell));
return;
}
if (cell.is_live_and_has_ttl()) {

View File

@@ -26,20 +26,14 @@
#include "schema.hh"
#include "hashing.hh"
#include "imr/utils.hh"
// A variant type that can hold either an atomic_cell, or a serialized collection.
// Which type is stored is determined by the schema.
// Has an "empty" state.
// Objects moved-from are left in an empty state.
class atomic_cell_or_collection final {
// FIXME: This has made us lose small-buffer optimisation. Unfortunately,
// due to the changed cell format it would be less effective now, anyway.
// Measure the actual impact because any attempts to fix this will become
// irrelevant once rows are converted to the IMR as well, so maybe we can
// live with this like that.
using imr_object_type = imr::utils::object<data::cell::structure>;
imr_object_type _data;
managed_bytes _data;
private:
atomic_cell_or_collection(imr::utils::object<data::cell::structure>&& data) : _data(std::move(data)) {}
atomic_cell_or_collection(managed_bytes&& data) : _data(std::move(data)) {}
public:
atomic_cell_or_collection() = default;
atomic_cell_or_collection(atomic_cell_or_collection&&) = default;
@@ -49,20 +43,16 @@ public:
atomic_cell_or_collection(atomic_cell ac) : _data(std::move(ac._data)) {}
atomic_cell_or_collection(const abstract_type& at, atomic_cell_view acv);
static atomic_cell_or_collection from_atomic_cell(atomic_cell data) { return { std::move(data._data) }; }
atomic_cell_view as_atomic_cell(const column_definition& cdef) const { return atomic_cell_view::from_bytes(cdef.type->imr_state().type_info(), _data); }
atomic_cell_ref as_atomic_cell_ref(const column_definition& cdef) { return atomic_cell_mutable_view::from_bytes(cdef.type->imr_state().type_info(), _data); }
atomic_cell_mutable_view as_mutable_atomic_cell(const column_definition& cdef) { return atomic_cell_mutable_view::from_bytes(cdef.type->imr_state().type_info(), _data); }
atomic_cell_view as_atomic_cell(const column_definition& cdef) const { return atomic_cell_view::from_bytes(*cdef.type, _data); }
atomic_cell_mutable_view as_mutable_atomic_cell(const column_definition& cdef) { return atomic_cell_mutable_view::from_bytes(*cdef.type, _data); }
atomic_cell_or_collection(collection_mutation cm) : _data(std::move(cm._data)) { }
atomic_cell_or_collection copy(const abstract_type&) const;
explicit operator bool() const {
return bool(_data);
return !_data.empty();
}
static constexpr bool can_use_mutable_view() {
return true;
}
void swap(atomic_cell_or_collection& other) noexcept {
_data.swap(other._data);
}
static atomic_cell_or_collection from_collection_mutation(collection_mutation data) { return std::move(data._data); }
collection_mutation_view as_collection_mutation() const;
bytes_view serialize() const;
@@ -82,12 +72,3 @@ public:
};
friend std::ostream& operator<<(std::ostream&, const printer&);
};
namespace std {
inline void swap(atomic_cell_or_collection& a, atomic_cell_or_collection& b) noexcept
{
a.swap(b);
}
}

View File

@@ -26,10 +26,7 @@
namespace auth {
const sstring& allow_all_authenticator_name() {
static const sstring name = meta::AUTH_PACKAGE_NAME + "AllowAllAuthenticator";
return name;
}
constexpr std::string_view allow_all_authenticator_name("org.apache.cassandra.auth.AllowAllAuthenticator");
// To ensure correct initialization order, we unfortunately need to use a string literal.
static const class_registrator<

View File

@@ -37,7 +37,7 @@ class migration_manager;
namespace auth {
const sstring& allow_all_authenticator_name();
extern const std::string_view allow_all_authenticator_name;
class allow_all_authenticator final : public authenticator {
public:
@@ -53,7 +53,7 @@ public:
}
virtual std::string_view qualified_java_name() const override {
return allow_all_authenticator_name();
return allow_all_authenticator_name;
}
virtual bool require_authentication() const override {

View File

@@ -26,10 +26,7 @@
namespace auth {
const sstring& allow_all_authorizer_name() {
static const sstring name = meta::AUTH_PACKAGE_NAME + "AllowAllAuthorizer";
return name;
}
constexpr std::string_view allow_all_authorizer_name("org.apache.cassandra.auth.AllowAllAuthorizer");
// To ensure correct initialization order, we unfortunately need to use a string literal.
static const class_registrator<

View File

@@ -34,7 +34,7 @@ class migration_manager;
namespace auth {
const sstring& allow_all_authorizer_name();
extern const std::string_view allow_all_authorizer_name;
class allow_all_authorizer final : public authorizer {
public:
@@ -50,7 +50,7 @@ public:
}
virtual std::string_view qualified_java_name() const override {
return allow_all_authorizer_name();
return allow_all_authorizer_name;
}
virtual future<permission_set> authorize(const role_or_anonymous&, const resource&) const override {

View File

@@ -34,10 +34,9 @@ namespace auth {
namespace meta {
const sstring DEFAULT_SUPERUSER_NAME("cassandra");
const sstring AUTH_KS("system_auth");
const sstring USERS_CF("users");
const sstring AUTH_PACKAGE_NAME("org.apache.cassandra.auth.");
constexpr std::string_view AUTH_KS("system_auth");
constexpr std::string_view USERS_CF("users");
constexpr std::string_view AUTH_PACKAGE_NAME("org.apache.cassandra.auth.");
}
@@ -83,7 +82,7 @@ static future<> create_metadata_table_if_missing_impl(
b.set_uuid(uuid);
schema_ptr table = b.build();
return ignore_existing([&mm, table = std::move(table)] () {
return mm.announce_new_column_family(table, false);
return mm.announce_new_column_family(table);
});
}
@@ -109,10 +108,17 @@ future<> wait_for_schema_agreement(::service::migration_manager& mm, const datab
});
}
const timeout_config& internal_distributed_timeout_config() noexcept {
::service::query_state& internal_distributed_query_state() noexcept {
#ifdef DEBUG
// Give the much slower debug tests more headroom for completing auth queries.
static const auto t = 30s;
#else
static const auto t = 5s;
#endif
static const timeout_config tc{t, t, t, t, t, t, t};
return tc;
static thread_local ::service::client_state cs(::service::client_state::internal_tag{}, tc);
static thread_local ::service::query_state qs(cs, empty_service_permit());
return qs;
}
}

View File

@@ -35,6 +35,7 @@
#include "log.hh"
#include "seastarx.hh"
#include "utils/exponential_backoff_retry.hh"
#include "service/query_state.hh"
using namespace std::chrono_literals;
@@ -53,10 +54,10 @@ namespace auth {
namespace meta {
extern const sstring DEFAULT_SUPERUSER_NAME;
extern const sstring AUTH_KS;
extern const sstring USERS_CF;
extern const sstring AUTH_PACKAGE_NAME;
constexpr std::string_view DEFAULT_SUPERUSER_NAME("cassandra");
extern const std::string_view AUTH_KS;
extern const std::string_view USERS_CF;
extern const std::string_view AUTH_PACKAGE_NAME;
}
@@ -87,6 +88,6 @@ future<> wait_for_schema_agreement(::service::migration_manager&, const database
///
/// Time-outs for internal, non-local CQL queries.
///
const timeout_config& internal_distributed_timeout_config() noexcept;
::service::query_state& internal_distributed_query_state() noexcept;
}

View File

@@ -65,15 +65,14 @@ extern "C" {
namespace auth {
const sstring& default_authorizer_name() {
static const sstring name = meta::AUTH_PACKAGE_NAME + "CassandraAuthorizer";
return name;
std::string_view default_authorizer::qualified_java_name() const {
return "org.apache.cassandra.auth.CassandraAuthorizer";
}
static const sstring ROLE_NAME = "role";
static const sstring RESOURCE_NAME = "resource";
static const sstring PERMISSIONS_NAME = "permissions";
static const sstring PERMISSIONS_CF = "role_permissions";
static constexpr std::string_view ROLE_NAME = "role";
static constexpr std::string_view RESOURCE_NAME = "resource";
static constexpr std::string_view PERMISSIONS_NAME = "permissions";
static constexpr std::string_view PERMISSIONS_CF = "role_permissions";
static logging::logger alogger("default_authorizer");
@@ -104,7 +103,6 @@ future<bool> default_authorizer::any_granted() const {
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{},
true).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return !results->empty();
@@ -117,8 +115,7 @@ future<> default_authorizer::migrate_legacy_metadata() const {
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config).then([this](::shared_ptr<cql3::untyped_result_set> results) {
db::consistency_level::LOCAL_ONE).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
return do_with(
row.get_as<sstring>("username"),
@@ -198,7 +195,6 @@ default_authorizer::authorize(const role_or_anonymous& maybe_role, const resourc
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{*maybe_role.name, r.name()}).then([](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return permissions::NONE;
@@ -227,7 +223,7 @@ default_authorizer::modify(
return _qp.execute_internal(
query,
db::consistency_level::ONE,
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{permissions::to_strings(set), sstring(role_name), resource.name()}).discard_result();
});
}
@@ -252,7 +248,7 @@ future<std::vector<permission_details>> default_authorizer::list_all() const {
return _qp.execute_internal(
query,
db::consistency_level::ONE,
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{},
true).then([](::shared_ptr<cql3::untyped_result_set> results) {
std::vector<permission_details> all_details;
@@ -279,7 +275,7 @@ future<> default_authorizer::revoke_all(std::string_view role_name) const {
return _qp.execute_internal(
query,
db::consistency_level::ONE,
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{sstring(role_name)}).discard_result().handle_exception([role_name](auto ep) {
try {
std::rethrow_exception(ep);
@@ -299,7 +295,6 @@ future<> default_authorizer::revoke_all(const resource& resource) const {
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{resource.name()}).then_wrapped([this, resource](future<::shared_ptr<cql3::untyped_result_set>> f) {
try {
auto res = f.get0();
@@ -316,7 +311,6 @@ future<> default_authorizer::revoke_all(const resource& resource) const {
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
infinite_timeout_config,
{r.get_as<sstring>(ROLE_NAME), resource.name()}).discard_result().handle_exception(
[resource](auto ep) {
try {

View File

@@ -51,8 +51,6 @@
namespace auth {
const sstring& default_authorizer_name();
class default_authorizer : public authorizer {
cql3::query_processor& _qp;
@@ -71,9 +69,7 @@ public:
virtual future<> stop() override;
virtual std::string_view qualified_java_name() const override {
return default_authorizer_name();
}
virtual std::string_view qualified_java_name() const override;
virtual future<permission_set> authorize(const role_or_anonymous&, const resource&) const override;

View File

@@ -62,15 +62,12 @@
namespace auth {
const sstring& password_authenticator_name() {
static const sstring name = meta::AUTH_PACKAGE_NAME + "PasswordAuthenticator";
return name;
}
constexpr std::string_view password_authenticator_name("org.apache.cassandra.auth.PasswordAuthenticator");
// name of the hash column.
static const sstring SALTED_HASH = "salted_hash";
static const sstring DEFAULT_USER_NAME = meta::DEFAULT_SUPERUSER_NAME;
static const sstring DEFAULT_USER_PASSWORD = meta::DEFAULT_SUPERUSER_NAME;
static constexpr std::string_view SALTED_HASH = "salted_hash";
static constexpr std::string_view DEFAULT_USER_NAME = meta::DEFAULT_SUPERUSER_NAME;
static const sstring DEFAULT_USER_PASSWORD = sstring(meta::DEFAULT_SUPERUSER_NAME);
static logging::logger plogger("password_authenticator");
@@ -98,7 +95,7 @@ static bool has_salted_hash(const cql3::untyped_result_set_row& row) {
static const sstring& update_row_query() {
static const sstring update_row_query = format("UPDATE {} SET {} = ? WHERE {} = ?",
meta::roles_table::qualified_name(),
meta::roles_table::qualified_name,
SALTED_HASH,
meta::roles_table::role_col_name);
return update_row_query;
@@ -117,7 +114,7 @@ future<> password_authenticator::migrate_legacy_metadata() const {
return _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([this](::shared_ptr<cql3::untyped_result_set> results) {
internal_distributed_query_state()).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
auto username = row.get_as<sstring>("username");
auto salted_hash = row.get_as<sstring>(SALTED_HASH);
@@ -125,7 +122,7 @@ future<> password_authenticator::migrate_legacy_metadata() const {
return _qp.execute_internal(
update_row_query(),
consistency_for_user(username),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{std::move(salted_hash), username}).discard_result();
}).finally([results] {});
}).then([] {
@@ -142,7 +139,7 @@ future<> password_authenticator::create_default_if_missing() const {
return _qp.execute_internal(
update_row_query(),
db::consistency_level::QUORUM,
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt), DEFAULT_USER_NAME}).then([](auto&&) {
plogger.info("Created default superuser authentication record.");
});
@@ -198,7 +195,7 @@ db::consistency_level password_authenticator::consistency_for_user(std::string_v
}
std::string_view password_authenticator::qualified_java_name() const {
return password_authenticator_name();
return password_authenticator_name;
}
bool password_authenticator::require_authentication() const {
@@ -215,10 +212,10 @@ authentication_option_set password_authenticator::alterable_options() const {
future<authenticated_user> password_authenticator::authenticate(
const credentials_map& credentials) const {
if (!credentials.count(USERNAME_KEY)) {
if (!credentials.contains(USERNAME_KEY)) {
throw exceptions::authentication_exception(format("Required key '{}' is missing", USERNAME_KEY));
}
if (!credentials.count(PASSWORD_KEY)) {
if (!credentials.contains(PASSWORD_KEY)) {
throw exceptions::authentication_exception(format("Required key '{}' is missing", PASSWORD_KEY));
}
@@ -233,13 +230,13 @@ future<authenticated_user> password_authenticator::authenticate(
return futurize_invoke([this, username, password] {
static const sstring query = format("SELECT {} FROM {} WHERE {} = ?",
SALTED_HASH,
meta::roles_table::qualified_name(),
meta::roles_table::qualified_name,
meta::roles_table::role_col_name);
return _qp.execute_internal(
query,
consistency_for_user(username),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{username},
true);
}).then_wrapped([=](future<::shared_ptr<cql3::untyped_result_set>> f) {
@@ -273,7 +270,7 @@ future<> password_authenticator::create(std::string_view role_name, const authen
return _qp.execute_internal(
update_row_query(),
consistency_for_user(role_name),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{passwords::hash(*options.password, rng_for_salt), sstring(role_name)}).discard_result();
}
@@ -283,26 +280,26 @@ future<> password_authenticator::alter(std::string_view role_name, const authent
}
static const sstring query = format("UPDATE {} SET {} = ? WHERE {} = ?",
meta::roles_table::qualified_name(),
meta::roles_table::qualified_name,
SALTED_HASH,
meta::roles_table::role_col_name);
return _qp.execute_internal(
query,
consistency_for_user(role_name),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{passwords::hash(*options.password, rng_for_salt), sstring(role_name)}).discard_result();
}
future<> password_authenticator::drop(std::string_view name) const {
static const sstring query = format("DELETE {} FROM {} WHERE {} = ?",
SALTED_HASH,
meta::roles_table::qualified_name(),
meta::roles_table::qualified_name,
meta::roles_table::role_col_name);
return _qp.execute_internal(
query, consistency_for_user(name),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{sstring(name)}).discard_result();
}

View File

@@ -52,7 +52,7 @@ class migration_manager;
namespace auth {
const sstring& password_authenticator_name();
extern const std::string_view password_authenticator_name;
class password_authenticator : public authenticator {
cql3::query_processor& _qp;

View File

@@ -45,16 +45,13 @@ std::string_view creation_query() {
" member_of set<text>,"
" salted_hash text"
")",
qualified_name(),
qualified_name,
role_col_name);
return instance;
}
std::string_view qualified_name() noexcept {
static const sstring instance = AUTH_KS + "." + sstring(name);
return instance;
}
constexpr std::string_view qualified_name("system_auth.roles");
}
@@ -64,21 +61,20 @@ future<bool> default_role_row_satisfies(
cql3::query_processor& qp,
std::function<bool(const cql3::untyped_result_set_row&)> p) {
static const sstring query = format("SELECT * FROM {} WHERE {} = ?",
meta::roles_table::qualified_name(),
meta::roles_table::qualified_name,
meta::roles_table::role_col_name);
return do_with(std::move(p), [&qp](const auto& p) {
return qp.execute_internal(
query,
db::consistency_level::ONE,
infinite_timeout_config,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([&qp, &p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{meta::DEFAULT_SUPERUSER_NAME},
true).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
@@ -97,13 +93,13 @@ future<bool> default_role_row_satisfies(
future<bool> any_nondefault_role_row_satisfies(
cql3::query_processor& qp,
std::function<bool(const cql3::untyped_result_set_row&)> p) {
static const sstring query = format("SELECT * FROM {}", meta::roles_table::qualified_name());
static const sstring query = format("SELECT * FROM {}", meta::roles_table::qualified_name);
return do_with(std::move(p), [&qp](const auto& p) {
return qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
internal_distributed_query_state()).then([&p](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
return false;
}

View File

@@ -43,7 +43,7 @@ std::string_view creation_query();
constexpr std::string_view name{"roles", 5};
std::string_view qualified_name() noexcept;
extern const std::string_view qualified_name;
constexpr std::string_view role_col_name{"role", 4};

View File

@@ -31,9 +31,7 @@
#include "auth/allow_all_authenticator.hh"
#include "auth/allow_all_authorizer.hh"
#include "auth/common.hh"
#include "auth/password_authenticator.hh"
#include "auth/role_or_anonymous.hh"
#include "auth/standard_role_manager.hh"
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
#include "db/consistency_level_type.hh"
@@ -125,18 +123,7 @@ service::service(
, _authorizer(std::move(z))
, _authenticator(std::move(a))
, _role_manager(std::move(r))
, _migration_listener(std::make_unique<auth_migration_listener>(*_authorizer)) {
// The password authenticator requires that the `standard_role_manager` is running so that the roles metadata table
// it manages is created and updated. This cross-module dependency is rather gross, but we have to maintain it for
// the sake of compatibility with Apache Cassandra and its choice of auth. schema.
if ((_authenticator->qualified_java_name() == password_authenticator_name())
&& (_role_manager->qualified_java_name() != standard_role_manager_name())) {
throw incompatible_module_combination(
format("The {} authenticator must be loaded alongside the {} role-manager.",
password_authenticator_name(),
standard_role_manager_name()));
}
}
, _migration_listener(std::make_unique<auth_migration_listener>(*_authorizer)) {}
service::service(
permissions_cache_config c,
@@ -167,7 +154,7 @@ future<> service::create_keyspace_if_missing(::service::migration_manager& mm) c
// We use min_timestamp so that default keyspace metadata will loose with any manual adjustments.
// See issue #2129.
return mm.announce_new_keyspace(ksm, api::min_timestamp, false);
return mm.announce_new_keyspace(ksm, api::min_timestamp);
}
return make_ready_future<>();
@@ -223,7 +210,6 @@ future<bool> service::has_existing_legacy_users() const {
return _qp.execute_internal(
default_user_query,
db::consistency_level::ONE,
infinite_timeout_config,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([this](auto results) {
if (!results->empty()) {
@@ -233,7 +219,6 @@ future<bool> service::has_existing_legacy_users() const {
return _qp.execute_internal(
default_user_query,
db::consistency_level::QUORUM,
infinite_timeout_config,
{meta::DEFAULT_SUPERUSER_NAME},
true).then([this](auto results) {
if (!results->empty()) {
@@ -242,8 +227,7 @@ future<bool> service::has_existing_legacy_users() const {
return _qp.execute_internal(
all_users_query,
db::consistency_level::QUORUM,
infinite_timeout_config).then([](auto results) {
db::consistency_level::QUORUM).then([](auto results) {
return make_ready_future<bool>(!results->empty());
});
});
@@ -376,25 +360,28 @@ future<permission_set> get_permissions(const service& ser, const authenticated_u
}
bool is_enforcing(const service& ser) {
const bool enforcing_authorizer = ser.underlying_authorizer().qualified_java_name() != allow_all_authorizer_name();
const bool enforcing_authorizer = ser.underlying_authorizer().qualified_java_name() != allow_all_authorizer_name;
const bool enforcing_authenticator = ser.underlying_authenticator().qualified_java_name()
!= allow_all_authenticator_name();
!= allow_all_authenticator_name;
return enforcing_authorizer || enforcing_authenticator;
}
bool is_protected(const service& ser, const resource& r) noexcept {
return ser.underlying_role_manager().protected_resources().count(r)
|| ser.underlying_authenticator().protected_resources().count(r)
|| ser.underlying_authorizer().protected_resources().count(r);
bool is_protected(const service& ser, command_desc cmd) noexcept {
if (cmd.type_ == command_desc::type::ALTER_WITH_OPTS) {
return false; // Table attributes are OK to modify; see #7057.
}
return ser.underlying_role_manager().protected_resources().contains(cmd.resource)
|| ser.underlying_authenticator().protected_resources().contains(cmd.resource)
|| ser.underlying_authorizer().protected_resources().contains(cmd.resource);
}
static void validate_authentication_options_are_supported(
const authentication_options& options,
const authentication_option_set& supported) {
const auto check = [&supported](authentication_option k) {
if (supported.count(k) == 0) {
if (!supported.contains(k)) {
throw unsupported_authentication_option(k);
}
};
@@ -474,7 +461,7 @@ future<bool> has_role(const service& ser, std::string_view grantee, std::string_
return when_all_succeed(
validate_role_exists(ser, name),
ser.get_roles(grantee)).then_unpack([name](role_set all_roles) {
return make_ready_future<bool>(all_roles.count(sstring(name)) != 0);
return make_ready_future<bool>(all_roles.contains(sstring(name)));
});
}
future<bool> has_role(const service& ser, const authenticated_user& u, std::string_view name) {
@@ -531,14 +518,9 @@ future<std::vector<permission_details>> list_filtered_permissions(
? auth::expand_resource_family(r)
: auth::resource_set{r};
all_details.erase(
std::remove_if(
all_details.begin(),
all_details.end(),
[&resources](const permission_details& pd) {
return resources.count(pd.resource) == 0;
}),
all_details.end());
std::erase_if(all_details, [&resources](const permission_details& pd) {
return !resources.contains(pd.resource);
});
}
std::transform(
@@ -551,11 +533,9 @@ future<std::vector<permission_details>> list_filtered_permissions(
});
// Eliminate rows with an empty permission set.
all_details.erase(
std::remove_if(all_details.begin(), all_details.end(), [](const permission_details& pd) {
return pd.permissions.mask() == 0;
}),
all_details.end());
std::erase_if(all_details, [](const permission_details& pd) {
return pd.permissions.mask() == 0;
});
if (!role_name) {
return make_ready_future<std::vector<permission_details>>(std::move(all_details));
@@ -567,14 +547,9 @@ future<std::vector<permission_details>> list_filtered_permissions(
return do_with(std::move(all_details), [&ser, role_name](auto& all_details) {
return ser.get_roles(*role_name).then([&all_details](role_set all_roles) {
all_details.erase(
std::remove_if(
all_details.begin(),
all_details.end(),
[&all_roles](const permission_details& pd) {
return all_roles.count(pd.role_name) == 0;
}),
all_details.end());
std::erase_if(all_details, [&all_roles](const permission_details& pd) {
return !all_roles.contains(pd.role_name);
});
return make_ready_future<std::vector<permission_details>>(std::move(all_details));
});

View File

@@ -181,10 +181,21 @@ future<permission_set> get_permissions(const service&, const authenticated_user&
///
bool is_enforcing(const service&);
/// A description of a CQL command from which auth::service can tell whether or not this command could endanger
/// internal data on which auth::service depends.
struct command_desc {
auth::permission permission; ///< Nature of the command's alteration.
const ::auth::resource& resource; ///< Resource impacted by this command.
enum class type {
ALTER_WITH_OPTS, ///< Command is ALTER ... WITH ...
OTHER
} type_ = type::OTHER;
};
///
/// Protected resources cannot be modified even if the performer has permissions to do so.
///
bool is_protected(const service&, const resource&) noexcept;
bool is_protected(const service&, command_desc) noexcept;
///
/// Create a role with optional authentication information.

View File

@@ -49,11 +49,7 @@ namespace meta {
namespace role_members_table {
constexpr std::string_view name{"role_members" , 12};
static std::string_view qualified_name() noexcept {
static const sstring instance = AUTH_KS + "." + sstring(name);
return instance;
}
constexpr std::string_view qualified_name("system_auth.role_members");
}
@@ -84,13 +80,13 @@ static db::consistency_level consistency_for_role(std::string_view role_name) no
static future<std::optional<record>> find_record(cql3::query_processor& qp, std::string_view role_name) {
static const sstring query = format("SELECT * FROM {} WHERE {} = ?",
meta::roles_table::qualified_name(),
meta::roles_table::qualified_name,
meta::roles_table::role_col_name);
return qp.execute_internal(
query,
consistency_for_role(role_name),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{sstring(role_name)},
true).then([](::shared_ptr<cql3::untyped_result_set> results) {
if (results->empty()) {
@@ -124,13 +120,8 @@ static bool has_can_login(const cql3::untyped_result_set_row& row) {
return row.has("can_login") && !(boolean_type->deserialize(row.get_blob("can_login")).is_null());
}
std::string_view standard_role_manager_name() noexcept {
static const sstring instance = meta::AUTH_PACKAGE_NAME + "CassandraRoleManager";
return instance;
}
std::string_view standard_role_manager::qualified_java_name() const noexcept {
return standard_role_manager_name();
return "org.apache.cassandra.auth.CassandraRoleManager";
}
const resource_set& standard_role_manager::protected_resources() const {
@@ -148,7 +139,7 @@ future<> standard_role_manager::create_metadata_tables_if_missing() const {
" member text,"
" PRIMARY KEY (role, member)"
")",
meta::role_members_table::qualified_name());
meta::role_members_table::qualified_name);
return when_all_succeed(
@@ -168,13 +159,13 @@ future<> standard_role_manager::create_default_role_if_missing() const {
return default_role_row_satisfies(_qp, &has_can_login).then([this](bool exists) {
if (!exists) {
static const sstring query = format("INSERT INTO {} ({}, is_superuser, can_login) VALUES (?, true, true)",
meta::roles_table::qualified_name(),
meta::roles_table::qualified_name,
meta::roles_table::role_col_name);
return _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{meta::DEFAULT_SUPERUSER_NAME}).then([](auto&&) {
log.info("Created default superuser role '{}'.", meta::DEFAULT_SUPERUSER_NAME);
return make_ready_future<>();
@@ -201,7 +192,7 @@ future<> standard_role_manager::migrate_legacy_metadata() const {
return _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([this](::shared_ptr<cql3::untyped_result_set> results) {
internal_distributed_query_state()).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
role_config config;
config.is_superuser = row.get_or<bool>("super", false);
@@ -256,13 +247,13 @@ future<> standard_role_manager::stop() {
future<> standard_role_manager::create_or_replace(std::string_view role_name, const role_config& c) const {
static const sstring query = format("INSERT INTO {} ({}, is_superuser, can_login) VALUES (?, ?, ?)",
meta::roles_table::qualified_name(),
meta::roles_table::qualified_name,
meta::roles_table::role_col_name);
return _qp.execute_internal(
query,
consistency_for_role(role_name),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{sstring(role_name), c.is_superuser, c.can_login},
true).discard_result();
}
@@ -301,11 +292,11 @@ standard_role_manager::alter(std::string_view role_name, const role_config_updat
return _qp.execute_internal(
format("UPDATE {} SET {} WHERE {} = ?",
meta::roles_table::qualified_name(),
meta::roles_table::qualified_name,
build_column_assignments(u),
meta::roles_table::role_col_name),
consistency_for_role(role_name),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{sstring(role_name)}).discard_result();
});
}
@@ -319,12 +310,12 @@ future<> standard_role_manager::drop(std::string_view role_name) const {
// First, revoke this role from all roles that are members of it.
const auto revoke_from_members = [this, role_name] {
static const sstring query = format("SELECT member FROM {} WHERE role = ?",
meta::role_members_table::qualified_name());
meta::role_members_table::qualified_name);
return _qp.execute_internal(
query,
consistency_for_role(role_name),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{sstring(role_name)}).then([this, role_name](::shared_ptr<cql3::untyped_result_set> members) {
return parallel_for_each(
members->begin(),
@@ -357,13 +348,13 @@ future<> standard_role_manager::drop(std::string_view role_name) const {
// Finally, delete the role itself.
auto delete_role = [this, role_name] {
static const sstring query = format("DELETE FROM {} WHERE {} = ?",
meta::roles_table::qualified_name(),
meta::roles_table::qualified_name,
meta::roles_table::role_col_name);
return _qp.execute_internal(
query,
consistency_for_role(role_name),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{sstring(role_name)}).discard_result();
};
@@ -383,14 +374,14 @@ standard_role_manager::modify_membership(
const auto modify_roles = [this, role_name, grantee_name, ch] {
const auto query = format(
"UPDATE {} SET member_of = member_of {} ? WHERE {} = ?",
meta::roles_table::qualified_name(),
meta::roles_table::qualified_name,
(ch == membership_change::add ? '+' : '-'),
meta::roles_table::role_col_name);
return _qp.execute_internal(
query,
consistency_for_role(grantee_name),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{role_set{sstring(role_name)}, sstring(grantee_name)}).discard_result();
};
@@ -399,24 +390,24 @@ standard_role_manager::modify_membership(
case membership_change::add:
return _qp.execute_internal(
format("INSERT INTO {} (role, member) VALUES (?, ?)",
meta::role_members_table::qualified_name()),
meta::role_members_table::qualified_name),
consistency_for_role(role_name),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{sstring(role_name), sstring(grantee_name)}).discard_result();
case membership_change::remove:
return _qp.execute_internal(
format("DELETE FROM {} WHERE role = ? AND member = ?",
meta::role_members_table::qualified_name()),
meta::role_members_table::qualified_name),
consistency_for_role(role_name),
internal_distributed_timeout_config(),
internal_distributed_query_state(),
{sstring(role_name), sstring(grantee_name)}).discard_result();
}
return make_ready_future<>();
};
return when_all_succeed(modify_roles(), modify_role_members()).discard_result();
return when_all_succeed(modify_roles(), modify_role_members).discard_result();
}
future<>
@@ -425,7 +416,7 @@ standard_role_manager::grant(std::string_view grantee_name, std::string_view rol
return this->query_granted(
grantee_name,
recursive_role_query::yes).then([role_name, grantee_name](role_set roles) {
if (roles.count(sstring(role_name)) != 0) {
if (roles.contains(sstring(role_name))) {
throw role_already_included(grantee_name, role_name);
}
@@ -437,7 +428,7 @@ standard_role_manager::grant(std::string_view grantee_name, std::string_view rol
return this->query_granted(
role_name,
recursive_role_query::yes).then([role_name, grantee_name](role_set roles) {
if (roles.count(sstring(grantee_name)) != 0) {
if (roles.contains(sstring(grantee_name))) {
throw role_already_included(role_name, grantee_name);
}
@@ -460,7 +451,7 @@ standard_role_manager::revoke(std::string_view revokee_name, std::string_view ro
return this->query_granted(
revokee_name,
recursive_role_query::no).then([revokee_name, role_name](role_set roles) {
if (roles.count(sstring(role_name)) == 0) {
if (!roles.contains(sstring(role_name))) {
throw revoke_ungranted_role(revokee_name, role_name);
}
@@ -504,7 +495,7 @@ future<role_set> standard_role_manager::query_granted(std::string_view grantee_n
future<role_set> standard_role_manager::query_all() const {
static const sstring query = format("SELECT {} FROM {}",
meta::roles_table::role_col_name,
meta::roles_table::qualified_name());
meta::roles_table::qualified_name);
// To avoid many copies of a view.
static const auto role_col_name_string = sstring(meta::roles_table::role_col_name);
@@ -512,7 +503,7 @@ future<role_set> standard_role_manager::query_all() const {
return _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_timeout_config()).then([](::shared_ptr<cql3::untyped_result_set> results) {
internal_distributed_query_state()).then([](::shared_ptr<cql3::untyped_result_set> results) {
role_set roles;
std::transform(

View File

@@ -42,8 +42,6 @@ class migration_manager;
namespace auth {
std::string_view standard_role_manager_name() noexcept;
class standard_role_manager final : public role_manager {
cql3::query_processor& _qp;
::service::migration_manager& _migration_manager;

View File

@@ -101,7 +101,7 @@ public:
virtual future<authenticated_user> authenticate(const credentials_map& credentials) const override {
auto i = credentials.find(authenticator::USERNAME_KEY);
if ((i == credentials.end() || i->second.empty())
&& (!credentials.count(PASSWORD_KEY) || credentials.at(PASSWORD_KEY).empty())) {
&& (!credentials.contains(PASSWORD_KEY) || credentials.at(PASSWORD_KEY).empty())) {
// return anon user
return make_ready_future<authenticated_user>(anonymous_user());
}

View File

@@ -100,3 +100,7 @@ std::ostream& operator<<(std::ostream& os, const bytes_view& b) {
}
}
std::ostream& operator<<(std::ostream& os, const fmt_hex& b) {
return os << to_hex(b.v);
}

View File

@@ -28,6 +28,7 @@
#include <iosfwd>
#include <functional>
#include "utils/mutable_view.hh"
#include <xxhash.h>
using bytes = basic_sstring<int8_t, uint32_t, 31, false>;
using bytes_view = std::basic_string_view<int8_t>;
@@ -35,20 +36,24 @@ using bytes_mutable_view = basic_mutable_view<bytes_view::value_type>;
using bytes_opt = std::optional<bytes>;
using sstring_view = std::string_view;
inline bytes to_bytes(bytes&& b) {
return std::move(b);
}
inline sstring_view to_sstring_view(bytes_view view) {
return {reinterpret_cast<const char*>(view.data()), view.size()};
}
namespace std {
inline bytes_view to_bytes_view(sstring_view view) {
return {reinterpret_cast<const int8_t*>(view.data()), view.size()};
}
template <>
struct hash<bytes_view> {
size_t operator()(bytes_view v) const {
return hash<sstring_view>()({reinterpret_cast<const char*>(v.begin()), v.size()});
}
struct fmt_hex {
bytes_view& v;
fmt_hex(bytes_view& v) noexcept : v(v) {}
};
}
std::ostream& operator<<(std::ostream& os, const fmt_hex& hex);
bytes from_hex(sstring_view s);
sstring to_hex(bytes_view b);
@@ -83,10 +88,37 @@ struct appending_hash<bytes_view> {
}
};
struct bytes_view_hasher : public hasher {
XXH64_state_t _state;
bytes_view_hasher(uint64_t seed = 0) noexcept {
XXH64_reset(&_state, seed);
}
void update(const char* ptr, size_t length) noexcept {
XXH64_update(&_state, ptr, length);
}
size_t finalize() {
return static_cast<size_t>(XXH64_digest(&_state));
}
};
namespace std {
template <>
struct hash<bytes_view> {
size_t operator()(bytes_view v) const {
bytes_view_hasher h;
appending_hash<bytes_view>{}(h, v);
return h.finalize();
}
};
} // namespace std
inline int32_t compare_unsigned(bytes_view v1, bytes_view v2) {
auto n = memcmp(v1.begin(), v2.begin(), std::min(v1.size(), v2.size()));
auto size = std::min(v1.size(), v2.size());
if (size) {
auto n = memcmp(v1.begin(), v2.begin(), size);
if (n) {
return n;
}
}
return (int32_t) (v1.size() - v2.size());
}

View File

@@ -24,9 +24,10 @@
#include <boost/range/iterator_range.hpp>
#include "bytes.hh"
#include <seastar/core/unaligned.hh>
#include "hashing.hh"
#include <seastar/core/simple-stream.hh>
#include <concepts>
/**
* Utility for writing data into a buffer when its final size is not known up front.
*
@@ -39,7 +40,7 @@ public:
using size_type = bytes::size_type;
using value_type = bytes::value_type;
using fragment_type = bytes_view;
static constexpr size_type max_chunk_size() { return 128 * 1024; }
static constexpr size_type max_chunk_size() { return max_alloc_size() - sizeof(chunk); }
private:
static_assert(sizeof(value_type) == 1, "value_type is assumed to be one byte long");
struct chunk {
@@ -59,13 +60,21 @@ private:
void operator delete(void* ptr) { free(ptr); }
};
static constexpr size_type default_chunk_size{512};
static constexpr size_type max_alloc_size() { return 128 * 1024; }
private:
std::unique_ptr<chunk> _begin;
chunk* _current;
size_type _size;
size_type _initial_chunk_size = default_chunk_size;
public:
class fragment_iterator : public std::iterator<std::input_iterator_tag, bytes_view> {
class fragment_iterator {
public:
using iterator_category = std::input_iterator_tag;
using value_type = bytes_view;
using difference_type = std::ptrdiff_t;
using pointer = bytes_view*;
using reference = bytes_view&;
private:
chunk* _current = nullptr;
public:
fragment_iterator() = default;
@@ -125,16 +134,15 @@ private:
return _current->size - _current->offset;
}
// Figure out next chunk size.
// - must be enough for data_size
// - must be enough for data_size + sizeof(chunk)
// - must be at least _initial_chunk_size
// - try to double each time to prevent too many allocations
// - do not exceed max_chunk_size
// - should not exceed max_alloc_size, unless data_size requires so
size_type next_alloc_size(size_t data_size) const {
auto next_size = _current
? _current->size * 2
: _initial_chunk_size;
next_size = std::min(next_size, max_chunk_size());
// FIXME: check for overflow?
next_size = std::min(next_size, max_alloc_size());
return std::max<size_type>(next_size, data_size + sizeof(chunk));
}
// Makes room for a contiguous region of given size.
@@ -226,9 +234,9 @@ public:
};
// Returns a place holder for a value to be written later.
template <typename T>
template <std::integral T>
inline
std::enable_if_t<std::is_fundamental<T>::value, place_holder<T>>
place_holder<T>
write_place_holder() {
return place_holder<T>{alloc(sizeof(T))};
}

View File

@@ -28,7 +28,6 @@
#include "partition_version.hh"
#include "utils/logalloc.hh"
#include "query-request.hh"
#include "partition_snapshot_reader.hh"
#include "partition_snapshot_row_cursor.hh"
#include "read_context.hh"
#include "flat_mutation_reader.hh"
@@ -103,7 +102,7 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
// Points to the underlying reader conforming to _schema,
// either to *_underlying_holder or _read_context->underlying().underlying().
flat_mutation_reader* _underlying = nullptr;
std::optional<flat_mutation_reader> _underlying_holder;
flat_mutation_reader_opt _underlying_holder;
future<> do_fill_buffer(db::timeout_clock::time_point);
future<> ensure_underlying(db::timeout_clock::time_point);
@@ -113,6 +112,7 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
void move_to_next_range();
void move_to_range(query::clustering_row_ranges::const_iterator);
void move_to_next_entry();
void maybe_drop_last_entry() noexcept;
void add_to_buffer(const partition_snapshot_row_cursor&);
void add_clustering_row_to_buffer(mutation_fragment&&);
void add_to_buffer(range_tombstone&&);
@@ -123,6 +123,7 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
bool can_populate() const;
// Marks the range between _last_row (exclusive) and _next_row (exclusive) as continuous,
// provided that the underlying reader still matches the latest version of the partition.
// Invalidates _last_row.
void maybe_update_continuity();
// Tries to ensure that the lower bound of the current population range exists.
// Returns false if it failed and range cannot be populated.
@@ -134,7 +135,7 @@ class cache_flat_mutation_reader final : public flat_mutation_reader::impl {
void maybe_add_to_cache(const static_row& sr);
void maybe_set_static_row_continuous();
void finish_reader() {
push_mutation_fragment(partition_end());
push_mutation_fragment(*_schema, _permit, partition_end());
_end_of_stream = true;
_state = state::end_of_stream;
}
@@ -146,7 +147,7 @@ public:
lw_shared_ptr<read_context> ctx,
partition_snapshot_ptr snp,
row_cache& cache)
: flat_mutation_reader::impl(std::move(s))
: flat_mutation_reader::impl(std::move(s), ctx->permit())
, _snp(std::move(snp))
, _position_cmp(*_schema)
, _ck_ranges(std::move(crr))
@@ -158,17 +159,18 @@ public:
, _read_context(std::move(ctx))
, _next_row(*_schema, *_snp)
{
clogger.trace("csm {}: table={}.{}", this, _schema->ks_name(), _schema->cf_name());
push_mutation_fragment(partition_start(std::move(dk), _snp->partition_tombstone()));
clogger.trace("csm {}: table={}.{}", fmt::ptr(this), _schema->ks_name(), _schema->cf_name());
push_mutation_fragment(*_schema, _permit, partition_start(std::move(dk), _snp->partition_tombstone()));
}
cache_flat_mutation_reader(const cache_flat_mutation_reader&) = delete;
cache_flat_mutation_reader(cache_flat_mutation_reader&&) = delete;
virtual future<> fill_buffer(db::timeout_clock::time_point timeout) override;
virtual void next_partition() override {
virtual future<> next_partition() override {
clear_buffer_to_next_partition();
if (is_buffer_empty()) {
_end_of_stream = true;
}
return make_ready_future<>();
}
virtual future<> fast_forward_to(const dht::partition_range&, db::timeout_clock::time_point timeout) override {
clear_buffer();
@@ -188,7 +190,7 @@ future<> cache_flat_mutation_reader::process_static_row(db::timeout_clock::time_
return _snp->static_row(_read_context->digest_requested());
});
if (!sr.empty()) {
push_mutation_fragment(mutation_fragment(std::move(sr)));
push_mutation_fragment(mutation_fragment(*_schema, _permit, std::move(sr)));
}
return make_ready_future<>();
} else {
@@ -232,7 +234,7 @@ future<> cache_flat_mutation_reader::fill_buffer(db::timeout_clock::time_point t
return after_static_row();
}
}
clogger.trace("csm {}: fill_buffer(), range={}, lb={}", this, *_ck_ranges_curr, _lower_bound);
clogger.trace("csm {}: fill_buffer(), range={}, lb={}", fmt::ptr(this), *_ck_ranges_curr, _lower_bound);
return do_until([this] { return _end_of_stream || is_buffer_full(); }, [this, timeout] {
return do_fill_buffer(timeout);
});
@@ -265,6 +267,9 @@ future<> cache_flat_mutation_reader::do_fill_buffer(db::timeout_clock::time_poin
}
_state = state::reading_from_underlying;
_population_range_starts_before_all_rows = _lower_bound.is_before_all_clustered_rows(*_schema);
if (!_read_context->partition_exists()) {
return read_from_underlying(timeout);
}
auto end = _next_row_in_range ? position_in_partition(_next_row.position())
: position_in_partition(_upper_bound);
return _underlying->fast_forward_to(position_range{_lower_bound, std::move(end)}, timeout).then([this, timeout] {
@@ -277,7 +282,7 @@ future<> cache_flat_mutation_reader::do_fill_buffer(db::timeout_clock::time_poin
// assert(_state == state::reading_from_cache)
return _lsa_manager.run_in_read_section([this] {
auto next_valid = _next_row.iterators_valid();
clogger.trace("csm {}: reading_from_cache, range=[{}, {}), next={}, valid={}", this, _lower_bound,
clogger.trace("csm {}: reading_from_cache, range=[{}, {}), next={}, valid={}", fmt::ptr(this), _lower_bound,
_upper_bound, _next_row.position(), next_valid);
// We assume that if there was eviction, and thus the range may
// no longer be continuous, the cursor was invalidated.
@@ -291,7 +296,7 @@ future<> cache_flat_mutation_reader::do_fill_buffer(db::timeout_clock::time_poin
}
}
_next_row.maybe_refresh();
clogger.trace("csm {}: next={}, cont={}", this, _next_row.position(), _next_row.continuous());
clogger.trace("csm {}: next={}, cont={}", fmt::ptr(this), _next_row.position(), _next_row.continuous());
_lower_bound_changed = false;
while (_state == state::reading_from_cache) {
copy_from_cache_to_buffer();
@@ -329,7 +334,6 @@ future<> cache_flat_mutation_reader::read_from_underlying(db::timeout_clock::tim
}
if (_next_row_in_range) {
maybe_update_continuity();
_last_row = _next_row;
add_to_buffer(_next_row);
try {
move_to_next_entry();
@@ -342,14 +346,14 @@ future<> cache_flat_mutation_reader::read_from_underlying(db::timeout_clock::tim
if (no_clustering_row_between(*_schema, _upper_bound, _next_row.position())) {
this->maybe_update_continuity();
} else if (can_populate()) {
rows_entry::compare less(*_schema);
rows_entry::tri_compare cmp(*_schema);
auto& rows = _snp->version()->partition().clustered_rows();
if (query::is_single_row(*_schema, *_ck_ranges_curr)) {
with_allocator(_snp->region().allocator(), [&] {
auto e = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(_ck_ranges_curr->start()->value()));
// Use _next_row iterator only as a hint, because there could be insertions after _upper_bound.
auto insert_result = rows.insert_check(_next_row.get_iterator_in_latest_version(), *e, less);
auto insert_result = rows.insert_before_hint(_next_row.get_iterator_in_latest_version(), *e, cmp);
auto inserted = insert_result.second;
auto it = insert_result.first;
if (inserted) {
@@ -357,7 +361,7 @@ future<> cache_flat_mutation_reader::read_from_underlying(db::timeout_clock::tim
e.release();
auto next = std::next(it);
it->set_continuous(next->continuous());
clogger.trace("csm {}: inserted dummy at {}, cont={}", this, it->position(), it->continuous());
clogger.trace("csm {}: inserted dummy at {}, cont={}", fmt::ptr(this), it->position(), it->continuous());
}
});
} else if (ensure_population_lower_bound()) {
@@ -365,16 +369,17 @@ future<> cache_flat_mutation_reader::read_from_underlying(db::timeout_clock::tim
auto e = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(*_schema, _upper_bound, is_dummy::yes, is_continuous::yes));
// Use _next_row iterator only as a hint, because there could be insertions after _upper_bound.
auto insert_result = rows.insert_check(_next_row.get_iterator_in_latest_version(), *e, less);
auto insert_result = rows.insert_before_hint(_next_row.get_iterator_in_latest_version(), *e, cmp);
auto inserted = insert_result.second;
if (inserted) {
clogger.trace("csm {}: inserted dummy at {}", this, _upper_bound);
clogger.trace("csm {}: inserted dummy at {}", fmt::ptr(this), _upper_bound);
_snp->tracker()->insert(*e);
e.release();
} else {
clogger.trace("csm {}: mark {} as continuous", this, insert_result.first->position());
clogger.trace("csm {}: mark {} as continuous", fmt::ptr(this), insert_result.first->position());
insert_result.first->set_continuous(true);
}
maybe_drop_last_entry();
});
}
} else {
@@ -405,15 +410,15 @@ bool cache_flat_mutation_reader::ensure_population_lower_bound() {
if (!_last_row.is_in_latest_version()) {
with_allocator(_snp->region().allocator(), [&] {
auto& rows = _snp->version()->partition().clustered_rows();
rows_entry::compare less(*_schema);
rows_entry::tri_compare cmp(*_schema);
// FIXME: Avoid the copy by inserting an incomplete clustering row
auto e = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(*_schema, *_last_row));
e->set_continuous(false);
auto insert_result = rows.insert_check(rows.end(), *e, less);
auto insert_result = rows.insert_before_hint(rows.end(), *e, cmp);
auto inserted = insert_result.second;
if (inserted) {
clogger.trace("csm {}: inserted lower bound dummy at {}", this, e->position());
clogger.trace("csm {}: inserted lower bound dummy at {}", fmt::ptr(this), e->position());
_snp->tracker()->insert(*e);
e.release();
}
@@ -428,6 +433,7 @@ void cache_flat_mutation_reader::maybe_update_continuity() {
with_allocator(_snp->region().allocator(), [&] {
rows_entry& e = _next_row.ensure_entry_in_latest().row;
e.set_continuous(true);
maybe_drop_last_entry();
});
} else {
_read_context->cache().on_mispopulate();
@@ -453,20 +459,20 @@ void cache_flat_mutation_reader::maybe_add_to_cache(const clustering_row& cr) {
_read_context->cache().on_mispopulate();
return;
}
clogger.trace("csm {}: populate({})", this, clustering_row::printer(*_schema, cr));
clogger.trace("csm {}: populate({})", fmt::ptr(this), clustering_row::printer(*_schema, cr));
_lsa_manager.run_in_update_section_with_allocator([this, &cr] {
mutation_partition& mp = _snp->version()->partition();
rows_entry::compare less(*_schema);
rows_entry::tri_compare cmp(*_schema);
if (_read_context->digest_requested()) {
cr.cells().prepare_hash(*_schema, column_kind::regular_column);
}
auto new_entry = alloc_strategy_unique_ptr<rows_entry>(
current_allocator().construct<rows_entry>(*_schema, cr.key(), cr.tomb(), cr.marker(), cr.cells()));
current_allocator().construct<rows_entry>(*_schema, cr.key(), cr.as_deletable_row()));
new_entry->set_continuous(false);
auto it = _next_row.iterators_valid() ? _next_row.get_iterator_in_latest_version()
: mp.clustered_rows().lower_bound(cr.key(), less);
auto insert_result = mp.clustered_rows().insert_check(it, *new_entry, less);
: mp.clustered_rows().lower_bound(cr.key(), cmp);
auto insert_result = mp.clustered_rows().insert_before_hint(it, *new_entry, cmp);
if (insert_result.second) {
_snp->tracker()->insert(*new_entry);
new_entry.release();
@@ -475,7 +481,7 @@ void cache_flat_mutation_reader::maybe_add_to_cache(const clustering_row& cr) {
rows_entry& e = *it;
if (ensure_population_lower_bound()) {
clogger.trace("csm {}: set_continuous({})", this, e.position());
clogger.trace("csm {}: set_continuous({})", fmt::ptr(this), e.position());
e.set_continuous(true);
} else {
_read_context->cache().on_mispopulate();
@@ -494,14 +500,14 @@ bool cache_flat_mutation_reader::after_current_range(position_in_partition_view
inline
void cache_flat_mutation_reader::start_reading_from_underlying() {
clogger.trace("csm {}: start_reading_from_underlying(), range=[{}, {})", this, _lower_bound, _next_row_in_range ? _next_row.position() : _upper_bound);
clogger.trace("csm {}: start_reading_from_underlying(), range=[{}, {})", fmt::ptr(this), _lower_bound, _next_row_in_range ? _next_row.position() : _upper_bound);
_state = state::move_to_underlying;
_next_row.touch();
}
inline
void cache_flat_mutation_reader::copy_from_cache_to_buffer() {
clogger.trace("csm {}: copy_from_cache, next={}, next_row_in_range={}", this, _next_row.position(), _next_row_in_range);
clogger.trace("csm {}: copy_from_cache, next={}, next_row_in_range={}", fmt::ptr(this), _next_row.position(), _next_row_in_range);
_next_row.touch();
position_in_partition_view next_lower_bound = _next_row.dummy() ? _next_row.position() : position_in_partition_view::after_key(_next_row.key());
for (auto &&rts : _snp->range_tombstones(_lower_bound, _next_row_in_range ? next_lower_bound : _upper_bound)) {
@@ -509,7 +515,7 @@ void cache_flat_mutation_reader::copy_from_cache_to_buffer() {
// This guarantees that rts starts after any emitted clustering_row
// and not before any emitted range tombstone.
if (!less(_lower_bound, rts.position())) {
rts.set_start(*_schema, _lower_bound);
rts.set_start(_lower_bound);
} else {
_lower_bound = position_in_partition(rts.position());
_lower_bound_changed = true;
@@ -517,12 +523,11 @@ void cache_flat_mutation_reader::copy_from_cache_to_buffer() {
return;
}
}
push_mutation_fragment(std::move(rts));
push_mutation_fragment(*_schema, _permit, std::move(rts));
}
// We add the row to the buffer even when it's full.
// This simplifies the code. For more info see #3139.
if (_next_row_in_range) {
_last_row = _next_row;
add_to_buffer(_next_row);
move_to_next_entry();
} else {
@@ -533,7 +538,7 @@ void cache_flat_mutation_reader::copy_from_cache_to_buffer() {
inline
void cache_flat_mutation_reader::move_to_end() {
finish_reader();
clogger.trace("csm {}: eos", this);
clogger.trace("csm {}: eos", fmt::ptr(this));
}
inline
@@ -558,7 +563,7 @@ void cache_flat_mutation_reader::move_to_range(query::clustering_row_ranges::con
_ck_ranges_curr = next_it;
auto adjacent = _next_row.advance_to(_lower_bound);
_next_row_in_range = !after_current_range(_next_row.position());
clogger.trace("csm {}: move_to_range(), range={}, lb={}, ub={}, next={}", this, *_ck_ranges_curr, _lower_bound, _upper_bound, _next_row.position());
clogger.trace("csm {}: move_to_range(), range={}, lb={}, ub={}, next={}", fmt::ptr(this), *_ck_ranges_curr, _lower_bound, _upper_bound, _next_row.position());
if (!adjacent && !_next_row.continuous()) {
// FIXME: We don't insert a dummy for singular range to avoid allocating 3 entries
// for a hit (before, at and after). If we supported the concept of an incomplete row,
@@ -568,11 +573,11 @@ void cache_flat_mutation_reader::move_to_range(query::clustering_row_ranges::con
// Insert dummy for lower bound
if (can_populate()) {
// FIXME: _lower_bound could be adjacent to the previous row, in which case we could skip this
clogger.trace("csm {}: insert dummy at {}", this, _lower_bound);
clogger.trace("csm {}: insert dummy at {}", fmt::ptr(this), _lower_bound);
auto it = with_allocator(_lsa_manager.region().allocator(), [&] {
auto& rows = _snp->version()->partition().clustered_rows();
auto new_entry = current_allocator().construct<rows_entry>(*_schema, _lower_bound, is_dummy::yes, is_continuous::no);
return rows.insert_before(_next_row.get_iterator_in_latest_version(), *new_entry);
auto new_entry = alloc_strategy_unique_ptr<rows_entry>(current_allocator().construct<rows_entry>(*_schema, _lower_bound, is_dummy::yes, is_continuous::no));
return rows.insert_before(_next_row.get_iterator_in_latest_version(), std::move(new_entry));
});
_snp->tracker()->insert(*it);
_last_row = partition_snapshot_row_weakref(*_snp, it, true);
@@ -584,28 +589,64 @@ void cache_flat_mutation_reader::move_to_range(query::clustering_row_ranges::con
}
}
// Drops _last_row entry when possible without changing logical contents of the partition.
// Call only when _last_row and _next_row are valid.
// Calling after ensure_population_lower_bound() is ok.
// _next_row must have a greater position than _last_row.
// Invalidates references but keeps the _next_row valid.
inline
void cache_flat_mutation_reader::maybe_drop_last_entry() noexcept {
// Drop dummy entry if it falls inside a continuous range.
// This prevents unnecessary dummy entries from accumulating in cache and slowing down scans.
//
// Eviction can happen only from oldest versions to preserve the continuity non-overlapping rule
// (See docs/design-notes/row_cache.md)
//
if (_last_row
&& _last_row->dummy()
&& _last_row->continuous()
&& _snp->at_latest_version()
&& _snp->at_oldest_version()) {
with_allocator(_snp->region().allocator(), [&] {
_last_row->on_evicted(_read_context->cache()._tracker);
});
_last_row = nullptr;
// There could be iterators pointing to _last_row, invalidate them
_snp->region().allocator().invalidate_references();
// Don't invalidate _next_row, move_to_next_entry() expects it to be still valid.
_next_row.force_valid();
}
}
// _next_row must be inside the range.
inline
void cache_flat_mutation_reader::move_to_next_entry() {
clogger.trace("csm {}: move_to_next_entry(), curr={}", this, _next_row.position());
clogger.trace("csm {}: move_to_next_entry(), curr={}", fmt::ptr(this), _next_row.position());
if (no_clustering_row_between(*_schema, _next_row.position(), _upper_bound)) {
move_to_next_range();
} else {
auto new_last_row = partition_snapshot_row_weakref(_next_row);
if (!_next_row.next()) {
move_to_end();
return;
}
_last_row = std::move(new_last_row);
_next_row_in_range = !after_current_range(_next_row.position());
clogger.trace("csm {}: next={}, cont={}, in_range={}", this, _next_row.position(), _next_row.continuous(), _next_row_in_range);
clogger.trace("csm {}: next={}, cont={}, in_range={}", fmt::ptr(this), _next_row.position(), _next_row.continuous(), _next_row_in_range);
if (!_next_row.continuous()) {
start_reading_from_underlying();
} else {
maybe_drop_last_entry();
}
}
}
inline
void cache_flat_mutation_reader::add_to_buffer(mutation_fragment&& mf) {
clogger.trace("csm {}: add_to_buffer({})", this, mutation_fragment::printer(*_schema, mf));
clogger.trace("csm {}: add_to_buffer({})", fmt::ptr(this), mutation_fragment::printer(*_schema, mf));
if (mf.is_clustering_row()) {
add_clustering_row_to_buffer(std::move(mf));
} else {
@@ -618,7 +659,14 @@ inline
void cache_flat_mutation_reader::add_to_buffer(const partition_snapshot_row_cursor& row) {
if (!row.dummy()) {
_read_context->cache().on_row_hit();
add_clustering_row_to_buffer(row.row(_read_context->digest_requested()));
add_clustering_row_to_buffer(mutation_fragment(*_schema, _permit, row.row(_read_context->digest_requested())));
} else {
position_in_partition::less_compare less(*_schema);
if (less(_lower_bound, row.position())) {
_lower_bound = row.position();
_lower_bound_changed = true;
}
_read_context->cache()._tracker.on_dummy_row_hit();
}
}
@@ -627,7 +675,7 @@ void cache_flat_mutation_reader::add_to_buffer(const partition_snapshot_row_curs
// (2) If _lower_bound > mf.position(), mf was emitted
inline
void cache_flat_mutation_reader::add_clustering_row_to_buffer(mutation_fragment&& mf) {
clogger.trace("csm {}: add_clustering_row_to_buffer({})", this, mutation_fragment::printer(*_schema, mf));
clogger.trace("csm {}: add_clustering_row_to_buffer({})", fmt::ptr(this), mutation_fragment::printer(*_schema, mf));
auto& row = mf.as_clustering_row();
auto new_lower_bound = position_in_partition::after_key(row.key());
push_mutation_fragment(std::move(mf));
@@ -637,7 +685,7 @@ void cache_flat_mutation_reader::add_clustering_row_to_buffer(mutation_fragment&
inline
void cache_flat_mutation_reader::add_to_buffer(range_tombstone&& rt) {
clogger.trace("csm {}: add_to_buffer({})", this, rt);
clogger.trace("csm {}: add_to_buffer({})", fmt::ptr(this), rt);
// This guarantees that rt starts after any emitted clustering_row
// and not before any emitted range tombstone.
position_in_partition::less_compare less(*_schema);
@@ -645,18 +693,18 @@ void cache_flat_mutation_reader::add_to_buffer(range_tombstone&& rt) {
return;
}
if (!less(_lower_bound, rt.position())) {
rt.set_start(*_schema, _lower_bound);
rt.set_start(_lower_bound);
} else {
_lower_bound = position_in_partition(rt.position());
_lower_bound_changed = true;
}
push_mutation_fragment(std::move(rt));
push_mutation_fragment(*_schema, _permit, std::move(rt));
}
inline
void cache_flat_mutation_reader::maybe_add_to_cache(const range_tombstone& rt) {
if (can_populate()) {
clogger.trace("csm {}: maybe_add_to_cache({})", this, rt);
clogger.trace("csm {}: maybe_add_to_cache({})", fmt::ptr(this), rt);
_lsa_manager.run_in_update_section_with_allocator([&] {
_snp->version()->partition().row_tombstones().apply_monotonically(*_schema, rt);
});
@@ -668,7 +716,7 @@ void cache_flat_mutation_reader::maybe_add_to_cache(const range_tombstone& rt) {
inline
void cache_flat_mutation_reader::maybe_add_to_cache(const static_row& sr) {
if (can_populate()) {
clogger.trace("csm {}: populate({})", this, static_row::printer(*_schema, sr));
clogger.trace("csm {}: populate({})", fmt::ptr(this), static_row::printer(*_schema, sr));
_read_context->cache().on_static_row_insert();
_lsa_manager.run_in_update_section_with_allocator([&] {
if (_read_context->digest_requested()) {
@@ -684,7 +732,7 @@ void cache_flat_mutation_reader::maybe_add_to_cache(const static_row& sr) {
inline
void cache_flat_mutation_reader::maybe_set_static_row_continuous() {
if (can_populate()) {
clogger.trace("csm {}: set static row continuous", this);
clogger.trace("csm {}: set static row continuous", fmt::ptr(this));
_snp->version()->partition().set_static_row_continuous(true);
} else {
_read_context->cache().on_mispopulate();

View File

@@ -23,7 +23,7 @@
#include <seastar/core/sstring.hh>
#include <boost/lexical_cast.hpp>
#include "exceptions/exceptions.hh"
#include "json.hh"
#include "utils/rjson.hh"
#include "seastarx.hh"
class schema;
@@ -76,7 +76,7 @@ public:
}
sstring to_sstring() const {
return json::to_json(to_map());
return rjson::print(rjson::from_string_map(to_map()));
}
static caching_options get_disabled_caching_options() {
@@ -97,13 +97,14 @@ public:
} else if (p.first == "enabled") {
e = p.second == "true";
} else {
throw exceptions::configuration_exception("Invalid caching option: " + p.first);
throw exceptions::configuration_exception(format("Invalid caching option: {}", p.first));
}
}
return caching_options(k, r, e);
}
static caching_options from_sstring(const sstring& str) {
return from_map(json::to_map(str));
return from_map(rjson::parse_to_map<std::map<sstring, sstring>>(str));
}
bool operator==(const caching_options& other) const {

View File

@@ -37,7 +37,7 @@
#include "idl/mutation.dist.impl.hh"
#include <iostream>
canonical_mutation::canonical_mutation(bytes data)
canonical_mutation::canonical_mutation(bytes_ostream data)
: _data(std::move(data))
{ }
@@ -45,8 +45,7 @@ canonical_mutation::canonical_mutation(const mutation& m)
{
mutation_partition_serializer part_ser(*m.schema(), m.partition());
bytes_ostream out;
ser::writer_of_canonical_mutation<bytes_ostream> wr(out);
ser::writer_of_canonical_mutation<bytes_ostream> wr(_data);
std::move(wr).write_table_id(m.schema()->id())
.write_schema_version(m.schema()->version())
.write_key(m.key())
@@ -54,7 +53,6 @@ canonical_mutation::canonical_mutation(const mutation& m)
.partition([&] (auto wr) {
part_ser.write(std::move(wr));
}).end_canonical_mutation();
_data = to_bytes(out.linearize());
}
utils::UUID canonical_mutation::column_family_id() const {

View File

@@ -32,9 +32,9 @@
// Safe to access from other shards via const&.
// Safe to pass serialized across nodes.
class canonical_mutation {
bytes _data;
bytes_ostream _data;
public:
explicit canonical_mutation(bytes);
explicit canonical_mutation(bytes_ostream);
explicit canonical_mutation(const mutation&);
canonical_mutation(canonical_mutation&&) = default;
@@ -51,7 +51,7 @@ public:
utils::UUID column_family_id() const;
const bytes& representation() const { return _data; }
const bytes_ostream& representation() const { return _data; }
friend std::ostream& operator<<(std::ostream& os, const canonical_mutation& cm);
};

View File

@@ -33,9 +33,13 @@ template<typename T>
struct cartesian_product {
const std::vector<std::vector<T>>& _vec_of_vecs;
public:
class iterator : public std::iterator<std::forward_iterator_tag, std::vector<T>> {
class iterator {
public:
using iterator_category = std::forward_iterator_tag;
using value_type = std::vector<T>;
using difference_type = std::ptrdiff_t;
using pointer = std::vector<T>*;
using reference = std::vector<T>&;
private:
size_t _pos;
const std::vector<std::vector<T>>* _vec_of_vecs;

View File

@@ -20,10 +20,16 @@
#pragma once
#include <map>
#include <seastar/core/sstring.hh>
#include "bytes.hh"
#include "serializer.hh"
#include "db/extensions.hh"
#include "cdc/cdc_options.hh"
#include "schema.hh"
#include "serializer_impl.hh"
namespace cdc {
@@ -33,6 +39,7 @@ public:
static constexpr auto NAME = "cdc";
cdc_extension() = default;
cdc_extension(const options& opts) : _cdc_options(opts) {}
explicit cdc_extension(std::map<sstring, sstring> tags) : _cdc_options(std::move(tags)) {}
explicit cdc_extension(const bytes& b) : _cdc_options(cdc_extension::deserialize(b)) {}
explicit cdc_extension(const sstring& s) {

View File

@@ -27,10 +27,32 @@
namespace cdc {
enum class delta_mode : uint8_t {
keys,
full,
};
/**
* (for now only pre-) image collection mode.
* Stating how much info to record.
* off == none
* on == changed columns
* full == all (changed and unmodified columns)
*/
enum class image_mode : uint8_t {
off,
on,
full,
};
std::ostream& operator<<(std::ostream& os, delta_mode);
std::ostream& operator<<(std::ostream& os, image_mode);
class options final {
bool _enabled = false;
bool _preimage = false;
image_mode _preimage = image_mode::off;
bool _postimage = false;
delta_mode _delta_mode = delta_mode::full;
int _ttl = 86400; // 24h in seconds
public:
options() = default;
@@ -40,10 +62,19 @@ public:
sstring to_sstring() const;
bool enabled() const { return _enabled; }
bool preimage() const { return _preimage; }
bool preimage() const { return _preimage != image_mode::off; }
bool full_preimage() const { return _preimage == image_mode::full; }
bool postimage() const { return _postimage; }
delta_mode get_delta_mode() const { return _delta_mode; }
void set_delta_mode(delta_mode m) { _delta_mode = m; }
int ttl() const { return _ttl; }
void enabled(bool b) { _enabled = b; }
void preimage(bool b) { preimage(b ? image_mode::on : image_mode::off); }
void preimage(image_mode m) { _preimage = m; }
void postimage(bool b) { _postimage = b; }
void ttl(int v) { _ttl = v; }
bool operator==(const options& o) const;
bool operator!=(const options& o) const;
};

283
cdc/change_visitor.hh Normal file
View File

@@ -0,0 +1,283 @@
/*
* Copyright (C) 2020 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include "mutation.hh"
/*
* This file contains a general abstraction for walking over mutations,
* deconstructing them into ``atomic'' pieces, and consuming these pieces.
*
* The pieces considered atomic are:
* - atomic_cells, either in collections or in atomic columns
* (see `live_collection_cell`, `dead_collection_cell`, `live_atomic_cell`, `dead_atomic_cell`),
* - collection tombstones (see `collection_tombstone`)
* - row markers (see `marker`)
* - row tombstones (see `clustered_row_delete`),
* - range tombstones (see `range_delete`),
* - partition tombstones (see `partition_delete`).
* We use the term ``changes'' to refer to these atomic pieces, hence the name ``ChangeVisitor''.
*
* IMPORTANT: this doesn't understand all possible states that a mutation can have, e.g. it doesn't understand
* the concept of ``continuity''. However, it is sufficient for analyzing mutations created by a write coordinator,
* e.g. obtained by parsing a CQL statement.
*
* To analyze a mutation, create a visitor (described by the `ChangeVisitor` concept below) and pass it
* together with the mutation to `inspect_mutation`.
*
* To analyze certain fragments of the mutation, the inspecting code requires further visitors to be passed.
* For example, when it encounters a clustered row update, it calls `clustered_row_cells` on the visitor,
* passing it the row's key and the callback. The visitor can then decide:
* - if it's not interested in the row's cells, it can simply not call the callback,
* - otherwise, it can call the callback with a value of type that satisfies the ``RowCellsVisitor'' concept.
* If the callback is called, the inspector walks over the row and passes the changes into the ``row cells visitor''.
* In either case, it will then proceed to analyze further parts of the mutation, if any.
*
* Note that the type passed to the callbacks provided by the inspector (such as in the example above)
* can be decided at runtime. This can be especially useful with the callback passed to `collection_column`
* in RowCellsVisitor, if different collection types require different logic to handle.
*
* The dummy visitors below are there only to define the concepts.
* For example, in the RowCellsVisitor concept I wanted to express that `visit_collection` in RowCellsVisitor
* is a function that handles *any* type which satisfies CollectionVisitor. I didn't find a way to do that
* other than providing a ``most generic'' concrete type which satisfies the interface (`dummy_collection_visitor`).
* Unfortunately C++ is still not Haskell.
*
* The inspector calls `finished()` after visiting each change, and sometimes before (e.g. when it starts
* visiting a static row, but before it visits any of its cells). If it returns true, the inspector
* will stop the visitation. Thus, if at any point during the walk the visitor decides it's not interested
* in any more changes, it can inform the inspector by returning `true` from `finished()`.
*
* IMPORTANT: if the visitor returns `true` from `finished()`, it should keep returning `true`. This is because
* the inspector may call `finished()` multiple times when exiting some nested loops.
*
* The order of visitation is as follows:
* - First the static row is visited, if it has any cells.
* Within the row, its columns are visited in order of increasing column IDs.
*
* - Then, for each clustering key, if a change (row marker, cell, or tombstone) exists for this key:
* - The row marker is visited, if there is one.
* - Columns are visited in order of increasing column IDs.
* - The row tombstone is visited, if there is one.
*
* For both the static row and a clustering row, for each column:
* - If the column is atomic, a corresponding atomic_cell is visited (if there is one).
* - Otherwise (the column is non-atomic):
* - The collection tombstone is visited first.
* - Cells are visited in order of increasing keys
* (assuming that the mutation was correctly constructed, i.e. it stores cells in key order).
*
* WARNING: visited collection tombstone and cells
* are guaranteed to live only for the duration of `collection_column` call.
*
* - Then range tombstones are visited. The order is unspecified
* (more accurately: if it's specified, I don't know what it is)
*
* - Finally, the partition tombstone is visited, if it exists.
*/
namespace cdc {
template <typename V>
concept CollectionVisitor = requires(V v,
const tombstone& t,
bytes_view key,
const atomic_cell_view& cell) {
{ v.collection_tombstone(t) } -> std::same_as<void>;
{ v.live_collection_cell(key, cell) } -> std::same_as<void>;
{ v.dead_collection_cell(key, cell) } -> std::same_as<void>;
{ v.finished() } -> std::same_as<bool>;
};
struct dummy_collection_visitor {
void collection_tombstone(const tombstone&) {}
void live_collection_cell(bytes_view, const atomic_cell_view&) {}
void dead_collection_cell(bytes_view, const atomic_cell_view&) {}
bool finished() { return false; }
};
template <typename V>
concept RowCellsVisitor = requires(V v,
const column_definition& cdef,
const atomic_cell_view& cell,
noncopyable_function<void(dummy_collection_visitor&)> visit_collection) {
{ v.live_atomic_cell(cdef, cell) } -> std::same_as<void>;
{ v.dead_atomic_cell(cdef, cell) } -> std::same_as<void>;
{ v.collection_column(cdef, std::move(visit_collection)) } -> std::same_as<void>;
{ v.finished() } -> std::same_as<bool>;
};
struct dummy_row_cells_visitor {
void live_atomic_cell(const column_definition&, const atomic_cell_view&) {}
void dead_atomic_cell(const column_definition&, const atomic_cell_view&) {}
void collection_column(const column_definition&, auto&& visit_collection) {
dummy_collection_visitor v;
visit_collection(v);
}
bool finished() { return false; }
};
template <typename V>
concept ClusteredRowCellsVisitor = requires(V v,
const row_marker& rm) {
requires RowCellsVisitor<V>;
{ v.marker(rm) } -> std::same_as<void>;
};
struct dummy_clustered_row_cells_visitor : public dummy_row_cells_visitor {
void marker(const row_marker&) {}
};
template <typename V>
concept ChangeVisitor = requires(V v,
api::timestamp_type ts,
const clustering_key& ckey,
const range_tombstone& rt,
const tombstone& t,
noncopyable_function<void(dummy_clustered_row_cells_visitor&)> visit_clustered_row_cells,
noncopyable_function<void(dummy_row_cells_visitor&)> visit_row_cells) {
{ v.static_row_cells(std::move(visit_row_cells)) } -> std::same_as<void>;
{ v.clustered_row_cells(ckey, std::move(visit_clustered_row_cells)) } -> std::same_as<void>;
{ v.clustered_row_delete(ckey, t) } -> std::same_as<void>;
{ v.range_delete(rt) } -> std::same_as<void>;
{ v.partition_delete(t) } -> std::same_as<void>;
{ v.finished() } -> std::same_as<bool>;
};
template <RowCellsVisitor V>
void inspect_row_cells(const schema& s, column_kind ckind, const row& r, V& v) {
r.for_each_cell_until([&s, ckind, &v] (column_id id, const atomic_cell_or_collection& acoc) {
auto& cdef = s.column_at(ckind, id);
if (cdef.is_atomic()) {
auto cell = acoc.as_atomic_cell(cdef);
if (cell.is_live()) {
v.live_atomic_cell(cdef, cell);
} else {
v.dead_atomic_cell(cdef, cell);
}
return stop_iteration(v.finished());
}
acoc.as_collection_mutation().with_deserialized(*cdef.type, [&v, &cdef] (collection_mutation_view_description view) {
v.collection_column(cdef, [&view] (CollectionVisitor auto& cv) {
if (cv.finished()) {
return;
}
if (view.tomb) {
cv.collection_tombstone(view.tomb);
if (cv.finished()) {
return;
}
}
for (auto& [key, cell]: view.cells) {
if (cell.is_live()) {
cv.live_collection_cell(key, cell);
} else {
cv.dead_collection_cell(key, cell);
}
if (cv.finished()) {
return;
}
}
});
});
return stop_iteration(v.finished());
});
}
template <ChangeVisitor V>
void inspect_mutation(const mutation& m, V& v) {
auto& p = m.partition();
auto& s = *m.schema();
if (!p.static_row().empty()) {
v.static_row_cells([&s, &p] (RowCellsVisitor auto& srv) {
if (srv.finished()) {
return;
}
inspect_row_cells(s, column_kind::static_column, p.static_row().get(), srv);
});
if (v.finished()) {
return;
}
}
for (auto& cr: p.clustered_rows()) {
auto& r = cr.row();
if (r.marker().is_live() || !r.cells().empty()) {
v.clustered_row_cells(cr.key(), [&s, &r] (ClusteredRowCellsVisitor auto& crv) {
if (crv.finished()) {
return;
}
auto& rm = r.marker();
if (rm.is_live()) {
crv.marker(rm);
if (crv.finished()) {
return;
}
}
inspect_row_cells(s, column_kind::regular_column, r.cells(), crv);
});
if (v.finished()) {
return;
}
}
if (r.deleted_at()) {
auto t = r.deleted_at().tomb();
assert(t.timestamp != api::missing_timestamp);
v.clustered_row_delete(cr.key(), t);
if (v.finished()) {
return;
}
}
}
for (auto& rt: p.row_tombstones()) {
assert(rt.tomb.timestamp != api::missing_timestamp);
v.range_delete(rt);
if (v.finished()) {
return;
}
}
if (p.partition_tombstone()) {
v.partition_delete(p.partition_tombstone());
}
}
} // namespace cdc

View File

@@ -22,10 +22,13 @@
#include <boost/type.hpp>
#include <random>
#include <unordered_set>
#include <algorithm>
#include <seastar/core/sleep.hh>
#include <seastar/core/coroutine.hh>
#include "keys.hh"
#include "schema_builder.hh"
#include "database.hh"
#include "db/config.hh"
#include "db/system_keyspace.hh"
#include "db/system_distributed_keyspace.hh"
@@ -36,6 +39,8 @@
#include "gms/gossiper.hh"
#include "cdc/generation.hh"
#include "cdc/cdc_options.hh"
#include "cdc/generation_service.hh"
extern logging::logger cdc_log;
@@ -59,14 +64,57 @@ static void copy_int_to_bytes(int64_t i, size_t offset, bytes& b) {
std::copy_n(reinterpret_cast<int8_t*>(&i), sizeof(int64_t), b.begin() + offset);
}
stream_id::stream_id(int64_t first, int64_t second)
static constexpr auto stream_id_version_bits = 4;
static constexpr auto stream_id_random_bits = 38;
static constexpr auto stream_id_index_bits = sizeof(uint64_t)*8 - stream_id_version_bits - stream_id_random_bits;
static constexpr auto stream_id_version_shift = 0;
static constexpr auto stream_id_index_shift = stream_id_version_shift + stream_id_version_bits;
static constexpr auto stream_id_random_shift = stream_id_index_shift + stream_id_index_bits;
/**
* Responsibilty for encoding stream_id moved from factory method to
* this constructor, to keep knowledge of composition in a single place.
* Note this is private and friended to topology_description_generator,
* because he is the one who defined the "order" we view vnodes etc.
*/
stream_id::stream_id(dht::token token, size_t vnode_index)
: _value(bytes::initialized_later(), 2 * sizeof(int64_t))
{
copy_int_to_bytes(first, 0, _value);
copy_int_to_bytes(second, sizeof(int64_t), _value);
static thread_local std::mt19937_64 rand_gen(std::random_device{}());
static thread_local std::uniform_int_distribution<uint64_t> rand_dist;
auto rand = rand_dist(rand_gen);
auto mask_shift = [](uint64_t val, size_t bits, size_t shift) {
return (val & ((1ull << bits) - 1u)) << shift;
};
/**
* Low qword:
* 0-4: version
* 5-26: vnode index as when created (see generation below). This excludes shards
* 27-64: random value (maybe to be replaced with timestamp)
*/
auto low_qword = mask_shift(version_1, stream_id_version_bits, stream_id_version_shift)
| mask_shift(vnode_index, stream_id_index_bits, stream_id_index_shift)
| mask_shift(rand, stream_id_random_bits, stream_id_random_shift)
;
copy_int_to_bytes(dht::token::to_int64(token), 0, _value);
copy_int_to_bytes(low_qword, sizeof(int64_t), _value);
// not a hot code path. make sure we did not mess up the shifts and masks.
assert(version() == version_1);
assert(index() == vnode_index);
}
stream_id::stream_id(bytes b) : _value(std::move(b)) { }
stream_id::stream_id(bytes b)
: _value(std::move(b))
{
// this is not a very solid check. Id:s previous to GA/versioned id:s
// have fully random bits in low qword, so this could go either way...
if (version() > version_1) {
throw std::invalid_argument("Unknown CDC stream id version");
}
}
bool stream_id::is_set() const {
return !_value.empty();
@@ -76,6 +124,10 @@ bool stream_id::operator==(const stream_id& o) const {
return _value == o._value;
}
bool stream_id::operator!=(const stream_id& o) const {
return !(*this == o);
}
bool stream_id::operator<(const stream_id& o) const {
return _value < o._value;
}
@@ -87,18 +139,26 @@ static int64_t bytes_to_int64(bytes_view b, size_t offset) {
return net::ntoh(res);
}
int64_t stream_id::first() const {
return token_from_bytes(_value);
}
int64_t stream_id::second() const {
return bytes_to_int64(_value, sizeof(int64_t));
dht::token stream_id::token() const {
return dht::token::from_int64(token_from_bytes(_value));
}
int64_t stream_id::token_from_bytes(bytes_view b) {
return bytes_to_int64(b, 0);
}
static uint64_t unpack_value(bytes_view b, size_t off, size_t shift, size_t bits) {
return (uint64_t(bytes_to_int64(b, off)) >> shift) & ((1ull << bits) - 1u);
}
uint8_t stream_id::version() const {
return unpack_value(_value, sizeof(int64_t), stream_id_version_shift, stream_id_version_bits);
}
size_t stream_id::index() const {
return unpack_value(_value, sizeof(int64_t), stream_id_index_shift, stream_id_index_bits);
}
const bytes& stream_id::to_bytes() const {
return _value;
}
@@ -119,26 +179,38 @@ bool topology_description::operator==(const topology_description& o) const {
return _entries == o._entries;
}
const std::vector<token_range_description>& topology_description::entries() const {
const std::vector<token_range_description>& topology_description::entries() const& {
return _entries;
}
static stream_id create_stream_id(dht::token t) {
static thread_local std::mt19937_64 rand_gen(std::random_device().operator()());
static thread_local std::uniform_int_distribution<int64_t> rand_dist(std::numeric_limits<int64_t>::min());
std::vector<token_range_description>&& topology_description::entries() && {
return std::move(_entries);
}
return {dht::token::to_int64(t), rand_dist(rand_gen)};
static std::vector<stream_id> create_stream_ids(
size_t index, dht::token start, dht::token end, size_t shard_count, uint8_t ignore_msb) {
std::vector<stream_id> result;
result.reserve(shard_count);
dht::sharder sharder(shard_count, ignore_msb);
for (size_t shard_idx = 0; shard_idx < shard_count; ++shard_idx) {
auto t = dht::find_first_token_for_shard(sharder, start, end, shard_idx);
// compose the id from token and the "index" of the range end owning vnode
// as defined by token sort order. Basically grouping within this
// shard set.
result.emplace_back(stream_id(t, index));
}
return result;
}
class topology_description_generator final {
const db::config& _cfg;
const std::unordered_set<dht::token>& _bootstrap_tokens;
const locator::token_metadata& _token_metadata;
const locator::token_metadata_ptr _tmptr;
const gms::gossiper& _gossiper;
// Compute a set of tokens that split the token ring into vnodes
auto get_tokens() const {
auto tokens = _token_metadata.sorted_tokens();
auto tokens = _tmptr->sorted_tokens();
auto it = tokens.insert(
tokens.end(), _bootstrap_tokens.begin(), _bootstrap_tokens.end());
std::sort(it, tokens.end());
@@ -150,10 +222,10 @@ class topology_description_generator final {
// Fetch sharding parameters for a node that owns vnode ending with this.end
// Returns <shard_count, ignore_msb> pair.
std::pair<size_t, uint8_t> get_sharding_info(dht::token end) const {
if (_bootstrap_tokens.count(end) > 0) {
if (_bootstrap_tokens.contains(end)) {
return {smp::count, _cfg.murmur3_partitioner_ignore_msb_bits()};
} else {
auto endpoint = _token_metadata.get_endpoint(end);
auto endpoint = _tmptr->get_endpoint(end);
if (!endpoint) {
throw std::runtime_error(
format("Can't find endpoint for token {}", end));
@@ -163,32 +235,26 @@ class topology_description_generator final {
}
}
token_range_description create_description(dht::token start, dht::token end) const {
token_range_description create_description(size_t index, dht::token start, dht::token end) const {
token_range_description desc;
desc.token_range_end = end;
auto [shard_count, ignore_msb] = get_sharding_info(end);
desc.streams.reserve(shard_count);
desc.streams = create_stream_ids(index, start, end, shard_count, ignore_msb);
desc.sharding_ignore_msb = ignore_msb;
dht::sharder sharder(shard_count, ignore_msb);
for (size_t shard_idx = 0; shard_idx < shard_count; ++shard_idx) {
auto t = dht::find_first_token_for_shard(sharder, start, end, shard_idx);
desc.streams.push_back(create_stream_id(t));
}
return desc;
}
public:
topology_description_generator(
const db::config& cfg,
const std::unordered_set<dht::token>& bootstrap_tokens,
const locator::token_metadata& token_metadata,
const locator::token_metadata_ptr tmptr,
const gms::gossiper& gossiper)
: _cfg(cfg)
, _bootstrap_tokens(bootstrap_tokens)
, _token_metadata(token_metadata)
, _tmptr(std::move(tmptr))
, _gossiper(gossiper)
{}
@@ -213,10 +279,10 @@ public:
vnode_descriptions.reserve(tokens.size());
vnode_descriptions.push_back(
create_description(tokens.back(), tokens.front()));
create_description(0, tokens.back(), tokens.front()));
for (size_t idx = 1; idx < tokens.size(); ++idx) {
vnode_descriptions.push_back(
create_description(tokens[idx - 1], tokens[idx]));
create_description(idx, tokens[idx - 1], tokens[idx]));
}
return {std::move(vnode_descriptions)};
@@ -243,24 +309,68 @@ future<db_clock::time_point> get_local_streams_timestamp() {
});
}
// Run inside seastar::async context.
db_clock::time_point make_new_cdc_generation(
// non-static for testing
size_t limit_of_streams_in_topology_description() {
// Each stream takes 16B and we don't want to exceed 4MB so we can have
// at most 262144 streams but not less than 1 per vnode.
return 4 * 1024 * 1024 / 16;
}
// non-static for testing
topology_description limit_number_of_streams_if_needed(topology_description&& desc) {
int64_t streams_count = 0;
for (auto& tr_desc : desc.entries()) {
streams_count += tr_desc.streams.size();
}
size_t limit = std::max(limit_of_streams_in_topology_description(), desc.entries().size());
if (limit >= streams_count) {
return std::move(desc);
}
size_t streams_per_vnode_limit = limit / desc.entries().size();
auto entries = std::move(desc).entries();
auto start = entries.back().token_range_end;
for (size_t idx = 0; idx < entries.size(); ++idx) {
auto end = entries[idx].token_range_end;
if (entries[idx].streams.size() > streams_per_vnode_limit) {
entries[idx].streams =
create_stream_ids(idx, start, end, streams_per_vnode_limit, entries[idx].sharding_ignore_msb);
}
start = end;
}
return topology_description(std::move(entries));
}
future<db_clock::time_point> make_new_cdc_generation(
const db::config& cfg,
const std::unordered_set<dht::token>& bootstrap_tokens,
const locator::token_metadata& tm,
const locator::token_metadata_ptr tmptr,
const gms::gossiper& g,
db::system_distributed_keyspace& sys_dist_ks,
std::chrono::milliseconds ring_delay,
bool for_testing) {
auto gen = topology_description_generator(cfg, bootstrap_tokens, tm, g).generate();
bool add_delay) {
using namespace std::chrono;
auto gen = topology_description_generator(cfg, bootstrap_tokens, tmptr, g).generate();
// If the cluster is large we may end up with a generation that contains
// large number of streams. This is problematic because we store the
// generation in a single row. For a generation with large number of rows
// this will lead to a row that can be as big as 32MB. This is much more
// than the limit imposed by commitlog_segment_size_in_mb. If the size of
// the row that describes a new generation grows above
// commitlog_segment_size_in_mb, the write will fail and the new node won't
// be able to join. To avoid such problem we make sure that such row is
// always smaller than 4MB. We do that by removing some CDC streams from
// each vnode if the total number of streams is too large.
gen = limit_number_of_streams_if_needed(std::move(gen));
// Begin the race.
auto ts = db_clock::now() + (
for_testing ? std::chrono::milliseconds(0) : (
2 * ring_delay + std::chrono::duration_cast<std::chrono::milliseconds>(generation_leeway)));
sys_dist_ks.insert_cdc_topology_description(ts, std::move(gen), { tm.count_normal_token_owners() }).get();
(!add_delay || ring_delay == milliseconds(0)) ? milliseconds(0) : (
2 * ring_delay + duration_cast<milliseconds>(generation_leeway)));
co_await sys_dist_ks.insert_cdc_topology_description(ts, std::move(gen), { tmptr->count_normal_token_owners() });
return ts;
co_return ts;
}
std::optional<db_clock::time_point> get_streams_timestamp_for(const gms::inet_address& endpoint, const gms::gossiper& g) {
@@ -269,63 +379,581 @@ std::optional<db_clock::time_point> get_streams_timestamp_for(const gms::inet_ad
return gms::versioned_value::cdc_streams_timestamp_from_string(streams_ts_string);
}
// Run inside seastar::async context.
static void do_update_streams_description(
static future<> do_update_streams_description(
db_clock::time_point streams_ts,
db::system_distributed_keyspace& sys_dist_ks,
db::system_distributed_keyspace::context ctx) {
if (sys_dist_ks.cdc_desc_exists(streams_ts, ctx).get0()) {
cdc_log.debug("update_streams_description: description of generation {} already inserted", streams_ts);
return;
if (co_await sys_dist_ks.cdc_desc_exists(streams_ts, ctx)) {
cdc_log.info("Generation {}: streams description table already updated.", streams_ts);
co_return;
}
// We might race with another node also inserting the description, but that's ok. It's an idempotent operation.
auto topo = sys_dist_ks.read_cdc_topology_description(streams_ts, ctx).get0();
auto topo = co_await sys_dist_ks.read_cdc_topology_description(streams_ts, ctx);
if (!topo) {
throw std::runtime_error(format("could not find streams data for timestamp {}", streams_ts));
throw no_generation_data_exception(streams_ts);
}
std::set<cdc::stream_id> streams_set;
for (auto& entry: topo->entries()) {
streams_set.insert(entry.streams.begin(), entry.streams.end());
}
std::vector<cdc::stream_id> streams_vec(streams_set.begin(), streams_set.end());
sys_dist_ks.create_cdc_desc(streams_ts, streams_vec, ctx).get();
co_await sys_dist_ks.create_cdc_desc(streams_ts, *topo, ctx);
cdc_log.info("CDC description table successfully updated with generation {}.", streams_ts);
}
void update_streams_description(
future<> update_streams_description(
db_clock::time_point streams_ts,
shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source& abort_src) {
try {
do_update_streams_description(streams_ts, *sys_dist_ks, { get_num_token_owners() });
} catch(...) {
co_await do_update_streams_description(streams_ts, *sys_dist_ks, { get_num_token_owners() });
} catch (...) {
cdc_log.warn(
"Could not update CDC description table with generation {}: {}. Will retry in the background.",
streams_ts, std::current_exception());
// It is safe to discard this future: we keep system distributed keyspace alive.
(void)seastar::async([
streams_ts, sys_dist_ks, get_num_token_owners = std::move(get_num_token_owners), &abort_src
] {
(void)(([] (db_clock::time_point streams_ts,
shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source& abort_src) -> future<> {
while (true) {
sleep_abortable(std::chrono::seconds(60), abort_src).get();
co_await sleep_abortable(std::chrono::seconds(60), abort_src);
try {
do_update_streams_description(streams_ts, *sys_dist_ks, { get_num_token_owners() });
return;
co_await do_update_streams_description(streams_ts, *sys_dist_ks, { get_num_token_owners() });
co_return;
} catch (...) {
cdc_log.warn(
"Could not update CDC description table with generation {}: {}. Will try again.",
streams_ts, std::current_exception());
}
}
});
})(streams_ts, std::move(sys_dist_ks), std::move(get_num_token_owners), abort_src));
}
}
static db_clock::time_point as_timepoint(const utils::UUID& uuid) {
return db_clock::time_point{std::chrono::milliseconds(utils::UUID_gen::get_adjusted_timestamp(uuid))};
}
static future<std::vector<db_clock::time_point>> get_cdc_desc_v1_timestamps(
db::system_distributed_keyspace& sys_dist_ks,
abort_source& abort_src,
const noncopyable_function<unsigned()>& get_num_token_owners) {
while (true) {
try {
co_return co_await sys_dist_ks.get_cdc_desc_v1_timestamps({ get_num_token_owners() });
} catch (...) {
cdc_log.warn(
"Failed to retrieve generation timestamps for rewriting: {}. Retrying in 60s.",
std::current_exception());
}
co_await sleep_abortable(std::chrono::seconds(60), abort_src);
}
}
// Contains a CDC log table's creation time (extracted from its schema's id)
// and its CDC TTL setting.
struct time_and_ttl {
db_clock::time_point creation_time;
int ttl;
};
/*
* See `maybe_rewrite_streams_descriptions`.
* This is the long-running-in-the-background part of that function.
* It returns the timestamp of the last rewritten generation (if any).
*/
static future<std::optional<db_clock::time_point>> rewrite_streams_descriptions(
std::vector<time_and_ttl> times_and_ttls,
shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source& abort_src) {
cdc_log.info("Retrieving generation timestamps for rewriting...");
auto tss = co_await get_cdc_desc_v1_timestamps(*sys_dist_ks, abort_src, get_num_token_owners);
cdc_log.info("Generation timestamps retrieved.");
// Find first generation timestamp such that some CDC log table may contain data before this timestamp.
// This predicate is monotonic w.r.t the timestamps.
auto now = db_clock::now();
std::sort(tss.begin(), tss.end());
auto first = std::partition_point(tss.begin(), tss.end(), [&] (db_clock::time_point ts) {
// partition_point finds first element that does *not* satisfy the predicate.
return std::none_of(times_and_ttls.begin(), times_and_ttls.end(),
[&] (const time_and_ttl& tat) {
// In this CDC log table there are no entries older than the table's creation time
// or (now - the table's ttl). We subtract 10s to account for some possible clock drift.
// If ttl is set to 0 then entries in this table never expire. In that case we look
// only at the table's creation time.
auto no_entries_older_than =
(tat.ttl == 0 ? tat.creation_time : std::max(tat.creation_time, now - std::chrono::seconds(tat.ttl)))
- std::chrono::seconds(10);
return no_entries_older_than < ts;
});
});
// Find first generation timestamp such that some CDC log table may contain data in this generation.
// This and all later generations need to be written to the new streams table.
if (first != tss.begin()) {
--first;
}
if (first == tss.end()) {
cdc_log.info("No generations to rewrite.");
co_return std::nullopt;
}
cdc_log.info("First generation to rewrite: {}", *first);
bool each_success = true;
co_await max_concurrent_for_each(first, tss.end(), 10, [&] (db_clock::time_point ts) -> future<> {
while (true) {
try {
co_return co_await do_update_streams_description(ts, *sys_dist_ks, { get_num_token_owners() });
} catch (const no_generation_data_exception& e) {
cdc_log.error("Failed to rewrite streams for generation {}: {}. Giving up.", ts, e);
each_success = false;
co_return;
} catch (...) {
cdc_log.warn("Failed to rewrite streams for generation {}: {}. Retrying in 60s.", ts, std::current_exception());
}
co_await sleep_abortable(std::chrono::seconds(60), abort_src);
}
});
if (each_success) {
cdc_log.info("Rewriting stream tables finished successfully.");
} else {
cdc_log.info("Rewriting stream tables finished, but some generations could not be rewritten (check the logs).");
}
if (first != tss.end()) {
co_return *std::prev(tss.end());
}
co_return std::nullopt;
}
future<> maybe_rewrite_streams_descriptions(
const database& db,
shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source& abort_src) {
if (!db.has_schema(sys_dist_ks->NAME, sys_dist_ks->CDC_DESC_V1)) {
// This cluster never went through a Scylla version which used this table
// or the user deleted the table. Nothing to do.
co_return;
}
if (co_await db::system_keyspace::cdc_is_rewritten()) {
co_return;
}
if (db.get_config().cdc_dont_rewrite_streams()) {
cdc_log.warn("Stream rewriting disabled. Manual administrator intervention may be required...");
co_return;
}
// For each CDC log table get the TTL setting (from CDC options) and the table's creation time
std::vector<time_and_ttl> times_and_ttls;
for (auto& [_, cf] : db.get_column_families()) {
auto& s = *cf->schema();
auto base = cdc::get_base_table(db, s.ks_name(), s.cf_name());
if (!base) {
// Not a CDC log table.
continue;
}
auto& cdc_opts = base->cdc_options();
if (!cdc_opts.enabled()) {
// This table is named like a CDC log table but it's not one.
continue;
}
times_and_ttls.push_back(time_and_ttl{as_timepoint(s.id()), cdc_opts.ttl()});
}
if (times_and_ttls.empty()) {
// There's no point in rewriting old generations' streams (they don't contain any data).
cdc_log.info("No CDC log tables present, not rewriting stream tables.");
co_return co_await db::system_keyspace::cdc_set_rewritten(std::nullopt);
}
// It's safe to discard this future: the coroutine keeps system_distributed_keyspace alive
// and the abort source's lifetime extends the lifetime of any other service.
(void)(([_times_and_ttls = std::move(times_and_ttls), _sys_dist_ks = std::move(sys_dist_ks),
_get_num_token_owners = std::move(get_num_token_owners), &_abort_src = abort_src] () mutable -> future<> {
auto times_and_ttls = std::move(_times_and_ttls);
auto sys_dist_ks = std::move(_sys_dist_ks);
auto get_num_token_owners = std::move(_get_num_token_owners);
auto& abort_src = _abort_src;
// This code is racing with node startup. At this point, we're most likely still waiting for gossip to settle
// and some nodes that are UP may still be marked as DOWN by us.
// Let's sleep a bit to increase the chance that the first attempt at rewriting succeeds (it's still ok if
// it doesn't - we'll retry - but it's nice if we succeed without any warnings).
co_await sleep_abortable(std::chrono::seconds(10), abort_src);
cdc_log.info("Rewriting stream tables in the background...");
auto last_rewritten = co_await rewrite_streams_descriptions(
std::move(times_and_ttls),
std::move(sys_dist_ks),
std::move(get_num_token_owners),
abort_src);
co_await db::system_keyspace::cdc_set_rewritten(last_rewritten);
})());
}
static void assert_shard_zero(const sstring& where) {
if (this_shard_id() != 0) {
on_internal_error(cdc_log, format("`{}`: must be run on shard 0", where));
}
}
class and_reducer {
private:
bool _result = true;
public:
future<> operator()(bool value) {
_result = value && _result;
return make_ready_future<>();
}
bool get() {
return _result;
}
};
class or_reducer {
private:
bool _result = false;
public:
future<> operator()(bool value) {
_result = value || _result;
return make_ready_future<>();
}
bool get() {
return _result;
}
};
class generation_handling_nonfatal_exception : public std::runtime_error {
using std::runtime_error::runtime_error;
};
constexpr char could_not_retrieve_msg_template[]
= "Could not retrieve CDC streams with timestamp {} upon gossip event. Reason: \"{}\". Action: {}.";
generation_service::generation_service(
const db::config& cfg, gms::gossiper& g, sharded<db::system_distributed_keyspace>& sys_dist_ks,
abort_source& abort_src, const locator::shared_token_metadata& stm)
: _cfg(cfg), _gossiper(g), _sys_dist_ks(sys_dist_ks), _abort_src(abort_src), _token_metadata(stm) {
}
future<> generation_service::stop() {
if (this_shard_id() == 0) {
co_await _gossiper.unregister_(shared_from_this());
}
_stopped = true;
}
generation_service::~generation_service() {
assert(_stopped);
}
future<> generation_service::after_join(std::optional<db_clock::time_point>&& startup_gen_ts) {
assert_shard_zero(__PRETTY_FUNCTION__);
assert(db::system_keyspace::bootstrap_complete());
_gen_ts = std::move(startup_gen_ts);
_gossiper.register_(shared_from_this());
_joined = true;
// Retrieve the latest CDC generation seen in gossip (if any).
co_await scan_cdc_generations();
}
void generation_service::on_join(gms::inet_address ep, gms::endpoint_state ep_state) {
assert_shard_zero(__PRETTY_FUNCTION__);
auto val = ep_state.get_application_state_ptr(gms::application_state::CDC_STREAMS_TIMESTAMP);
if (!val) {
return;
}
on_change(ep, gms::application_state::CDC_STREAMS_TIMESTAMP, *val);
}
void generation_service::on_change(gms::inet_address ep, gms::application_state app_state, const gms::versioned_value& v) {
assert_shard_zero(__PRETTY_FUNCTION__);
if (app_state != gms::application_state::CDC_STREAMS_TIMESTAMP) {
return;
}
auto ts = gms::versioned_value::cdc_streams_timestamp_from_string(v.value);
cdc_log.debug("Endpoint: {}, CDC generation timestamp change: {}", ep, ts);
handle_cdc_generation(ts).get();
}
future<> generation_service::check_and_repair_cdc_streams() {
if (!_joined) {
throw std::runtime_error("check_and_repair_cdc_streams: node not initialized yet");
}
auto latest = _gen_ts;
const auto& endpoint_states = _gossiper.get_endpoint_states();
for (const auto& [addr, state] : endpoint_states) {
if (!_gossiper.is_normal(addr)) {
throw std::runtime_error(format("All nodes must be in NORMAL state while performing check_and_repair_cdc_streams"
" ({} is in state {})", addr, _gossiper.get_gossip_status(state)));
}
const auto ts = get_streams_timestamp_for(addr, _gossiper);
if (!latest || (ts && *ts > *latest)) {
latest = ts;
}
}
bool should_regenerate = false;
std::optional<topology_description> gen;
static const auto timeout_msg = "Timeout while fetching CDC topology description";
static const auto topology_read_error_note = "Note: this is likely caused by"
" node(s) being down or unreachable. It is recommended to check the network and"
" restart/remove the failed node(s), then retry checkAndRepairCdcStreams command";
static const auto exception_translating_msg = "Translating the exception to `request_execution_exception`";
const auto tmptr = _token_metadata.get();
auto sys_dist_ks = get_sys_dist_ks();
try {
gen = co_await sys_dist_ks->read_cdc_topology_description(
*latest, { tmptr->count_normal_token_owners() });
} catch (exceptions::request_timeout_exception& e) {
cdc_log.error("{}: \"{}\". {}.", timeout_msg, e.what(), exception_translating_msg);
throw exceptions::request_execution_exception(exceptions::exception_code::READ_TIMEOUT,
format("{}. {}.", timeout_msg, topology_read_error_note));
} catch (exceptions::unavailable_exception& e) {
static const auto unavailable_msg = "Node(s) unavailable while fetching CDC topology description";
cdc_log.error("{}: \"{}\". {}.", unavailable_msg, e.what(), exception_translating_msg);
throw exceptions::request_execution_exception(exceptions::exception_code::UNAVAILABLE,
format("{}. {}.", unavailable_msg, topology_read_error_note));
} catch (...) {
const auto ep = std::current_exception();
if (is_timeout_exception(ep)) {
cdc_log.error("{}: \"{}\". {}.", timeout_msg, ep, exception_translating_msg);
throw exceptions::request_execution_exception(exceptions::exception_code::READ_TIMEOUT,
format("{}. {}.", timeout_msg, topology_read_error_note));
}
// On exotic errors proceed with regeneration
cdc_log.error("Exception while reading CDC topology description: \"{}\". Regenerating streams anyway.", ep);
should_regenerate = true;
}
if (!gen) {
cdc_log.error(
"Could not find CDC generation with timestamp {} in distributed system tables (current time: {}),"
" even though some node gossiped about it.",
latest, db_clock::now());
should_regenerate = true;
} else {
std::unordered_set<dht::token> gen_ends;
for (const auto& entry : gen->entries()) {
gen_ends.insert(entry.token_range_end);
}
for (const auto& metadata_token : tmptr->sorted_tokens()) {
if (!gen_ends.contains(metadata_token)) {
cdc_log.warn("CDC generation {} missing token {}. Regenerating.", latest, metadata_token);
should_regenerate = true;
break;
}
}
}
if (!should_regenerate) {
if (latest != _gen_ts) {
co_await do_handle_cdc_generation(*latest);
}
cdc_log.info("CDC generation {} does not need repair", latest);
co_return;
}
const auto new_gen_ts = co_await make_new_cdc_generation(_cfg,
{}, std::move(tmptr), _gossiper, *sys_dist_ks,
std::chrono::milliseconds(_cfg.ring_delay_ms()), true /* add delay */);
// Need to artificially update our STATUS so other nodes handle the timestamp change
auto status = _gossiper.get_application_state_ptr(
utils::fb_utilities::get_broadcast_address(), gms::application_state::STATUS);
if (!status) {
cdc_log.error("Our STATUS is missing");
cdc_log.error("Aborting CDC generation repair due to missing STATUS");
co_return;
}
// Update _gen_ts first, so that do_handle_cdc_generation (which will get called due to the status update)
// won't try to update the gossiper, which would result in a deadlock inside add_local_application_state
_gen_ts = new_gen_ts;
co_await _gossiper.add_local_application_state({
{ gms::application_state::CDC_STREAMS_TIMESTAMP, gms::versioned_value::cdc_streams_timestamp(new_gen_ts) },
{ gms::application_state::STATUS, *status }
});
co_await db::system_keyspace::update_cdc_streams_timestamp(new_gen_ts);
}
future<> generation_service::handle_cdc_generation(std::optional<db_clock::time_point> ts) {
assert_shard_zero(__PRETTY_FUNCTION__);
if (!ts) {
co_return;
}
if (!db::system_keyspace::bootstrap_complete() || !_sys_dist_ks.local_is_initialized()
|| !_sys_dist_ks.local().started()) {
// The service should not be listening for generation changes until after the node
// is bootstrapped. Therefore we would previously assume that this condition
// can never become true and call on_internal_error here, but it turns out that
// it may become true on decommission: the node enters NEEDS_BOOTSTRAP
// state before leaving the token ring, so bootstrap_complete() becomes false.
// In that case we can simply return.
co_return;
}
if (co_await container().map_reduce(and_reducer(), [ts = *ts] (generation_service& svc) {
return !svc._cdc_metadata.prepare(ts);
})) {
co_return;
}
bool using_this_gen = false;
try {
using_this_gen = co_await do_handle_cdc_generation_intercept_nonfatal_errors(*ts);
} catch (generation_handling_nonfatal_exception& e) {
cdc_log.warn(could_not_retrieve_msg_template, ts, e.what(), "retrying in the background");
async_handle_cdc_generation(*ts);
co_return;
} catch (...) {
cdc_log.error(could_not_retrieve_msg_template, ts, std::current_exception(), "not retrying");
co_return; // Exotic ("fatal") exception => do not retry
}
if (using_this_gen) {
cdc_log.info("Starting to use generation {}", *ts);
co_await update_streams_description(*ts, get_sys_dist_ks(),
[tmptr = _token_metadata.get()] { return tmptr->count_normal_token_owners(); },
_abort_src);
}
}
void generation_service::async_handle_cdc_generation(db_clock::time_point ts) {
assert_shard_zero(__PRETTY_FUNCTION__);
(void)(([] (db_clock::time_point ts, shared_ptr<generation_service> svc) -> future<> {
while (true) {
co_await sleep_abortable(std::chrono::seconds(5), svc->_abort_src);
try {
bool using_this_gen = co_await svc->do_handle_cdc_generation_intercept_nonfatal_errors(ts);
if (using_this_gen) {
cdc_log.info("Starting to use generation {}", ts);
co_await update_streams_description(ts, svc->get_sys_dist_ks(),
[tmptr = svc->_token_metadata.get()] { return tmptr->count_normal_token_owners(); },
svc->_abort_src);
}
co_return;
} catch (generation_handling_nonfatal_exception& e) {
cdc_log.warn(could_not_retrieve_msg_template, ts, e.what(), "continuing to retry in the background");
} catch (...) {
cdc_log.error(could_not_retrieve_msg_template, ts, std::current_exception(), "not retrying anymore");
co_return; // Exotic ("fatal") exception => do not retry
}
if (co_await svc->container().map_reduce(and_reducer(), [ts] (generation_service& svc) {
return svc._cdc_metadata.known_or_obsolete(ts);
})) {
co_return;
}
}
})(ts, shared_from_this()));
}
future<> generation_service::scan_cdc_generations() {
assert_shard_zero(__PRETTY_FUNCTION__);
std::optional<db_clock::time_point> latest;
for (const auto& ep: _gossiper.get_endpoint_states()) {
auto ts = get_streams_timestamp_for(ep.first, _gossiper);
if (!latest || (ts && *ts > *latest)) {
latest = ts;
}
}
if (latest) {
cdc_log.info("Latest generation seen during startup: {}", *latest);
co_await handle_cdc_generation(latest);
} else {
cdc_log.info("No generation seen during startup.");
}
}
future<bool> generation_service::do_handle_cdc_generation_intercept_nonfatal_errors(db_clock::time_point ts) {
assert_shard_zero(__PRETTY_FUNCTION__);
try {
co_return co_await do_handle_cdc_generation(ts);
} catch (exceptions::request_timeout_exception& e) {
throw generation_handling_nonfatal_exception(e.what());
} catch (exceptions::unavailable_exception& e) {
throw generation_handling_nonfatal_exception(e.what());
} catch (exceptions::read_failure_exception& e) {
throw generation_handling_nonfatal_exception(e.what());
} catch (...) {
const auto ep = std::current_exception();
if (is_timeout_exception(ep)) {
throw generation_handling_nonfatal_exception(format("{}", ep));
}
throw;
}
}
future<bool> generation_service::do_handle_cdc_generation(db_clock::time_point ts) {
assert_shard_zero(__PRETTY_FUNCTION__);
auto sys_dist_ks = get_sys_dist_ks();
auto gen = co_await sys_dist_ks->read_cdc_topology_description(
ts, { _token_metadata.get()->count_normal_token_owners() });
if (!gen) {
throw std::runtime_error(format(
"Could not find CDC generation with timestamp {} in distributed system tables (current time: {}),"
" even though some node gossiped about it.",
ts, db_clock::now()));
}
// If we're not gossiping our own generation timestamp (because we've upgraded from a non-CDC/old version,
// or we somehow lost it due to a byzantine failure), start gossiping someone else's timestamp.
// This is to avoid the upgrade check on every restart (see `should_propose_first_cdc_generation`).
// And if we notice that `ts` is higher than our timestamp, we will start gossiping it instead,
// so if the node that initially gossiped `ts` leaves the cluster while `ts` is still the latest generation,
// the cluster will remember.
if (!_gen_ts || *_gen_ts < ts) {
_gen_ts = ts;
co_await db::system_keyspace::update_cdc_streams_timestamp(ts);
co_await _gossiper.add_local_application_state(
gms::application_state::CDC_STREAMS_TIMESTAMP, gms::versioned_value::cdc_streams_timestamp(ts));
}
// Return `true` iff the generation was inserted on any of our shards.
co_return co_await container().map_reduce(or_reducer(), [ts, &gen] (generation_service& svc) {
auto gen_ = *gen;
return svc._cdc_metadata.insert(ts, std::move(gen_));
});
}
shared_ptr<db::system_distributed_keyspace> generation_service::get_sys_dist_ks() {
assert_shard_zero(__PRETTY_FUNCTION__);
if (!_sys_dist_ks.local_is_initialized()) {
throw std::runtime_error("system distributed keyspace not initialized");
}
return _sys_dist_ks.local_shared();
}
} // namespace cdc

View File

@@ -40,6 +40,8 @@
#include "database_fwd.hh"
#include "db_clock.hh"
#include "dht/token.hh"
#include "locator/token_metadata.hh"
#include "utils/chunked_vector.hh"
namespace seastar {
class abort_source;
@@ -55,26 +57,26 @@ namespace gms {
class gossiper;
} // namespace gms
namespace locator {
class token_metadata;
} // namespace locator
namespace cdc {
class stream_id final {
bytes _value;
public:
static constexpr uint8_t version_1 = 1;
stream_id() = default;
stream_id(int64_t, int64_t);
stream_id(bytes);
stream_id(dht::token, size_t);
bool is_set() const;
bool operator==(const stream_id&) const;
bool operator!=(const stream_id&) const;
bool operator<(const stream_id&) const;
int64_t first() const;
int64_t second() const;
uint8_t version() const;
size_t index() const;
const bytes& to_bytes() const;
dht::token token() const;
partition_key to_partition_key(const schema& log_schema) const;
static int64_t token_from_bytes(bytes_view);
@@ -110,7 +112,30 @@ public:
topology_description(std::vector<token_range_description> entries);
bool operator==(const topology_description&) const;
const std::vector<token_range_description>& entries() const;
const std::vector<token_range_description>& entries() const&;
std::vector<token_range_description>&& entries() &&;
};
/**
* The set of streams for a single topology version/generation
* I.e. the stream ids at a given time.
*/
class streams_version {
public:
utils::chunked_vector<stream_id> streams;
db_clock::time_point timestamp;
streams_version(utils::chunked_vector<stream_id> s, db_clock::time_point ts)
: streams(std::move(s))
, timestamp(ts)
{}
};
class no_generation_data_exception : public std::runtime_error {
public:
no_generation_data_exception(db_clock::time_point generation_ts)
: std::runtime_error(format("could not find generation data for timestamp {}", generation_ts))
{}
};
/* Should be called when we're restarting and we noticed that we didn't save any streams timestamp in our local tables,
@@ -130,8 +155,8 @@ bool should_propose_first_generation(const gms::inet_address& me, const gms::gos
*/
future<db_clock::time_point> get_local_streams_timestamp();
/* Generate a new set of CDC streams and insert it into the distributed cdc_generations table.
* Returns the timestamp of this new generation.
/* Generate a new set of CDC streams and insert it into the distributed cdc_generation_descriptions table.
* Returns the timestamp of this new generation
*
* Should be called when starting the node for the first time (i.e., joining the ring).
*
@@ -142,14 +167,14 @@ future<db_clock::time_point> get_local_streams_timestamp();
* (not guaranteed in the current implementation, but expected to be the common case;
* we assume that `ring_delay` is enough for other nodes to learn about the new generation).
*/
db_clock::time_point make_new_cdc_generation(
future<db_clock::time_point> make_new_cdc_generation(
const db::config& cfg,
const std::unordered_set<dht::token>& bootstrap_tokens,
const locator::token_metadata& tm,
const locator::token_metadata_ptr tmptr,
const gms::gossiper& g,
db::system_distributed_keyspace& sys_dist_ks,
std::chrono::milliseconds ring_delay,
bool for_testing);
bool add_delay);
/* Retrieves CDC streams generation timestamp from the given endpoint's application state (broadcasted through gossip).
* We might be during a rolling upgrade, so the timestamp might not be there (if the other node didn't upgrade yet),
@@ -161,17 +186,26 @@ std::optional<db_clock::time_point> get_streams_timestamp_for(const gms::inet_ad
/* Inform CDC users about a generation of streams (identified by the given timestamp)
* by inserting it into the cdc_streams table.
*
* Assumes that the cdc_generations table contains this generation.
* Assumes that the cdc_generation_descriptions table contains this generation.
*
* Returning from this function does not mean that the table update was successful: the function
* might run an asynchronous task in the background.
*
* Run inside seastar::async context.
*/
void update_streams_description(
future<> update_streams_description(
db_clock::time_point,
shared_ptr<db::system_distributed_keyspace>,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source&);
/* Part of the upgrade procedure. Useful in case where the version of Scylla that we're upgrading from
* used the "cdc_streams_descriptions" table. This procedure ensures that the new "cdc_streams_descriptions_v2"
* table contains streams of all generations that were present in the old table and may still contain data
* (i.e. there exist CDC log tables that may contain rows with partition keys being the stream IDs from
* these generations). */
future<> maybe_rewrite_streams_descriptions(
const database&,
shared_ptr<db::system_distributed_keyspace>,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source&);
} // namespace cdc

138
cdc/generation_service.hh Normal file
View File

@@ -0,0 +1,138 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* Modified by ScyllaDB
* Copyright (C) 2021 ScyllaDB
*
*/
#pragma once
#include "cdc/metadata.hh"
#include "gms/i_endpoint_state_change_subscriber.hh"
namespace db {
class system_distributed_keyspace;
}
namespace gms {
class gossiper;
}
namespace cdc {
class generation_service : public peering_sharded_service<generation_service>
, public async_sharded_service<generation_service>
, public gms::i_endpoint_state_change_subscriber {
bool _stopped = false;
// The node has joined the token ring. Set to `true` on `after_join` call.
bool _joined = false;
const db::config& _cfg;
gms::gossiper& _gossiper;
sharded<db::system_distributed_keyspace>& _sys_dist_ks;
abort_source& _abort_src;
const locator::shared_token_metadata& _token_metadata;
/* Maintains the set of known CDC generations used to pick streams for log writes (i.e., the partition keys of these log writes).
* Updated in response to certain gossip events (see the handle_cdc_generation function).
*/
cdc::metadata _cdc_metadata;
/* The latest known generation timestamp and the timestamp that we're currently gossiping
* (as CDC_STREAMS_TIMESTAMP application state).
*
* Only shard 0 manages this, hence it will be std::nullopt on all shards other than 0.
* This timestamp is also persisted in the system.cdc_local table.
*
* On shard 0 this may be nullopt only in one special case: rolling upgrade, when we upgrade
* from an old version of Scylla that didn't support CDC. In that case one node in the cluster
* will create the first generation and start gossiping it; it may be us, or it may be some
* different node. In any case, eventually - after one of the nodes gossips the first timestamp
* - we'll catch on and this variable will be updated with that generation.
*/
std::optional<db_clock::time_point> _gen_ts;
public:
generation_service(const db::config&, gms::gossiper&,
sharded<db::system_distributed_keyspace>&, abort_source&, const locator::shared_token_metadata&);
future<> stop();
~generation_service();
/* After the node bootstraps and creates a new CDC generation, or restarts and loads the last
* known generation timestamp from persistent storage, this function should be called with
* that generation timestamp moved in as the `startup_gen_ts` parameter.
* This passes the responsibility of managing generations from the node startup code to this service;
* until then, the service remains dormant.
* At the time of writing this comment, the startup code is in `storage_service::join_token_ring`, hence
* `after_join` should be called at the end of that function.
* Precondition: the node has completed bootstrapping and system_distributed_keyspace is initialized.
* Must be called on shard 0 - that's where the generation management happens.
*/
future<> after_join(std::optional<db_clock::time_point>&& startup_gen_ts);
cdc::metadata& get_cdc_metadata() {
return _cdc_metadata;
}
virtual void before_change(gms::inet_address, gms::endpoint_state, gms::application_state, const gms::versioned_value&) override {}
virtual void on_alive(gms::inet_address, gms::endpoint_state) override {}
virtual void on_dead(gms::inet_address, gms::endpoint_state) override {}
virtual void on_remove(gms::inet_address) override {}
virtual void on_restart(gms::inet_address, gms::endpoint_state) override {}
virtual void on_join(gms::inet_address, gms::endpoint_state) override;
virtual void on_change(gms::inet_address, gms::application_state, const gms::versioned_value&) override;
future<> check_and_repair_cdc_streams();
private:
/* Retrieve the CDC generation which starts at the given timestamp (from a distributed table created for this purpose)
* and start using it for CDC log writes if it's not obsolete.
*/
future<> handle_cdc_generation(std::optional<db_clock::time_point>);
/* If `handle_cdc_generation` fails, it schedules an asynchronous retry in the background
* using `async_handle_cdc_generation`.
*/
void async_handle_cdc_generation(db_clock::time_point);
/* Wrapper around `do_handle_cdc_generation` which intercepts timeout/unavailability exceptions.
* Returns: do_handle_cdc_generation(ts). */
future<bool> do_handle_cdc_generation_intercept_nonfatal_errors(db_clock::time_point);
/* Returns `true` iff we started using the generation (it was not obsolete or already known),
* which means that this node might write some CDC log entries using streams from this generation. */
future<bool> do_handle_cdc_generation(db_clock::time_point);
/* Scan CDC generation timestamps gossiped by other nodes and retrieve the latest one.
* This function should be called once at the end of the node startup procedure
* (after the node is started and running normally, it will retrieve generations on gossip events instead).
*/
future<> scan_cdc_generations();
/* generation_service code might be racing with system_distributed_keyspace deinitialization
* (the deinitialization order is broken).
* Therefore, whenever we want to access sys_dist_ks in a background task,
* we need to check if the instance is still there. Storing the shared pointer will keep it alive.
*/
shared_ptr<db::system_distributed_keyspace> get_sys_dist_ks();
};
} // namespace cdc

1585
cdc/log.cc

File diff suppressed because it is too large Load Diff

View File

@@ -41,7 +41,6 @@
#include "exceptions/exceptions.hh"
#include "timestamp.hh"
#include "tracing/trace_state.hh"
#include "cdc_options.hh"
#include "utils/UUID.hh"
class schema;
@@ -63,6 +62,7 @@ class query_state;
class mutation;
class partition_key;
class database;
namespace cdc {
@@ -80,7 +80,7 @@ class cdc_service final : public async_sharded_service<cdc::cdc_service> {
std::unique_ptr<impl> _impl;
public:
future<> stop();
cdc_service(service::storage_proxy&);
cdc_service(service::storage_proxy&, cdc::metadata&);
cdc_service(db_context);
~cdc_service();
@@ -100,20 +100,16 @@ public:
struct db_context final {
service::storage_proxy& _proxy;
service::migration_notifier& _migration_notifier;
locator::token_metadata& _token_metadata;
cdc::metadata& _cdc_metadata;
class builder final {
service::storage_proxy& _proxy;
cdc::metadata& _cdc_metadata;
std::optional<std::reference_wrapper<service::migration_notifier>> _migration_notifier;
std::optional<std::reference_wrapper<locator::token_metadata>> _token_metadata;
std::optional<std::reference_wrapper<cdc::metadata>> _cdc_metadata;
public:
builder(service::storage_proxy& proxy);
builder(service::storage_proxy& proxy, cdc::metadata&);
builder& with_migration_notifier(service::migration_notifier& migration_notifier);
builder& with_token_metadata(locator::token_metadata& token_metadata);
builder& with_cdc_metadata(cdc::metadata&);
db_context build();
};
@@ -129,7 +125,12 @@ enum class operation : int8_t {
};
bool is_log_for_some_table(const sstring& ks_name, const std::string_view& table_name);
seastar::sstring log_name(const seastar::sstring& table_name);
schema_ptr get_base_table(const database&, const schema&);
schema_ptr get_base_table(const database&, sstring_view, std::string_view);
seastar::sstring base_name(std::string_view log_name);
seastar::sstring log_name(std::string_view table_name);
seastar::sstring log_data_column_name(std::string_view column_name);
seastar::sstring log_meta_column_name(std::string_view column_name);
bytes log_data_column_name_bytes(const bytes& column_name);
@@ -141,6 +142,8 @@ bytes log_data_column_deleted_name_bytes(const bytes& column_name);
seastar::sstring log_data_column_deleted_elements_name(std::string_view column_name);
bytes log_data_column_deleted_elements_name_bytes(const bytes& column_name);
bool is_cdc_metacolumn_name(const sstring& name);
utils::UUID generate_timeuuid(api::timestamp_type t);
} // namespace cdc

View File

@@ -51,7 +51,8 @@ static cdc::stream_id get_stream(
return entry.streams[shard_id];
}
static cdc::stream_id get_stream(
// non-static for testing
cdc::stream_id get_stream(
const std::vector<cdc::token_range_description>& entries,
dht::token tok) {
if (entries.empty()) {
@@ -77,6 +78,12 @@ cdc::metadata::container_t::const_iterator cdc::metadata::gen_used_at(api::times
return std::prev(it);
}
bool cdc::metadata::streams_available() const {
auto now = api::new_timestamp();
auto it = gen_used_at(now);
return it != _gens.end();
}
cdc::stream_id cdc::metadata::get_stream(api::timestamp_type ts, dht::token tok) {
auto now = api::new_timestamp();
if (ts > now + generation_leeway.count()) {

View File

@@ -57,6 +57,10 @@ public:
/* Is a generation with the given timestamp already known or superseded by a newer generation? */
bool known_or_obsolete(db_clock::time_point) const;
/* Are there streams available. I.e. valid for time == now. If this is false, any writes to
* CDC logs will fail fast.
*/
bool streams_available() const;
/* Return the stream for the base partition whose token is `tok` to which a corresponding log write should go
* according to the generation used at time `ts` (i.e, the latest generation whose timestamp is less or equal to `ts`).
*

View File

@@ -22,8 +22,14 @@
#include "mutation.hh"
#include "schema.hh"
#include "concrete_types.hh"
#include "types/user.hh"
#include "split.hh"
#include "log.hh"
#include "change_visitor.hh"
#include <type_traits>
struct atomic_column_update {
column_id id;
@@ -70,6 +76,37 @@ struct partition_deletion {
tombstone t;
};
using clustered_column_set = std::map<clustering_key, cdc::one_kind_column_set, clustering_key::less_compare>;
template<typename Container>
concept EntryContainer = requires(Container& container) {
// Parenthesized due to https://bugs.llvm.org/show_bug.cgi?id=45088
{ (container.atomic_entries) } -> std::same_as<std::vector<atomic_column_update>&>;
{ (container.nonatomic_entries) } -> std::same_as<std::vector<nonatomic_column_update>&>;
};
template<EntryContainer Container>
static void add_columns_affected_by_entries(cdc::one_kind_column_set& cset, const Container& cont) {
for (const auto& entry : cont.atomic_entries) {
cset.set(entry.id);
}
for (const auto& entry : cont.nonatomic_entries) {
cset.set(entry.id);
}
}
/* Given a mutation with multiple timestamps/ttl/types of changes, we split it into multiple mutations
* before passing it into `process_change` (see comment above `should_split_visitor` for more details).
*
* The first step of the splitting is to walk over the mutation and put each change into an appropriate bucket
* (see `batch`). The buckets are sorted by timestamps (see `set_of_changes`), and within each bucket,
* the changes are split according to their types (`static_updates`, `clustered_inserts`, and so on).
* Within each type, the changes are sorted w.r.t TTLs. Changes without a TTL are treated as if they had TTL = 0.
*
* The function that puts changes into bucket is called `extract_changes`. Underneath, it uses
* `extract_changes_visitor`, `extract_collection_visitor` and `extract_row_visitor`.
*/
struct batch {
std::vector<static_row_update> static_updates;
std::vector<clustered_row_insert> clustered_inserts;
@@ -77,6 +114,40 @@ struct batch {
std::vector<clustered_row_deletion> clustered_row_deletions;
std::vector<clustered_range_deletion> clustered_range_deletions;
std::optional<partition_deletion> partition_deletions;
clustered_column_set get_affected_clustered_columns_per_row(const schema& s) const {
clustered_column_set ret{clustering_key::less_compare(s)};
if (!clustered_row_deletions.empty()) {
// When deleting a row, all columns are affected
cdc::one_kind_column_set all_columns{s.regular_columns_count()};
all_columns.set(0, s.regular_columns_count(), true);
for (const auto& change : clustered_row_deletions) {
ret.insert(std::make_pair(change.key, all_columns));
}
}
auto process_change_type = [&] (const auto& changes) {
for (const auto& change : changes) {
auto& cset = ret[change.key];
cset.resize(s.regular_columns_count());
add_columns_affected_by_entries(cset, change);
}
};
process_change_type(clustered_inserts);
process_change_type(clustered_updates);
return ret;
}
cdc::one_kind_column_set get_affected_static_columns(const schema& s) const {
cdc::one_kind_column_set ret{s.static_columns_count()};
for (const auto& change : static_updates) {
add_columns_affected_by_entries(ret, change);
}
return ret;
}
};
using set_of_changes = std::map<api::timestamp_type, batch>;
@@ -86,100 +157,179 @@ struct row_update {
std::vector<nonatomic_column_update> nonatomic_entries;
};
static
std::map<std::pair<api::timestamp_type, gc_clock::duration>, row_update>
extract_row_updates(const row& r, column_kind ckind, const schema& schema) {
std::map<std::pair<api::timestamp_type, gc_clock::duration>, row_update> result;
r.for_each_cell([&] (column_id id, const atomic_cell_or_collection& cell) {
auto& cdef = schema.column_at(ckind, id);
if (cdef.is_atomic()) {
auto view = cell.as_atomic_cell(cdef);
auto timestamp_and_ttl = std::pair(
view.timestamp(),
view.is_live_and_has_ttl() ? view.ttl() : gc_clock::duration(0)
);
result[timestamp_and_ttl].atomic_entries.push_back({id, atomic_cell(*cdef.type, view)});
return;
static gc_clock::duration get_ttl(const atomic_cell_view& acv) {
return acv.is_live_and_has_ttl() ? acv.ttl() : gc_clock::duration(0);
}
static gc_clock::duration get_ttl(const row_marker& rm) {
return rm.is_expiring() ? rm.ttl() : gc_clock::duration(0);
}
using change_key_t = std::pair<api::timestamp_type, gc_clock::duration>;
/* Visits the cells and tombstone of a collection, putting the encountered changes into buckets
* sorted by timestamp first and ttl second (see `_updates`).
*/
template <typename V>
struct extract_collection_visitor {
private:
const column_id _id;
std::map<change_key_t, row_update>& _updates;
nonatomic_column_update& get_or_append_entry(api::timestamp_type ts, gc_clock::duration ttl) {
auto& updates = this->_updates[std::pair(ts, ttl)].nonatomic_entries;
if (updates.empty() || updates.back().id != _id) {
updates.push_back({_id});
}
cell.as_collection_mutation().with_deserialized(*cdef.type, [&] (collection_mutation_view_description mview) {
auto desc = mview.materialize(*cdef.type);
for (auto& [k, v]: desc.cells) {
auto timestamp_and_ttl = std::pair(
v.timestamp(),
v.is_live_and_has_ttl() ? v.ttl() : gc_clock::duration(0)
);
auto& updates = result[timestamp_and_ttl].nonatomic_entries;
if (updates.empty() || updates.back().id != id) {
updates.push_back({id, {}});
}
updates.back().cells.push_back({std::move(k), std::move(v)});
}
if (desc.tomb) {
auto timestamp_and_ttl = std::pair(desc.tomb.timestamp + 1, gc_clock::duration(0));
auto& updates = result[timestamp_and_ttl].nonatomic_entries;
if (updates.empty() || updates.back().id != id) {
updates.push_back({id, {}});
}
updates.back().t = std::move(desc.tomb);
}
});
});
return result;
};
set_of_changes extract_changes(const mutation& base_mutation, const schema& base_schema) {
set_of_changes res;
auto& p = base_mutation.partition();
auto sr_updates = extract_row_updates(p.static_row().get(), column_kind::static_column, base_schema);
for (auto& [k, up]: sr_updates) {
auto [timestamp, ttl] = k;
res[timestamp].static_updates.push_back({
ttl,
std::move(up.atomic_entries),
std::move(up.nonatomic_entries)
});
return updates.back();
}
for (const rows_entry& cr : p.clustered_rows()) {
auto cr_updates = extract_row_updates(cr.row().cells(), column_kind::regular_column, base_schema);
/* To copy a value from a collection/non-frozen UDT (in order to put it into a bucket) we need to know the value's type.
* The method of obtaining the type depends on the collection type; in particular, for non-frozen UDT, each value
* might have a different type, thus in general we need a method that, given a key (identifying the value in the collection),
* returns the value' type.
*
* We use the `Curiously Recurring Template Pattern' to avoid performing a dynamic dispatch on the collection's type for each visited cell.
* Instead we perform a single dynamic dispatch at the beginning, when encountering the collection column;
* the dispatch provides us with a correct `get_value_type` method.
* See `extract_row_visitor::collection_column` where the dispatch is done.
const auto& marker = cr.row().marker();
auto marker_timestamp = marker.timestamp();
auto marker_ttl = marker.is_expiring() ? marker.ttl() : gc_clock::duration(0);
if (marker.is_live()) {
// make sure that an entry corresponding to the row marker's timestamp and ttl is in the map
(void)cr_updates[std::pair(marker_timestamp, marker_ttl)];
data_type get_value_type(bytes_view);
*/
void cell(bytes_view key, const atomic_cell_view& c) {
auto& entry = get_or_append_entry(c.timestamp(), get_ttl(c));
entry.cells.emplace_back(to_bytes(key), atomic_cell(*static_cast<V&>(*this).get_value_type(key), c));
}
public:
extract_collection_visitor(column_id id, std::map<change_key_t, row_update>& updates)
: _id(id), _updates(updates) {}
void collection_tombstone(const tombstone& t) {
auto& entry = get_or_append_entry(t.timestamp + 1, gc_clock::duration(0));
entry.t = t;
}
void live_collection_cell(bytes_view key, const atomic_cell_view& c) {
cell(key, c);
}
void dead_collection_cell(bytes_view key, const atomic_cell_view& c) {
cell(key, c);
}
constexpr bool finished() const { return false; }
};
/* Visits all cells and tombstones in a row, putting the encountered changes into buckets
* sorted by timestamp first and ttl second (see `_updates`).
*/
struct extract_row_visitor {
std::map<change_key_t, row_update> _updates;
void cell(const column_definition& cdef, const atomic_cell_view& cell) {
_updates[std::pair(cell.timestamp(), get_ttl(cell))].atomic_entries.push_back({cdef.id, atomic_cell(*cdef.type, cell)});
}
void live_atomic_cell(const column_definition& cdef, const atomic_cell_view& c) {
cell(cdef, c);
}
void dead_atomic_cell(const column_definition& cdef, const atomic_cell_view& c) {
cell(cdef, c);
}
void collection_column(const column_definition& cdef, auto&& visit_collection) {
visit(*cdef.type, make_visitor(
[&] (const collection_type_impl& ctype) {
struct collection_visitor : public extract_collection_visitor<collection_visitor> {
data_type _value_type;
collection_visitor(column_id id, std::map<change_key_t, row_update>& updates, const collection_type_impl& ctype)
: extract_collection_visitor<collection_visitor>(id, updates), _value_type(ctype.value_comparator()) {}
data_type get_value_type(bytes_view) {
return _value_type;
}
} v(cdef.id, _updates, ctype);
visit_collection(v);
},
[&] (const user_type_impl& utype) {
struct udt_visitor : public extract_collection_visitor<udt_visitor> {
const user_type_impl& _utype;
udt_visitor(column_id id, std::map<change_key_t, row_update>& updates, const user_type_impl& utype)
: extract_collection_visitor<udt_visitor>(id, updates), _utype(utype) {}
data_type get_value_type(bytes_view key) {
return _utype.type(deserialize_field_index(key));
}
} v(cdef.id, _updates, utype);
visit_collection(v);
},
[&] (const abstract_type& o) {
throw std::runtime_error(format("extract_changes: unknown collection type:", o.name()));
}
));
}
auto is_insert = [&] (api::timestamp_type timestamp, gc_clock::duration ttl) {
if (!marker.is_live()) {
return false;
constexpr bool finished() const { return false; }
};
struct extract_changes_visitor {
set_of_changes _result;
void static_row_cells(auto&& visit_row_cells) {
extract_row_visitor v;
visit_row_cells(v);
for (auto& [ts_ttl, row_update]: v._updates) {
_result[ts_ttl.first].static_updates.push_back({
ts_ttl.second,
std::move(row_update.atomic_entries),
std::move(row_update.nonatomic_entries)
});
}
}
void clustered_row_cells(const clustering_key& ckey, auto&& visit_row_cells) {
struct clustered_cells_visitor : public extract_row_visitor {
api::timestamp_type _marker_ts;
gc_clock::duration _marker_ttl;
std::optional<row_marker> _marker;
void marker(const row_marker& rm) {
_marker_ts = rm.timestamp();
_marker_ttl = get_ttl(rm);
_marker = rm;
// make sure that an entry corresponding to the row marker's timestamp and ttl is in the map
(void)_updates[std::pair(_marker_ts, _marker_ttl)];
}
} v;
visit_row_cells(v);
return timestamp == marker_timestamp && ttl == marker_ttl;
};
for (auto& [k, up]: cr_updates) {
for (auto& [ts_ttl, row_update]: v._updates) {
// It is important that changes in the resulting `set_of_changes` are listed
// in increasing TTL order. The reason is explained in a comment in cdc/log.cc,
// search for "#6070".
auto [timestamp, ttl] = k;
auto [ts, ttl] = ts_ttl;
if (is_insert(timestamp, ttl)) {
res[timestamp].clustered_inserts.push_back({
if (v._marker && ts == v._marker_ts && ttl == v._marker_ttl) {
_result[ts].clustered_inserts.push_back({
ttl,
cr.key(),
marker,
std::move(up.atomic_entries),
ckey,
*v._marker,
std::move(row_update.atomic_entries),
{}
});
auto& cr_insert = res[timestamp].clustered_inserts.back();
auto& cr_insert = _result[ts].clustered_inserts.back();
bool clustered_update_exists = false;
for (auto& nonatomic_up: up.nonatomic_entries) {
for (auto& nonatomic_up: row_update.nonatomic_entries) {
// Updating a collection column with an INSERT statement implies inserting a tombstone.
//
// For example, suppose that we have:
@@ -205,9 +355,9 @@ set_of_changes extract_changes(const mutation& base_mutation, const schema& base
cr_insert.nonatomic_entries.push_back(std::move(nonatomic_up));
} else {
if (!clustered_update_exists) {
res[timestamp].clustered_updates.push_back({
_result[ts].clustered_updates.push_back({
ttl,
cr.key(),
ckey,
{},
{}
});
@@ -228,201 +378,239 @@ set_of_changes extract_changes(const mutation& base_mutation, const schema& base
clustered_update_exists = true;
}
auto& cr_update = res[timestamp].clustered_updates.back();
auto& cr_update = _result[ts].clustered_updates.back();
cr_update.nonatomic_entries.push_back(std::move(nonatomic_up));
}
}
} else {
res[timestamp].clustered_updates.push_back({
_result[ts].clustered_updates.push_back({
ttl,
cr.key(),
std::move(up.atomic_entries),
std::move(up.nonatomic_entries)
ckey,
std::move(row_update.atomic_entries),
std::move(row_update.nonatomic_entries)
});
}
}
auto row_tomb = cr.row().deleted_at().regular();
if (row_tomb) {
res[row_tomb.timestamp].clustered_row_deletions.push_back({cr.key(), row_tomb});
}
}
for (const auto& rt: p.row_tombstones()) {
if (rt.tomb.timestamp != api::missing_timestamp) {
res[rt.tomb.timestamp].clustered_range_deletions.push_back({rt});
}
void clustered_row_delete(const clustering_key& ckey, const tombstone& t) {
_result[t.timestamp].clustered_row_deletions.push_back({ckey, t});
}
auto partition_tomb_timestamp = p.partition_tombstone().timestamp;
if (partition_tomb_timestamp != api::missing_timestamp) {
res[partition_tomb_timestamp].partition_deletions = {p.partition_tombstone()};
void range_delete(const range_tombstone& rt) {
_result[rt.tomb.timestamp].clustered_range_deletions.push_back({rt});
}
return res;
void partition_delete(const tombstone& t) {
_result[t.timestamp].partition_deletions = {t};
}
constexpr bool finished() const { return false; }
};
set_of_changes extract_changes(const mutation& m) {
extract_changes_visitor v;
cdc::inspect_mutation(m, v);
return std::move(v._result);
}
namespace cdc {
bool should_split(const mutation& base_mutation, const schema& base_schema) {
auto& p = base_mutation.partition();
struct find_timestamp_visitor {
api::timestamp_type _ts = api::missing_timestamp;
api::timestamp_type found_ts = api::missing_timestamp;
std::optional<gc_clock::duration> found_ttl; // 0 = "no ttl"
bool finished() const { return _ts != api::missing_timestamp; }
auto check_or_set = [&] (api::timestamp_type ts, gc_clock::duration ttl) {
if (found_ts != api::missing_timestamp && found_ts != ts) {
return true;
}
found_ts = ts;
void visit(api::timestamp_type ts) { _ts = ts; }
void visit(const atomic_cell_view& cell) { visit(cell.timestamp()); }
if (found_ttl && *found_ttl != ttl) {
return true;
}
found_ttl = ttl;
void live_atomic_cell(const column_definition&, const atomic_cell_view& cell) { visit(cell); }
void dead_atomic_cell(const column_definition&, const atomic_cell_view& cell) { visit(cell); }
void collection_tombstone(const tombstone& t) {
// A collection tombstone with timestamp T can be created with:
// UPDATE ks.t USING TIMESTAMP T + 1 SET X = null WHERE ...
// (where X is a collection column).
// This is, among others, the reason why we show it in the CDC log
// with cdc$time using timestamp T + 1 instead of T.
visit(t.timestamp + 1);
}
void live_collection_cell(bytes_view, const atomic_cell_view& cell) { visit(cell); }
void dead_collection_cell(bytes_view, const atomic_cell_view& cell) { visit(cell); }
void collection_column(const column_definition&, auto&& visit_collection) { visit_collection(*this); }
void marker(const row_marker& rm) { visit(rm.timestamp()); }
void static_row_cells(auto&& visit_row_cells) { visit_row_cells(*this); }
void clustered_row_cells(const clustering_key&, auto&& visit_row_cells) { visit_row_cells(*this); }
void clustered_row_delete(const clustering_key&, const tombstone& t) { visit(t.timestamp); }
void range_delete(const range_tombstone& t) { visit(t.tomb.timestamp); }
void partition_delete(const tombstone& t) { visit(t.timestamp); }
};
return false;
};
/* Find some timestamp inside the given mutation.
*
* If this mutation was created using a single insert/update/delete statement, then it will have a single,
* well-defined timestamp (even if this timestamp occurs multiple times, e.g. in a cell and row_marker).
*
* This function shouldn't be used for mutations that have multiple different timestamps: the function
* would only find one of them. When dealing with such mutations, the caller should first split the mutation
* into multiple ones, each with a single timestamp.
*/
api::timestamp_type find_timestamp(const mutation& m) {
find_timestamp_visitor v;
bool had_static_row = false;
cdc::inspect_mutation(m, v);
bool should_split = false;
p.static_row().get().for_each_cell([&] (column_id id, const atomic_cell_or_collection& cell) {
had_static_row = true;
auto& cdef = base_schema.column_at(column_kind::static_column, id);
if (cdef.is_atomic()) {
auto view = cell.as_atomic_cell(cdef);
if (check_or_set(view.timestamp(), view.is_live_and_has_ttl() ? view.ttl() : gc_clock::duration(0))) {
should_split = true;
}
return;
}
cell.as_collection_mutation().with_deserialized(*cdef.type, [&] (collection_mutation_view_description mview) {
auto desc = mview.materialize(*cdef.type);
for (auto& [k, v]: desc.cells) {
if (check_or_set(v.timestamp(), v.is_live_and_has_ttl() ? v.ttl() : gc_clock::duration(0))) {
should_split = true;
return;
}
}
if (desc.tomb) {
if (check_or_set(desc.tomb.timestamp + 1, gc_clock::duration(0))) {
should_split = true;
return;
}
}
});
});
if (should_split) {
return true;
if (v._ts == api::missing_timestamp) {
throw std::runtime_error("cdc: could not find timestamp of mutation");
}
bool had_clustered_row = false;
if (!p.clustered_rows().empty() && had_static_row) {
return true;
}
for (const rows_entry& cr : p.clustered_rows()) {
had_clustered_row = true;
const auto& marker = cr.row().marker();
if (marker.is_live() && check_or_set(marker.timestamp(), marker.is_expiring() ? marker.ttl() : gc_clock::duration(0))) {
return true;
}
bool is_insert = marker.is_live();
bool had_cells = false;
cr.row().cells().for_each_cell([&] (column_id id, const atomic_cell_or_collection& cell) {
had_cells = true;
auto& cdef = base_schema.column_at(column_kind::regular_column, id);
if (cdef.is_atomic()) {
auto view = cell.as_atomic_cell(cdef);
if (check_or_set(view.timestamp(), view.is_live_and_has_ttl() ? view.ttl() : gc_clock::duration(0))) {
should_split = true;
}
return;
}
cell.as_collection_mutation().with_deserialized(*cdef.type, [&] (collection_mutation_view_description mview) {
for (auto& [k, v]: mview.cells) {
if (check_or_set(v.timestamp(), v.is_live_and_has_ttl() ? v.ttl() : gc_clock::duration(0))) {
should_split = true;
return;
}
if (is_insert) {
// nonatomic updates cannot be expressed with an INSERT.
should_split = true;
return;
}
}
if (mview.tomb) {
if (check_or_set(mview.tomb.timestamp + 1, gc_clock::duration(0))) {
should_split = true;
return;
}
}
});
});
if (should_split) {
return true;
}
auto row_tomb = cr.row().deleted_at().regular();
if (row_tomb) {
if (had_cells) {
return true;
}
// there were no cells, so no ttl
assert(!found_ttl);
if (found_ts != api::missing_timestamp && found_ts != row_tomb.timestamp) {
return true;
}
found_ts = row_tomb.timestamp;
}
}
if (!p.row_tombstones().empty() && (had_static_row || had_clustered_row)) {
return true;
}
for (const auto& rt: p.row_tombstones()) {
if (rt.tomb) {
if (found_ts != api::missing_timestamp && found_ts != rt.tomb.timestamp) {
return true;
}
found_ts = rt.tomb.timestamp;
}
}
if (p.partition_tombstone().timestamp != api::missing_timestamp
&& (!p.row_tombstones().empty() || had_static_row || had_clustered_row)) {
return true;
}
// A mutation with no timestamp will be split into 0 mutations
return found_ts == api::missing_timestamp;
return v._ts;
}
void for_each_change(const mutation& base_mutation, const schema_ptr& base_schema,
seastar::noncopyable_function<void(mutation, api::timestamp_type, bytes, int&)> f) {
auto changes = extract_changes(base_mutation, *base_schema);
/* If a mutation contains multiple timestamps, multiple ttls, or multiple types of changes
* (e.g. it was created from a batch that both updated a clustered row and deleted a clustered row),
* we split it into multiple mutations, each with exactly one timestamp, at most one ttl, and a single type of change.
* We also split if we find both a change with no ttl (e.g. a cell tombstone) and a change with ttl (e.g. a ttled cell update).
*
* The `should_split` function checks whether the mutation requires such splitting, using `should_split_visitor`.
* The visitor uses the order in which the mutation is being visited (see the documentation of ChangeVisitor),
* remembers a bunch of state based on whatever was visited until now (e.g. was there a static row update?
* Was there a clustered row update? Was there a clustered row delete? Was there a TTL?)
* and tells the caller to stop on the first occurence of a second timestamp/ttl/type of change.
*/
struct should_split_visitor {
bool _had_static_row = false;
bool _had_clustered_row = false;
bool _had_upsert = false;
bool _had_row_marker = false;
bool _had_range_delete = false;
bool _result = false;
// This becomes a valid (non-missing) timestamp after visiting the first change.
// Then, if we encounter any different timestamp, it means that we should split.
api::timestamp_type _ts = api::missing_timestamp;
// This becomes non-null after visiting the fist change.
// If the change did not have a ttl (e.g. a non-ttled cell, or a tombstone), we store gc_clock::duration(0) there,
// because specifying ttl = 0 is equivalent to not specifying a TTL.
// Otherwise we store the change's ttl.
std::optional<gc_clock::duration> _ttl = std::nullopt;
inline bool finished() const { return _result; }
inline void stop() { _result = true; }
void visit(api::timestamp_type ts, gc_clock::duration ttl = gc_clock::duration(0)) {
if (_ts != api::missing_timestamp && _ts != ts) {
return stop();
}
_ts = ts;
if (_ttl && *_ttl != ttl) {
return stop();
}
_ttl = { ttl };
}
void visit(const atomic_cell_view& cell) { visit(cell.timestamp(), get_ttl(cell)); }
void live_atomic_cell(const column_definition&, const atomic_cell_view& cell) { visit(cell); }
void dead_atomic_cell(const column_definition&, const atomic_cell_view& cell) { visit(cell); }
void collection_tombstone(const tombstone& t) { visit(t.timestamp + 1); }
void live_collection_cell(bytes_view, const atomic_cell_view& cell) {
if (_had_row_marker) {
// nonatomic updates cannot be expressed with an INSERT.
return stop();
}
visit(cell);
}
void dead_collection_cell(bytes_view, const atomic_cell_view& cell) { visit(cell); }
void collection_column(const column_definition&, auto&& visit_collection) { visit_collection(*this); }
void marker(const row_marker& rm) {
_had_row_marker = true;
visit(rm.timestamp(), get_ttl(rm));
}
void static_row_cells(auto&& visit_row_cells) {
_had_static_row = true;
visit_row_cells(*this);
}
void clustered_row_cells(const clustering_key&, auto&& visit_row_cells) {
if (_had_static_row) {
return stop();
}
_had_clustered_row = _had_upsert = true;
visit_row_cells(*this);
}
void clustered_row_delete(const clustering_key&, const tombstone& t) {
if (_had_static_row || _had_upsert) {
return stop();
}
_had_clustered_row = true;
visit(t.timestamp);
}
void range_delete(const range_tombstone& t) {
if (_had_static_row || _had_clustered_row) {
return stop();
}
_had_range_delete = true;
visit(t.tomb.timestamp);
}
void partition_delete(const tombstone&) {
if (_had_range_delete || _had_static_row || _had_clustered_row) {
return stop();
}
}
};
bool should_split(const mutation& m) {
should_split_visitor v;
cdc::inspect_mutation(m, v);
return v._result
// A mutation with no timestamp will be split into 0 mutations:
|| v._ts == api::missing_timestamp;
}
void process_changes_with_splitting(const mutation& base_mutation, change_processor& processor,
bool enable_preimage, bool enable_postimage) {
const auto base_schema = base_mutation.schema();
auto changes = extract_changes(base_mutation);
auto pk = base_mutation.key();
if (changes.empty()) {
return;
}
const auto last_timestamp = changes.rbegin()->first;
for (auto& [change_ts, btch] : changes) {
auto tuuid = timeuuid_type->decompose(generate_timeuuid(change_ts));
int batch_no = 0;
const bool is_last = change_ts == last_timestamp;
processor.begin_timestamp(change_ts, is_last);
clustered_column_set affected_clustered_columns_per_row{clustering_key::less_compare(*base_schema)};
one_kind_column_set affected_static_columns{base_schema->static_columns_count()};
if (enable_preimage || enable_postimage) {
affected_static_columns = btch.get_affected_static_columns(*base_schema);
affected_clustered_columns_per_row = btch.get_affected_clustered_columns_per_row(*base_mutation.schema());
}
if (enable_preimage) {
if (affected_static_columns.count() > 0) {
processor.produce_preimage(nullptr, affected_static_columns);
}
for (const auto& [ck, affected_row_cells] : affected_clustered_columns_per_row) {
processor.produce_preimage(&ck, affected_row_cells);
}
}
for (auto& sr_update : btch.static_updates) {
mutation m(base_schema, pk);
@@ -434,7 +622,7 @@ void for_each_change(const mutation& base_mutation, const schema_ptr& base_schem
auto& cdef = base_schema->column_at(column_kind::static_column, nonatomic_update.id);
m.set_static_cell(cdef, collection_mutation_description{nonatomic_update.t, std::move(nonatomic_update.cells)}.serialize(*cdef.type));
}
f(std::move(m), change_ts, tuuid, batch_no);
processor.process_change(m);
}
for (auto& cr_insert : btch.clustered_inserts) {
@@ -451,7 +639,7 @@ void for_each_change(const mutation& base_mutation, const schema_ptr& base_schem
}
row.apply(cr_insert.marker);
f(std::move(m), change_ts, tuuid, batch_no);
processor.process_change(m);
}
for (auto& cr_update : btch.clustered_updates) {
@@ -467,27 +655,86 @@ void for_each_change(const mutation& base_mutation, const schema_ptr& base_schem
row.apply(cdef, collection_mutation_description{nonatomic_update.t, std::move(nonatomic_update.cells)}.serialize(*cdef.type));
}
f(std::move(m), change_ts, tuuid, batch_no);
processor.process_change(m);
}
for (auto& cr_delete : btch.clustered_row_deletions) {
mutation m(base_schema, pk);
m.partition().apply_delete(*base_schema, cr_delete.key, cr_delete.t);
f(std::move(m), change_ts, tuuid, batch_no);
processor.process_change(m);
}
for (auto& crange_delete : btch.clustered_range_deletions) {
mutation m(base_schema, pk);
m.partition().apply_delete(*base_schema, crange_delete.rt);
f(std::move(m), change_ts, tuuid, batch_no);
processor.process_change(m);
}
if (btch.partition_deletions) {
mutation m(base_schema, pk);
m.partition().apply(btch.partition_deletions->t);
f(std::move(m), change_ts, tuuid, batch_no);
processor.process_change(m);
}
if (enable_postimage) {
if (affected_static_columns.count() > 0) {
processor.produce_postimage(nullptr);
}
for (const auto& [ck, crow] : affected_clustered_columns_per_row) {
processor.produce_postimage(&ck);
}
}
processor.end_record();
}
}
void process_changes_without_splitting(const mutation& base_mutation, change_processor& processor,
bool enable_preimage, bool enable_postimage) {
auto ts = find_timestamp(base_mutation);
processor.begin_timestamp(ts, true);
const auto base_schema = base_mutation.schema();
if (enable_preimage) {
const auto& p = base_mutation.partition();
one_kind_column_set columns{base_schema->static_columns_count()};
if (!p.static_row().empty()) {
p.static_row().get().for_each_cell([&] (column_id id, const atomic_cell_or_collection& cell) {
columns.set(id);
});
processor.produce_preimage(nullptr, columns);
}
columns.resize(base_schema->regular_columns_count());
for (const rows_entry& cr : p.clustered_rows()) {
columns.reset();
if (cr.row().deleted_at().regular()) {
// Row deleted - include all columns in preimage
columns.set(0, base_schema->regular_columns_count(), true);
} else {
cr.row().cells().for_each_cell([&] (column_id id, const atomic_cell_or_collection& cell) {
columns.set(id);
});
}
processor.produce_preimage(&cr.key(), columns);
}
}
processor.process_change(base_mutation);
if (enable_postimage) {
const auto& p = base_mutation.partition();
if (!p.static_row().empty()) {
processor.produce_postimage(nullptr);
}
for (const rows_entry& cr : p.clustered_rows()) {
processor.produce_postimage(&cr.key());
}
}
processor.end_record();
}
} // namespace cdc

View File

@@ -22,6 +22,7 @@
#pragma once
#include <vector>
#include <boost/dynamic_bitset.hpp>
#include "schema_fwd.hh"
#include "timestamp.hh"
#include "bytes.hh"
@@ -31,8 +32,61 @@ class mutation;
namespace cdc {
bool should_split(const mutation& base_mutation, const schema& base_schema);
void for_each_change(const mutation& base_mutation, const schema_ptr& base_schema,
seastar::noncopyable_function<void(mutation, api::timestamp_type, bytes, int&)>);
// Represents a set of column ids of one kind (partition key, clustering key, regular row or static row).
// There already exists a column_set type, but it keeps ordinal_column_ids, not column_ids (ordinal column ids
// are unique across whole table, while kind-specific ids are unique only within one column kind).
// To avoid converting back and forth between ordinal and kind-specific ids, one_kind_column_set is used instead.
using one_kind_column_set = boost::dynamic_bitset<uint64_t>;
// An object that processes changes from a single, big mutation.
// It is intended to be used with process_changes_xxx_splitting. Those functions define the order and layout in which
// changes should appear in CDC log, and change_processor is responsible for producing CDC log rows from changes given
// by those two functions.
//
// The flow of calling its methods should go as follows:
// -> begin_timestamp #1
// -> produce_preimage (one call for each preimage row to be generated)
// -> process_change (one call for each part generated by the splitting function)
// -> produce_postimage (one call for each postimage row to be generated)
// -> begin_timestamp #2
// ...
class change_processor {
protected:
~change_processor() {};
public:
// Tells the processor that changes that follow from now on will be of given timestamp.
// This method must be called in increasing timestamp order.
// begin_timestamp can be called only once for a given timestamp and change_processor object.
// ts - timestamp of mutation parts
// is_last - determines if this will be the last timestamp to be processed by this change_processor instance.
virtual void begin_timestamp(api::timestamp_type ts, bool is_last) = 0;
// Tells the processor to produce a preimage for a given clustering/static row.
// ck - clustering key of the row for which to produce a preimage; if nullptr, static row preimage is requested
// columns_to_include - include information about the current state of those columns only, leave others as null
virtual void produce_preimage(const clustering_key* ck, const one_kind_column_set& columns_to_include) = 0;
// Tells the processor to produce a postimage for a given clustering/static row.
// Contrary to preimage, this requires data from all columns to be present.
// ck - clustering key of the row for which to produce a postimage; if nullptr, static row postimage is requested
virtual void produce_postimage(const clustering_key* ck) = 0;
// Processes a smaller mutation which is a subset of the big mutation.
// The mutation provided to process_change should be simple enough for it to be possible to convert it
// into CDC log rows - for example, it cannot represent a write to two columns of the same row, where
// both columns have different timestamp or TTL set.
// m - the small mutation to be converted into CDC log rows.
virtual void process_change(const mutation& m) = 0;
// Tells processor we have reached end of record - last part
// of a given timestamp batch
virtual void end_record() = 0;
};
bool should_split(const mutation& base_mutation);
void process_changes_with_splitting(const mutation& base_mutation, change_processor& processor,
bool enable_preimage, bool enable_postimage);
void process_changes_without_splitting(const mutation& base_mutation, change_processor& processor,
bool enable_preimage, bool enable_postimage);
}

View File

@@ -31,10 +31,7 @@ class checked_file_impl : public file_impl {
public:
checked_file_impl(const io_error_handler& error_handler, file f)
: _error_handler(error_handler), _file(f) {
_memory_dma_alignment = f.memory_dma_alignment();
_disk_read_dma_alignment = f.disk_read_dma_alignment();
_disk_write_dma_alignment = f.disk_write_dma_alignment();
: file_impl(*get_file_impl(f)), _error_handler(error_handler), _file(f) {
}
virtual future<size_t> write_dma(uint64_t pos, const void* buffer, size_t len, const io_priority_class& pc) override {

View File

@@ -67,8 +67,8 @@ public:
int operator()(const clustering_key_prefix& p1, int32_t w1, const clustering_key_prefix& p2, int32_t w2) const {
auto type = _s.get().clustering_key_prefix_type();
auto res = prefix_equality_tri_compare(type->types().begin(),
type->begin(p1), type->end(p1),
type->begin(p2), type->end(p2),
type->begin(p1.representation()), type->end(p1.representation()),
type->begin(p2.representation()), type->end(p2.representation()),
::tri_compare);
if (res) {
return res;

View File

@@ -72,7 +72,14 @@ public:
}
return result;
}
class position_range_iterator : public std::iterator<std::input_iterator_tag, const position_range> {
class position_range_iterator {
public:
using iterator_category = std::input_iterator_tag;
using value_type = const position_range;
using difference_type = std::ptrdiff_t;
using pointer = const position_range*;
using reference = const position_range&;
private:
set_type::iterator _i;
public:
position_range_iterator(set_type::iterator i) : _i(i) {}

View File

@@ -65,6 +65,11 @@ private:
_current_start = position_in_partition_view::for_range_start(_current_range.front());
_current_end = position_in_partition_view::for_range_end(_current_range.front());
}
} else {
// If the first range is contiguous with the static row, then advance _current_end as much as we can
if (_current_range && !_current_range.front().start()) {
_current_end = position_in_partition_view::for_range_end(_current_range.front());
}
}
}

View File

@@ -22,7 +22,6 @@
#include "types/collection.hh"
#include "types/user.hh"
#include "concrete_types.hh"
#include "atomic_cell_or_collection.hh"
#include "mutation_partition.hh"
#include "compaction_garbage_collector.hh"
#include "combine.hh"
@@ -30,40 +29,28 @@
#include "collection_mutation.hh"
collection_mutation::collection_mutation(const abstract_type& type, collection_mutation_view v)
: _data(imr_object_type::make(data::cell::make_collection(v.data), &type.imr_state().lsa_migrator())) {}
: _data(v.data) {}
collection_mutation::collection_mutation(const abstract_type& type, const bytes_ostream& data)
: _data(imr_object_type::make(data::cell::make_collection(fragment_range_view(data)), &type.imr_state().lsa_migrator())) {}
static collection_mutation_view get_collection_mutation_view(const uint8_t* ptr)
{
auto f = data::cell::structure::get_member<data::cell::tags::flags>(ptr);
auto ti = data::type_info::make_collection();
data::cell::context ctx(f, ti);
auto view = data::cell::structure::get_member<data::cell::tags::cell>(ptr).as<data::cell::tags::collection>(ctx);
auto dv = data::cell::variable_value::make_view(view, f.get<data::cell::tags::external_data>());
return collection_mutation_view { dv };
}
collection_mutation::collection_mutation(const abstract_type& type, managed_bytes data)
: _data(std::move(data)) {}
collection_mutation::operator collection_mutation_view() const
{
return get_collection_mutation_view(_data.get());
return collection_mutation_view{managed_bytes_view(_data)};
}
collection_mutation_view atomic_cell_or_collection::as_collection_mutation() const {
return get_collection_mutation_view(_data.get());
return collection_mutation_view{managed_bytes_view(_data)};
}
bool collection_mutation_view::is_empty() const {
auto in = collection_mutation_input_stream(data);
auto in = collection_mutation_input_stream(fragment_range(data));
auto has_tomb = in.read_trivial<bool>();
return !has_tomb && in.read_trivial<uint32_t>() == 0;
}
template <typename F>
requires std::is_invocable_r_v<const data::type_info&, F, collection_mutation_input_stream&>
static bool is_any_live(const atomic_cell_value_view& data, tombstone tomb, gc_clock::time_point now, F&& read_cell_type_info) {
auto in = collection_mutation_input_stream(data);
bool collection_mutation_view::is_any_live(const abstract_type& type, tombstone tomb, gc_clock::time_point now) const {
auto in = collection_mutation_input_stream(fragment_range(data));
auto has_tomb = in.read_trivial<bool>();
if (has_tomb) {
auto ts = in.read_trivial<api::timestamp_type>();
@@ -73,9 +60,10 @@ static bool is_any_live(const atomic_cell_value_view& data, tombstone tomb, gc_c
auto nr = in.read_trivial<uint32_t>();
for (uint32_t i = 0; i != nr; ++i) {
auto& type_info = read_cell_type_info(in);
auto key_size = in.read_trivial<uint32_t>();
in.skip(key_size);
auto vsize = in.read_trivial<uint32_t>();
auto value = atomic_cell_view::from_bytes(type_info, in.read(vsize));
auto value = atomic_cell_view::from_bytes(type, in.read(vsize));
if (value.is_live(tomb, now, false)) {
return true;
}
@@ -84,33 +72,8 @@ static bool is_any_live(const atomic_cell_value_view& data, tombstone tomb, gc_c
return false;
}
bool collection_mutation_view::is_any_live(const abstract_type& type, tombstone tomb, gc_clock::time_point now) const {
return visit(type, make_visitor(
[&] (const collection_type_impl& ctype) {
auto& type_info = ctype.value_comparator()->imr_state().type_info();
return ::is_any_live(data, tomb, now, [&type_info] (collection_mutation_input_stream& in) -> const data::type_info& {
auto key_size = in.read_trivial<uint32_t>();
in.skip(key_size);
return type_info;
});
},
[&] (const user_type_impl& utype) {
return ::is_any_live(data, tomb, now, [&utype] (collection_mutation_input_stream& in) -> const data::type_info& {
auto key_size = in.read_trivial<uint32_t>();
auto key = in.read(key_size);
return utype.type(deserialize_field_index(key))->imr_state().type_info();
});
},
[&] (const abstract_type& o) -> bool {
throw std::runtime_error(format("collection_mutation_view::is_any_live: unknown type {}", o.name()));
}
));
}
template <typename F>
requires std::is_invocable_r_v<const data::type_info&, F, collection_mutation_input_stream&>
static api::timestamp_type last_update(const atomic_cell_value_view& data, F&& read_cell_type_info) {
auto in = collection_mutation_input_stream(data);
api::timestamp_type collection_mutation_view::last_update(const abstract_type& type) const {
auto in = collection_mutation_input_stream(fragment_range(data));
api::timestamp_type max = api::missing_timestamp;
auto has_tomb = in.read_trivial<bool>();
if (has_tomb) {
@@ -120,39 +83,16 @@ static api::timestamp_type last_update(const atomic_cell_value_view& data, F&& r
auto nr = in.read_trivial<uint32_t>();
for (uint32_t i = 0; i != nr; ++i) {
auto& type_info = read_cell_type_info(in);
const auto key_size = in.read_trivial<uint32_t>();
in.skip(key_size);
auto vsize = in.read_trivial<uint32_t>();
auto value = atomic_cell_view::from_bytes(type_info, in.read(vsize));
auto value = atomic_cell_view::from_bytes(type, in.read(vsize));
max = std::max(value.timestamp(), max);
}
return max;
}
api::timestamp_type collection_mutation_view::last_update(const abstract_type& type) const {
return visit(type, make_visitor(
[&] (const collection_type_impl& ctype) {
auto& type_info = ctype.value_comparator()->imr_state().type_info();
return ::last_update(data, [&type_info] (collection_mutation_input_stream& in) -> const data::type_info& {
auto key_size = in.read_trivial<uint32_t>();
in.skip(key_size);
return type_info;
});
},
[&] (const user_type_impl& utype) {
return ::last_update(data, [&utype] (collection_mutation_input_stream& in) -> const data::type_info& {
auto key_size = in.read_trivial<uint32_t>();
auto key = in.read(key_size);
return utype.type(deserialize_field_index(key))->imr_state().type_info();
});
},
[&] (const abstract_type& o) -> api::timestamp_type {
throw std::runtime_error(format("collection_mutation_view::last_update: unknown type {}", o.name()));
}
));
}
std::ostream& operator<<(std::ostream& os, const collection_mutation_view::printer& cmvp) {
fmt::print(os, "{{collection_mutation_view ");
cmvp._cmv.with_deserialized(cmvp._type, [&os, &type = cmvp._type] (const collection_mutation_view_description& cmvd) {
@@ -278,28 +218,31 @@ static collection_mutation serialize_collection_mutation(
auto size = accumulate(cells, (size_t)4, element_size);
size += 1;
if (tomb) {
size += sizeof(tomb.timestamp) + sizeof(tomb.deletion_time);
size += sizeof(int64_t) + sizeof(int64_t);
}
bytes_ostream ret;
ret.reserve(size);
auto out = ret.write_begin();
*out++ = bool(tomb);
managed_bytes ret(managed_bytes::initialized_later(), size);
managed_bytes_mutable_view out(ret);
write<uint8_t>(out, uint8_t(bool(tomb)));
if (tomb) {
write(out, tomb.timestamp);
write(out, tomb.deletion_time.time_since_epoch().count());
write<int64_t>(out, tomb.timestamp);
write<int64_t>(out, tomb.deletion_time.time_since_epoch().count());
}
auto writeb = [&out] (bytes_view v) {
serialize_int32(out, v.size());
out = std::copy_n(v.begin(), v.size(), out);
auto writek = [&out] (bytes_view v) {
write<int32_t>(out, v.size());
write_fragmented(out, single_fragmented_view(v));
};
auto writev = [&out] (managed_bytes_view v) {
write<int32_t>(out, v.size());
write_fragmented(out, v);
};
// FIXME: overflow?
serialize_int32(out, boost::distance(cells));
write<int32_t>(out, boost::distance(cells));
for (auto&& kv : cells) {
auto&& k = kv.first;
auto&& v = kv.second;
writeb(k);
writek(k);
writeb(v.serialize());
writev(v.serialize());
}
return collection_mutation(type, ret);
}
@@ -448,13 +391,12 @@ deserialize_collection_mutation(const abstract_type& type, collection_mutation_i
return visit(type, make_visitor(
[&] (const collection_type_impl& ctype) {
// value_comparator(), ugh
auto& type_info = ctype.value_comparator()->imr_state().type_info();
return deserialize_collection_mutation(in, [&type_info] (collection_mutation_input_stream& in) {
return deserialize_collection_mutation(in, [&ctype] (collection_mutation_input_stream& in) {
// FIXME: we could probably avoid the need for size
auto ksize = in.read_trivial<uint32_t>();
auto key = in.read(ksize);
auto vsize = in.read_trivial<uint32_t>();
auto value = atomic_cell_view::from_bytes(type_info, in.read(vsize));
auto value = atomic_cell_view::from_bytes(*ctype.value_comparator(), in.read(vsize));
return std::make_pair(key, value);
});
},
@@ -464,8 +406,7 @@ deserialize_collection_mutation(const abstract_type& type, collection_mutation_i
auto ksize = in.read_trivial<uint32_t>();
auto key = in.read(ksize);
auto vsize = in.read_trivial<uint32_t>();
auto value = atomic_cell_view::from_bytes(
utype.type(deserialize_field_index(key))->imr_state().type_info(), in.read(vsize));
auto value = atomic_cell_view::from_bytes(*utype.type(deserialize_field_index(key)), in.read(vsize));
return std::make_pair(key, value);
});
},

Some files were not shown because too many files have changed in this diff Show More