Get all flush permits to serialize with any
ongoing flushes and preventing further flushes
during table::clear, in particular calling
discard_completed_segments for every table and
clearing the memtables in clear_and_add.
Fixes#10423
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit aae532a96b)
Dtest triggers the problem by:
1) creating table with LCS
2) disabling regular compaction
3) writing a few sstables
4) running maintenance compaction, e.g. cleanup
Once the maintenance compaction completes, disengaged optional _last_compacted_keys
triggers an exception in notify_completion().
_last_compacted_keys is used by regular for its round-robin file picking
policy. It stores the last compacted key for each level. Meaning it's
irrelevant for any other compaction type.
Regular compaction is responsible for initializing it when it runs for
the first time to pick files. But with it disabled, notify_completion()
will find it uninitialized, therefore resulting in bad_optional_access.
To fix this, the procedure is skipped if _last_compacted_keys is
disengaged. Regular compaction, once re-enabled, will be able to
fill _last_compacted_keys by looking at metadata of the files.
compaction_test.py::TestCompaction::test_disable_autocompaction_doesnt_
block_user_initiated_compactions[CLEANUP-LeveledCompactionStrategy]
now passes.
Fixes#10378.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#10508
(cherry picked from commit 8e99d3912e)
One user observed this assertion fail, but it's an extremely rare event.
The root cause - interlacing of processing STARTUP and OPTIONS messages -
is still there, but now it's harmless enough to leave it as is.
Fixes#10487Closes#10503
(cherry picked from commit 603dd72f9e)
It seams that batch prepared statements always return false for
depends_on, this in turn renders the removal criteria from the
prepared statements cache to always be false which result by the
queries not being evicted.
Here we change the function to return the true state meaning,
they will return true if one of the sub queries is dependant
upon the keyspace and/ or column family.
Fixes#10129
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
(cherry picked from commit 4eb0398457)
purpose
Cql statements used to have two API functions, depends_on_keyspace and
depends_on_column_family. The former, took as a parameter only a table
name, which makes no sense. There could be multiple tables with the same
name each in a different keyspace and it doesn't make sense to
generalize the test - i.e to ask "Does a statement depend on any table
named XXX?"
In this change we unify the two calls to one - depends on that takes a
keyspace name and optionally also a table name, that way every logical
dependency tests that makes sense is supported by a single API call.
(cherry picked from commit bf50dbd35b)
Ref #10129
If we are redefining the log table, we need to ensure any dropped
columns are registered in "dropped_columns" table, otherwise clients will not
be able to read data older than now.
Includes unit test.
Should probably be backported to all CDC enabled versions.
Fixes#10473Closes#10474
(cherry picked from commit 78350a7e1b)
Backport notes: removed cql-pytest test from the original commit, since
the cql-pytest framework on branch-4.5 is missing the `nodetool`
package.
There are two issues with current implementation of remove/remove_if:
1) If it happens concurrently with get_ptr(), the latter may still
populate the cache using value obtained from before remove() was
called. remove() is used to invalidate caches, e.g. the prepared
statements cache, and the expected semantic is that values
calculated from before remove() should not be present in the cache
after invalidation.
2) As long as there is any active pointer to the cached value
(obtained by get_ptr()), the old value from before remove() will be
still accessible and returned by get_ptr(). This can make remove()
have no effect indefinitely if there is persistent use of the cache.
One of the user-perceived effects of this bug is that some prepared
statements may not get invalidated after a schema change and still use
the old schema (until next invalidation). If the schema change was
modifying UDT, this can cause statement execution failures. CQL
coordinator will try to interpret bound values using old set of
fields. If the driver uses the new schema, the coordinaotr will fail
to process the value with the following exception:
User Defined Type value contained too many fields (expected 5, got 6)
The patch fixes the problem by making remove()/remove_if() erase old
entries from _loading_values immediately.
The predicate-based remove_if() variant has to also invalidate values
which are concurrently loading to be safe. The predicate cannot be
avaluated on values which are not ready. This may invalidate some
values unnecessarily, but I think it's fine.
Fixes#10117
Message-Id: <20220309135902.261734-1-tgrabiec@scylladb.com>
(cherry picked from commit 8fa704972f)
loading_shared_values/loading_cache'es iterators interface is dangerous/fragile because
iterator doesn't "lock" the entry it points to and if there is a
preemption point between aquiring non-end() iterator and its
dereferencing the corresponding cache entry may had already got evicted (for
whatever reason, e.g. cache size constraints or expiration) and then
dereferencing may end up in a use-after-free and we don't have any
protection against it in the value_extractor_fn today.
And this is in addition to #8920.
So, instead of trying to fix the iterator interface this patch kills two
birds in a single shot: we are ditching the iterators interface
completely and return value_ptr from find(...) instead - the same one we
are returning from loading_cache::get_ptr(...) asyncronous APIs.
A similar rework is done to a loading_shared_values loading_cache is
based on: we drop iterators interface and return
loading_shared_values::entry_ptr from find(...) instead.
loading_cache::value_ptr already takes care of "lock"ing the returned value so that it
would relain readable even if it's evicted from the cache by the time
one tries to read it. And of course it also takes care of updating the
last read time stamp and moving the corresponding item to the top of the
MRU list.
Fixes#8920
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20210817222404.3097708-1-vladz@scylladb.com>
(cherry picked from commit 7bd1bcd779)
[avi: prerequisite to backporting #10117]
Said method has to evict all querier cache entries, belonging to the to-be-dropped table. This is already the case, but there was a window where new entries could sneak in, causing a stale reference to the table to be de-referenced later when they are evicted due to TTL. This window is now closed, the entries are evicted after the method has waited for all ongoing operations on said table to stop.
Fixes: #10450Closes#10451
* github.com:scylladb/scylla:
replica/database: drop_column_family(): drop querier cache entries after waiting for ops
replica/database: finish coroutinizing drop_column_family()
replica/database: make remove(const column_family&) private
(cherry picked from commit 7f1e368e92)
Fixes the case of make_room() invoked with last_chunk_capacity_deficit
but _size not in the last reserved chunk.
Found during code review, no known user impact.
Fixes#10363.
Message-Id: <20220411222605.641614-1-tgrabiec@scylladb.com>
(cherry picked from commit 01eeb33c6e)
[avi: make max_chunk_capacity() public for backport]
Protocol v4 added WRITE_FAILURE and READ_FAILURE. When running under v3
we downgrade these exceptions to WRITE_TIMEOUT and READ_TIMEOUT (since
the client won't understand the v4 errors), but we still send the new
error codes. This causes the client to become confused.
Fix by updating the error codes.
A better fix is to move the error code from the constructor parameter
list and hard-code it in the constructor, but that is left for a follow-up
after this minimal fix.
Fixes#5610.
Closes#10362
(cherry picked from commit 987e6533d2)
When a query contains IN restriction on its partition key,
it's currently not eligible for indexing. It was however
erroneously qualified as such, which lead to fetching incorrect
results. This commit fixes the issue by not allowing such queries
to undergo indexing, and comes with a regression test.
Fixes#10300Closes#10302
(cherry picked from commit c0fd53a9d7)
(cherry picked from commit ded169476b0498aeedeeca40a612f2cebb54348a)
Following up on a57c087c89,
compare_atomic_cell_for_merge should compare the ttl value in the
reverse order since, when comparing two cells that are identical
in all attributes but their ttl, we want to keep the cell with the
smaller ttl value rather than the larger ttl, since it was written
at a later (wall-clock) time, and so would remain longer after it
expires, until purged after gc_grace seconds.
Fixes#10173
Test: mutation_test.test_cell_ordering, unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302154328.2400717-1-bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220306091913.106508-1-bhalevy@scylladb.com>
(cherry picked from commit a085ef74ff)
Unlike atomic_cell_or_collection::equals, compare_atomic_cell_for_merge
currently returns std::strong_ordering::equal if two cells are equal in
every way except their ttl:s.
The problem with that is that the cells' hashes are different and this
will cause repair to keep trying to repair discrepancies caused by the
ttl being different.
This may be triggered by e.g. the spark migrator that computes the ttl
based on the expiry time by subtracting the expiry time from the current
time to produce a respective ttl.
If the cell is migrated multiple times at different times, it will generate
cells that the same expiry (by design) but have different ttl values.
Fixes#10156
Test: mutation_test.test_cell_ordering, unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302154328.2400717-1-bhalevy@scylladb.com>
(cherry picked from commit a57c087c89)
compare_atomic_cell_for_merge() was placed in database.cc, before
atomic_cell.cc existed. Move it to its correct place.
Closes#9889
(cherry picked from commit 6c53717a39)
[avi: 4.5 backport: retain pre-shapship-operator code]
Currently any unhandled error during deferred shutdown
is rethrown in a noexcept context (in ~deferred_action),
generating a core dump.
The core dump is not helpful if the cause of the
error is "environmental", i.e. in the system, rather
than in scylla itself.
This change detects several such errors and calls
_Exit(255) to exit the process early, without leaving
a coredump behind. Otherwise, call abort() explicitly,
rather than letting terminate() be called implicitly
by the destructor exception handling code.
Fixes#9573
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220227101054.1294368-1-bhalevy@scylladb.com>
(cherry picked from commit 132c9d5933)
Currently, view will be not updated because the streaming reason is set
to streaming::stream_reason::rebuild. On the receiver side, only
streaming with the reason streaming::stream_reason::repair will trigger
view update.
Change the stream reason to repair to trigger view update for load and
stream. This makes load_and_stream behaves the same as nodetool refresh.
Note: However, this is not very efficient though.
Consider RF = 3, sst1, sst2, sst3 from the older cluster. When sst1 is
loaded, it streams to 3 replica nodes, if we generate view updates, we
will have 3 view updates for this replica (each of the peer nodes finds
its peer and writes the view update to peer). After loading sst2 and
sst3, we will have 9 view updates in total for a single partition.
If we create the view after the load and stream process, we will only
have 3 view updates for a single partition.
If we create the view after the load and stream process, we will only
have 3 view updates for a single partition.
Fixes#9205Closes#9213
(cherry picked from commit eaf4d2afb4)
Add the missing partition-key validation in INSERT JSON statements.
Scylla, following the lead of Cassandra, forbids an empty-string partition
key (please note that this is not the same as a null partition key, and
that null clustering keys *are* allowed).
Trying to INSERT, UPDATE or DELETE a partition with an empty string as
the partition key fails with a "Key may not be empty". However, we had a
loophole - you could insert such empty-string partition keys using an
"INSERT ... JSON" statement.
The problem was that the partition key validation was done in one place -
`modification_statement::build_partition_keys()`. The INSERT, UPDATE and
DELETE statements all inherited this same method and got the correct
validation. But the INSERT JSON statement - insert_prepared_json_statement
overrode the build_partition_keys() method and this override forgot to call
the validation function. So in this patch we add the missing validation.
Note that the validation function checks for more than just empty strings -
there is also a length limit for partition keys.
This patch also adds a cql-pytest reproducer for this bug. Before this
patch, the test passed on Cassandra but failed on Scylla.
Reported by @FortTell
Fixes#9853.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220116085216.21774-1-nyh@scylladb.com>
(cherry picked from commit 8fd5041092)
Currently the exception handling code of feed_writer() assumes
consume_end_of_stream() doesn't throw. This is false and an exception
from said method can currently lead to an unclean destroy of the writer
and reader. Fix by also handling exceptions from
consume_end_of_stream() too.
Closes#10147
We found that monitor mode of mdadm does not work on RAID0, and it is
not a bug, expected behavior according to RHEL developer.
Therefore, we should revert workaround patch which downgrades mdadm,
and stop enabling mdmonitor since we always use RAID0.
See #9540Closes#10077
DynamoDB allows an UpdateItem operation "REMOVE x.y" when a map x
exists in the item, but x.y doesn't - the removal silently does
nothing. Alternator incorrectly generated an error in this case,
and unfortunately we didn't have a test for this case.
So in this patch we add the missing test (which fails on Alternator
before this patch - and passes on DynamoDB) and then fix the behavior.
After this patch, "REMOVE x.y" will remain an error if "x" doesn't
exist (saying "document paths not valid for this item"), but if "x"
exists and is a map, but "x.y" doesn't, the removal will silently
do nothing and will not be an error.
Fixes#10043.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220207133652.181994-1-nyh@scylladb.com>
(cherry picked from commit 9982a28007)
If the Docker startup script is passed both "--alternator-port" and
"--alternator-https-port", a combination which is supposed to be
allowed, it passes to Scylla the "--alternator-address" option twice.
This isn't necessary, and worse - not allowed.
So this patch fixes the scyllasetup.py script to only pass this
parameter once.
Fixes#10016.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220202165814.1700047-1-nyh@scylladb.com>
(cherry picked from commit cb6630040d)
We found that monitor mode of mdadm does not work on RAID0, and it is
not a bug, expected behavior according to RHEL developer.
Therefore, we should stop enabling mdmonitor when RAID0 is specified.
Fixes#9540
----
This reverts 0d8f932 and introduce correct fix.
Closes#9970
* github.com:scylladb/scylla:
scylla_raid_setup: use mdmonitor only when RAID level > 0
Revert "scylla_raid_setup: workaround for mdmonitor.service issue on CentOS8"
(cherry picked from commit df22396a34)
The "Authorization" HTTP header is used in DynamoDB API to sign
requests. Our parser for this header, in server::verify_signature(),
required the different components of this header to be separated by
a comma followed by a whitespace - but it turns out that in DynamoDB
both spaces and commas are optional - one of them is enough.
At least one DynamoDB client library - the old "boto" (which predated
boto3) - builds this header without spaces.
In this patch we add a test that shows that an Authorization header
with spaces removed works fine in DynamoDB but didn't work in
Alternator, and after this patch modifies the parsing code for this
header, the test begins to pass (and the other tests show that the
previously-working cases didn't break).
Fixes#9568
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211101214114.35693-1-nyh@scylladb.com>
(cherry picked from commit 56eb994d8f)
Although the DynamoDB API responses are JSON, additional conventions apply
to these responses - such as how error codes are encoded in JSON. For this
reason, DynamoDB uses the content type `application/x-amz-json-1.0` instead
of the standard `application/json` in its responses.
Until this patch, Scylla used `application/json` in its responses. This
unexpected content-type didn't bother any of the AWS libraries which we
tested, but it does bother the aiodynamo library (see HENNGE/aiodynamo#27).
Moreover, we should return the x-amz-json-1.0 content type for future
proofing: It turns out that AWS already defined x-amz-json-1.1 - see:
https://awslabs.github.io/smithy/1.0/spec/aws/aws-json-1_1-protocol.html
The 1.1 content type differs (only) in how it encodes error replies.
If one day DynamoDB starts to use this new reply format (it doesn't yet)
and if DynamoDB libraries will need to differenciate between the two
reply formats, Alternator better return the right one.
This patch also includes a new test that the Content-Type header is
returned with the expected value. The test passes on DynamoDB, and
after this patch it starts to pass on Alternator as well.
Fixes#9554.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211031094621.1193387-1-nyh@scylladb.com>
(cherry picked from commit 6ae0ea0c48)
On CentOS8, mdmonitor.service does not works correctly when using
mdadm-4.1-15.el8.x86_64 and later versions.
Until we find a solution, let's pinning the package version to older one
which does not cause the issue (4.1-14.el8.x86_64).
Fixes#9540Closes#9782
(cherry picked from commit 0d8f932f0b)
We have two identical "Truncated frame" errors, at:
* read_frame_size() in serialization_visitors.hh;
* cql_server::connection::read_and_decompress_frame() in
transport/server.cc;
When such an exception is thrown, it is impossible to tell where was it
thrown from and it doesn't have any further information contained in it
(beyond the basic information it being thrown implies).
This patch solves both problems: it makes the exception messages unique
per location and it adds information about why it was thrown (the
expected vs. real size of the frame).
Ref: #9482Closes#9520
(cherry picked from commit 9ec55e054d)
The B-tree's insert_before() is throwing operation, its caller
must account for that. When the rows_entry's collection was
switched on B-tree all the risky places were fixed by ee9e1045,
but few places went under the radar.
In the cache_flat_mutation_reader there's a place where a C-pointer
is inserted into the tree, thus potentially leaking the entry.
In the partition_snapshot_row_cursor there are two places that not
only leak the entry, but also leave it in the LRU list. The latter
it quite nasty, because those entry can be evicted, eviction code
tries to get rows_entry iterator from "this", but the hook happens
to be unattached (because insertion threw) and fails the assert.
fixes: #9728
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit ee103636ac)
Both places get the C-pointer on the freshly allocated rows_entry,
insert it where needed and return back the dereferenced pointer.
The C-pointer is going to become smart-pointer that would go out
of scope before return. This change prepares for that by constructing
the ensure_result from the iterator, that's returned from insertion
of the entry.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 9fd8db318d)
Ref #9728
Unfortunately, defining metrics in Scylla requires some code
duplication, with the metrics declared in one place but exported in a
different place in the code. When we duplicated this code in Alternator,
we accidentally dropped the first metric - for BatchGetItem. The metric
was accounted in the code, but not exported to Prometheus.
In addition to fixing the missing metric, this patch also adds a test
that confirms that the BatchGetItem metric increases when the
BatchGetItem operation is used. This test failed before this patch, and
passes with it. The test only currently tests this for BatchGetItem
(and BatchWriteItem) but it can be later expanded to cover all the other
operations as well.
Fixes#9406
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210929121611.373074-1-nyh@scylladb.com>
(cherry picked from commit 5cbe9178fd)
The shard reader can outlive its parent reader (the multishard reader).
This creates a problem for lifecycle management: readers take the range
and slice parameters by reference and users keep these alive until the
reader is alive. The shard reader outliving the top-level reader means
that any background read-ahead that it has to wait on will potentially
have stale references to the range and the slice. This was seen in the
wild recently when the evictable reader wrapped by the shard reader hit
a use-after-free while wrapping up a background read-ahead.
This problem was solved by fa43d76 but any previous versions are
susceptible to it.
This patch solves this problem by having the shard reader copy and keep
the range and slice parameters in stable storage, before passing them
further down.
Fixes: #9719
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211202113910.484591-1-bdenes@scylladb.com>
We were silently ignoring INSERTs with NULL values for primary-key
columns, which Cassandra rejects. Fix it by rejecting any
modification_statement that would operate on empty partition or
clustering range.
This is the most direct fix, because range and slice are calculated in
one place for all modification statements. It covers not only NULL
cases, but also impossible restrictions like c>0 AND c<0.
Unfortunately, Cassandra doesn't treat all modification statements
consistently, so this fix cannot fully match its behavior. We err on
the side of tolerance, accepting some DELETE statements that Cassandra
rejects. We add a TODO for rejecting such DELETEs later.
Fixes#7852.
Tests: unit (dev), cql-pytest against Cassandra 4.0
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#9286
(cherry picked from commit 1fdaeca7d0)
Before this patch when writing an index block, the sstables writer was
storing range tombstones that span the boundary of the block in order
of end bounds. This led to a range tombstone being ignored by a reader
if there was a row tombstone inside it.
This patch sorts the range tombstones based on start bound before
writing them to the index file.
The assumption is that writing an index block is rare so we can afford
sortting the tombstones at that point. Additionally this is a writer of
an old format and writing to it will be dropped in the next major
release so it should be rarely used already.
Kudos to Kamil Braun <kbraun@scylladb.com> for finding the reproducer.
Test: unit(dev)
Fixes#9690
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit scylladb/scylla-enterprise@eb093afd6f)
Indexed queries are using paging over the materialized view
table. Results of the view read are then used to issue reads of the
base table. If base table reads are short reads, the page is returned
to the user and paging state is adjusted accordingly so that when
paging is resumed it will query the view starting from the row
corresponding to the next row in the base which was not yet
returned. However, paging state's "remaining" count was not reset, so
if the view read was exhausted the reading will stop even though the
base table read was short.
Fix by restoring the "remaining" count when adjusting the paging state
on short read.
Tests:
- index_with_paging_test
- secondary_index_test
Fixes#9198
Message-Id: <20210818131840.1160267-1-tgrabiec@scylladb.com>
(cherry picked from commit 1e4da2dcce)
We need stopwaitsecs just like we do TimeoutStpSec=900 on
scylla-server.service, to avoid timeout on scylla-server shutdown.
Fixes#9485Closes#9545
(cherry picked from commit c9499230c3)
In commit 11a8912093 (gossiper:
get_gossip_status: return string_view and make noexcept)
get_gossip_status returns a pointer to an endpoint_state in
endpoint_state_map.
After commit 425e3b1182 (gossip: Introduce
direct failure detector), gossiper::mark_dead and gossiper::real_mark_alive
can yield in the middle of the function. It is possible that
endpoint_state can be removed, causing use-after-free to access it.
To fix, make a copy before we yield.
Fixes#8859Closes#8862
(cherry picked from commit 7a32cab524)
GKE metadata server does not provide same metadata as GCE, we should not
return True on is_gce().
So try to fetch machine-type from metadata server, return False if it
404 not found.
Fixes#9471
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closes#9582
(cherry picked from commit 9b4cf8c532)
There are two APIs for checking the repair status and they behave
differently in case the id is not found.
```
{"host": "192.168.100.11:10001", "method": "GET", "uri":
"/storage_service/repair_async/system_auth?id=999", "duration": "1ms",
"status": 400, "bytes": 49, "dump": "HTTP/1.1 400 Bad
Request\r\nContent-Length: 49\r\nContent-Type: application/json\r\nDate:
Wed, 03 Nov 2021 10:49:33 GMT\r\nServer: Seastar
httpd\r\n\r\n{\"message\": \"unknown repair id 999\", \"code\": 400}"}
{"host": "192.168.100.11:10001", "method": "GET", "uri":
"/storage_service/repair_status?id=999&timeout=1", "duration": "0ms",
"status": 500, "bytes": 49, "dump": "HTTP/1.1 500 Internal Server
Error\r\nContent-Length: 49\r\nContent-Type: application/json\r\nDate:
Wed, 03 Nov 2021 10:49:33 GMT\r\nServer: Seastar
httpd\r\n\r\n{\"message\": \"unknown repair id 999\", \"code\": 500}"}
```
The correct status code is 400 as this is a parameter error and should
not be retried.
Returning status code 500 makes smarter http clients retry the request
in hopes of server recovering.
After this patch:
curl -X PGET
'http://127.0.0.1:10000/storage_service/repair_async/system_auth?id=9999'
{"message": "unknown repair id 9999", "code": 400}
curl -X GET
'http://127.0.0.1:10000/storage_service/repair_status?id=9999'
{"message": "unknown repair id 9999", "code": 400}
Fixes#9576Closes#9578
(cherry picked from commit f5f5714aa6)
Fixes#9103
compare overload was declared as "bool" even though it is a tri-cmp.
causes us to never use the speed-up shortcut (lessen search set),
in turn meaning more overhead for collections.
Closes#9104
(cherry picked from commit 59555fa363)
The copying and comparing utilities for FragmentedView are not prepared to
deal with empty fragments in non-empty views, and will fall into an infinite
loop in such case.
But data coming in result_row_view can contain such fragments, so we need to
fix that.
Fixes#8398.
Closes#8397
(cherry picked from commit f23a47e365)
node_exporter is packaged with some random uid/gid in the tarball.
When extracting it as an ordinary user this isn't a problem, since
the uid/gid are reset to the current user, but that doesn't happen
under dbuild since `tar` thinks the current user is root. This causes
a problem if one wants to delete the build directory later, since it
becomes owned by some random user (see /etc/subuid)
Reset the uid/gid infomation so this doesn't happen.
Closes#9579Fixes#9610.
(cherry picked from commit e1817b536f)
This patch fixes a bug in UpdateItem's ReturnValues=ALL_NEW, which in
some cases returned the OLD (pre-modification) value of some of the
attributes, instead of its NEW value.
The bug was caused by a confusion in our JSON utility function,
rjson::set(), which sounds like it can set any member of a map, but in
fact may only be used to add a *new* member - if a member with the same
name (key) already existed, the result is undefined (two values for the
same key). In ReturnValues=ALL_NEW we did exactly this: we started with
a copy of the original item, and then used set() to override some of the
members. This is not allowed.
So in this patch, we introduce a new function, rjson::replace(), which
does what we previously thought that rjson::set() does - i.e., replace a
member if it exists, or if not, add it. We call this function in
the ReturnValues=ALL_NEW code.
This patch also adds a test case that reproduces the incorrect ALL_NEW
results - and gets fixed by this patch.
In an upcoming patch, we should rename the confusingly-named set()
functions and audit all their uses. But we don't do this in this patch
yet. We just add some comments to clarify what set() does - but don't
change it, and just add one new function for replace().
Fixes#9542
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211104134937.40797-1-nyh@scylladb.com>
(cherry picked from commit b95e431228)
Consider the following procedure:
- n1, n2, n3
- n3 is down
- n1 runs nodetool removenode uuid_of_n3 to removenode from n3 the
cluster
- n1 is down in the middle of removenode operation
Node n1 will set n3 to removing gossip status during removenode
operation. Whenever existing nodes learn a node is in removing gossip
status, they will call restore_replica_count to stream data from other
nodes for the ranges n3 loses if n3 was removed from the cluster. If
the streaming fails, the streaming will sleep and retry. The current
max number of retry attempts is 5. The sleep interval starts at 60
seconds and increases 1.5 times per sleep.
This can leave the cluster in a bad state. For example, nodes can go
out of disk space if the streaming continues. We need a way to abort
such streaming attempts.
To abort the removenode operation and forcely remove the node, users
can run `nodetool removenode force` on any existing nodes to move the
node from removing gossip status to removed gossip status. However,
the restore_replica_count will not be aborted.
In this patch, a status checker is added in restore_replica_count, so
that once a node is in removed gossip status, restore_replica_count
will be aborted.
This patch is for older releases without the new NODE_OPS_CMD
infrastructure where such abort will happen automatically in case of
error.
Fixes#8651Closes#8655
(cherry picked from commit 0858619cba)
Although the sstable name is part of the system.large_* records,
it is not printed in the log.
In particular, this is essential for the "too many rows" warning
that currently does not record a row in any large_* table
so we can't correlate it with a sstable.
Fixes#9524
Test: unit(dev)
DTest: wide_rows_test.py
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211027074104.1753093-1-bhalevy@scylladb.com>
(cherry picked from commit a21b1fbb2f)
The everywhere_topology returns the number of nodes in the cluster as
RF. This makes only streaming from the node losing the range impossible
since no node is losing the range after bootstrap.
Shortcut to stream from all nodes in local dc in case the keyspace is
everywhere_topology.
Fixes#8503
(cherry picked from commit 3c36517598)
The sstable_list is destroyed right after the temporary
lw_shared_ptr<sstable_list> returned from `cf.get_sstables()`
is dereferenced.
Fixes#9138
Test: unit(dev)
DTest: resharding_test.py:ReshardingTombstones_with_DateTieredCompactionStrategy.disable_tombstone_removal_during_reshard_test (debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210804075813.42526-1-bhalevy@scylladb.com>
(cherry picked from commit 3ad0067272)
There were cases where a query on an indexed table
needed filtering but need_filtering returned false.
This is fixed by using new conditions in cases where
we are using an index.
Fixes#8991.
Fixes#7708.
For now this is an overly conservative implementation
that returns true in some cases where filtering
is not needed.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit 54149242b4)
As a function returning a future, simplify
its interface by handling any exceptions and
returning an exceptional future instead of
propagating the exception.
In this specific case, throwing from advance_and_await()
will propagate through table::await_pending_* calls
short-circuiting a .finally clause in table::stop().
Also, mark as noexcept methods of class table calling
advance_and_await and table::await_pending_ops that depends on them.
Fixes#8636
A followup patch will convert advance_and_await to a coroutine.
This is done separately to facilitate backporting of this patch.
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210511161407.218402-1-bhalevy@scylladb.com>
(cherry picked from commit c0dafa75d9)
On Debian variants, mdmonitor.service cannnot enable because it missing
[Install] section, so 'systemctl enable mdmonitor.service' will fail,
not able to run mdmonitor after the system restarted.
To force running the service, add Wants=mdmonitor.service on
var-lib-scylla.mount.
Fixes#8494Closes#8530
(cherry picked from commit c9324634ca)
This series adds a wrapper for the default rjson allocator which throws on allocation/reallocation failures. It's done to work around several rapidjson (the underlying JSON parsing library) bugs - in a few cases, malloc/realloc return value is not checked, which results in dereferencing a null pointer (or an arbitrary pointer computed as 0 + `size`, with the `size` parameter being provided by the user). The new allocator will throw an `rjson:error` if it fails to allocate or reallocate memory.
This series comes with unit tests which checks the new allocator behavior and also validates that an internal rapidjson structure which we indirectly rely upon (Stack) is not left in invalid state after throwing. The last part is verified by the fact that its destructor ran without errors.
Fixes#8521
Refs #8515
Tests:
* unit(release)
* YCSB: inserting data similar to the one mentioned in #8515 - 1.5MB objects clustered in partitions 30k objects in size - nothing crashed during various YCSB workloads, but nothing also crashed for me locally before this patch, so it's not 100% robust
relevant YCSB workload config for using 1.5MB objects:
```yaml
fieldcount=150
fieldlength=10000
```
Closes#8529
* github.com:scylladb/scylla:
test: add a test for rjson allocation
test: rename alternator_base64_test to alternator_unit_test
rjson: add a throwing allocator
(cherry picked from commit c36549b22e)
Fixes#8749
if a table::clear() was issued while we were flushing a memtable,
the memtable is already gone from list. We need to check this before
erase. Otherwise we get random memory corruption via
std::vector::erase
v2:
* Make interface more set-like (tolerate non-existance in erase).
Closes#8904
(cherry picked from commit 373fa3fa07)
We start background reclaim after we bootstrap, so bootstrap doesn't
benefit from it, and sees long stalls.
Fix by moving background reclaim initialization early, before
storage_service::join_cluster().
(storage_service::join_cluster() is quite odd in that main waits
for it synchronously, compared to everything else which is just
a background service that is only initialized in main).
Fixes#8473.
Closes#8474
(cherry picked from commit 935378fa53)
Current code std::move()-s the range tombstone into consumer thus
moving the tombstone's linkage to the containing list as well. As
the result the orignal range tombstone itself leaks as it leaves
the tree and cannot be reached on .clear(). Another danger is that
the iterator pointing to the tombstone becomes invalid while it's
then ++-ed to advance to the next entry.
The immediate fix is to keep the tombstone linked to the list while
moving.
fixes: #9207
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210825100834.3216-1-xemul@scylladb.com>
(cherry picked from commit b012040a76)
The extra status print is not needed in the log.
Fixes the following error:
ERROR 2021-08-10 10:54:21,088 [shard 0] storage_service -
service/storage_service.cc:3150 @do_receive: failed to log message:
fmt='send_meta_data: got error code={}, from node={}, status={}':
fmt::v7::format_error (argument not found)
Fixes#9183Closes#9189
(cherry picked from commit ce8fd051c9)
To building Ubuntu AMI with CPU scaling configuration, we need force
running mode for scylla_cpuscaling_setup, which run setup without
checking scaling_governor support.
See scylladb/scylla-machine-image#204
Closes#9326
(cherry picked from commit f928dced0c)
On Ubuntu, scaling_governor becomes powersave after rebooted, even we configured cpufrequtils.
This is because ondemand.service, it unconditionally change scaling_governor to ondemand or powersave.
cpufrequtils will start before ondemand.service, scaling_governor overwrite by ondemand.service.
To configure scaling_governor correctly, we have to disable this service.
Fixes#9324Closes#9325
(cherry picked from commit cd7fe9a998)
There will be unbounded growth of pending tasks if they are submitted
faster than retiring them. That can potentially happen if memtables
are frequently flushed too early. It was observed that this unbounded
growth caused task queue violations as the queue will be filled
with tons of tasks being reevaluated. By avoiding duplication in
pending task list for a given table T, growth is no longer unbounded
and consequently reevaluation is no longer aggressive.
Refs #9331.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210930125718.41243-1-raphaelsc@scylladb.com>
(cherry picked from commit 52302c3238)
The Red Hat packages were missing two things, first the metapackage
wasn't dependant at all in the python3 package and second, the
scylla-server package dependencies didn't contain a version as part
of the dependency which can cause to some problems during upgrade.
Doing both of the things listed here is a bit of an overkill as either
one of them separately would solve the problem described in #XXXX
but both should be applied in order to express the correct concept.
Fixes#8829Closes#8832
(cherry picked from commit 9bfb2754eb)
"
This mini-series fixes two loosely related bugs around reader recreation
in the evictable reader (related by both being around reader
recreation). A unit test is also added which reproduces both of them and
checks that the fixes indeed work. More details in the patches
themselves.
This series replaces the two independent patches sent before:
* [PATCH v1] evictable_reader: always reset static row drop flag
* [PATCH v1] evictable_reader: relax partition key check on reader
recreation
As they depend on each other, it is easier to add a test if they are in
a series.
Fixes: #8923Fixes: #8893
Tests: unit(dev, mutation_reader_test:debug)
"
* 'evictable-reader-recreation-more-bugs/v1' of https://github.com/denesb/scylla:
test: mutation_reader_test: add more test for reader recreation
evictable_reader: relax partition key check on reader recreation
evictable_reader: always reset static row drop flag
(cherry picked from commit 4209dfd753)
The node in this place is not yet attached to its parent, so
in btree::debug::yes (tests only) mode the node::drop()'s parent
checks will access null parent pointer.
However, in non-tesing runtime there's a chance that a linear
node fails to clone one of its keys and gets here. In this case
it will carry both leftmost and rightmost flags and the assertion
in drop will fire.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 1d857d604a)
Ref #9248.
This PR changes the `can_send` function so that it looks at the `token_metadata` in order to tell if the destination node is in the ring. Previously, gossiper state was used for that purpose and required a relatively complicated condition to check. The new logic just uses `token_metadata::is_member` which reduces complexity of the `can_send` function.
Additionally, `storage_service` is slightly modified so that during a removenode operation the `token_metadata` is first updated and only then endpoint lifecycle subscribers are notified. This was done in order to prevent a race just like the one which happened in #5087 - hints manager is a lifecycle subscriber and starts a draining operation when a node is removed, and in order for draining to work correctly, `can_send` should keep returning true for that node.
Tests:
- unit(dev)
- dtest(hintedhandoff_additional_test.py)
- dtest(topology_test.py)
Closes#8387
* github.com:scylladb/scylla:
hints: clarify docstring comment for can_send
hints: use token_metadata to tell if node is in the ring
hints: slightly reogranize "if" statement in can_send
storage_service: release token_metadata lock before notify_left
storage_service: notify_left after token_metadata is replicated
(cherry picked from commit 307bd354d2)
Ref #5087.
When failed-to-be-cloned node cleans itself it must also clear
all its child nodes. Plain destroy() doesn't do it, it only
frees the provided node.
fixes: #9248
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit d1a1a2dac2)
lw_shared_ptr must not be copied on a foreign shard.
Copying the vector on shard 0 tries increases the reference count of
lw_shared_ptr<sstable> elements that were created on other shards,
as seen in https://github.com/scylladb/scylla/issues/9278.
Fixes#9278
DTest: migration_test.py:TestLoadAndStream_with_3_0_md.load_and_stream_increase_cluster_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210902084313.2003328-1-bhalevy@scylladb.com>
(cherry picked from commit 33f579f783)
Whenever a node_head_ptr is assigned to nil_root, the _backref inside it is
overwritten. But since nil_root is shared between shards, this causes severe
cache line bouncing. (It was observed to reduce the total write throughput
of Scylla by 90% on a large NUMA machine).
This backreference is never read anyway, so fix this bug by not writing it.
Fixes#9252Closes#9246
(cherry picked from commit 126baa7850)
Introduce `raft` experimental option.
Adjust the tests accordingly to accomodate the new option.
It's not enabled by default when providing
`--experimental=true` config option and should be
requested explicitly via `--experimental-options=raft`
config option.
Hide the code related to `raft_group_registry` behind
the switch. The service object is still constructed
but no initialization is performed (`init()` is not
called) if the flag is not set.
Later, other raft-related things, such as raft schema
changes, will also use this flag.
Also, don't introduce a corresponding gossiper feature
just yet, because again, it should be done after the
raft schema changes API contract is stabilized.
This will be done in a separate series, probably related to
implementing the feature itself.
Tests: unit(dev)
Ref #9239.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210823121956.167682-1-pa.solodovnikov@scylladb.com>
(cherry picked from commit 22794efc22)
Refs #9053
Flips default for commitlog disk footprint hard limit enforcement to off due
to observed latency stalls with stress runs. Instead adds an optional flag
"commitlog_use_hard_size_limit" which can be turned on to in fact do enforce it.
Sort of tape and string fix until we can properly tweak the balance between
cl & sstable flush rate.
Closes#9195
(cherry picked from commit 3633c077be)
On some environment /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
does not exist even it supported CPU scaling.
Instead, /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor is
avaliable on both environment, so we should switch to it.
Fixes#9191Closes#9193
(cherry picked from commit e5bb88b69a)
The reader is used by load and stream to read sstables from the upload
directory which are not guaranteed to belong to the local shard.
Using the make_range_sstable_reader instead of
make_local_shard_sstable_reader.
Tests:
backup_restore_tests.py:TestBackupRestore.load_and_stream_using_snapshot_test
backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_2_test
backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_1_test
migration_test.py:TestLoadAndStream.load_and_stream_asymmetric_cluster_test
migration_test.py:TestLoadAndStream.load_and_stream_decrease_cluster_test
migration_test.py:TestLoadAndStream.load_and_stream_frozen_pk_test
migration_test.py:TestLoadAndStream.load_and_stream_increase_cluster_test
migration_test.py:TestLoadAndStream.load_and_stream_primary_replica_only_test
Fixes#9173Closes#9185
(cherry picked from commit 040b626235)
Compaction manager can start tons of compaction of fully expired sstable in
parallel, which may consume a significant amount of resources.
This problem is caused by weight being released too early in compaction, after
data is all compacted but before table is called to update its state, like
replacing sstables and so on.
Fully expired sstables aren't actually compacted, so the following can happen:
- compaction 1 starts for expired sst A with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 2 starts for expired sst B with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 3 starts for expired sst C with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 1 is done updating table state, so it finally completes and
releases all the resources.
- compaction 2 is done updating table state, so it finally completes and
releases all the resources.
- compaction 3 is done updating table state, so it finally completes and
releases all the resources.
This happens because, with expired sstable, compaction will release weight
faster than it will update table state, as there's nothing to be compacted.
With my reproducer, it's very easy to reach 50 parallel compactions on a single
shard, but that number can be easily worse depending on the amount of sstables
with fully expired data, across all tables. This high parallelism can happen
only with a couple of tables, if there are many time windows with expired data,
as they can be compacted in parallel.
Prior to 55a8b6e3c9, weight was released earlier in compaction, before
last sstable was sealed, but right now, there's no need to release weight
earlier. Weight can be released in a much simpler way, after the compaction is
actually done. So such compactions will be serialized from now on.
Fixes#8710.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210527165443.165198-1-raphaelsc@scylladb.com>
[avi: drop now unneeded storage_service_for_tests]
(cherry picked from commit a7cdd846da)
Partition count is of a type size_t but we use std::plus<int>
to reduce values of partition count in various column families.
This patch changes the argument of std::plus to the right type.
Using std::plus<int> for size_t compiles but does not work as expected.
For example plus<int>(2147483648LL, 1LL) = -2147483647 while the code
would probably want 2147483649.
Fixes#9090
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Closes#9074
(cherry picked from commit 90a607e844)
The recent commit 0ef0a4c78d added helpful
error messages in case an index cannot be created because the intended
name of its materialized view is already taken - but accidentally broke
the "CREATE INDEX IF NOT EXISTS" feature.
The checking code was correct, but in the wrong place: we need to first
check maybe the index already exists and "IF NOT EXISTS" was chosen -
and only do this new error checking if this is not the case.
This patch also includes a cql-pytest test for reproducing this bug.
The bug is also reproduced by the translated Cassandra unit tests
cassandra_tests/validation/entities/secondary_index_test.py::
testCreateAndDropIndex
and this is how I found this bug. After these patch, all these tests
pass.
Fixes#8717.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210526143635.624398-1-nyh@scylladb.com>
(cherry picked from commit 97e827e3e1)
When an index is created without an explicit name, a default name
is chosen. However, there was no check if a table with conflicting
name already exists. The check is now in place and if any conflicts
are found, a new index name is chosen instead.
When an index is created *with* an explicit name and a conflicting
regular table is found, index creation should simply fail.
This series comes with a test.
Fixes#8620
Tests: unit(release)
Closes#8632
* github.com:scylladb/scylla:
cql-pytest: add regression tests for index creation
cql3: fail to create an index if there is a name conflict
database: check for conflicting table names for indexes
(cherry picked from commit cee4c075d2)
Refs #8418
Broadcast can (apparently) be an address not actually on machine, but
on the other side of NAT. Thus binding local side of outgoing
connection there will fail.
Bind instead to listen_address (or broadcast, if listen_to_broadcast),
this will require routing + NAT to make the connection looking
like from broadcast from node connected to, to allow the connection
(if using partial encryption).
Note: this is somewhat verified somewhat limitedly. I would suggest
verifying various multi rack/dc setups before relying on it.
Closes#8974
(cherry picked from commit b8b5f69111)
The repair parallelism is calculated by the number of memory allocated to
repair and memory usage per repair instance. Currently, it does not
consider memory bloat issues (e.g., issue #8640) which cause repair to
use more memory and cause std::bad_alloc.
Be more conservative when calculating the parallelism to avoid repair
using too much memory.
Fixes#8641Closes#8652
(cherry picked from commit b8749f51cb)
To avoid restart scylla-server.service unexpectedly, drop BindsTo=
from scylla-fstrim.timer.
Fixes#8921Closes#8973
(cherry picked from commit def81807aa)
Fixes#8952
In 5ebf5835b0 we added a segment
prune after flushing, to deal with deadlocks in shutdown.
This means that calls that issue sync/flush-like ops "for-all",
need to operate on a defensive copy of the list.
Closes#8980
(cherry picked from commit ce45ffdffb)
systemd unit name of scylla-node-exporter is
scylla-node-exporter.service, not node-exporter.service.
Fixes#8966Closes#8967
(cherry picked from commit f19ebe5709)
Listing /etc/systemd/system/*.mount as ghost file seems incorrect,
since user may want to keep using RAID volume / coredump directory after
uninstalling Scylla, or user may want to upgrade enterprise version.
Also, we mixed two types of files as ghost file, it should handle differently:
1. automatically generated by postinst scriptlet
2. generated by user invoked scylla_setup
The package should remove only 1, since 2 is generated by user decision.
However, just dropping .mount from %files section causes another
problem, rpm will remove these files during upgrade, instead of
uninstall (#8924).
To fix both problem, specify .mount files as "%ghost %config".
It will keep files both package upgrade and package remove.
See scylladb/scylla-enterprise#1780Closes#8810Closes#8924Closes#8959
(cherry picked from commit f71f9786c7)
Commit 5adb8e555c marked the ::feed_hash() and a visitor lambda of
digester::feed_hash() as noexcept. This was quite recklesl as the
appending_hash<>::operator()s called by ::feed_hash() are not all
marked noexcept. In particular, the appending_hash<row>() is not
such and seem to throw.
The original intent of the mentioned commit was to facilitate the
partition_hasher in repair/ code. The hasher itself had been removed
by the 0af7a22c21, so it no longer needs the feed_hash-s to be
noexcepts.
The fix is to inherit noexcept from the called hashers, but for the
digester::feed_hash part the noexcept is just removed until clang
compilation bug #50994 is fixed.
fixes: #8983
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210706153608.4299-1-xemul@scylladb.com>
(cherry picked from commit 63a2fed585)
With strict mode, it could happen that a sstable alone in level 0 is
selected for offstrategy compaction, which means that we could run
into an infinite reshape process.
This is fixed by respecting the offstrategy threshold. Unit test is
added.
Fixes#8573.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210506181324.49636-1-raphaelsc@scylladb.com>
(cherry picked from commit 8480839932)
Wrong comparison operator is used when checking for overlapping. It
would miss overlapping when last key of a sstable is equal to the first
key of another sstable that comes next in the set, which is sorted by
first key.
Fixes#8531.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 39ecddbd34)
Fixes#8270
If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk footprint that is over limit causing new segment allocation to stall.
We need to take a few things into account:
1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here.
2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task.
3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit.
4.) (v2) Must ensure discard/delete routines are executed. Because we can race with background disk syncs, we may need to
issue segment prunes from end_flush() so we wake up actual file deletion/recycling
5.) (v2) Shutdown must ensure discard/delete is run after we've disabled background task etc, otherwise we might fail waking up replenish and get stuck in gate
6.) (v2) Recycling or deleting segments must be consistent, regardless of shutdown. For same reason as above.
7.) (v3) Signal recycle/delete queues/promise on shutdown (with recognized marker) to handle edge case where we only have a single (allocating) segment in the list, and cannot wake up replenisher in any more civilized way.
Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything).
New attempt at this, should fix intermittent shutdown deadlocks in commitlog_test.
Closes#8764
* github.com:scylladb/scylla:
commitlog_test: Add test case for usage/disk size threshold mismatch
commitlog_test: Improve test assertion
commitlog: Add waitable future for background sync/flush
commitlog: abort queues on shutdown
commitlog: break out "abort" calls into member functions
commitlog: Do explicit discard+delete in shutdown
commitlog: Recycle or not should not depend on shutdown state
commitlog: Issue discard_unused_segments on segment::flush end IFF deletable
commitlog: Flush all segments if we only have one.
commitlog: Always force flush if segment allocation is waiting
commitlog: Include segment wasted (slack) size in footprint check
commitlog: Adjust (lower) usage threshold
(cherry picked from commit 14252c8b71)
Shutdown must never fail, otherwise it may cause hangs
as seen in https://github.com/scylladb/scylla/issues/8577.
This change wraps the file created in `allocate_segment_ex` in `make_checked_file` so that scylla will abort when failing to write to the commitlog files.
In case other errors are seen during shutdown, just log them and continue with shutting down to prevent scylla from hanging.
Fixes#8577
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#8578
* github.com:scylladb/scylla:
commitlog: segment_manager::shutdown: abort on errors
commitlog: allocate_segment_ex: make_checked_file
(cherry picked from commit 48ff641f67)
This was triggered by the test_total_space_limit_of_commitlog dtest.
When it passes a very large commitlog_segment_size_in_mb (1/6th of the
free memory size, in mb), segment_manager constructor limits max_size
to std::numeric_limits<position_type>::max() which is 0xffffffff.
This causes allocate_segment_ex to loop forever when writing the segment
file since `dma_write` returns 0 when the count is unaligned (seen 4095).
The fix here is to select a sligtly small maxsize that is aligned
down to a multiple of 1MB.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210407121059.277912-1-bhalevy@scylladb.com>
(cherry picked from commit 705f9c4f79)
Backport of 6726fe79b6.
The code was susceptible to use-after-move if both local
and remote updates were going to be sent.
The whole routine for sending view updates is now rewritten
to avoid use-after-move.
Fixes#8830
Tests: unit(release),
dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)
Closes#8834
* backport-6726fe7:
view: fix use-after-move when handling view update failures
db,view: explicitly move the mutation to its helper function
db,view: pass base token by value to mutate_MV
The code was susceptible to use-after-move if both local
and remote updates were going to be sent.
The whole routine for sending view updates is now rewritten
to avoid use-after-move.
Refs #8830
Tests: unit(release),
dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)
The `apply_to_remote_endpoints` helper function used to take
its `mut` parameter by reference, but then moved the value from it,
which is confusing and prone to errors. Since the value is moved-from,
let's pass it to the helper function as rvalue ref explicitly.
The base token is passed cross-continuations, so the current way
of passing it by const reference probably only works because the token
copying is cheap enough to optimize the reference out.
Fix by explicitly taking the token by value.
This backport fixes the follow issue:
Cassandra stress fails to achieve consistency during replace node operation #8013
without the NODE_OPS_CMD infrastructure.
The commit c82250e0cf (gossip: Allow deferring advertise of local node to be up) which fixes for
During replace node operation - replacing node is used to respond to read queries #7312
is already present in 4.5 branch.
Closes#8703
* github.com:scylladb/scylla:
storage_service: Delay update pending ranges for replacing node
gossip: Add helper to wait for a node to be up
In issue #5021 we noticed that the equality check in Alternator's condition
expressions needs to handle sets differently - we need to compare the set's
elements ignoring their order. But the implementation we added to fix that
issue was only correct when the entire attribute was a set... In the
general case, an attribute can be a nested document, with only some
inner set. The equality-checking function needs to tranverse this nested
document, and compare the sets inside it as appropriate. This is what
we do in this patch.
This patch also adds a new test comparing equality of a nested document with
some inner sets. This test passes on DynamoDB, failed on Alternator before
this patch, and passes with this patch.
Refs #5021Fixes#8514
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419184840.471858-1-nyh@scylladb.com>
(cherry picked from commit dae7528fe5)
In issue #5021 we noted that Alternator's equality operator needs to be
fixed for the case of comparing two sets, because the equality check needs
to take into account the possibility of different element order.
Unfortunately, we fixed only the equality check operator, but forgot there
is also an inequality operator!
So in this patch we fix the inequality operator, and also add a test for
it that was previously missing.
The implementation of the inequality operator is trivial - it's just the
negation of the equality test. Our pre-existing tests verify that this is
the correct implementation (e.g., if attribute x doesn't exist, then "x = 3"
is false but "x <> 3" is true).
Refs #5021Fixes#8513
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419141450.464968-1-nyh@scylladb.com>
(cherry picked from commit 50f3201ee2)
When a condition expression (ConditionExpression, FilterExpression, etc.)
checks for equality of two item attributes, i.e., "x = y", and when one of
these attributes was missing we correctly returned false.
However, we also need to return false when *both* attributes are missing in
the item, because this is what DynamoDB does in this case. In other words
an unset attribute is never equal to anything - not even to another unset
attribute. This was not happening before this patch:
When x and y were both missing attributes, Alternator incorrectly returned
true for "x = y", and this patch fixes this case. It also fixes "x <> y"
which should to be true when both x and y are unset (but was false
before this patch).
The other comparison operators - <, <=, >, >=, BETWEEN, were all
implemented correctly even before this patch.
This patch also includes tests for all the two-unset-attribute cases of
all the operators listed above. As usual, we check that these tests pass
on both DynamoDB and Alternator to confirm our new behavior is the correct
one - before this patch, two of the new tests failed on Alternator and
passed on DynamoDB.
Fixes#8511
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210419123911.462579-1-nyh@scylladb.com>
(cherry picked from commit 46448b0983)
This is a follow up change to #8512.
Let's add aio conf file during scylla installation process and make sure
we also remove this file when uninstall Scylla
As per Avi Kivity's suggestion, let's set aio value as static
configuration, and make it large enough to work with 500 cpus.
Closes#8650
Refs: #8713
(cherry picked from commit dd453ffe6a)
On severl instance types in AWS and Azure, we get the following failure
during scylla_io_setup process:
```
ERROR 2021-04-14 07:50:35,666 [shard 5] seastar - Could not setup Async
I/O: Resource temporarily unavailable. The most common cause is not
enough request capacity in /proc/sys/fs/aio-max-nr. Try increasing that
number or reducing the amount of logical CPUs available for your
application
```
We have scylla_prepare:configure_io_slots() running before the
scylla-server.service start, but the scylla_io_setup is taking place
before
1) Let's move configure_io_slots() to scylla_util.py since both
scylla_io_setup and scylla_prepare are import functions from it
2) cleanup scylla_prepare since we don't need the same function twice
3) Let's use configure_io_slots() during scylla_io_setup to avoid such
failure
Fixes: #8587Closes#8512
Refs: #8713
(cherry picked from commit 588a065304)
In commit c82250e0cf (gossip: Allow deferring
advertise of local node to be up), the replacing node is changed to postpone
the responding of gossip echo message to avoid other nodes sending read
requests to the replacing node. It works as following:
1) replacing node does not respond echo message to avoid other nodes to
mark replacing node as alive
2) replacing node advertises hibernate state so other nodes knows
replacing node is replacing
3) replacing node responds echo message so other nodes can mark
replacing node as alive
This is problematic because after step 2, the existing nodes in the
cluster will start to send writes to the replacing node, but at this
time it is possible that existing nodes haven't marked the replacing
node as alive, thus failing the write request unnecessarily.
For instance, we saw the following errors in issue #8013 (Cassandra
stress fails to achieve consistency when only one of the nodes is down)
```
scylla:
[shard 1] consistency - Live nodes 2 do not satisfy ConsistencyLevel (2
required, 1 pending, live_endpoints={127.0.0.2, 127.0.0.1},
pending_endpoints={127.0.0.3}) [shard 0] gossip - Fail to send
EchoMessage to 127.0.0.3: std::runtime_error (Not ready to respond
gossip echo message)
c-s:
java.io.IOException: Operation x10 on key(s) [4c4f4d37324c35304c30]:
Error executing: (UnavailableException): Not enough replicas available
for query at consistency QUORUM (2 required but only 1 alive
```
To solve this problem for older releases without the patch "repair:
Switch to use NODE_OPS_CMD for replace operation", a minimum fix is
implemented in this patch. Once existing nodes learn the replacing node
is in HIBERNATE state, they add the replacing as replacing, but only add
the replacing to the pending list only after the replacing node is
marked as alive.
With this patch, when the existing nodes start to write to the replacing
node, the replacing node is already alive.
Tests: replace_address_test.py:TestReplaceAddress.replace_node_same_ip_test + manual test
Fixes: #8013Closes#8614
(cherry picked from commit e4872a78b5)
Currently, var-lib-scylla.mount may fails because it can start before
MDRAID volume initialized.
We may able to add "After=dev-disk-by\x2duuid-<uuid>.device" to wait for
device become available, but systemd manual says it automatically
configure dependency for mount unit when we specify filesystem path by
"absolute path of a device node".
So we need to replace What=UUID=<uuid> to What=/dev/disk/by-uuid/<uuid>.
Fixes#8279Closes#8681
(cherry picked from commit 3d307919c3)
Currently, unified installer does not apply correct file security context
while copying files, it causes permission error on scylla-server.service.
We should apply default file security context while copying files, using
'-Z' option on /usr/bin/install.
Also, because install -Z requires normalized path to apply correct security
context, use 'realpath -m <PATH>' on path variables on the script.
Fixes#8589Closes#8602
(cherry picked from commit 60c0b37a4c)
Since we have added scylla-node-exporter, we needed to do 'install -d'
for systemd directory and sysconfig directory before copying files.
Fixes#8663Closes#8664
(cherry picked from commit 6faa8b97ec)
When recreating the paging state from an indexed query,
a bunch of panic checks were introduced to make sure that
the code is correct. However, one of the checks is too eager -
namely, it throws an error if the base column type is not equal
to the view column type. It usually works correctly, unless the
base column type is a clustering key with DESC clustering order,
in which case the type is actually "reversed". From the point of view
of the paging state generation it's not important, because both
types deserialize in the same way, so the check should be less
strict and allow the base type to be reversed.
Tests: unit(release), along with the additional test case
introduced in this series; the test also passes
on Cassandra
Fixes#8666Closes#8667
* github.com:scylladb/scylla:
test: add a test case for paging with desc clustering order
cql3: relax a type check for index paging
(cherry picked from commit 593ad4de1e)
run_custom_job() was swallowing all exceptions, which is definitely
wrong because failure in a resharding or reshape would be incorrectly
interpreted as success, which means upper layer will continue as if
everything is ok. For example, ignoring a failure in resharding could
result in a shared sstable being left unresharded, so when that sstable
reaches a table, scylla would abort as shared ssts are no longer
accepted in the main sstable set.
Let's allow the exception to be propagated, so failure will be
communicated, and resharding and reshape will be all or nothing, as
originally intended.
Fixes#8657.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210515015721.384667-1-raphaelsc@scylladb.com>
(cherry picked from commit 10ae77966c)
The timestamp_type is an int64_t. So, it has to be explicitly
initialized before using it.
This missing inicialization prevented the major compactation
from happening when a time window finishes, as described in #8569.
Fixes#8569
Signed-off-by: Lauro Ramos Venancio <lauro.venancio@incognia.com>
Closes#8590
(cherry picked from commit 15f72f7c9e)
Tracing is created in two steps and is destroyed in two too.
The 2nd step doesn't have the corresponding stop part, so here
it is -- defer tracing stop after it was started.
But need to keep in mind, that tracing is also shut down on
drain, so the stopping should handle this.
Fixes#8382
tests: unit(dev), manual(start-stop, aborted-start)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210331092221.1602-1-xemul@scylladb.com>
(cherry picked from commit 887a1b0d3d)
This is a backport of 8aaa3a7bb8 to <= branch-4.5. The main conflicts were around Benny's reader close series (fa43d7680), but it also turned out that an additional patch (2f1d65ca11) also has to backported to make sure admission on signaling resources doesn't deadlock.
Refs: https://github.com/scylladb/scylla/issues/8493Closes#8558
* github.com:scylladb/scylla:
test: mutation_reader_test: add test_reader_concurrency_semaphore_forward_progress
test: mutation_reader_test: add test_reader_concurrency_semaphore_readmission_preserves_units
reader_concurrency_semaphore: add dump_diagnostics()
reader_permit: always forward resources
test: multishard_mutation_query_test: fuzzy-test: don't consume resource up-front
reader_concurrency_semaphore: make admission conditions consistent
This unit test checks that the semaphore doesn't get into a deadlock
when contended, in the presence of many memory-only reads (that don't
wait for admission). This is tested by simulating the 3 kind of reads we
currently have in the system:
* memory-only: reads that don't pass admission and only own memory.
* admitted: reads that pass admission.
* evictable: admitted reads that are furthermore evictable.
The test creates and runs a large number of these reads in parallel,
read kinds being selected randomly, then creates a watchdog which
kills the test if no progress is being made.
(cherry picked from commit 45d580f056)
This unit test passes a read through admission again-and-again, just
like an evictable reader would be during its lifetime. When readmitted
the read sometimes has to wait and sometimes not. This is to check that
the readmitting a previously admitted reader doesn't leak any units.
(cherry picked from commit cadc26de38)
Allow semaphore related tests to include a diagnostics printout in error
messages to help determine why the test failed.
(cherry picked from commit d246e2df0a)
This commit conceptually reverts 4c8ab10. Said commit was meant to
prevent the scenario where memory-only permits -- those that don't pass
admission but still consume memory -- completely prevent the admission
of reads, possibly even causing a deadlock because a permit might even
blocks its own admission. The protection introduced by said commit
however proved to be very problematic. It made the status of resources
on the permit very hard to reason about and created loopholes via which
permits could accumulate without tracking or they could even leak
resources. Instead of continuing to patch this broken system, this
commit does away with this "protection" based on the observation that
deadlocks are now prevented anyway by the admission criteria introduced
by 0fe75571d9, which admits a read anyway when all the initial count
resources are available (meaning no admitted reader is alive),
regardless of availability of memory.
The benefits of this revert is that the semaphore now knows about all
the resources and is able to do its job better as it is not "lied to"
about resource by the permits. Furthermore the status of a permit's
resources is much simpler to reason about, there are no more loopholes
in unexpected state transitions to swallow/leak resources.
To prove that this revert is indeed safe, in the next commit we add
robust tests that stress test admission on a highly contested semaphore.
This patch also does away with the registered/admitted differentiation
of permits, as this doesn't make much sense anymore, instead these two
are unified into a single "active" state. One can always tell whether a
permit was admitted or not from whether it owns count resources anyway.
(cherry picked from commit caaa8ef59a)
The fuzzy test consumes a large chunk of resource from the semaphore
up-front to simulate a contested semaphore. This isn't an accurate
simulation, because no permit will have more than 1 units in reality.
Furthermore this can even cause a deadlock since 8aaa3a7 as now we rely
on all count units being available to make forward progress when memory
is scarce.
This patch just cuts out this part of the test, we now have a dedicated
unit test for checking a heavily contested semaphore, that does it
properly, so no need to try to fix this clumsy attempt that is just
making trouble at this point.
Refs: #8493
Tests: release(multishard_mutation_query_test:fuzzy_test)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210429084458.40406-1-bdenes@scylladb.com>
(cherry picked from commit 26ae9555d1)
Migration manager has a function to get a schema (for read or write),
this function queries a peer node and retrieves the schema from it. One
scenario where it can happen is if an old node, queries an old not fixed
index.
This makes a hole through which views that are only adjusted for reading
can slip through.
Here we plug the hole by fixing such views before they are registered.
Closes#8509
(cherry picked from commit 480a12d7b3)
Fixes#8554.
Currently there are two places where we check admission conditions:
`do_wait_admission()` and `signal()`. Both use `has_available_units()`
to check resource availability, but the former has some additional
resource related conditions on top (in `may_proceed()`), which lead to
the two paths working with slightly different conditions. To fix, push
down all resource availability related checks to `has_available_units()`
to ensure admission conditions are consistent across all paths.
(cherry picked from commit d90cd6402c)
Fixes#8363Fixes#8376
Delete segements has two issues when running with size-limited
commit log and strict adherence to said limit.
1.) It uses parallel processing, with deferral. This means that
the disk usage variables it looks at might not be fully valid
- i.e. we might have already issued a file delete that will
reduce disk footprint such that a segment could instead be
recycled, but since vars are (and should) only updated
_post_ delete, we don't know.
2.) It does not take into account edge conditions, when we only
delete a single segment, and this segment is the border segment
- i.e. the one pushing us over the limit, yet allocation is
desperately waiting for recycling. In this case we should
allow it to live on, and assume that next delete will reduce
footprint. Note: to ensure exact size limit, make sure
total size is a multiple of segment size.
if we had an error in recycling (disk rename?), and no elements
are available, we could have waiters hoping they will get segements.
abort the queue (not permanent, but wakes up waiters), and let them
retry. Since we did deletions instead, disk footprint should allow
for new allocs at least. Or more likely, everything is broken, but
we will at least make more noise.
Closes#8372
* github.com:scylladb/scylla:
commitlog: Add signalling to recycle queue iff we fail to recycle
commitlog: Fix race and edge condition in delete_segments
commitlog: coroutinize delete_segments
commitlog_test: Add test for deadlock in recycle waiter
(cherry picked from commit 8e808a56d2)
It is possible that a partition is in cache but is not present in sstables that are underneath.
In such case:
1. cache_flat_mutation_reader will fast forward underlying reader to that partition
2. The underlying reader will enter the state when it's empty and its is_end_of_stream() returns true
3. Previously cache_flat_mutation_reader::do_fill_buffer would try to fast forward such empty underlying reader
4. This PR fixes that
Test: unit(dev)
Fixes#8435Fixes#8411Closes#8437
* github.com:scylladb/scylla:
row_cache: remove redundant check in make_reader
cache_flat_mutation_reader: fix do_fill_buffer
read_context: add _partition_exists
read_context: remove skip_first_fragment arg from create_underlying
read_context: skip first fragment in ensure_underlying
(cherry picked from commit 163f2be277)
When a particular partition exists in at least one sstable, the cache
expects any single-partition query to this partition to return a `partition_start`
fragment, even if the result is empty.
In `time_series_sstable_set::create_single_key_sstable_reader` it could
happen that all sstables containing data for the given query get
filtered out and only sstables without the relevant partition are left,
resulting in a reader which immediately returns end-of-stream (while it
should return a `partition_start` and if not in forwarding mode, a
`partition_end`). This commit fixes that.
We do it by extending the reader queue (used by the clustering reader
merger) with a `dummy_reader` which will be returned by the queue as
the very first reader. This reader only emits a `partition_start` and,
if not in forwarding mode, a `partition_end` fragment.
Fixes#8447.
Closes#8448
(cherry picked from commit 5c7ed7a83f)
The merger could return end-of-stream if some (but not all) of the
underlying readers were empty (i.e. not even returning a
`partition_start`). This could happen in places where it was used
(`time_series_sstable_set::create_single_key_sstable_reader`) if we
opened an sstable which did not have the queried partition but passed
all the filters (specifically, the bloom filter returned a false
positive for this sstable).
The commit also extends the random tests for the merger to include empty
readers and adds an explicit test case that catches this bug (in a
limited scope: when we merge a single empty reader).
It also modifies `test_twcs_single_key_reader_filtering` (regression
test for #8432) because the time where the clustering key filter is
invoked changes (some invocations move from the constructor of the
merger to operator()). I checked manually that it still catches the bug
when I reintroduce it.
Fixes#8445.
Closes#8446
(cherry picked from commit 7ffb0d826b)
The filter passed to `min_position_reader_queue`, which was used by
`clustering_order_reader_merger`, would incorrectly include sstables as
soon as they passed through the PK (bloom) filter, and would include
sstables which didn't pass the PK filter (if they passed the CK
filter). Fortunately this wouldn't cause incorrect data to be returned,
but it would cause sstables to be opened unnecessarily (these sstables
would immediately return eof), resulting in a performance drop. This commit
fixes the filter and adds a regression test which uses statistics to
check how many times the CK filter was invoked.
Fixes#8432.
Closes#8433
(cherry picked from commit 3687757115)
compound set isn't overriding create_single_key_sstable_reader(), so
default implementation is always called. Although default impl will
provide correct behavior, specialized ones which provides better perf,
which currently is only available for TWCS, were being ignored.
compound set impl of single key reader will basically combine single key
readers of all sets managed by it.
Fixes#8415.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210406205009.75020-1-raphaelsc@scylladb.com>
(cherry picked from commit 8e0a1ca866)
If a table that is not replicated to a certain DC (rf=0) is accessed
with LOCAL_QUORUM on that DC the current code will crash since the
'targets' array will be empty and read executor does not handle it.
Fix it by replying with empty result.
Fixes#8354
Message-Id: <YGro+l2En3fF80CO@scylladb.com>
(cherry picked from commit cd24dfc7e5)
Don't allow users to disable MC sstables format any more.
We would like to retire some old cluster features that has been around
for years. Namely MC_SSTABLE and UNBOUNDED_RANGE_TOMBSTONES. To do this
we first have to make sure that all existing clusters have them enabled.
It is impossible to know that unless we stop supporting
enable_sstables_mc_format flag.
Test: unit(dev)
Refs #8352
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Closes#8360
(cherry picked from commit 57c7964d6c)
* seastar 48376c76a...72e3baed9 (3):
> file: Add RFW_NOWAIT detection case for AuFS
> sharded: provide type info on no sharded instance exception
> iotune: Estimate accuarcy of measurement
Added missing include "database.hh" to api/lsa.cc since seastar::sharded<>
now needs full type information.
This patch is another series on removing big allocations from scylla.
The buffers in `compare_visitor` were replaced with `managed_bytes_view`, similiar change was also needed in tuple_deserializing_iterator and listlike_partial_deserializing_iterator, and was applied as well.
Tests:unit(dev)
Closes#8357
* github.com:scylladb/scylla:
types: remove linearization from abstract_type::compare
types: replace buffers in tuple_deserializing_iterator with fragmented ones
types: make tuple_type_impl::split work with any FragmentedViews
types: move read_collection_size/value specialization to header file
To avoid high latencies caused by large contigous allocations
needed by linearizing, work on fragmented buffers instead.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
In preparation for removing linearization from abstract_type::compare,
add options to avoid linearization in tuple_deserializing_iterator.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
We may want to store a tuple in a fragmented buffer. To split it
into a vector of optional bytes, tuple_type_impl::split can be used.
To split a contiguous buffer(bytes_view), simply pass
single_fragmented_view(bytes_view).
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Addressed https://github.com/scylladb/scylla/pull/8314#issuecomment-803671234
(write issue: "Tracing: slow query fast mode documentation request")
adds a fast slow queries tracing mode documentation to the docs/guide/tracing.md
patch to the scylla-doc will be dup-ed after this one merged
cc @nyh
cc @vladzcloudius
Closes#8373
* github.com:scylladb/scylla:
tracing: api: fast mode doc improvement
tracing: fast slow query tracing mode docs
Following the work started in 253a7640e, a new batch of old features is assumed to be always available. They are all still announced via gossip, but the code assumes that the feature is always true, because we only support upgrades from a previous release, and the release window is considerably smaller than 2 years.
Features picked this time via `git blame`, along with the date of their introduction:
* fe4afb1aa3 (Asias He 2018-09-05 14:52:10 +0800 109) static const sstring ROW_LEVEL_REPAIR = "ROW_LEVEL_REPAIR";
* ff5e541335 (Calle Wilund 2019-02-05 13:06:07 +0000 110) static const sstring TRUNCATION_TABLE = "TRUNCATION_TABLE";
* fefef7b9eb (Tomasz Grabiec 2019-03-05 19:08:07 +0100 111) static const sstring CORRECT_STATIC_COMPACT_IN_MC = "CORRECT_STATIC_COMPACT_IN_MC";
Tests: unit(dev)
Closes#8235
* github.com:scylladb/scylla:
sstables,test: remove variables depending on old features
gms: make CORRECT_STATIC_COMPACT_IN_MC ft unconditionally true
sstables: stop relying on CORRECT_STATIC_COMPACT_IN_MC feature
gms: make TRUNCATION_TABLE feature unconditionally true
gms: make ROW_LEVEL_REPAIR feature unconditionally true
repair: stop relying on ROW_LEVEL_REPAIR feature
Fixes#8369
This was originally found (and fixed) by @gleb-cloudius, but the patch set with
the fix was reverted at some point, and the fix went away. Now the error remains
even in new, nice coroutine code.
We check the wrong var in the inner loop of the pre-fill path of
allocate_segment_ex, often causing us to generate giant writev:s of more or less
the whole file. Not intended.
Closes#8370
8a8589038c ("test: increase quota for tests to 6GB") increased
the quota for tests from 2GB to 6GB. I later found that the increased
requirement is related to the page size: Address Sanitizer allocates
at least a page per object, and so if the page size is larger the
memory requirement is also larger.
Make use of this by only increasing the quota if the page size
is greater than 4096 (I've only seen 4096 and 65536 in the wild).
This allows greater parallelism when the page size is small.
Closes#8371
Tests are short-lived and use a small amount of data. They
are also often run repeatly, and the data is deleted immediately
after the test. This is a good scenario for using the kernel page
cache, as it can cache read-only data from test to test, and avoid
spilling write data to disk if it is deleted quickly.
Acknowledge this by using the new --kernel-page-cache option for
tests.
This is expected to help on large machines, where the disk can be
overloaded. Smaller machines with NVMe disks probably will not see
a difference.
Closes#8347
In order to maintain backward compatibility wrt. cluster features,
two boolean variables were kept in sstable writers:
- correctly_serialize_non_compound_range_tombstones
- correctly_serialize_static_compact_in_mc
Since these features are assumed to always be present now,
the above variables are no longer needed and can be purged.
This pull request adds partial admission control to Thrift frontend. The solution is partial mostly because the Thrift layer, aside from allowing Thrift messages, may also be used as a base protocol for CQL messages. Coupling admission control to this one is a little bit more complicated due to how the layer currently works - a Thrift handler, created once per connection, keeps a local `query_state` instance for the occasion of handling CQL requests. However, `query_state` should be kept per query, not per connection, so adding admission control to this aspect of the frontend is left for later.
Finally, the way service permits are passed from the server, via the handler factory, handler and then to queries is hacky. I haven't figured out how to force Thrift to pass custom context per query, so the way it works now is by relying on the fact that the server does not yield (in Seastar sense) between having read the request and launching the proper handler. Due to that, it's possible to just store the service permit in the server itself, pass the reference (address) to it down to the handler, and then read it back from the handling code and claim ownership of it. It works, but if anyone has a better idea, please share.
Refs #4826Closes#8313
* github.com:scylladb/scylla:
thrift: add support for max_concurrent_requests_per_shard
thrift: add metrics for admission control
thrift: add a counter for in-flight requests
thrift: add a counter for blocked requests
thrift: partially add admission control
service_permit: add a getter for the number of units held
thrift: coroutinize processing a request
memory_limiter: add a missing seastarx include
LCS reshape is currently inefficient for repair-based operation, because
the disjoint run of 256 sstables is reshaped into bigger L0 files, which
will be then integrated into the main sstable set.
On reshape completion, LCS has to compact those big L0 files onto higher
levels, until last level is reached, producing bad write amplification.
A much better approach is to instead compact that disjoint run into the
best possible level L, which can be figured out with:
log (base fan_out) of (total_size / max_sstable_size)
This compaction will be essentially a copy operation. It's important
to do it rather than only mutating the level of sstables because we have
to reshape the input run according to LCS parameters like sstable size.
For repair-based bootstrap/replace, the input disjoint run is now efficiently
reshaped into an ideal level L, so there's no compaction backlog once
reshape completes.
This behavior will manifest in the log as this:
LeveledManifest - Reshaping 256 disjoint sstables in level 0 into level 2
For repair-based decommission/removenode though, which reshape wasn't
wired on yet, level L may temporarily hold 2 disjoint runs, which overlap
one another, but LCS itself will incrementally merge them through either
promotion of L-1 into L, or by detecting overlapping in level L and
merging the overlapping sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210329171826.42873-1-raphaelsc@scylladb.com>
Failures in this test typically happen inside the test consumer object.
These however don't stop the test as the code invoking the consumer
object handles exceptions coming from it. So the test will run to
completion and will fail again when comparing the produced output with
the expected one. This results in distracting failures. The real problem
is not the difference in the output, but the first check that failed,
which is however buried in the noise. To prevent this add an "ok" flag
which is set to false if the consumer fails. In this case the additional
checks are skipped in the end to not generate useless noise.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210326083147.26113-2-bdenes@scylladb.com>
"
When a permit is destroyed we check if it still holds on to any
resources in the destructor. Any resources the permit still holds on are
leaked resources, as users should have released these. Currently we just
invoke `on_internal_error_noexcept()` to handle this, which -- depending
on the configuration -- will result in an error message or an assert. In
the former case, the resources will be leaked for good. This mini-series
fixes this, by signaling back these resources to the semaphore. This
helps avoid an eventual complete dry-up of all semaphore resources and a
subsequent complete shutdown of reads.
Tests: unit(release, debug)
"
* 'reader-permit-signal-leaked-resources/v1' of https://github.com/denesb/scylla:
reader_permit: signal leaked resources
test: test_reader_lifecycle_policy: keep semaphores alive until all ops cease
sstables: generate_summary(): extend the lifecycle of the reader concurrency semaphore
Said test has two separate logical readers, but they share the same
permit, which is illegal. This didn't cause any problems yet, but soon
the semaphore will start to keep score of active/inactive permits which
will be confused by such sharing, so have them use separate permits.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210326083147.26113-1-bdenes@scylladb.com>
Clean up step() function by moving state specific processing into per
state functions. This way it is easier to see how each state handles
individual messages. No functional changes here.
Message-Id: <YGHCiTWjq+L/jVCB@scylladb.com>
The command can be used to inspect IO queues of a local reactor.
Example output:
```
(gdb) scylla io-queues
Dev 0:
Class: |shares: |ptr:
--------------------------------------------------------------------------------
"default" |1 |(seastar::priority_class_data *)0x6000002c6500
"commitlog" |1000 |(seastar::priority_class_data *)0x6000003ad940
"memtable_flush" |1000 |(seastar::priority_class_data *)0x6000005cb300
"streaming" |200 |(seastar::priority_class_data *)0x0
"query" |1000 |(seastar::priority_class_data *)0x600000718580
"compaction" |1000 |(seastar::priority_class_data *)0x6000030ef0c0
Max request size: 2147483647
Max capacity: Ticket(weight: 4194303, size: 4194303)
Capacity tail: Ticket(weight: 73168384, size: 100561888)
Capacity head: Ticket(weight: 77360511, size: 104242143)
Resources executing: Ticket(weight: 2176, size: 514048)
Resources queued: Ticket(weight: 384, size: 98304)
Handles: (1)
Class 0x6000005d7278:
Ticket(weight: 128, size: 32768)
Ticket(weight: 128, size: 32768)
Ticket(weight: 128, size: 32768)
Pending in sink: (0)
```
Created when debugging a core dump. Turned out not to be immediately useful for this use case, but I'm publishing it since it may come in handy in future investigations.
Closes#8362
* github.com:scylladb/scylla:
scylla-gdb: add io-queues command
scylla-gdb.py: add parsing std::priority_queue
scylla-gdb.py: add parsing std::atomic
scylla-gdb.py: add parsing std::shared_ptr
scylla-db.py: add parsing intrusive_slist
`flat_mutation_reader::consume_pausable` is widely used in Scylla. Some places
worth mentioning are memtables and combined readers but there are others as
well.
This patchset improves `consume_pausable` in three ways:
1. it removes unnecessary allocation
2. it rearranges ifs to not check the same thing twice
3. for a consumer that returns plain stop_iteration not a future<stop_iteration>
it reduces the amount of future usage
Test: unit(dev, release, debug)
Combined reader microbenchmark has shown from 2% to 22% improvement in median
execution time while memtable microbenchmark has shown from 3.6% to 7.8%
improvement in median execution time.
Before the change:
```
./build/release/test/perf/perf_mutation_readers --random-seed 3549335083
single run iterations: 0
single run duration: 1.000s
number of runs: 5
number of cores: 16
random seed: 3549335083
test iterations median mad min max
combined.one_row 1316234 140.120ns 0.020ns 140.074ns 140.141ns
combined.single_active 7332 91.484us 31.890ns 91.453us 91.778us
combined.many_overlapping 945 870.973us 429.720ns 868.625us 871.403us
combined.disjoint_interleaved 7102 85.989us 7.847ns 85.973us 85.997us
combined.disjoint_ranges 7129 85.570us 7.840ns 85.562us 85.596us
combined.overlapping_partitions_disjoint_rows 5458 124.787us 56.738ns 124.731us 125.370us
clustering_combined.ranges_generic 1920688 217.940ns 0.184ns 217.742ns 218.275ns
clustering_combined.ranges_specialized 1935318 194.610ns 0.199ns 194.210ns 195.228ns
memtable.one_partition_one_row 624001 1.600us 1.405ns 1.599us 1.605us
memtable.one_partition_many_rows 79551 12.555us 1.829ns 12.549us 12.558us
memtable.many_partitions_one_row 40557 24.748us 77.083ns 24.644us 25.135us
memtable.many_partitions_many_rows 3220 310.429us 57.628ns 310.295us 311.189us
```
After the change:
```
./build/release/test/perf/perf_mutation_readers --random-seed 3549335083
single run iterations: 0
single run duration: 1.000s
number of runs: 5
number of cores: 16
random seed: 3549335083
test iterations median mad min max
combined.one_row 1358839 109.222ns 0.122ns 109.089ns 109.348ns
combined.single_active 7525 87.305us 25.540ns 87.273us 87.362us
combined.many_overlapping 962 853.195us 1.904us 851.244us 855.142us
combined.disjoint_interleaved 7310 81.988us 28.877ns 81.949us 82.032us
combined.disjoint_ranges 7315 81.699us 37.144ns 81.662us 81.874us
combined.overlapping_partitions_disjoint_rows 5591 120.964us 15.294ns 120.949us 121.120us
clustering_combined.ranges_generic 1954722 211.993ns 0.052ns 211.883ns 212.084ns
clustering_combined.ranges_specialized 2042194 187.807ns 0.066ns 187.732ns 188.289ns
memtable.one_partition_one_row 648701 1.542us 0.339ns 1.542us 1.543us
memtable.one_partition_many_rows 85007 11.759us 1.168ns 11.752us 11.782us
memtable.many_partitions_one_row 43893 22.805us 17.147ns 22.782us 22.843us
memtable.many_partitions_many_rows 3441 290.220us 41.720ns 290.172us 290.306us
```
Closes#8359
* github.com:scylladb/scylla:
flat_mutation_reader: optimize consume_pausable for some consumers
flat_mutation_reader: special case consumers in consume_pausable
flat_mutation_reader: Change order of checks in consume_pausable
flat_mutation_reader: fix indentation in consume_pausable
flat_mutation_reader: Remove allocation in consume_pausable
perf: Add benchmarks for large partitions
This commit adds admission control in the form of passing
service permits to the Thrift server.
The support is partial, because Thrift also supports running CQL
queries, and for that purpose a query_state object is kept
in the Thrift handler. However, the handler is generally created
once per connection, not once per query, and the query_state object
is supposed to keep the state of a single query only.
In order to keep this series simpler, the CQL-on-top-of-Thrift
layer is not touched and is left as TODO.
Moreover, the Thrift layer does not make it easy to pass custom
per-query context (like service_permit), so the implementation
uses a trick: the service permit is created on the server
and then passed as reference to its connections and their respective
Thrift handlers. Then, each time a query is read from the socket,
this service permit is overwritten and then read back from the Thrift
handler. This mechanism heavily relies on the fact that there are
zero preemption points between overwriting the service permit
and reading it back by the handler. Otherwise, races may occur.
This assumption was verified by code inspection + empirical tests,
but if somebody is aware that it may not always hold, please speak up.
The Thrift layer is functional, but it's not usually the first-choice protocol for Scylla users, so it's hereby disabled by default.
Fixes#8336Closes#8338
* github.com:scylladb/scylla:
docs: mention disabling Thrift by default
db,config: disable Thrift by default
Raft instance needs to update RPC subsystem on changes in
configuration, so that RPC can deliver messages to the new nodes
in configuration, as well as dispose of the old nodes.
I.e. the nodes which are not the part of the most recent
configuration anymore.
The effective scope of RPC mappings is limited by the piece of
code which sends messages to both the "new" nodes (which
are added to the cluster with the most recent configuration
change) and the "old" nodes which are removed from the cluster.
Until the messages are successfully delivered to at least
the majority of "old" nodes and we have heard back from them,
the mappings should be kept intact. After that point the RPC
mappings for the removed nodes are no longer of interest
and thus can be immediately disposed.
There is also another problem to be solved: in Raft an instance may
need to communicate with a peer outside its current configuration.
This may happen, e.g., when a follower falls out of sync with the
majority and then a configuration is changed and a leader not present
in the old configuration is elected.
The solution is to introduce the concept of "expirable" updates to
the RPC subsystem.
When RPC receives a message from an unknown peer, it also adds the
return address of the peer to the address map with a TTL. Should
we need to respond to the peer, its address will be known.
An outgoing communication to an unconfigured peer is impossible.
* manmanson/raft_mappings_wiring_v12:
raft: update README.md with info on RPC server address mappings
raft: wire up `rpc::add_server` and `rpc::remove_server` for configuration changes
raft/fsm: add optional `rpc_configuration` field to fsm_output
raft: maintain current rpc context in `server_impl`
raft: use `.contains` instead of `.count` for std::set in `raft::configuration::diff`
raft: unit-tests for `raft_address_map`
raft: support expiring server address mappings for rpc module
consumers that return stop_iteration not future<stop_iteration> don't
have to consume a single fragment per each iteration of repeat. They can
consume whole buffer in each iteration.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
consume_pausable works with consumers that return either stop_iteration
or future<stop_iteration>. So far it was calling futurize_invoke for
both. This patch special cases consumers that return
future<stop_iteration> and don't call futurize_invoke for them as this
is unnecessary work.
More importantly, this will allow the following patch to optimize
consumers that return plain stop_iteration.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This way we can avoid checking is_buffer_empty twice.
Compiler might be able to optimize this out but why depend on it
when the alternative is not less readable.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Code was left with wrong indentation by the previous commit that
removed do_with call around the code that's currently present.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The allocation was introduced in 515bed90bb but I couldn't figure out
why it's needed. It seems that the consumer can just be captured inside
lambda. Tests seem to support the idea.
Indentation will be fixed in the following commit to make the review
easier.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
dbuild detects if the kernel is using cgroupv2 by checking if the
cgroup2 filesystem is mounted on /sys/fs/cgroup. However, on Ubuntu
20.10, the cgroup filesystem is mounted on /sys/fs/cgroup and the
cgroup2 filesystem is mounted on /sys/fs/cgroup/unified. This second
mount matches the search expression and gives a false positive.
Fix by adding a space at the end; this will fail to match
/sys/fs/cgroup/unified.
Closes#8355
Describe the high-level scheme of managing RPC mappings and
also expand on the introduction of "expirable" RPC mappings concept
and why these are needed.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Raft instance needs to update RPC subsystem on changes in
configuration, so that RPC can deliver messages to the new nodes
in configuration, as well as dispose of the old nodes.
I.e. the nodes which are not the part of the most recent
configuration anymore.
The effective scope of RPC mappings is limited by the piece of
code which sends messages to both the "new" nodes (which
are added to the cluster with the most recent configuration
change) and the "old" nodes which are removed from the cluster.
Until the messages are successfully delivered to at least
the majority of "old" nodes and we have heard back from them,
the mappings should be kept intact. After that point the RPC
mappings for the removed nodes are no longer of interest
and thus can be immediately disposed.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The field is set in `fsm.get_output` whenever
`_log.last_conf_idx()` or the term changes.
Also, add `_last_conf_idx` and `_last_term` to
`fsm::last_observed_state`, they are utilized
in the condition to evaluate current rpc
configuration in `fsm.get_output()`.
This will be used later to update rpc config state
stored in `server_impl` and maintain rpc address map.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The start-stop code is drifting towards a straightforward
scheme of a bunch of
service foo;
foo.start();
auto stop_foo = defer([&foo] { foo.stop(); });
blocks. The drain_on_shutdown() and its relation to drain()
and decommission() is a big hurdle on the way of this effort.
This set unifies drain() and drain_on_shutdown() so that drain
really becomes just some first steps of the regular shutdown,
i.e. -- what it should be. Some synchronisation bits around it
are still needed, though.
This unification also closes a bunch not-yet-caught bugs when
parts of the system remained running in case shutdown happens
after nodetool drain. In this case the whole drain_on_sutdown()
becomes a noop (just returns drain()'s future) and what's
missing in drain() becomes missing on shutdown.
tests: unit(dev), dtest(simple_boot_shutdown : dev),
manual(start+stop, start+drain+stop : dev)
refs: #2737
* xemul/br-drain-on-shutdown:
drain_on_shutdown: Simplify
drain: Fix indentation
storage_service: Unify drain and drain_on_shutdown
storage_proxy: Drain and unsubscribe in main.cc
migration_manager: Stop it in two phases
stream_manager: Stop instances on drain
batchlog_manager: Stop its instances on shutdown
tracing: Shutdown tracing in drain
tracing: Stop it in main.cc
system_distributed_keyspace: Stop it in main.cc
storage_service: Move (un)subscription to migration events
Introduce rpc server_address that represents the
last observed state of address mappings
for RPC module.
It does not correspond to any kind of configuration
in the raft sense, just an artificial construct
corresponding to the largest set of server
addresses coming from both previous and current
raft configurations (to be able to contact both
joining and leaving servers).
This will be used later to update rpc module mappings
when cluster configuration changes.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
`std::unordered_set::contains` is introduced in C++20 and provides
clearer semantics to check existence of a given element in a set.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This patch introduces `raft_address_map` class to abstract
the notion of expirable address mappings for a raft rpc module.
In Raft an instance may need to communicate with a peer outside
its current configuration. This may happen, e.g., when a follower
falls out of sync with the majority and then a configuration is
changed and a leader not present in the old configuration is elected.
The solution is to introduce the concept of "expirable" updates to
the RPC subsystem.
When RPC receives a message from an unknown peer, it also adds the
return address of the peer to the address map with a TTL. Should
we need to respond to the peer, its address will be known.
An outgoing communication to an unconfigured peer is impossible.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The modern version of this method doesn't need the
run_with_no_api_lock(), as it's launched on shard 0
anyway, neither it needs logging before and after
as it's done by the deferred action from main that
calls it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now they only differ in one bit -- compaction manager is
drained on drain and is left running (until regular stop)
on shutdown. So this unification adds a boolean flag for
this case.
Also the indentation is deliberately left broken for the
sake of patch readability.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently shutdown after drain leaves storage proxy
subscribed on storage_service events and without the
storage_proxy::drain_on_shutdown being called. So it
seems safe if the whole thing is relocated closer to
its starting peers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Before the patch the migration manager was stopped in
two ways and one was buggy.
Plain shutdown -- it's just sharded::stop-ed by defer
in main(), but this happens long after the shutdown
of commitlog, which is not correct.
Shutdown after drain -- it's stopped twice, first time
right before the commitlog shutdown, second -- the
same defer in main(). And since the sharded::stop is
reentrable, the 2nd stop works noop.
This patch splits the stop into two phases: first it
stops the instances and does this in _both_ -- plain
shutdown and shutdown after drain. This phase is done
before commitlog shutdown in both cases. Second, the
existring deferred sharded::stop in main.cc.
This changes needs the migration_manager::stop() to
become re-entrable, but that's easily checked with the
help of abort_source the migration_manager has.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's not seen directly from ths patch itself, but the only
difference between first several calls that drain() makes
and the stop_transport() is the do_stop_stream_manager()
in the latter.
Again, it's partially a bugfix (shutdown after drain leaves
streaming running), partially a must-have thing (streaming
is not expected in the background after drain), partially a
unification of two drains out there.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's now stopped (not sharded::stop(), but batchlog_manager::stop)
on plain drain, but plain shutdown leaves it running, so fill this
gap.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
First of all, shutdown that happens after nodetoo drain leaves
tracing up-n-running, so it's effectively a bugfix. But also
a step towards unified drain and drain_on_shutdown.
Keeping this bit in drain seems to be required because drain
stops transport, flushes column families and shuts commitlog
down. Any tracing activity happening after it looks uncalled
for.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The tracing::stop() just checks that it was shutdown()-ed and
otherwise a noop, so it's OK to stop tracing later. This brings
drain() and drain_on_shutdown() closer to each other and makes
main.cc look more like it should.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's now stopped in drain_on_shutdown, but since its stop()
method is a noop, it doesn't matter where it is. Keeping it
in main.cc next to related start brings drain_on_shutdown()
closer to drain() and the whole thing closer to the Ideal
start-stop sequence.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After the patch the subscription effectively happens at
the same time as before, but is now located in main.cc,
so no real change here.
The unsubscription was in the drain_on_shutdown before
the patch, but after it it happens to be a defer next
to its peer, i.e. later, but it shouldn't be disastrous
for two reasons. First -- client services and migration
manager are already stopped. Second -- before the patch
this subscription was _not_ cancelled if shutdown ran
after nodetool drain and it didn't cause troubles.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When destroying a permit with leaked resources we call
`on_internal_error_noexcept()` in the destructor. This method logs an
error or asserts depending on the configuration. When not asserting, we
need to return the leaked units to the semaphore, otherwise they will be
leaked for good. We can do this because we know exactly how many
resources the user of the permit leaked (never signalled).
Currently the guy copies and merges all range tombstones from all partition
versions (that match the given range, but still) when being initialized or
decides to refresh iterators. This is a lot of potentially useless work and
memory, as the reader may be dropped before it emits all the mutations from
the given range(s).
It's better to walk the tombstones step-by-step, like it's done for rows.
fixes: #1671
tests: unit(dev)
* xemul/br-partiion-snapshot-reader-on-demand-range-tombstones-2:
range_tombstone_stream: Remove unused methods
partition_snapshot_reader: Emit range tombstones on demand
partition_snapshot_reader: Introduce maybe_refresh_state
partition_snapshot_reader: Move range tombstone stream member
partition_snapshot_reader: Add reset_state method to helper class
partition_snapshot_reader: Downgrade heap comparator
partition_snapshot_reader: Use on-demand comparators
range_tombstone_list: Add new slice() helper
range_tombstone_list: Introduce iterator_range alias
Constructors of classes inherited from file_impl copy alignment
values by hands, but miss the overwrite one, thus on a new file
it remains default-initialized.
To fix this and not to forget to properly initalize future fields
from file_impl, use the impl's copy constructor.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210325104830.31923-1-xemul@scylladb.com>
Seven etcd unit tests as boost tests.
* alejo/raft-tests-etcd-08-v4-communicate-v5:
raft: etcd unit tests: test proposal handling scenarios
raft: etcd unit tests: test old messages ignored
raft: etcd unit tests: test single node precandidate
raft: etcd unit tests: test dueling precandidates
raft: etcd unit tests: test dueling candidates
raft: etcd unit tests: test cannot commit without new term
raft: etcd unit tests: test single node commit
raft: etcd unit tests: update test_leader_election_overwrite_newer_logs
raft: etcd unit tests: fix test_progress_leader
raft: testing: log comparison helper functions
raft: testing: helper to make fsm candidate
raft: testing: expose log for test verification
raft: testing: use server_address_set
raft: testing: add prevote configuration
raft: testing: make become_follower() available for tests
TestProposal
For multiple scenarios, check proposal handling.
Note, instead of expecting an explicit result for each specified case,
the test automatically checks for expected behavior when quorum is
reached or not.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
TestSingleNodePreCandidate
Checks a single node configuration with precandidate on works to
automatically elect the node.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
TestDuelingPreCandidates
In a configuration of 3 nodes, two nodes don't see each other and they
compete for leadership. Loser (3) should revert to follower when prevote
is rejected and revert to term 1.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
TestDuelingCandidates
In a configuration of 3 nodes, two nodes don't see each other and they
compete for leadership. Once reconnected, loser should not disrupt.
But note it will remain candidate with current algorithm without
prevoting and other fsms will not bump term.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
TestCannotCommitWithoutNewTermEntry tests the entries cannot be
committed when leader changes, no new proposal comes in and ChangeTerm
proposal is filtered.
NOTE: this doesn't check committed but it's implicit for next round;
this could also use communicate() providing committed output map
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Port etcd TestSingleNodeCommit
In a single node configuration elect the node, add 2 entries and check
number of committed entries.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Make test_leader_election_overwrite_newer_logs use newer communicate()
and other new helpers.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Make implementation follow closer to original test.
Use newer boost test helpers.
NOTE: in etcd it seems a leader's self progress is in PIPELINE state.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Two helper functions to compare logs. For now only index, term, and data
type are used. Data content comparison does not seem to be necessary for now.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Current election_timeout() helper might bump the term twice.
It's convenient and less error prone to have a more fine grained helper
that stops right when candidate state is reached.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
One INSERT statement was unnecessary for the test, so delete it.
Another was necessary, so explain it.
Tests: cql-pytest/test_null on both Scylla and Cassandra
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8304
Test raft configuration changes:
a node with empty configuration, transitioning
to an entirely different cluster, transitioning
in presence of down nodes, leader change during
configuration change, stray replies, etc.
* scylla-dev/raft-empty-confchange-v5: (21 commits)
raft: (testing) stray replies from removed followers
raft: always return a non-zero configuration index from the log
raft: (testing) leader change during configuration change
raft: (testing) test confchange {ABCDE} -> {ABCDEFG}
raft: (testing) test confchange {ABCDEF} -> {ABCGH}
raft: (testing) test confchange {ABC} -> {CDE}
raft: (testing) test confchange {AB} -> {CD}
raft: (testing) test confchange {A} -> {B}
raft: (testing) test a server with empty configuration
raft: (testing) introduce testing utilities
raft: (testing) simplify id allocation in test
raft: (testing) add select_leader() helper
raft: (testing) introduce communicate() helper
raft: (testing) style cleanup in raft_fsm_test
raft: (testing) fix bug in election_threshold
raft: minor style changes & comments
raft: do not assert when transitioning to empty config
raft: assert we never apply a snapshot over uncommitted entries (leader)
raft: improve tracing
raft: add fsm_output::empty() helper to aid testing
...
The template method needs to be specialized in each file that is
using it. To avoid rewriting the specialization into multiple files,
move it to the header file.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Right now toppartitions can only be invoked on one column family at a time.
This change introduces a natural extension to this functionality,
allowing to specify a list of families.
We provide three ways for filtering in the query parameter "name_list":
1. A specific column family to include in the form "ks:cf"
2. A keyspace, telling the server to include all column families in it.
Specified by omitting the cf name, i.e. "ks:"
3. All column families, which is represented by an empty list
The list can include any amount of one or both of the 1. and 2. option.
Fixes#4520Closes#7864
LCS reshape is basically 'major compacting' level 0 until it contains less than
N sstables.
That produces terrible write amplification, because any given byte will be
compacted (initial # of sstables / max_threshold (32)) times. So if L0 initially
contained 256 ssts, there would be a WA of about 8.
This terrible write amplification can be reduced by performing STCS instead on
L0, which will leave L0 in a good shape without hurting WA as it happens
now.
Fixes#8345.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210322150655.27011-1-raphaelsc@scylladb.com>
During the retry mechanism, it's possible to encounter a gate
closed exception, which should simply be ignored, because
it indicates that the server is shutting down.
Closes#8337
This open option tells seastar that the file in question
will be truncated to the needed size right at once and all
the subsequent writes will happen within this size. This
hint turns off append optimization in seastar that's not
that cheap and helps so save few cpu cycles.
The option was introduced in seastar by 8bec57bc.
tests: unit(dev), dtest(commitlog:
test_batch_commitlog,
test_periodic_commitlog,
test_commitlog_replay_on_startup)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210323115409.31215-1-xemul@scylladb.com>
Add comments explaining the rationale from transfer_leadership()
(more PhD quotes), encapsulate stable leader check in tick()
into a lambda and add more detailed comments to it.
Until clang figures things out with the now infamous
`-llvm -inline-threshold X` parameter, let's allow customizing
it to make the compilation of release builds less tiresome.
For instance, scylla's row_level.o object file currently does not compile
for me until I decrease the inline threshold to a low value (e.g. 50).
Message-Id: <54113db9438e3c3371410996f49b7fbe9a1b7257.1616422536.git.sarna@scylladb.com>
Row marker has a cell name which sorts after the row tombstone's start
bound. The old code was writing the marker first, then the row
tombstone, which is incorrect.
This was harmeless to our sstable reader, which recognized both as
belonging to the current clustering row fragment, and collects both
fine.
However, if both atoms trigger creation of promoted index blocks, the
writer will create a promoted index with entries wich violate the cell
name ordering. It's very unlikely to run into in practice, since to
trigger promoted index entries for both atoms, the clustering key
would be so large so that the size of the marker cell exceeds the
desired promoted index block size, which is 64KB by default (but
user-controlled via column_index_size_in_kb option). 64KB is also the
limit on clustering key size accepted by the system.
This was caught by one of our unit tests:
sstable_conforms_to_mutation_source_test
...which runs a battery of mutation reader tests with various
desired promoted index block sizes, including the target size of 1
byte, which triggers an entry for every atom.
The test started to fail for some random seeds after commit ecb6abe
inside the
test_streamed_mutation_forwarding_is_consistent_with_slicing test
case, reporting a mutation mismatch in the following line:
assert_that(*sliced_m).is_equal_to(*fwd_m, slice_with_ranges.row_ranges(*m.schema(), m.key()));
It compares mutations read from the same sstable using different
methods, slicing using clustering key restricitons, and fast
forwarding. The reported mismatch was that fwd_m contained the row
marker, but sliced_m did not. The sstable does contain the marker, so
both reads should return it.
After reverting the commit which introduced dynamic adjustments, the
test passes, but both mutations are missing the marker, both are
wrong!
They are wrong because the promoted index contians entries whose
starting positions violate the ordering, so binary search gets confused
and selects the row tombstone's position, which is emitted after the
marker, thus skipping over the row marker.
The explanation for why the test started to fail after dynamic
adjustements is the following. The promoted index cursor works by
incrementally parsing buffers fed by the file input stream. It first
parses the whole block and then does a binary search within the parsed
array. The entries which cursor touches during binary search depend on
the size of the block read from the file. The commit which enabled
dynamic adjustements causes the block size to be different for
subsequent reads, which allows one of the reads to walk over the
corrupted entries and read the correct data by selecting the entry
corresponding to the row marker.
Fixes#8324
Message-Id: <20210322235812.1042137-1-tgrabiec@scylladb.com>
Clustering rows are now stored in intrusive btree, cells are
now stored in radix tree, but scylla-gdb tries to walk the
intrusive_set and vector/set union respectively.
For the former case -- the btree wrapper is introduced.
For the latter -- compiler optimizes-away too many important
bits and walking the tree turns into a bunch of hard-coded
hacks and reiterpret-casts. Untill better solution is found,
just print the address of the tree root.
* xemul/br-gdb-btree-rows:
gdb: Show address of the row::_cells tree (or "empty" mark)
gdb: Add support for intrusive B tree
gdb: Use helper to get rows from mutation_partition
Currently clang optimizes-out lots of critical stuff from
compact radix tree. Untill we find out the way to walk the
tree in gdb, it's better to at least show where it is in
memory.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Rows inside partition are now stored in an intrusive B-tree,
so here's the helper class that wraps this collection.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
validate_partial() is declared in the internal namespace, but defined
outside it. This causes calls to validate_partial() to be ambiguous
on architectures that haven't been SIMD-optimized yet (e.g. s390x).
Fix by defining it in the internal namespace.
Closes#8268
The manifest manipulation commands stopped working with podman 3;
the containers-storage: prefix now throws errors.
Switch to `buildah manifest`; since we're building with buildah,
we might as well maintain the manifest with buildah as well.
Closes#8231
crc has some code to reverse endianness on big-endian machines, but does
not handle the case of a 1-byte object (which doesn't need any adjustement).
This causes clang to complain that the switch statement doesn't handle that
case.
Fix by adding a no-op case.
Closes#8269
Although a seastar::smart_ptr is trivial to dereference manually, so is
adding support for it to dereference_smart_ptr(), avoiding the annoying
(but brief) detour which is currently needed.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210322150149.84534-1-bdenes@scylladb.com>
Concepts are easier to read and result in better error messages.
This change also tightens the constraint from "std::is_fundamental" to
"std::integral". The differences are floating point values, nullptr_t,
and void. The latter two are illegal/useless to write, and nobody uses
floating point values for list lengths, so everything still compiles.
Closes#8326
It will still be possible to use Thrift once it's enabled
in the yaml file, but it's better to not open this port
by default, since Thrift is definitely not the first choice
for Scylla users.
Fixes#8336
This series implements leader stepdown extension. See patch 4 for
justification for its existence. First three patches either implement
cleanups to existing code that future patch will touch or fix bugs
that need to be fixed in order for stepdown test to work.
* 'raft-leader-stepdown-v3' of github.com:scylladb/scylla-dev:
raft: add test for leader stepdown
raft: introduce leader stepdown procedure
raft: fix replication when leader is not part of current config
raft: do not update last election time if current leader is not a part of current configuration
raft: move log limiting semaphore into the leader state
"
Scylla suffers with aggressive compaction after repair-based operation has initiated. That translates into bad latency and slowness for the operation itself.
This aggressiveness comes from the fact that:
1) new sstables are immediately added to the compaction backlog, so reducing bandwidth available for the operation.
2) new sstables are in bad shape when integrated into the main sstable set, not conforming to the strategy invariant.
To solve this problem, new sstables will be incrementally reshaped, off the compaction strategy, until finally integrated into the main set.
The solution takes advantage there's only one sstable per vnode range, meaning sstables generated by repair-based operations are disjoint.
NOTE: off-strategy for repair-based decommission and removenode will follow this series and require little work as the infrastructure is introduced in this series.
Refs #5226.
"
* 'offstrategy_v7' of github.com:raphaelsc/scylla:
tests: Add unit test for off-strategy sstable compaction
table: Wire up off-strategy compaction on repair-based bootstrap and replace
table: extend add_sstable_and_update_cache() for off-strategy
sstables/compaction_manager: Add function to submit off-strategy work
table: Introduce off-strategy compaction on maintenance sstable set
table: change build_new_sstable_list() to accept other sstable sets
table: change non_staging_sstables() to filter out off-strategy sstables
table: Introduce maintenance sstable set
table: Wire compound sstable set
table: prepare make_reader_excluding_sstables() to work with compound sstable set
table: prepare discard_sstables() to work with compound sstable set
table: extract add_sstable() common code into a function
sstable_set: Introduce compound sstable set
reshape: STCS: preserve token contiguity when reshaping disjoint sstables
Section 3.10 of the PhD describes two cases for which the extension can
be helpful:
1. Sometimes the leader must step down. For example, it may need to reboot
for maintenance, or it may be removed from the cluster. When it steps
down, the cluster will be idle for an election timeout until another
server times out and wins an election. This brief unavailability can be
avoided by having the leader transfer its leadership to another server
before it steps down.
2. In some cases, one or more servers may be more suitable to lead the
cluster than others. For example, a server with high load would not make
a good leader, or in a WAN deployment, servers in a primary datacenter
may be preferred in order to minimize the latency between clients and
the leader. Other consensus algorithms may be able to accommodate these
preferences during leader election, but Raft needs a server with a
sufficiently up-to-date log to become leader, which might not be the
most preferred one. Instead, a leader in Raft can periodically check
to see whether one of its available followers would be more suitable,
and if so, transfer its leadership to that server. (If only human leaders
were so graceful.)
The patch here implements the extension and employs it automatically
when a leader removes itself from a cluster.
When a leader orchestrates its own removal from a cluster there is a
situation where the leader is still responsible for replication, but it
is no longer part of active configuration. Current code skips replication
in this case though. Fix it by always replicating in the leader state.
Since we use external failure detector instead of relying on empty
AppendRequests from a leader there can be a situation where a node
is no longer part of a certain raft group but is still alive (and also
may be part of other raft groups). In such case last election time
should not be updated even if the node is alive. It is the same as if
it would have stopped to send empty AppendRequests in original raft.
Since we switched scylla-python3 build directory to tools/python3/build
on Jenkins, we nolonger need compat-python3 targets, drop them.
Related scylladb/scylla-pkg#1554
Closes#8328
"
Due to bad interaction of recent changes (913d970 and 4c8ab10) inctive
readers that are not admitted have managed to completely fly under the
radar, avoiding any sort of limitation. The reason is that pre-admission
the permits don't forward their resource cost to the semaphore, to
prevent them possibly blocking their own admission later. However this
meant that if such a reader is registered as inactive, it completely
avoids the normal resource based eviction mechanism and can accumulate
without bounds.
The real solution to this is to move the semaphore before the cache and
make all reads pass admission before they get started (#4758). Although
work has been started towards this, it is still a while until it lands.
In the meanwhile this patchset provides a workaround in the form of a
new inactive state, which -- like admitted -- causes the permit to
forward its cost to the semaphore, making sure these un-admitted
inactive reads are accounted for and evicted if there is too much of
them.
Fixes: #8258
Tests: unit(release), dtest(oppartitions_test.py:TestTopPartitions.test_read_by_gause_key_distribution_for_compound_primary_key_and_large_rows_number)
"
* 'reader-concurrency-semaphore-limit-inactive-reads/v4' of https://github.com/denesb/scylla:
test: mutation_reader_test: add test for permit cleanup
test: querier_cache_test: add memory based cache eviction test
reader_permit: add inactive state
querier: insert(): account immediately evicted querier as resource based eviction
reader_concurrency_semaphore: fix clear_inactive_reads()
reader_concurrency_semaphore: make inactive_read_handle a weak reference
reader_concurrency_semaphore: make evict() noexcept
reader_concurrency_semaphore: update out-of-date comments
After commit 0bd201d3ca ("cql3: Skip indexed
column for CK restrictions") fixed issue #7888, the test
cassandra_tests/validation/entities/frozen_collections_test.py::testClusteringColumnFiltering
began passing, as expected. So we can remove its "xfail" label.
Refs #7888.
cassandra_tests/validation/entities/frozen_collections_test.py::testClusteringColumnFiltering
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210321080522.1831115-1-nyh@scylladb.com>
* seastar ea5e529f30...83339edb04 (21):
> cmake: filter out -Wno-error=#warnings from pkgconfig (seastar.pc)
> Merge 'utils/log.cc: fix nested_exception logging (again)' from Vlad Zolotarov
Fixes#8327.
> file: Add option to refuse the append-challenged file
> Merge "Teach io-tester to work on block device" from Pavel E
> Merge "Cleanup files code" from Pavel E
> install-dependencies: Support rhel-8.3
> install-dependencies: Add some missing rh packages
> file, reactor: reinstate RWF_NOWAIT support
> file: Prevent fsxattr.fsx_extsize from overflow
> cmake: enable clang's -Wno-error=#warnings if supported
> cmake: harden seastar_supports_flag aginst inputs with spaces or #
> cmake: fix seastar_supports_flag failing after first invocation
> thread: Stop backtraces in main() on s390x architecture
> intent: Explicitly declare constructors for references
> test: file_io_test: parallel_overwrite: use testing::local_random_engine
> util: log-impl: rework log_buf::inserter_iterator
> rwlock: pass timeout parameter to get_units
> concepts: require lib support to enable concepts
> rpc: print more info on bad protocol magic
> seastar-addr2line: strip input line to restore multiline support
> log: skip on unknown nested mixing instead of stopping the logging
Ref #8327.
This is a translation of Cassandra's CQL unit test source file
validation/entities/SecondaryIndexOnMapEntriesTest.java into our
our cql-pytest framework.
This test file checks various features of indexing (with secondary index)
individual entries of maps. All these tests pass on Cassandra, but fail on
Scylla because of issue #2962 - we do not yet support indexing of the content
of unfrozen collections. The failing test currently fail as soon as they
try to create the index, with the message:
"Cannot create secondary index on non-frozen collection or UDT column v".
Refs #2962.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210310124638.1653606-1-nyh@scylladb.com>
Thrift used to be quite unsafe with regard to its retry mechanism, which caused very rapid use of resources, namely the number of file descriptors. It was also prone to use-after-free due to spawning futures without guarding the captured objects with anything.
The mechanism is now cleaned up, and a simple exponential backoff replaced previous constant backoff policy.
Fixes#8317
Tests: unit(dev), manual(see #8317 for a simple reproducer)
Closes#8318
* github.com:scylladb/scylla:
thrift: add exponential backoff for retries
thrift: fix and simplify retry logic
When a tuple value is serialized, we go through every element type and
use it to serialize element values. But an element type can be
reversed, which is artificially different from the type of the value
being read. This results in a server error due to the type mismatch.
Fix it by unreversing the element type prior to comparing it to the
value type.
Fixes#7902
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8316
When querying an index table, we assemble clustering-column
restrictions for that query by going over the base table token,
partition columns, and clustering columns. But if one of those
columns is the indexed column, there is a problem; the indexed column
is the index table's partition key, not clustering key. We end up
with invalid clustering slice, which can cause problems downstream.
Fix this by skipping the indexed column when assembling the clustering
restrictions.
Tests: unit (dev)
Fixes#7888
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8320
A trichotomic comparator returning an int an easily be mistaken
for a less comparator as the return types are convertible.
Use the new std::strong_ordering instead.
A caller in cql3's update_parameters.hh is also converted, following
the path of least resistance.
Ref #1449.
Test: unit (dev)
Closes#8323
Constrain inject() with a requires clause rather than enable_if,
simplifying the code and compiler diagnostics.
Note that the second instance could not have been called, since
the template argument does not appear in the function parameter
list and thus could not be deduced. This is corrected here.
Closes#8322
time_point_to_string ensures its input is a time_point with
millisecond resolution (though it neglects to verify the epoch
is what it expects). Change the test from a clunky enable_if to
a nicer concept.
Closes#8321
Currently send_snapshot is the only two-way RPC used by Raft.
However, the sender (the leader) does not look at the receiver's
reply, other than checks it's not an error. This has the following
issues:
- if the follower has a newer term and rejects the snapshot for
that reason, the leader will not learn about a newer follower
term and will not step down
- the send_snapshot message doesn't pass through a single-endpoint
fsm::step() and thus may not follow the general Raft rules
which apply for all messages.
- making a general purpose transport that simply calls fsm::step()
for every message becomes impossible.
Fix it by actually responding with snapshot_reply to send_snapshot
RPC, generating this reply in fsm::step() on the follower,
and feeding into fsm::step() on the leader.
* scylla-dev/raft-send-snapshot-v2:
raft: pass snapshot_reply into fsm::step()
raft: respond with snapshot_reply to send_snapshot RPC
raft: set follower's next_idx when switching to SNAPSHOT mode
raft: set the current leader upon getting InstallSnapshot
The original backoff mechanism which just retries after 1ms
may still lead to rapid resource depletion.
Instead, an exponential backoff is used, with a cap of ~2s.
Tests: manual, with cassandra-stress and browsing logs
The retry logic for Thrift frontend had two bugs:
1. Due to missing break in a switch statement,
two retry calls were always performed instead of one,
which acts a little bit like a Seastar forkbomb
2. The delayed action was not guarded with any gate,
so it was theoretically possible to access a captured `this`
pointer of an object which already got deallocated.
In order to fix the above, the logic is simplified to always
retry with backoff - it makes very little sense to skip the backoff
and immediate retries are not needed by anyone, while they cause
severe overload risk.
Tests: manual - a simple cassandra-stress invocation was able to crash
scylla with a segfault:
$ cassandra-stress write -mode thrift -rate threads=2000
Fixes#8317
enable_if is hard to understand, especially its error messages. Convert
enable_if in sstable code to concepts.
A new concept is introduced, self_describing, for the case of a type
that follows the obj.describe_type() protocol. Otherwise this is quite
straightforward.
Closes#8315
* github.com:scylladb/scylla:
sstables: vector write: convert to concepts
sstables: check_truncated_and_assign: convert to concept
sstables: convert write() to concepts
sstables: convert write_vint() to concepts
sstables: vector parse(): convert to concept
sstables: convert parse() for a self-describing type to concept
sstables: read_vint(): convert enable_if to concepts
sstables: add concept for self-describing type
We have an integral and a non-integral overload, each constrained
with enable_if. We use std::integral to constrain the integral
overload and leave the other unconstrained, as C++ will choose the
more constrained version when applicable.
There are three variants: integral, enum, and self-describing
(currently expressed as not integral and not enum). Convert to
concepts by using the standard concepts or the new self_describing
concept.
By default the boto3 library waits up to 60 second for a response,
and if got no response, it sends the same request again, multiple
times. We already noticed in the past that it retries too many times
thus slowing down failures, so in our test configuration lowered the
number of retries to 3, but the setting of 60-second-timeout plus
3 retries still causes two problems:
1. When the test machine and the build are extremely slow, and the
operation is long (usually, CreateTable or DeleteTable involving
multiple views), the 60 second timeout might not be enough.
2. If the timeout is reached, boto3 silently retries the same operation.
This retry may fail because the previous one really succeeded at
least partially! The symptom is tests which report an error when
creating a table which already exists, or deleting a table which
dooesn't exist.
The solution in this patch is first of all to never do retries - if
a query fails on internal server error, or times out, just report this
failure immediately. We don't expect to see transient errors during
local tests, so this is exactly the right behavior.
The second thing we do is to increase the default timeout. If 1 minute
was not enough, let's raise it to 5 minutes. 5 minutes should be enough
for every operation (famous last words...).
Even if 5 minutes is not enough for something, at least we'll now see
the timeout errors instead of some wierd errors caused by retrying an
operation which was already almost done.
Fixes#8135
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210222125630.1325011-1-nyh@scylladb.com>
The two vector parse() overloads select between integral members
and non-integral members. Use std::integral to constrain the
integral overload and leave the other unconstrained; C++ will choose
the more constrained version when it applies.
This parse() overload uses "not integral and not enum" to reject
non-self-describing types. Express it directly with the self_describing
concept instead.
Convert read_vint() to a concept. The explicitly deleted version
is no longer needed since wrongly-typed inputs will be rejected
by the constraint. Similarly the static assert can be dropped
for the same reason.
Our sstable parsing and writing code contains a self-describing
type concept, where a type can advertise its members via a
describe_types() member function with a specific protocol.
Formalize that into a C++ concept. This is a little tricky, since
describe_type() accepts a parameter that is itself a template, and
requires clauses only work with concrete type. To handle this problem,
create such a concrete example type and use it in the concept.
Now, sstables created by bootstrap and replace will be added to the
maintenance set, and once the operation completes, off-strategy compaction
will be started.
We wait until the end of operation to trigger off-strategy, as reshaping
can be more efficient if we wait for all sstables before deciding what
to compact. Also, waiting for completion is no longer an issue because
we're able to read from new sstables using partitioned_sstable_set and
their existence aren't accounted by the compaction backlog tracker yet.
Refs #5226.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This new variant will allow its caller to submit off-strategy job
asynchronously on behalf of a given table.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Off-strategy compaction is about incrementally reshaping the off-strategy
sstables in maintenance set, using our existing reshape mechanism, until
the set is ready for integration into the main sstable set.
The whole operation is done in maintenance mode, using the streaming
scheduling group.
We can do it this way because data in maintenance set is disjoint, so
effects on read amplification is avoided by using
partitioned_sstable_set, which is able to efficiently and incrementally
retrieve data from disjoint sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
SSTables that are off-strategy should be excluded by this function as
it's used to select candidates for regular compaction.
So in addition to only returning candidates from the main set, let's
also rename it to precisely reflect its behavior.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This new sstable set will hold sstables created by repair-based
operations. A repair-based op creates 1 sstable per vrange (256),
so sstables added to this new set are disjoint, therefore they
can be efficiently read from using partitioned_sstable_set.
Compound set is changed to include this new set, so sstables in
this new set are automatically included when creating readers,
computing statistics, and so on.
This new set is not backlog tracked, so changes were needed to
prevent a sstable in this set from being added or removed from
the tracker.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
From now own, _sstables becomes the compound set, and _main_sstables refer
only to the main sstables of the table. In the near future, maintenance
set will be introduced and will also be managed by the compound set.
So add_sstable() and on_compaction_completion() are changed to
explicitly insert and remove sstables from the main set.
By storing compound set in _sstables, functions which used _sstables for
creating reader, computing statistics, etc, will not have to be changed
when we introduce the maintenance set, so code change is a lot minimized
by this approach.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Compound set will not be inserted or erased directly, so let's change
this function to build a new set from scratch instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
After compound set, discard_sstables() will have to prune each set
individually and later refresh the compound set. So let's change
the function to support multiple sstable sets, taking into account
that a sstable set may not want to be backlog tracked.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The purpose is to allow the code to be eventually reused by maintenance
sstable set, which will be soon introduced.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This new sstable set implementation is useful for combining operation of
multiple sstable sets, which can still be referenced individually via
its shared ptr reference.
It will be used when maintenance set is introduced in table, so a
compound set is required to allow both sets to have their operations
efficiently combined.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When reshaping hundreds of disjoint sstables, like on bootstrap,
contiguity wasn't being preserved because the heuristic for picking
candidates didn't take into account their token range, which resulted
in reshape messing with the contiguity that could otherwise be
preserved by respecting the token order of the disjoint sstables.
In other words, sstables with the smallest first tokens should be
compacted first. By doing that, the contiguity is preserved even
across size tiers, after reshape has completed its possible multiple
rounds to get all the data in shape.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Relaxed mode, used during initialization, of reshape only tolerates min_threshold
(default: 4) L0 sstables. However, relaxed mode should tolerate more sstables in
level 0, otherwise boot will have to reshape level 0 every time it crosses the
min threshold. So let's make LCS reshape tolerate a max of max_threshold and 32.
This change is beneficial because once table is populated, LCS regular compaction
can decide to merge those sstables in level 0 into level 1 instead, therefore
reducing WA.
Refs #8297.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210318131442.17935-1-raphaelsc@scylladb.com>
By the time we receive snapshot_reply from a follower
we may no longer be the leader. Follower term may be
different from snapshot term, e.g. the follower may
be aware of a new leader already and have a higher term.
We should pass this information into (possibly ex-) leader FSM via
fsm::step() so that it can correctly change its state, and
not call FSM directly.
Raft send_snapshot RPC is actually two-way, the follower
responds with snapshot_reply message. This message until now
was, however, muted by RPC.
Do not mute snapshot_reply any more:
- to make it obvious the RPC is two way
- to feed the follower response directly into leader's FSM and
thus ensure that FSM testing results produced when using a test
transport are representative of the real world uses of
raft::rpc.
Set follower's next_idx to snapshot index + 1 when switching
it to snapshot mode. If snapshot transfer succeeds, that's the
best match for the follower's next replication index. If it fails,
the leader will send a new probe to find out the follower position
again and re-try sending a possibly newer snapshot.
The change helps reduce protocol state managed outside FSM.
If tracing::tracing::_ignore_trace_events is enabled then
the tracing system must ignore all sessions events
for non full_tracing sessions (probability tracing and
user requested) and creating subsessions with the
make_trace_info.
Patch introduces the slow query tracing fast mode that
omits all events during tracing.
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
This state will be used for permits that are not in admitted state when
registered as inactive. We can have such reads if a read can be served
entirely from cache/memtables and it doesn't have to go to disk and
hence doesn't go through admission. These permits currently don't
forward their cost to the semaphore so they won't prevent their own
admission creating a deadlock. However, when in inactive state, we do
want to keep tabs on their resource consumption so we don't accumulate
too much of these inactive reads. So introduce a new state for these
non-admitted inactive reads. When entering the inactive state, the
permit registers its cost with the semaphore, and when unregistered as
inactive, it retracts it. This is a workaround (khm hack) until #4758 is
solved and all permits will be admitted on creation.
`reader_concurrency_semaphore::register_inactive_read()` drops the
registered inactive read immediately if there is a resource shortage.
This is in effect a resource based eviction, so account it as such in
`querier::insert()`.
Broken by the move to an intrusive container (9cbbf40), which caused
said method to only clear the container but not destroy the inactive
reads contained therein. This patch restores the previous behaviour and
also adds a call the destructor (to ensure inactive reads are cleaned up
under any circumstances), as well as a unit test.
Having the handle keep an owning reference to the inactive read lead to
awkward situations, where the inactive read is destroyed during eviction
in certain situations only (querier cache) and not in other cases.
Although the users didn't notice anything from this, it lead to very
brittle code inside the reader concurrency semaphore. Among others, the
inactive read destructor has to be open coded in evict() which already
lead to mistakes.
This patch goes back to the weak pointer paradigm used a while ago,
which is a much more natural fit for this. Inactive reads are still kept
in an intrusive list in the semaphore but the handle now keeps a weak
pointer to them. When destroyed the handler will destroy the inactive
read if it is still alive. When evicting the inactive read, it will
set the pointer in the handle to null.
As #1449 notes, trichotomic comparators returning int are dangerous as they
can be mistaken for less comparators. This series converts dht::ring_position
and dht::decorated_key, as well as a few closely related downstream types, to
return std::strong_ordering.
Closes#8225
* github.com:scylladb/scylla:
dht: ring_position, decorated_key: convert tri_comparators to std::strong_ordering
pager: rephrase misleading comparison check
test: total_order_checks: prepare for std::strong_ordering
test: mutation_test: prepare merge_container for std::strong_ordering
intrusive_array: prepare for std::strong_ordering
utils: collection-concepts: prepare for std::strong_ordering
Convert tri_comparators to return std::strong_ordering rather than int,
to prevent confusion with less comparators. Downstream users are either
also converted, or adjust the return type back to int, whichever happens
to be simpler; in all cases the change it trivial.
We check !result_of_tri_compare, which makes it look like we're
checking a boolean predicate, whereas we're really checking for
equality. Change to result_of_tri_compare == 0, which is less likely
to be confusing, and is also compatible with std::strong_ordering.
Adjust the total_order_check template to work with comparators
returning either int (as a temporary compatibility measure) or
std::strong_ordering (for #1449 safety).
The function merge_container() accepts a trichotomic comparator returning
an int. As #1449 explains, this is dangerous as it could be mistaken for
a less comparator. Switch to std::strong_ordering, but leave a compatible
merge_container() in place as it is still needed (even after this series).
collection-concepts includes a Comparable concept for a trichotomic
comparator function, used in intrusive btree and double_decker. Prepare
for std::strong_ordering by also allowing std::strong_ordering as a
return type. Once we've cleaned the code base, we can tighten it to
only allow std::strong_ordering.
This set removes few more calls for global storage service and prevents
more of them to happen in thrift that's about to start using the memory
limiter semaphore too.
The set turns this semaphore into a sharded one living in the scope of
main(), makes others use the local instance and removes the no longer
needed bits from storage service.
tests: unit(dev)
branch: https://github.com/xemul/scylla/commits/br-global-memory-limiter-sem
* xemul_drop_memory_limiter:
storage_service: Drop memory limiter
memory_limiter: Use main-local instance everyehere
main: Have local memory limiter and carry where needed
memory_limiter: Encapsulate memory limiting facility
cql_server: Remove semaphore getter fn from config
The cql_server and alternator both need the limiter, so
patch them to stop using storage service's one and use
the main-local one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Prepare memory limiters to have non-global instance of
the service. For now the main-local instance is not
used and (!) is not stopped for real, just like the
storage_service's one is.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The storage service carries sempaphore and a size_t value
to facilitate the memory limiting for client services.
This patch encapsulates both fields on a separate helper
class that will be used by whoever needs it without
messing with the storage service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The cql_server() need to get the memory limiter semaphore
from local storage service instance. To make this happen
a callback in introduced on the config structure. The same
can be achieved in a simler manner -- by providing the
local storage service instances directly.
Actually, the storage service will be removed in further
patches from this place, so this patch is mostly to get
rid of the callback from the config.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The slowest test in test_streams.py is test_list_streams_paged. It is meant
to test the ListStreams operation with paging. The existing test repeated
its test four times, for four different stream types. However, there is
no reason to suspect that the ListStreams operation might somehow be
different for the four stream types... We already have other tests which
create streams of the four types, and uses these streams - we don't
need the test for ListStreams to also test creating the four types.
By doing this test just once, not four times, we can save around 1.5
seconds of test time.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210318073755.1784349-1-nyh@scylladb.com>
In the test test_tracing.py::test_tracing_all, we do some operations and
then need to wait until they appear in the tracing table.
The current code used an exponentially-increasing delay during this wait,
starting with 0.1 seconds and then doubling the delay until we find what
we're looking for.
However, it turns out that the delay until the data appears in the table
is deliberately chosen by Scylla - and is always around 2 seconds.
In this case, an exponential delay is really bad - we will usually wait
for around 1 seconds too long after the needed wait of 2 seconds.
So in this patch we replace the exponential delay by a constant delay -
we wait 0.3 seconds between each retry.
This change makes the test test_tracing.py::test_tracing_all finish
in a little over 2 seconds, instead of a little over 3 seconds
before this patch. We cannot reduce this 2 second time any further
unless we make the 2-second tracing delay configurable.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210318000040.1782933-1-nyh@scylladb.com>
The test test_table.py::test_table_streams_on creates tables with various
stream types, and then immediately deletes them without testing anything.
This is a slow test (taking almost a full second on my laptop), and is
redundant because in test_streams.py we have tests which create tables
with streams in the same way - but then actually test that things work
with these streams. So this test might as well be removed, and this is
what we do in this patch.
Removing this test shaves another second from the Alternator test suite's
run time.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210317230530.1780849-1-nyh@scylladb.com>
The test
test_condition_expression.py::test_condition_expression_with_forbidden_rmw
takes half a second to run (dev build, on my laptop), one of the slowest
tests in Alternator's test suite. Part of the reason was that it needlessly
set the same table to forbidden_rmw, multiple times.
Instead of doing that, we switch to using the test_table_s_forbid_rmw
fixture, which is a table like test_table_s but created just once in
forbid_rmw mode.
The result is a faster test (0.05 seconds instead of 0.5 seconds), but
also safer if we ever want to run tests in parallel. It also fixes a
bug in the test: At the end of the test, we intended to double-check
that although the forbid_rmw table forbids read-modify-write operations,
it does allow pure writes. Yet the test did this after clearing the
forbid_rmw mode... So after this patch the test verifies this on the
forbid_rmw table, as intended.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210317222703.1779992-1-nyh@scylladb.com>
The test
test_condition_expression.py::test_condition_expression_with_permissive_write_isolation
Currently takes (on my laptop, dev build) a full two seconds, one of
the slowest tests. It is not surprising it is slow - it runs five other
tests three times each (for three different write isolation modes),
but it doesn't have to be this slow. Before this patch, for each of
the five tests we switch the write isolation mode three times, and
these switches involve schema changes and are fairly slow. So in
this patch we reverse the loop - and switch the write isolation mode
to the outer loop.
This patch halves the runtime of this test - from two seconds to one.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210317221045.1779329-1-nyh@scylladb.com>
This series makes configure.py output slightly more helpful in case of incorrect parameters passed to the compiler/linker.
Closes#8267
* github.com:scylladb/scylla:
configure: print more context if the linking attempt failed
configure: provide more context on failed ./configure.py run
configure: add verbose option to try_compile_and_link
We should run install-dependencies.sh with -e option to prevent ignoring
error in the script.
Also, need to add tools/jmx/install-dependencies.sh and
tools/java/install-dependencies.sh, to fix 'No such file or directory'
error on them.
Fixes#8293Closes#8294
[avi: did not regenerate toolchain image, since no new packages are
installed]
In some places we use the `*reinterpret_cast<const net::packed<T>*>(&x)`
pattern to reinterpret memory. This is a violation of C++'s aliasing rules,
which invokes undefined behaviour.
The blessed way to correctly reinterpret memory is to copy it into a new
object. Let's do that.
Note: the reinterpret_cast way has no performance advantage. Compilers
recognize the memory copy pattern and optimize it away.
Closes#8241
* github.com:scylladb/scylla:
treewide: get rid of unaligned_cast
treewide: get rid of incorrect reinterpret casts
The existing implementation wrongfully shares _all sstables
rather than cloning it. This caused a use-after-free
in `repair_meta::do_estimate_partitions_on_local_shard`
when traversing a shared sstable_set, during which
`table::make_reader_excluding_sstables` erased an entry.
The erase should have happened on a cloned copy
of the sstable_list, not on a shared copy.
The regression was introduced in
c3b8757fa1.
Added a unit test that reproduces the share-on-copy issue
for partitioned_stable_set (sstables::sstable_set).
Fixes#8274
Test: unit(release, debug)
DTest: materialized_views_test.py:TestMaterializedViews.simple_repair_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210317145552.701559-1-bhalevy@scylladb.com>
This series adds slow query logging capability to alternator. Queries which last longer than the specified threshold are logged in `system_traces.node_slow_log` and traced.
In order to be better prepared for https://github.com/scylladb/scylla/issues/2572, this series also expands the tracing API to allow custom key-value params and adds a custom `alternator_op` parameter to the slow node log. This information can also be deduced from the tracing session id by consulting the system_traces.events table, but https://github.com/scylladb/scylla/issues/2572 's assumption is that this tracing might not always be available in the future.
This series comes with a simple test case which checks if operation logs indeed end up in `system_traces.node_slow_log`.
Tests:
unit(dev, alternator pytest)
manual: verified that no operations are logged if slow query logging is disabled; verified that operations that take less time than the threshold are not logged; verified with test_batch.py::test_batch_write_item_large that a large-enough operation is indeed logged and traced.
Fixes#8292
Example trace:
```cql
cqlsh> select parameters, duration from system_traces.node_slow_log where start_time=b7a44589-8711-11eb-8053-14c6c5faf955;
parameters | duration
---------------------------------------------------------------------------------------------+----------
{'alternator_op': 'DeleteTable', 'query': '{"TableName": "alternator_Test_1615979572905"}'} | 75732
```
Closes#8298
* github.com:scylladb/scylla:
alternator: add test for slow query logging
alternator: allow enabling slow query logging
tracing: allow providing a custom session record param
"
All compaction types can now be stopped with the nodetool stop
command, example: nodetool stop SCRUB
Supported types are: COMPACTION, CLEANUP, VALIDATION, SCRUB,
INDEX_BUILD, RESHARD, UPGRADE, RESHAPE.
"
* 'stop_compaction_types_v2' of github.com:raphaelsc/scylla:
compaction: Allow all supported compaction types to be stopped
compaction: introduce function to map compaction name to respective type
compaction: refactor mapping of compaction type to string
compaction: move compaction_name() out of line
index_entry.hh (the current home of `promoted_index_blocks_reader`) is
included in `sstables.hh` and thus in half our code-base. All that code
really doesn't need the definition of the promoted index blocks reader
which also pulls in the sstables parser mechanism. Move it into its own
header and only include it where it is actually needed: the promoted
index cursor implementations.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210317093654.34196-1-bdenes@scylladb.com>
If the current leader is set, the follower will not vote
for another candidate. This is also known as "sticky leadership" rule.
Before this change, the rule was enacted only upon receiving
AppendEntries RPC from the leader. Turn it on also upon receiving
InstallSnapshot RPC.
In some places we use the `*reinterpret_cast<const net::packed<T>*>(&x)`
pattern to reinterpret memory. This is a violation of C++'s aliasing rules,
which invokes undefined behaviour.
The blessed way to correctly reinterpret memory is to copy it into a new
object. Let's do that.
Note: the reinterpret_cast way has no performance advantage. Compilers
recognize the memory copy pattern and optimize it away.
The test checks whether slow queries are properly logged
in the system_traces.node_slow_log system table.
The test is deterministic because it uses the threshold of 0ms
to qualify a query as slow, which effectively makes all queries
"slow enough".
The mechanism of session record params is currently only used
to store query strings and a couple more params like consistency level,
but since we now have more frontends than just CQL and Thrift,
it would be nice to also allow the users to put custom parameters in
there.
An immediate first user of this mechanism would be alternator,
which is going to put the operation type under the "alternator_op" key.
The operation type is not part of the query string due to how DynamoDB's
protocol works - the op type is stored separately in the HTTP header.
While it's possible to extract the operation type from the session_id,
it might not be the case once #2572 is implemented.
As the existing comment explains a progress can be deleted at the point
of logging. The logging should only be done if the progress still
exists.
Message-Id: <YFDFVRQU1iVYhFdM@scylladb.com>
Previously, we crashed when the IN marker is bound to null. Throw
invalid_request_exception instead.
Fixes#8265
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8287
... `writes_coordinator_outside_replica_set`' from Juliusz Stasiewicz
With this change, coordinator prefers himself as the "counter leader", so if
another endpoint is chosen as the leader, we know that coordinator was not a
member of replica set. With this guarantee we can increment
`scylla_storage_proxy_coordinator_writes_coordinator_outside_replica_set` metric
after electing different leader (that metric used to neglect the counter
updates).
The motivation for this change is to have more reliable way of counting
non-token-aware queries.
Fixes#4337Closes#8282
* github.com:scylladb/scylla:
storage_proxy: Include counter writes in `writes_coordinator_outside_replica_set`
counters: Favor coordinator as leader
Refs #7794
Iff we need to pre-fill segment file ni O_DSYNC mode, we should
drop this for the pre-fill, to avoid issuing flushes until the file
is filled. Done by temporarily closing, re-opening in "normal" mode,
filling, then re-opening.
Closes#8250
* github.com:scylladb/scylla:
commitlog: Make pre-allocation drop O_DSYNC while pre-filling
commitlog: coroutinize allocate_segment_ex
Before this change, we would print an expression like this:
((ColumnDefinition{name=c, type=org.apache.cassandra.db.marshal.Int32Type, kind=CLUSTERING_COLUMN, componentIndex=0, droppedAt=-9223372036854775808}) = 0000007b)
Now, we print the same expression like this:
(c = 0000007b)
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8285
There are a bunch of such calls in schema altering statements and
there's currently no way to obtain the migration manager for such
statements, so a relatively big rework needed.
The solution in this set is -- all statements' execute() methods are
called with query processor as first argument (now the storage proxy
is there), query processor references and provides migration manager
for statements. Those statements that need proxy can already get it
from the query processor.
Afterwards table_helper and thrift code can also stop using the global
migration manager instance, since they both have query processor in
needed places. While patching them a couple of calls to global storage
proxy also go away.
The new query processor -> migration manager dependency fits into
current start-stop sequence: the migration manager is started early,
the query processor is started after it. On stop the query processor
remains alive, but the migration manager stops. But since no code
currently (should) call get_local_migration_manager() it will _not_
call the query_processor::get_migration_manager() either, so this
dangling reference is ugly, but safe.
Another option could be to make storage proxy reference migration
manager, but this dependency doesn't look correct -- migration manager
is higher-level service than the storage proxy is, it is migration
manager who currently calls storage proxy, but not the vice versa.
* xemul/br-kill-some-migration-managers-2:
cql3: Get database directly from query processor
thrift: Use query_processor::get_migration_manager()
table_helper: Use query_processor::get_migration_manager()
cql3: Use query_processor::get_migration_manager() (lambda captures cases)
cql3: Use query_processor::get_migration_manager() (alter_type statement)
cql3: Use query_processor::get_migration_manager() (trivial cases)
query_processor: Keep migration manager onboard
cql3: Pass query processor to announce_migration:s
cql3: Switch to qp (almost) in schema-altering-stmt
cql3: Change execute()'s 1st arg to query_processor
compound set will select runs from all of its managed sets, so let's
adjust select_sstable_runs() to only return runs which belong to it.
without this adjustment, selection of runs would fail because
function would try to unconditionally retrieve the run which may
live somewhere else.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312042255.111060-3-raphaelsc@scylladb.com>
after compound set is introduced, select_sstable_runs() will no longer
work because the sstable runs live in sstable_set, but they should
actually live in the sstable_set being written to.
Given that runs is a concept that belongs only to strategies which
use partitioned_sstable_set, let's move the implementation of
select_sstable_runs() to it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312042255.111060-2-raphaelsc@scylladb.com>
When internode_encryption is "rack" or "dc", we should enforce incoming
connections are from the appropriate address spaces iff answering on
non-tls socket.
This is implemented by having two protocol handlers. One for tls/full notls,
and one for mixed (needs checking) connections. The latter will ask
snitch if remote address is kosher, and refuse the connection otherwise.
Note: requires seastar patches:
"rpc: Make is possible for rpc server instance to refuse connection"
"RPC: (client) retain local address and use on stream creation"
Note that ip-level checks are not exhaustive. If a user is also using
"require_client_auth" with dc/rack tls setting we should warn him that
there is a possibility that someone could spoof himself pass the
authentication.
Closes#8051
Basically SLES support is already done in f20736d93d, but it was for offline installer.
This fixes few more problems to install our rpm to SLES.
After this change, we can just install our rpm for both CentOS/RHEL and SLES in single image, like unified deb.
SLES uses original package manager called 'zypper', but it does support yum repository so no need to change required for repo.
Closes#8277
* github.com:scylladb/scylla:
scylla_coredump_setup: support SLES
scylla_setup: use rpm to check package availability for SLES
dist: install optional packages for SLES
Automatically initialize and start a timer in
`raft_services::add_server` for each raft server instance created.
The patch set also changes several other things in order
for tickers to work:
1. A bug in `raft_sys_table_storage` which caused an exception
if `raft::server::start` is called without any persisted state.
2. `raft_services::add_server` now automatically calls
`raft::server::start()` since a server instance should be started
before any of its methods can be called.
3. Raft servers can now start with initial term = 0. There was an
artificial restriction which is now lifted.
4. Raft schema state machine now returns a ready future instead of
throwing "not implemented" exception in `abort()`.
* github.com/ManManson/scylla.git/raft_services_tickers_v9_next_rebase:
raft/raft_services: provide a ticker for each raft server
raft/raft_services: switch from plain `throw` to `on_internal_error`
raft/raft_services: start server instance automatically in `add_server`
raft: return ready future instead of throwing in schema_raft_state_machine
raft: allow raft server to start with initial term 0
raft/raft_sys_table_storage: fix loading term/vote and snapshot from empty state
The log structured allocator's background reclaimer tries to
allocate CPU power proportional to memory demand, but a
bug made that not happen. Fix the bug, add some logging,
and future-proof the timer. Also, harden the test against
overcommitted test machines.
Fixes#8234.
Test: logalloc_test(dev), 20 concurrent runs on 2 cores (1 hyperthread each)
Closes#8281
* github.com:scylladb/scylla:
test: logalloc_test: harden background reclain test against cpu overcommit
logalloc: background reclaim: use default scheduling group for adjusting shares
logalloc: background reclaim: log shares adjustment under trace level
logalloc: background reclaim: fix shares not updated by periodic timer
Automatically initialize a ticker for each raft server
instance when `raft_services::add_server` is called.
A ticker is a timer which regularly calls `raft::server::tick`
in order to tick its raft protocol state machine.
Note that the timer should start after the server calls
its `start()` method, because otherwise it would crash
since fsm is not initialized yet.
Currently, the tick interval is hardcoded to be 100ms.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Raft server instance cannot be used in any way prior
to calling the `start()` method, which initializes
its internal state, e.g. raft protocol state machine.
Otherwise, it will likely result in a crash.
Also, properly stop the servers on shutdown via
`raft_services::stop_servers()`.
In case some exception happened inside `add_server`,
the `init` function will de-initialize what it already
initialized, i.e. raft rpc verbs. This is important
since otherwise it would break further initialization
process and, what is more important, will prevent raft
rpc verbs deinitialization. This will cause a crash in
`messaging_service` uninit procedure, because raft rpc
handlers would still be initialized.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The current implementation throws an exception, which will cause
a crash when stopping scylla. This will be used in the next patch.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Prior to the fix there was an assert to check in
`raft::server_impl::start` that the initial term is not 0.
This restriction is completely artificial and can be lifted
without any problems, which will be described below.
The only place that is dependent on this corner case is in
`server_impl::io_fiber`. Whenever term or vote has changed,
they will be both set in `fsm::get_output`. `io_fiber` checks
whether it needs to persist term and vote by validating that
the term field is set (by actually executing a `term != 0`
condition).
This particular check is based on an unobvious fact that the
term will never be 0 in case `fsm::get_output` saves
term and vote values, indicating that they need to be
persisted.
Vote and term can change independently of each other, so that
checking only for term obscures what is happening and why
even more.
In either case term will never be 0, because:
1. If the term has changed, then it's naturally greater than 0,
since it's a monotonically increasing value.
2. If the vote has changed, it means that we received
a vote request message. In such case we have already updated
our term to the requester's term.
Switch to using an explicit optional in `fsm_output` so that
a reader don't have to think about the motivation behind this `if`
and just checks that `term_and_vote` optional is engaged.
Given the motivation described above, the corresponding
assert(_fsm->get_current_term() != term_t(0));
in `server_impl::start` is removed.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
When a raft server is started for the first time and there isn't
any persisted state yet, provide default return values for
`load_term_and_vote` and `load_snapshot`. The code currently
does not handle this corner case correctly and fail with an
exception.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Coordinator prefers himself as the "counter leader", so if another
endpoint is chosen as the leader, we know that coordinator was
not a member of replica set. We can use this information to
increment relevant metric (which used to neglect the counters
completely).
Fixes#4337
This not only reduces internode traffic but is also needed for a
later change in this PR: metrics for non-token-aware writes
including counter updates.
Both methods apply a list of tombstones to the stream. One
was unused even before the set, the other one became unused
after previous patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the reader gets all range tombstones from the given
range and places them into a stream. When filling the buffer
with fragments the range tombstones are extracted from the
stream one by one.
This is memory consuming, the reader's memory usage shouldn't
depend on the number of inhabitants in the partition range.
The patch implements the heap-based cursor for range tombstones
almost like it's done for rows.
The heap contains range_tombstone_list::iterator_ranges, the
tombstones are popped from the heap when needed, are applied
into the stream and then are emitted from it into the buffer.
The refresh_state() is called on each new range to set up the
iterators, and when lsa reports references invalidation to
refresh the iterators. To let the refresh_state revalidate the
iterators, the position at which the last range tombstone was
emitted is maintained.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The existing refresh_state() is supposed to setup or revalidate
iterators to rows inside partition versions if needed. It will
be called in more than one place soon, so here's the helper.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The lsa_partition_reader is the helper sub-class for
partition_snapshot_reader that, among other things, is
responsible for filling the stream of range tombstones,
that's then used by the reader itself.
Next patches will change the way range tombstones are
emitted by the reader, so hide the stream inside the
helper subclass in advance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This method "notifies" the lsa_reader helper class when the owning
reader moves to a new range. This method is now empty, but will be
used by next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patch will extend the comparator to manage heap of
range tombstones. Not to add yet another comparator to
it (and not to create another heap comparator class) just
use the comparator that's common for both -- rows and range
tombstones.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are already two raii-sh comparators on reader, next patch will
need to add the third. This just bloats the reader, the comparators
in question are state-less and can be created on demand for free.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two of them now -- one to return iterator_range that
covers the given query::clustering_range, the other to return
it for two given positions.
In the next patch the 3rd one is needed -- the slice() to get
iterator_range that's
a) starts strictly after a given position
b) ends after the given clustering_range's end
It will be used to refresh the range tombstones iterators after
some of them will have been emitted. The same thing is currently
done by partition_snapshot_reader's refresh_state wrt rows:
if (last_row)
start = rows.upper_bound(last_row) // continuation
else
start = rows.lower_bound(range.start) // initial
end = rows.upper_bound(range.end) // end is the same in
// either case
Respectively for range tombstones the goal is the same.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The range_tombstone_list::slice() set of methods return
back pair of iterators represending a range. In the next
patches this pair will be actively used, and it's handy
to have a shorter alias for it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Previously, when a linking attempt failed, configure.py immediately
printed that neither lld nor gold was found, which might be misleading
if the linkers are installed, but the compilation failed anyway.
The printed information is now more specific, and combined with the
previous commit, it will also provide more information why the
compilation attempt failed.
If the configuration step failed, it used to only inform that
it must be due to the wrong GCC version, which can be misleading.
For instance, trying to compile on clang with incorrect flags
also resulted in an "wrong GCC version" message.
Now, the message is more generic, but it also prints the stderr
output from the miscompilation, which may help pinpoint the problem:
$ ./configure.py --mode release --cflags='-fhello -fcolor-diagnostics -mllvm -opt-bisect-limit=10000' --compiler=clang++ --c-compiler=clang
Note: neither lld nor gold found; using default system linker
Compilation failed: clang++ -x c++ -o build/tmp/tmp1177gojf /home/sarna/repo/scylla/build/tmp/tmp_u3voys6 -fhello -fcolor-diagnostics -mllvm -opt-bisect-limit=10000 []
// clang pretends to be gcc (defined __GNUC__), so we
// must check it first
\#ifdef __clang__
\#if __clang_major__ < 10
#error "MAJOR"
\#endif
\#elif defined(__GNUC__)
\#if __GNUC__ < 10
#error "MAJOR"
\#elif __GNUC__ == 10
#if __GNUC_MINOR__ < 1
#error "MINOR"
#elif __GNUC_MINOR__ == 1
#if __GNUC_PATCHLEVEL__ < 1
#error "PATCHLEVEL"
#endif
#endif
\#endif
\#else
\#error "Unrecognized compiler"
\#endif
int main() { return 0; }
clang-11: error: unknown argument: '-fhello'
distcc[4085341] ERROR: compile (null) on localhost failed
Wrong compiler version or incorrect flags. Scylla needs GCC >= 10.1.1 with coroutines (-fcoroutines) or clang >= 10.0.0 to compile.
After previous patches some places in cql3 code take a
long path to get database reference:
query processor -> storage proxy -> database
The query processor can provide the database reference
by itself, so take this chance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Thrift needs migration manager to call announce_<something> on
it and currently it grabs blobak migration manager instance.
Since thrift handler has query processor rerefence onboard and
the query processor can provide the migration manager reference,
it's time to remove few more globals from thrift code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After the migration manager can be obtained from the query
processor the table heler can also benefit from it and not
call for global migration manager instance any longer.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are few schema altering statements that need to have
the query processor inside lambda continuations. Fortunately,
they all are continuations of make_ready_future<>()s, so the
query processor can be simply captured by reference and used.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This statement needs the query processor one step below the
stack from its .announce_migration method. So here's the
dedicated patch for it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Most of the schema altering statements implementations can now
stop calling for global migration manager instance and get it
from the query processor.
Here are the trivial cases when the query processor is just
avaiable at the place where it's needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The query processor sits upper than the migration manager,
in the services layering, it's started after and (will be)
stopped before the migration manager.
The migration manager is needed in schema altering statements
which are called with query processor argument. They will
later get the migration manager from the query processor.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now when the only call to .announce_migration gas the
query processor at hands -- pass it to the real statements.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The schema altering statements are all inherited from the same
base class which delcares a pure virtual .announce_migration()
method. All the real statements are called with storage proxy
argument, while the need the migration manager. So like in the
previous patch -- replace storage proxy with query processor.
While doing the replacement also get the database instance from
the querty processor, not from proxy.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the statement's execute() method accepts storage
proxy as the first argument. This is enough for all of them
but schema altering ones, because the latter need to call
migration manager's announce.
To provide the migration manager to those who need it it's
needed to have some higher-level service that the proxy. The
query processor seems to be good candidate for it.
Said that -- all the .execute()s now accept the querty
processor instead of the proxy and get the proxy itself from
the query processor.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If the shares are currently low, we might not get enough CPU time to
adjust the shares in time.
This is currently no-op, since Seastar runs the callback outside
scheduling groups (and only uses the scheduling group for inherited
continuations); but better be insulated against such details.
adjust_shares() thinks it needs to do nothing if the main loop
is running, but in reality it can only avoid waking the main loop;
it still needs to adjust the shares unconditionally. Otherwise,
the background reclaim shares can get locked into a low value.
Fix by splitting the conditional into two.
Move boost tests to tests/raft and factor out common helpers.
* alejo/raft-tests-reorg-5-rebase-next-2:
raft: tests: move common helpers to header
raft: tests: move boost tests to tests/raft
Refs #7794
Iff we need to pre-fill segment file ni O_DSYNC mode, we should
drop this for the pre-fill, to avoid issuing flushes until the file
is filled. Done by temporarily closing, re-opening in "normal" mode,
filling, then re-opening.
v2:
* More comment
v3:
* Add missing flush
v4:
* comment
v5:
* Split coroutine and fix into separate patches
Prior to 463d0ab, only one table could be cleaned up at a time on a given shard.
Since then, all tables belonging to a given keyspace are cleaned up in parallel.
Cleanup serialization on each shard was enforced with a semaphore, which was
incorrectly removed by the patch aforementioned.
So space requirement for cleanup to succeed can be up to the size of keyspace,
increasing the chances of node running out of space.
Node could also run out of memory if there are tons of tables in the keyspace.
Memory requirement is at least #_of_tables * 128k (not taking into account write
behind, etc). With 5k tables, it's ~0.64G per shard.
Also all tables being cleaned up in parallel will compete for the same
disk and cpu bandwidth, so making them all much slower, and consequently
the operation time is significantly higher.
This problem was detected with cleanup, but scrub and upgrade go through the
same rewrite procedure, so they're affected by exact the same problem.
Fixes#8247.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210312162223.149993-1-raphaelsc@scylladb.com>
When failing to rebuild a node, we would print the error with the useless
explanation "<no exception>". The problem was a typo in the logging command
which used std::current_exception() - which wasn't relevant in that point -
instead of "ep".
Refs #8089
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210314113118.1690132-1-nyh@scylladb.com>
This is how PhD explain the need for prevoting stage:
One downside of Raft's leader election algorithm is that a server that
has been partitioned from the cluster is likely to cause a disruption
when it regains connectivity. When a server is partitioned, it will
not receive heartbeats. It will soon increment its term to start
an election, although it won't be able to collect enough votes to
become leader. When the server regains connectivity sometime later, its
larger term number will propagate to the rest of the cluster (either
through the server's RequestVote requests or through its AppendEntries
response). This will force the cluster leader to step down, and a new
election will have to take place to select a new leader.
Prevoting stage is addressing that. In the Prevote algorithm, a
candidate only increments its term if it first learns from a majority of
the cluster that they would be willing to grant the candidate their votes
(if the candidate's log is sufficiently up-to-date, and the voters have
not received heartbeats from a valid leader for at least a baseline
election timeout).
The Prevote algorithm solves the issue of a partitioned server disrupting
the cluster when it rejoins. While a server is partitioned, it won't
be able to increment its term, since it can't receive permission
from a majority of the cluster. Then, when it rejoins the cluster, it
still won't be able to increment its term, since the other servers
will have been receiving regular heartbeats from the leader. Once the
server receives a heartbeat from the leader itself, it will return to
the follower state(in the same term).
In our implementation we have "stable leader" extension that prevents
spurious RequestVote to dispose an active leader, but AppendEntries with
higher term will still do that, so prevoting extension is also required.
* scylla-dev/raft-prevote-v5:
raft: store leader and candidate state in state variant
raft: add boost tests for prevoting
raft: implement prevoting stage in leader election
raft: reset the leader on entering candidate state
raft: use modern unordered_set::contains instead of find in become_candidate
We already have server state dependant state in fsm, so there is no need
to maintain "voters" and "tracker" optionals as well. The upside is that
optional and variant sates cannot drift apart now.
This is how PhD explain the need for prevoting stage:
One downside of Raft's leader election algorithm is that a server that
has been partitioned from the cluster is likely to cause a disruption
when it regains connectivity. When a server is partitioned, it will
not receive heartbeats. It will soon increment its term to start
an election, although it won't be able to collect enough votes to
become leader. When the server regains connectivity sometime later, its
larger term number will propagate to the rest of the cluster (either
through the server's RequestVote requests or through its AppendEntries
response). This will force the cluster leader to step down, and a new
election will have to take place to select a new leader.
Prevoting stage is addressing that. In the Prevote algorithm, a
candidate only increments its term if it first learns from a majority of
the cluster that they would be willing to grant the candidate their votes
(if the candidate's log is sufficiently up-to-date, and the voters have
not received heartbeats from a valid leader for at least a baseline
election timeout).
The Prevote algorithm solves the issue of a partitioned server disrupting
the cluster when it rejoins. While a server is partitioned, it won't
be able to increment its term, since it can't receive permission
from a majority of the cluster. Then, when it rejoins the cluster, it
still won't be able to increment its term, since the other servers
will have been receiving regular heartbeats from the leader. Once the
server receives a heartbeat from the leader itself, it will return to
the follower state(in the same term).
In our implementation we have "stable leader" extension that prevents
spurious RequestVote to dispose an active leader, but AppendEntries with
higher term will still do that, so prevoting extension is also required.
This new method is preferred over all() for iterations purposes, because
all() may have to copy sstables into a temporary.
For example, all() implementation of the upcoming compound_sstable_set
will have no choice but to merge all sstables from N managed sets into
a temporary.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210311163009.42210-1-raphaelsc@scylladb.com>
"
Currently the sstable reader code is scattered across several source
files as following (paths are relative to sstables/):
* partition.cc - generic reader code;
* row.hh - format specific code related to building mutation fragments
from cells;
* mp_row_consumer.hh - format specific code related to parsing the raw
byte stream;
This is a strange organization scheme given that the generic sstable
reader is a template and as such it doesn't itself depend on the other
headers where the consumer and context implementations live. Yet these
are all included in partition.cc just so the reader factory function can
instantiate the sstable reader template with the format specific
objects.
This patchset reorganizes this code such that the generic sstable reader
is exposed in a header. Furthermore, format specific code is moved to
the kl/ and mx/ directories respectively. Each directory has a
reader.hh with a single factory function which creates the reader, all
the format specific code is hidden from sight. The added benefit is that
now reader code specific to a format is centralized in the format
specific folder, just like the writer code.
This patchset only moves code around, no logical changes are made.
Tests: unit(dev)
"
* 'sstable-reader-separation/v1' of https://github.com/denesb/scylla:
sstables: get rid of mp_row_consumer.{hh,cc}
sstables: get rid of row.hh
sstables/mp_row_consumer.hh: remove unused struct new_mutation
sstables: move mx specific context and consumer to mx/reader.cc
sstables: move kl specific context and consumer to kl/reader.cc
sstables: mv partition.cc sstable_mutation_reader.hh
Let's make stop_compaction() use sstables::to_compaction_type(),
so all supported compaction types can now be aborted.
Refs #7738.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This will make it easier to introduce new type and also to map type to
string and vice-versa, using reverse lookup.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Move stuff contained therein to `sstable_mutation_reader.{hh,cc}` which
will serve as the collection point of utility stuff needed by all reader
implementations.
Move stuff contained therein to `sstable_mutation_reader.{hh,cc}` which
will serve as the collection point of utility stuff needed by all reader
implementations.
Move all the mx format specific context and consumer code to
mx/reader.cc and add a factory function `mx::make_reader()` which takes
over the job of instantiating the `sstable_mutation_reader` with the mx
specific context and consumer.
Move all the kl format specific context and consumer code to
kl/reader* and add a factory function `kl::make_reader()` which takes
over the job of instantiating the `sstable_mutation_reader` with the kl
specific context and consumer. Code which is used by test is moved to
kl/reader_impl.hh, while code that can be hidden us moved to
kl/reader.cc. Users who just want to create a reader only have to
include kl/reader.hh.
The sstable reader currently knows the definition of all the different
consumers and contexts. But it doesn't really need to, as it is a
template. Exploit this and prepare for a organization scheme where the
consumers and contexts live hidden in a cc file which includes and
instantiates the sstable reader template. As a first step expose
`sstable_mutation_reader` in a header.
- 3 nodes in the cluster with rf = 3
- run repair on node1 with ignore_nodes to ignore node2 and node3
- node1 has no followers to repair with
However, currently node1 will walk through the repair procedure to read
data from disk and calculate hashes which are unnecessary.
This patch fixes this issue, so that in case there are no followers, we
skip the range and avoid the unnecessary work.
Before:
$ curl -X POST http://127.0.0.1:10000/storage_service/repair_async/myks3?ignore_nodes="127.0.0.2,127.0.0.3"
repair - repair id [id=1, uuid=ff39151b-2ce9-4885-b7e9-89158b14b5c2] on shard 0 stats:
repair_reason=repair, keyspace=myks3, tables={standard1},
ranges_nr=769, sub_ranges_nr=769, round_nr=1456,
round_nr_fast_path_already_synced=1456,
round_nr_fast_path_same_combined_hashes=0,
round_nr_slow_path=0, rpc_call_nr=0, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.19 seconds,
tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.0.1, 2822972}},
row_from_disk_nr={{127.0.0.1, 6218}},
row_from_disk_bytes_per_sec={{127.0.0.1, 14.1695}} MiB/s,
row_from_disk_rows_per_sec={{127.0.0.1, 32726.3}} Rows/s,
tx_row_nr_peer={}, rx_row_nr_peer={}
Data was read from disk.
After:
$ curl -X POST http://127.0.0.1:10000/storage_service/repair_async/myks3?ignore_nodes="127.0.0.2,127.0.0.3"
repair - repair id [id=1, uuid=c6df8b23-bd3b-4ebc-8d4c-a11d1ebcca39] on shard 0 stats:
repair_reason=repair, keyspace=myks3, tables={standard1}, ranges_nr=769,
sub_ranges_nr=0, round_nr=0, round_nr_fast_path_already_synced=0,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=0, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.0 seconds,
tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={},
row_from_disk_nr={},
row_from_disk_bytes_per_sec={} MiB/s,
row_from_disk_rows_per_sec={} Rows/s,
tx_row_nr_peer={}, rx_row_nr_peer={}
No data was read from disk.
Fixes#8256Closes#8257
Instead of using the `restrictions` class hierarchy, calculate the clustering slice using the `expr::expression` representation of the WHERE clause. This will allow us to eventually drop the `restrictions` hierarchy altogether.
Tests: unit (dev, debug)
Closes#8227
* github.com:scylladb/scylla:
cql3: Make get_clustering_bounds() use expressions
cql3/expr: Add is_multi_column()
cql3/expr: Add more operators to needs_filtering
cql3: Replace CK-bound mode with comparison_order
cql3/expr: Make to_range globally visible
cql3: Gather slice-defining WHERE expressions
cql3: Add statement_restrictions::_where
test: Add unit tests for get_clustering_bounds
Not resetting a leader causes vote requests to be ignored instead of
rejected which will make voting round to take more time to fail and may
slow down new leader election.
Use expressions instead of _clustering_columns_restrictions. This is
a step towards replacing the entire restrictions class hierarchy with
expressions.
Update some expected results in unit tests to reflect the new code.
These new results are equivalent to the old ones in how
storage_proxy::query() will process them (details:
bound_view::from_range() returns the same result for an empty-prefix
singular as for (-inf,+inf)).
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Omitting these operators didn't cause bugs, because needs_filtering()
is never invoked on them. But that will likely change in the future,
so add them now to prevent problems down the road.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Instead of defining this enum in multi_column_restriction::slice, put
it in the expr namespace and add it to binary_operator. We will need
it when we switch bounds calculation from multi_column_restriction to
expr classes.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
It will be used in statement_restrictions for calculating clustering
bounds. And it will come in handy elsewhere in the future, I'm sure.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Add statement_restrictions::_clustering_prefix_restrictions and fill
it with relevant expressions. Explain how to find all such
expressions in the WHERE clause.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Fixes#8212
Some snapshotting operations call in on a single table at a time.
When checking for existing snapshots in this case, we should not
bother with snapshots in other tables. Add an optional "filter"
to check routine, which if non-empty includes tables to check.
Use case is "scrub" which calls with a limited set of tables
to snapshot.
Closes#8240
Currently, if the data_size is greater than
max_chunk_size - sizeof(chunk), we end up
allocating up to max_chunk_size + sizeof(chunk) bytes,
exceeding buf.max_chunk_size().
This may lead to allocation failures, as seen in
https://github.com/scylladb/scylla/issues/7950,
where we couldn't allocate 131088 (= 128K + 16) bytes.
This change adjusted the expose max_chunk_size()
to be max_alloc_size (128KB) - sizeof(chunk)
so that the allocated chunks would normally be allocated
in 128KB chunks in the write() path.
Added a unit test - test_large_placeholder that
stresses the chunk allocation path from the
write_place_holder(size) entry point to make
sure it handles large chunk allocations correctly.
Refs #7950
Refs #8081
Test: unit(release), bytes_ostream_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210303143413.902968-1-bhalevy@scylladb.com>
On the environment hard limit of coredump is set to zero, coredump test
script will fail since the system does not generate coredump.
To avoid such issue, set ulimit -c 0 before generating SEGV on the script.
Note that scylla-server.service can generate coredump even ulimit -c 0
because we set LimitCORE=infinity on its systemd unit file.
Fixes#8238Closes#8245
* 'preparatory_work_for_compound_set' of github.com:raphaelsc/scylla:
sstable_set: move all() implementation into sstable_set_impl
sstable_set: preparatory work to change sstable_set::all() api
sstables: remove bag_sstable_set
This test checks that `mutation_partition::difference()` works correctly.
One of the checks it does is: m1 + m2 == m1 + (m2 - m1).
If the two mutations are identical but have compactable data, e.g. a
shadowable tombstone shadowed by a row marker, the apply will collapse
these, causing the above equality check to fail (as m2 - m1 is null).
To prevent this, compact the two input mutations.
Fixes: #8221
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210310141118.212538-1-bdenes@scylladb.com>
The main motivation behind this is that by moving all() impl into
sstable_set_impl, sstable_set no longer needs to maintain a list
with all sstables, which in turn may disagree with the respective
sstable_set_impl.
This will be very important for compound_sstable_set_impl which
will be built from existing sets, and will implement all() by
combining the all() of its managed sets.
Without this patch, we'd have to insert the same sstable at
both compound set and also the set managed by it, to guarantee
all() of compound set would return the correct data, which would
be expensive and error prone.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
users of sstable_set::all() rely on the set itself keeping a reference
to the returned list, so user can iterate through the list assuming
that it is alive all the way through.
this will change in the future though, because there will be a
compound set impl which will have to merge the all() of multiple
managed sets, and the result is a temporary value.
so even range-based loops on all() have to keep a ref to the returned
list, to avoid the list from being prematurely destroyed.
so the following code
for (auto& sst : *sstable_set.all()) { ...}
becomes
for (auto sstables = sstable_set.all(); auto& sst : *sstables) { ... }
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"
This class is basically a wrapper around a unique pointer and a few
short convenience methods, but is otherwise a distraction in trying to
untangle the maze that is the sstable reader class hierachy.
So this patchset folds it into its only real user: the sstable reader.
"
* 'data_consume_context_bye' of https://github.com/denesb/scylla:
sstable: move data_consume_* factory methods to row.hh
sstables: fold data_consume_context: into its users
sstables: partition.cc: remove data_consume_* forward declarations
The indentation level is significantly reduced, and so is the number
of allocations. The function signature is changed from taking an rvalue
ref to taking the unique_ptr by value, because otherwise the coroutine
captures the request as a reference, which results in use-after-free.
Tests: unit(dev)
Closes#8249
* github.com:scylladb/scylla:
alternator: drop read_content_and_verify_signature
alternator: coroutinize handle_api_request
The indentation level is significantly reduced, and so is the number
of allocations.
The function signature is changed from taking an rvalue ref to taking
the unique_ptr by value, because otherwise the coroutine captures
the request as a reference, which results in use-after-free.
`data_consume_context` is a thin wrapper over the real context object
and it does little more than forward method calls to it. The few
methods doing more then mere forwarding can be folded into its single
real user: `sstable_reader`.
Alternator request sizes can be up to 16 MB, but the current implementation
had the Seastar HTTP server read the entire request as a contiguous string,
and then processed it. We can't avoid reading the entire request up-front -
we want to verify its integrity before doing any additional processing on it.
But there is no reason why the entire request needs to be stored in one big
*contiguous* allocation. This always a bad idea. We should use a non-
contiguous buffer, and that's the goal of this patch.
We use a new Seastar HTTPD feature where we can ask for an input stream,
instead of a string, for the request's body. We then begin the request
handling by reading lthe content of this stream into a
vector<temporary_buffer<char>> (which we alias "chunked_content"). We then
use this non-contiguous buffer to verify the request's signature and
if successful - parse the request JSON and finally execute it.
Beyond avoiding contiguous allocations, another benefit of this patch is
that while parsing a long request composed of chunks, we free each chunk
as soon as its parsing completed. This reduces the peak amount of memory
used by the query - we no longer need to store both unparsed and parsed
versions of the request at the same time.
Although we already had tests with requests of different lengths, most
of them were short enough to only have one chunk, and only a few had
2 or 3 chunks. So we also add a test which makes a much longer request
(a BatchWriteItem with large items), which in my experiment had 17 chunks.
The goal of this test is to verify that the new signature and JSON parsing
code which needs to cross chunk boundaries work as expected.
Fixes#7213.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210309222525.1628234-1-nyh@scylladb.com>
The test populates the cache, then invalidates it, then tries to push
huge (10x times the segment size) chunks into seastar memory hoping that
the invalid entries will be evicted. The exit condition on the last
stage is -- total memory of the region (sum of both -- used and free)
becomes less than the size of one chunk.
However, the condition is wrong, because cache usually contains a dummy
entry that's not necessarily on lru and on some test iteration it may
happen that
evictable size < chunk size < evictable size + dummy size
In this case test fails with bad_alloc being unable to evict the memory
from under the dummy.
fixes: #7959
tests: unit(row_cache_test), unit(the failing case with the triggering
seed from the issue + 200 times more with random seeds)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210309134138.28099-1-xemul@scylladb.com>
Change test/raft directory to Boost test type.
Run replication_test cases with their own test.
RAFT_TEST_CASE macro creates 2 test cases, one with random 20% packet
loss named name_drops.
The directory test/raft is changed to host Boost tests instead of unit.
While there improve the documentation.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
In raft the UUID 0 is a special case so server ids start at 1.
Add two helper functions. Convert local 0-based id to raft 1-based
UUID. And from UUID to raft_id.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Change global map of disconnected servers to a more intuitive class
connected. The class is callable for the most common case
connected(id).
Methods connect(), disconnect(), and all() are provided for readability
instead of directly calling map methods (insert, erase, clear). They
also support both numerical (0 based) and server_id (UUID, 1 based) ids.
The actual shared map is kept in a lw_shared_ptr.
The class is passed around to be copy-constructed which is practically
just creating a new lw_shared_ptr.
Internally it tracks disconnected servers but externally it's more
intuitive to use connect instead of disconnect. So it reads
"connected id" and "not disconnected id", without double negatives.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
This is a reworked submission of #7686 which has been reverted. This series
fixes some race conditions in MV/SI schema creation and load, we spotted some
places where a schema without a base table reference can sneak into the
registry. This can cause to an unrecoverable error since write commands with
those schemas can't be issued from other nodes. Most of those cases can occur on
2 main and uncommon cases, in a mixed cluster (during an upgrade) and in a small
window after a view or base table altering.
Fixes#7709Closes#8091
* github.com:scylladb/scylla:
database: Fix view schemas in place when loading
global_schema_ptr: add support for view's base table
materialized views: create view schemas with proper base table reference.
materialized views: Extract fix legacy schema into its own logic
In some cases, user may want to repair the cluster, ignoring the node
that is down. For example, run repair before run removenode operation to
remove a dead node.
Currently, repair will ignore the dead node and keep running repair
without the dead node but report the repair is partial and report the
repair is failed. It is hard to tell if the repair is failed only due to
the dead node is not present or some other errors.
In order to exclude the dead node, one can use the hosts option. But it
is hard to understand and use, because one needs to list all the "good"
hosts including the node itself. It will be much simpler, if one can
just specify the node to exclude explicitly.
In addition, we support ignore nodes option in other node operations
like removenode. This change makes the interface to ignore a node
explicitly more consistent.
Refs: #7806Closes#8233
Add tests to check if quorum (for leader election and commit index
purposes) is calculated correctly in the presence of non-voting members.
Message-Id: <20210304101158.1237480-3-gleb@scylladb.com>
This patch adds a support for non-voting members. Non voting member is a
member which vote is not counted for leader election purposes and commit
index calculation purposes and it cannot become a leader. But otherwise
it is a normal raft node. The state is needed to let new nodes to catch
up their log without disturbing a cluster.
All kind of transitions are allowed. A node may be added as a voting member
directly or it may be added as non-voting and then changed to be voting
one through additional configuration change. A node can be demoted from
voting to non-voting member through a configuration change as well.
Message-Id: <20210304101158.1237480-2-gleb@scylladb.com>
bag_sstable_set can be replaced with partitioned_sstable_set, which
will provide the same functionality, given that L0 sstables go to
a "bag" rather than interval map. STCS, for example, will only
have L0 sstables, so it will get exact the same behavior with
partitioned_sstable_set.
it also gives us the benefit of keeping the leveled sstables in
the interval map if user has switched from LCS to STCS, until
they're all compacted into size-tiered ssts.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Since Scylla requires C++20, there is no need to protect
concept definitions or usages with SEASTAR_CONCEPT; it just
clutters the code. This patch therefore removes all uses.
Closes#8236
Currently, the rpc timeout error for the GOSSIP_GET_ENDPOINT_STATES verb
is not handled in gossiper::do_shadow_round. If the
GOSSIP_GET_ENDPOINT_STATES rpc call to any of the remote nodes goes
timeout, gossiper::do_shadow_round will throw an exception and fail the
whole boot up process.
It is fine that some of the remote nodes timeout in shadow round. It is
not a must to talk to all nodes.
This patch fixes an issue we saw recently in our sct tests:
```
INFO | scylla[1579]: [shard 0] init - Shutting down gossiping
INFO | scylla[1579]: [shard 0] gossip - gossip is already stopped
INFO | scylla[1579]: [shard 0] init - Shutting down gossiping was successful
...
ERR | scylla[1579]: [shard 0] init - Startup failed: seastar::rpc::timeout_error (rpc call timed out)
```
Fixes#8187Closes#8213
We already have a test which shows verify DynamoDB and Alternator
do not allow an index in an attribute path - like a[0].b - to be
a value reference - a[:xyz].b. We forgot to verify that the index
also can't be a name reference - a[#xyz].b is a syntax error. So here
we add a test which confirms that this is indeed the case - DynamoDB
doesn't allow it, and neither does Alternator.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210219123310.1240271-1-nyh@scylladb.com>
On Ubuntu 20.04 AMI, scylla_raid_setup --raiddev /dev/md0 causes
'/dev/md0 is already using' (issue #7627).
So we merged the patch to find free mdX (587b909).
However, look into /proc/mdstat of the AMI, it actually says no active md device available:
ubuntu@ip-10-0-0-43:~$ cat /proc/mdstat
Personalities :
unused devices: <none>
We currently decide mdX is used when os.path.exists('/sys/block/mdX/md/array_state') == True,
but according to kernel doc, the file may available even array is STOPPED:
clear
No devices, no size, no level
Writing is equivalent to STOP_ARRAY ioctl
https://www.kernel.org/doc/html/v4.15/admin-guide/md.html
So we should also check array_state != 'clear', not just array_state
existance.
Fixes#8219Closes#8220
has_monotonic_positions() wants to check for a greater-than-or-equal-to
relation, but actually tests for not-equal, since it treats a
trichotomic comparator as a less-than comparator. This is clearly seen
in the BOOST_FAIL message just below.
Fix by aligning the test with the intended invariant. Luckily, the tests
still pass.
Ref #1449.
Closes#8222
On restart the view schemas are loaded and might contain old
views with an unmarked computed column. We already have code to
update the schema, but before we do it we load the view as is. This
is not desired since once registered, this view version can be used
for writes which is forbidden since we will spot a none computed
column which is in the view's primary key but not in the base table
at all. To solve this, in addition to altering the persistent schema,
we fix the view's loaded schema in place. This is safe since computed
column is just involved in generating a value for this column when
creating a view update so the effect of this manipulation stays
internal.
The second stage of the in place fixing is to persist the
changes made in the in place fixing so the view is ready for
the next node restart in particular the `computed_columns` table.
Up until now, the global_schema_ptr object was a crack
through which a view schema with an uninitialized base
reference could sneak. Even if the schema itself contained a
base reference, the base schema didn't carry over to shards
different than the shard on which the global_schema_ptr was
created.
Since once the schema is in the registry it might be used for
everything (reads and writes), we also need to make sure that
global schemas for an incomplete view schemas will not be created.
reference.
Newly created view schemas don't always have their base info,
this is bad since such schemas don't support read nor write.
This leaves us vulnerable to a race condition where there is
an attempt to use this schema for read or write. Here we initialize
the base reference and also reconfigure the view to conform to the
new computed column type, which makes it usable for write and not only
reads. We do it for views created in the migration manager following
announcements and also for copied schemas.
We extract the logic for fixing the view schema into it's own
logic as we will need to use it in more places in the code.
This makes 'maybe_update_legacy_secondary_index_mv_schema' redundant since
it becomes a two liner wrapper for this logic. We also
remove it here and replace the call to it with the equivalent code.
Current fs.aio-max-nr value cpu_count() * 11026 is exact size of scylla
uses, if other apps on the environment also try to use aio, aio slot
will be run out.
So increase value +65536 for other apps.
Related #8133Closes#8228
Fixes an sstableloader bug where we quoted twice column names that
had to be quoted, and therefore failed on such tables - and in particular
Alternator tables which always have a column called ":attrs".
Fixes#8229
* tools/java 142f517a23...c5d9e8513e (1):
> sstableloader: Only escape column names once
034cb81323 and 0f0c3be disallowed reverse partition-range scans based on
the observation that the CQL frontend disallows them, assuming that
other client APIs also disallow them. As it turns out this is not true
and there it at least one client API (Thrift) which does allows reverse
range scans. So re-enable them.
Fixes: #8211
Tests: unit(release), dtest(thrift_tests.py)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210304142249.164247-1-bdenes@scylladb.com>
Before this patch, if Scylla crashes during some test in cql-pytest, all
tests after it will fail because they can't connect to Scylla - and we can
get a report on hundreds of failures without a clear sign of where the real
problem was.
This patch introduces an autouse fixture (i.e., a fixture automatically
used by every test) which tries to run a do-nothing CQL command after each
test. If this CQL command fails, we conclude that Scylla crashed and
report the test in which this happened - and exist pytest instead of failing
a hundred more tests.
Fixes#8080
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210304132804.1527977-1-nyh@scylladb.com>
In a scenario where node is running out of disk space, which is a common
cause of cluster expansion, it's very important to clean up the smallest
files first to increase the chances of success when the biggest files are
reached down the road. That's possible given that cleanup operates on a
single file at a time, and that the smaller the file the smaller the
space requirement.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210303165520.55563-1-raphaelsc@scylladb.com>
This function currently eagerly decrements `_size`, before `func()` is
invoked. If `func()` throws the consumption fails but the size remains
decremented. If this happens right at the last element in the row, the
`row::empty()` will incorrectly return `true`, even though there is
still one cell left in it. Move the decrement after the `func()`
invocation to avoid this by only decrementing if the consumption
was successful.
Fixes: #8154
Tests: unit(mutation_test:release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210304125318.143323-1-bdenes@scylladb.com>
A recent change to the commitlog (4082f57) caused its configurable size limit to
be strictly enforced - after reaching the limit, new segments wouldn't be
allocated until some of the previous segments are freed. This flow can work for
the regular commitlog, however the hints commitlog does not delete the segments
itself - instead, hints manager recreates its commitlog every 10 seconds, picks
up segments left by the previous instance and deletes each segment manually only
after all hints are sent out from a segment.
Because of the non-standard flow, it is possible that the hints commitlog fills
up and stops accepting more hints. Hints manager uses a relatively low limit for
each commitlog instance (128MB divided by shard count), so it's not hard to fill
it up. What's worse, hints manager tries to acquire file_update_mutex in
exclusive mode before re-creating the commitlog, while hints waiting to be
written acquire this lock in shared mode - which causes hints flushing to
completely deadlock and no more hints be admitted to the commitlog. The queue of
hints waiting to be admitted grows very quickly and soon all writes which could
result in a hint being generated are rejected with OverloadedException.
To solve this problem, it is now possible to bring back the soft disk space
limit by setting a flag in commitlog's configuration.
Tests:
- unit(dev)
- wrote hints for 15 minutes in order to see if it gets stuck again
Fixes#8137Closes#8206
* github.com:scylladb/scylla:
hints_manager: don't use commitlog hard space limit
commitlog: add an option to allow going over size limit
tri_compare() returns an int, which is dangerous as a tri_compare can
be misused where a less_compare is expected. To prevent such misuse,
convert the interval<> template to accept comparators that return
std::strong_ordering, and then convert dht::token's comparator to do
the same.
Ref #1449.
Closes#8181
* github.com:scylladb/scylla:
dht: convert token tri_compare to std::strong_ordering
interval: support C++20 three-way comparisons
This pull request removes seastar namespace imports from the header
files. There are some additional cleanups to make that easier and to
remove some commented out code.
Closes#8202
* github.com:scylladb/scylla:
redis: Remove seastar namespace import from query_processor.hh
redis: Switch to seastar::sharded<> in query_procesor.hh
redis: Remove seastar namespace import from query_utils.hh
redis: Remove seastar namespace import from reply.hh
redis: Remove commented out code from options.hh
redis: Remove seastar namespace import from options.hh
redis: Remove seastar namespace import from service.hh
redis: Switch to seastar::sharded<> in service.{hh,cc}
redis: Remove unneeded include from keyspace_utils.hh
redis: Remove seastar namespace import from keyspace_utils.hh
redis: Remove seastar namespace import from command_factory.hh
redis: Fix include path in command_factory.hh
redis: Remove unneeded includes from command_factory.hh
Refs: #8012Fixes: #8210
With the update to CDC generation management, the way we retrieve and process these changed.
One very bad bug slipped through though; the code for getting versioned streams did not take into
account the late-in-pr change to make clustering of CDC gen timestamps reversed. So our alternator
shard info became quite rump-stumped, leading to more or less no data depending on when generations
changed w.r. data.
Also, the way we track the above timestamps changed, so we should utilize this for our end-of-iterator check.
Closes#8209
* github.com:scylladb/scylla:
alternator::streams: Use better method for generation timestamp
system_distributed_keyspace: Add better routine to get latest cdc gen. timestamp
system_distributed_keyspace: Fix cdc_get_versioned_streams timestamp range
With the new scheme for cdc generation management, one of the last
changes was to make the time ordering of the stream timestamps reversed.
However, cdc_get_versioned_streams forgot to take this into account
when sifting out timestamp ranges for stream retrieval (based on
low mark).
Fixed by doing reverse iteration.
Test log consistency after apply_snapshot() is called.
Ensure log::last_term() log::last_conf_index() and log::size()
work as expected.
Misc cleanups.
* scylla-dev.git/raft-confchange-test-v4:
raft: fix spelling
raft: add a unit test for voting
raft: do not account for the same vote twice
raft: remove fsm::set_configuration()
raft: consistently use configuration from the log
raft: add ostream serialization for enum vote_result
raft: advance commit index right after leaving joint configuration
raft: add tracker test
raft: tidy up follower_progress API
raft: update raft::log::apply_snapshot() assert
raft: add a unit test for raft::log
raft: rename log::non_snapshoted_length() to log::in_memory_size()
raft: inline raft::log::truncate_tail()
raft: ignore AppendEntries RPC with a very old term
raft: remove log::start_idx()
raft: return a correct last term on an empty log
raft: do not use raft::log::start_idx() outside raft::log()
raft: rename progress.hh to tracker.hh
raft: extend single_node_is_quiet test
This reverts commit f94f70cda8, reversing
changes made to 5206a97915.
Not the latest version of the series was merged. Rvert prior to
merging the latest one.
Refs #7961Fixes#8014
The "untyped_result_set" object was created for small, internal access to cql-stored metadata.
It is nowadays used for rather more than that (cdc).
This has the potential of mixing badly with the fact that the type does deep copying of data
and linearizes all (not to mention handles multiple rows rather inefficiently).
Instead of doing a deep copy of input, we keep assume ownership and build
rows of the views therein, potentially retaining fragmented data as-is
avoiding premature linearization.
Note that this is not all sugar and flowers though. Any data access will
by nature be more expensive, and the view collections we create are
potentially just as expensive as copying for small cells.
Otoh, it allows writing code using this that avoids data copying,
depending on destination.
v2:
* Fixed wrong collection reserved in visitor
* Changed row index from shared ptr to ref
* Moved typedef
* Removed non-existing constructors
* Added const ref to index build
* Fixed raft usage after rebase
v3:
* Changed shared_ptr to unique
Closes#8015
* github.com:scylladb/scylla:
untyped_result_set: Do not copy data from input store (retain fragmented views)
result_generator: make visitor callback args explicit optionals
listlike_partial_deserializing_iterator: expose templated collection routines
Previously, we had two tests demonstrating issue #7966. But since then,
our understanding of this issue has improved which resulted in issue #8203,
so this patch improves those tests and makes them reproduce the new issue.
Importantly, we now know that this problem is not specific to a full-table
scan, and also happens in a single-partition scan, so we fix the test to
demonstrate this (instead of the old test, which missed the problem so
the test passed).
Both tests pass on Cassandra, and fail on Scylla.
Refs #8203.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210302224020.1498868-1-nyh@scylladb.com>
Refs #7961Fixes#8014
Instead of doing a deep copy of input, we keep assume ownership and build
rows of the views therein, potentially retaining fragmented data as-is
avoiding premature linearization.
Note that this is not all sugar and flowers though. Any data access will
by nature be more expensive, and the view collections we create are
potentially just as expensive as copying for small cells.
Otoh, it allows writing code using this that avoids data copying,
depending on destination.
v2:
* Fixed wrong collection reserved in visitor
* Changed row index from shared ptr to ref
* Moved typedef
* Removed non-existing constructors
* Added const ref to index build
* Fixed raft usage after rebase
v3:
* Changed shared_ptr to unique
When shedding requests (e.g. due to their size or number exceeding the
limits), errors were returned right after parsing their headers, which
resulted in their bodies lingering in the socket. The server always
expects a correct request header when reading from the socket after the
processing of a single request is finished, so shedding the requests
should also take care of draining their bodies from the socket.
Fixes#8193Closes#8194
* github.com:scylladb/scylla:
cql-pytest: add a shedding test
transport: return error on correct stream during size shedding
transport: return error on correct stream during shedding
transport: skip the whole request if it is too large
transport: skip the whole request during shedding
This scylla-only test case tries to push a too-large request
to Scylla, and then retries with a smaller request, expecting
a success this time.
Refs #8193
Accidentally introduced in 9eed26ca3d, it can never be true due to
code above it.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#8201
Like DynamoDB, Alternator rejects requests larger than some fixed maximum
size (16MB). We had a test for this feature - test_too_large_request,
but it was too blunt, and missed two issues:
Refs #8195
Refs #8196
So this patch adds two better tests that reproduce these two issues:
First, test_too_large_request_chunked verifies that an oversized request
is detected even if the body is sent with chunked encoding.
Second, both tests - test_too_large_request_chunked and
test_too_large_request_content_length - verify that the rather limited
(and arguably buggy) Python HTTP client is able to read the 413 status
code - and doesn't report some generic I/O error.
Both tests pass on DynamoDB, but fail on Alternator because of these two
open issues.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210302154555.1488812-1-nyh@scylladb.com>
The main goal of this patch is to add a reproducer for issue #7966, where
partition-range scan with filtering that begins with a long string of
non-matches aborts the query prematurely - but the same thing is fine with
a single-partition scan. The test, test_filtering_with_few_matches, is
marked as "xfail" because it still fails on Scylla. It passes on Cassandra.
I put a lot of effort into making this reproducer *fast* - the dev-build
test takes 0.4 seconds on my laptop. Earlier reproducers for the same
problem took as much as 30 seconds, but 0.4 seconds turns this test into
a viable regression test.
We also add a test, test_filter_on_unset, reproduces issue #6295 (or
the duplicate #8122), which was already solved so this test passes.
Refs #6295
Refs #7966
Refs #8122
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210301170451.1470824-1-nyh@scylladb.com>
Refs #8093
Refs /scylladb/scylla-tools-java#218
Adds keyword that can preface value tuples in (a, b, c) > (1, 2, 3)
expressions, forcing the restriction to bypass column sort order
treatment, and instead just create the raw ck bounds accordningly.
This is a very limited, and simple version, but since we only need
to cover this above exact syntax, this should be sufficient.
v2:
* Add small cql test
v3:
* Added comment in multi_column_restriction::slice, on what "mode" means and is for
* Added small document of our internal CQL extension keywords, including this.
v4:
* Added a few more cases to tests to verify multi-column restrictions
* Reworded docs a bit
v5:
* Fixed copy-paste error in comment
v6:
* Added negative (error) test cases
v7:
* Added check + reject of trying to combine SCYLLA_CLUST... slice and
normal one
Closes#8094
The Redis transport layer seems to have originated as a copy-paste of
the CQL transport layer. This pull request removes bunch of unused
and commented out bits of code, and also does some minor cleanups
like organizing includes, to make the code more readable.
Closes#8198
* github.com:scylladb/scylla:
redis: Remove unused to_bytes_view() function from server.cc
redis: Remove unused tracing_request_type enum
redis: Remove unneeded connection friend declaration
redis: Remove unused process_request_executor friend declaration
redis: Remove unused _request_cpu class member
redis: Remove commented out code from server.hh
redis: Remove duplicate request.hh include
redis: Remove unused db::config forward declaration
redis: Remove unused fmt_visitor forward declaration
redis: Organize includes in server.{cc,hh}
redis: Switch to seastar::sharded<>
redis: Remove redundant access modifiers from server.hh
This commit disables the hard space limit applied by commitlogs created
to store hints. The hard limit causes problems for hints because they
use small-sized commitlogs to store hints (128MB, currently). Instead of
letting the commitlog delete the segments itself, it recreates the
commitlog every 10 seconds and manually deletes old segments after all
hints are sent out from them.
If the 128MB limit is reached, the hints manager will get stuck. A
future which puts hint into commitlog holds a shared lock, and commitlog
recreation needs to get an exclusive lock, which results in a deadlock.
No more hints will be admitted, and eventually we will start rejecting
writes with OverloadedException due to too many hints waiting to be
admitted to the commitlog.
By disabling the hard limit for hints commitlog, the old behavior is
brought back - commitlog becomes more conservative with the space used
after going over its size limit, but does not block until some of its
segments are deleted.
When a request is shed due to being too large, its response
was sent with stream id 0 instead of the stream id that matches
the communication lane. That in turn confused the client,
which is no longer the case.
When a request is shed due to exceeding the max number of concurrent
requests, its response was sent with stream id 0 instead of
the stream id that matches the communication lane.
That in turn confused the client, which is no longer the case.
On RAID prompt, we can type disk list something like this:
/dev/sda1,/dev/sdb1,/dev/sdc1,/dev/sdd1
However, if the list has spaces in the list, it doesn't work:
/dev/sda1, /dev/sdb1, /dev/sdc1, /dev/sdd1
Because the script mistakenly recognize the space part of a device path.
So we need strip() the input for each item.
Fixes#8174Closes#8190
When a request is shed due to being too large, only the header
was actually read, and the body was still stuck in the socket
- and would be read in the next iteration, which would expect
to actually read a new request header.
Instead, the whole message is now skipped, so that a new request
can be correctly read and parsed.
Fixes#8193
When a request is shed due to exceeding the number of max concurrent
requests, only its header was actually read, and the body was still
stuck in the socket - and would be read in the next iteration,
which would expect to actually read a new request header.
Instead, the whole message is now skipped, so that a new request
can be correctly read and parsed.
Refs #8193
"
Currently range scans build their results on the replica in the
`reconcilable_result` format, that -- as its name suggests -- is
normally used for reconciliation (read repair). As such this result
format is quite inefficient for normal queries: it contains all columns
and all tombstones in the requested range. These are all unnecessary for
normal queries which only want live data and only those columns that are
requested by the user.
Furthermore, as the coordinator works in terms of `query::result` for
normal queries anyway, this intermediate result has to be converted to
the final `query::result` format adding an unnecessary intermediate
conversion step.
This series gets rid of this problem by introducing
`query_data_on_all_shards()`, a variant of
`query_mutations_on_all_shards()` that builds `query::result` directly.
Reverse queries still use the old intermediate method behind the scenes.
Fixes#8061
Refs #7434
Tests: unit(release, debug)
"
* 'range-scan-data-variant/v5-rebased' of https://github.com/denesb/scylla:
cql_query_test: add unit test for the more efficient range scan result format
test/cql_test_env: do_with_cql_test_env(): add thread_attributes parameter
cql_query_test: test_query_limit: clean up scheduling groups
storage_proxy: use query_data_on_all_shards() for data range scan queries
query: partition_slice: add range_scan_data_variant option
gms: add RANGE_SCAN_DATA_VARIANT cluster feature
multishard_mutation_query: query_mutations_on_all_shards(): refuse reverse queries
multishard_mutation_query: add query_data_on_all_shards()
mutation_partition.cc: fix indentation
query_result_builder: make it a public type
multishard_mutation_query: generalize query code w.r.t. the result builder used
multishard_mutation_query: query_mutations_on_all_shards(): extract logic into new method
multishard_mutation_query: query_mutations_on_all_shards(): convert to coroutine
multishar_mutation_query: do_query_mutations(): convert to coroutine
multishard_mutation_query: read_page(): convert to coroutine
multishard_mutation_query: extract page reading logic into separate method
The most user-visible aspect of this change is range scans which select
a small subset of the columns. These queries work as the user expects
them to work: unselected columns are not included in determining the
size of the result (or that of the page). This is the aspect this test
is checking for. While at it, also test single partition queries too.
Currently range scans build their result using the `reconcilable_result`
format and then convert it to `query::result`. This is inefficient for
multiple reasons:
1) it introduces an additional intermediate result format and a
subsequent conversion to the final one;
2) the reconcilable result format was designed for reconciliation so it
contains all data, including columns unselected by the query, dead
rows and tombstones, which takes much more memory to build;
There is no reason to go through all this trouble, if there ever was one
in the past it doesn't stand anymore. So switch to the newly introduced
`query_data_on_all_shards()` when doing normal data range scans, but
only if all the nodes in the cluster supports it, to avoid artificial
differences in page sizes due to how reconcilable result and
query::result calculates result size and the consequent false-positive
read repair.
The transition to this new more efficient method is coordinated by a
cluster feature and whether to use it is decided by the coordinator
(instead of each replica individually). This is to avoid needless
reconciliation due to the different page sizes the two formats will
produce.
Switching to the data variant of range scans have to be coordinated by
the coordinator to avoid replicas noticing the availability of the
respective feature in different time, resulting in some using the
mutation variant, some using the data variant.
So the plan is that it will be the coordinator's job to check the
cluster feature and set the option in the partition slice which will
tell the replicas to use the data variant for the query.
To control the transition to the data variant of range scans. As there
is a difference in how the data and mutation variants calculate pages
sizes, the transition to the former has to happen in a controlled
manner, when all nodes in the cluster support it, to avoid artificial
differences in page content and subsequently triggering false-positive
read repair.
Refuse reverse queries just like in the new
`query_data_on_all_shards()`. The reason is the same, reverse range
scans are not supported on the client API level and hence they are
underspecified and more importantly: not tested.
A data query variant of the existing `query_mutations_on_all_shards()`.
This variant builds a `query::result`, instead of `reconcilable_result`.
This is actually the result format coordinators want when executing
range scans, the reason for using the reconcilable result for these
queries is historic, and it just introduces an unnecessary intermediate
format.
This new method allows the storage proxy to skip this intermediate
format and the associated conversion to `query::result`, just like we do
for single partition queries.
Reverse queries are refused because they are not supported on the client
API (CQL) level anyway and hence it is unspecified how they should work
and more importantly: they are not tested.
We want to add support to building `query::result` directly and reuse
the code path we use to build reconcilable result currently for it.
So templatize said code path on the result builder used. Since the
different result builders don't have a source level compatible interface
an adaptor class is used.
In the next patches we are going to generalize the query logic w.r.t.
the result builder used, so query_mutations_on_all_shards() will be just
a facade parametrizing the actual query code with the right result
builder.
So it can be modified while walked to dispatch
subscribed event notifications.
In #8143, there is a race between scylla shutdown and
notify_down(), causing use-after-free of cql_server.
Using an atomic vector itstead and futurizing
unregister_subscriber allows deleting from _lifecycle_subscribers
while walked using atomic_vector::for_each.
Fixes#8143
Test: unit(release)
DTest:
update_cluster_layout_tests:TestUpdateClusterLayout.add_node_with_large_partition4_test(release)
materialized_views_test.py:TestMaterializedViews.double_node_failure_during_mv_insert_4_nodes_test(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210224164647.561493-2-bhalevy@scylladb.com>
Move unregister_subscriber from the destructor to stop
as preparation for moving storage_service lifescyle_subscribers
to atomic_vector and futurizing unregister_subscriber.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210224164647.561493-1-bhalevy@scylladb.com>
* seastar 803e790598...ea5e529f30 (3):
> Merge "Teach io_tester to generate YAML output" from Pavel E
> bitset: set_range: mark constructor constexpr
> Update dpdk submodule
This series is extracted from #7913 as it may prove useful to other series as well, and #7913 might take a while until its merged, given that it also depends on other unmerged pull requests.
The idea of this series is to move timeouts to the client state, which will allow changing them independently for each session - e.g. by setting per-service-level timeouts and initializing the values from attached service levels (see #7867).
Closes#8140
* github.com:scylladb/scylla:
treewide: remove timeout config from query options
cql3: use timeout config from client state instead of query options
cql3: use timeout config from client state instead of query options
cql3: use timeout config from client state instead of query options
service: add timeout config to client state
Due to regression introduced by 463d0ab, regular can compact in parallel a sstable
being compacted by cleanup, scrub or upgrade.
This redundancy causes resources to be wasted, write amplification is increased
and so does the operation time, etc.
That's a potential source of data resurrection because the now-owned data from
a sstable being compacted by both cleanup and regular will still exist in the
node afterwards, so resurrection can happen if node regains ownership.
Fixes#8155.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210225172641.787022-1-raphaelsc@scylladb.com>
This will prevent accumulation of unnecessary dummy entries.
A single-partition populating scan with clustering key restrictions
will insert dummy entries positioned at the boundaries of the
clustering query range to mark the newly populated range as
continuous.
Those dummy entries may accumulate with time, increasing the cost of
the scan, which needs to walk over them.
In some workloads we could prevent this. If a populating query
overlaps with dummy entries, we could erase the old dummy entry since
it will not be needed, it will fall inside a broader continuous
range. This will be the case for time series worklodas which scan with
a decreasing (newest) lower bound.
Refs #8153.
_last_row is now updated atomically with _next_row. Before, _last_row
was moved first. If exception was thrown and the section was retried,
this could cause the wrong entry to be removed (new next instead of
old last) by the new algorithm. I don't think this was causing
problems before this patch.
The problem is not solved for all the cases. After this patch, we
remove dummies only when there is a single MVCC version. We could
patch apply_monotonically() to also do it, so that dummies which are
inside continuous ranges are eventually removed, but this is left for
later.
perf_row_cache_reads output after that patch shows that the second
scan touches no dummies:
$ build/release/test/perf/perf_row_cache_reads_g -c1 -m200M
Rows in cache: 0
Populating with dummy rows
Rows in cache: 265320
Scanning
read: 142.621613 [ms], preemption: {count: 639, 99%: 0.545791 [ms], max: 0.526929 [ms]}, cache: 0/0 [MB]
read: 0.023197 [ms], preemption: {count: 1, 99%: 0.035425 [ms], max: 0.032736 [ms]}, cache: 0/0 [MB]
Message-Id: <20210226172801.800264-1-tgrabiec@scylladb.com>
Evicts objects from caches which reflect sstable content, like the row
cache. In the future, it will also drop the page cache
and sstable index caches.
Unlike lsa/compact, doesn't cause reactor stalls.
The old lsa/compact call invokes memory reclamation, which is
non-preemptible. It also compacts LSA segments, so does more
work. Some use cases don't need to compact LSA segments, just want the
row cache to be wiped.
Message-Id: <20210301120211.36195-1-tgrabiec@scylladb.com>
This commit adds an option which, when turned on, allows the commitlog
to go over configured size limit. After reaching the limit, commitlog
will be more conservative with its usage of the disk space - for
example, it won't increase the segment reserve size or reuse recycled
segments. Most importantly, it won't block writes until the space used
by the commitlog goes down.
This change is necessary for hinted handoff to keep its current
behavior. Hinted handoff does not let the commitlog free segments
itself - instead, it re-creates it every 10 seconds and manually deletes
segments after all hints are sent from a segment.
Current aio-max-nr is set up statically to 1048576 in
/etc/sysctl.d/99-scylla-aio.conf.
This is sufficient for most use cases, but falls short on larger machines
such as i3en.24xlarge on AWS that has 96 vCPUs.
We need to tune the parameter based on the number of cpus, instead of
static setting.
Fixes#8133
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closes#8188
Currently, the sstable_set in a table is copied before every change
to allow accessing the unchanged version by existing sstable readers.
This patch changes the sstable_set to a structure that keeps all its
versions that are referenced somewhere and provides a way of getting
a reference to an immutable version of the set.
Each sstable in the set is associated with the versions it is alive in,
and is removed when all such versions don't have references anymore.
To avoid copying, the object holding all sstables in the set version is
changed to a new structure, sstable_list, which was previously an alias
for std::unordered_set<shared_sstable>, and which implements most of the
methods of an unordered_set, but its iterator uses the actual set with
all sstables from all referenced versions and iterates over those
sstables that belong to the captured version.
The methods that modify the sets contents give strong exception guarantee
by trying to insert new sstables to its containers, and erasing them in
the case of an caught exception.
To release shared_sstables as soon as possible (i.e. when all references
to versions that contain them die), each time a version is removed, all
sstables that were referenced exclusively by this version are erased. We
are able to find these sstables efficiently by storing, for each version,
all sstables that were added and erased in it, and, when a version is
removed, merging it with the next one. When a version that adds an sstable
gets merged with a version that removes it, this sstable is erased.
Fixes#2622
Signed-off-by: Wojciech Mitros wojciech.mitros@scylladb.comCloses#8111
* github.com:scylladb/scylla:
sstables: add test for checking the latency of updating the sstable_set in a table
sstables: move column_family_test class from test/boost to test/lib
sstables: use fast copying of the sstable_set instead of rebuilding it
sstables: replace the sstable_set with a versioned structure
sstables: remove potential ub
sstables: make sstable_set constructor less error-prone
"
We have recently seen out-of-order partitions getting into sstables
causing major disruption later on. Given the damage caused, it was again
raised that we should enable partition key monotonicity validation
unconditionally in the sstable write path. This was also raised in the
past but dismissed as key validation was suspected (but not measured) to
add considerable per-fragment overhead. One of the problems was that the
key monotonicity validation was all or nothing. It either validated all
(clustering and partition) key monotonicity or none of it.
This series takes a second look at this and solves the all-or-nothing
problem by making the configuration of the key monotonicity check more
fine grained, allowing for enabling just token monotonicity validation
separately, then enables it unconditionally.
Refs: #7623
Tests: unit(release)
"
* 'sstable-writer-validate-partition-keys-unconditionally/v3' of https://github.com/denesb/scylla:
sstables: enable token monotonicity validation by default
mutation_fragment_stream_validator: add token validation level
mutation_fragment_stream_validating_filter: make validation levels more fine-grained
This patch adds the compaction id to the get_compaction structure.
While it was supported, it was not used and up until now wasn't needed.
After this patch a call to curl -X GET 'http://localhost:10000/compaction_manager/compactions'
will include the compaction id.
Relates to #7927
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Closes#8186
The error now contains information about the view table that failed,
as well as base and view tokens.
Example:
view - Error applying view update to 127.0.0.1 (view: ks.testme_v_idx_index,
base token: -4069959284402364209, view token: -3248873570005575792): std::runtime_error (manually injected error)
Fixes#8177Closes#8178
Partition key order validation in data written to sstables can be very
disruptive. All our components in the storage layers assume that
partitions are in order, which means that reading out-of-order
partitions triggers undefined behaviour. Computer scientists often joke
that undefined behaviour can erase your hard drive and in this case the
damage done by undefined behaviour caused by out-of-order partitions is
very close to that. The corruption is known to mutate causing crashes,
corrupting more data and even loose data. For this reason it is
imperative that out-of-order partitions cannot get into sstables. This
patch enables token monotonicity validation unconditionally in
the sstable writer. As partition key monotonicity checks involve a key
copy per partition, which might have an impact on the performance, we do
the next best thing instead and enable only token monotonicity
validation.
In some cases the full-blown partition key validation and especially the
associated key copy per partition might be deemed too costly. As a next
best thing this patch adds a token only validation, which should cover
99% (number pulled out of my sleeve) of the cases. Let's hope no one
gets unlucky.
Currently key order validation for the mutation fragment stream
validating filter is all or nothing. Either no keys (partition or
clustering) are validated or all of them. As we suspect that clustering
key order validation would add a significant overhead, this discourages
turning key validation on, which means we miss out on partition key
monotonicity validation which has a much more moderate cost.
This patch makes this configurable in a more fine-grained fashion,
providing separate levels for partition and clustering key monotonicity
validation.
As the choice for the default validation level is not as clear-cut as
before, the default value for the validation level is removed in the
validating filter's constructor.
Change token's tri_compare functions to return std::strong_ordering,
which is not convertible to bool and therefore not suspect to
being misused where a less-compare is expected.
Two of the users (ring_position and decorated_key) have to undo
the conversion, since they still return int. A follow up will
convert them too.
Ref #1449.
Allow the tri-comparator input to range functions to return
std::strong_ordering, e.g. the result of operator<=>. An int
input is still allowed, and coerced to std::strong_ordering by
tri-comparing it against zero. Once all users are converted, this
will be disallowed.
The clever code that performs boundary comparisons unfortunately
has to be dumbed down to conditionals. A helper
require_ordering_and_on_equal_return() is introduced that accepts
a comparison result between bound values, an expected comparison
result, and what to return if the bound value matches (this depends
on whether individual bounds are exclusive or inclusive, on
whether the bounds are start bounds or end bounds, and on the
sense of the comparison).
Unfortunately, the code is somewhat pessimized, and there is no
way to pessimize it as the enum underlying std::strong_ordering
is hidden.
fill_buffer() will keep scanning until _lower_bound_changed is true,
even if preemption is signaled, so that the reader makes forward
progress.
Before the patch, we did not update _lower_bound on touching a dummy
entry. The read will not respect preemption until we hit a non-dummy
row. If there is a lot of dummy rows, that can cause reactor stalls.
Fix that by updating _lower_bound on dummy entries as well.
Refs #8153.
Tested with perf_row_cache_reads:
```
$ build/release/test/perf/perf_row_cache_reads -c1 -m200M
Rows in cache: 0
Populating with dummy rows
Rows in cache: 373929
Scanning
read: 183.658966 [ms], preemption: {count: 848, 99%: 0.545791 [ms], max: 0.519343 [ms]}, cache: 99/100 [MB]
read: 120.951515 [ms], preemption: {count: 257, 99%: 0.545791 [ms], max: 0.518795 [ms]}, cache: 99/100 [MB]
```
Notice that max preemption latency is low in the second "read:" line.
Closes#8167
* github.com:scylladb/scylla:
row_cache: Make fill_buffer() preemptable when cursor leads with dummy rows
tests: perf: Introduce perf_row_cache_reads
row_cache: Add metric for dummy row hits
The optimal path of said method mistakenly captures `pos` (a local
variable) in its reader factory method and passes a temporary range
implicitly constructed from said `pos` as the range parameter to the
sstable reader. This will lead to the sstable reader using a dangling
range and will result in returning no result for queries. This patch
fixes this bug and adds a unit test to cover this code path.
Fixes#8138.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210226143111.104591-2-bdenes@scylladb.com>
The `result_memory_accounter` terminates a query if it reaches either
the global or shard-local limit. This used to be so only for paged
queries, unpaged ones could grow indefinitely (until the node OOM'd).
This was changed in fea5067 which enforces the local limit on unpaged
queries as well, by aborting them. However a loophole remained in the
code: `result_memory_accounter::check_and_update()` has another stop
condition, besides `check_local_limit()`, it also checks the global
limit. This stop condition was not updated to enforce itself on unpaged
queries by aborting them, instead it silently terminated them, causing
them to return less data then requested. This was masked by most queries
reaching the local limit first.
This patch fixes this by aborting unpaged mutation queries when they hit
the global limit.
Fixes: #8162
Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210226102202.51275-1-bdenes@scylladb.com>
The multishard combining reader currently assumes that all shards have
data for the read range. This however is not always true and in extreme
cases (like reading a single token) it can lead to huge read
amplification. Avoid this by not pushing shards to
`_shard_selection_min_heap` if the first token they are expected to
produce falls outside of the read range. Also change the read ahead
algorithm to select the shards from `_shard_selection_min_heap`, instead
of walking them in shard order. This was wrong in two ways:
* Shards may be ordered differently with respect to the first partition
they will produce; reading ahead on the next shard in shard order
might not bring in data on the next shard the read will continue on.
Shard order is only correct when starting a new range and shards are
iterated over in the order they own tokens according to the sharding
algorithm.
* Shards that may not have data relevant to the read range are also
considered for read ahead.
After this patch, the multishard reader will only read from shards that
have data relevant to the read range, both in the case of normal reads
and also for read-ahead.
Fixes: #8161
Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210226132536.85438-1-bdenes@scylladb.com>
Currently all management of CDC generations happens in storage_service,
which is a big ball of mud that does many unrelated things.
This PR introduces a new service crafted to handle CDC generation
management: listening and reacting to generation changes in the cluster.
We plug the service in, initializing it in main and test code,
passing a reference to storage_service and having storage_service call
the service (using the `after_join` method): the service only starts
doing its job after the node joins the token ring (either on bootstrap
or restart).
Some parts of generation management still remain in storage_service:
the bootstrap procedure, which happens inside storage_service,
must also do some initialization regarding CDC generations,
for example: on restart it must retrieve the latest known generation
timestamp from disk; on bootstrap it must create a new generation
and announce it to other nodes. The order of these operations w.r.t
the rest of the startup procedure is important, hence the startup
procedure is the only right place for them. We may try decoupling
these services even more in follow-up PRs, but that requires a bit
of careful reasoning. What this PR does is a low-hanging fruit.
Still, what remains in storage_service is a small part of the entire
CDC generation management logic; most of it has been moved to the
new service. This includes listening for generation changes and
updating the data structures for performing CDC log writes (cdc::metadata).
Furthermore these handling functions now return futures (and are internally
coroutines), where previously they required a seastar::async context.
This PR is a prerequisite to fixing #7985. The fact that all the CDC generation
management code was in storage_service is technical debt. It will be easier
to modify the management algorithms when they sit in their own module.
Tests: unit (dev) and cdc_tests.py dtest (dev), and local replication test using scylla-cdc-java
Closes#8172
* github.com:scylladb/scylla:
cdc: move (most of) CDC generation management code to the new service
cdc: coroutinize make_new_cdc_generation
cdc: coroutinize update_streams_description
cdc: introduce cdc::generation_service
main: move cdc_service initialization just prior to storage_service initialization
Currently all management of CDC generations happens in storage_service,
which is a big ball of mud that does many unrelated things.
Previous commits have introduced a new service for managing CDC
generations. This code moves most of the relevant code to this new
service.
However, some part still remains in storage_service: the bootstrap
procedure, which happens inside storage_service, must also do some
initialization regarding CDC generations, for example: on restart it
must retrieve the latest known generation timestamp from disk; on
bootstrap it must create a new generation and announce it to other
nodes. The order of these operations w.r.t the rest of the startup
procedure is important, hence the startup procedure is the only right
place for them.
Still, what remains in storage_service is a small part of the entire
CDC generation management logic; most of it has been moved to the
new service. This includes listening for generation changes and
updating the data structures for performing CDC log writes (cdc::metadata).
Furthermore these functions now return futures (and are internally
coroutines), where previously they required a seastar::async context.
fill_buffer() will keep scanning until _lower_bound_chnaged is true,
even if preemption is signalled, so that the reader makes forward
progress.
Before the patch, we did not update _lower_bound on touching a dummy
entry. The read will not respect preemption until we hit a non-dummy
row. If there is a lot of dummy rows, that can cause reactor stalls.
Fix that by updating _lower_bound on dummy entries as well.
Refs #8153.
Tested with perf_row_cache_reads:
$ build/release/test/perf/perf_row_cache_reads -c1 -m200M
Rows in cache: 0
Populating with dummy rows
Rows in cache: 373929
Scanning
read: 183.658966 [ms], preemption: {count: 848, 99%: 0.545791 [ms], max: 0.519343 [ms]}, cache: 99/100 [MB]
read: 120.951515 [ms], preemption: {count: 257, 99%: 0.545791 [ms], max: 0.518795 [ms]}, cache: 99/100 [MB]
Notice that max preemption latency is low in the second "read:" line.
Timeout config is now stored in each connection, so there's no point
in tracking it inside each query as well. This patch removes
timeout_config from query_options and follows by removing now
unnecessary parameters of many functions and constructors.
dh_installinit --name <service> is for forcing install debian/*.service
and debian/*.default that does not matches with package name.
And if we have subpackages, packager has responsibility to rename
debian/*.service to debian/<subpackage>.*service.
However, we currently mistakenly running
dh_installinit --name scylla-node-exporter for
debian/scylla-node-exporeter.service,
the packaging system tries to find destination package for the .service,
and does not find subpackage name on it, so it will pick first
subpackage ordered by name, scylla-conf.
To solve the issue, we just need to run dh_installinit without --name
when $product == 'scylla'.
Fixes#8163Closes#8164
We currently deny running scylla_setup when umask != 0022.
To remove this limitation, run os.chmod(0o644) on every file creation
to allow reading from scylla user.
Note that perftune.yaml is not really needed to set 0644 since perftune.py is
running in root user, but setting it to align permission with other files.
Fixes#8049Closes#8119
TWCS reshape was silently ignoring windows which contain at least
min_threshold sstables (can happen with data segregation).
When resizing candidates, size of multi_window was incorrectly used and
it was always empty in this path, which means candidates was always
cleared.
Fixes#8147.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210224125322.637128-1-raphaelsc@scylladb.com>
This series adds background reclaim to lsa, with the goal
that most large allocations can be satisfied from available
free memory, and and reclaim work can be done from a preemptible
context.
If the workload has free cpu, then background reclaim will
utilize that free cpu, reducing latency for the main workload.
Otherwise, background reclaim will compete with the main
workload, but since that work needs to happen anyway,
throughput will not be reduced.
A unit test is added to verify it works.
Fixes#1634.
Closes#8044
* github.com:scylladb/scylla:
test: logalloc_test: test background reclaim
logalloc: reduce gap between std min_free and logalloc min_free
logalloc: background reclaim
logalloc: preemptible reclaim
Test that the background reclaimer is able to compete with a
fake load and reclaim 10 MB/s. The test is quite stressful as the "LRU"
is fully randomized.
If the background reclaimer is disabled, the test fails as soon as the
20MB "gap" is exhausted. With the reclaimer enabled, it is able to
free memory ahead of the allocations.
This patch adds to Alternator support for the CORS (Cross-Origin Resource
Sharing) protocol - a simple extension over the HTTP protocol which
browsers use when Javascript code contacts HTTP-based servers.
Although we usually think of Alternator as being used in a three-tier
application, in some setups there is no middle layer and the user's
browser, running Javascript code, wants to communicate directly with the
database. However, for security reasons, by default Javascript loaded
from domain X is not allowed to communicate with different domains Y.
The CORS protocol is meant to allow this, and Alternator needs to
participate in this protocol if it is to be used directly from Javascript
in browsers.
To implement CORS, Alternator needs to respond to the OPTIONS method
which it didn't allow before - with certain headers based on the
input headers. It also needs to do some of these things for the
regular methods (mostly, POST). The patch includes a comprehensive
test that runs against both Alternator and DynamoDB and shows that
Alternator handles these headers and methods the same as DynamoDB.
Additionally, I tested manually a Javascript DynamoDB client - which
didn't work prior to this patch (the browser reported CORS errors),
and works after this patch.
Fixes#8025.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210217222027.1219319-1-nyh@scylladb.com>
The schema used to create the sstable writer has to be the same as the
schema used by the reader, as the former is used to intrpret mutation
fragments produced by the reader.
Commit 9124a70 intorduced a deferring point between reader creation
and writer creation which can result in schema mismatch if there was a
concurrent alter.
This could lead to the sstable write to crash, or generate a corrupted
sstable.
Fixes#7994
Message-Id: <20210222153149.289308-1-tgrabiec@scylladb.com>
Compaction manager allows compaction of different weights to proceed in
parallel. For example, a small-sized compaction job can happen in parallel to a
large-sized one, but similar-sized jobs are serialized.
The problem is the current definition of weight, which is the log (base 4) of
total size (size of all sstables) of a job.
This is what we get with the current weight definition:
weight=5 for sizes=[1K, 3K]
weight=6 for sizes=[4K, 15K]
weight=7 for sizes=[16K, 63K]
weight=8 for sizes=[64K, 255K]
weight=9 for sizes=[258K, 1019K]
weight=10 for sizes=[1M, 3M]
weight=11 for sizes=[4M, 15M]
weight=12 for sizes=[16M, 63M]
weight=13 for sizes=[64M, 254M]
weight=14 for sizes=[256M, 1022M]
weight=15 for sizes=[1033M, 4078M]
weight=16 for sizes=[4119M, 10188M]
total weights: 12
Note that for jobs smaller than 1MB, we have 5 different weights, meaning 5
jobs smaller than 1MB could proceed in parallel. High number of parallel
compactions can be observed after repair, which potentially produces tons of
small sstables of varying sizes. That causes compaction to use a significant
amount of resources.
To fix this problem, let's add a fixed tax to the size before taking the log,
so that jobs smaller than 1M will all have the same weight.
Look at what we get with the new weight definition:
weight=10 for sizes=[1K, 2M]
weight=11 for sizes=[3M, 14M]
weight=12 for sizes=[15M, 62M]
weight=13 for sizes=[63M, 254M]
weight=14 for sizes=[256M, 1022M]
weight=15 for sizes=[1033M, 4078M]
weight=16 for sizes=[4119M, 10188M]
total weights: 7
Fixes#8124.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210217123022.241724-1-raphaelsc@scylladb.com>
Currently, init_server and join_cluster which initiate the bootstrap and
replace operations on the new node run inside the main scheduling group.
We should run them inside the maintenance scheduling group to reduce the
impact on the user workload.
This patch fixes a scheduling group leak for bootstrap and replace operation.
Before:
[shard 0] storage_service - storage_service::bootstrap sg=main
[shard 0] repair - bootstrap_with_repair sg=main
After:
[shard 0] storage_service - storage_service::bootstrap sg=streaming
[shard 0] repair - bootstrap_with_repair sg=streaming
Fixes#8130Closes#8131
operator<< used the wrong criterium for deciding whether the data is
stored as atomic_cell or collection_mutation, resulting in
catastrophical failure if it was used with frozen collections or UDTs.
Since frozen collections and UDTs are stored as atomic_cell, not
collection_mutation, the correct criterium is not is_collection(),
but is_multi_cell().
Closes#8134
This commit introduces a new service crafted to handle CDC generation
management: listening and reacting to generation changes in the cluster.
The implementation is a stub for now, the service reacts to generation
changes by simply logging the event.
The commit plugs the service in, initializing it in main and test code,
passing a reference to storage_service and having storage_service start
the service (using the `after_join` method): the service only starts
doing its job after the node joins the token ring (either on bootstrap
or restart).
initialization
As a preparation for introducing CDC generation management service.
cdc_service will depend on the generation service.
But the generation service needs some other services to work
properly. In particular, it uses the local database, so it should be
initialized after the local database.
The only service that will need the cdc generation service is
storage_service, so we can place the generation service initialization
code right before storage_service initialization code. So the order will
be cdc_generation_service -> cdc_service -> storage_service.
"
Current storage of cells in a row is a union of vector and set. The
vector holds 5 cell_and_hash's inline, up to 32 ones in the external
storage and then it's switched to std::set. Once switched, the whole
union becomes the waste of space, as it's size is
sizeof(vector head) + 5 * sizeof(cell and hash) = 90+ bytes
and only 3 pointers from it are used (std::set header). Also the
overhead to keep cell_and_hash as a set entry is more then the size
of the structure itself.
Column ids are 32-bit integers that most likely come sequentialy.
For this kind of a search key a radix tree (with some care for
non-sequential cases) can be beneficial.
This set introduces a compact radix tree, that uses 7-bit sub values
from the search key to index on each node and compacts the nodes
themselves for better memory usage. Then the row::_storage is replaced
with the new tree.
The most notable result is the memory footprint decrease, for wide
rows down to 2x times. The performance of micro-benchmarks is a bit
lower for small rows and (!) higer for longer (8+ cells). The numbers
are in patch #12 (spoiler: they are better than for v2)
v3:
- trimmed size of radix down to 7 bits
- simplified the nodes layouts, now there are 2 of them (was 4)
- enhanced perf_mutation to test N-cells schema
- added AVX intra-nodes search for medium-sized nodes
- added .clone_from() method that helped to improve perf_mutation
- minor
- changed functions not to return values via refs-arguments
- fixed nested classes to properly use language constructors
- renamed index_to to key_t to distinguish from node_index_t
- improved recurring variadic templates not to use sentinel argument
- use standard concepts
v2:
- fixed potential mis-compilation due to strict-aliasing violation
- added oracle test (radix tree is compared with std::map)
- added radix to perf_collection
- cosmetic changes (concepts, comments, names)
A note on item 1 from v2 changelog. The nodes are no longer packed
perfectly, each has grown 3 bytes. But it turned out that when used
as cells container most of this growth drowned in lsa alignments.
next todo:
- aarch64 version of 16-keys node search
tests: unit(dev), unit(debug for radix*), pref(dev)
"
* 'br-radix-tree-for-cells-3' of https://github.com/xemul/scylla:
test/memory_footpring: Print radix tree node sizes
row: Remove old storages
row: Prepare row::equal for switch
row: Prepare row::difference for switch
row: Introduce radix tree storage type
row-equal: Re-declare the cells_equal lambda
test: Add tests for radix tree
utils: Compact radix tree
array-search: Add helpers to search for a byte in array
test/perf_collection: Add callback to check the speed of clone
test/perf_mutation: Add option to run with more than 1 columns
test/perf_mutation: Prepare to have several regular columns
test/perf_mutation: Use builder to build schema
In this patch, we port validation/entities/json_test.java, containing
21 tests for various JSON-related operations - SELECT JSON, INSERT JSON,
and the fromJson() and toJson() functions.
In porting these tests, I uncovered 19 (!!) previously unknown bugs in
Scylla:
Refs #7911: Failed fromJson() should result in FunctionFailure error, not
an internal error.
Refs #7912: fromJson() should allow null parameter.
Refs #7914: fromJson() integer overflow should cause an error, not silent
wrap-around.
Refs #7915: fromJson() should accept "true" and "false" also as strings.
Refs #7944: fromJson() should not accept the empty string "" as a number.
Refs #7949: fromJson() fails to set a map<ascii, int>.
Refs #7954: fromJson() fails to set null tuple elements.
Refs #7972: toJson() truncates some doubles to integers.
Refs #7988: toJson() produces invalid JSON for columns with "time" type.
Refs #7997: toJson() is missing a timezone on timestamp.
Refs #8001: Documented unit "µs" not supported for assigning a "duration"
type.
Refs #8002: toJson() of decimal type doesn't use exponents so can produce
huge output.
Refs #8077: SELECT JSON output for function invocations should be
compatible with Cassandra.
Refs #8078: SELECT JSON ignores the "AS" specification.
Refs #8085: INSERT JSON with bad arguments should yield InvalidRequest
error, not internal error.
Refs #8086: INSERT JSON cannot handle user-defined types with case-
sensitive component names.
Refs #8087: SELECT JSON incorrectly quotes strings inside map keys.
Refs #8092: SELECT JSON missing null component after adding field to
UDT definition.
Refs #8100: SELECT JSON with IN and ORDER BY does not obey the ORDER BY.
Due to these bugs, 8 out of the 21 tests here currently xfail and one
has to be skipped (issue #8100 causes the sanitizer to detect a use
after free, and crash Scylla).
As usual in these sort of tests, all 21 tests pass when running against
Cassandra.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210217130732.1202811-1-nyh@scylladb.com>
When psutil.disk_paritions() reports / is /dev/root, aws_instance mistakenly
reports root partition is part of ephemeral disks, and RAID construction will
fail.
This prevents the error and reports correct free disks.
Fixes#8055Closes#8040
Due to small value optimization used in `bytes`, views to `bytes` stored
in `vector` can be invalidated when the vector resizes, resulting in
use-after-free and data corruption. Fix that.
Closes#8105
* github.com:scylladb/scylla:
cdc: log: avoid an unnecessary copy
cdc: log: fix use-after-free in process_bytes_visitor
Due to small value optimization used in `bytes`, views to `bytes` stored
in `vector` can be invalidated when the vector resizes, resulting in
use-after-free and data corruption. Fix that.
Fixes#8117
While a duplicate vote from the same server is not possible by a
conforming Raft implementation, Raft assumptions on network permit
duplicates.
So, in theory, it is possible that a vote message is delivered
multiple times.
The current voting implementation does reject votes from non-members,
but doesn't check for duplicate votes.
Keep track of who has voted yet, and reject duplicate votes.
A unit test follows.
Imagine the cluster is in joint configuration {{A, B}, {A, B, C, D, E}}.
The leader's view of stable indexes is:
Server Match Index
A 5
B 5
C 6
D 7
E 8
The commit index would be 5 if we use joint configuration, and 6
if we assume we left it. Let it happen without an extra FSM
step.
The old name was incorrect, in case apply_snapshot() was called with
non-zero trailing entries, the total log length is greater than the
length of the part that is not stored in a snapshot.
Fix spelling in related comments.
Rename fsm::wait() to fsm::wait_max_log_size(), it's a more
specific name. Rename max_log_length to max_log_size to use
'size' rather than 'length' consistently for log size.
Replace it with a private _first_idx, which is maintained
along with the rest of class log state.
_first_idx is a name consistent with counterpart last_idx().
Do not use a function since going forward we may want
to remove Raft index from struct log_entry, so should rely
less on it.
This fixes a bug when _last_conf_idx was not reset
after apply_snapshot() because start_idx() was pointing
to a non-existent entry.
If the log is empty, we must use snapshot's term,
since the log could be right after taking a snapshot
when no trailing entries were kept.
This fixes a rare possible bug when a log matching
rule could be violated during elections by a follower
with a log which was just truncated after a snapshot.
A separate unit test for the issue will follow.
raft::log::start_idx() is currently not meaningful
in case the log is empty.
Avoid using it in fsm::replicate_to() and avoid manual search for
previous log term, instead encapsulate the search in log::term_for().
As a side effect we currently return a correct term (0)
when log matching rule is exercised for an empty log
and the very first snapshot with term 0. Update raft_etcd_test.cc
accordingly.
This change happens to reduce the overall line count.
While at it, improve the comments in raft::replicate_to().
Until now, the lists of streams in the `cdc_streams_descriptions` table
for a given generation were stored in a single collection. This solution
has multiple problems when dealing with large clusters (which produce
large lists of streams):
1. large allocations
2. reactor stalls
3. mutations too large to even fit in commitlog segments
This commit changes the schema of the table as described in issue #7993.
The streams are grouped according to token ranges, each token range
being represented by a separate clustering row. Rows are inserted in
reasonably large batches for efficiency.
The table is renamed to enable easy upgrade. On upgrade, the latest CDC
generation's list of streams will be (re-)inserted into the new table.
Yet another table is added: one that contains only the generation
timestamps clustered in a single partition. This makes it easy for CDC
clients to learn about new generations. It also enables an elegant
two-phase insertion procedure of the generation description: first we
insert the streams; only after ensuring that a quorum of replicas
contains them, we insert the timestamp. Thus, if any client observes a
timestamp in the timestamps table (even using a ONE query),
it means that a quorum of replicas must contain the list of streams.
---
Nodes automatically ensure that the latest CDC generation's list of
streams is present in the streams description table. When a new
generation appears, we only need to update the table for this
generation; old generations are already inserted.
However, we've changed the description table (from
`cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The
existing mechanism only ensures that the latest generation appears in
the new description table. We add an additional procedure that
rewrites the older generations as well, if we find that it is necessary
to do so (i.e. when some CDC log tables may contain data in these
generations).
Closes#8116
* github.com:scylladb/scylla:
tests: add a simple CDC cql pytest
cdc: add config option to disable streams rewriting
cdc: rewrite streams to the new description table
cql3: query_processor: improve internal paged query API
cdc: introduce no_generation_data_exception exception type
docs: cdc: mention system.cdc_local table
cdc: coroutinize do_update_streams_description
sys_dist_ks: split CDC streams table partitions into clustered rows
cdc: use chunked_vector for streams in streams_version
cdc: remove `streams_version::expired` field
system_distributed_keyspace: use mutation API to insert CDC streams
storage_service: don't use `sys_dist_ks` before it is started
Rewriting stream descriptions is a long, expensive, and prone-to-failure
operation. Due to #8061 it may consume a lot of memory. In general, it
may keep failing (and being retried) endlessly, straining the cluster.
As a backdoor we add this flag for potential future needs of admins or
field engineers.
I don't expect it will ever be used, but it won't hurt and may save us
some work in the worst case scenario.
Nodes automatically ensure that the latest CDC generation's list of
streams is present in the streams description table. When a new
generation appears, we only need to update the table for this
generation; old generations are already inserted.
However, we've changed the description table (from
`cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The
existing mechanism only ensures that the latest generation appears in
the new description table. This commit adds an additional procedure that
rewrites the older generations as well, if we find that it is necessary
to do so (i.e. when some CDC log tables may contain data in these
generations).
The `query_processor::query` method allowed internal paged queries.
However, it was quite limited, hardcoding a number of parameters:
consistency level, timeout config, page size.
This commit does the following improvements:
1. Rename `query` to `query_internal` to make it obvious that this API
is supposed to be used for internal queries only
2. Extend the method to take consistency level, timeout config, and page
size as parameters
3. Remove unused overloads of `query_internal`
4. Fix a bunch of typos / grammar issues in the docstring
Until now, the lists of streams in the `cdc_streams_descriptions` table
for a given generation were stored in a single collection. This solution
has multiple problems when dealing with large clusters (which produce
large lists of streams):
1. large allocations
2. reactor stalls
3. mutations too large to even fit in commitlog segments
This commit changes the schema of the table as described in issue #7993.
The streams are grouped according to token ranges, each token range
being represented by a separate clustering row. Rows are inserted in
reasonably large batches for efficiency.
The table is renamed to enable easy upgrade. On upgrade, the latest CDC
generation's list of streams will be (re-)inserted into the new table.
Yet another table is added: one that contains only the generation
timestamps clustered in a single partition. This makes it easy for CDC
clients to learn about new generations. It also enables an elegant
two-phase insertion procedure of the generation description: first we
insert the streams; only after ensuring that a quorum of replicas
contains them, we insert the timestamp. Thus, if any client observes a
timestamp in the timestamps table (even using a ONE query),
it means that a quorum of replicas must contain the list of streams.
It could happen that system_distributed_keyspace was used by
storage_service before it was fully started (inside
`handle_cdc_generation`), i.e. before sys_dist_ks' `start()` returned
(on shard 0). It only checked whether `local_is_initialized()` returns
true, so it only ensured that the service is constructed.
Currently, sys_dist_ks' `start` only announces migrations, so this was
mostly harmless. More concretely: it could result in the node trying to
send CQL requests using a table that it didn't yet recognize by calling
sys_dist_ks' methods before the `announce_migration` call inside `start`
has returned. This would result in an exception; however, the exception
would be catched by the caller and the procedure would be retried,
succeeding eventually. See `handle_cdc_generation` for details.
Still, the initial intention of the code was to wait for the sys_dist_ks
service to be fully started before it was used. This commit fixes that.
Test log consistency after apply_snapshot() is called.
Ensure log::last_term() log::last_conf_index() and log::size()
work as expected.
Misc cleanups.
* scylla-dev/raft-confchange-test:
raft: add a unit test for voting
raft: do not account for the same vote twice
raft: remove fsm::set_configuration()
raft: consistently use configuration from the log
raft: add ostream serialization for enum vote_result
raft: advance commit index right after leaving joint configuration
raft: add tracker test
raft: tidy up follower_progress API
raft: update raft::log::apply_snapshot() assert
raft: add a unit test for raft::log
raft: rename log::non_snapshoted_length() to log::length()
raft: inline raft::log::truncate_tail()
raft: ignore AppendEntries RPC with a very old term
raft: remove log::start_idx()
raft: return a correct last term on an empty log
raft: do not use raft::log::start_idx() outside raft::log()
raft: rename progress.hh to tracker.hh
raft: extend single_node_is_quiet test
expired sstables are skipped in the compaction setup phase, because they don't
need to be actually compacted, but rather only deleted at the end.
that is causing such sstables to not be removed from the backlog tracker,
meaning that backlog caused by expired sstables will not be removed even after
their deletion, which means shares will be higher than needed, making compaction
potentially more aggressive than it have to.
to fix this bug, let's manually register these sstables into the monitor,
such that they'll be removed from the tracker once compaction completes.
Fixes#6054.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210216203700.189362-1-raphaelsc@scylladb.com>
Current renaming rule of debian/scylla-* files is buggy, it fails to
install some .service files when custom product name specified.
Introduce regex based rewriting instead of adhoc renaming, and fixed
wrong renaming rule.
Fixes#8113Closes#8114
`_range_override` is used to store the modified range the reader reads
after it has to be recreated (when recreating a reader it's read range
is reduced to account for partitions it already read). When engaged,
this field overrides the `_pr` field as the definitive range the reader
is supposed to be currently reading. Fast forwarding conceptually
overrides the range the reader is currently reading, however currently
it doesn't reset the `_range_override` field. This resulted in
`_range_override` (containing the modified pre-fast-forward range)
incorrectly overriding the fast-forwarded-to range in `_pr` when
validating the first partition produced by the just recreated reader,
resulting in a false-positive validation failure.
Fixes: #8059
Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210217164744.420100-1-bdenes@scylladb.com>
Instead of resetting _reader in scanning_and_populating_reader::fill_buffer
in the `reader_finished` case, use a gentler, _read_next_partition flag
on which `read_next_partition` will be called in the next iteration.
Then, read_next_partition can close _reader only before overwriting it
with a new reader. Otherwise, if _reader is always closed in the
``reader_finished` case, we end up hitting premature end_of_stream.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-30-bhalevy@scylladb.com>
Unlike flat_mutation_reader_opt that is defined using
optimized_optional<flat_mutation_reader>, std::optional<T> does not evaluate
to `false` after being moved, only after it is explicitly reset.
Use flat_mutation_reader_opt rather than std::optional<flat_mutation_reader>
to make it easier to check if it was closed before it's destroyed
or being assigned-over.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-6-bhalevy@scylladb.com>
Currently, whole topology description for CDC is stored in a single row.
This means that for a large cluster of strong machines (say 100 nodes 64
cpus each), the size of the topology description can reach 32MB.
This causes multiple problems. First of all, there's a hard limit on
mutation size that can be written to Scylla. It's related to commit log
block size which is 16MB by default. Mutations bigger than that can't be
saved. Moreover, such big partitions/rows cause reactor stalls and
negatively influence latency of other requests.
This patch limits the size of topology description to about 4MB. This is
done by reducing the number of CDC streams per vnode and can lead to CDC
data not being fully colocated with Base Table data on shards. It can
impact performance and consistency of data.
This is just a quick fix to make it easily backportable. A full solution
to the problem is under development.
For more details see #7961, #7993 and #7985.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Closes#8048
* github.com:scylladb/scylla:
cdc: Limit size of topology description
cdc: Extract create_stream_ids from topology_description_generator
Currently, whole topology description for CDC is stored in a single row.
This means that for a large cluster of strong machines (say 100 nodes 64
cpus each), the size of the topology description can reach 32MB.
This causes multiple problems. First of all, there's a hard limit on
mutation size that can be written to Scylla. It's related to commit log
block size which is 16MB by default. Mutations bigger than that can't be
saved. Moreover, such big partitions/rows cause reactor stalls and
negatively influence latency of other requests.
This patch limits the size of topology description to about 4MB. This is
done by reducing the number of CDC streams per vnode and can lead to CDC
data not being fully colocated with Base Table data on shards. It can
impact performance and consistency of data.
This is just a quick fix to make it easily backportable. A full solution
to the problem is under development.
For more details see #7961, #7993 and #7985.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Commit aab6b0ee27 introduced the
controversial new IMR format, which relied on a very template-heavy
infrastructure to generate serialization and deserialization code via
template meta-programming. The promise was that this new format, beyond
solving the problems the previous open-coded representation had (working
on linearized buffers), will speed up migrating other components to this
IMR format, as the IMR infrastructure reduces code bloat, makes the code
more readable via declarative type descriptions as well as safer.
However, the results were almost the opposite. The template
meta-programming used by the IMR infrastructure proved very hard to
understand. Developers don't want to read or modify it. Maintainers
don't want to see it being used anywhere else. In short, nobody wants to
touch it.
This commit does a conceptual revert of
aab6b0ee27. A verbatim revert is not
possible because related code evolved a lot since the merge. Also, going
back to the previous code would mean we regress as we'd revert the move
to fragmented buffers. So this revert is only conceptual, it changes the
underlying infrastructure back to the previous open-coded one, but keeps
the fragmented buffers, as well as the interface of the related
components (to the extent possible).
Fixes: #5578Closes#8106
* github.com:scylladb/scylla:
imr: switch back to open-coded description of structures
utils: managed_bytes: add a few trivial helper methods
utils: fragment_range: move FragmentedView helpers to fragment_range.hh
utils: fragment_range: add single_fragmented_mutable_view
utils: fragment_range: implement FragmentRange for fragment_range
utils: mutable_view: add front()
types: remove an unused helper function
test: mutation_test: fix memory calculations in make_fragments_with_non_monotonic_positions
test: mutation_test: remove an obsolete assertion
test: mutation_test: initialize an uninitialized variable
test: sstable_datafile_test: fix tracking of closed sstables in sstable_run_based_compaction_test
Commit aab6b0ee27 introduced the
controversial new IMR format, which relied on a very template-heavy
infrastructure to generate serialization and deserialization code via
template meta-programming. The promise was that this new format, beyond
solving the problems the previous open-coded representation had (working
on linearized buffers), will speed up migrating other components to this
IMR format, as the IMR infrastructure reduces code bloat, makes the code
more readable via declarative type descriptions as well as safer.
However, the results were almost the opposite. The template
meta-programming used by the IMR infrastructure proved very hard to
understand. Developers don't want to read or modify it. Maintainers
don't want to see it being used anywhere else. In short, nobody wants to
touch it.
This commit does a conceptual revert of
aab6b0ee27. A verbatim revert is not
possible because related code evolved a lot since the merge. Also, going
back to the previous code would mean we regress as we'd revert the move
to fragmented buffers. So this revert is only conceptual, it changes the
underlying infrastructure back to the previous open-coded one, but keeps
the fragmented buffers, as well as the interface of the related
components (to the extent possible).
Fixes: #5578
In the upcoming IMR removal patch we will need read_simple() and similar helpers
for FragmentedView outside of types.hh. For now, let's move them to
fragment_range.hh, where FragmentedView is defined. Since it's a widely included
header, we should consider moving them to a more specialized header later.
The off-by-one error would cause
test_multishard_combining_reader_non_strictly_monotonic_positions to fail if
the added range_tombstones filled the buffer exactly to the end.
In such situation, with the old loop condition,
make_fragments_with_non_monotonic_positions would add one range_tombstone too
many to the deque, violating the test assumptions.
Due to small value optimizations, the removed assertions are not true in
general. Until now, atomic_cell did not use small value optimizations, but
it will after upcoming changes.
sstable_run_based_compaction_test assumed that sstables are freed immediately
after they are fully processed.
Hovewer, since commit b524f96a74,
mutation_reader_merger releases sstables in batches of 4, which breaks the
assumption. This fix adjusts the test accordingly.
Until now, the test only kept working by chance: by coincidence, the number of
test sstables processed by merging_reader in a single fill_buffer() call was
divisible by 4. Since the test checks happen between those calls,
the test never witnessed a situation when an sstable was fully processed,
but not released yet.
The error was noticed during the work on an upcoming patch which changes the
size of mutation_fragment, and reduces the number of test sstables processed
in a single fill_buffer() call, which breaks the test.
While duplicate votes are not allowed by Raft rules, it is possible
that a vote message is delivered multiple times.
The current voting implementation does reject votes from non-members,
but doesn't check for duplicate votes.
Keep track of who has voted yet, and reject duplicate votes.
A unit test follows.
Imagine the cluster is in joint configuration {{A, B}, {A, B, C, D, E}}.
Server stable indexes are:
Server Stable Index
A 5
B 5
C 6
D 7
E 8
The commit index would be 5 if we use joint configuration, and 6
if we assume we left it. Left it happen without an extra FSM
step.
The old name was incorrect, in case apply_snapshot() was called with
non-zero trailing entries, the total log length is greater than the
length of the part that is not stored in a snapshot.
Fix spelling in related comments.
Rename fsm::wait() to fsm::wait_max_log_length(), it's a more
specific name.
Replace it with a private _first_idx, which is maintained
along with the rest of class log state.
_first_idx is a name consistent with counterpart last_idx().
Do not use a function since going forward we may want
to remove Raft index from struct log_entry, so should rely
less on it.
This fixes a bug when _last_conf_idx was not reset
after apply_snapshot() because start_idx() was pointing
to a non-existent entry.
If the log is empty, we must use snapshot's term,
since the log could be right after taking a snapshot
when no trailing entries were kept.
This fixes a rare possible bug when a log matching
rule could be violated during elections by a follower
with a log which was just truncated after a snapshot.
A separate unit test for the issue will follow.
raft::log::start_idx() is currently not meaningful
in case the log is empty.
Avoid using it in fsm::replicate_to() and avoid manual search for
previous log term, instead encapsulate the search in log::term_for().
As a side effect we currently return a correct term (0)
when log matching rule is exercised for an empty log
and the very first snapshot with term 0. Update raft_etcd_test.cc
accordingly.
This change happens to reduce the overall line count.
While at it, improve the comments in raft::replicate_to().
This patch adds several additional tests o test/cql-pytest/test_json.py
to reproduce additional bugs or clarify some non-bugs.
First, it adds a reproducer for issue #8087, where SELECT JSON may create
invalid JSON - because it doesn't quote a string which is part of a map's
key. As usual for these reproducers, the test passes on Cassandra, and fails
on Scylla (so marked xfail).
We have a bigger test translated from Cassandra's unit tests,
cassandra_tests/validation/entities/json_test.py::testInsertJsonSyntaxWithNonNativeMapKeys
which demonstrates the same problem, but the test added in this patch is much
shorter and focuses on demonstrating exactly where the problem is.
Second, this patch adds a test test verifies that SELECT JSON works correctly
for UDTs or tuples where one of their components was never set - in such a
case the SELECT JSON should also output this component, with a "null" value.
And this test works (i.e., produces the same result in Cassandra and Scylla).
This test is interesting because it shows that issue #8092 is specific to the
case of an altered UDT, and doesn't happen for every case of null
component in a UDT.
Refs #8087
Refs #8092
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210216150329.1167335-1-nyh@scylladb.com>
* seastar 76cff58964...e53a1059f9 (18):
> rpc: streaming sink: order outgoing messages
Fixes#7552.
> http: fix compilation issues when using clang++
> http/file_handler: normalize file-type for mime detection
> http/mime_types: add support for svg+xml
> reactor: simplify get_sched_stats()
> Merge "output_stream: make api noexcept" from Benny
> Merge " input_stream: make api noexcept" from Benny
> rpc: mark 'protocol' class as final
> tls: reloadable_certificate inotify flag is wrong
Fixes#8082.
> cli: Ignore the --num-io-queues option
> io_queue: Do not carry start time in lambda capture
> fstream: Cancel all IO-s on file_data_source_impl close
> http: add "Transfer-Encoding: chunked" handling
> http: add ragel parsers for chunks used in messages with Transfer-Encoding: chunked
> http: add request content streaming
> http: add reading/skipping all bytes in an input_stream
> Merge "Reduce per-io-queue container for prio classes" from Pavel Emelyanov
> seastar-addr2line: split multiple addresses on the same line
messaging_service's rpc_protocol_server_wrapper inherits from
seastar::rpc::protocol::server as a way to avoid a
is unfortunate, as protocol.hh wasn't designed for inheritance, and
is not marked final.
Avoid this inheritance by hiding the class as a member. This causes
a lot of boilerplate code, which is unfortunate, but this random
inheritance is bad practice and should be avoided.
Closes#8084
Currently there are places that call
keyspace_element_name::get_keyspace() without checking that _ks_name is
engaged. Fix those places.
Message-Id: <20210216085545.54753-1-gleb@scylladb.com>
repair_writer::do_write(): already has a partition compare for each
mutation fragment written, do determine whether the fragment belongs to
another partition or not. This equal compare can be converted to a
tri_compare at no extra cost allowing for detecting out-of-order
partitions, in which case `on_internal_error()` is called.
Refs: #7623
Refs: #7552
Test: dtest(RepairAdditionalTest.repair_disjoint_row_3nodes_diff_shard_count_test:debug)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210216074523.318217-1-bdenes@scylladb.com>
We see long reactor stalls from `logalloc::prime_segment_pool`
in debug mode yet the stall detector's purpose is to detect
reactor stalls during normal operation where they can increase
the latency of other queries running in parallel.
Since this change doesn't actually fix the stalls but rather
hides them, the following annotations will just refrence
the respective github issues rather than auto-close them.
Refs #7150
Refs #5192
Refs #5960
Restore blocked_reactor_notify_ms right before
starting storage_proxy. Once storage_proxy is up, this node
affects cluster latency, and so stalls should be reported so
they can be fixed.
Test: secondary_index_test --blocked-reactor-notify-ms 1 (release)
DTest: CASSANDRA_DIR=../scylla/build/release SCYLLA_EXT_OPTS="--blocked-reactor-notify-ms 2" ./scripts/run_test.sh materialized_views_test:TestMaterializedViews.interrupt_build_process_with_resharding_half_to_max_test
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210216112052.27672-1-bhalevy@scylladb.com>
This series provides a `raft_services` class to create and store
a raft schema changes server instances, and also wires up the RPC handlers
for Raft RPC verbs.
* manmanson/raft-api-server-handlers-v10:
raft: share `raft_gossip_failure_detector` instance across multiple raft rpc instances
raft: move server address handling from `raft_rpc` to `raft_services` class
raft: wire up schema Raft RPC handlers
raft: raft_rpc: provide `update_address_mapping` and dispatcher functions
raft: pass `group_id` as an argument to raft rpc messages
raft: use a named constant for pre-defined schema raft group
Store an instance inside `raft_services` and reuse it for
all raft groups created and managed by `raft_services` instance.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This allows to decouple `raft_gossip_failure_detector` from being
dependent on a particular rpc instance and thus makes it possible
to share the same failure detector instance among all raft servers
since they are managed in a centralized way by a `raft_services`
instance.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This patch adds registration and de-registration of the
corresponding Raft RPC verbs handlers.
There is a new `raft_services` class that
is responsible for initializing the raft RPC verbs and
managing raft server instances.
The service inherits `seastar::peering_sharded_service<T>`,
because we need to route the request to the appropriate shard
which is handled by the `shard_for_group` function (currently
only handling schema raft group to land on shard 0, otherwise
throws an exception).
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
We had Alternator's current compatibility with DynamoDB described in
two places - alternator.md and compatibility.md. This duplication was
not only unnecessary, in some places it led to inconsistent claims.
In general, the better description was in compatibility.md, so in
this patch we remove the compatibility section from alternator.md
and instead link to compatibility.md. There was a bit of information
that was missing in compatibility.md, so this patch adds it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210215203057.1132162-1-nyh@scylladb.com>
After switching cells storage onto compact radix tree it
becomes useful to know the tree nodes' sizes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now when the 3rd storage type (radix tree) is all in, old
storage can be safely removed. The result is:
1. memory footprint
sizeof(class row): 112 => 16 bytes
sizeof(rows_entry): 126 => 120 bytes
the "in cache" value depends on the number of cells:
num of cells master patch
1 752 656
2 808 712
3 864 768
4 920 824
5 968 936
6 1136 992
...
16 1840 1672
17 1904 1992 (+88)
18 1976 2048 (+72)
19 2048 2104 (+56)
20 2120 2160 (+40)
21 2184 2208 (+24)
22 2256 2264 ( +8)
23 2328 2320
...
32 2960 2808
After 32 cells the storage switches into rbtree with
24-bytes per-cell overhead and the radix tree improvement
rocketlaunches
64 7872 6056
128 15040 9512
256 29376 18568
2. perf_mutation test is enhanced by this series and the
results differ depending on the number of columns used
tps value
--column-count master patch
1 59.9k 57.6k (-3.8%)
2 59.9k 57.5k
4 59.8k 57.6k
8 57.6k 57.7k <- eq
16 56.3k 57.6k
32 53.2k 57.4k (+7.9%)
A note on this. Last time 1-column test was ~5% worse which
was explained by inline storage of 5 cells that's present on
current implementation and was absent in radix tree.
An attempt to make inline storage for small radix trees
resulted in complete loss of memory footprint gain, but gave
fraction of percent to perf_mutation performance. So this
version doesn't have inline nodes.
The 1.2% improvement from v2 surprisingly came from the
tree::clone_from() which in v2 was work-around-ed by slow
walk+emplace sequence while this version has the optimized
API call for cloning.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Same as the previous patch, re-implement the row::equal to use
the radix_tree iterator for comparison of two index:cell sequences.
The std::equal() doesn't work here, since the predicate-fn needs
to look at both iterators to call it.key() on (radix tree API
feature), while std::equal provides only the T&s in it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method effectively walks two pairs of <colun_id, cell> and
applies the difference to separare row instance. The code added
is the copy of the same code below this hunk with the mechanical
substitution:
c.first -> c.key()
c.second -> c->cell
it->first -> it.key()
it->second -> it.cell
because first-s are column_id-s reported by radix tree iterator
.key() method and second-s are cells, that were referenced by
current code in get_..._vector() from boost::irange and are now
directly pointed to by raidx tree iterator.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently class row uses a union of a vector and a set to keep
the cells and switches between them. Add the 3rd type with the
radix tree, but never switch to it, just to show how the operations
would look like. Later on vector and set will be removed and the
whole row will be immediately switched to the radix tree storage.
NB: All the added places have indentation deliberately broken, so
that next patch will just remove the surrounding (old) code away
and (most of) the new one will happen in its place instantly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
For further patching it's handy to have this helper to accept
column_id and atomic_cell_or_collection arguments, instead of
an std::pair of these two.
This is to facilitate next patching.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The tree uses integral type as a search key. On each level the local index
is next 7 bits from the key, respectively for 32-bit key we have 5 levels.
The tree uses 2 memory packing techniques -- prefix compaction and growing
node layouts.
The prefix compaction is used when a node has only one child. In this case
such a node is replaced in its parent with this only child and the child in
question keeps "prefix:length" pair on board, that's used to check if the
short-cut lookup took the correct path.
The growing node layouts makes the nodes occupy as much memory as needed
to keep the _present_ keys and there are 2 kinds of layouts.
Direct layout is array, intra-node search is plain indexing. The layout
storage grows in vector-like manner, but there's a special case for the
maximum-sized layout that helps avoiding some boundary checks.
Indirect layout keeps two arrays on board -- with values and with indices.
The intra-node search is thus a lookup in the latter array first. This
layout is used to save memory for sparse keys. Lookup is optimized with
SIMD instructions.
Inner nodes use direct layouts, as they occupy ~1% of memory and thus
need not to be memory efficient. At the same time lookup of a key in the
tree potentially walks several inner nodes, so speeding up search for
them is beneficial.
Leaf nodes are indirect, since they are 99% of memory and thus need to
be packed well. The single indirect lookup when searching in the tree
doesn't slow things down notably even on insertion stress test.
Said that
* inner nodes are: header + 4 / 8 / 16 / 32 / 64 / 128 pointers
* leaf nodes are : header + 4 / 8 / 16 / 32 bytes + <same nr> objects
or header + 16 bytes bitmap + 128 objects
The header is
- backreference (8 bytes)
- prefix (4 bytes)
- size, layout, capacity (1 byte each)
The iterator is one-direction (for simplicity) but it enough for its main
target -- the sparse array of cells on a row. Also the iterator has an
.index() method that reports back the index of the entry at which it points.
This greatly simplifies the tree scans by the class row further.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The attrs_to_get object was previously copied, but it's quite
a heavyweight operations, since this object may contain an
instance of std::map or std::unordered_map.
To avoid copying whole maps, the object is wrapped in a shared
const pointer.
Message-Id: <75ad810de16c630b65ae8d319cb4b37e1de8085f.1613398751.git.sarna@scylladb.com>
As Gleb suggested in a previous review, remove ticker from raft and
leave calling tick() to external code.
While there, tick faster to speed up tests.
* https://github.com/alecco/scylla/tree/tests-17-remove-ticker:
raft: replication test: reduce ticker from 100ms to 1ms
raft: drop ticker from raft
Use the `_interval` member instead of the old `_range` field, but stay
compatible with pre 4.2 releases, falling back to `_range` when
`_interval` doesn't exist.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210215104008.166746-1-bdenes@scylladb.com>
The radix tree code will need the code to find 8-bit value
in an array of some fixed size, so here are the helpers.
Those that allow for SIMD implementations are such for x86_64
TODO: Add aarch64
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In some places scylla clones collections of objects, so it's
sometimes needed to measure the speed of this operation.
This patch adds a placeholder for it, but no implementations
for any supported collections. It will be added soon for radix
tree.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The --column-count makes the test generate schema with
the given numbers of columns and make mutation maker
fill random column with the value on each iteration.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Teach the schema builder and test itself to work on more
than one regular column, but for now only use 1, as before.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
in all expressions' from Nadav Har'El.
This series fixes#5024 - which is about adding support for nested attribute
paths (e.g., a.b.c[2]) to Alternator. The series adds complete support for this
feature in ProjectionExpression, ConditionExpression, FilterExpression and
UpdateExpression - and also its combination with ReturnValues. Many relevant
tests - and also some new tests added in this series - now pass.
The first patch in the series fixes#8043 a bug in some error cases in
conditions, which was discovered while working in this series, and is
conceptually separate from the rest of the series.
Closes#8066
* github.com:scylladb/scylla:
alternator: correct implemention of UpdateItem with nested attributes and ReturnValues
alternator: fix bug in ReturnValues=UPDATED_NEW
alternator: implemented nested attribute paths in UpdateExpression
alternator: limit the depth of nested paths
alternator: prepare for UpdateItem nested attribute paths
alternator: overhaul ProjectionExpression hierarchy implementation
alternator: make parsed::path object printable
alternator-test: a few more ProjectionExpression conflict test cases
alternator-test: improve tests for nested attributes in UpdateExpression
alternator: support attribute paths in ConditionExpression, FilterExpression
alternator-test: improve tests for nested attributes in ConditionExpression
alternator: support attribute paths in ProjectionExpression
alternator: overhaul attrs_to_get handling
alternator-test: additional tests for attribute paths in ProjectionExpression
alternator-test: harden attribute-path tests for ProjectionExpression
alternator: fix ValidationException in FilterExpression - and more
The canonical_mutation type can contain a large mutation, particularly
when the mutation is a result of converting a big schema. Its data
was stored in a field of type 'bytes', which is non-contiguous and
may cause a large allocation.
This is fixed by simply changing the type to 'bytes_ostream', which is
fragmented. The change is compatible because the idl type 'bytes' is compatible
with 'bytes_ostream' as a result of dcf794b, and all canonical_mutations's
methods use the field as an input stream (ser::as_input_stream), which can
be used on 'bytes_ostream' too.
Fixes#8074
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Closes#8075
This patch adds several more tests reproducing bugs in toJson() and
SELECT JSON.
First add two xfailing tests reproducing two toJson() issues - #7988
and #8002. The first is that toJson() incorrectly formats values of the
"time" type - it should be a string but Scylla forgets the quotes.
The second is that toJson() format "decimal" values as JSON numbers
without using an exponent, resulting in memory allocation failure
for numbers with high exponents, like 1e1000000000.
The actual test for 1e1000000000 has to be skipped because in
debug build mode we get a crash trying this huge allocation.
So instead, we check 1e1000 - this generates a string of 1000
characters, which is much too much (should just be "1e1000")
but doesn't crash.
Then we add a reproducing test for issue #8077: When using SELECT JSON
on a function, such as count(*), ttl(v) or intAsBlob(v), Cassandra has
a specific way how it formats the result in JSON, and Scylla should do
it the same way unless we have a good reason not to.
As usual, the new tests passes on Cassandra, fails on Scylla, so is marked
xfail.
Refs #7988
Refs #8002
Refs #8077.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210214210727.1098388-1-nyh@scylladb.com>
Start improving CONTRIBUTING.md, as suggested in issue #8037:
1. Incorporate the few lines we had in coding-style.md into CONTRIBUTING.md.
This was mostly a pointer to Seastar's coding style anyway, so it's not
helpful to have a separate file which hopeful developers will not find
anyway.
2. Mention the Scylla developers mailing list, not just the Scylla users
mailing list. The Scylla developers mailing list is where all the action
happens, and it's very odd not to mention it.
3. The decisions that github pull requests are forbidden was retracted
a long time ago, so change the explanation on pull requests.
4. Some smaller phrasing changes.
Refs #8037.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210214152752.1071313-1-nyh@scylladb.com>
Merged patch series from Pavel Emelyanov:
The default -O<> levels are considered to produce slow and tedious to
test code, so it's tempting to increase the level. On the other hand,
there was some complains about re-compile-mostrly work that would suffer
from slower builds.
This set tries to find a reasonable compromise -- raise the default opt
levels and provide the ability to configure one if needed.
* 'br-cxx-o-levels-2' of github.com:xemul/scylla:
configure: Switch debug build from -O0 to -Og
configure: Switch dev build from -O1 to -O2
configure: Make -O flag configurable
With the larger gap, logalloc reserved more memory for std than
the background reclaim threshold for running, so it was triggered
rarely.
With the gap reduced, background reclaim is constantly running in
an allocating workload (e.g. cache misses).
Set up a coroutine in a new scheduling group to ensure there is
a "cushion" of free memory. It reclaims in preemptible mode in
order to reduce reactor stalls (constrast with synchronous reclaim
that cannot preempt until it achieved its goal).
The free memory target is arbitrarily set at 60MB. The reclaimer's
shares are proportional to the distance from the free memory target;
so a workload that allocates memory rapidly will have the background
reclaimer working harder.
I rolled my own condition variable here, mostly as an experiment.
seastar::condition_variable requires several allocations, while
the one here requires none. We should formalize it after we gain
more experience with it.
Add an option (currently unused by all callers) to preempt
reclaim. If reclaim is preempted, it just stops what it is
doing, even if it reclaimed nothing. This is useful for background
reclaim.
Currently, preemption checks are on segment granularity. This is
probably too coarse, and should be refined later, but is already
better than the current granularity which does not allow preemption
until the entire requested memory size was reclaimed.
To speed up replication test reduce the tick time from 100ms to 1ms
Speed up: debug 3.7 to 2.5, dev 2.9 to 2.1 seconds
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
This test reproduces issue #7987, where Scylla cannot set a time column
with an integer - wheres the documentation says this should be possible
and it also works in Cassandra.
The test file includes tests for both ways of setting a time column
(using an integer and a string), with both prepared and unprepared
statements, and demonstrates that only one combination fails in Scylla -
an unprepared statement with an integer. This test xfails on Scylla
and passes on Cassandra, and the rest pass on both.
Refs #7987.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210128215103.370723-1-nyh@scylladb.com>
This patch fixes the last missing part of nested attribute support in
UpdateItem - returning the correct attributes when ReturnValues is requested.
When the expression says "a.b = :val" and ReturnValues is set to UPDATED_OLD
or UPDATED_NEW, only the actual updated attribute a.b should be returned, not
the entire top-level attribute a as we did before this patch.
This patch was made very simple because our existing hierarchy_filter()
function already does exactly the right thing, and can trivially be made to
accept any attribute_path_map<T> (in our case attribute_path_map<action>),
not just attrs_to_get as it did until now.
This patch also adds several more checks to the test in test_returnvalues.py
to improve the test's coverage even more. Interestingly, I discovered two
esoteric cases where DynamoDB does something which makes little sense, but
apparently simplified their implementation - but the beautiful thing is that
it also simplifies our implementation! See long comments about these two
cases in the test code.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Commit 0c460927bf broke UpdateItem's
ReturnValues=UPDATED_NEW by moving previous_item while it is still
needed. None of the existing tests broke because none of them needed
previous_item after it was moved - but it started to break when we
add support for nested attribute paths, which need this previous_item.
So this patch returns the move to a copy, as it was before the
aforementioned patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds full support for nested attribute paths (e.g., a.b[3].c)
in UpdateExpression. After in previous patches we already added such
support for ProjectionExpression, ConditionExpression and FilterExpression
this means the nested attribute paths feature is now complete, so we
remove the warning from the documents. However, there is one last loose
end to tie and we will do it in the next patch: After this patch, the
combination of UpdateExpression with nested attributes and ReturnValues
is still wrong, and the test for it in test_returnvalues.py still xfails.
Note that previous patches already implemented support for attribute paths
in expression evaluations - i.e., the right-hand side of UpdateExpression
actions, and in this patch we just needed to implement the left hand side:
When an update action is on an attribute a.b we need to read the entire
content of the top-level a (an RWM operation), modify just the b part of
its json with the result of the action, and finally write back the entire
content of a. Of course everything gets complicated by the fact that we
can have multiple actions on multiple pieces of the same JSON, and we also
need to detect overlapping and conflicting actions (we already have this
detection in the attribute_path_map<> class we introduced in a previous
patch).
I decided to leave one small esoteric difference, reproduced by the xfailing
test_update_expression.py::test_nested_attribute_remove_from_missing_item:
As expected, "SET x.y = :val" fails for an item if its attribute x doesn't
exist or the item itself does not exist. For the update expression
"REMOVE x.y", DynamoDB fails if the attribute x doesn't exist, but oddly
silently passes if the entire item doesn't exist. Alternator does not
currently reproduce this oddity - it will fail this write as well.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
DynamoDB limits the depth of a nested path in expressions (e.g. "a.b.c.d")
to 32 levels. This patch adds the same limit also to Alternator.
The exact value of this limit is less important (although it did make
sense to choose the same limit as DynamoDB does), but it's important
to have *some* limit: It's often convenient to handle paths with a
recursive algorithm, and if we allow unlimited path depth, it can
result in unlimited recursion depth, and a crash. Let's avoid this
possibility.
We detect the over-long path while building the parsed::path object
in the parser, and generate a parse error.
This patch also includes a test that verifies that both Alternator
and DynamoDB have the same 32-level nesting limit on paths.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch prepares UpdateItem for updating of nested attribute paths
(e.g., "SET a.b = :val"), but does not yet support them.
Instead of _update_expression holding an unsorted list of "actions",
we change it to hold a attribute_path_map of actions. This will allow
us to process all the actions on a top-level attribute together, and
moreover gets us "for free" the correct checking for overlapping and
conflicting updates - exactly the same checking we already had in
attribute_path_map for ProjectionExpression. Other than this change,
most of this patch is just code movement, not functional changes.
After this patch, the tests for update path overlap and conflict pass:
test_update_expression_multi_overlap_nested and
test_update_expression_multi_conflict_nested.
We can also mark test_update_expression_nested_attribute_rhs as passing -
this test involves an attribute path in the right-hand-side of an update,
but the left-hand-side is still a top-level attribute, so it works (it
actually worked before this patch - it started working when we implemented
attribute paths in expressions, for ConditionExpression and
FilterExpression).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
For ProjectionExpression we implemented a hierarchical filter object which
can be used to hold a tree of attribute paths groups by a the top-level
attributes, and also detect overlapping and conflicting entries.
For UpdateExpression, we need almost exactly the same object: We need to
group update actions (e.g., SET a.b=3) by the top-level attribute, and
also detect and fail overlapping or conflicting paths.
So in this patch we rewrite the data structure we had for ProjectionExpression
in a more genric manner, using the template attribute_path_map<T> - which
holds data of type T for each attribute path. We also implement a template
function attribute_path_map_add() to add a path/value pair to this map,
and includes all the overlap and conflict detecting logic.
There shouldn't be functional changes in this patch. The ProjectionExpression
code uses the new generic code instead of the specific code, but should work
the same. In the next patch we can use the new generic code to implement
UpdateExpression as well.
The only somewhat functional change is better error messages for
conflicting or overlapping paths - which now include one of the
conflicting paths.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We already had many tests for nested attributes in UpdateExpression, but
this patch adds even more:
* Test nested attribute in right-hand-side in assignment: z = a.c.x.
* Test for making multiple changes to the same and different top-level
attributes in the same update.
* Additional cases of overlap between multiple changes.
* Tests for conflict between multiple changes.
* Tests for writing to a nested path on a non-existent attribute or item.
* A stronger test for array append sorts the added items.
As this feature was not yet implemented, these tests fail on Alternator,
and pass on DynamoDB.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Provide several utility functions which will be used in rpc message
handlers:
1. `update_address_mapping` -- add a new (server_id -> inet_address)
mapping for a `raft_rpc` instance.
This is used to update rpc module with a caller address
upon receiving an rpc message from a yet unknown server.
2. A set of dispatcher functions for every rpc call that forward calls
to an appropriate `raft::rpc_server` instance (for which `raft::rpc`
has a back-pointer).
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The reference is used by range streamer and (!) storage
service itself to find out if the consistent_rangemovement
option is ON/OFF.
Both places already have the database with config at hands
and can be simplified.
v2: spellchecking
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210212095403.22662-1-xemul@scylladb.com>
The test timeuuid_test.py::testTimeuuid sporadically failed, and it turns out
the reason was a bug in the test - which this patch fixes.
The buggy test created a timeuuid and then compared the time stored in it
to the result of the dateOf() CQL function. The problem is that dateOf()
returns a CQL "timestamp", which has millisecond resolution, while the
timeuuid *may* have finer than millisecond resolution. The reason why this
test rarely failed is that in our implementation, the timeuuid almost
always gets a millisecond-resolution timestamp. Only if now() gets called
more than once in one millisecond, does it pick a higher time incremented
by less than a millisecond.
What this patch does is to truncate the time read from the timeuuid to
millisecond resolution, and only then compare it to the result of dateOf().
We cannot hope for more.
Fixes#8060
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210211165046.878371-1-nyh@scylladb.com>
For certain situations where barely enough nodes to elect a new leader
are connected a disruptive candidate can occassionally block the
election.
For example having servers A B C D E and only A B C are active in a
partition. If the test wants to elect A, it has to first make all 3
servers reach election timeout threshold (to make B and C receptive).
Then A is ticked till it becomes a candidate and has to send vote
requests to the other servers.
But all servers have a timer (_ticker) calling their periodic tick()
functions. If one of the other servers, say B, gets its timer tick
before A sends vote requests, B becomes a (disruptive) candidate and
will refuse to vote for A. In our case of only having 3 out of 5 servers
connected a single missing vote can hang the election.
This patch disables timer ticks for all servers when running custom
elections and partitioning.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
This will be used later to filter the requests which belong
to the schema raft group and route them to shard 0.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Introduce a static `schema_raft_state_machine::group_id` constant,
which denotes the raft group id for the schema changes server.
Also fix the comment on the state machine class declaration
to emphasize that the instance will be managed by shard 0.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
When a table is altered in a mixed cluster by a node with a more
recent version, the request can fail if there is a difference in
schema_features between the two versions. This miniset handles the
two problems that prevents the sync.
Closes#8011
* github.com:scylladb/scylla:
schema: recalculate digest when computed_columns feature is enabled
schema tables: Remove mutations to unknown tables when adapting schema mutations
schema tables: Register 'scylla_tables' versions that were sent to other nodes
mutations
Whenever an alter table occurs, the mutations for the just altered table
are sent over to all of the replicas from the coordinator.
In a mixed cluster the mutations should be adapted to a specific version
of the schema. However, the adaptation that happens today doesn't omit
mutations to newly added schema tables, to be more specific, mutations
to the `computed_columns` table which doesn't exist for example in
version 2019.1
This makes altering a table during a rolling upgrade from 2019.1 to
2020.1 dangerous.
nodes
In a mixed cluster there can be a situation where `scylla_tables` needs
to be sent over to another node because a schema sync or because the
node pulls it because it is referenced by a frozen_mutation. The former
is not a problem since the sending node chooses the version to send.
However, the former is problematic since `scylla_tables` versions are not
registered anywhere.
This registers every `scylla_tables` schema version which is used to adapted
mutations since after this happens a schema pull for this version might
follow.
Added a test which measures the time it takes to replace sstables in a table's
sstable_set, using the leveled compaction strategy.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Column_family_test allows performing private methods on column_family's
sstable_set. It may be useful not only in the boost tests, so it's moved
from test/boost/sstable_test.hh to test/lib/sstable_test_env.hh.
sstable_test.hh includes sstable_test_env.hh, so no includes need to be
changed.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
The sstable_set enables copying without iterating over all its elements,
so it's faster to copy a set and modify it than copy all its elements
while filtering the ones that were erased.
The modifications are done on a temporary version of the set, so that
if an operation fails the base version remains unchanged
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Currently, the sstable_set in a table is copied before every change
to allow accessing the unchanged version by existing sstable readers.
This patch changes the sstable_set to a structure that allows copying
without actually copying all the sstables in the set, while providing
the same methods(and some extra) without majorly decreasing their speed.
This is achieved by associating all copies with sstable_set versions
which hold the changes that were performed in them, and references to
the versions that were copied, a.k.a. their parents. The set represented
by a version is the result of combining all changes of its ancestors.
This causes most methods of the version to have a time complexity
dependent on the number of its ancestors. To limit this number, versions
that represent copies that have already been deleted are merged with its
descendants.
The strategy used for deciding when and with which of its children
should a version be merged heavily depends on the use case of sstable_sets:
there is a main copy of the set in a table class which undergoes many
insertions and deletions, and there are copies of it in compaction or
mutation readers which are further copied or edited few or zero times.
It's worth to mention, that when a copy is made, the copied set should not
be modified anymore, because it would also modify the results given by the
copy. In order to still allow modifying the copied set, if a change is
to be performed on it, the version assiociated with this set is replaced
with a new version depending on the previous one.
As we can see, in our use case there is a main chain of versions(with
changes from the table), and smaller branches of versions that start
from a version from this chain, but are deleted soon after.
In such case we can merge a version when it has exactly one descendant,
as this limits the number of concurrent ancestors of a version to the
number of copies of its ancestors are concurrently used. During each
such merge, the parent version is removed and the child version is
modified so that all operations on it give the same results.
In order to preserve the same interface, the sstable_set still contains a
lw_shared_ptr<sstable_list>, but sstable_list (previously an alias for
unordered_set<shared_sstable>) is now a new structure. Each sstable_set
contains a sstable_list but not every sstable_list has to be contained
by a sstable_set, and we also want to allow fast copying of sstable_lists,
so the reference to the sstable_set_version is kept by the sstable_lists
and the sstable_set can access the sstable_set_version it's associated
with through its sstable_list.
Accessing sstables that are elements of a certain sstable_set copy(so
the select, select_sstable_runs and sstable_list's iterator) get results
from containers that hold all sstables from all versions(which are stored
in a single, shared "versioned_sstable_set_data" structure), and then
filter out these sstables that aren't present in the version in question.
This version of the sstable_set allows adding and erasing the same sstable
repeatedly. Inserting and erasing from the set modifies the containers in
a version only when it has an actual effect: if an sstable has been added
in the parent version, and hasn't been erased in the child version, adding
it again will have no effect. This ensures that when merging versions, the
versions have disjoint sets of added, and erased sstables (an sstable can
still be added in one and erased in the second). It's worth noting hat if
an sstable has been added in one of the merged sets and erased in the
second, the version that remains after merging doesn't need to have any
info about the sstable's inclusion in the set - it can be inferred from
the changes in previous versions (and it doesn't matter if the sstable has
been erased before or after being added).
To release pointers to sstables as soon as possible (i.e. when all references
to versions that contain them die), if an sstable is added/erased in all
child versions that are based on a version which has no external references,
this change gets removed from these versions and added to the parent version.
If an sstable's insertion gets overwritten as a result, we might be able
to remove the sstable completely from the set. We know how many times this
needs to happen by counting, for each sstable, in how many different verisions
has it been added. When a change that adds an sstable gets merged with a change
that removes it, or when a such a change simply gets deleted alongside its
associated version, this count is reduced, and when an sstable gets added to a
version that doesn't already contain it, this count is increased.
The methods that modify the sets contents give strong exception guarantee
by trying to insert new sstables to its containers, and erasing them in
the case of an caught exception.
Fixes#2622
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
If the range expression in a range based for loop returns a temporary,
its lifetime is extended until the end of the loop. The same can't be said
about temporaries created within the range expression. In our case,
*t->get_sstables_including_compacted_undeleted() returns a reference to a
const sstable_list, but the t->get_sstables_including_compacted_undeleted()
is a temporary lw_shared_ptr, so its lifetime may not be prolonged until the
end of the loop, and it may be the sole owner of the referenced sstable_list,
so the referenced sstable_list may be already deleted inside the loop too.
Fix by creating a local copy of the lw_shared_ptr, and get reference from it
in the loop.
Fixes#7605
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Adding an non-empty set of sstables as the set of all sstables in
an sstable_set could cause inconsistencies with the values returned
by select_sstable_runs because the _all_runs map would still be
initialized empty. For similar reasons, the provided sstable_set_impl
should also be empty.
Dispel doubts by removing the unordered_set from the constructor, and
adding a check of emptiness of the sstable_set_impl.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
scylla_io_setup condition for nr_disks was using the bitwise operator
(&) instead of logical and operator (and) causing the io_properties
files to have incorrect values
Fixes#7341
Reviewed-by: Lubos Kosco <lubos@scylladb.com>
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Closes#8019
"
Currently, register_inactive_read accepts an eviction_notify_handler
to be called when the inactive_read is evicted.
However, in case there was an error in register_inactive_read
the notification function isn't called leaving behind
state that needs to be cleaned up.
This series separates the register_inactive_reader interface
into 2 parts:
1. register_inactive_reader(flat_mutation_reader) - which just registers
the reader and return an inactive_read_handle, *if permitted*.
Otherwise, the notification handler is not called (it is not known yet)
and the caller is not expected to do anything fance at this point
that will require cleanup.
This optimizes the server when overloaded since we do less work
that we'd need to undo in case the reader_concurrecy_semaphore
runs out of resources.
2. After register_inactive_reader succeeded to return a valid
inactive_read_handle, the caller sets up its local state
and may call `set_notify_handler` to set the optional
notify_handler and ttl on the o_r_h.
After this state, the notify_handler will be called when
the inactive_reader is evicted, for any reason.
querier_cache::insert_querier was modified to use the
above procedure and to handle (and log/ignore) any error
in the process.
inactive_read_handle and inactive_read keeping track of each other
was simplified by keeping an iterator in the handle and a backpointer
in the inactive_read object. The former is used to evict the reader
and to set the notify_handler and/or ttl without having to lookup the i_r.
The latter is used to invalidate the i_r_h when the i_r is destroyed.
Test: unit(release), querier_cache_test(debug)
"
* tag 'register_inactive_read-error-handling-v6' of github.com:bhalevy/scylla:
querier_cache: insert_querier: ignore errors to register inactive reader
querier_cache: insert_querier: handle errors
querier_utils: mark functions noexcept
reader_concurrency_semaphore: register_inactive_read: make noexcept
reader_concurrency_semaphore: separate set_notify_handler from register_inactive_reader
reader_concurrency_semaphore: inactive_read: make ttl_timer non-optional
reader_concurrency_semaphore: inactive_read: use intrusive list
reader_concurrency_semaphore: do_wait_admission: use try_evict_one_inactive_read
reader_concurrency_semaphore: try_evict_one_inactive_read: pass evict_reason
reader_concurrency_semaphore: unregister_inactive_read: calling on wrong semaphore is an internal error
reader_concurrency_semaphore: unregister_inactive_read: do nothing if disengaged
reader_concurrency_semaphore: inactive_read_handle: swap definition order
reader_lifecycle_policy: retire low level try_resume method
reader_concurrency_semaphore: inactive_read: keep a flat_mutation_reader
One of the USING TIMEOUT tests relied on a specific TTL value,
but that's fragile if the test runs on the boundary of 2 seconds.
Instead, the test case simply checks if the TTL value is present
and is greater than 0, which makes the test robust unless its execution
lasts for more than 1 million seconds, which is highly unlikely.
Fixes#8062Closes#8063
Previous version, merged and dequeued due to a dependency bug: https://github.com/scylladb/scylla/pull/7297
Note: this pull request is temporarily created against /next, because it depends on https://github.com/scylladb/scylla/pull/7279.
This series adds support for `max_concurrent_requests_per_shard` config variable to alternator. Excessive requests are shed and RequestLimitExceeded is sent back to the client.
Tested manually by reloading Scylla multiple times and editing the config, while bombarding alternator with many concurrent requests. Observed excepted failures are:
`botocore.errorfactory.RequestLimitExceeded: An error occurred (RequestLimitExceeded) when calling the CreateTable operation: too many in-flight requests: 17
`
Fixes#7294Closes#8039
* github.com:scylladb/scylla:
alternator: server: return api_error instead of throwing
alternator: add requests_shed metrics
alternator: add handling max_concurrent_requests_per_shard
alternator: add RequestLimitExceeded error
The code that creates system keyspace open code a lot of things from
database::create_keyspace(). The patch makes create_keyspace() suitable
for both system and non system keyspaces and uses it to create system
keyspaces as well.
Message-Id: <20210209160506.1711177-1-gleb@scylladb.com>
The error is benign but if it is not handled "unhandled exception" error
will be printed in the logs.
Message-Id: <20210209150313.GA1708015@scylladb.com>
The interface of the failure detector service is cleaned up a little:
- an unimplemented method is removed (is_alive)
- a return type of another method is fixed (arrival_samples)
- a getter for the most recent successful update is added (last_update)
This code was tested manually during various overload protection
experiments, which check if the failure detector can be used to reject
requests which have a very small chance of succeeding within their
timeout.
Closes#8052
* github.com:scylladb/scylla:
failure_detector: add getting last update time point
failure_detector: return arrival samples by const reference
failure_detector: remove unimplemented is_alive method
The same trick is used as in C*:
79e693e16e/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java (L241)
The edited CQL test relied on quietly accepting non-existing DCs, so it had to
be removed. Also, one boost-test referred to nonexistent `datacenter2` and had
to be removed.
Fixes#7595Closes#8056
* github.com:scylladb/scylla:
tests: Adjusted tests for DC checking in NTS
locator: Check DC names in NTS
Since fea5067df we enforce a limit on the memory consumption of
otherwise non-limited queries like reverse and non-paged queries. This
limit is sent down to the replicas by the coordinator, ensuring that
each replica is working with the same limit. This however doesn't work
in a mixed cluster, when upgrading from a version which doesn't have
this series. This has been worked around by falling back to the old
max_result_size constant of 1MB in mixed clusters. This however resulted
in a regression when upgrading from a pre fea5067df to a post fea5067df
one. Pre fea5067df already had a limit for reverse queries, which was
generalized to also cover non-paged ones too by fea5067df.
The regression manifested in previously working reverse queries being
aborted. This happened because even though the user has set a generous
limit for them before the upgrade, in the mix cluster replicas fall back
to the much stricter 1MB limit temporarily ignoring the configured limit
if the coordinator is an old node. This patch solves this problem by
using the locally configured limit instead of the max_result_size
constant. This means that the user has to take extra care to configure
the same limit on all replicas, but at least they will have working
reverse queries during the upgrade.
Fixes: #8022
Tests: unit(release), manual test by user who reported the issue
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210209075947.1004164-1-bdenes@scylladb.com>
* seastar 4c7c5c7c4...76cff5896 (6):
> rpc: Make is possible for rpc server instance to refuse connection
> reactor: expose cumulative tasks processed statistic
> fair_queue: add missing #include <optional>
> reactor: optimize need_preempt() thread-local-storage access
> Merge " Use reference for backend->reactor link" from Pavel E
> test: coroutines: failed coroutine does not throw
CQL test relied on quietly acceptiong non-existing DCs, so it had
to be removed. Also, one boost-test referred to nonexisting
`datacenter2` and had to be removed.
Since the reader may normally dropped upon
registration, hitting an error is equivalent to having
it evicted at any time, so just log the exception
and ignore it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
They all are trivially noexcept.
Mark them so to simplify error handing assumptions in the
next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Catch error to allocate an inactive_read and just log them.
Return an empty inactive_read_handle in
this case, as if the inactive reader was evicted due to
lack of resources.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Register the inactive reader first with no
evict_notify_handler and ttl.
Those can be set later, only if registration succeeded.
Otherwise, as in the querier example, there is no need
to to place the querier in the index and erase it
on eviction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
By default it will be unarmed and with no callback
so there's no need to wrap it in a std::optional.
This saves an allocation and another potential
error case.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To simplify insertion and eviction into the inactive_reads container,
use an intrusive list thta requires a single allocation for the
inactive_read object itself.
This allows passing a reference to the inactive_read
to evict it.
Note that the reader will be unlinked automatically from
the inactive_readers list if the inactive_read_handle is destroyed.
This is okay since there is no need to track the inactive_read
if the caller loses the i_r_h (e.g. if an error is thrown).
It is also safe to evict the inactive_reader while the
i_r_h is alive. In this case the i_r will be unlinked
after the flat_mutation_reader it holds is moved out of it.
bi::auto_unlink will detect that it's alredy unlinked
when destroyed and do nothing.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Calling unregister_inactive_read on the wrong semaphore is a blatant
bug so better call on_internal_error so it'd be easier to catch and fix.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
There is no need to lookup the inactive_read if the i_r_h
is disengaged, it should not be registered so just return
quickly.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
There's no need to hold a unique_ptr<flat_mutation_reader> as
flat_mutation_reader itself holds a unique_ptr<flat_mutation_reader::impl>
and functions as a unique ptr via flat_mutation_reader_opt.
With that, unregister_inactive_read was modified to return a
flat_mutation_reader_opt rather than a std::unique_ptr<flat_mutation_reader>,
keeping exactly the same semantics.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch fully implements support for attribute paths (e.g. a.b.c, a.d[3])
for the ConditionExpression in conditional updates, and FilterExpression in
queries and scans. After this patch, all previously-xfailing tests in
test_projection_expression.py and test_filter_expression.py now pass.
The fix is simple: Both ConditionExpression and FilterExpression use the
function calculate_value() to calculate the value of the expression. When
this function calculates the value of a path, it mustn't just take the
top-level attribute - it needs to walk into the specific sub-object as
specified by the attribute path.
This is not the end of attribute path support, UpdateExpression and
ReturnValues are not yet fully supported. This will come in following
patches.
Refs #5024
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Strengthen the tests in test_condition_expression.py for nested attribute
paths (e.g., b.y[1]):
1. The test test_update_condition_nested_attributes only tested successful
conditions involving nested attributes. Let's also add an *unsuccessful*
condition, to verify we don't accidentally pass every condition involving
a nested attribute.
2. Test a case where a non-existant nested attribute is involved in the
condition.
3. In the test for an attribute path with references - "#name1.#name2",
make sure the test doesn't pass if #name2 is silently ignored.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This fixes the problem where the cordinator already knows about the
new schema and issues a read which uses new objects, but the replica
doesn't know those objects yet. The read will fail in this case. We
can avoid this if we propagate schema changes with reads, like we
already do for writes.
Message-Id: <20210205163422.414275-1-tgrabiec@scylladb.com>
Refs #6148
Commitlog disk limit was previously a "soft" limit, in that we allowed allocating new segments, even if we were over
disk usage max. This would also cause us sometimes to create new segments and delete old ones, if badly timed in
needing and releasing segments, in turn causing useless disk IO for pre-allocation/zeroing.
This patch set does:
* Make limit a hard limit. If we have disk usage > max, we wait for delete or recycle.
* Make flush threshold configurable. Default is ask for flush when over 50% usage. (We do not wait for results)
* Make flush "partial". We flush X% of the used space (used - thres/2), and make the rp limit accordingly. This means we will try to clear the N oldest segments, not all. I.e. "lighter" flush. Of course, if the CL is wholly dominated by a single CF, this will not really help much. But when > 1 cf is used, it means we can skip those not having unflushed data < req rp.
* Force more eager flush/recycle if we're out of segments
Note: flush threshold is not exposed in scylla config (yet). Because I am unsure of wording, and even if it should.
Note: testing is sparse, esp. in regard to latency/timeouts added in high usage scenarios. While I can fairly easily provoke "stalls" (i.e. forced waiting for segments to free up) with simple C-S, it is hard to say exactly where in a more sane config (I set my limits looow) latencies will start accumulating.
Closes#7879
* github.com:scylladb/scylla:
commitlog: Force earlier cycle/flush iff segment reserve is empty
commitlog: Make segment allocation wait iff disk usage > max
commitlog: Do partial (memtable) flushing based on threshold
commitlog: Make flush threshold configurable
table: Add a flush RP mark to table, and shortcut if not above
This patch fully implements support for attribute paths (e.g. a.b.c, a.d[3])
for the ProjectionExpression in the various operations where this parameter
is supported - GetItem, BatchGetItem, Query and Scan. After this patch, all
xfailing tests in test_projection_expression.py now pass.
In the previous patch we remembered in the "attrs_to_get" object not only
the top-level attributes to read from the table, but also how to filter
from it only the desired pieces of the nested document. In this patch we
add a filter() function to do this filtering, and call it in the right
places to post-process the JSON objects we read from the table.
We also had to fix reference resolution in paths to resolve all the
components of the path (e.g., #name1.#name2) and not just the top-level
attribute.
This is not the end of attribute path support, there are still other
expressions (ConditionExpression, UpdateExpression, FilterExpression,
ReturnValues) where they are not yet supported. This will come in following
patches.
Refs #5024
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In the existing code, the variable "attrs_to_get" is a list of top-level
attributes to fetch for an item. It is used to implement features like
ProjectionExpression or AttributesToGet in GetItem and other places.
However, to support attribute paths (e.g., a.b.c[2]) in ProjectionExpression,
i.e., issue #5024, we need more than that. We still need to know the top-
level attribute "a", because this is the granularity we have in the Scylla
table (all the content inside "a" is serialized as a single JSON); But we
also need to remember exactly which parts *inside* "a" we will need to
extract and return.
So in this patch we add a new type, "attrs_to_get", which is more than
just a list of top-level attributes. Instead, it is a *map*, whose keys
are the top-level attributes, and the value for each of them is a
"hierarchy_filter", an object which describes which part of the attribute
is needed.
This patch includes the code which converts the AttributesToGet and
ProjectionExpression into the new attrs_to_get structure. During this
conversion, we recognize two kinds of errors which DynamoDB complains
about: We recognize "overlapping" attributes (e.g., requesting both
a.b and a.b.c) and "conflicting" attributes (e.g, requesting both
a.b and a[1]). After this, two xfailing tests we had for detecting
these overlap and conflicts finally pass and their "xfail" label is
removed.
After this patch, we have the attrs_to_get object which can allow us
to filter only the requested pieces of the top-level attributes, but
we don't use it yet - so this patch is not enough for complete support
of attribute paths in ProjectionExpression. We will complete this
support in the next patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds more tests for attribute paths in ProjectionExpression,
that deal with document paths which do not fit the content of the item -
e.g., trying to ask for "a.b[3]" when a.b is not a list but rather an
integer or a dictionary.
Moreover, we note that if you try to ask for "a.b, a[2]", DynamoDB
fails this request as a "conflict". The reasoning is that no single
item can ever have both a.b and a[2] (the first is only valid for
dictionaries, the second for lists). It's not clear to me why we
still can't return whichever of the two actually is relevant, but
the fact is that DynamoDB does not allow it.
The new tests fail on Alternator (marked xfailed) and pass on DynamoDB.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We have 7 xfailing tests for usage of nested attribute paths (e.g.,
"a.b.c[7]") in a ProjectionExpression. But some of these tests were too
"easy" to pass - a trivial and *wrong* implementation that just ignores
the path and uses the top level attribute (in the above example, "a"),
would cause some of them to start passing.
So this patch strengthens these tests. They still pass on AWS DynamoDB,
and now continue to fail with the aforementioned broken implementation.
Refs #5024.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The first condition expressions we implemented in Alternator were the old
"Expected" syntax of conditional updates. That implementation had some
specific assumptions on how it handles errors: For example, in the "LT"
operator in "Expected", the second operand is always part of the query, so
an error in it (e.g., an unsupported type) resulted it a ValidationException
error.
When we implemented ConditionExpression and FilterExpression, we wrongly
used the same functions check_compare(), check_BETWEEN(), etc., to implement
them. This results in some inaccurate error handling. The worst example is
what happens when you use a FilterExpression with an expression such as
"x < y" - this filter is supposed to silently skip items whose "x" and "y"
attributes have unsupported or different types, but in our implementation
a bad type (e.g., a list) for y resulted in a ValidationException which
aborted the entire scan! Interestingly, in once case (that of BEGINS_WITH)
we actually noticed the slightly different behavior needed and implemented
the same operator twice - with ugly code duplication. But in other operators
we missed this problem completely.
This patch first adds extensive tests of how the different expressions
(Expected, QueryFilter, FilterExpression, ConditionExpression) and the
different operators handle various input errors - unsupported types,
missing items, incompatible types, etc. Importantly, the tests demonstrate
that there is often different behavior depending on whether the bad
input comes from the query, or from the item. Some of the new tests
fail before this patch, but others pass and were useful to verify that
the patch doesn't break anything that already worked correctly previously.
As usual, all the tests pass on Cassandra.
Finally, this patch *fixes* all these problems. The comparison functions
like check_compare() and check_BETWEEN() now not only take the operands,
they also take booleans saying if each of the operands came from the
query or from an item. The old-syntax caller (Expected or QueryFilter)
always say that the first operand is from the item and the second is
from the query - but in the new-syntax caller (ConditionExpression or
FilterExpression) any or all of the operands can come from the query
and need verification.
The old duplicated code for check_BEGINS_WITH() - which a TODO to remove
it - is finally removed. Instead we use the same idea of passing booleans
saying if each of its operands came from an item or from the query.
Fixes#8043
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The db::update_keyspace() needs sharded<storage_proxy>
reference, but the only caller of it already has it and
can pass one as argument.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210205175611.13464-3-xemul@scylladb.com>
On start the transport controller keeps the storage service
on server config's lambda just to let the server grab a
database config option.
The same can be achieved by passing the sharded database
reference to sharded<server>::start, so that each server
instance get local database with config.
As an nice side effect transport::server's config looks
more like a config with simple values and without methods
and/or lambdas on board.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210205175611.13464-1-xemul@scylladb.com>
Few method in column_familiy API were doing the aggregation wrong,
specifically, bloom filter disk size.
The issue is not always visible, it happens when there are multiple
filter files per shard.
Fixes#4513
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Closes#8007
we're unconditionally using make_combined_mutation_source(), which causes extra
allocations, even if memtable was flushed into a single sstable, which is the
most common case. memtable will only be flushed into more than one sstable if
TWCS is used and memtable had old data written into it due to out-of-order
writes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210205182028.439948-1-raphaelsc@scylladb.com>
Previous patch changed the -O flag for dev builds. This
had no effect on unit tests compile+run time, and was
aimed at improving the individual tests, dtest, stress-
and other tests runtimes.
This change is mainly focused on imprving the debug-mode
full unit tests running, while keeping the debuggability:
the compile+run time gets ~10 minutes shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Based on the original patch from Nadav.
The -O1-generated code is too slow. Raising the opt level
slows compilation down ~9%, but greatly improves the
testing time. E.g. running the alternator test alone is
2.5 times faster with -O2 (118 vs 48 seconds).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It was noticed, that current optimization levels do not
generate fast enough code for dev builds. On the other
hand just increasing the default optimization level will
make re-compile-mostly work much more frustrating.
The new configure.py option allows to select the desired
-O option value by hands. Current hard-coded values are
used as defaults.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
std_list has an iterator object which provides the python3 `__next__()`
method only. Python2 wants a method called `next()`. As it is trivial to
provide both, do that to allow debugging on centos7.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210205073549.734362-1-bdenes@scylladb.com>
Fix raft_fsm_test failure in debug mode. ASAN complained
that follower_progress is used in append_entries_reply()
after it was destroyed. This could happen if in maybe_commit()
we switched to a new configuration and destroyed old progress
objects.
The fix is to lookup the object one more time after maybe_commit().
Throwing a C++ exception creates unnecessary overhead, so when
an unsupported operation is encountered, the api error is directly
returned instead of being thrown.
The config value is already used to set an upper limit of concurrent
CQL requests, and now it's also abided by alternator.
Excessive requests result in returning RequestLimitExceeded error
to the client.
Tests: manual
Running multiple concurrent requests via the test suite results in:
botocore.errorfactory.RequestLimitExceeded: An error occurred (RequestLimitExceeded) when calling the CreateTable operation: too many in-flight requests: 17
"
Before this patch, each index reader had its own cache of partition
index pages. Now there is a shared cache, owned by the sstable object.
This allows concurrent reads to share partition index pages and thus
reduce the amount of I/O.
It used to be like that a few years ago, but we moved to per-reader
cache to implement incremental promoted index parsing, to avoid OOMs
with large partitions. At that time, the solution involved caching
input streams inside partition index entries, which couldn't be reused
between readers. This could have been solved differently. Instead of
caching input streams, we can cache information needed to created them
(temporary_buffer<>). This solution takes this approach.
This series is also needed before we can implement promoted index
caching. That's because before the promoted index can be shared by
readers, the partition index entries, which hold the promoted index,
must also be shareable.
The pages live as long as there is at least one index reader
referencing them. So it only helps when there is concurrent access. In
the future we will keep them for longer and evict on memory pressure.
Promoted index cursor is no longer created when the partition index
entry is parsed, by it's created on-demand when the top-level cursor
enters the partition. The promoted index cursor is owned by the
top-level cursor, not by the partition index entry.
Below are the results of an experiment performed on my laptop which
demonstrates the improvement in performance.
Load driver command line:
./scylla-bench \
-workload uniform \
-mode read \
--partition-count=10 \
-clustering-row-count=1 \
-concurrency 100
Scylla command line:
scylla --developer-mode=1 -c1 -m1G --enable-cache=0
The workload is IO-bound.
Before, we needed 2 I/O per read, now we need 1 (amortized).
The throughput is ~70% higher.
Before:
time ops/s rows/s errors max 99.9th 99th 95th 90th median mean
1s 4706 4706 0 35ms 30ms 27ms 25ms 24ms 21ms 21ms
2s 4646 4646 0 42ms 31ms 31ms 27ms 25ms 21ms 22ms
3.1s 4670 4670 0 40ms 27ms 26ms 25ms 25ms 21ms 21ms
4.1s 4581 4581 0 39ms 33ms 33ms 27ms 26ms 21ms 22ms
5.1s 4345 4345 0 40ms 37ms 35ms 32ms 31ms 21ms 23ms
6.1s 4328 4328 0 49ms 40ms 34ms 32ms 31ms 22ms 23ms
7.1s 4198 4198 0 45ms 36ms 35ms 31ms 30ms 22ms 24ms
8.2s 3913 3913 0 51ms 50ms 50ms 39ms 35ms 24ms 26ms
9.2s 4524 4524 0 34ms 31ms 30ms 28ms 27ms 21ms 22ms
After:
time ops/s rows/s errors max 99.9th 99th 95th 90th median mean
1s 7913 7913 0 25ms 25ms 20ms 15ms 14ms 12ms 13ms
2s 7913 7913 0 18ms 18ms 18ms 16ms 14ms 12ms 13ms
3s 8125 8125 0 20ms 20ms 17ms 15ms 14ms 12ms 12ms
4s 5609 5609 0 41ms 35ms 29ms 28ms 27ms 13ms 18ms
5.1s 8020 8020 0 18ms 17ms 17ms 15ms 14ms 12ms 13ms
6.1s 7102 7102 0 27ms 27ms 24ms 19ms 18ms 13ms 14ms
7.1s 5780 5780 0 26ms 26ms 26ms 23ms 22ms 17ms 18ms
8.1s 6530 6530 0 37ms 34ms 26ms 22ms 20ms 15ms 15ms
9.1s 7937 7937 0 19ms 19ms 17ms 17ms 16ms 12ms 13ms
Tests:
- unit [release]
- scylla-bench
"
* tag 'share-partition-index-v1' of github.com:tgrabiec/scylla:
sstables: Share partition index pages between readers
sstables: index_reader: Drop now unnecessary index_entry::close_pi_stream()
sstables: index_reader: Do not store cluster index cursor inside partition indexes
Our test for tracing Alternator requests can't be sure when tracing a request
finished, because tracing is asynchronous and has no official ending signal.
So before we can conclude that tracing failed, we need to wait until a
timeout, which in the current code was roughly 6.4 seconds (the timeout
logic is unnecessarily convoluted, but to make a long story short it has
exponential sleeps starting with 0.1 second and ending with 3.2 seconds,
totaling 6.4 seconds).
It turns out that sporadically, in test runs on overcommitted test machines
with the very slow debug build, we fail this test with this timeout.
So this patch increases the timeout to 51.2 seconds. It should be more
than enough for everyone. Famous last words :-)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210204151554.582260-1-nyh@scylladb.com>
Before this patch, each index reader had its own cache of partition
index pages. Now there is a shared cache, owned by the sstable object.
This allows concurrent reads to share partition index pages and thus
reduce the amount of I/O.
This change is also needed before we can implement promoted index caching.
That's because before the promoted index can be shared by readers, the
partition index entries, which hold the promoted index, must also be
shareable.
The pages live as long as there is at least one index reader
referencing them. So it only helps when there is concurrent access. In
the future we will keep them for longer and evict on memory pressure.
Promoted index cursor is no longer created when the partition index entry
is parsed, by it's created on-demand when the top-level cursor enters
the partition. The promoted index cursor is owned by the top-level cursor,
not by the partition index entry.
Currently, the partition index page parser will create and store
promoted index cursors for each entry. The assumption is that
partition index pages are not shared by readers so each promoted index
cursor will be used by a single index_reader (the top-level cursor).
In order to be able to share partition index entries we must make the
entries immutable and thus move the cursor outside. The promoted index
cursor is now created and owned by each index_reader. There is at most
one such active cursor per index_reader bound (lower/upper).
The current procedure for building images is complicated, as it
requires access to x86_64, aarch64, and s390x machines. Add an alternative
procedure that is fully automated, as it relies on emulation on a single
machine.
It is slow, but requires less attention.
Closes#8024
The supplementary groups are removed by default, so add them back.
Supplementary groups are useful for group-shared directories like
ccache.
I added them to the podman-only branch since I don't know if this
works for docker. If a docker user verifies it works there too,
we can move it to the generic code.
Closes#8020
The patch adds set of counters for various events inside raft
implementation to facilitate monitoring and debugging.
Message-Id: <20210204125313.GA1513786@scylladb.com>
We can do with a forward declaration instead to reduce
the dependency, and include reader_concurrency_semaphore.hh
in test/lib/reader_permit.cc instead.
We need to include "../../reader_permit.hh" to get the
definition of class reader_permit. We need the include
path to prevent recursive include (or rename test/lib/reader_permit.hh
but this creates a lot of code churn).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210204122002.1041808-1-bhalevy@scylladb.com>
commitlog was changed to use fragmented_temporary_buffer::ostream (db::commitlog::output).
So if there are discontiguous small memory blocks, they can be used to satisfy
an allocation even if no contiguous memory blocks are available.
To prevent that, as Avi suggested, this change allocates in 128K blocks
and frees the last one to succeed (so that we won't fail on allocating continuations).
Fixes#8028
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210203100333.862036-1-bhalevy@scylladb.com>
This is an implementation of `raft::failure_detector` for Scylla
that uses gms::gossiper to query `is_alive` state for a given
raft server id.
Server ids are translated to `gms::inet_address` to be consumed
by `gms::gossiper` with the help of `raft_rpc` class,
which manages the mapping.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210129223109.2142072-1-pa.solodovnikov@scylladb.com>
This series provides additional RPC verbs and corresponding methods in
`messaging_service` class, as well as a scylla-specific Raft RPC module
implementation that uses `netw::messaging_service` under the hood to
dispatch RPC messages.
* https://github.com/ManManson/scylla/commits/raft-api-rpc-impl-v6:
raft: introduce `raft_rpc` class
raft: add Raft RPC verbs to `messaging_service` and wire up the RPC calls
configure.py: compile serializer.cc
Currently we call firstNvmeSize before checking that we have enough
(at least 1) ephemeral disks. When none are found, we hit the following
error (see #7971):
```
File "/opt/scylladb/scripts/libexec/scylla_io_setup", line 239, in
if idata.is_recommended_instance():
File "/opt/scylladb/scripts/scylla_util.py", line 311, in is_recommended_instance
diskSize = self.firstNvmeSize
File "/opt/scylladb/scripts/scylla_util.py", line 291, in firstNvmeSize
firstDisk = ephemeral_disks[0]
IndexError: list index out of range
```
This change reverses the order and first checks that we found
enough disks before getting the fist disk size.
Fixes#7971
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#8027
`mutation::consume()` is used by range scans to convert the immediate
`reconcilable_result` to the final `query::result` format. When the
range scan is in reverse, `mutation::consume()` has to feed the
clustering fragments to the consumer in reverse order, but currently
`mutation::consume()` always uses the natural order, breaking reverse
range scans.
This patch fixes this by adding a `consume_in_reverse` parameter to
`mutation::consume()`, and consequently support for consuming clustering
fragments in reverse order.
Fixes: #8000
Tests: unit(release, debug),
dtest(thrift_tests.py:TestMutations.test_get_range_slice)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210203081659.622424-1-bdenes@scylladb.com>
"
Currently inactive readers are stored in two different places:
* reader concurrency semaphore
* querier cache
With the latter registering its inactive readers with the former. This
is an unnecessarily complex (and possibly surprising) setup that we want
to move away from. This series solves this by moving the responsibility
if storing of inactive reads solely to the reader concurrency semaphore,
including all supported eviction policies. The querier cache is now only
responsible for indexing queriers and maintaining relevant stats.
This makes the ownership of the inactive readers much more clear,
hopefully making Benny's work on introducing close() and abort() a
little bit easier.
Tests: unit(release, debug:v1)
"
* 'unify-inactive-readers/v2' of https://github.com/denesb/scylla:
reader_concurrency_semaphore: store inactive readers directly
querier_cache: store readers in the reader concurrency semaphore directly
querier_cache: retire memory based cache eviction
querier_cache: delegate expiry to the reader_concurrency_semaphore
reader_concurrency_semaphore: introduce ttl for inactive reads
querier_cache: use new eviction notify mechanism to maintain stats
reader_concurrency_semaphore: add eviction notification facility
reader_concurrency_semaphore: extract evict code into method evict()
This is the continuaiton of the row-cache performance
improvements, this time -- the rework of clustering keys part.
The goal is to solve the same set of problems:
- logN eviction complexity
- deep and sparse tree
Unlike partitions, this cache has one big feature that makes it
impossible to just use existing B+ tree:
There's no copyable key at hands. The clustering key is the
managed_bytes() that is not nothrow-copy-constructibe, neither
it's hash-able for lookup due to prefix lookup.
Thus the choice is the B-tree, which is also N-ary one, but
doesn't copy keys around.
B-trees are like B+, but can have key:data pairs in inner nodes,
thus those nodes may be significantly bigger then B+ ones, that
have data-s only in leaf trees. Not to make the memory footprint
worse, the tree assumes that keys and data live on the same object
(the rows_entry one), and the tree itself manages only the key
pointers.
Not to invalidate iterators on insert/remove the tree nodes keep
pointers on keys, not the keys themselves.
The tree uses tri-compare instead of less-compare. This makes the
.find and .lower_bound methods do ~10% less comparisons on random
insert/lookup test.
Numbers:
- memory_footprint: B-tree master
rows_entry size: 216 232
1 row
in-cache: 968 960 (because of dummy entry)
in-memtable: 1006 1022
100 rows
in-cache: 50774 50856
in-memtable: 50620 50918
- mutation_test: B-tree master
tps.average: 891177 833896
- simple_query: B-tree master
tps.median: 71807 71656
tps.maximum: 71847 71708
* xemul/clustering-cache-over-btree-4:
mutation_partition: Save one keys comparison
partition_snapshot_row_cursor: Remove rows pointer
mutation_partition: Use B-tree insertion sugar
perf-test : Print B-tree sizes
mutation_partition: Switch cache of rows onto B-tree
partition_snapshot_reader: Rename cmp to less for explicity
mutation_partition: Make insertion bullet-proof
mutation_partition: Use tri-compare in non-set places
flat_mutation_reader: Use clear() in destroy_current_mutation()
rows_entry: Generalize compare
utils: Intrusive B-tree (with tests)
tests: Generalize bptree compaction test
tests: Generalize bptree stress test
Fixes#7615
Makes the CL writer interface N-valued (though still 1 for the "old" paths). Adds a new write path to input N mutations -> N rp_handles.
Guarantees that all entries are written or none are, and that they will be flushed to disk together.
Small test included.
Closes#7616
* github.com:scylladb/scylla:
commitlog_test: Add multi-entry write test
commitlog: Add "add_entries" call to allow inputting N mutations
commitlog: Make commitlog entries optionally multi-entry
commitlog: Move entry_writer definition to cc file
Whenever we push a fragment, we check whether the buffer is
full and return proceed::no if so, so that the state machine pauses
and lets the consumer continue. This patch adds an additional
condition - if preemption is needed, we also return proceed::no.
This drops us back to the outer loop
(in sstable_mutation_reader::fill_buffer), which will yield to
the reactor as part of seastar::do_until().
Two cases (partition_start and partition_end) did not have the
check for is_buffer_full(); it is added now. This can trigger
is the partition has no rows.
Unlike the previous attempt, push_ready_fragments() is not touched.
The extra preemption opportunities triggered a preexisting bug in
clustering_ranges_walker; it is fixed in the first patch of the series.
I tested this by reading from a large partition with a simple
schema (pk int, ck int, primary key(pk, ck)) with BYPASS CACHE.
However, even without the patch I only got sporadic stalls
with the detector set to 1ms, so it's possible I'm not testing
correctly.
Test: unit (dev, debug, release)
Fixes#7883.
Closes#7928
* github.com:scylladb/scylla:
sstable: reader: preempt after every fragment
clustering_range_walker: fix false discontiguity detected after a static row
Rather than asserting, as seen in #7977.
This shouldn't crash the server in production.
Add unit test that reproduces this scenario
and verifies the internal error exception.
Fixes#7977
Test: unit(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210201163051.1775536-1-bhalevy@scylladb.com>
Fixes#7615
Allows N mutations to be written "atomically" (i.e. in the same
call). Either all are added to segement, or none.
Returns rp_handle vector corresponding to the call vector.
Allows writing more than one blob of data using a single
"add" call into segment. The old call sites will still
just provide a single entry.
To ensure we can determine the health of all the entries
as a unit, we need to wrap them in a "parent" entry.
For this, we bump the commitlog segment format and
introduce a magic marker, which if present, means
we have entries in entry, totalling "size" bytes.
We checksum the entra header, and also checksum
the individual checksums of each sub-entry (faster).
This is added as a post-word.
When parsing/replaying, if v2+ and marker, we have to
read all entries + checksums into memory, verify, and
_then_ we can actually send the info to caller.
Fedora version of systemd macros does not work correctly on CentOS7,
since CentOS7 does not support "file trigger" feature.
To fix the issue we need to stop using systemd macros, call systemctl
directly.
See scylladb/scylla-jmx#94
Closes#8005
"
This series extends the scylla_metadata sstable component
to hold an optional testual description of the sstable origin.
It describes where the sstables originated from
(e.g. memtable, repair, streaming, compaction, etc.)
The origin string is provided by the sstable writer via
sstable_writer_config, written to the scylla_metadata component,
and loaded on sstable::load().
A get_origin() method was added to class sstable to retrieve
its origin. It returns an empty string by default if the origin
is missing.
Compaction now logs the sstable origin for each sstable it
compacts, and it generates the sstable origin for all sstables
in generates. Regular compaction origin is simply set to "compaction"
while other compaction types are mentioned by name, as
"cleanup", "resharding", "reshaping", etc.
A unit test was added to test the sstable_origin by writing either
an empty origin and a random string, and then comparing
the origin retrieved by sstable::load to the one written.
Test: unit(release)
Fixes#7880
"
* tag 'sstable-origin-v2' of github.com:bhalevy/scylla:
compaction: log sstable origin
sstables: scylla_metadata: add support for sstable_origin
sstables: sstable_writer_config: add origin member
The apply_monotonically checks if the cursor is behind the source
position to decide whether or not to push it forward (with the
lower_bound call). The 2nd comparison is done to check if either
the cursor was ahead or if lower_bound result actually hit the key.
This 2nd comparison can be avoided:
- the 1st case needs B-tree lower_bound API extention that reports
if the bound is match or not.
- the 2nd one is covered with reusing tri-compare result from the
1st comparison
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The pointer is needed to erase an element by its iterator from the
rows container. The B-tree has this method on iterator and it does
NOT need to walk up the tree to find its root, so the complexity
is still amortized constant.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After the switch from BST to B-tree the memory foorprint includes inner/leaf nodes
from the B-tree, so it's useful to know their sizes too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The switch is pretty straightforward, and consists of
- change less-compare into tri-compare
- rename insert/insert_check into insert_before_hint
- use tree::key_grabber in mutation_partition::apply_monotonically to
exception-safely transfer a row from one tree to another
- explicitly erase the row from tree in rows_entry::on_evicted, there's
a O(1) tree::iterator method for this
- rewrite rows_entry -> cache_entry transofrmation in the on_evicted to
fit the B-tree API
- include the B-tree's external memory usage into stats
That's it. The number of keys per node was is set to 12 with linear search
and linear extention of 20 because
- experimenting with tree shows that numbers 8 through 10 keys with linear
search show the best performance on stress tests for insert/find-s of
keys that are memcmp-able arrays of bytes (which is an approximation of
current clustring key compare). More keys work slower, but still better
than any bigger value with any type of search up to 64 keys per node
- having 12 keys per nodes is the threshold at which the memory footprint
for B-tree becomes smaller than for boost::intrusive::set for partitions
with 32+ keys
- 20 keys for linear root eats the first-split peak and still performs
well in linear search
As a result the footpring for B tree is bigger than the one for BST only for
trees filled with 21...32 keys by 0.1...0.7 bytes per key.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The bi::intrusive::set::insert-s are non-throwing, so it's safe to add
new entry like this
auto* ne = new entry;
set.insert(ne);
and not worry about memory leak. B-tree's insert will be throwing, so we
need some way to free the new entries in case of exception. There's alreay
a way for this:
std::unique_ptr<entry> ne = std::make_unique<entry>();
set.insert(*ne);
ne.release();
so make every insertion into the set work this way in advance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The mutation_partition::_rows will be switched on B-tree with tri
comparator, so to clearly identify not affected by it places, switch
them onto tri-compare in advance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the code uses a look of unlink_leftmost_without_rebalance
calls. B-tree does have it, but plain clearing of the tree is a bit
faster with clear().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Turn the rows_entry less-comparator's calls into a template as
they are nothing but wrappers on top of rows_entyry tri-comparator.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The design of the tree goes from the row-cache needs, which are
1. Insert/Remove do not invalidate iterators
2. Elements are LSA-manageable
3. Low key overhead
4. External tri-comparator
5. As little actions on insert/remove as possible
With the above the design is
Two types of nodes -- inner and leaf. Both types keep pointer on parent nodes
and N pointers on keys (not keys themselves). Two differences: inner nodes have
array of pointers on kids, leaf nodes keep pointer on the tree (to update left-
and rightmost tree pointers on node move).
Nodes do not keep pointers/references on trees, thus we have O(1) move of any
object, but O(logN) to get the tree size. Fortunately, with big keys-per-node
value this won't result in too many steps.
In turn, the tree has 3 pointers -- root, left- and rightmost leaves. The latter
is for constant-time begin() and end().
Keys are managed by user with the help of embeddable member_hook instance,
which is 1 pointer in size.
The code was copied from the B+ tree one, then heavily reworked, the internal
algorythms turned out to differ quite significantly.
For the sake of mutation_partition::apply_monotonically(), which needs to move
an element from one tree into another, there's a key_grabber helping wrapper
that allows doing this move respecting the exception-safety requirement.
As measured by the perf_collections test the B-tree with 8 keys is faster, than
the std::set, but slower than the B+tree:
vs set vs b+tree
fill: +13% -6%
find: +23% -35%
Another neat thing is that 1-key insertion-removal is ~40% faster than
for BST (the same number of allocations, but the key object is smaller,
less pointers to set-up and less instructions to execute when linking
node with root).
v4:
- equip insertion methods with on_alloc_point() calls to catch
potential exception guarantees violations eariler
- add unlink_leftmost_without_rebalance. The method is borrowed from
boost intrusive set, and is added to kill two birds -- provide it,
as it turns out to be popular, and use a bit faster step-by-step
tree destruction than plain begin+erase loop
v3:
- introduce "inline" root node that is embedded into tree object and in
which the 1st key is inserted. This greatly improves the 1-key-tree
performance, which is pretty common case for rows cache
v2:
- introduce "linear" root leaf that grows on demand
This improves the memory consumption for small trees. This linear node may
and should over-grow the NodeSize parameter. This comes from the fact that
there are two big per-key memory spikes on small trees -- 1-key root leaf
and the first split, when the tree becomes 1-key root with two half-filled
leaves. If the linear extention goes above NodeSize it can flatten even the
2nd peak
- mitigate the keys indirection a bit
Prefetching the keys while doing the intra-node linear scan and the nodes
while descending the tree gives ~+5% of fill and find
- generalize stress tests for B and B+ trees
- cosmetic changes
TODO:
- fix few inefficincies in the core code (walks the sub-tree twice sometimes)
- try to optimize the leaf nodes, that are not lef-/righmost not to carry
unused tree pointer on board
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Whenever we push a fragment, we check whether the buffer is
full and return proceed::no if so, so that the state machine pauses
and lets the consumer continue. This patch adds an additional
condition - if preemption is needed, we also return proceed::no.
This drops us back to the outer loop
(in sstable_mutation_reader::fill_buffer), which will yield to
the reactor as part of seastar::do_until().
Two cases (partition_start and partition_end) did not have the
check for is_buffer_full(); it is added now. This can trigger
is the partition has no rows.
Unlike the previous attempt, push_ready_fragments() is not touched.
I tested this by reading from a large partition with a simple
schema (pk int, ck int, primary key(pk, ck)) with BYPASS CACHE.
However, even without the patch I only got sporadic stalls
with the detector set to 1ms, so it's possible I'm not testing
correctly.
Test: unit (dev)
Fixes#7883.
clustering_range_walker detects when we jump from one row range to another. When
a static row is included in the query, the constructor sets up the first before/after
bounds to be exactly that static row. That creates an artificial range crossing if
the first clustering range is contiguous with the static row.
This can cause the index to be consulted needlessly if we happen to fall back
to sstable_mutation_reader after reading the static row.
A unit test is added.
Ref #7883.
The database lass have to duplicated functions keyspaces() and
get_keyspaces(). Drop the former since it is used in one place only.
Message-Id: <20210201135333.GA1403508@scylladb.com>
Support configuration changes based on joint consensus.
When a user adds a configuration entry, commit an interim "joint
consensus" configuration to the log first, and transition to the
final configuration once both C_old and C_new configurations
accept the joint entry.
Misc cleanups.
* scylla-dev/raft-config-changes-v2:
raft: update README.md
raft: add a simple test for configuration changes
raft: joint consensus, wire up configuration changes in the API
raft: joint consensus, count votes using joint config
raft: joint consensus, wire up configuration changes in FSM
raft: joint consensus, update progress tracker with joint configuration
raft: joint consensus, don't store configuration in FSM
raft: joint consensus, keep track of the last confchange index in the log
raft: joint consensus, implement helpers in class configuration
raft: joint consensus, use unordered_set for server_address list
raft: joint consensus, switch configuration to joint
raft: rename check_committed() to maybe_commit()
raft: fix spelling and add comments
Add new scylla_metadata_type::SSTableOrigin.
Store and retrive a sstring to the scylla metadata component.
Pass sstable_writer_config::origin from the mx sstable writer
and ignore it in the k_l writer.
Add unit test to verify the sstable_origin extension
using both empty and a random string.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a string describing where the sstables originated
from (e.g. memtable, repair, streaming, compaction, etc.)
If configure_writer is called with a nullptr, the origin
will be equal to an empty string.
Introduce test_env_sstables_manager that provides an overload
of configure_writer with no parmeters that calls the base-class'
configure_writer with "test" origin. This was to reduce the
code churn in this patch and to keep the tests simple.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch adds a test for the different units which are supposed to
be usable for assigning a "duration" type in CQL. It turns out that
all documented units are supported correctly except µs (with a unicode
mu), so the test reproduces issue #8001.
The test xfails on Scylla (because µs is not supported) and passes
on Cassandra.
Refs: #8001.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210131192220.407481-1-nyh@scylladb.com>
Since makeself script changes current umask, scylla_setup causes
"scylla does not work with current umask setting (0077)" error.
To fix that we need use latest version of makeself, and specfiy --keep-umask
option.
Fixes#6243Closes#6244
* github.com:scylladb/scylla:
dist/offline_redhat: fix umask error
dist/offline_installer/redhat: support cross build
Since makeself script changes current umask, scylla_setup causes
"scylla does not work with current umask setting (0077)" error.
To fix that we need use latest version of makeself, and specfiy --keep-umask
option.
Fixes#6243
Supported cross build by running CentOS7 on docker, now it's able to build
on Fedora.
It also supported switch container image, tested on Oracle Linux 7 and
CentOS 7/8.
* seastar 52d41277a...cb3aaf07e (2):
> tls: reloadable_credentials_base: add_dir_watch: fix root dir detection
> scripts/perftune.py: convert nic option in old perftune.yaml to list for compatibility
last fragment is unconditionally pushed to set of fragments, so if data
size is fragment-aligned, an empty fragment will be needlessly pushed to
the back of the fragment set.
note: i haven't tested if empty fragment at back of set will cause issues,
i think it won't, but this should be avoided anyway.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210129231532.871405-3-raphaelsc@scylladb.com>
1) reuse default_fragment_size for knowledge of max fragment size
2) fragments_count is not a good name as it doesn't include last non-full
fragment (if present), so rename it.
3) simplify calculation of last fragment size
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210129231532.871405-1-raphaelsc@scylladb.com>
The patch contains a skeleton implementation for the Scylla-specific
Raft RPC module.
It uses `netw::messaging_service` as underlying mechanism to send
RPC messages.
The instance is supposed to be bound to a single raft group.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
All RPC module APIs except for `send_snapshot` should resolve as
soon as the message is sent, so these messages are passed via
`send_message_oneway_timeout`.
`send_snapshot` message is sent via `send_message_timeout` and
returns a `future<>`, which resolves when snapshot transfer
finishes or fails with an exception.
All necessary functions to wire the new Raft RPC verbs are also
provided (such as `register` and `unregister` handlers).
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This file was not added to the configure.py,
which `raft_sys_table_storage` series was supposed to do.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Send RequestVote to a joint config.
We need to exclude self from the list of peers
if we're not part of the current configuration.
Avoid disrupting the cluster in this case.
Maintain separate status for previous and current config when counting
votes.
When add_entry() with new configuraiton is submitted,
create a joint configuration and switch to it immediately.
Refuse to enter joint configuration if a configuration
change is already in progress.
When the leader it committed an entry with joint configuration,
append a new entry with final configuration and switch to it.
Resign leadership if the current leader is not part of a new
configuration.
When we change from A, B, C to B, C, D and the leader is A,
then, when C_new starts to be used, the leader is not part of
the current configuration, so it doesn't have to be in the tracker.
Do not try to find & advance leader progress unconditionally then.
The leader doesn't have to be part of the current
configuration, so add a way to access follower_progress for the leader
only if it is present.
Upon configuration changes, preserve progress information
for intact nodes, remove for removed, and create a new progress
object for added nodes.
When tracking commit progress in joint configuration mode,
calculate two commit indexes for two configurations, and
choose the smallest one.
In follower state, FSM doesn't know the current cluster
configuration. Instead of trying to watch the follower log for
configuration changes to keep FSM copy up to date, remove it from
FSM altogether since the follower doesn't need it anyway.
When entering candidate or leader state, fetch the most recent
configuration from the log and initialize the state specific
state with it.
When initializing the log, find the most recent configuration
change index, if present.
Maintain the most recent configuration change index when
the log is truncated or entries are appended to it.
The last configuration change index will be used by FSM when it enters
candidate or leader state to fetch the current configuration.
We never truncate beyond a single in-progress configuration
change, so storing the previous value of last_conf_idx
helps avoid log backward scan on truncation in 100% of cases.
Remove all unused log constructors.
In order to work correctly in transitional configuration,
participants must enter it after crashes, restarts and
state changes.
This means it must be stored in Raft log and snapshot
on the leader and followers.
This is most easily done if transitional configuration
is just a flavour of standard configuration.
In FSM, rename _current_config to _configuration,
it now contains both current and future configuration
at all times.
The idea of the monotonicity checking test is: try to apply
one one random partition to another random one sequentually
failing allocations. Each time allocation fails (with the
bad_alloc exception) -- check the exception guarantee is
respected, then apply (!) the very same two partitions to
each other. At the end of the test we make sure, that an
exception may pop up at any point of application and it
will be safe.
This idea is flawed currently. When verifying the guarantee
the test moves the 2nd partition and leaves it empty for the
next loop iteration. So right on the 2nd attempt to apply
partitions it becomes a no-op, doesn't fail and no more
exceptions arise.
Fix by restoring both partitions at the end of each check.
Broken since 74db08165d.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210129153641.5449-1-xemul@scylladb.com>
This series contains an initial implementation of raft persistency module
that uses `raft` system table as the underlying storage model.
"system.raft" table will be used as a backend storage for implementing
raft persistence module in Scylla. It combines both raft log,
persisted vote and term, and snapshot info.
The table is partitioned by group id, thus allowing multi-raft
operation. The rest of the table structure mirrors the fields of
corresponding core raft structures defined in `raft.hh`, such as
`raft::log_entry`.
The raft table stores the only the latest snapshot id while
the actual snapshot will be available in a separate table
called `system.raft_snapshots`. The schema of `raft_snapshots`
mirrors the fields of `raft::snapshot` structure.
IDL definitions are also added for every raft struct so that we
automatically provide serialization and deserialization facilities
needed both for persistency module and for future RPC implmementation.
The first patch is a side-change needed to provide complete
serialization/deserialization for `bytes_ostream`, which we
need when persisting the raft log in the table (since `data`
is a variant containing `raft::command` (aka `bytes_ostream`)
among others).
`bytes_ostream` was lacking `deserialize` function, which is
added in the patch.
The second patch provides serializer for `lw_shared_ptr<T>`
which will be used for `raft::append_entries`, which has
a field with `std::vector<const lw_shared_ptr<raft::log_entry>>`
type.
There is also a patch to extend `fragmented_temporary_buffer`
with a static function `allocate_to_fit` that allocates an
instance of the fragmented buffer that has a specified size.
Individual fragment size is limited to 128kb.
The patch-set also contains the test suite covering basic
functionality of the persistency module.
* manmanson/raft-api-impl-v11:
raft/sys_table_storage: add basic tests for raft_sys_table_storage
raft: introduce `raft_sys_table_storage` class
utils: add `fragmented_temporary_buffer::allocate_to_fit`
raft: add IDL definitions for raft types
raft: create `system.raft` and `system.raft_snapshots` tables
serializer: add `serializer<lw_shared_ptr<T>>` specialization
serializer: add `deserialize` function overload for `bytes_ostream`
The test suite covers the most basic use cases for the system table
backed raft persistency module:
* store/load vote and term
* store/load snapshot
* store snapshot with log tail truncation
* store/load log entries
* log truncation
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This is the implementation of raft persistency module that
uses `raft` system table as the underlying storage model.
The instance is supposed to be bound to a single raft group.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Introduce `fragmented_temporary_buffer::allocate_to_fit` static
function returning an instance of the buffer of a specified size.
The allocated buffer fragments have a size of at most 128kb.
`bytes_ostream` has the same hard-coded limit, so just use the
same here.
This patch will be later needed for `raft::log_entry` raw data
serialization when writing to the underlying persistent storage.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Changes to the `configuration` and `tagged_uint64` classes are needed
to overcome limitations of the IDL compiler tool, i.e. we need to
supply a constructor to the struct initializing all the
members (raft::configuration) and also need to make an accessor
function for private members (in case of raft::tagged_uint64).
All other structs mirror raft definitions in exactly the same way
they are declared in `raft.hh`.
`tagged_id` and `tagged_uint64` are used directly instead of their
typedef-ed companions defined in `raft.hh` since we don't want
to introduce indirect dependencies. In such case it can be guaranteed
that no accidental changes made outside of the idl file will affect idl
definitions.
This patch also fixes a minor typo in `snapshot_id_tag` struct used
in `snapshot_id` typedef.
System raft table will be used as a backend storage for implementing
raft persistence module in Scylla. It combines both raft log,
persisted vote and term, and snapshot info.
The table is partitioned by group id, thus allowing multi-raft
operation. The rest of the table structure mirrors the fields of
corresponding core raft structures defined in `raft.hh`, such as
`raft::log_entry`.
The raft table stores the only the latest snapshot id while
the actual snapshot will be available in a separate table
called `system.raft_snapshots`. The schema of `raft_snapshots`
mirrors the fields of `raft::snapshot` structure.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This one works similar to `serializer<optional<T>>` and will be
later needed for serializing `raft::append_request`, which has
a field containing `lw_shared_ptr`.
Users to be warned, though: this code assumes that the pointer
is never null. This is done to mirror the serialize implementation
for `lw_shared_ptr:s` in the messaging_service.cc, which is
subject to being deleted in favor of the impl in the
`serializer_impl.hh`.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
We use a custom sharder for all schema tables: every table under
the `system_schema` keyspace, plus `system.scylla_table_schema_history`.
This sharder puts all data on shard 0.
To achieve this, we hardcode the sharder in initial schema object
definitions. Furthermore - since the sharder is not stored inside schema
mutations yet - whenever we deserialize schema objects from mutations,
we modify the sharder based on the schema's keyspace and table names.
A regression test is added to ensure no one forgets to set the special
sharder for newly added schema tables. This test assumes that all newly
added schema tables will end up in the `system_schema` keyspace (other
tables may go unnoticed, unfortunately).
Closes#7947
"
Currently there are three different methods for creating an sstable
reader:
* one for single key reads
* one for ranged reads
* and one nobody uses
This patch-set consolidates all these into a single `make_reader()`
method, which behind the scenes uses the same logic to dispatch to the
right sstable reader constructor that `sstables::as_mutation_source()`
uses.
This patch-set is part of an effort to clean up the jungle that is the
various reader creation methods. The next step is to clean up the
sstable_set, which has even more methods.
One very sad discovery I made while working on this patch-set is that
we
still default `mutation_reader::forwarding` to `yes` in the sstable
range reader creator method and in the
`mutation_source::make_reader()`.
I couldn't assume that all callers are passing what they mean as the
value for that parameter. I found many sites in tests that create
forwardable single partition readers. This is also something we should
address soon.
Tests: unit(release, debug:v3)
"
* 'sstables-consolidate-reader-factory-methods-v4' of https://github.com/denesb/scylla:
cql_query_test: add unit test covering the non-optimal TWCS sstable read path
sstable_mutation_reader: consolidate constructors
tests: don't pass temporary ranges to readers
sstables: sstable_mutation_reader: remove now unused whole sstable constructor
sstables: stats: remove now unused sstable_partition_reads counter
sstable: remove read_.*row.*_flat() methods
tree-wide: use sstables::make_reader() instead of the read_.*row.*_flat() methods
sstables: pass partition_range to create_single_key_sstable_reader()
sstables: sstable: add make_reader()
The sstable read path for TWCS tables takes a different path when the
optimized read path cannot be used. This path was found to be not
covered at all by unit tests which allowed a trivial use-after-free to
slip in. Add a unit test to cover this path as well, so ASAN can catch
such bugs in the future.
* seastar a287bb1a3...52d41277a (8):
> fair_queue: Preempted requests got re-queued too far
> scripts/perftune.py: remove repeated items after merging options from file
> file.hh: Remove fair_queue.hh
> Merge "Reloadable TLS certificate tolerance" from Calle
> Merge "Cancellable IO" from Pavel E
> abort-source: Improve the subscriptions management
> fair_queue: Improve requests preemption while in pending state
> http: add support for Default handler (/*)
The two remaining sstable constructor are very similar apart from the
content of the initialize lambda. Speaking of which, the two remaining
initializer lambdas can be easily merged into one too. So this patch
does just that, consolidates the two constructors one and moves
consolidates as well as extracts the initializer method into a member
method. This means we have to store the previously captured variables as
members, but this is actually a good thing: when debugging we can see
the range and slice the reader is reading, and we are not actually
paying for it either -- they were already stored, just out of sight.
The sstable_mutation_reader, like all other mutation readers expects
that the partition-range passed to it is kept alive by its creator
for the duration of its lifetime. However, the single-key constructor
of the sstable reader was more tolerant, as it only extracted the key
from the range, essentially requiring only the key to be kept alive (but
not the containing range). Naturally in time some code come to rely on
it and ended up passing temporary ranges to the reader. This behaviour
will no longer be acceptable as we are about to consolidate the various
sstable reader constructors, uniformly requiring that the range is kept
alive. So this patch fixes up the tests so they work with this stricter
requirement. Only two occurences were found.
We want to unify the various sstable reader creation methods and this
method taking a ring position instead of a partition range like
everybody else stands in the way of that.
This is effect reverts 68663d0de.
Follow-up to #7917
The size of an cf::column_family_info is 224 bytes, so an std::vector that
contains one for each column family may be very large, causing allocations
of over 1MB.
Considering the vector is used only for iteration, it can be changed to
a non-contiguous list instead.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Closes#7973
"
Currently we have two parallel query paths:
* database::query() -> table::query() -> data_query()
* mutation::query()
The former is used by single partition queries, the latter by range
scans, as mutation::query() is used to convert reconcilable_result to
query::result (which means it is also used in single partition queries
if it triggers read repair). This is a rather unfortunate situation as
we have two parallel implementation of the query code, which means they
are prone to diverge, and in fact they already have -- more on that
later.
This patchset aims to remedy this situation by retiring
`mutation::query()` and migrating users to an implementation based on
the "standard" query path, in other words one using the same building
blocks as the `database::query()` path. This means using
`compact_mutation` for compacting and `query_result_builder` for result
building. These components however were created to work with
`flat_mutation_reader`, however introducing a reader into this pipeline
would mean that we'd have to make all the related APIs asynchronous,
which would cause an insane amount of churn. To avoid this, this
patchset adds an API compatible `consume()` method to `mutation`, which
can accept a `compact_mutation` instance as-is. This allows an elegant
and succinct reimplementation. So far so good.
Like mentioned above, the two implementations have diverged in time, or
have been different from the start. The difference manifest when
calculating digests, more precisely in which tombstones are included in
the digest. The retired `mutation::query()` path incorporates only
non-purgeable tombstones in the digest. The standard query path however
incorporates all tombstones, even those that can be purged. After some
scrutiny however this difference proved to be completely theoretical,
as
the code path where this would matter -- converting reconcilable result
to query result -- passes min timestamp as the query time to the
compaction, so nothing is compacted and hence the difference has no
chance to manifest.
This patch-set was motivated by the desire to provide a single solution
to #7434, instead of two, one for each path.
Tests: unit(release:v2, debug:v2, dev:v3)
"
* 'unified-query-path/v3' of https://github.com/denesb/scylla:
mutation: remove now unused query() and query_compacted()
treewide: use query_mutations() instead of mutation::query()
mutation_test: test_query_digest: ensure digest is produced consistently
mutation_query: introduce query_mutation()
mutation_query: to_data_query_result(): migrate to standard query code
mutation_query: move to_data_query_result() to mutation_partition.cc
mutation: add consume()
flat_mutation_reader: move mutation consumer concepts to separate header
mutation compactor: query compaction: ignore purgeable tombstones
This will be the only method to create sstable readers with. For now we
leave the other variants, they as well as their users will be removed in
a following patch.
This patch adds a cql-pytest, test_json.py::test_tojson_double(),
which reproduces issue #7972 - where toJson() prints some doubles
incorrectly - truncated to integers, but some it prints fine (I
still don't know why, this will need to be debugged).
The test is marked xfail: It fails on Scylla, and passes on Cassandra.
Refs #7972.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210127124338.297544-1-nyh@scylladb.com>
This patch introduces `schema_raft_state_machine` class
which is currently just a dummy implementation throwing a
"not implemented" exceptions for every call.
Will be needed later to construct an instance of `raft::server`.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210126193413.1520948-1-pa.solodovnikov@scylladb.com>
For some reason we had a distinct specialization of `serialize`
function to handle `bytes_ostream` but not `deserialize`.
This will be used in the following patches.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Currently the replacing node sets the status as STATUS_UNKNOWN when it
starts gossip service for the first time before it sets the status to
HIBERNATE to start the replacing operation. This introduces the
following race:
1) Replacing node using the same IP address of the node to be replaced
starts gossip service without setting the gossip STATUS (will be seen as
STATUS_UNKNOWN by other nodes)
2) Replacing node waits for gossip to settle and learns status and
tokens of existing nodes
3) Replacing node announces the HIBERNATE STATUS.
After Step 1 and before Step 3, existing nodes will mark the replacing
node as UP, but haven't marked the replacing node as doing replacing
yet. As a result, the replacing node will not be excluded from the read
replicas and will be considered a target node to serve CQL reads.
To fix, we make the replacing node avoid responding echo message when it is not
ready.
Fixes#7312Closes#7714
I see a miscompile on aarch64 where a call to format("{}", uuid)
translates a function pointer to -1. When called, this crashes.
Reduce the inline threshold from 2500 to 600. This doesn't guarantee
no miscompiles but all the tests pass with this parameter.
Closes#7953
The following patches fix issues seen occasionally in debug mode.
Notes:
- In debug mode there's still the UB nullptr arithmetic warning.
* https://github.com/alecco/scylla/tree/raft-ale-tests-07h-wait-propagation:
raft: replication test: wait for log propagation
raft: replication test: move wait for log to a function
raft: replication test: remove unused member
raft: replication test: use later()
raft: testing: remove election wait time and just yield
test_cell_external_memory_usage uses with_allocator() to observe how some
types allocate memory. However, compiler reordering (observed with clang 11
on aarch64) can move the various thread-local CQL type object initialization
into the with_allocator() scope; so any managed object allocated as part of
this initialization also gets measured, and the test fails. The code movement
is legal, as far as I can tell.
Fix this by initializing the type object early; use an atomic_thread_fence
as an optimization barrier so the compiler doesn't eliminate the or move
the early initialization.
Closes#7951
This patch adds a test for trying to set a tuple element to null with
fromJson(), which works on Cassandra but fails on Scylla. So the test
xfails on Scylla. Reproduces issue #7954.
Refs #7954.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210124082311.126300-1-nyh@scylladb.com>
"
`next_partition()` used to return void, so readers that had to call
future returning code had to work around this. Now that
`next_partition()` returns a future, we can get rid of these
workarounds.
Tests: unit(release, debug)
"
* 'next-partition-cross-shard-readers/v1' of https://github.com/denesb/scylla:
mutation_reader: reader_lifecycle_policy::stopped_reader: drop pending_next_partition flag
mutation_reader: evictable_reader: remove next_partition() workaround
mutation_reader: shard_reader: remove next_partition() workaround
mutation_reader: foreign_reader: remove next_partition() workaround
If the message is larger than current buffer size, we need to consume
more data until we reach to tail of the message.
To do so, we need to return nullptr when it's not on the tail.
Fixes#7273Closes#7903
* github.com:scylladb/scylla:
redis: rename _args_size/_size_left There are two types of numerical parameter in redis protocol: - *[0-9]+ defined array size - $[0-9]+ defined string size
redis: fix large message handling
There are two types of numerical parameter in redis protocol:
- *[0-9]+ defined array size
- $[0-9]+ defined string size
Currently, array size is stored to args_count, and string size is
stored to _arg_size / _size_left.
It's bit hard to understand since both uses same word "arg(s)", let's
rename string size variables to _bytes_count / _bytes_left.
If the message is larger than current buffer size, we need to consume
more data until we reach to tail of the message.
To do so, we need to return nullptr when it's not on the tail.
Fixes#7273
Wait until entries propagate after adding and before changing leader
using the same code as done for partitioning.
This fixes occasional hangs in debug mode when a test switches to a
different leader without leaving enough time for full propagation.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Instead of sleep 1us use later()
Also use later to yield after sending append entries in rpc test impl.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
After support for mixed cluster compatibility feature
DIGEST_MULTIPARTITION_READ was dropped in 854a44ff9b
range_slice_read_executor and never_speculating_read_executor become
identical, so remove the former for good.
Message-Id: <20210124122731.GA1122499@scylladb.com>
If possible, test the highest sstable format version,
as it's the mostly used.
If there pre-written sstables we need to load from the
test directory from an older version, either specify their
version explicitly, or use the new test_env::reusable_sst
method that looks up the latest sstable version in the
given directory and generation.
Test: unit(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201210161822.2833510-1-bhalevy@scylladb.com>
`next_partition()` now returns a future<>, so we can forward it to the
remote shard in the scope of the next partition call, remove the
now obsolete workaround for the synchronous next partition.
`next_partition()` now returns a future<>, so we can forward it to the
remote shard in the scope of the next partition call, remove the
now obsolete workaround for the synchronous next partition.
`next_partition()` now returns a future<>, so we can forward it to the
remote shard in the scope of the next partition call, remove the
now obsolete workaround for the synchronous next partition.
The fromJson() function can take a map JSON and use it to set a map column.
However, the specific example of a map<ascii, int> doesn't work in Scylla
(it does work in Cassandra). The xfailing tests in this patch demonstrate
this. Although the tests use perfectly legal ASCII, scylla fails the
fromJson() function, with a misleading error.
Refs #7949.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210121233855.100640-1-nyh@scylladb.com>
Before we retire the mutation::query() code, expand the digest test to
check that the new code replacing it produces identical digest on all
possible equivalent mutations.
This is a replacement of `mutation::query()`, but with an implementation
based on the standard query result building code.
This will allow us to migrate the remaining `mutation::query()` users
off of said method, which in turn will allow us to retire it finally.
Reimplement in terms of the standard query result building code. We want
to retire the alternative query result code in `mutation::query()` and
`to_data_query_result()` is one of the main users.
We want to rewrite the above mentioned method's implementation in terms
of the standard query result building code (that of the `data_query()`
path), in order to retire the alternative query code in the mutation
class.
The `data_query()` code uses classes private to `mutation_partition.cc`
and instead of making these public, just move `to_data_query_result()`
to `mutation_partition.cc`.
This consume method accepts a `FlattenedConsumer`, the same one that the
name-sake `flat_mutation_reader::consume()` does. Indeed the main
purpose of this method is to allow using the standard query result
building stack with a mutation, the same way said stack is used with
mutation readers currently. This will allow us to replace the parallel
query result building code that currently exists in the
`mutation::query()` and friends, with the standard one.
In the next patch we will want to use these concepts in `mutation.hh`. To
avoid pulling in the entire `flat_mutation_reader.hh` just for these,
and create a circular dependency in doing so, move them to a dedicated
header instead.
This behaviour is makes query result building sensitive to whether the
data was recently compacted or not, in particular different digests will
be produced depending on whether purgeable tombstones happened to be
compacted (and thus purged) or not. This means that two replicas can
produce different digests for the same data if has compacted some
purgeable tombstones and the other not.
To avoid this, drop purgeable tombstones during query compaction as
well.
This method was marked with 'FIXME -- should not be public'
when it was introduced. Since then it has stopped being used
and can even be removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210122083146.5886-1-xemul@scylladb.com>
`multishard_combining_reader` currently only works under the assumption
that every table uses the same sharder configured using the node's number
of shards. But we could potentially specify a different sharder for a chosen table,
e.g. one that puts everything on shard 0.
Then this assumption will be broken and the reader causes a segfault.
Fixes#7945.
When writing to an integer column, Cassandra's fromJson() function allows
not just JSON number constants, it also allows a string containing a
number. Strings which do not hold a number fail with a FunctionFailure.
In particular, the empty string "" is an invalid number, and should fail.
The tests in this patch check this for two integer types: int and
varint.
Curiously, Cassandra and Scylla have opposite bugs here: Scylla fails
to recognize the error for varint, while Cassandra fails to recognize
the error for int. The tests in this patch reproduce these bugs.
The tests demonstrating Scylla's bug are marked xfail, and the tests
demonstrating Cassandra's bug is marked "cassandra_bug" (which means
it is marked xfail only when running against Cassandra, but expected
to succeed on Scylla.
Refs #7944.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210121133833.66075-1-nyh@scylladb.com>
As reproduced in cql-pytest/test_json.py and reported in issue #7911,
failing fromJson() calls should return a FUNCTION_FAILURE error, but
currently produce a generic SERVER_ERROR, which can lead the client
to think the server experienced some unknown internal error and the
query can be retried on another server.
This patch adds a new cassandra_exception subclass that we were missing -
function_execution_exception - properly formats this error message (as
described in the CQL protocol documentation), and uses this exception
in two cases:
1. Parse errors in fromJson()'s parameters are converted into a
function_execution_exception.
2. Any exceptions during the execute() of a native_scalar_function_for
function is converted into a function_execution_exception.
In particular, fromJson() uses a native_scalar_function_for.
Note, however, that functions which already took care to produce
a specific Cassandra error, this error is passed through and not
converted to a function_execution_exception. An example is
the blobAsText() which can return an invalid_request error, so
it is left as such and not converted. This also happens in Cassandra.
All relevant tests in cql-pytest/test_json.py now pass, and are
no longer marked xfail. This patch also includes a few more improvements
to test_json.py.
Fixes#7911
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210118140114.4149997-1-nyh@scylladb.com>
Merged patch series by Konstantin Osipov:
"These series improve uniqueness of generated timeuuids and change
list append/prepend logic to use client/LWT timestamp in timeuuids
generated for list keys. Timeuuid compare functions are
optimized.
The test coverage is extended for all of the above."
uuid: add a comment warning against UUID::operator<
uuid: replace slow versions of timeuiid compare with optimized/tested versions.
test: add tests for legacy uuid compare & msb monotonicity
test: add a test case for append/prepend limit
test: add a test case for monotonicity of timeuuid least significant bits
uuid: implement optimized timeuuid compare
test: add a test case for list prepend/append with custom timestamp
lists: rewrite list prepend to use append machinery
lists: use query timestamp for list cell values during append
uuid: fill in UUID node identifier part of UUID
test: add a CQL test for list append/prepend operations
Introduce uint64_t based comparator for serialized timeuuids.
Respect Cassandra legacy for timeuuid compare order.
Scylla uses two versions of timeuuid compare:
- one for timeuuid values stored in uuid columns
- a different one for timeuuid values stored in timeuuid columns.
This commit re-implements the implementations of these comparators in
types.cc and deprecates the respective implementations types.cc. They
will be removed in a following patch.
A micro-benchmark at https://github.com/alecco/timeuuid-bench/
shows 2-4x speed up of the new comparators.
Rewrite list prepend to use the same machinery
as append, and thus produce correct results when used in LWT.
After this patch, list prepend begins to honor user supplied timestamps.
If a user supplied timestamp for prepend is less than 2010-01-01 00:00:00
an exception is thrown.
Fixes#7611
Scylla list cells are represented internally as a map of
timeuuid => value. To append a new value to a list
the coordinator generates a timeuuid reflecting the current time as key
and adds a value to the map using this key.
Before this patch, Scylla always generated a timeuuid for a new
value, even if the query had a user supplied or LWT timestamp.
This could break LWT linearizability. User supplied timestamps were
ignored.
This is reported as https://github.com/scylladb/scylla/issues/7611
A statement which appended multiple values to a list or a BATCH
generated an own microsecond-resolution timeuuid for each value:
BEGIN BATCH
UPDATE ... SET a = a + [3]
UPDATE ... SET a = a + [4]
APPLY BATCH
UPDATE ... SET a = a + [3, 4]
To fix the bug, it's necessary to preserve monotonicity of
timeuuids within a batch or multi-value append, but make sure
they all use the microsecond time, as is set by LWT or user.
To explain the fix, it's first necessary to recall the structure
of time-based UUIDs:
60 bits: time since start of GMT epoch, year 1582, represented
in 100-nanosecond units
4 bits: version
14 bits: clock sequence, a random number to avoid duplicates
in case system clock is adjusted
2 bits: type
48 bits: MAC address (or other hardware address)
The purpose of clockseq bits is as defined in
https://tools.ietf.org/html/rfc4122#section-4.1.5
is to reduce the probability of UUID collision in case clock
goes back in time or node id changes. The implementation should reset it
whenever one of these events may occur.
Since LWT microsecond time is guaranteed to be
unique by Paxos, the RFC provisioning for clockseq and MAC
slots becomes excessive.
The fix thus changes timeuuid slot content in the following way:
- time component now contains the same microsecond time for all
values of a statement or a batch. The time is unique and monotonic in
case of LWT. Otherwise it's most always monotonic, but may not be
unique if two timestamps are created on different coordinators.
- clockseq component is used to store a sequence number which is
unique and monotonic for all values within the statement/batch.
- to protect against time back-adjustments and duplicates
if time is auto-generated, MAC component contains a random (spoof)
MAC address, re-created on each restart. The address is different
at each shard.
The change is made for all sources of time: user, generated, LWT.
Conditioning the list key generation algorithm on the source of
time would unnecessarily complicate the code while not increase
quality (uniqueness) of created list keys.
Since 14 bits of clockseq provide us with only 16383 distinct slots
per statement or batch, 3 extra bits in nanosecond part of the time
are used to extend the range to 131071 values per statement/batch.
If the rang is exceeded beyond the limit, an exception is produced.
A twist on the use of clockseq to extend timeuuid uniqueness is
that Scylla, like Cassandra, uses int8 compare to compare lower
bits of timeuuid for ordering. The patch takes this into account
and sign-complements the clockseq value to make it monotonic
according to the legacy compare function.
Fixes#7611
test: unit (dev)
Before this patch, UUID generation code was not creating
sufficiently unique IDs: the 6 byte node identifier was mostly
empty, i.e. only containing shard id. This could lead to
collisions between queries executed concurrently at different
coordinators, and, since timeuuid is used as key in list append
and prepend operations, lead to lost updates.
To generate a unique node id, the patch uses a combination of
hardware MAC address (or a random number if no hardware address is
available) and the current shard id.
The shard id is mixed into higher bits of MAC, to reduce the
chances on NIC collision within the same network.
With sufficiently unique timeuuids as list cell keys, such updates
are no longer lost, but multi-value update can still be "merged"
with another multi-value update.
E.g. if node A executes SET l = l + [4, 5] and node B executes SET
l = l + [6, 7], the list value could be any of [4, 5, 6, 7], [4,
6, 5, 7], [6, 4, 5, 7] and so on.
At least we are now less likely to get any value lost.
Fixes#6208.
@todo: initialize UUID subsystem explicitly in main()
and switch to using seastar::engine().net().network_interfaces()
test: unit (dev)
Now that managed_bytes and its users do not assume that a managed_bytes
instance allocated using standard_allocation_strategy is non-fragmented,
we can set the preferred max contiguous allocation to 128k. This causes
managed_bytes to fragment instances that are larger than this size.
Note that managed_bytes is the only user.
Closes#7943
This patch set adds etcd unit tests for raft.
It also includes a fix for replication test in debug mode and a
simplification for append_request.
Tests: unit ({dev}), unit ({debug}), unit ({release})
* https://github.com/alecco/scylla/tree/raft-ale-tests-09b:
raft: etcd unit tests: test log replication
raft: boost test etcd: test fsm can vote from any state
raft: boost test etcd: port TestLeaderElectionOverwriteNewerLogs
raft: replication test: add etcd test for cycling leaders
raft: testing: provide primitives to wait for log propagation
raft: etcd unit tests: initial boost tests
raft: combine append_request _receive and _send
Podman doesn't correctly support --pids-limit with cgroupsv1. Some
versions ignore it, and some versions reject the option.
To avoid the error, don't supply --pids-limit if cgroupsv2 is not
available (detected by its presence in /proc/filesystems). The user
is required to configure the pids limit in
/etc/containers/containers.conf.
Fixes#7938.
Closes#7939
"
_consumer_fut is expected to return an exception
on the abort path. Wait for it and drop any exception
so it won't be abandoned as seen in #7904.
A future<> close() method was added to return
_consumer_fut. It is called both after abort()
in the error path, and after consume_end_of_stream,
on the success path.
With that, consume_end_of_stream was made void
as it doesn't return a future<> anymore.
Fixes#7904
Test: unit(release)
"
* tag 'close-bucket-writer-v5' of github.com:bhalevy/scylla:
mutation_writer: bucket_writer: add close
mutation_writer/feed_writers: refactor bucket/shard writers
mutation_writer: update bucket/shard writers consume_end_of_stream
This adds a "--build-mode" command line option to "scylla" executable:
$ ./build/dev/scylla --build-mode
dev
This allows you to discover the build mode of a "scylla" executable
without resorting to "readelf", for example, to verify that you are
looking at the correct executable while debugging packaging issues.
Closes#7865
Just like scylla-sstable-index, scylla-types accepts types in (short)
cassandra class name notation. The mapping from the clq3 type names to
the class names is not straight-forward in all cases, so provide a link
to a table which lists the cassandra class name of all supported types
(and more).
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210120083816.37774-2-bdenes@scylladb.com>
This reverts commit df2f67626b. The fix
is correct, but has an unfortunate side effect with O_DSYNC: each
128k write also needs to flush the XFS log. This translates to
32MB/128k = 256 flushes, compared to one flush with the original code.
A better fix would be to prezero without O_DSYNC, then reopen the file
with O_DSYNC, but we can do that later.
Reopens#5857.
"
Currently storage service and snitch implicitly depend on each
other. Storage service gossips snitch data on start, snitch
kicks the storage service when its configuration changes.
This interdependency is relaxed:
- snitch gossips all its state itself without using the
storage service as a mediator
- storage service listens for snitch updates with the help
of self-breaking subscription
Both changes make snitch independent from storage service,
remove yet another call for global storage service from the
codebase and make the storage service -> snitch reference
robust against dagling pointers/references
tests: unit(dev), dtest.rebuild.TestRebuild.simple_rebuild(dev)
"
* 'br-snitch-gossip-2' of https://github.com/xemul/scylla:
storage-service: Subscribe to snitch to update topology
snitch: Introduce reconfiguration signal
snitch: Always gossip snitch info itself
snitch: Do gossip DC and RACK itself
snitch: Add generic gossiping helper
The new naming scheme more clearly communicates to the client of
the raft library that the `persistence` interface implements
persistency layer of the fsm that is powering the raft
protocol itself rather than the client-side workflow and
user-provided `state_machine`.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201126135114.7933-1-pa.solodovnikov@scylladb.com>
Current code that checks when snapshot has to be transferred does not
take in account the case where there can be log entries preceding the
snapshot. Fix the code to correctly test for snapshot transfer
condition.
Message-Id: <20210117095801.GB733394@scylladb.com>
replication_test's state machine is not commutative, so if commands are
applied in different order the states will be different as well. Since
the preemption check was added into co_await in seastar even waiting for
a ready future can preempt which will cause reordering of simultaneously
submitted entries in debug mode. For a long time we tried to keep entries
submission parallel in the test, but with the above seastar change it
is no longer possible to maintain it without changing the state machine
to be commutative. The patch changes the test to submit entries one by
one.
Message-Id: <20210117095147.GA733394@scylladb.com>
bucket_writer::close waits for the _consumer_fut.
It is called both after consume_end_of_stream()
and after abort().
_consumer_fut is expected to return an exception
on the abort path. Wait for it and drop any exception
so it won't be abandoned as seen in #7904.
With that moved to close() time, consume_end_of_stream
doesn't need to return a future and is made void
all the way in the stack. This is ok since
queue_reader_handle::push_end_of_stream is synchronous too.
Added a unit test that aborts the reader consumer
during `segregate_by_timestamp`, reproducing the
Exceptional future ignored issue without the fix.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Consolidate shard_based_splitting_writer::shard_writer
and timestamp_based_splitting_writer::bucket_writer
common code into mutation_writer::bucket_writer.
This provides a common place to handle consume_end_of_stream()
and abort(), and in particular the handling of the underlying
_conmsumer_fut.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
After 61520a33d6
feed_writers doesn't call consume_end_of_stream
after abort() so no need to test
if (!_handle.is_terminated()) {
and consume_end_of_stream is now called in then_wrapped
rather than `finally` so it's ok if it throws.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The code would already silence broken pipe exceptions since it's
expected when the other side closes the connection or when we shutdown the
socket during Scylla shutdown, but the code wouldn't handle the following:
1. "Connection reset by peer" errors: these can also happen in the
aforementioned two scenarios; the conditions that determine which of
the two types of errors occur are unclear.
2. The scenarios would sometimes result in a `seastar::nested_exception`,
mainly during shutdown. The errors could happen once when trying to send
a response to a request (`_write_buf.write(...)/flush(...)`) and then
again when trying to close the connection in a `finally` block. These
nested exceptions were not silenced.
The commit handles each of these cases.
Closes#7907.
Closes#7931
The main motivation for this patchset is to prepare
for adding a async close() method to flat_mutation_reader.
In order to close the reader before destroying it
in all paths we need to make next_partition asynchronous
so it can asynchronously close a current reader before
destoring it, e.g. by reassignment of flat_mutation_reader_opt,
as done in scanning_reader::next_partition.
Test: unit(release, debug)
* git@github.com:bhalevy/scylla.git futurize-next-partition-v1:
flat_mutation_reader: return future from next_partition
multishard_mutation_query: read_context: save_reader: destroy reader_meta from the calling shard
mutation_reader: filtering_reader: fill_buffer: futurize inner loop
flat_mutation_reader::impl: consumer_adapter: futurize handle_result
flat_mutation_reader: consume_pausable/in_thread: futurize_invoke consumer
flat_mutation_reader: FlatMutationReaderConsumer: support also async consumer
flat_mutation_reader:impl: get rid of _consume_done member
For tests to be able to transition in a consistent state, in some cases
it's needed to allow the followers to catch up with the leader.
This prevents occasional hangs in debug mode for incoming tests.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Combine structs for append request send and receive into a single
struct.
Author: Gleb Natapov <gleb@scylladb.com>
Date: Mon Nov 23 14:33:14 2020 +0200
Test single- and multi- value list append, prepend,
append and prepend in a batch, conditional statements.
This covers the parts of Cassandra which are working as documented
and which we intend to preserve compatibility with.
storage_service: Introduce load_and_stream
=== Introduction ===
This feature extends the nodetool refresh to allow loading arbitrary sstables
that do not belong to a node into the cluster. It loads the sstables from disk
and calculates the owning nodes of the data and streams to the owners
automatically.
From example, say the old cluster has 6 nodes and the new cluster has 3 nodes.
We can copy the sstables from the old cluster to any of the new nodes and
trigger the load and stream process.
This can make restores and migrations much easier.
=== Performance ===
I managed to get 40MB/s per shard on my build machine.
CPU: AMD Ryzen 7 1800X Eight-Core Processor
DISK: Samsung SSD 970 PRO 512GB
Assume 1TB sstables per node, each shard can do 40MB/s, each node has 32
shards, we can finish the load and stream 1TB of data in 13 mins on each
node.
1TB / 40 MB per shard * 32 shard / 60 s = 13 mins
=== Tests ===
backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_test
which creates a cluster with 4 nodes and inserts data, then use
load_and_stream to restore to a 2 nodes cluster.
=== Usage ===
curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true
=== Notes ===
Btw, with the old nodetool refresh, the node will not pick up the data
that does not belong to this node but it will not delete it either. One
has to run nodetool cleanup to remove those data manually which is a
surprise to me and probably to users as well. With load and stream, the
process will delete the sstables once it finishes stream, so no nodetool
cleanup is needed.
The name of this feature load and stream follows load and store in CPU world.
Fixes#7831Closes#7846
* github.com:scylladb/scylla:
storage_service: Introduce load_and_stream
distributed_loader: Add get_sstables_from_upload_dir
table: Add make_streaming_reader for given sstables set
This is a revival of #7490.
Quoting #7490:
The managed_bytes class now uses implicit linearization: outside LSA, data is never fragmented, and within LSA, data is linearized on-demand, as long as the code is running within with_linearized_managed_bytes() scope.
We would like to stop linearizing managed_bytes and keep it fragmented at all times, since linearization can require large contiguous chunks. Large contiguous allocations are hard to satisfy and cause latency spikes.
As a first step towards that, we remove all implicitly linearizing accessors and replace them with an explicit linearization accessor, with_linearized().
Some of the linearization happens long before use, by creating a bytes_view of the managed_bytes object and passing it onwards, perhaps storing it for later use. This does not work with with_linearized(), which creates a temporary linearized view, and does not work towards the longer term goal of never linearizing. As a substitute a managed_bytes_view class is introduced that acts as a view for managed_bytes (for interoperability it can also be a view for bytes and is compatible with bytes_view).
By the end of the series, all linearizations are temporary, within the scope of a with_linearized() call and can be converted to fragmented consumption of the data at leisure.
This has limited practical value directly, as current uses of managed_bytes are limited to keys (which are limited to 64k). However, it enables converting the atomic_cell layer back to managed_bytes (so we can remove IMR) and the CQL layer to managed_bytes/managed_bytes_view, removing contiguous allocations from the coordinator.
Closes#7820
* github.com:scylladb/scylla:
test: add hashers_test
memtable: fix accounting of managed_bytes in partition_snapshot_accounter
test: add managed_bytes_test
utils: fragment_range: add a fragment iterator for FragmentedView
keys: update comments after changes and remove an unused method
mutation_test: use the correct preferred_max_contiguous_allocation in measuring_allocator
row_cache: more indentation fixes
utils: remove unused linearization facilities in `managed_bytes` class
misc: fix indentation
treewide: remove remaining `with_linearized_managed_bytes` uses
memtable, row_cache: remove `with_linearized_managed_bytes` uses
utils: managed_bytes: remove linearizing accessors
keys, compound: switch from bytes_view to managed_bytes_view
sstables: writer: add write_* helpers for managed_bytes_view
compound_compat: transition legacy_compound_view from bytes_view to managed_bytes_view
types: change equal() to accept managed_bytes_view
types: add parallel interfaces for managed_bytes_view
types: add to_managed_bytes(const sstring&)
serializer_impl: handle managed_bytes without linearizing
utils: managed_bytes: add managed_bytes_view::operator[]
utils: managed_bytes: introduce managed_bytes_view
utils: fragment_range: add serialization helpers for FragmentedMutableView
bytes: implement std::hash using appending_hash
utils: mutable_view: add substr()
utils: fragment_range: add compare_unsigned
utils: managed_bytes: make the constructors from bytes and bytes_view explicit
utils: managed_bytes: introduce with_linearized()
utils: managed_bytes: constrain with_linearized_managed_bytes()
utils: managed_bytes: avoid internal uses of managed_bytes::data()
utils: managed_bytes: extract do_linearize_pure()
thrift: do not depend on implicit conversion of keys to bytes_view
clustering_bounds_comparator: do not depend on implicit conversion of keys to bytes_view
cql3: expression: linearize get_value_from_mutation() eariler
bytes: add to_bytes(bytes)
cql3: expression: mark do_get_value() as static
=== Introduction ===
This feature extends the nodetool refresh to allow loading arbitrary sstables
that do not belong to a node into the cluster. It loads the sstables from disk
and calculates the owning nodes of the data and streams to the owners
automatically.
From example, say the old cluster has 6 nodes and the new cluster has 3 nodes.
We can copy the sstables from the old cluster to any of the new nodes and
trigger the load and stream process.
This can make restores and migrations much easier.
=== Performance ===
I managed to get 40MB/s per shard on my build machine.
CPU: AMD Ryzen 7 1800X Eight-Core Processor
DISK: Samsung SSD 970 PRO 512GB
Assume 1TB sstables per node, each shard can do 40MB/s, each node has 32
shards, we can finish the load and stream 1TB of data in 13 mins on each
node.
1TB / 40 MB per shard * 32 shard / 60 s = 13 mins
=== Tests ===
backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_test
which creates a cluster with 4 nodes and inserts data, then use
load_and_stream to restore to a 2 nodes cluster.
=== Usage ===
curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true
=== Notes ===
Btw, with the old nodetool refresh, the node will not pick up the data
that does not belong to this node but it will not delete it either. One
has to run nodetool cleanup to remove those data manually which is a
surprise to me and probably to users as well. With load and stream, the
process will delete the sstables once it finishes stream, so no nodetool
cleanup is needed.
The name of this feature load and stream follows load and store in CPU world.
Fixes#7831
This test is a sanity check. It verifies that our wrappers over well known
hashes (xxhash, md5, sha256) actually calculate exactly those hashes.
It also checks that the `update()` methods of used hashers are linear with
respect to concatenation: that is, `update(a + b)` must be equivalent to
`update(a); update(b)`. This wasn't relied on before, but now we need to
confirm that hashing fragmented keys without linearizing them won't break
backward compatibility.
managed_bytes has a small overhead per each fragment. Due to that, managed_bytes
containing the same data can have different total memory usage in different
allocators. The smaller the preferred max allocation size setting is, the more
fragments are needed and the greater total per-fragment overhead is.
In particular, managed_bytes allocated in the LSA could grow in
memory usage when copied to the standard allocator, if the standard allocator
had a preferred max allocation setting smaller than the LSA.
partition_snapshot_accounter calculates the amount of memory used by
mutation fragments in the memtable (where they are allocated with LSA) based
on the memory usage after they are copied to the standard allocator.
This could result in an overestimation, as explained above.
But partition_snapshot_accounter must not overestimate the amount of freed
memory, as doing otherwise might result in OOM situations.
This patch prevents the overaccounting by adding minimal_external_memory_usage():
a new version of external_memory_usage(), which ignores allocator-dependent
overhead. In particular, it includes the per-fragment overhead in managed_bytes
only once, no matter how many fragments there are.
The comments were outdated after the latest changes (bytes_view vs
managed_bytes_view).
compound_view_wrapper::get_component() is unused, so we remove it.
Since we introduced relocatable package and offline installer, scylla binary itself can run almost any distributions.
However, setup scripts are not designed to run in unsupported distributions, it causes error on such environment.
This PR adds minimal support to run offline installation on unsupported distributions, tested on SLES, Arch Linux and Gentoo.
Closes#7858
* github.com:scylladb/scylla:
dist: use sysconfig_parser to parse gentoo config file
dist: add package name translation
dist: support SLES/OpenSUSE
install.sh: add systemd existance check
install.sh: ignore error missing sysctl entries
dist: show warning on unsupported distributions
dist: drop Ubuntu 14.04 code
dist: move back is_amzn2() to scylla_util.py
dist: rename is_gentoo_variant() to is_gentoo()
dist: support Arch Linux
dist: make sysconfig directory detectable
Flush is facing stalls because partition_snapshot_flat_reader::fill_buffer()
generates mutation fragment until buffer is full[1] without yielding.
this is the code path:
flush_reader::fill_buffer() <---------|
flat_mutation_reader::consume_pausable() <--------|
partition_snapshot_flat_reader::fill_buffer() -|
[1]: https://github.com/scylladb/scylla/blob/6cfc949e/partition_snapshot_reader.hh#L261
This is fixed by breaking the loop in do_fill_buffer() if preemption is needed,
allowing do_until() to yield in sequence, and when it resumes, continue from
where it left off, until buffer is full.
Fixes#7885.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210114141417.285175-1-raphaelsc@scylladb.com>
implicit revert of 6322293263
sshd previosly was used by the scylla manager 1.0.
new version does not need it. there is no point of
having it currently. it also confuses everyone.
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
Closes#7921
The client_state::check_access() calls for global storage service
to get the features from it and check if the CDC feature is on.
The latter is needed to perform CDC-specific checks.
However it was noticed, that the check for the feature is excessive
as all the guarded if-s will resolve to false in case CDC is off
and the check_access will effectively work as it would with the
feature check.
With that observation, it's possible to ditch one more global storage
service reference.
tests: unit(dev), dtest(dev, auth)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210105063651.7081-1-xemul@scylladb.com>
The reader_meta in _readers[shard] is created on shard 0 and must
be destroyed on it as well.
A following patch changes next_partition() to return a future<>
thus it introduces a continuation that requires access to `rm`.
We cannot move it down to the conuation safely, since it will be
wrongly destroyed in the invoked shard, so use do_with to hold it
in the scope of the calling shard until the invoked function
completes.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently snitch explicitly calls storage service (if
it's initialized) to update topology on snitch data
change.
Instead of it -- make storage service subscribe on the
snitch reconfigure signal upon creation.
This finally makes snitch fully independent from storage
service.
In tests the snitch instance is not created, so check
for it before subscribing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add a notifier to snitch_base that gets triggered when the
snitch configuration changes to which others may subscribe.
For now only the gossiping-file-snitch triggers it when it
re-reads its config file. Other existing snitches are kinda
static in this sense.
The subscribe-trigger engine is based on scoped connection
from boost::signals2.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The gossiping_property_file_snitch updates the gossip RACK and DC
values upon config change. Right now this is done with the help
of storage service, but the needed code to gossip rack and dc is
already available in the snitch itself.
Said that -- gossip snitch info by snitch helper and remove the
storage_service's one. This makes the 2nd step decoupling snitch
and storage service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is the 2nd step in generalizing the snitch data gossiping
and at the same the 1st step in decoupling storage service and
snitch.
During start storage service starts gossiper, which notifies the
snicth with .gossiper_starting() call, then the storage service
calls gossip_snitch_info.
This patch makes snitch itself do the last step.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Nowadays some snitch implementations gossip the INTERNAL_IP
value and storage_service gossip RACK and DC for all of them.
This functionality is going to be generalized and the first
step is in making a common method for a snitch to gossip its
data.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
offline installer can run in non-systemd distributions, but it won't
work since we only have systemd units.
So check systemd existance and print error message.
is_redhat_variant() is the function to detect RHEL/CentOS/Fedora/OEL,
and is_debian_variant() is the function to detect Debian/Ubuntu.
Unlike these functions, is_gentoo_variant() does not detect "Gentoo variants",
we should rename it to is_gentoo().
Currently, install.sh provide a way to customize sysconfig directory,
but sysconfig directory is hardcoded on script.
Also, /etc/sysconfig seems correct to use default value, but current
code specify /etc/default as non-redhat distributions.
Instead of hardcoding, generate generate python script in install.sh
to save specified sysconfig directory path in python code.
The reply to a /column_family/ GET request contains info about all
column families. Currently, all this info is stored in a single
string when replying, and this string may require a big allocation
when there are many column families.
To avoid that allocation, instead of a single string, use a
body_writer function, which writes chunks of the message content
to the output stream.
Fixes#7916
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Closes#7917
After these changes the generated code deserializes the stream into a chunked vector, instead of an contiguous one, so even if there are many fields in it, there won't be any big allocations.
I haven't run the scylla cluster test with it yet but it passes the unit tests.
Closes#7919
* github.com:scylladb/scylla:
idl: change the type of mutation_partition_view::rows() to a chunked_vector
idl-compiler: allow fields of type utils::chunked_vector
Numbers in JSON are not limited in range, so when the fromJson() function
converts a number to a limited-range integer column in Scylla, this
conversion can overflow. The following tests check that this conversion
should result in an error (FunctionFailure), not silent trunction.
Scylla today does silently wrap around the number, so these tests
xfail. They pass on Cassandra.
Refs #7914.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210112151041.3940361-1-nyh@scylladb.com>
This patch adds more (failing) tests for issue #7911, where fromJson()
failures should be reported as a clean FunctionFailure error, not an
internal server error.
The previous tests we had were about JSON parse failures, but a
different type of error we should support is valid JSON which returned
the wrong type - e.g., the JSON returning a string when an integer
was expected, or the JSON returning a string with non-ASCII characters
when ASCII was expected. So this patch adds more such tests. All of
them xfail on Scylla, and pass on Cassandra.
Refs #7911.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210112122211.3932201-1-nyh@scylladb.com>
This patch adds a reproducer test for issue #7912, which is about passing
a null parameter to the fromJson() function supposed to be legal (and
return a null value), and is legal in Cassandra, but isn't allowed in
Scylla.
There are two tests - for a prepared and unprepared statement - which
fail in different ways. The issue is still open so the tests xfail on
Scylla - and pass on Cassandra.
Refs #7912.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210112114254.3927671-1-nyh@scylladb.com>
Related issue scylladb/sphinx-scylladb-theme#88
Once this commit is merged, the docs will be published under the new domain name https://scylla.docs.scylladb.com
Frequently asked questions:
Should we change the links in the README/docs folder?
GitHub automatically handles the redirections. For example, https://scylladb.github.io/sphinx-scylladb-theme/stable/examples/index.html redirects to https://sphinx-theme.scylladb.com/stable/examples/index.html
Nevertheless, it would be great to change URLs progressively to avoid the 301 redirections.
Do I need to add this new domain in the custom dns domain section on GitHub settings?
It is not necessary. We have already edited the DNS for this domain and the theme creates programmatically the required CNAME file. If everything goes well, GitHub should detect the new URL after this PR is merged.
The DNS doesn't seem to have the right SSL certificates
GitHub handles the certificate provisioning but is not aware of the subdomain for this repo yet. make multi-version will create a new file "CNAME". This is published in gh-pages branch, therefore GitHub should create the missing cert.
Closes#7877
Use the thread_local seastar::testing::local_random_engine
in all seastar tests so they can be reproduced using
the --random-seed option.
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210112103713.578301-2-bhalevy@scylladb.com>
The min/max aggregators use aggregate_type_for comparators, and the
aggregate_type_for<timeuuid> is regular uuid. But that yields wrong
results; timeuuids should be compared as timestamps.
Fix it by changing aggregate_type_for<timeuuid> from uuid to timeuuid,
so aggregators can distinguish betwen the two. Then specialize the
aggregation utilities for timeuuid.
Add a cql-pytest and change some unit tests, which relied on naive
uuid comparators.
Fixes#7729.
Tests: unit (dev, debug)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#7910
"
Without interposer consumer on flush, it could happen that a new sstable,
produced by memtable flush, will not conform to the strategy invariant.
For example, with TWCS, this new sstable could span multiple time windows,
making it hard for the strategy to purge expired data. If interposer is
enabled, the data will be correctly segregated into different sstables,
each one spanning a single window.
Fixes#4617.
tests:
- mode(dev).
- manually tested it by forcing a flush of memtable spanning many windows
"
* 'segregation_on_flush_v2' of github.com:raphaelsc/scylla:
test: Add test for TWCS interposer on memtable flush
table: Wire interposer consumer for memtable flush
table: Add write_memtable_to_sstable variant which accepts flat_mutation_reader
table: Allow sstable write permit to be shared across monitors
memtable: Track min timestamp
table: Extend cache update to operate a memtable split into multiple sstables
This patch adds a reproducer test for issue #7911, which is about a parse
error in JSON string passed to the fromJson() function causing an
internal error instead of the expected FunctionFailure error.
The issue is still open so the test xfails on Scylla (and passes on
Cassandra).
Refs #7911.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210112094629.3920472-1-nyh@scylladb.com>
The option can only take integer values >= 0, since negative
TTL is meaningless and is expected to fail the query when used
with `USING TTL` clause.
It's better to fail early on `CREATE TABLE` and `ALTER TABLE`
statement with a descriptive message rather than catch the
error during the first lwt `INSERT` or `UPDATE` while trying
to insert to system.paxos table with the desired TTL.
Tests: unit(dev)
Fixes: #7906
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210111202942.69778-1-pa.solodovnikov@scylladb.com>
Unfortunately snapshot checking still does not work in the presence of
log entries reordering. It is impossible to know when exactly the
snapshot will be taken and if it is taken before all smaller than
snapshot idx entries are applied the check will fail since it assumes
that.
This patch disabled snapshot checking for SUM state machine that is used
in backpressure test.
Message-Id: <20201126122349.GE1655743@scylladb.com>
The value of mutation_partition_view::rows() may be very large, but is
used almost exclusively for iteration, so in order to avoid a big allocation
for an std::vector, we change its type to an utils::chunked_vector.
Fixes#7918
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
The utils::chunked_vector has practically the same methods
as a std::vector, so the same code can be generated for it.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
We have recently seen a suspected corrupt mutation fragment stream to get
into an sstable undetected, causing permanent corruption. One of the
suspected ways this could happen is the compaction sstable write path not
being covered with a validator. To prevent events like this in the future
make sure all sstable write paths are validated by embedding the validator
right into the sstable writer itself.
Refs: #7623
Refs: #7640
Tests: unit(release)
* https://github.com/denesb/scylla.git sstable-writer-fragment-stream-validation/v2:
sstable_writer: add validation
test/boost/sstable_datafile_test: sstable_scrub_test: disable key validation
mutation_fragment_stream_validator: make it easier to validate concrete fragment types
flat_mutation_reader: extract fragment stream validator into its own header
Attempt to hurry flushing/segment delete/recycle if we are trying
to get a segment for allocation, and reserve is empty when above
disk threshold. This is minimize time waited in allocation semaphore.
Cassandra constructs `QueryOptions.SpecificOptions` in the same
way that we do (by not providing `serial_constency`), but they
do have a user-defined constructor which does the following thing:
this.serialConsistency = serialConsistency == null ? ConsistencyLevel.SERIAL : serialConsistency;
This effectively means that DEFAULT `SpecificOptions` always
have `SerialConsistency` set to `SERIAL`, while we leave this
`std::nullopt`, since we don't have a constructor for
`specific_options` which does this.
Supply `db::consistency_level::SERIAL` explicitly to the
`specific_options::DEFAULT` value.
Tests: unit(dev)
Fixes: #7850
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201231104018.362270-1-pa.solodovnikov@scylladb.com>
This adds a simple reproducer for a bug involving a CONTAINS relation on
frozen collection clustering columns when the query is restricted to a
single partition - resulting in a strange "marshalling error".
This bug still exists, so the test is marked xfail.
Refs #7888.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210107191417.3775319-1-nyh@scylladb.com>
We add a reproducer for issues #7868 and #7875 which are about bugs when
a table has a frozen collection as its clustering key, and it is sorted
in *reverse order*: If we tried to insert an item to such a table using an
unprepared statement, it failed with a wrong error ("invalid set literal"),
but if we try to set up a prepared statement, the result is even worse -
an assertion failure and a crash.
Interestingly, neither of these problems happen without reversed sort order
(WITH CLUSTERING ORDER BY (b DESC)), and we also add a test which
demonstrates that with default (increasing) order, everything works fine.
All tests pass successfully when run against Cassandra.
The fix for both issues was already committed, so I verified these tests
reproduced the bug before that commit, and pass now.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210110232312.3844408-1-nyh@scylladb.com>
In this patch, we port validation/entities/frozen_collections_test.java,
containing 33 tests for frozen collections of all types, including
nesting collections.
In porting these tests, I uncovered four previously unknown bugs in Scylla:
Refs #7852: Inserting a row with a null key column should be forbidden.
Refs #7868: Assertion failure (crash) when clustering key is a frozen
collection and reverse order.
Refs #7888: Certain combination of filtering, index, and frozen collection,
causes "marshalling error" failure.
Refs #7902: Failed SELECT with tuple of reversed-ordered frozen collections.
These tests also provide two more reproducers for an already known bug:
Refs #7745: Length of map keys and set items are incorrectly limited to
64K in unprepared CQL.
Due to these bugs, 7 out of the 33 tests here currently xfail. We actually
had more failing tests, but we fixed issue #7868 before this patch went in,
so its tests are passing at the time of this submission.
As usual in these sort of tests, all 33 pass when running against Cassandra.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210110231350.3843686-1-nyh@scylladb.com>
In test_streams.py we had some code to get a list of shards and iterators
duplicated three times. Put it in a function, shards_and_latest_iterators(),
to reduce this duplication.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201006112421.426096-1-nyh@scylladb.com>
Add a mutation_fragment_stream_validating_filter to
sstables::writer_impl and use it in sstable_writer to validate the
fragment stream passed down to the writer implementation. This ensures
that all fragment streams written to disk are validated, and we don't
have to worry about validating each source separately.
The current validator from sstable::write_components() is removed. This
covers only part of the write paths. Ad-hoc validations in the reader
implementations are removed as well as they are now redundant.
The test violates clustering key order on purpose to produce a corrupt
sstable (to test scrub). Disable key validation so when we move the
validator into the writer itself in the next patch it doesn't abort the
test.
The current API is tailored to the `mutation_fragment` type. In
the next patch we will want to use the validator from a context where
the mutation fragments are already decomposed into their respective
concrete types, e.g. static_row, clustering_row, etc. To avoid having to
reconstruct a mutation fragment type just to use the validator, add an
API which allows validating these concrete types conveniently too.
Replace two methods for unreversal (`as` and `self_or_reversed`) with
a new one (`without_reversed`). More flexible and better named.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#7889
Currently, frozen mutations, that contain partitions with out-of-order
or duplicate rows will trigger (if they even do) an assert in
`row::append_cell()`. However, this results in poor diagnostics (if at
all) as the context doesn't contain enough information on what exactly
went wrong. This results in a cryptic error message and an investigation
that can only start after looking at a coredump.
This series remedies this problem by explicitly checking for
out-of-order and duplicate rows, as early as possible, when the
supposedly empty row is created. If the row already existed (is a
duplicate) or it is not the last row in the partition (out-of-order row)
an exception is thrown and the deserialization is aborted. To further
improve diagnostics, the partition context is also added to the
exception.
Tests: unit(release)
* botond/frozen-mutation-bad-row-diagnostics/v3:
frozen_mutation: add partition context to errors coming from deserializing
partition_builder: accept_row(): use append_clustering_row()
mutation_partition: add append_clustered_row()
measuring_allocator is a wrapper around standard_allocator, but it exposed
the default preferred_max_contiguous_allocation, not the one from
standard_allocator. Thus managed_bytes allocated in those two allocators
had fragments of different size, and their total memory usage differed,
causing test_external_memory_usage to fail if
standard_allocator::preferred_max_contiguous_allocation was changed from the
default. Fix that.
Remove the following bits of `managed_bytes` since they are unused:
* `with_linearized_managed_bytes` function template
* `linearization_context_guard` RAII wrapper class for managing
`linearization_context` instances.
* `do_linearize` function
* `linearization_context` class
Since there is no more public or private methods in `managed_class`
to linearize the value except for explicit `with_linearized()`,
which doesn't use any of aforementioned parts, we can safely remove
these.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The patch fixes indentation issues introduced in previous patches
related to removing `with_linearized_managed_bytes` uses from the
code tree.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
There is no point in calling the wrapper since linearization code
is private in `managed_bytes` class and there is no one to call
`managed_bytes::data` because it was deleted recently.
This patch is a prerequisite for removing
`with_linearized_managed_bytes` function completely, alongside with
the corresponding parts of implementation in `managed_bytes`.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Since `managed_bytes::data()` is deleted as well as other public
APIs of `managed_bytes` which would linearize stored values except
for explicit `with_linearized`, there is no point
invoking `with_linearized_managed_bytes` hack which would trigger
automatic linearization under the hood of managed_bytes.
Remove useless `with_linearized_managed_bytes` wrapper from
memtable and row_cache code.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The keys classes (partition_key et al) already use managed_bytes,
but they assume the data is not fragmented and make liberal use
of that by casting to bytes_view. The view classes use bytes_view.
Change that to managed_bytes_view, and adjust return values
to managed_bytes/managed_bytes_view.
The callers are adjusted. In some places linearization (to_bytes())
is needed, but this isn't too bad as keys are always <= 64k and thus
will not be fragmented when out of LSA. We can remove this
linearization later.
The serialize_value() template is called from a long chain, and
can be reached with either bytes_view or managed_bytes_view.
Rather than trace and adjust all the callers, we patch it now
with constexpr if.
operator bytes_view (in keys) is converted to operator
managed_bytes_view, allowing callers to defer or avoid
linearization.
bytes_view can convert to managed_bytes_view, so the change
is compatible with the existing representation and the next
patches, which change compound types to use managed_bytes_view.
This operator has a single purpose: an easier port of legacy_compound_view
from bytes_view to managed_bytes_view.
It is inefficient and should be removed as soon as legacy_compound_view stops
using operator[].
managed_bytes_view is a non-owning view into managed_bytes.
It can also be implicitly constructed from bytes_view.
It conforms to the FragmentedView concept and is mainly used through that
interface.
It will be used as a replacement for bytes_view occurrences currently
obtained by linearizing managed_bytes.
This is a preparation for the upcoming introduction of managed_bytes_view,
intended as a fragmented replacement for bytes_view.
To ease the transition, we want both types to give equal hashes for equal
contents.
Unset values for key and value were not handled. Handle them in a
manner matching Cassandra.
This fixes all cases in testMapWithUnsetValues, so re-enable it (and
fix a comment typo in it).
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
When the right-hand side of IN is an unset value, we must report an
error, like Cassandra does.
This fixes testListWithUnsetValues, so re-enable it.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Make the bind() operation of the scalar marker handle the unset-value
case (which it previously didn't).
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Avoid crash described in #7740 by ignoring the update when the
element-to-remove is UNSET_VALUE.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Since we haven't implemented parse error on redis protocol parser,
reply message is broken at parse error.
Implemented parse error, reply error message correctly.
Fixes#7861Fixes#7114Closes#7862
When the clustering order is reversed on a map column, the column type
is reversed_type_impl, not map_type_impl. Therefore, we have to check
for both reversed type and map type in some places.
This patch handles reverse types in enough places to make
test_clustering_key_reverse_frozen_map pass. However, it leaves
other places (invocations of is_map() and *_cast<map_type_impl>())
as they currently are; some are protected by callers from being
invoked on reverse types, but some are quite possibly bugs untriggered
by existing tests.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
When the clustering order is reversed on a list column, the column type
is reversed_type_impl, not list_type_impl. Therefore, we have to check
for both reversed type and list type in some places.
This patch handles reverse types in enough places to make
test_clustering_key_reverse_frozen_list pass. However, it leaves
other places (invocations of is_list() and *_cast<list_type_impl>())
as they currently are; some are protected by callers from being
invoked on reverse types, but some are quite possibly bugs untriggered
by existing tests.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
When the clustering order is reversed on a set column, the column type
is reversed_type_impl, not set_type_impl. Therefore, we have to check
for both reversed type and set type in some places.
To make such checks easier, add convenience methods self_or_reversed()
and as() to abstract_type. Invoke those methods (instead of is_set()
and casts) enough to make test_clustering_key_reverse_frozen_set pass.
Leave other invocations of is_set() and *_cast<set_type_impl>() as
they are; some are protected by callers from being invoked on reverse
types, but some are quite possibly bugs untriggered by existing tests.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Adds a second RP to table, marking where we flushed last.
If a new flush request comes in that is below this mark, we
can skip a second flush.
This is to (in future) support incremental CL flush.
This patch enables select cql statements where collection columns are
selected columns in queries where clustering column is restricted by
"IN" cql operator. Such queries are accepted by cassandra since v4.0.
The internals actually provide correct support for this feature already,
this patch simply removes relevant cql query check.
Tests: cql-pytest (testInRestrictionWithCollection)
Fixes#7743Fixes#4251
Signed-off-by: Vojtech Havel <vojtahavel@gmail.com>
Message-Id: <20210104223422.81519-1-vojtahavel@gmail.com>
* seastar d1b5d41b...a2fc9d72 (6):
> perftune.py: support passing multiple --nic options to tune multiple interfaces at once
> perftune.py recognize and sort IRQs for Mellanox NICs
> perftune.py: refactor getting of driver name into __get_driver_name()
Fixes#6266
> install-dependencies: support Manjaro
> append_challenged_posix_file_impl: optimize_queue: use max of sloppy_size_hint and speculative_size
> future: do_until: handle exception in stop condition
"
The size_estimates_mutation_reader call for global proxy
to get database from. The database is used to find keyspaces
to work with. However, it's safe to keep the local database
refernece on the reader itself.
tests: unit(debug)
"
* 'br-no-proxy-in-size-estimate-reader' of https://github.com/xemul/scylla:
size_estimate_reader: Use local db reference not global
size_estimate_reader: Keep database reference on mutation reader
size_estimate_reader: Keep database reference on virtual_reader
Conversions from views to owners have no business being implicit.
Besides, they would also cause various ambiguity problems when adding
managed_bytes_view.
From now on, memtable flush will use the strategy's interposer consumer
iff split_during_flush is enabled (disabled by default).
It has effect only for TWCS users as TWCS it's the only strategy that
goes on to implement this interposer consumer, which consists of
segregating data according to the window configuration.
Fixes#4617.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
As a preparation for interposer on flush, let's allow database write monitor
to store a shared sstable write permit, which will be released as soon as
any of the sstable writers reach the sealing stage.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
In this patch, we port validation/entities/collection_test.java, containing
7 tests for CQL counters. Happily, these tests did not uncover any bugs in
Scylla and all pass on both Cassandra and Scylla.
There is one small difference that I decided to ignore instead of reporting
a bug. If you try a CREATE TABLE with both counter and non-counter columns,
Scylla gives a ConfigurationException error, while Cassandra gives a more
reasonable InvalidRequest. The ported test currently allows both.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201223181325.3148928-1-nyh@scylladb.com>
In issue #7843 there were questions raised on how much does Scylla
support the notion of Unicode Equivalence, a.k.a. Unicode normalization.
Consider the Spanish letter ñ - it can be represented by a single Unicode
character 00F1, but can also be represented as a 006E (lowercase "n")
followed by a 0303 ("combining tilde"). Unicode specifies that these
two representations should be considered "equivalent" for purposes of
sorting or searching. But the following tests demonstrates that this
is not, in fact, supported in Scylla or Cassandra:
1. If you use one representation as the key, then looking up the other one
will not find the row. Scylla (and Cassandra) do *not* consider
the two strings equivalent.
2. The LIKE operator (a Scylla-only extension) doesn't know that
the single-character ñ begins with an n, or that the two-character
ñ is just a single character.
This is despite the thinking on #7843 which by using ICU in the
implementation of LIKE, we somehow got support for this. We didn't.
Refs #7843
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201229125330.3401954-1-nyh@scylladb.com>
This patch adds a reproducer for issue #7856, which is about frozen sets
and how we can in Scylla (but not in Cassandra), insert one in the "wrong"
order, but only in very specific circumstances which this reproducer
demonstrates: The bug can only be reproduced in a nested frozen collection,
and using prepared statements.
Refs #7856
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201231085500.3514263-1-nyh@scylladb.com>
Tracking both min and max timestamp will be required for memtable flush
to short-circuit interposer consumer if needed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This extension is needed for future work where a memtable will be segregated
during flush into one sstable or more. So now multiple sstables can be added
to the set after a memtable flush, and compaction is only triggered at the
end.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
- the mystical `accept_predicate` is renamed to `accept_keyspace`
to be more self-descriptive
- a short comment is added to the original calculate_schema_digest
function header, mentioning that it computes schema digest
for non-system keyspaces
Refs #7854
Message-Id: <04f1435952940c64afd223bd10a315c3681b1bef.1609763443.git.sarna@scylladb.com>
In scylla-jmx, we fixed a hardcode sysconfdir in EnvironmentFile path,
realpath was used to convert the path. This patch changed to use
realpath in scylla repo to make it consistent with scylla-jmx.
Suggested-by: Pekka Enberg <penberg@scylladb.com>
Signed-off-by: Amos Kong <amos@scylladb.com>
Closes#7860
The original idea for `schema_change_test` was to ensure that **if** schema hasn't changed, the digest also remained unchanged. However, a cumbersome side effect of adding an internal distributed table (or altering one) is that all digests in `schema_change_test` are immediately invalid, because the schema changed.
Until now, each time a distributed system table was added/amended, a new test case for `schema_change_test` was generated, but this effort is not worth the effect - when a distributed system table is added, it will always propagate on its own, so generating a new test case does not bring any tangible new test coverage - it's just a pain.
To avoid this pain, `schema_change_test` now explicitly skips all internal keyspaces - which includes internal distributed tables - when calculating schema digest. That way, patches which change the way of computing the digest itself will still require adding a new test case, which is good, but, at the same time, changes to distributed tables will not force the developers to introduce needless schema features just for the sake of this test.
Tests:
* unit(dev)
* manual(rebasing on top of a change which adds two distributed system tables - all tests still passed)
Refs #7617Closes#7854
* github.com:scylladb/scylla:
schema_change_test: skip distributed system tables in digest
schema_tables: allow custom predicates in schema digest calc
alternator: drop unneeded sstring creation
system_keyspace: migrate helper functions to string_view
database: migrate find_keyspace to string views
With previous design of the schema change test, a regeneration
was necessary each time a new distributed system table was added.
It was not the original purpose of the test to keep track of new
distributed tables which simply propagate on their own,
so the test case is now modified: internal distributed tables
are not part of the schema digest anymore, which means that
changes inside them will not cause mismatches.
This change involves a one-shot regeneration of all digests,
which due to historical reasons included internal distributed
tables in the digest, but no further regenerations should ever
be necessary when a new internal distributed table is added.
For testing purposes it would be useful to be able to skip computing
schema for certain tables (namely, internal distributed tables).
In order to allow that, a function which accepts a custom predicate
is added.
It's now possible to use string views to check if a particular
table is a system table, so it's no longer needed to explicitly
create an sstring instance.
Functions for checking if the keyspace is system/internal were based
on sstring references, which is impractical compared to string views
and may lead to unnecessary creation of sstring instances.
It looks like the history of the flag begins in Cassandra's
https://issues.apache.org/jira/browse/CASSANDRA-7327 where it is
introduced to speedup tests by not needing to start the gossiper.
The thing is we always start gossiper in our cql tests, so the flag only
introduce noise. And, of course, since we want to move schema to use raft
it goes against the nature of the raft to be able to apply modification only
locally, so we better get rid of the capability ASAP.
Tests: units(dev, debug)
Message-Id: <20201230111101.4037543-2-gleb@scylladb.com>
When a node notice that it uses legacy SI tables it converts them to use
new format, but it update only local schema. It will only cause schema
discrepancy between nodes, there schema change should propagate
globally.
Fixes#7857.
Message-Id: <20201230111101.4037543-1-gleb@scylladb.com>
After 13fa2bec4c, every compaction will be performed through a filtering
reader because consumers cannot do the filtering if interposer consumer is
enabled.
It turns out that filtering_reader is adding significant overhead when regular
compactions are running. As no other compaction type need to actually do
any filtering, let's limit filtering_reader to cleanup compaction.
Alternatively, we could disable interposer consumer on behalf of cleanup,
or add support for the consumers to do the filtering themselves but that
would add lots of complexity.
Fixes#7748.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201230194516.848347-2-raphaelsc@scylladb.com>
This filter is used to discard data that doesn't belong to current
shard, but scylla will only make a sstable available to regular
compaction after it was resharded on either boot or refresh.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201230194516.848347-1-raphaelsc@scylladb.com>
This reverts commit ceb67e7728. The
"epel-release" package is needed to install the "supervisord"
package, which I somehow missed in testing...
Fixes#7851
This patch adds two simple tests for what happens when a user tries to
insert a row with one of the key column missing. The first tests confirms
that if the column is completely missing, we correctly print an error
(this was issue #3665, that was already marked fixed).
However, the second test demonstrates that we still have a bug when
the key column appears on the command, but with a null value.
In this case, instead of failing the insert (as Cassandra does),
we silently ignore it. This is the proper behavior for UNSET_VALUE,
but not for null. So the second test is marked xfail, and I opened
issue #7852 about it.
Refs #3665
Refs #7852
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201230132350.3463906-1-nyh@scylladb.com>
In a previous version of test_using_timeout.py, we had tables pre-filled
with some content labled "everything". The current version of the tests
don't use it, so drop it completely.
One test, test_per_query_timeout_large_enough, still had code that did
res = list(cql.execute(f"SELECT * FROM {table} USING TIMEOUT 24h"))
assert res == everything
this was a bug - it only works as expected if this test is run before
anything other test is run, and will fail if we ever reorder or parallelize
these tests. So drop these two lines.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201229145435.3421185-1-nyh@scylladb.com>
This miniseries rewrites two alternator request handlers from seastar threads to coroutines - since these handlers are not on a hot path and using seastar threads is way too heavy for such a simple routine.
NOTE: this pull request obviously has to wait until coroutines are fully supported in Seastar/Scylla.
Closes#7453
* github.com:scylladb/scylla:
alternator: coroutinize untagging a resource
alternator: coroutinize tagging a resource
node_exporter had been added to scylla-server package by commit
95197a09c9.
So we can enable it by default for offline installation.
Closes#7832
* github.com:scylladb/scylla:
scylla_setup: cleanup if judgments
scylla_setup: enable node_exporter for offline installation
On every compaction completion, sstable set is rebuilt from scratch.
With LCS and ~160G of data per shard, it means we'll have to create
a new sstable set with ~1000 entries whenever compaction completes,
which will likely result in reactor stalling for a significant
amount of time.
Fixes#7758.
Closes#7842
* github.com:scylladb/scylla:
table: Fix potential reactor stall on LCS compaction completion
table: decouple preparation from execution when updating sstable set
table: change rebuild_sstable_list to return new sstable set
row_cache: allow external updater to decouple preparation from execution
The range_tombstone_list always (unless misused?) contains de-overlapped
entries. There's a test_add_random that checks this, but it suffers from
several problems:
- generated "random" ranges are sequential and may only overlap on
their borders
- test uses the keys of the same prefix length
Enhance the generator part to produce a purely random sequence of ranges
with bound keys of arbitrary length. Just pay attention to generate the
"valid" individual ranges, whose start is not ahead of the end.
Also -- rename the test to reflect what it's doing and increase the
number of iterations.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201228115525.20327-1-xemul@scylladb.com>
On every compaction completion, sstable set is rebuilt from scratch.
With LCS and ~160G of data per shard, it means we'll have to create
a new sstable set with ~1000 entries whenever compaction completes,
which will likely result in reactor stalling for a significant
amount of time.
This is fixed by futurizing build_new_sstable_list(), so it will
yield whenever needed.
Fixes#7758.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
row cache now allows updater to first prepare the work, and then execute
the update atomically as the last step. let's do that when rebuilding
the set, so now new set is created in the preparation phase, and the
new set replaces the old one in the execution phase, satisfying the
atomicity requirement of row cache.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
procedure is changed to return the new set, so caller will be responsible
for replacing the old set with the new one. this will allow our future
work where building new set and enabling it will be decoupled.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
External updater may do some preparatory work like constructing a new sstable list,
and at the end atomically replace the old list by the new one.
Decoupling the preparation from execution will give us the following benefits:
- the preparation step can now yield if needed to avoid reactor stalls, as it's
been futurized.
- the execution step will now be able to provide strong exception guarantees, as
it's now decoupled from the preparation step which can be non-exception-safe.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The CQL tests in test/cql-pytest use the Python CQL driver's default
timeout for execute(), which is 10 seconds. This usually more than
enough. However, in extreme cases like noted in issue #7838, 10
seconds may not be enough. In that issue, we run a very slow debug
build on a very slow test machine, and encounter a very slow request
(a DROP KEYSPACE that needs to drop multiple tables).
So this patch increases the default timeout to an even larger
120 seconds. We don't care that this timeout is ridiculously
large - under normal operations it will never be reached, there
is no code which loops for this amount of time for example.
Tested that this patch fixes#7838 by choosing a much lower timeout
(1 second) and reproducing test failures caused by timeouts.
Fixes#7838.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201228090847.3234862-1-nyh@scylladb.com>
The MAX_LEVELS is the levels count, but sstable level (index) starts
from 0. So the maximum and valid level is MAX_LEVELS - 1.
Signed-off-by: Amos Kong <amos@scylladb.com>
Closes#7833
sstable_writer may depend on the sstable throughout its whole lifecycle.
If the sstable is freed before the sstable_writer we might hit use-after-free
as in the follwing case:
```
std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*>::operator+=(long) at /usr/include/c++/10/bits/stl_deque.h:240
(inlined by) std::operator+(std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*> const&, long) at /usr/include/c++/10/bits/stl_deque.h:378
(inlined by) std::_Deque_iterator<sstables::compression::segmented_offsets::bucket, sstables::compression::segmented_offsets::bucket&, sstables::compression::segmented_offsets::bucket*>::operator[](long) const at /usr/include/c++/10/bits/stl_deque.h:252
(inlined by) std::deque<sstables::compression::segmented_offsets::bucket, std::allocator<sstables::compression::segmented_offsets::bucket> >::operator[](unsigned long) at /usr/include/c++/10/bits/stl_deque.h:1327
(inlined by) sstables::compression::segmented_offsets::push_back(unsigned long, sstables::compression::segmented_offsets::state&) at ./sstables/compress.cc:214
sstables::compression::segmented_offsets::writer::push_back(unsigned long) at ./sstables/compress.hh:123
(inlined by) compressed_file_data_sink_impl<crc32_utils, (compressed_checksum_mode)1>::put(seastar::temporary_buffer<char>) at ./sstables/compress.cc:519
seastar::output_stream<char>::put(seastar::temporary_buffer<char>) at table.cc:?
(inlined by) seastar::output_stream<char>::put(seastar::temporary_buffer<char>) at ././seastar/include/seastar/core/iostream-impl.hh:432
seastar::output_stream<char>::flush() at table.cc:?
seastar::output_stream<char>::close() at table.cc:?
sstables::file_writer::close() at sstables.cc:?
sstables::mc::writer::~writer() at writer.cc:?
(inlined by) sstables::mc::writer::~writer() at ./sstables/mx/writer.cc:790
sstables::mc::writer::~writer() at writer.cc:?
flat_mutation_reader::impl::consumer_adapter<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~consumer_adapter() at compaction.cc:?
(inlined by) std::_Optional_payload_base<sstables::compaction_writer>::_M_destroy() at /usr/include/c++/10/optional:260
(inlined by) std::_Optional_payload_base<sstables::compaction_writer>::_M_reset() at /usr/include/c++/10/optional:280
(inlined by) std::_Optional_payload<sstables::compaction_writer, false, false, false>::~_Optional_payload() at /usr/include/c++/10/optional:401
(inlined by) std::_Optional_base<sstables::compaction_writer, false, false>::~_Optional_base() at /usr/include/c++/10/optional:474
(inlined by) std::optional<sstables::compaction_writer>::~optional() at /usr/include/c++/10/optional:659
(inlined by) sstables::compacting_sstable_writer::~compacting_sstable_writer() at ./sstables/compaction.cc:229
(inlined by) compact_mutation<(emit_only_live_rows)0, (compact_for_sstables)1, sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>::~compact_mutation() at ././mutation_compactor.hh:468
(inlined by) compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>::~compact_for_compaction() at ././mutation_compactor.hh:538
(inlined by) std::default_delete<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >::operator()(compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>*) const at /usr/include/c++/10/bits/unique_ptr.h:85
(inlined by) std::unique_ptr<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer>, std::default_delete<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~unique_ptr() at /usr/include/c++/10/bits/unique_ptr.h:361
(inlined by) stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >::~stable_flattened_mutations_consumer() at ././mutation_reader.hh:342
(inlined by) flat_mutation_reader::impl::consumer_adapter<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >::~consumer_adapter() at ././flat_mutation_reader.hh:201
auto flat_mutation_reader::impl::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter>(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:272
(inlined by) auto flat_mutation_reader::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter>(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, flat_mutation_reader::no_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:383
(inlined by) auto flat_mutation_reader::consume_in_thread<stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> > >(stable_flattened_mutations_consumer<compact_for_compaction<sstables::compacting_sstable_writer, noop_compacted_fragments_consumer> >, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at ././flat_mutation_reader.hh:389
(inlined by) seastar::future<void> sstables::compaction::setup<noop_compacted_fragments_consumer>(noop_compacted_fragments_consumer)::{lambda(flat_mutation_reader)#1}::operator()(flat_mutation_reader)::{lambda()#1}::operator()() at ./sstables/compaction.cc:612
```
What happens here is that:
compressed_file_data_sink_impl(output_stream<char> out, sstables::compression* cm, sstables::local_compression lc)
: _out(std::move(out))
, _compression_metadata(cm)
, _offsets(_compression_metadata->offsets.get_writer())
, _compression(lc)
, _full_checksum(ChecksumType::init_checksum())
_compression_metadata points to a buffer held by the sstable object.
and _compression_metadata->offsets.get_writer returns a writer that keeps
a reference to the segmented_offsets in the sstables::compression
that is used in the ~writer -> close path.
Fixes#7821
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201227145726.33319-1-bhalevy@scylladb.com>
When applying a mutation partition to another if a dummy entry
from the source falls into a destination continuous range, it
can be just dropped. However, current implementation still
inserts it and then instantly removes.
Relax this code-flow by dropping the unwanted entry without
tossing it.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201224130438.11389-1-xemul@scylladb.com>
node_exporter had been added to scylla-server package by commit
95197a09c9.
So we can enable it by default for offline installation.
Signed-off-by: Amos Kong <amos@scylladb.com>
When a rows_entry is added to row_cache it's constructed from
clustering_row by unpacking all its internals and putting
them into the rows_entry's deletable_row. There's a shorter
way -- the clustering_row already has the deletale_row onboard
from which rows_entry can copy-construct its.
This lets keeping the rows_entry and deletable_row set of
constructors a bit shorter.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201224161112.20394-1-xemul@scylladb.com>
"
This series does a lot of cleanups, dead code removal, and most
importantly fixes the following things in IDL compiler tool:
* The grammar now rejects invalid identifiers, which,
in some cases, allowed to write things like `std:vector`.
* Error reporting is improved significantly and failures are
now pointing to the place of failure much more accurately.
This is done by restricting rule backtracing on those rules
which don't need it.
"
* 'idl-compiler-minor-fixes-v4' of https://github.com/ManManson/scylla:
idl: move enum and class serializer code writers to the corresponding AST classes
idl: extract writer functions for `write`, `read` and `skip` impls for classes and enums
idl: minor fixes and code simplification
idl: change argument name from `hout` to `cout` in all dependencies of `add_visitors` fn
idl: fix parsing of basic types and discard unneeded terminals
idl: remove unused functions
idl: improve error tracing in the grammar and tighten-up some grammar rules
idl: remove redundant `set_namespace` function
idl: remove unused `declare_class` function
idl: slightly change `str` and `repr` for AST types
idl: place directly executed init code into if __name__=="__main__"
To connection-less environment, we need to add node_exporter binary
to scylla-server package, not downloading it from internet.
Related #7765Fixes#2190Closes#7796
It turns out that `cql_table_large_data_handler::record_large_rows`
and `cql_table_large_data_handler::record_large_cells` were broken
for reporting static cells and static rows from the very beginning:
In case a large static cell or a large static row is encountered,
it tries to execute `db::try_record` with `nullptr` additional values,
denoting that there is no clustering key to be recorded.
These values are next passed to `qctx.execute_cql()`, which
creates `data_value` instances for each statement parameter,
hence invoking `data_value(nullptr)`.
This uses `const char*` overload which delegates to
`std::string_view` ctor overload. It is UB to pass `nullptr`
pointer to `std::string_view` ctor. Hence leading to
segmentation faults in the aforementioned large data reporting
code.
What we want here is to make a null `data_value` instead, so
just add an overload specifically for `std::nullptr_t`, which
will create a null `data_value` with `text` type.
A regression test is provided for the issue (written in
`cql-pytest` framework).
Tests: test/cql-pytest/test_large_cells_rows.py
Fixes: #6780
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201223204552.61081-1-pa.solodovnikov@scylladb.com>
In Alternator's expression parser in alternator/expressions.g, a list can be
indexed by a '[' INTEGER ']'. I had doubts whether maybe a value-reference
for the index, e.g., "something[:xyz]", should also work. So this patch adds
a test that checks whether "something[:xyz]" works, and confirms that both
DynamoDB and Alternator don't accept it and consider it a syntax error.
So Alternator's parser is correct to insist that the index be a literal
integer.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201214100302.2807647-1-nyh@scylladb.com>
* seastar 2bd8c8d088...1f5e3d3419 (5):
> Merge "Avoid fair-queue rovers overflow if not configured" from Pavel E
> doc: add a coroutines section to the tutorial
> Merge "tests/perf: add random-seed config option" from Benny
> iotune: Print parameters affecting the measurement results
> cook: Add patch cmd for ragel build (signed char confusion on aarch64)
Expand the role of AST classes to also supply methods for actually
generating the code. More changes will follow eventually until
all generation code is handled by these classes.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
"
We've encountered a number of reactor stalls
related to token_metadata that were fixed
in 052a8d036d.
This is a follow-up series that adds a clear_gently
method to token_metadata that uses continuations
to prevent reactor stalls when destroying token_metadata
objects.
Test: unit(dev), {network_topology_strategy,storage_proxy}_test(debug)
"
* tag 'token_metadata_clear_gently-v3' of github.com:bhalevy/scylla:
token_metadata: add clear_gently
token_metadata: shared_token_metadata: add mutate_token_metadata
token_metdata: futurize update_normal_tokens
abstract_replication_strategy: get_pending_address_ranges: invoke clone_only_token_map if can_yield
repair: replace_with_repair: convert to coroutine
A previous patch added test/cql-pytest/cassandra_tests - a framework for
porting Cassandra's unit tests to Python - but only ported two tiny test
files with just 3 tests. In this patch, we finally port a much larger
test file validation/entities/collection_test.java. This file includes
50 separate tests, which cover a lot of aspects of collection support,
as well as how other stuff interact with collections.
As of now, 23 (!) of these 50 tests fail, and exposed six new issues
in Scylla which I carefully documented:
Refs #7735: CQL parser missing support for Cassandra 3.10's new "+=" syntax
Refs #7740: CQL prepared statements incomplete support for "unset" values
Refs #7743: Restrictions missing support for "IN" on tables with
collections, added in Cassandra 4.0
Refs #7745: Length of map keys and set items are incorrectly limited to 64K
in unprepared CQL
Refs #7747: Handling of multiple list updates in a single request differs
from recent Cassandra
Refs #7751: Allow selecting map values and set elements, like in
Cassandra 4.0
These issues vary in severity - some are simply new Cassandra 4.0 features
that Scylla never implemented, but one (#7740) is an old Cassandra 2.2
feature which it seems we did not implement correctly in some cases that
involve collections.
Note that there are some things that the ported tests do not include.
In a handful of places there are things which the Python driver checks,
before sending a request - not giving us an opportunity to check how
the server handles such errors. Another notable change in this port is
that the original tests repeated a lot of tests with and without a
"nodetool flush". In this port I chose to stub the flush() function -
it does NOT flush. I think the point of these tests is to check the
correctness of the CQL features - *not* to verify that memtable flush
works correctly. Doing a real memtable flush is not only slow, it also
doesn't really check much (Scylla may still serve data from cache,
not sstables). So I decided it is pointless.
An important goal of this patch is that all 50 tests (except three
skipped tests because Python has client-side checking), pass when
run on Cassandra (with test/cql-pytest/run-cassandra). This is very
important: It was very easy to make mistakes while porting the tests,
and I did make many such mistakes; But running the against Cassandra
allowed me to fix those mistakes - because the correct tests should
pass on Cassandra. And now they do.
Unfortunately, the new tests are significantly slower than what we've
been accustomed in Alternator/CQL tests. The 50 tests create more than a
hundred tables, udfs, udts, and similar slow operations - they do not
reuse anything via fixtures. The total time for these 50 tests (in dev
build mode) is around 18 seconds. Just one test - testMapWithLargePartition
is responsibe for almost half (!) of that time - we should consider in
the future whether it's worth it or can be made smaller.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201215155802.2867386-1-nyh@scylladb.com>
clear_gently gently clears the token_metadata members.
It uses continuations to allow yielding if needed
to prevent reactor stalls.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
mutate_token_metadata acquires the shared_token_metadata lock,
clones the token_metadata (using clone_async)
and calls an asynchronous functor on
the cloned copy of the token_metadata to mutate it.
If the functor is successful, the mutated clone
is set back to to the shared_token_metadata,
otherwise, the clone is destroyed.
With that, get rid of shared_token_metadata::clone
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The function complexity if O(#tokens) in the worst case
as for each endpoint token to traverses _token_to_endpoint_map
lineraly to erase the endpoint mapping if it exists.
This change renames the current implementation of
update_normal_tokens to update_normal_tokens_sync
and clones the code as a coroutine that returns a future
and may yield if needed.
Eventually we should futurize the whole token_metadata
and abstract_replication_strategy interface and get rid
of the synchronous functions. Until then the sync
version is still required from call sites that
are neither returning a future nor run in a seastar thread.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add the seastar-cpu-map.sh to the SBINFILES variable, which is used to
create symbolic links to scripts so that they appear in $PATH.
Please note that there are additional Python scripts (like perftune.py),
which are not in $PATH. That's because Python scripts are handled
separately in "install.sh" and no Python script has a "sbin" symlink. We
might want to change this in the future, though.
Fixes#6731Closes#7809
This is a temporary scaffold for weaning ourselves off
linearization. It differs from with_linearized_managed_bytes in
that it does not rely on the environment (linearization_context)
and so is easier to remove.
We use managed_bytes::data() in a few places when we know the
data is non-fragmented (such as when the small buffer optimization
is in use). We'd like to remove managed_bytes::data() as linearization
is bad, so in preparation for that, replace internal uses of data()
with the equivalent direct access.
do_linearize() is an impure function as it changes state
in linearization_context. Extract the pure parts into a new
do_linearize_pure(). This will be used to linearize managed_bytes
without a linearization_context, during the transition period where
fragmented and non-fragmented values coexist.
do_get_value() is careful to return a fragmented view, but
its only caller get_value_from_mutation() linearizes it immediately
afterwards. Linearize it sooner; this prevents mixing in
fragmented values from cells (now via IMR) and fragmented values
from partition/clustering keys. It only works now because
keys are not fragmented outside LSA, and value_view has a special
case for single-fragment values.
This helps when keys become fragmented.
Converting from bytes to bytes is nonsensical, but it helps
when transitioning to other types (managed_bytes/managed_bytes_view),
and these types will have to_bytes() conversions.
We introduce a new single-key sstable reader for sstables created by `TimeWindowCompactionStrategy`.
The reader uses the fact that sstables created by TWCS are mostly disjoint with respect to the contained `position_in_partition`s in order to avoid having multiple sstable readers opened at the same time unnecessarily. In case there are overlapping ranges (for example, in the current time-window), it performs the necessary merging (it uses `clustering_order_reader_merger`, introduced recently).
The reader uses min/max clustering key metadata present in `md` sstables in order to decide when to open or close a sstable reader.
The following experiment was performed:
1. create a TWCS table with 1 minute windows
2. fill the table with 8 equal windows of data
(each window flushed to a separate sstable)
3. perform `select * from ks.t where pk = 0 limit 1` query
with and without the change
The expectation is that with the commit, only one sstable will be opened
to fetch that one row; without the commit all 8 sstables would be opened at once.
The difference in the value of `scylla_reactor_aio_bytes_read` was measured
(value after the query minus value before the query), both with and without the commit.
With the commit, the difference was 67584.
Without the commit, the difference was 528384.
528384 / 67584 ~= 7.8.
Fixes#6418.
Closes#7437
* github.com:scylladb/scylla:
sstables: gather clustering key filtering statistics in TWCS single key reader
sstables: use time_series_sstable_set in time_window_compaction_strategy
sstable_set: new reader for TWCS single partition queries
mutation_reader_test: test clustering_order_reader_merger with time_series_sstable_set
sstable_set: introduce min_position_reader_queue
sstable_set: introduce time_series_sstable_set
sstables: add min_position and max_position accessors
sstable_set: make create_single_key_sstable_reader a virtual method
clustering_order_reader_merger: fix the 0 readers case
The following experiment was performed:
1. create a TWCS table with 1 minute windows
2. fill the table with 8 windows of data
(each window flushed to a separate sstable)
3. perform `select * from ks.t where pk = 0 limit 1` query
with and without the change
The expectation is that with the commit, only one sstable will be opened
to fetch that one row; without the commit all 8 sstables would be opened at once.
The difference in the value of `scylla_reactor_aio_bytes_read` was measured
(value after the query minus value before the query), both with and without the commit.
With the commit, the difference was 67584.
Without the commit, the difference was 528384.
528384 / 67584 ~= 7.8.
Fixes https://github.com/scylladb/scylla/issues/6418.
This commit introduces a new implementation of `create_single_key_sstable_reader`
in `time_series_sstable_set` dedicated for TWCS-created sstables.
It uses the fact that such sstables are mostly disjoint with respect to
contained `position_in_partition`s in order to decrease the number of
sstable readers that are opened at the same time.
The implementation uses `clustering_order_reader_merger` under the hood.
The reader assumes that the schema does not have static columns and none
of the queried sstable contain partition tombstones; also, it assumes
that the sstables have the min/max clustering key metadata in order for
the implementation to be efficient. Thus, if we detect that some of
these assumptions aren't true, we fall back to the old implementation.
This is a queue of readers of sstables in a time_series_sstable_set,
returning the readers in order of the smallest position_in_partition
that the sstables have. It uses the min/max clustering key sstable
metadata.
The readers are opened lazily, at the moment of being returned.
At this moment it is a slightly less efficient version of
bag_sstable_set, but in following commits we will use the new data
structures to gain advantage in single partition queries
for sstables created by TimeWindowCompactionStrategy.
... of sstable_set_impl.
Soon we shall provide a specialized implementation in one of the
`sstable_set_impl` derived classes.
The existing implementation is used as the default one.
There were two problems with handling conflicting equalities on the same PK column (eg, c=1 AND c=0):
1. When the column is indexed, Scylla crashed (#7772)
2. Computing ranges and slices was throwing an exception
This series fixes them both; it also happens to resolve some old TODOs from restriction_test.
Tests: unit (dev, debug)
Closes#7804
* github.com:scylladb/scylla:
cql3: Fix value_for when restriction is impossible
cql3: Fix range computation for p=1 AND p=1
Previously, single_column_restrictions::value_for() assumed that a
column's restriction specifies exactly one value for the column. But
since 37ebe521e3, multiple equalities on the same column are allowed,
so the restriction could be a conjunction of conflicting
equalities (eg, c=1 AND c=0). That violates an assert and crashes
Scylla.
This patch fixes value_for() by gracefully handling the
impossible-restriction case.
Fixes#7772
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Previously compute_bounds was assuming that primary-key columns are
restricted by exactly one equality, resulting in the following error:
query 'select p from t where p=1 and p=1' failed:
std::bad_variant_access (std::get: wrong index for variant)
This patch removes that assumption and deals correctly with the
multiple-equalities case. As a byproduct, it also stops raising
"invalid null value" exceptions for null RHS values.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Split `write`, `read` and `skip` serializer function writers to
separate functions in `handle_class` and `handle_enum` functions,
which slightly improves readability.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
* Introduce `ns_qualified_name` and `template_params_str` functions
to simplify code a little bit in `handle_enum` and `handle_class`
functions.
* Previously each serializer had a separate namespace open-close
statements, unify them into a single namespace scope.
* Fix a few more `hout` -> `cout` argument names.
* Rename `template` pattern to `template_decl` to improve clarity.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Prior to the patch all functions that are called from `add_visitors`
and this function itself declared the argument denoting the output
file as `hout`. Though, this was quite misleading since `hout`
is meant to be header file with declarations, while `cout` is an
implementation file.
These functions write to implmentation file hence `hout` should
be changed to `cout` to avoid confusion.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Prior to the patch `btype` production was using `with_colon`
rule, which accidentally supported parsing both numbers and
identifiers (along with other invalid inputs, such as "123asd").
It was changed to use `ns_qualified_ident` and those places
which can accept numeric constants, are explicitly listing
it as an alternative, e.g. template parameter list.
Unfortunately, I had to make TemplateType to explicitly construct
`BasicType` instances from numeric constants in template arguments
list. This is exactly the way it was handled before, though.
But nonetheless, this should be addressed sometime later.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Remove the following functions since they are not used:
* `open_namespaces`
* `close_namespaces`
* `flat_template`
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This patch replaces use of some handwritten rules to use their
alternatives already defined in `pyparsing.pyparsing_common`
class, i.e.: `number`, `identifier` productions.
Changed ignore patterns for comments to use pre-defined
`pp.cppStyleComment` instead of hand-written combination of
'//'-style and C-style comment rules.
Operator '-' is now used whenever possible to improve debugging
experience: it disables default backtracking for productions
so that compiler fails earlier and can now point more precisely
to a place in the input string where it failed instead of
backtracking to the top-level rule and reporting error there.
Template names and class names now use `ns_qualified_ident`
rule instead of `with_colon` which prevents grammar from
matching invalid identifiers, such as `std:vector`.
Many places are using the updated `identifier` production, which
is working correctly unlike its predecessor: now inputs
such as `1ident` are considered invalid.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Surround string representation with angle brackets. This improves
readability when printing debug output.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Since idl compiler is not intended to be used as a module to other
python build scripts, move initialization code under an if checking
that current module name is "__main__".
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
When recycling a segment in O_DSYNC mode if the size of the segment
is neither shrunk nor grown, avoid calling file::truncate() or
file::allocate().
Message-Id: <20201215182332.1017339-2-kostja@scylladb.com>
"
This patch series consists of the following patches:
1. The first one turned out to be a massive rewrite of almost
everything in `idl-compiler.py`. It aims to decouple parser
structures from the internal representation which is used
in the code-generation itself.
Prior to the patch everything was working with raw token lists and
the code was extremely fragile and hard to understand and modify.
Moreover, every change in the parser code caused a cascade effect
of breaking things at many different places, since they were relying
on the exact format of output produced by parsing rules.
Now there is a bunch of supplementary AST structures which provide
hierarchical and strongly typed structure as the output of parsing
routine.
It is much easier to verify (by the means of `isinstance`, for example)
and extend since the internal structures used in code-generation are
decoupled from the structure of parsing rules, which are now controlled
by custom parse actions providing high-level abstractions.
It is tested manually by checking that the old code produces exactly
the same autogenerated sources for all Scylla IDLs as the new one.
2 and 3. Cosmetics changes only: fixed a few typos and moved from
old-fashioned `string.Template` to python f-strings.
This improves readability of the idl-compiler code by a lot.
Only one non-functional whitespace change introduced.
4. This patch adds a very basic support for the parser to
understand `const` specifier in case it's used with a template
parameter for a data member in a class, e.g.
struct my_struct {
std::vector<const raft::log_entry> entries;
};
It actually does two things:
* Adjusts `static_asserts` in corresponding serializer methods
to match const-ness of fields.
* Defines a second serializer specialization for const type in
`.dist.hh` right next to non-const one.
This seems to be sufficient for raft-related uses for now.
Please note there is no support for the following cases, though:
const std::vector<raft::log_entry> entries;
const raft::term_t term;
None of the existing IDLs are affected by the change, so that
we can gradually improve on the feature and write the idl
unit-tests to increase test coverage with time.
5. A basic unit-test that writes a test struct with an
`std::vector<S<const T>>` field and reads it back to verify
that serialization works correctly.
6. Basic documentation for AST classes.
TODO: should also update the docs in `docs/IDL.md`. But it is already
quite outdated, and some changes would even be out of scope for this
patch set.
"
* 'idl-compiler-refactor-v5' of https://github.com/ManManson/scylla:
idl: add docstrings for AST classes
idl: add unit-test for `const` specifiers feature
idl: allow to parse `const` specifiers for template arguments
idl: fix a few typos in idl-compiler
idl: switch from `string.Template` to python f-strings and format string in idl-compiler
idl: Decouple idl-compiler data structures from grammar structure
feed_writer() eats exception and transforms it into an end of stream
instead. Downstream validators hate when this happens.
Fixes#7482
Message-Id: <20201216090038.GB3244976@scylladb.com>
A tool which lists all partitions contained in an sstable index. As all
partitions in an sstable are indexed, this tool can be used to find out
what partitions are contained in a given sstable.
The printout has the following format:
$pos: $human_readable_value (pk{$raw_hex_value})
Where:
* $pos: the position of the partition in the (decompressed) data file
* $human_readable_value: the human readable partition key
* $raw_hex_value: the raw hexadecimal value of the binary representation
of the partition key
For now the tool requires the types making up the partition key to be
specified on the command line, using the `--type|-t` command line
argument, using the Cassandra type class name notation for types.
As these are not assumed to be widely known, this patch includes a
document mapping all cql3 types to their Cassandra type class name
equivalent (but not just).
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201208092323.101349-1-bdenes@scylladb.com>
Fixes#7732
When truncating with auto_snapshot on, we try to verify the low rp mark
from the CF against the sstables discarded by the truncation timestamp.
However, in a scenario like:
Fill memtables
Flush
Truncate with snapshot A
Fill memtables some more
Truncate
Move snapshot A to upload + refresh (load old tables)
Truncate
The last op will assert, because while we have sstables loaded, which
will be discarded now, we did not in fact generate any _new_ ones
(since memtables are empty), and the RP we get back from discard is
one from an earlier generation set.
(Any permutation of events that create the situation "empty memtable" +
"non-empty sstables with only old tables" will generate the same error).
Added a check that before flushing checks if we actually have any
data, and if not, does not uphold the RP relation assert.
Closes#7799
This series makes sure that before the table is dropped, all pending memtable flushes related to its memtables would finish.
Normally, flushes are not problematic in Scylla, because all tables are by default `auto_snapshot=true`, which also implies that a table is flushed before being dropped. However, with `auto_snapshot=false` the flush is not attempted at all. It leads to the following race:
1. Run a node with `auto_snapshot=false`
2. Schedule a memtable flush (e.g. via nodetool)
3. Get preempted in the middle of the flush
4. Drop the table
5. The flush that already started wakes up and starts operating on freed memory, which causes a segfault
Tests: manual(artificially preempting for a long time in bullet point 2. to ensure that the race occurs; segfaults were 100% reproducible before the series and do not happen anymore after the series is applied)
Fixes#7792Closes#7798
* github.com:scylladb/scylla:
database: add flushes to waiting for pending operations
table: unify waiting for pending operations
database: add a phaser for flush operations
database: add waiting for pending streams on table drop
This patch introduces very limited support for declaring `const`
template parameters in data members.
It's not covering all the cases, e.g.
`const type member_variable` and `const template_def<T1, T2, ...>`
syntax is not supported at the moment.
Though the changes are enough for raft-related use: this makes it
possible to declare `std::vector<raft::log_entries_ptr>` (aka
`std::vector<lw_shared_ptr<const raft::log_entry>>`) in the IDL.
Existing IDL files are not affected in any way.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Move to a modern and lightweight syntax of f-strings
introduced in python 3.6. It improves readability and provides
greater flexibility.
A few places are now using format strings instead, though.
In case when multiline substitution variable is used, the template
string should be first re-indented and only after that the
formatting should be applied, or we can end up with screwed
indentation the in generated sources.
This change introduces one invisible whitespace change
in `query.dist.impl.hh`, otherwise all generated code is exactly
the same.
Tests: build(dev) and diff genetated IDL sources by hand
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Instead of operating on the raw lists of tokens, transform them into
typed structures representation, which makes the code by many orders of
magnitude simpler to read, understand and extend.
This includes sweeping changes throughout the whole source code of the
tool, because almost every function was tightly coupled to the way
data was passed down from the parser right to the code generation
routines.
Tested manually by checking that old generated sources are precisely
the same as the new generated sources.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Pending flushes can participate in races when a table
with auto_snapshot==false is dropped. The race is as follows:
1. A flush of table T is initiated
2. The flush operation is preempted
3. Table T is dropped without flushing, because it has auto_snapshot off
4. The flush operation from (2.) wakes up and continues
working on table T, which is already dropped
5. Segfault/memory corruption
To prevent such races, a phaser for pending flushes is introduced
We already wait for pending reads and writes, so for completeness
we should also wait for all pending stream operations to finish
before dropping the table to avoid inconsistencies.
Download node_exporter in frozen image to prepare adding node_exporter
to relocatable pacakge.
Related #2190Closes#7765
[avi: updated toolchain, x86_64/aarch64/s390x]
Alternator tracing tests require the cluster to have the 'always'
isolation level configured to work properly. If that's not the case,
the tests will fail due to not having CAS-related traces present
in the logs. In order to help the users fix their configuration,
a helper message is printed before the test case is performed.
Automatic tests do not need this, because they are all ran with
matching isolation level, but this message could greatly improve
the user experience for manual tests.
Message-Id: <62bcbf60e674f57a55c9573852b6a28f99cbf408.1607949754.git.sarna@scylladb.com>
The outcome of alternator tracing tests was that tracing probability
was always set to 0 after the test was finished. That makes sense
for most test runs, but manual tests can work on existing clusters
with tracing probability set to some other value. Due to preserve
previous trace probability, the value is now extracted and stored,
so that it can be restored after the test is done.
Message-Id: <94f829b63f92847b4abb3b16f228bf9870f90c2e.1607949754.git.sarna@scylladb.com>
Normally a file size should be aligned around block size, since
we never write to it any unaligned size. However, we're not
protected against partial writes.
Just to be safe, align up the amount of bytes to zerofill
when recycling a segment.
Message-Id: <20201211142628.608269-4-kostja@scylladb.com>
Three tests in test_streams.py run update_table() on a table without
waiting for it to complete, and then call update_table() on the same
table or delete it. This always works in Scylla, and usually works in
AWS, but if we reach the second call, it may fail because the previous
update_table() did not take effect yet. We sometimes see these failures
when running the Alternator test suite against AWS.
So in this patch, after an each update_table() we wait for the table
to return from UPDATING to ACTIVE status.
The entire Alternator test suite now passes (or skipped) on AWS,
so: Fixes#7778.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201213164931.2767236-1-nyh@scylladb.com>
The test test_query_filter.py::test_query_filter_paging fails on AWS
and shouldn't fail, so this patch fixes the test. Note that this is
only a test problem - no fix is needed for Alternator itself.
The test reads 20 results with 1-result pages, and assumed that
21 pages are returned. The 21st page may happen because when the
server returns the 20th, it might not yet know there will be no
additional results, so another page is needed - and will be empty.
Still a different implementation might notice that the last page
completed the iteration, and not return an extra empty page. This is
perfectly fine, and this is what AWS DynamoDB does today - and should
not be considered an error.
Refs #7778
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201213143612.2761943-1-nyh@scylladb.com>
When request signature checking is enabled in Alternator, each request
should come with the appropriate Authorization header. Most errors in
this preparing this header will result in an InvalidSignatureException
response; But DynamoDB returns a more specific error when this header is
completely missing: MissingAuthenticationTokenException. We should do the
same, but before this patch we return InvalidSignatureException also for
a missing header.
The test test_authorization.py::test_no_authorization_header used to
enshrine our wrong error message, and failed when run against AWS.
After this patch, we fix the error message and the test - which now
passes against both Alternator and AWS.
Refs #7778.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201213133825.2759357-1-nyh@scylladb.com>
This series allows setting per-query timeout via CQL. It's possible via the existing `USING` clause, which is extended to be available for `SELECT` statement as well. This parameter accepts a duration and can also be provided as a marker.
The parameter acts as a regular part of the `USING` clause, which means that it can be used along with `USING TIMESTAMP` and `USING TTL` without issues.
The series comes with a pytest test suite.
Examples:
```cql
SELECT * FROM t USING TIMEOUT 200ms;
```
```cql
INSERT INTO t(a,b,c) VALUES (1,2,3) USING TIMESTAMP 42 AND TIMEOUT 50ms;
```
Working with prepared statements works as usual - the timeout parameter can be
explicitly defined or provided as a marker:
```cql
SELECT * FROM t USING TIMEOUT ?;
```
```cql
INSERT INTO t(a,b,c) VALUES (?,?,?) USING TIMESTAMP 42 AND TIMEOUT 50ms;
```
Tests: unit(dev)
Fixes#7777Closes#7781
* github.com:scylladb/scylla:
test: add prepared statement tests to USING TIMEOUT suite
docs: add an entry about USING TIMEOUT
test: add a test suite for USING TIMEOUT
storage_proxy: start propagating local timeouts as timeouts
cql3: allow USING clause for SELECT statement
cql3: add TIMEOUT attribute to the parser
cql3: add per-query timeout to select statement
cql3: add per-query timeout to batch statement
cql3: add per-query timeout to modification statement
cql3: add timeout to cql attributes
First of all, select statement is extended with an 'attrs' field,
which keeps the per-query attributes. Currently, only TIMEOUT
parameter is legal to use, since TIMESTAMP and TTL bear no meaning
for reads.
Secondly, if TIMEOUT attribute is set, it will be used as the effective
timeout for a particular query.
1. It's unused since cbe510d1b8
2. It's unsafe to keep a reference to token_metadata&
potentially across yield points.
The higher-level motivation is to make
storage_service::get_token_metadata() private so we
can control better how it's used.
For cdc, if the token_metadata is going to be needed
to the future, it'd be better get it from
db_context::_proxy.get_token_metadata_ptr().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201213162351.52224-2-bhalevy@scylladb.com>
"
This series fixes use-after-free via token_metadata&
We may currently get a token_metadata& via get_token_metadata() and
use it across yield points in a couple of sites:
- do_decommission_removenode_with_repair
- get_new_source_ranges
To fix that, get_token_metadata_ptr and hold on to it
across yielding.
Fixes#7790
Dtest: update_cluster_layout_tests:TestUpdateClusterLayout.simple_removenode_2_test(debug)
Test: unit(dev)
"
* tag 'storage_service-token_metadata_ptr-v2' of github.com:bhalevy/scylla:
storage_service: get_new_source_ranges: don't hold token_metadata& across yield point
storage_service: get_changed_ranges_for_leaving: no need to maybe_yield for each token_range
storage_service: get_changed_ranges_for_leaving: release token_metadata_ptr sooner
storage_service: get_changed_ranges_for_leaving: don't hold token_metadata& across yield
Provide the token_metadata& to get_new_source_ranges by the caller,
who keeps it valid throughout the call.
Note that there is no need to clone_only_token_map
since the token_metadata_ptr is immutable and can be
used just as well for calling strat.get_range_addresses.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When yielding in clone_only_token_map or clone_after_all_left
the token_metadata got with get_token_metadata() may go away.
Use get_token_metadata_ptr() instead to hold on to it.
And with that, we don't need to clone_only_token_map.
`metadata` is not modified by calculate_natural_endpoints, so we
can just refer to the immutable copy retrieved with
get_token_metadata_ptr.
Fixes#7790
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
The validate_column_family() helper uses the global proxy
reference to get database from. Fortunatelly, all the callers
of it can provide one via argument.
tests: unit(dev)
"
* 'br-no-proxy-in-validate' of https://github.com/xemul/scylla:
validation: Remove get_local_storage_proxy call
client_state: Call validate_column_family() with database arg
client_state: Add database& arg to has_column_family_access
storage_proxy: Add .local_db() getters
validate: Mark database argument const
"
The initial intent was to remove call for global storage service from
secondary index manager's create_view_for_index(), but while fixing it
one of intermediate schema table's helper managed to benefit from it
by re-using the database reference flying by.
The cleanup is done by simply pushing the database reference along the
stack from the code that already has it down the create_view_for_index().
tests: unit(dev)
"
* 'br-no-storages-in-index-and-schema' of https://github.com/xemul/scylla:
schema-tables: Use db from make_update_table_mutations in make_update_indices_mutations
schema-tables: Add database argument to make_update_table_mutations
schema-tables: Factor out calls getting database instance
index-manager: Move feature evaluation one level up
`ops` might be passed as a disengaged shared_ptr when called
from `decommission_with_repair`.
In this case we need to propagate to sync_data_using_repair a
disengaged std::optional<utils::UUID>.
Fixes#7788
DTest: update_cluster_layout_tests:TestUpdateClusterLayout.verify_latest_copy_decommission_node_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201213073743.331253-1-bhalevy@scylladb.com>
* seastar 8b400c7b45...2de43eb6bf (3):
> core: show span free sizes correctly in diagnostics
> Merge "IO queues to share capacities" from Pavel E
> file: make_file_impl: determine blockdev using st_mode
The switch to clang disabled the clang-specific -Wunused-value
since it generated some harmless warnings. Unfortunately, that also
prevent [[nodiscard]] violations from warning.
Fix by clearing all instances of the warning (including [[nodiscard]]
violations that crept in while it was disabled) and reinstating the warning.
Closes#7767
* github.com:scylladb/scylla:
build: reinstate -Wunused-value warning for [[nodiscard]]
test: lib: don't ignore future in compare_readers()
test: mutation_test: check both ranges when comparing summaries
serialializer: silence unused value warning in variant deserializer
tuned 2.11.0-9 and later writes to kerned.sched_wakeup_granularity_ns
and other sysctl tunables that we so laboriously tuned, dropping
performance by a factor of 5 (due to increased latency). Fix by
obsoleting tuned during install (in effect, we are a better tuned,
at least for us).
Not needed for .deb, since debian/ubunto do not install tuned by
default.
Fixes#7696Closes#7776
Two halves of the tunnel finally connect -- the
latter helper needs the local database instance and
is only called by the former one which already has it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are 3 callers of this helper (cdc, migration manager and tests)
and all of them already have the database object at hands.
The argument will be used by next patch to remove call for global
storage proxy instance from make_update_indices_mutations.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The make_update_indices_mutations gets database instance
for two things -- to find the cf to work with and to get
the value of a feature for index view creation.
To suit both and to remove calls for global storage proxy
and service instances get the database once in the
function entrance. Next patch will clean this further.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The create_view_for_index needs to know the state of the
correct-idx-token-in-secondary-index feature. To get one
it takes quite a long route through global storage service
instance.
Since there's only one caller of the method in question,
and the method is called in a loop, it's a bit faster to
get the feature value in caller and pass it in argument.
This will also help to get rid of the call for global
storage service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The get_next_partition uses global proxy instance to get
the local database reference. Now it's available in the
reader object itself, so it's possible to remove this
call for global storage proxy.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This reader uses local databse instance in its get_next_partition
method to find keyspaces to work with
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It is used in validate_column_family. The last caller of it was removed by
previous patch, so we may kill the helper itself
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The previous patch brought the databse reference arg. And since
the currently called validate_column_family() overload _just_
gets the database from global proxy, it's better to shortcut.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It is called from cql3/statements' check_access methods and from thrift
handlers. The former have proxy argument from which they can get the
database. The latter already have the database itself on board.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A sequel to #7692.
This series gets rid of linearization when validating collections and tuple types. (Other types were already validated without linearizing).
The necessary helpers for reading from fragmented buffers were introduced in #7692. All this series does is put them to use in `validate()`.
Refs: #6138Closes#7770
* github.com:scylladb/scylla:
types: add single-fragment optimization in validate()
utils: fragment_range: add with_simplified()
cql3: statements: select_statement: remove unnecessary use of with_linearized
cql3: maps: remove unnecessary use of with_linearized
cql3: lists: remove unnecessary use of with_linearized
cql3: tuples: remove unnecessary use of with_linearized
cql3: sets: remove unnecessary use of with_linearized
cql3: tuples: remove unnecessary use of with_linearized
cql3: attributes: remove unnecessary uses of with_linearized
types: validate lists without linearizing
types: validate tuples without linearizing
types: validate sets without linearizing
types: validate maps without linearizing
types: template abstract_type::validate on FragmentedView
types: validate_visitor: transition from FragmentRange to FragmentedView
utils: fragmented_temporary_buffer: add empty() to FragmentedView
utils: fragmented_temporary_buffer: don't add to null pointer
Manipulating fragmented views is costlier that manipulating contiguous views,
so let's detect the common situation when the fragmented view is actually
contiguous underneath, and make use of that.
Note: this optimization is only useful for big types. For trivial types,
validation usually only checks the size of the view.
Reading from contiguous memory (bytes_view) is significantly simpler
runtime-wise than reading from a fragmented view, due to less state and less
branching, so we often want to convert a fragmented view to a simple view before
processing it, if the fragmented view contains at most one fragment, which is
common. with_simplified() does just that.
This is primarily a stylistic change. It makes the interface more consistent
with deserialize(). It will also allow us to call `validate()` for collection
elements in `validate_aux()`.
This will allow us to easily get rid of linearizations when validating
collections and tuples, because the helpers used in validate_aux() already
have FragmentedView overloads.
When fragmented_temporary_buffer::view is created from a bytes_view,
_current is null. In that case, in remove_current(), null pointer offset
happens, and ubsan complains. Fix that.
The heuristic of STCS reshape is correct, and it built the compaction
descriptor correctly, but forgot to return it to the caller, so no
reshape was ever done on behalf of STCS even when the strategy
needed it.
Fixes#7774.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201209175044.1609102-1-raphaelsc@scylladb.com>
Currently removenode works like below:
- The coordinator node advertises the node to be removed in
REMOVING_TOKEN status in gossip
- Existing nodes learn the node in REMOVING_TOKEN status
- Existing nodes sync data for the range it owns
- Existing nodes send notification to the coordinator
- The coordinator node waits for notification and announce the node in
REMOVED_TOKEN
Current problems:
- Existing nodes do not tell the coordinator if the data sync is ok or failed.
- The coordinator can not abort the removenode operation in case of error
- Failed removenode operation will make the node to be removed in
REMOVING_TOKEN forever.
- The removenode runs in best effort mode which may cause data
consistency issues.
It means if a node that owns the range after the removenode
operation is down during the operation, the removenode node operation
will continue to succeed without requiring that node to perform data
syncing. This can cause data consistency issues.
For example, Five nodes in the cluster, RF = 3, for a range, n1, n2,
n3 is the old replicas, n2 is being removed, after the removenode
operation, the new replicas are n1, n5, n3. If n3 is down during the
removenode operation, only n1 will be used to sync data with the new
owner n5. This will break QUORUM read consistency if n1 happens to
miss some writes.
Improvements in this patch:
- This patch makes the removenode safe by default.
We require all nodes in the cluster to participate in the removenode operation and
sync data if needed. We fail the removenode operation if any of them is down or
fails.
If the user want the removenode operation to succeed even if some of the nodes
are not available, the user has to explicitly pass a list of nodes that can be
skipped for the operation.
$ nodetool removenode --ignore-dead-nodes <list_of_dead_nodes_to_ignore> <host_id>
Example restful api:
$ curl -X POST "http://127.0.0.1:10000/storage_service/remove_node/?host_id=7bd303e9-4c7b-4915-84f6-343d0dbd9a49&ignore_nodes=127.0.0.3,127.0.0.5"
- The coordinator can abort data sync on existing nodes
For example, if one of the nodes fails to sync data. It makes no sense for
other nodes to continue to sync data because the whole operation will
fail anyway.
- The coordinator can decide which nodes to ignore and pass the decision
to other nodes
Previously, there is no way for the coordinator to tell existing nodes
to run in strict mode or best effort mode. Users will have to modify
config file or run a restful api cmd on all the nodes to select strict
or best effort mode. With this patch, the cluster wide configuration is
eliminated.
Fixes#7359Closes#7626
Verify that the input types are iterators and their value types are compatible
with the compare function.
Because some of the inputs were not actually valid iterators, they are adjusted
too.
Closes#7631
* github.com:scylladb/scylla:
types: add constraint on lexicographical_tri_compare()
composite: make composite::iterator a real input_iterator
compound: make compount_type::iterator a real input_iterator
UpdateItem's "ADD" operation usually adds elements to an existing set
or adds a number to an existing counter. But it can *also* be used
to create a new set or counter (as if adding to an empty set or zero).
We unfortunately did not have a test for this case (creating a new set
or counter), and when I wrote such a test now, I discovered the
implementation was missing. So this patch adds both the test and the
implementation. The new test used to fail before this patch, and passes
with it - and passes on DynamoDB.
Note that we only had this bug for the newer UpdateItem syntax.
For the old AttributeUpdates syntax, we already support ADD actions
on missing attributes, and already tested it in test_update_item_add().
I just forgot to test the same thing for the newer syntax, so I missed
this bug :-(
Fixes#7763.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207085135.2551845-1-nyh@scylladb.com>
Snitch name needs to be exchanged within cluster once, on shadow
round, so joining nodes cannot use wrong snitch. The snitch names
are compared on bootstrap and on normal node start.
If the cluster already used mixed snitches, the upgrade to this
version will fail. In this case customer needs to add a node with
correct snitch for every node with the wrong snitch, then put
down the nodes with the wrong snitch and only then do the upgrade.
Fixes#6832Closes#7739
Whereas in CQL the client can pass a timeout parameter to the server, in
the DynamoDB API there is no such feature; The server needs to choose
reasonable timeouts for its own internal operations - e.g., writes to disk,
querying other replicas, etc.
Until now, Alternator had a fixed timeout of 10 seconds for its
requests. This choice was reasonable - it is much higher than we expect
during normal operations, and still lower than the client-side timeouts
that some DynamoDB libraries have (boto3 has a one-minute timeout).
However, there's nothing holy about this number of 10 seconds, some
installations might want to change this default.
So this patch adds a configuration option, "--alternator-timeout-in-ms",
to choose this timeout. As before, it defaults to 10 seconds (10,000ms).
In particular, some test runs are unusually slow - consider for example
testing a debug build (which is already very slow) in an extremely
over-comitted test host. In some cases (see issue #7706) we noticed
the 10 second timeout was not enough. So in this patch we increase the
default timeout chosen in the "test/alternator/run" script to 30 seconds.
Please note that as the code is structured today, this timeout only
applies to some operations, such as GetItem, UpdateItem or Scan, but
does not apply to CreateTable, for example. This is a pre-existing
issue that this patch does not change.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207122758.2570332-1-nyh@scylladb.com>
This reverts commit dc77d128e9. It was reverted
due to a strange and unexplained diff, which is now explained. The
HEAD on the working directory being pulled from was set back, so git
thought it was merging the intended commits, plus all the work that was
committed from HEAD to master. So it is safe to restore it.
"
The multishard_mutation_query test is toooo slow when built
with clang in dev mode. By reducing the number of scans it's
possible to shrink the full suite run time from half an hour
down to ~3 minutes.
tests: unit(dev)
"
* 'br-devel-mode-tests' of https://github.com/xemul/scylla:
test: Make multishard_mutation_query test do less scans
configure: Add -DDEVEL to dev build flags
When built by clang this dev-mode test takes ~30 minutes to
complete. Let's reduce this time by reducing the scale of
the test if DEVEL is set.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
scylla_setup command suggestion does not shows an argument of --io-setup,
because we mistakely stores bool value on it (recognized as 'store_true').
We always need to print '--io-setup X' on the suggestion instead.
Also, --nic is currently ignored by command suggestion, need to print it just like other options.
Related #7395Closes#7724
* github.com:scylladb/scylla:
scylla_setup: print --swap-directory and --swap-size on command suggestion
scylla_setup: print --nic on command suggestion
scylla_setup: fix wrong command suggestion on --io-setup scylla_setup command suggestion does not shows an argument of --io-setup, because we mistakely stores bool value on it (recognized as 'store_true'). We always need to print '--io-setup X' on the suggestion instead.
A sequel to #7692.
This series gets rid of linearization in `serialize_for_cql`, which serializes collections and user types from `collection_mutation_view` to CQL. We switch from `bytes` to `bytes_ostream` as the intermediate buffer type.
The only user of of `serialize_for_cql` immediately copies the result to another `bytes_ostream`. We could avoid some copies and allocations by writing to the final `bytes_ostream` directly, but it's currently hidden behind a template.
Before this series, `serialize_for_cql_aux()` delegated the actual writing to `collection_type_impl::pack` and `tuple_type_impl::build_value`, by passing them an intermediate `vector`. After this patch, the writing is done directly in `serialize_for_cql_aux()`. Pros: we avoid the overhead of creating an intermediate vector, without bloating the source code (because creating that intermediate vector requires just as much code as serializing the values right away). Cons: we duplicate the CQL collection format knowledge contained in `collection_type_impl::pack` and `tuple_type_impl::build_value`.
Refs: #6138Closes#7771
* github.com:scylladb/scylla:
types: switch serialize_for_cql from bytes to bytes_ostream
types: switch serialize_for_cql_aux from bytes to bytes_ostream
types: serialize user types to bytes_ostream
types: serialize lists to bytes_ostream
types: serialize sets to bytes_ostream
types: serialize maps to bytes_ostream
utils: fragment_range: use range-based for loop instead of boost::for_each
types: add write_collection_value() overload for bytes_ostream and value_view
Increase accepted disk-to-RAM ratio to 105 to accomodate even 7.5GB of
RAM for one NVMe log various reasons for not recommending the instance
type.
Fixes#7587Closes#7600
This change enhances the toppartitions api to also return
the cardinality of the read and write sample sets. It now uses
the size() method of space_saving_top_k class, counting the unique
operations in the sampled set for up to the given capacity.
Fixes#4089Closes#7766
When an Alternator table has partition keys or sort keys of type "bytes"
(blobs), a Scan or Query which required paging used to fail - we used
an incorrect function to output LastEvaluatedKey (which tells the user
where to continue at the next page), and this incorrect function was
correct for strings and numbers - but NOT for bytes (for bytes, we
need to encode them as base-64).
This patch also includes two tests - for bytes partition key and
for bytes sort key - that failed before this patch and now pass.
The test test_fetch_from_system_tables also used to fail after a
Limit was added to it, because one of the tables it scans had a bytes
key. That test is also fixed by this patch.
Fixes#7768
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207175957.2585456-1-nyh@scylladb.com>
Up until now, Scylla's debian packages dependencies versions were
unspecified. This was due to a technical difficulty to determine
the version of the dependent upon packages (such as scylla-python3
or scylla-jmx). Now, when those packages are also built as part of
this repo and are built with a version identical to the server package
itself we can depend all of our packages with explicit versions.
The motivation for this change is that if a user tries to install
a specific Scylla version by installing a specific meta package,
it will silently drag in the latest components instead of the ones
of the requested versions.
The expected change in behavior is that after this change an attempt
to install a metapackage with version which is not the latest will fail
with an explicit error hinting the user what other packages of the same
version should be explicitly included in the command line.
Fixes#5514Closes#7727
The switch to clang disabled the clang-specific -Wunused-value
since it generated some harmless warnings. Unfortunately, that also
prevent [[nodiscard]] violations from warning.
Fix by reinstating the warning, now that all instances of the warning
have been fixed.
A copy/paste error means we ignore the termination of one of the
ranges. Change the comma expression to a disjunction to avoid
the unused value warning from clang.
The code is not perfect, since if the two ranges are not the same
size we'll invoke undefined behavior, but it is no worse than before
(where we ignored the comparison completely).
The variant deserializer uses a fold expression to implement
an if-tree with a short-circuit, producing an intermediate boolean
value to terminate evaluation. This intermediate value is unneeded,
but evokes a warning from clang when -Wunused-value is enabled.
Since we want to enable the warning, add a cast to void to ignore
the intermediate value.
We want to pass bytes_ostream to this loop in later commits.
bytes_ostream does not conform to some boost concepts required by
boost::for_each, so let's just use C++'s native loop.
When getting local ranges, an assumption is made that
if a range does not contain an end or when its end is a maximum token,
then it must contain a start. This assumption proven not true
during manual tests, so it's now fortified with an additional check.
Here's a gdb output for a set of local ranges which causes an assertion
failure when calling `get_local_ranges` on it:
(gdb) p ranges
$1 = std::vector of length 2, capacity 2 = {{_interval = {_start = std::optional<interval_bound<dht::token>> = {[contained value] = {_value = {_kind = dht::token_kind::before_all_keys,
_data = 0}, _inclusive = false}}, _end = std::optional<interval_bound<dht::token>> [no contained value], _singular = false}}, {_interval = {
_start = std::optional<interval_bound<dht::token>> [no contained value], _end = std::optional<interval_bound<dht::token>> = {[contained value] = {_value = {
_kind = dht::token_kind::before_all_keys, _data = 0}, _inclusive = true}}, _singular = false}}}
Closes#7764
The test test_fetch_from_system_tables tests Alternator's system-table
feature by reading from all system tables. The intention was to confirm
we don't crash reading any of them - as they have different schemas and
can run into different problems (we had such problems in the initial
implementation). The intention was not to read *a lot* from each table -
we only make a single "Scan" call on each, to read one page of data.
However, the Scan call did not set a Limit, so the single page can get
pretty big.
This is not normally a problem, but in extremely slow runs - such as when
running the debug build on an extremely overcommitted test machine (e.g.,
issue #7706) reading this large page may take longer than our default
timeout. I'll send a separate patch for the timeout issue, but for now,
there is really no reason why we need to read a big page. It is good
enough to just read 50 rows (with Limit=50). This will still read all
the different types and make the test faster.
As an example, in the debug run on my laptop, this test spent 2.4
seconds to read the "compaction_history" table before this patch,
and only 0.1 seconds after this patch. 2.4 seconds is close to our
default timeout (10 seconds), 0.1 is very far.
Fixes#7706
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201207075112.2548178-1-nyh@scylladb.com>
The original goal of this patch was to replace the two single-node dtests
allow_filtering_test and allow_filtering_secondary_indexes_test, which
recently caused us problems when we wanted to change the ALLOW FILTERING
behavior but the tests were outside the tree. I'm hoping that after this
patch, those two tests could be removed from dtest.
But this patch actually tests more cases then those original dtest, and
moreover tests not just whether ALLOW FILTERING is required or not, but
also that the results of the filtering is correct.
Currently, four of the included tests are expected to fail ("xfail") on
Scylla, reproducing two issues:
1. Refs #5545:
"WHERE x IN ..." on indexed column x wrongly requires ALLOW FILTERING
2. Refs #7608:
"WHERE c=1" on clustering key c should require ALLOW FILTERING, but
doesn't.
All tests, except the one for issue #5545, pass on Cassandra. That one
fails on Cassandra because doesn't support IN on an indexed column at all
(regardless of whether ALLOW FILTERING is used or not).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201115124631.1224888-1-nyh@scylladb.com>
There is a typo in schema.cql of snapshot, lack of comma after
compaction strategy. It will fail to restore schema by the file.
AND compaction = {'class': 'SizeTieredCompactionStrategy''max_compaction_threshold': '32'}
map_as_cql_param() function has a `first` parameter to smartly add
comma, the compaction_strategy_options is always not the first.
Fixes#7741
Signed-off-by: Amos Kong <amos@scylladb.com>
Closes#7734
"
The storage service is called there to get the cached value
of db::system_keyspace::get_local_host_id(). Keeping the value
on database decouples it from storage service and kills one
more global storage service reference.
tests: unit(dev)
"
* 'br-remove-storage-service-from-counters-2' of https://github.com/xemul/scylla:
counters: Drop call to get_local_storage_service and related
counters: Use local id arg in transform_counter_update_to_shards
database: Have local id arg in transform_counter_updates_to_shards()
storage_service: Keep local host id to database
This PR adds the Sphinx documentation generator and the custom theme ``sphinx-scylladb-theme``. Once merged, the GitHub Actions workflow should automatically publish the developer notes stored under ``docs`` directory on http://scylladb.github.io/scylla
1. Run the command ``make preview`` from the ``docs`` directory.
3. Check the terminal where you have executed the previous command. It should not raise warnings.
3. Open in a new browser tab http://127.0.0.1:5500/ to see the generated documentation pages.
The table of contents displays the files sorted as they appear on GitHub. In a subsequent iteration, @lauranovich and I will submit an additional PR proposing a new folder organization structure.
Closes#7752
* github.com:scylladb/scylla:
docs: fixed warnings
docs: added theme
The previous way of deleting records based on the whole
sstatble data_size causes overzealous deletions (#7668)
and inefficiency in the rows cache due to the large number
of range tombstones created.
Therefore we'd be better of by juts letting the
records expire using he 30 days TTL.
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201206083725.1386249-1-bhalevy@scylladb.com>
This reverts commit 0aa1f7c70a, reversing
changes made to 72c59e8000. The diff is
strange, including unrelated commits. There is no understanding of the
cause, so to be safe, revert and try again.
The local host id is now passed by argument, so we don't
need the counter_id::local() and some other methods that
call or are called by it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Only few places in it need the uuid. And since it's only 16 bytes
it's possibvle to safely capture it by value in the called lambdas.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two places that call it -- database code itself and
tests. The former already has the local host id, so just pass
one.
The latter are a bit trickier. Currently they use the value from
storage_service created by storage_service_for_tests, but since
this version of service doesn't pass through prepare_to_join()
the local_host_id value there is default-initialized, so just
default-initialize the needed argument in place.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The value in question is cached from db::system_keyspace
for places that want to have it without waiting for
futures. So far the only place is database counters code,
so keep the value on database itself. Next patches will
make use of it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Citing #6138: > In the past few years we have converted most of our codebase to
work in terms of fragmented buffers, instead of linearised ones, to help avoid
large allocations that put large pressure on the memory allocator. > One
prominent component that still works exclusively in terms of linearised buffers
is the types hierarchy, more specifically the de/serialization code to/from CQL
format. Note that for most types, this is the same as our internal format,
notable exceptions are non-frozen collections and user types. > > Most types
are expected to contain reasonably small values, but texts, blobs and especially
collections can get very large. Since the entire hierarchy shares a common
interface we can either transition all or none to work with fragmented buffers.
This series gets rid of intermediate linearizations in deserialization. The next
steps are removing linearizations from serialization, validation and comparison
code.
Series summary:
- Fix a bug in `fragmented_temporary_buffer::view::remove_prefix`. (Discovered
while testing. Since it wasn't discovered earlier, I guess it doesn't occur in
any code path in master.)
- Add a `FragmentedView` concept to allow uniform handling of various types of
fragmented buffers (`bytes_view`, `temporary_fragmented_buffer::view`,
`ser::buffer_view` and likely `managed_bytes_view` in the future).
- Implement `FragmentedView` for relevant fragmented buffer types.
- Add helper functions for reading from `FragmentedView`.
- Switch `deserialize()` and all its helpers from `bytes_view` to
`FragmentedView`.
- Remove `with_linearized()` calls which just became unnecessary.
- Add an optimization for single-fragment cases.
The addition of `FragmentedView` might be controversial, because another concept
meant for the same purpose - `FragmentRange` - is already used. Unfortunately,
it lacks the functionality we need. The main (only?) thing we want to do with a
fragmented buffer is to extract a prefix from it and `FragmentRange` gives us no
way to do that, because it's immutable by design. We can work around that by
wrapping it into a mutable view which will track the offset into the immutable
`FragmentRange`, and that's exactly what `linearizing_input_stream` is. But it's
wasteful. `linearizing_input_stream` is a heavy type, unsuitable for passing
around as a view - it stores a pair of fragment iterators, a fragment view and a
size (11 words) to conform to the iterator-based design of `FragmentRange`, when
one fragment iterator (4 words) already contains all needed state, just hidden.
I suggest we replace `FragmentRange` with `FragmentedView` (or something
similar) altogether.
Refs: #6138Closes#7692
* github.com:scylladb/scylla:
types: collection: add an optimization for single-fragment buffers in deserialize
types: add an optimization for single-fragment buffers in deserialize
cql3: tuples: don't linearize in in_value::from_serialized
cql3: expr: expression: replace with_linearize with linearized
cql3: constants: remove unneeded uses of with_linearized
cql3: update_parameters: don't linearize in prefetch_data_builder::add_cell
cql3: lists: remove unneeded use of with_linearized
query-result-set: don't linearize in result_set_builder::deserialize
types: remove unneeded collection deserialization overloads
types: switch collection_type_impl::deserialize from bytes_view to FragmentedView
cql3: sets: don't linearize in value::from_serialized
cql3: lists: don't linearize in value::from_serialized
cql3: maps: don't linearize in value::from_serialized
types: remove unused deserialize_aux
types: deserialize: don't linearize tuple elements
types: deserialize: don't linearize collection elements
types: switch deserialize from bytes_view to FragmentedView
types: deserialize tuple types from FragmentedView
types: deserialize set type from FragmentedView
types: deserialize map type from FragmentedView
types: deserialize list type from FragmentedView
types: add FragmentedView versions of read_collection_size and read_collection_value
types: deserialize varint type from FragmentedView
types: deserialize floating point types from FragmentedView
types: deserialize decimal type from FragmentedView
types: deserialize duration type from FragmentedView
types: deserialize IP address types from FragmentedView
types: deserialize uuid types from FragmentedView
types: deserialize timestamp type from FragmentedView
types: deserialize simple date type from FragmentedView
types: deserialize time type from FragmentedView
types: deserialize boolean type from FragmentedView
types: deserialize integer types from FragmentedView
types: deserialize string types from FragmentedView
types: remove unused read_simple_opt
types: implement read_simple* versions for FragmentedView
utils: fragmented_temporary_buffer: implement FragmentedView for view
utils: fragment_range: add single_fragmented_view
serializer: implement FragmentedView for buffer_view
utils: fragment_range: add linearized and with_linearized for FragmentedView
utils: fragment_range: add FragmentedView
utils: fragmented_temporary_buffer: fix view::remove_prefix
Values usually come in a single fragment, but we pay the cost of fragmented
deserialization nevertheless: bigger view objects (4 words instead of 2 words)
more state to keep updated (i.e. total view size in addition to current fragment
size) and more branches.
This patch adds a special case for single-fragment buffers to
abstract_type::deserialize. They are converted to a single_fragmented_view
before doing anything else. Templates instantiated with single_fragmented_view
should compile to better code than their multi-fragmented counterparts. If
abstract_type::deserialize is inlined, this patch should completely prevent any
performance penalties for switching from with_linearized to fragmented
deserialization.
with_linearized creates an additional internal `bytes` when the input is
fragmented. linearized copies the data directly to the output `bytes`, so it's
more efficient.
Devirtualizes collection_type_impl::deserialize (so it can be templated) and
adds a FragmentedView overload. This will allow us to deserialize collections
with explicit cql_serialization_format directly from fragmented buffers.
The final part of the transition of deserialize from bytes_view to
FragmentedView.
Adds a FragmentedView overload to abstract_type::deserialize and
switches deserialize_visitor from bytes_view to FragmentedView, allowing
deserialization of all types with no intermediate linearization.
The partition builder doesn't expect the looked-up row to exist. In fact
it already existing is a sign of a bug. Currently bugs resulting in
duplicate rows will manifest by tripping an assert in
`row::append_cell()`. This however results in poor diagnostics, so we
want to catch these errors sooner to be able to provide higher level
diagnostics. To this end, switch to the freshly introduced
`append_clustering_row()` so that duplicate rows are found early and in
a context where their identity is known.
This abstraction is used to merge the output of multiple readers, each
opened for a single partition query, into a non-decreasing stream
of mutation_fragments.
It is similar to `mutation_reader_merger`,
but an important difference is that the new merger may select new readers
in the middle of a partition after it already returned some fragments
from that partition. It uses the new `position_reader_queue` abstraction
to select new readers. It doesn't support multi-partition (ring range) queries.
The new merger will be later used when reading from sstable sets created
by TimeWindowCompactionStrategy. This strategy creates many sstables
that are mostly disjoint w.r.t the contained clustering keys, so we can
delay opening sstable readers when querying a partition until after we have
processed all mutation fragments with positions before the keys
contained by these sstables.
A microbenchmark was added that compares the existing combining reader
(which uses `mutation_reader_merger` underneath) with a new combining reader
built using the new `clustering_order_reader_merger` and a simple queue of readers
that returns readers from some supplied set. The used set of readers is built from the following
ranges of keys (each range corresponds to a single reader):
`[0, 31]`, `[30, 61]`, `[60, 91]`, `[90, 121]`, `[120, 151]`.
The microbenchmark runs the reader and divides the result by the number of mutation fragments.
The results on my laptop were:
```
$ build/release/test/perf/perf_mutation_readers -t clustering_combined.* -r 10
single run iterations: 0
single run duration: 1.000s
number of runs: 10
test iterations median mad min max
clustering_combined.ranges_generic 2911678 117.598ns 0.685ns 116.175ns 119.482ns
clustering_combined.ranges_specialized 3005618 111.015ns 0.349ns 110.063ns 111.840ns
```
`ranges_generic` denotes the existing combining reader, `ranges_specialized` denotes the new reader.
Split from https://github.com/scylladb/scylla/pull/7437.
Closes#7688
* github.com:scylladb/scylla:
tests: mutation_source_test for clustering_order_reader_merger
perf: microbenchmark for clustering_order_reader_merger
mutation_reader_test: test clustering_order_reader_merger in memory
test: generalize `random_subset` and move to header
mutation_reader: introduce clustering_order_reader_merger
In issue #7722, it was suggested that we should port Cassandra's CQL unit
tests into our own repository, by translating the Java tests into Python
using the new cql-pytest framework. Cassandra's CQL unit test framework is
orders of magnitude faster than dtest, and in-tree, so Cassandra have been
moving many CQL correctness tests there, and we can also benefit from their
test cases.
In this patch, we take the first step in a long journey:
1. I created a subdirectory, test/cql-pytest/cassandra_tests, where all the
translated Cassandra tests will reside. The structure of this directory
will mirror that of the test/unit/org/apache/cassandra/cql3 directory in
the Cassandra repository.
pytest conveniently looks for test files recursively, so when all the
cql-pytest are run, the cassandra_tests files will be run as well.
As usual, one can also run only a subset of all the tests, e.g.,
"test/cql-pytest/run -vs cassandra_tests" runs only the tests in the
cassandra_tests subdirectory (and its subdirectories).
2. I translated into Python two of the smallest test files -
validation/entities/{TimeuuidTest,DataTypeTest}.java - containing just
three test functions.
The plan is to translate entire Java test files one by one, and to mirror
their original location in our own repository, so it will be easier
to remember what we already translated and what remains to be done.
3. I created a small library, porting.py, of functions which resemble the
common functions of the Java tests (CQLTester.java). These functions aim
to make porting the tests easier. Despite the resemblence, the ported code
is not 100% identical (of course) and some effort is still required in
this porting. As we continue this porting effort, we'll probably need
more of these functions, can can also continue to improve them to reduce
the porting effort.
Refs #7722.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201201192142.2285582-1-nyh@scylladb.com>
This series introduces a `large_data_counters` element to `scylla_metadata` component to explicitly count the number of `large_{partitions,rows,cells}` and `too_many_rows` in the sstable. These are accounted for in the sstable writer whenever the respective large data entry is encountered.
It is taken into account in `large_data_handler::maybe_delete_large_data_entries`, when engaged.
Otherwise, if deleting a legacy sstable that has no such entry in `scylla_metadata`, just revert to using the current method of comparing the sstable's `data_size` to the various thresholds.
Fixes#7668
Test: unit(dev)
Dtest: wide_rows_test.py (in progress)
Closes#7669
* github.com:scylladb/scylla:
docs: sstable-scylla-format: add large_data_stats subcomponent
large_data_handler: maybe_delete_large_data_entries: use sstable large data stats
large_data_handler: maybe_delete_large_data_entries: accept shared_sstable
large_data_handler: maybe_delete_large_data_entries: move out of line
sstables: load large_data_stats from scylla_metadata
sstables: store large_data_stats in scylla_metadata
sstables: writer: keep track of large data stats
large_data_handler: expose methods to get threshold
sstables: kl/writer: never record too many rows
large_data_handler: indicate recording of large data entries
large_data_handler: move constructor out of line
In test/cql-pytest/run.py we have a 200 second timeout to boot Scylla.
I never expected to reach this timeout - it normally takes (in dev
build mode) around 2 seconds, but in one run on Jenkins we did reach it.
It turns out that the code does not recognize this timeout correctly,
thought that Scylla booted correctly - and then failed all the
subtests when they fail to connect to Scylla.
This patch fixes the timeout logic. After the timeout, if Scylla's
CQL port is still not responsive, the test run is failed - without
trying to run many individual tests.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201201150927.2272077-1-nyh@scylladb.com>
When a row was inserted into a table with no regular columns, and no
such row existed in the first place, postimage would not be produced.
Fix this.
Fixes#7716.
Closes#7723
If the sstable has scylla_metadata::large_data_stats use them
to determine whether to delete the corresponding large data records.
Otherwise, defer to the current method of comparing the sstable
data_size to the respective thresholds.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Since the actual deletion if the large data entries
is done in the background, and we don't captures the shared_sstable,
we can safely pass it to maybe_delete_large_data_entries when
deleting the sstable in sstable::unlink and it will be release
as soon as maybe_delete_large_data_entries returns.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Load the large data stats from the scylla_metadata component
if they are present. Otherwise, if we're opening a legacy sstable
that has scylla_metadata_type::LargeDataStats, leave
sstable::_large_data_stats disengaged.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Store the large data statistics in the scylla_metadata component.
These will be retrieved when loading the sstable and be
used for determining whether to delete the corresponding
large data entries upon sstable deletion.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Previously, statement_restrictions::find_idx() would happily return an
index for a non-EQ restriction (because it checked only the column
name, not the operator). This is incorrect: when the selected index
is for a non-EQ restriction, it is impossible to query that index
table.
Fixes#7659.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#7665
* seastar 010fb0df1e...8b400c7b45 (6):
> append_challenged_posix_file_impl::read_dma: allow iovec to cross _logical_size
> Merge "Extend per task-queue timing statistics" from Pavel E
> tls_test: Create test certs at build time
> cook: upgrade hwloc version
> memory: rate-limit diagnostics messages
> util/log: add rate-limited version of writer version of log()
Currently, if the user provides a cell name with too many components,
we will accept it and construct an invalid clusterin key. This may
result in undefined behavior down the stream.
It was caught by ASAN in a debug build when executing dtest
cql_tests.py:MiscellaneousCQLTester.cql3_insert_thrift_test with
nodetool flush manually added after the write. Triggered during
sstable writing to an MC-format sstable:
seastar::shared_ptr<abstract_type const>::operator*() const at ././seastar/include/seastar/core/shared_ptr.hh:577
sstables::mc::clustering_blocks_input_range::next() const at ./sstables/mx/writer.cc:180
To prevent corrupting the state in this way, we should fail
early. This patch addds validation which will fail thrift requests
which attempt to create invalid clustering keys.
Fixes#7568.
Example error:
Internal server error: Cell name of ks.test has too many components, expected 1 got 2 in 0x0004000000040000017600
Message-Id: <1605550477-24810-1-git-send-email-tgrabiec@scylladb.com>
This patch adds an option to scylla_setup to configure an rsyslog destination.
The monitoring stack has an option to get information from rsyslog it
requires that rsyslog on the scylla machines will send the trace line to
it.
The configuration will be in a Scylla configuration file, so it is safe to run it multiple times.
Fixes#7589
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Closes#7634
* github.com:scylladb/scylla:
dist/common/scripts/scylla_setup: Optionally config rsyslog destination
Adding dist/common/scripts/scylla_rsyslog_setup utility
This patch adds an option to scylla_setup to configure an rsyslog
destination.
The monitoring stack has an option to get information from rsyslog, it
requires that rsyslog on the scylla machines will send the trace line to
it.
If the /etc/rsyslog.d/ directory exists (that means the current system
runs rsyslog) it will ask if to add rsyslog configuration and if yes, it
would run scylla_rsyslog_setup.
Fixes#7589
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
scylla_setup command suggestion does not shows an argument of --io-setup,
because we mistakely stores bool value on it (recognized as 'store_true').
We always need to print '--io-setup X' on the suggestion instead.
Related #7395
* scylla-dev/snapshot_fixes_v1:
raft: ignore append_reply from a peer in SNAPSHOT state
raft: Ignore outdated snapshots
raft: set next_idx to correct value after snapshot transfer
This abstraction is used to merge the output of multiple readers, each
opened for a single partition query, into a non-decreasing stream
of mutation_fragments.
It is similar to `mutation_reader_merger`,
an important difference is that the new merger may select new readers
in the middle of a partition after it already returned some fragments
from that partition. It uses the new `position_reader_queue` abstraction
to select new readers. It doesn't support multi-partition (ring range) queries.
The new merger will be later used when reading from sstable sets created
by TimeWindowCompactionStrategy. This strategy creates many sstables
that are mostly disjoint w.r.t the contained clustering keys, so we can
delay opening sstable readers when querying a partition until after we have
processed all mutation fragments with positions before the keys
contained by these sstables.
Fix#7680 by never using secondary index for multi-column restrictions.
Modify expr::is_supported_by() to handle multi-column correctly.
Tests: unit (dev)
Closes#7699
* github.com:scylladb/scylla:
cql3/expr: Clarify multi-column doesn't use indexing
cql3: Don't use index for multi-column restrictions
test: Add eventually_require_rows
The first two patches in this series are small improvements to cql-pytest to prepare for the third and main patch. This third patch adds cql-pytest tests which check that we fail CQL queries that try to inject non-ASCII and non-UTF-8 strings for ascii and text columns, respectively.
The tests do not discover any unknown bug in Scylla, however, they do show that Scylla is more strict in its definition of "valid UTF-8" compared to Cassandra.
Closes#7719
* github.com:scylladb/scylla:
test/cql-pytest: add tests for validation of inserted strings
test/cql-pytest: add "scylla_only" fixture
test/cpy-pytest: enable experimental features
This change adds tracking of all the CQL errors that can be
raised in response to a CQL message from a client, as described
in the CQL v4 protocol and with Scylla's CDC_WRITE_FAILUREs
included.
Fixes#5859Closes#7604
We have "Conflicts: kernel < 3.10.0-514" on rpm package to make sure
the environment is running newer kernel.
However, user may use non-standard kernel which has different package name,
like kernel-ml or kernel-uek.
On such environment Conflicts tag does not works correctly.
Even the system running with newer kernel, rpm only checks "kernel" package
version number.
To avoid such issue, we need to drop Conflicts tag.
Fixes#7675
This patch adds comprehensive cql-pytest tests for checking the validation
of strings - ASCII or UTF-8 - in CQL. Strings can be represented in CQL
using several methods - a strings can be a string literal as
part of the statement, can be encoded as a blob (0x...), or
can be a binding parameter for a prepared statement, or returned
by user-defined functions - and these tests check all of them.
We already have low-level unit tests for UTF-8 parsing in
test/boost/utf8_test.cc, but the new tests here confirms that we really
call these low-level functions in the correct way. Moreover, since these
are CQL tests, they can also be run against Cassandra, and doing that
demonstrated that Scylla's UTF-8 parsing is *stricter* than Cassandra's -
Scylla's UTF-8 parser rejects the following sequences which Cassandra's
accepts:
1. \xC0\x80 as another non-minimal representation of null. Note that other
non-minimal encodings are rejected by Cassandra, as expected.
2. Characters beyond the official Unicode range (or what Scylla considers
the end of the range).
3. UTF-16 surrogates - these are not considered valid UTF-8, but Cassandra
accepts them, and Scylla does not.
In the future, we should consider whether Scylla is more correct than
Cassandra here (so we're fine), or whether compatibility is more important
than correctness (so this exposed a bug).
The ASCII tests reproduces issue #5421 - that trying to insert a
non-ASCII string into an "ascii" column should produce an error on
insert - not later when fetching the string. This test now passes,
because issue 5421 was already fixed.
These tests did not exposed any bug in Scylla (other than the differences
with Cassandra mentioned a bug), so all of them pass on Scylla. Two
of the tests fail on Cassandra, because Cassandra does not recognize
some invalid UTF-8 (according to Scylla's definition) as invalid.
Refs #5421.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Reject the previously accepted case where the multi-column restriction
applied to just a single column, as it causes a crash downstream. The
user can drop the parentheses to avoid the rejection.
Fixes#7710
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#7712
"
This series adds maybe_yield called from
cleanup_compaction::get_ranges_for_invalidation
to avoid reactor stalls.
To achieve that, we first extract bool_class can_yield
to utils/maybe_yield.hh, and add a convience helper:
utils::maybe_yield(can_yield) that conditionally calls
seastar::thread::maybe_yield if it can (when called in a
seastar thread).
With that, we add a can_yield parameter to dht::to_partition_ranges
and dht::partition_range::deoverlap (defaults to false), and
use it from cleanup_compaction::get_ranges_for_invalidation,
as the latter is always called from `consume_in_thread`.
Fixes#7674
Test: unit(dev)
"
* tag 'unstall-get_ranges_for_invalidation-v2' of github.com:bhalevy/scylla:
compaction: cleanup_compaction: get_ranges_for_invalidation: add yield points
dht/i_partitioner: to_partition_ranges: support yielding
locator: extract can_yield to utils/maybe_yield.hh
It is used to force remove a node from gossip membership if something
goes wrong.
Note: run the force_remove_endpoint api at the same time on _all_ the
nodes in the cluster in order to prevent the removed nodes come back.
Becasue nodes without running the force_remove_endpoint api cmd can
gossip around the removed node information to other nodes in 2 *
ring_delay (2 * 30 seconds by default) time.
For instance, in a 3 nodes cluster, node 3 is decommissioned, to remove
node 3 from gossip membership prior the auto removal (3 days by
default), run the api cmd on both node 1 and node 2 at the same time.
$ curl -X POST --header "Accept: application/json"
"http://127.0.0.1:10000/gossiper/force_remove_endpoint/127.0.0.3"
$ curl -X POST --header "Accept: application/json"
"http://127.0.0.2:10000/gossiper/force_remove_endpoint/127.0.0.3"
Then run 'nodetool gossipinfo' on all the nodes to check the removed nodes
are not present.
Fixes#2134Closes#5436
This patch adds a fixture "scylla_only" which can be used to mark tests
for Scylla-specific features. These tests are skipped when running against
other CQL servers - like Apache Cassandra.
We recognize Scylla by looking at whether any system table exists with
the name "scylla" in its name - Scylla has several of those, and Cassandra
has none.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
bytes_view is one of the types we want to deserialize from (at least for now),
so we want to be able to pass it to deserialize() after it's transitioned to
FragmentView.
single_fragmented_view is a wrapper implementing FragmentedView for bytes_view.
It's constructed from bytes_view explicitly, because it's typically used in
context where we want to phase linearization (and by extension, bytes_view) out.
This patch introduces FragmentedView - a concept intented as a general-purpose
interface for fragmented buffers.
Another concept made for this purpose, FragmentedRange, already exists in the
codebase. However, it's unwieldy. The iterator-based design of FragmentRange is
harder to implement and requires more code, but more importantly it makes
FragmentRange immutable.
Usually we want to read the beginning of the buffer and pass the rest of it
elsewhere. This is impossible with FragmentRange.
FragmentedView can do everything FragmentRange can do and more, except for
playing nicely with iterator-based collection methods, but those are useless for
fragmented buffers anyway.
disk parsing expects output from recursive listing of GCP
metadata REST call, the method used to do it by default,
but now it requires a boolean flag to run in recursive mode
Fixes#7684Closes#7685
Since f3bcd4d205 ("Merge 'Support SSL Certificate Hot
Reloading' from Calle"), we reload certificates as they are
modified on disk. This uses inotify, which is limited by a
sysctl fs.inotify.max_user_instances, with a default of 128.
This is enough for 64 shards only, if both rpc and cql are
encrypted; above that startup fails.
Increase to 1200, which is enough for 6 instances * 200 shards.
Fixes#7700.
Closes#7701
When we introduced dependencies.conf, we mistakenly added it on rpm as %ghost,
but it should be normal file, should be installed normally on package installation.
Fixes#7703Closes#7704
Fixes#7211
If we start a sharded<> object, then proceed to do potentially
exceptional stuff, we should destroy it on said exception.
Otherwise, the exception propagation will abort on RAII
destruction of the sharded<>. And we get no exception logging.
Closes#7697
* github.com:scylladb/scylla:
redis::service: Shut down sharded<> subobject on startup exception
transport::controller: Shut down distributed object on startup exception
Refs #7211
If we start a sharded<> object, then proceed to do potentially
exceptional stuff, we should destroy it on said exception.
Otherwise, the exception propagation will abort on RAII
destruction of the sharded<>. And we get no exception logging.
Fixes#7211
If we start a sharded<> object, then proceed to do potentially
exceptional stuff, we should destroy it on said exception.
Otherwise, the exception propagation will abort on RAII
destruction of the sharded<>. And we get no exception logging.
The downstream code expects a single-column restriction when using an
index. We could fix it, but we'd still have to filter the rows
fetched from the index table, unlike the code that queries the base
table directly. For instance, WHERE (c1,c2,c3) = (1,2,3) with an
index on c3 can fetch just the right rows from the base table but all
the c3=3 rows from the index table.
Fixes#7680
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
After snapshot is transferred progress::next_idx is set to its index,
but the code uses current snapshot to set it instead of the snapshot
that was transferred. Those can be different snapshots.
After a node becomes leader it needs to do two things: send an append
message to establish its leadership and commit one entry to make sure
all previous entries with smaller terms are committed as well.
Snapshot index cannot be used to check snapshot correctness since some
entries may not be command and thus do not affect snapshot value. Lest
use applied entries count instead.
Move the definition of bool_class can_yield to a standalone
header file and define there a maybe_yield(can_yield) helper.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In commit 9b28162f88 (repair: Use label
for node ops metrics), we switched to use label for different node
operations. We should use the same description for the same metric name.
Fixes#7681Closes#7682
1. sstables: move `sstable_set` implementations to a separate module
All the implementations were kept in sstables/compaction_strategy.cc
which is quite large even without them. `sstable_set` already had its
own header file, now it gets its own implementation file.
The declarations of implementation classes and interfaces (`sstable_set_impl`,
`bag_sstable_set`, and so on) were also exposed in a header file,
sstable_set_impl.hh, for the purposes of potential unit testing.
2. mutation_reader: move `mutation_reader::forwarding` to flat_mutation_reader.hh
Files which need this definition won't have to include
mutation_reader.hh, only flat_mutation_reader.hh (so the inclusions are
in total smaller; mutation_reader.hh includes flat_mutation_reader.hh).
3. sstables: move sstable reader creation functions to `sstable_set`
Lower level functions such as `create_single_key_sstable_reader`
were made methods of `sstable_set`.
The motivation is that each concrete sstable_set
may decide to use a better sstable reading algorithm specific to the
data structures used by this sstable_set. For this it needs to access
the set's internals.
A nice side effect is that we moved some code out of table.cc
and database.hh which are huge files.
4. sstables: pass `ring_position` to `create_single_key_sstable_reader`
instead of `partition_range`.
It would be best to pass `partition_key` or `decorated_key` here.
However, the implementation of this function needs a `partition_range`
to pass into `sstable_set::select`, and `partition_range` must be
constructed from `ring_position`s. We could create the `ring_position`
internally from the key but that would involve a copy which we want to
avoid.
5. sstable_set: refactor `filter_sstable_for_reader_by_pk`
Introduce a `make_pk_filter` function, which given a ring position,
returns a boolean function (a filter) that given a sstable, tells
whether the sstable may contain rows with the given position.
The logic has been extracted from `filter_sstable_for_reader_by_pk`.
Split from #7437.
Closes#7655
* github.com:scylladb/scylla:
sstable_set: refactor filter_sstable_for_reader_by_pk
sstables: pass ring_position to create_single_key_sstable_reader
sstables: move sstable reader creation functions to `sstable_set`
mutation_reader: move mutation_reader::forwarding to flat_mutation_reader.hh
sstables: move sstable_set implementations to a separate module
This piece of logic was wrong for two unrelated reasons:
1. When fragmented_temporary_buffer::view is constructed from bytes_view,
_current is null. When remove_prefix was used on such view, null pointer
dereference happened.
2. It only worked for the first remove_prefix call. A second call would put a
wrong value in _current_position.
For sstable versions greater or equal than md, the `min_max_column_names`
sstable metadata gives a range of position-in-partitions such that all
clustering rows stored in this sstable have positions in this range.
Partition tombstones in this context are understood as covering the
entire range of clustering keys; thus, if the sstable contains at least
one partition tombstone, the sstable position range is set to be the
range of all clustered rows.
Therefore, by checking that the position range is *not* the range of all
clustered rows we know that the sstable cannot have any partition tombstones.
Closes#7678
It is not legal to fast forward a reader before it enters a partition.
One must ensure that there even is a partition in the first place. For
this one must fetch a `partition_start` fragment.
Closes#7679
Fixes, features needed for testing, snapshot testing.
Free election after partitioning (replication test) .
* https://github.com/alecco/scylla/tree/raft-ale-tests-05e:
raft: replication test: partitioning with leader
raft: replication test: run free election after partitioning
raft: expose fsm tick() to server for testing
raft: expose is_leader() for testing
raft: replication test: test take and load snapshot
raft: fix a bug in leader election
raft: fix default randomized timeout
raft: replication test: fix custom next leader
raft: replication test: custom next leader noop for same
raft: replication test: fix failure detector for disconnected
Introduce a `make_pk_filter` function, which given a ring position,
returns a boolean function (a filter) that given a sstable, tells
whether the sstable may contain rows with the given position.
The logic has been extracted from `filter_sstable_for_reader_by_pk`.
instead of partition_range.
It would be best to pass `partition_key` or `decorated_key` here.
However, the implementation of this function needs a `partition_range`
to pass into `sstable_set::select`, and `partition_range` must be
constructed from `ring_position`s. We could create the `ring_position`
internally from the key but that would involve a copy which we want to
avoid.
Currently, each internal page fetched during aggregating
gets a timeout based on the time the page fetch was started,
rather than the query start time. This means the query can
continue processing long after the client has abandoned it
due to its own timeout, which is based on the query start time.
Fix by establishing the timeout once when the query starts, and
not advancing it.
Test: manual (SELECT count(*) FROM a large table).
Fixes#1175.
Closes#7662
The C and C++ sub-builds were placed in submodule_pool to
reduce concurrency, as they are memory intensive (well, at least
the C++ jobs are), and we choose build concurrency based on memory.
But the other submodules are not memory intensives, and certainly
the packaging jobs are not (and they are single-threaded too).
To allow these simple jobs to utilize multicores more efficiently,
remove them from submodule_pool so they can run in parallel.
Closes#7671
The unified package is quite large (1GB compressed), and it
is the last step in the build so its build time cannot be
parallized with other tasks. Compress it with pigz to take
advantage of multiple cores and speed up the build a little.
Closes#7670
We initially implemented run() and out() functions because we couldn't use
subprocess.run() since we were on Python 3.4.
But since we moved to relocatable python3, we don't need to implement it ourselves.
Why we keep using these functions are, because we needed to set environemnt variable to set PATH.
Since we recently moved away these codes to python thunk, we finally able to
drop run() and out(), switch to subprocess.run().
When partitioning without keeping the existing leader, run an election
without forcing a particular leader.
To force a leader after partitioning, a test can just set it with new_leader{X}.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
For tests to advance servers they need to invoke tick().
This is needed to advance free elections.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Through configuration trigger automatic snapshotting.
For now, handle expected log index within the test's state machine and
pass it with snapshot_value (within the test file).
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
If a server responds favourably to RequestVote RPC, it should
reset its election timer, otherwise it has very high chances of becoming
a candidate with an even newer term, despite successful elections.
A candidate with a term larger than the leader rejects AppendEntries
RPCs and can not become a leader itself (because of protection
against of disruptive leaders), so is stuck in this state.
Range after election timeout should start at +1.
This matches existing update_current_term() code adding dist(1, 2*n).
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Adjustments after changes due to free election in partitioning and changes in
the code.
Elapse previous leader after isolating it.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
scylla_rsyslog_setup adds a configuration file to rsyslog to forward the
trances to a remote server.
It will override any existing file, so it is safe to run it multiple
times.
It takes an ip, or ip and port from the users for that configuration, if
no port is provided, the default port of Scylla-Monitoring promtail is
used.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Follow-up to https://github.com/scylladb/scylla/pull/6916.
- Fixes wrong usage of `resource_manager::prepare_per_device_limits`,
- Improves locking in `resource_manager` so that it is more safe to call its methods concurrently,
- Adds comments around `resource_manager::register_manager` so that it's more clear what this method does and why.
Closes#7660
* github.com:scylladb/scylla:
hints/resource_manager: add comments to register_manager
hints/resource_manager: fix indentation
hints/resource_manager: improve mutual exclusion
hints/resource_manager: correct prepare_per_device_limits usage
The "ninja dist-server-tar" command is a full replacement for
"build_reloc.sh" script. We release engineering infrastructure has been
switched to ninja, so let's remove "build_reloc.sh" as obsolete.
Now that CDC is GA, it should be enabled in all the tests by default.
To achieve that the PR adds a special db::config::add_cdc_extension()
helper which is used in cql_test_envm to make sure CDC is usable in
all the tests that use cql_test_env.m As a result, cdc_tests can be
simplified.
Finally, some trailing whitespaces are removed from cdc_tests.
Tests: unit(dev)
Closes#7657
* github.com:scylladb/scylla:
cdc: Remove trailing whitespaces from cdc_tests
cdc: Remove mk_cdc_test_config from tests
config: Add add_cdc_extension function for testing
cdc: Add missing includes to cdc_extension.hh
The patch which introduces build-dependent testing
has a regression: it quietly filters out all tests
which are not part of ninja output. Since ninja
doesn't build any CQL tests (including CQL-pytest),
all such tests were quietly disabled.
Fix the regression by only doing the filtering
in unit and boost test suites.
test: dev (unit), dev + --build-raft
Message-Id: <20201119224008.185250-1-kostja@scylladb.com>
Some systems (at least, Centos 7, aarch64) block the membarrier()
syscall via seccomp. This causes Scylla or unit tests to burn cpu
instead of sleeping when there is nothing to do.
Fix by instructing podman/docker not to block any syscalls. I
tested this with podman, and it appears [1] to be supported on
docker.
[1] https://docs.docker.com/engine/security/seccomp/#run-without-the-default-seccomp-profileCloses#7661
Lower level functions such as `create_single_key_sstable_reader`
were made methods of `sstable_set`.
The motivation is that each concrete sstable_set
may decide to use a better sstable reading algorithm specific to the
data structures used by this sstable_set. For this it needs to access
the set's internals.
A nice side effect is that we moved some code out of table.cc
and database.hh which are huge files.
Files which need this definition won't have to include
mutation_reader.hh, only flat_mutation_reader.hh (so the inclusions are
in total smaller; mutation_reader.hh includes flat_mutation_reader.hh).
All the implementations were kept in sstables/compaction_strategy.cc
which is quite large even without them. `sstable_set` already had its
own header file, now it gets its own implementation file.
The declarations of implementation classes and interfaces (`sstable_set_impl`,
`bag_sstable_set`, and so on) were also exposed in a header file,
sstable_set_impl.hh, for the purposes of potential unit testing.
"
The qctx is global object that references query processor and
database to let the rest of the code query system keyspace.
As the first step of de-globalizing it -- remove the database
reference from it. After the set the qctx remains a simple
wrapper over the query processor (which is already de-globalized)
and the query processor in turn is mostly needed only to parse
the query string into prepared statement only. This, in turn,
makes it possible to remove the qctx later by parsing the
query strings on boot and carrying _them_ around, not the qctx
itself.
tests: unit(dev), dtest(simple_cluster_driver_test:dev), manual start/stop
"
* 'br-remove-database-from-qctx' of https://github.com/xemul/scylla:
query-context: Remove database from qctx
schema-tables: Use query processor referece in save_system(_keyspace)?_schema
system-keyspace: Rewrite force_blocking_flush
system-keyspace: Use cluster_name string in check_health
system-keyspace: Use db::config in setup_version
query-context: Kill global helpers
test: Use cql_test_env::evecute_cql instead of qctx version
code: Use qctx::evecute_cql methods, not global ones
system-keyspace: Do not call minimal_setup for the 2nd time
system-keyspace: Fix indentation after previous patch
system-keyspace: Do not do invoke_on_all by hands
system-keyspace: Remove dead code
The save_system_schema and save_system_keyspace_schema are both
called on start and can the needed get query processor reference
from arguments.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method is called after query_processor::execute_internal
to flush the cf. Encapsulating this flush inside database and
getting the database from query_processor lets removing
database reference from global qctx object.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The check_help needs global qctx to get db.config.cluster_name,
which is already available at the caller side.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is the beginning of de-globalizing global qctx thing.
The setup_version() needs global qctx to get config from.
It's possible to get the config from the caller instead.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similar to previous patch, but for tests. Since cql_test_env
does't have qctx on board, the patch makes one step forward
and calls what is called by qctx::execute_cql.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are global db::execute_cql() helpers that just forward
the args into qctx::execute_cql(). The former are going away,
so patch all callers to use qctx themselves.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
THe system_keyspace::minimal_setup is called by main.cc by hands
already, some steps before the regular ::setup().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The cache_truncation_record needs to run cf.cache_truncation_record
on each shard's DB, so the invoke_on_all can be used.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This commit causes start, stop and register_manager methods of the
resource_manager to be serialized with respect to each other using the
_operation_lock.
Those function modify internal state, so it's best if they are
protected with a semaphore. Additionally, those function are not going
to be used frequently, therefore it's perfectly fine to protect them in
such a coarse manner.
Now, space_watchdog has a dedicated lock for serializing its on_timer
logic with resource_manager::register_manager. The reason for separate
lock is that resource_manager::stop cannot use the same lock as the
space_watchdog - otherwise a situation could occur in which
space_watchdog waits for semaphore units held by
resource_manager::stop(), and resource_manager::stop() waits until the
space_watchdog stops its asynchronous event loop.
The resource_manager::prepare_per_device_limits function calculates disk
quota for registered hints managers, and creates an association map:
from a storage device id to those hints manager which store hints on
that device (_per_device_limits_map)
This function was used with an assumption that it is idempotent - which
is a wrong assumption. In resource_manager::register_manager, if the
resource_manager is already started, prepare_per_device_limits would be
called, and those hints managers which were previously added to the
_per_device_limits_map would be added again. This would cause the space
used by those managers to be calculated twice, which would artificially
lower the limit which we impose on the space hints are allowed to occupy
on disk.
This patch fixes this problem by changing the prepare_per_device_limits
function to operate on a hints manager passed by argument. Now, we make
sure that this function is called on each hints manager only once.
Now that CDC is GA and enabled by default, there's no longer a need
for a specific config in CDC tests.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
It is now called `merging_reader`, and is used to change a `FragmentProducer`
that produces a non-decreasing stream of mutation fragments batches into
a `flat_mutation_reader` producing a non-decreasing stream of fragments.
The resulting stream of fragments is increasing except for places where
we encounter range tombstones (multiple range tombstones may be produced
with the same position_in_partition)
`merging_reader` is a simple adapter over `mutation_fragment_merger`.
The old `combined_mutation_reader` is simply a specialization of `merging_reader`
where the used `FragmentProducer` is `mutation_reader_merger`, an abstraction that
merges the output of multiple readers into one non-decreasing stream of fragment
batches.
There is no separate class for `combined_mutation_reader` now. Instead,
`make_combined_reader` works directly with `merging_reader`.
The PR also improves some comments.
Split from https://github.com/scylladb/scylla/pull/7437.
Closes#7656
* github.com:scylladb/scylla:
mutation_reader: `generalize combined_mutation_reader`
mutation_reader: fix description of mutation_fragment_merger
After the concept of the seed nodes was removed we can distinguish
whether the node is the first node in the cluster or not.
Thanks to this we can avoid adding delay to the timestamp of the first
CDC generation.
The delay is added to the timestamp to make sure that all the nodes
in the cluster manage to learn about it before the timestamp becomes in the past.
It is safe to not add the delay for the first node because we know it's the only node
in the cluster and no one else has to learn about the timestamp.
Fixes#7645
Tests: unit(dev)
Closes#7654
* github.com:scylladb/scylla:
cdc: Don't add delay to the timestamp of the first generation
cdc: Change for_testing to add_delay in make_new_cdc_generation
It is now called `merging_reader`, and is used to change a `FragmentProducer`
that produces a non-decreasing stream of mutation fragments batches into
a `flat_mutation_reader` producing a non-decreasing stream of fragments.
The resulting stream of fragments is increasing except for places where
we encounter range tombstones (multiple range tombstones may be produced
with the same position_in_partition)
`merging_reader` is a simple adapter over `mutation_fragment_merger`.
The old `combined_mutation_reader` is simply a specialization of `merging_reader`
where the used `FragmentProducer` is `mutation_reader_merger`, an abstraction that
merges the output of multiple readers into one non-decreasing stream of fragment
batches.
There is no separate class for `combined_mutation_reader` now. Instead,
`make_combined_reader` works directly with `merging_reader`.
"
We've recently seen failures in this unit test as follows:
```
test/boost/network_topology_strategy_test.cc(0): Entering test case "testCalculateEndpoints"
unknown location(0): fatal error: in "testCalculateEndpoints": std::out_of_range: _Map_base::at
./seastar/src/testing/seastar_test.cc(43): last checkpoint
test/boost/network_topology_strategy_test.cc(0): Leaving test case "testCalculateEndpoints"; testing time: 15192us
test/boost/network_topology_strategy_test.cc(0): Entering test case "test_invalid_dcs"
network_topology_strategy_test: ./seastar/include/seastar/core/future.hh:634: void seastar::future_state<seastar::internal::monostate>::set(A &&...) [T = seastar::internal::monostate, A = <>]: Assertion `_u.st == state::future' failed.
Aborting on shard 0.
```
This series fixes 2 issues in this test:
1. The core issue where std::out_of_range exception
is not handled in calculate_natural_endpoints().
2. A secondary issue where the static `snitch_inst` isn't
stopped when the first exception is hit, failing
the next time the snitch is started, as it wasn't
stopped properly.
Test: network_topology_strategy_test(release)
"
* tag 'nts_test-harden-calculate_natural_endpoints-v1' of github.com:bhalevy/scylla:
test: network_topology_strategy_test: has_sufficient_replicas: handle empty dc endpoints case
test: network_topology_strategy_test: fixup indentation
test: network_topology_strategy_test: always stop_snitch after create_snitch
After the concept of the seed nodes was removed we can distinguish
whether the node is the first node in the cluster or not.
Thanks to this we can avoid adding delay to the timestamp of the first
CDC generation.
Fixes#7645
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The meaning of the parameter changes from defining whether the function
is called in testing environment to deciding whether a delay should be
added to a timestamp of a newly created CDC generation.
This is a preparation for improvement in the following patch that does
not always add delay to every node but only to non-first node.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Asias He reports that git on Windows filesystem is unhappy about the
colon character (":") present in dist-check files:
$ git reset --hard origin/master
error: invalid path 'tools/testing/dist-check/docker.io/centos:7.sh'
fatal: Could not reset index file to revision 'origin/master'.
Rename the script to use a dash instead.
Closes#7648
Current tests uses hash state machine that checks for specific order of
entries application. The order is not always guaranty though.
Backpressure may delay some entires to be submitted and when they are
released together they may be reordered in the debug mode due to
SEASTAR_SHUFFLE_TASK_QUEUE. Introduce an ability for test to choose
state machine type and implement commutative state machine that does
not care about ordering.
To prevent the log to take too much memory introduce a mechanism that
limits the log to a certain size. If the size is reached no new log
entries can be submitted until previous entries are committed and
snapshotted.
If scylla_raid_setup script called without --raiddev argument
then try to use any of /dev/md[0-9] devices instead of only
one /dev/md0. Do it in this way because on Ubuntu 20.04
/dev/md0 used by OS already.
Closes#7628
gcc fails to compile current master like this
In file included from ./service/client_state.hh:44,
from ./cql3/cql_statement.hh:44,
from ./cql3/statements/prepared_statement.hh:47,
from ./cql3/statements/raw/select_statement.hh:45,
from build/dev/gen/cql3/CqlParser.hpp:64,
from build/dev/gen/cql3/CqlParser.cpp:44:
./auth/service.hh:188:21: error: declaration of ‘const auth::resource& auth::command_desc::resource’ changes meaning of ‘resource’ [-fpermissive]
188 | const resource& resource; ///< Resource impacted by this command.
| ^~~~~~~~
In file included from ./auth/authenticator.hh:57,
from ./auth/service.hh:33,
from ./service/client_state.hh:44,
from ./cql3/cql_statement.hh:44,
from ./cql3/statements/prepared_statement.hh:47,
from ./cql3/statements/raw/select_statement.hh:45,
from build/dev/gen/cql3/CqlParser.hpp:64,
from build/dev/gen/cql3/CqlParser.cpp:44:
./auth/resource.hh:98:7: note: ‘resource’ declared here as ‘class auth::resource’
98 | class resource final {
| ^~~~~~~~
clang doesn't fail
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201118155905.14447-1-xemul@scylladb.com>
If a list of target endpoints for sending view updates contains
duplicates, it results in benign (but annoying) broken promise
errors happening due to duplicated write response handlers being
instantiated for a single endpoint.
In order to avoid such errors, target remote endpoints are deduplicated
from the list of pending endpoints.
A similar issue (#5459) solved the case for duplicated local endpoints,
but that didn't solve the general case.
Fixes#7572Closes#7641
This PR allows changing the hinted_handoff_enabled option in runtime, either by modifying and reloading YAML configuration, or through HTTP API.
This PR also introduces an important change in semantics of hinted_handoff_enabled:
- Previously, hinted_handoff_enabled controlled whether _both writing and sending_ hints is allowed at all, or to particular DCs,
- Now, hinted_handoff_enabled only controls whether _writing hints_ is enabled. Sending hints from disk is now always enabled.
Fixes: #5634
Tests:
- unit(dev) for each commit of the PR
- unit(debug) for the last commit of the PR
Closes#6916
* github.com:scylladb/scylla:
api: allow changing hinted handoff configuration
storage_proxy: fix wrong return type in swagger
hints_manager: implement change_host_filter
storage_proxy: always create hints manager
config: plug in hints::host_filter object into configuration
db/hints: introduce host_filter
hints/resource_manager: allow registering managers after start
hints: introduce db::hints::directory_initializer
directories.cc: prepare for use outside main.cc
Fixes#7064
Iff broadcast address is set to ipv6 from main (meaning prefer
ipv6), determine the "public" ipv6 address (which should be
the same, but might not be), via aws metadata query.
Closes#7633
available_memory is used to seed many caches and controllers. Usually
it's detected from the environment, but unit tests configure it
on their own with fake values. If they forget, then the undefined
behavior sanitizer will kick in in random places (see 8aa842614a
("test: gossip_test: configure database memory allocation correctly")
for an example.
Prevent this early by asserting that available_memory is nonzero.
Closes#7612
std::iterator is deprecated since C++17 so define all the required iterator_traits directly and stop using std::iterator at all.
More context: https://www.fluentcpp.com/2018/05/08/std-iterator-deprecated
Tests: unit(dev)
Closes#7635
* github.com:scylladb/scylla:
log_heap: Remove std::iterator from hist_iterator
types: Remove std::iterator from tuple_deserializing_iterator
types: Remove std::iterator from listlike_partial_deserializing_iterator
sstables: remove std::iterator from const_iterator
token_metadata: Remove std::iterator from tokens_iterator
size_estimates_virtual_reader: Remove std::iterator
token_metadata: Remove std::iterator from tokens_iterator_impl
counters: Remove std::iterator from iterators
compound_compat: Remove std::iterator from iterators
compound: Remove std::iterator from iterator
clustering_interval_set: Remove std::iterator from position_range_iterator
cdc: Remove std::iterator from collection_iterator
cartesian_product: Remove std::iterator from iterator
bytes_ostream: Remove std::iterator from fragment_iterator
We saw this intermittent failure in testCalculateEndpoints:
```
unknown location(0): fatal error: in "testCalculateEndpoints": std::out_of_range: _Map_base::at
```
It turns out that there are no endpoints associated with the dc passed
to has_sufficient_replicas in the `all_endpoints` map.
Handle this case by returning true.
The dc is still required to appear in `dc_replicas`,
so if it's not found there, fail the test gracefully.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently stop_snitch is not called if the test fails on exception.
This causes a failure in create_snitch where snitch_inst fails to start
since it wasn't stopped earlier.
For example:
```
test/boost/network_topology_strategy_test.cc(0): Entering test case "testCalculateEndpoints"
unknown location(0): fatal error: in "testCalculateEndpoints": std::out_of_range: _Map_base::at
./seastar/src/testing/seastar_test.cc(43): last checkpoint
test/boost/network_topology_strategy_test.cc(0): Leaving test case "testCalculateEndpoints"; testing time: 15192us
test/boost/network_topology_strategy_test.cc(0): Entering test case "test_invalid_dcs"
network_topology_strategy_test: ./seastar/include/seastar/core/future.hh:634: void seastar::future_state<seastar::internal::monostate>::set(A &&...) [T = seastar::internal::monostate, A = <>]: Assertion `_u.st == state::future' failed.
Aborting on shard 0.
Backtrace:
0x0000000002825e94
0x000000000282ffa9
0x00007fd065f971df
/lib64/libc.so.6+0x000000000003dbc4
/lib64/libc.so.6+0x00000000000268a3
/lib64/libc.so.6+0x0000000000026788
/lib64/libc.so.6+0x0000000000035fc5
0x0000000000b484cf
0x0000000002a7c69f
0x0000000002a7c62f
0x0000000000b47b9e
0x0000000002595da2
0x0000000002595913
0x0000000002a83a31
```
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
scoped_critical_alloc_section was recently introduced to replace
disable_failure_guard and made the old class deprecated.
This patch replaces all occurences of disable_failure_guard with
scoped_critical_alloc_section.
Without this patch the build prints many warnings like:
warning: 'disable_failure_guard' is deprecated: Use scoped_critical_section instead [-Wdeprecated-declarations]
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <ca2a91aaf48b0f6ed762a6aa687e6ac5e936355d.1605621284.git.piotr@scylladb.com>
As requested in #7057, allow certain alterations of system_auth tables. Potentially destructive alterations are still rejected.
Tests: unit (dev)
Closes#7606
* github.com:scylladb/scylla:
auth: Permit ALTER options on system_auth tables
auth: Add command_desc
auth: Add tests for resource protections
And since now there is no danger of them filling the logs, the log-level
is promoted to info, so users can see the diagnostics messages by
default.
The rate-limit chosen is 1/30s.
Refs: #7398
Tests: manual
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201117091253.238739-1-bdenes@scylladb.com>
This commit makes it possible to change hints manager's configuration at
runtime through HTTP API.
To preserve backwards compatibility, we keep the old behavior of not
creating and checking hints directories if they are not enabled at
startup. Instead, hint directories are lazily initialized when hints are
enabled for the first time through HTTP API.
The GET `hinted_handoff_enabled_by_dc` endpoint had an incorrect return
type specified. Although it does not have an implementation, yet, it was
supposed to return a list of strings with DC names for which generating
hints is enabled - not a list of string pairs. Such return type is
expected by the JMX.
Implements a function which is responsible for changing hints manager
configuration while it is running.
It first starts new endpoint managers for endpoints which weren't
allowed by previous filter but are now, and then stops endpoint managers
which are rejected by the new filter.
The function is blocking and waits until all relevant ep managers are
started or stopped.
Now, the hints manager object for regular hints is always created, even
if hints are disabled in configuration. Please note that the behavior of
hints will be unchanged - no hints will be sent when they are disabled.
The intent of this change is to make enabling and disabling hints in
runtime easier to implement.
Uses db::hints::host_filter as the type of hinted_handoff_enabled
configuration option.
Previously, hinted_handoff_enabled used to be a string option, and it
was parsed later in a separate function during startup. The function
returned a std::optional<std::unordered_set<sstring>>, whose meaning in
the context of hints is rather enigmatic for an observer not familiar
with hints.
Now, hinted_handoff_enabled has type of db::hints::host_filter, and it
is plugged into the config parsing framework, so there is no need for
later post-processing.
Adds a db::hints::host_filter structure, which determines if generating
hints towards a given target is currently allowed. It supports
serialization and deserialization between the hinted_handoff_enabled
configuration/cli option.
This patch only introduces this structure, but does not make other code
use it. It will be plugged into the configuration architecture in the
following commits.
This change modifies db::hints::resource_manager so that it is now
possible to add hints::managers after it was started.
This change will make it possible to register the regular hints manager
later in runtime, if it wasn't enabled at boot time.
Introduces a db::hints::directory_initializer object, which encapsulates
the logic of initializing directories for hints (creating/validating
directories, segment rebalancing). It will be useful for lazy
initialization of hints manager.
Currently, the `directories` class is used exclusively during
initialization, in the main() function. This commit refactors this class
so that it is possible to use it to initialize directories much later
after startup.
The intent of this change is to make it possible for hints manager to
create directories for hints lazily. Currently, when Scylla is booted
with hinted handoff disabled, the `hints_directory` config parameter is
ignored and directories for hints are neither created nor verified.
Because we would like to preserve this behavior and introduce
possibility to switch hinted handoff on in runtime, the hints
directories will have to be created lazily the first time hinted handoff
is enabled.
* seastar 043ecec7...c861dbfb (3):
> Merge "memory: allow configuring when to dump memory diagnostics on allocation failures" from Botond
> perftune.py: support kvm-clock on tune-clock
> execution_stage: inheriting_concrete_execution_stage: add get_stats()
These alterations cannot break the database irreparably, so allow
them.
Expand command_desc as required.
Add a type (rather than command_desc) parameter to
has_column_family_access() to minimize code changes.
Fixes#7057
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
When a node bootstraps or upgrades from a pre-CDC version, it creates a
new CDC generation, writes it to a distributed table
(system_distributed.cdc_generation_descriptions), and starts gossiping
its timestamp. When other nodes see the timestamp being gossiped, they
retrieve the generation from the table.
The bootstrapping/upgrading node therefore assumes that the generation
is made durable and other nodes will be able to retrieve it from the
table. This assumption could be invalidated if periodic commitlog mode
was used: replicas would acknowledge the write and then immediately
crash, losing the write if they were unlucky (i.e. commitlog wasn't
synced to disk before the write was acknowledged).
This commit enforces all writes to the generations table to be
synced to commitlog immediately. It does not matter for performance as
these writes are very rare.
Fixes https://github.com/scylladb/scylla/issues/7610.
Closes#7619
An entry can be snapshotted, before the outgoing message is sent, so the
message has to hold to it to avoid use after free.
Message-Id: <20201116113323.GA1024423@scylladb.com>
Materialized view updates participate in a retirement program,
which makes sure that they are immediately taken down once their
target node is down, without having to wait for timeout (since
views are a background operation and it's wasteful to wait in the
background for minutes). However, this mechanism has very delicate
lifetime issues, and it already caused problems more than once,
most recently in #5459.
In order to make another bug in this area less likely, the two
implementations of the mechanism, in on_down() and drain_on_shutdown(),
are unified.
Possibly refs #7572Closes#7624
Commit e5be3352cf ("database, streaming, messaging: drop
streaming memtables") removed streaming memtables; this removes
the mechanisms to synchronize them: _streaming_flush_gate and
_streaming_flush_phaser. The memory manager for streaming is removed,
and its 10% reserve is evenly distributed between memtables and
general use (e.g. cache).
Note that _streaming_flush_phaser and _streaming_flush_date are
no longer used to syncrhonize anything - the gate is only used
to protect the phaser, and the phaser isn't used for anything.
Closes#7454
DEBIAN_FRONTEND environment variable was added just for prevent opening
dialog when running 'apt-get install mdadm', no other program depends on it.
So we can move it inside of apt_install()/apt_uninstall() and drop scylla_env,
since we don't have any other environment variables.
To passing the variable, added env argument on run()/out().
"
This is a follow-up on 052a8d036d
"Avoid stalls in token_metadata and replication strategy"
The added mutate_token_metadata helper combines:
- with_token_metadata_lock
- get_mutable_token_metadata_ptr
- replicate_to_all_cores
Test: unit(dev)
"
* tag 'mutate_token_metadata-v1' of github.com:bhalevy/scylla:
storage_service: fixup indentation
storage_service: mutate_token_metadata: do replicate_to_all_cores
storage_service: add mutate_token_metadata helper
Replicate the mutated token_metadata to all cores on success.
This moves replication out of update_pending_ranges(mutable_token_metadata_ptr, sstring),
so add explicit call to replicate_to_all_cores where it is called outside
of mutate_token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Replace a repeating pattern of:
with_token_metadata_lock([] {
return get_mutable_token_metadata_ptr([] (mutable_token_metadata_ptr tmptr) {
// mutate token_metadata via tmptr
});
});
With a call to mutate_token_metadata that does both
and calls the function with then mutable_token_metadata_ptr.
A following patch will also move the replication to all
cores to mutate_token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The "dist" target fails as follows:
$ ./tools/toolchain/dbuild ninja dist
ninja: error: 'build/dev/scylla-unified-package-..tar.gz', needed by 'dist-unified-tar', missing and no known rule to make it
Fix two issues:
- Fix Python variable references to "scylla_version" and
"scylla_release", broken by commit bec0c15ee9 ("configure.py: Add
version to unified tarball filename"). The breakage went unnoticed
because ninja default target does not call into dist...
- Remove dependencies to build/<mode>/scylla-unified-package.tar.gz. The
file is now in build/<mode>/dist/tar/ directory and contains version
and release in the filename.
Message-Id: <20201113110706.150533-1-penberg@scylladb.com>
To test handling of connectivity issues and recovery add support for
disconnecting servers.
This is not full partitioning yet as it doesn't allow connectivity
across the disconnected servers (having multiple active partitions.
* https://github.com/alecco/scylla/pull/new/raft-ale-partition-simple-v3:
raft: replication test: connectivity partitioning support
raft: replication test: block rpc calls to disconnected servers
raft: replication test: add is_disconnected helper
raft: replication test: rename global variable
raft: replication test: relocate global connection state map
We currently keep a copy of scylla-package.tar.gz in "build/<mode>" for
compatibility. However, we've long since switched our CI system over to
the new location, so let's remove the duplicate and use the one from
"build/<mode>/dist/tar" instead.
Message-Id: <20201113075146.67265-1-penberg@scylladb.com>
Add to the DynamoDB compatibility document, docs/alternator/compatibility.md,
a mention that Alternator streams are still an experimental features, and
how to turn it on (at this point CDC is no longer an experimental feature,
but Alternator Streams are).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201112184436.940497-1-nyh@scylladb.com>
Drop the adjective "experimental" used to describe Alternator in
docs/alternator/getting-started.md.
In Scylla, the word "experimental" carries a specific meaning - no support
for upgrades, not enough QA, not ready for general use) and Alternator is
no longer experimental in that sense.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201112185249.941484-1-nyh@scylladb.com>
This adds some test cases for ALTER KEYSPACE:
- ALTER KEYSPACE happy path
- ALTER KEYSPACE wit invalid options
- ALTER KEYSPACE for non-existing keyspace
- CREATE and ALTER KEYSPACE using NetworkTopologyStrategy with
non-existing data center in configuration, which triggers a bug in
Scylla:
https://github.com/scylladb/scylla/issues/7595
Message-Id: <20201112073110.39475-1-penberg@scylladb.com>
Introduce partition update command consisting of nodes still seeing
each other. Nodes not included are disconnected from everything else.
If the previous leader is not part of the new partition, the first node
specified in the partition will become leader.
For other nodes to accept a new leader it has to have a committed log.
For example, if the desired leader is being re-connected and it missed
entries other nodes saw it will not win the election. Example A B C:
partition{A,C},entries{2},partition{B,C}
In this case node C won't accept B as a new leader as it's missing 2
entries.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
In main.cc, we spawn a future which starts the hints manager, but we
don't wait for it to complete. This can have the following consequences:
- The hints manager does some asynchronous operations during startup,
so it can take some time to start. If it is started after we start
handling requests, and we admit some requests which would result in
hints being generated, those hints will be dropped instead because we
check if hints manager is started before writing them.
- Initialization of hints manager may fail, and Scylla won't be stopped
because of it (e.g. we don't have permissions to create hints
directories). The consequence of this is that hints manager won't be
started, and hints will be dropped instead of being written. This may
affect both regular hints manager, and the view hints manager.
This commit causes us to wait until hints manager start and see if there
were any errors during initialization.
Fixes#7598Closes#7599
CDC is ready to be a non-experimental feature so remove the experimental flag for it.
Also, guard Alternator Streams with their own experimental flag. Previously, they were using CDC experimental flag as they depend on CDC.
Tests: unit(dev)
Closes#7539
* github.com:scylladb/scylla:
alternator: guard streams with an experimental flag
Mark CDC as GA
cdc: Make it possible for CDC generation creation to fail
Add new alternator-streams experimental flag for
alternator streams control.
CDC becomes GA and won't be guarded by an experimental flag any more.
Alternator Streams stay experimental so now they need to be controlled
by their own experimental flag.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Following patch enables CDC by default and this means CDC has to work
will all the clusters now.
There is a problematic case when existing cluster with no CDC support
is stopped, all the binaries are updated to newer version with
CDC enabled by default. In such case, nodes know that they are already
members of the cluster but they can't find any CDC generation so they
will try to create one. This creation may fail due to lack of QUORUM
for the write.
Before this patch such situation would lead to node failing to start.
After the change, the node will start but CDC generation will be
missing. This will mean CDC won't be able to work on such cluster before
nodetool checkAndRepairCdcStreams is run to fix the CDC generation.
We still fail to bootstrap if the creation of CDC generation fails.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
When a missing base column happens to be named `idx_token`,
an additional helper message is printed in logs.
This additional message does not need to have `error` severity,
since the previous, generic message is already marked as `error`.
This patch simply makes it easier to write tests, because in case
this error is expected, only one message needs to be explicitly
ignored instead of two.
Closes#7597
This miniseries adds metrics which can help the users detect potential overloads:
* due to having too many in-flight hints
* due to exceeding the capacity of the read admission queue, on replica side
Closes#7584
* github.com:scylladb/scylla:
reader_concurrency_semaphore: add metrics for shed reads
storage_proxy: add metrics for too many in-flight hints failures
If interposer consumer is enabled, partition filtering will be done by the
consumer instead, but that's not possible because only the producer is able
to skip to the next partition if the current one is filtered out, so scylla
crashes when that happens with a bad function call in queue_reader.
This is a regression which started here: 55a8b6e3c9
To fix this problem, let's make sure that partition filtering will only
happen on the producer side.
Fixes#7590.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201111221513.312283-1-raphaelsc@scylladb.com>
"
This series is a rebased version of 3 patchsets that were sent
separately before:
1. [PATCH v4 00/17] Cleanup storage_service::update_pending_ranges et al.
This patchset cleansup service/storage_service use of
update_pending_ranges and replicate_to_all_cores.
It also moves some functionality from gossiping_property_file_snitch::reload_configuration
into a new method - storage_service::update_topology.
This prepares storage_service for using a shared ptr to token_metadata,
updating a copy out of line under a semaphore that serializes writers,
and eventually replicating to updated copy to all shards and releasing
the lock. This is a follow up to #7044.
2. [PATCH v8 00/20] token_metadata versioned shared ptr
Rather than keeping references on token_metadata use a shared_token_metadata
containing a lw_shared_ptr<token_metadata> (a.k.a token_metadata_ptr)
to keep track of the token_metadata.
Get token_metadata_ptr for a read-only snapshot of the token_metadata
or clone one for a mutable snapshot that is later used to safely update
the base versioned_shared_object.
token_metadata_ptr is used to modify token_metadata out of line, possibly with
multiple calls, that could be preeempted in-between so that readers can keep a consistent
snapshot of it while writers prepare an updated version.
Introduce a token_metadata_lock used to serialize mutators of token_metadata_ptr.
It's taken by the storage_service before cloning token_metadata_ptr and held
until the updated copy is replicated on all shards.
In addition, this series introduces token_metadata::clone_async() method
to copy the tokne_metadata class using a asynchronous function with
continuations to avoid reactor stalls as seen in #7220.
Fixes#7044
3. [PATCH v3 00/17] Avoid stalls in token_metadata and replication strategy
This series uses the shared_token_metadata infrastructure.
First patches in the series deal wth cloning token_metadata
using continuations to allow preemption while cloning (See #7220).
Then, the rest of the series makes sure to always run
`update_pending_ranges` and `calculate_pending_ranges_for_*` in a thread,
it then adds a `can_yield` parameter to the token_metadata and abstract_replication_strategy
`get_pending_ranges` and friends, and finally it adds `maybe_yield` calls
in potentially long loops.
Fixes#7313Fixes#7220
Test: unit (dev)
Dtest: gating(dev)
"
* tag 'replication_strategy_can_yield-v4' of github.com:bhalevy/scylla: (54 commits)
token_metadata_impl: set_pending_ranges: add can_yield_param
abstract_replication_strategy: get rid of get_ranges_in_thread
repair: call get_ranges_in_thread where possible
abstract_replication_strategy: add can_yield param to get_pending_ranges and friends
abstract_replication_strategy: define can_yield bool_class
token_metadata_impl: calculate_pending_ranges_for_* reindent
token_metadata_impl: calculate_pending_ranges_for_* pass new_pending_ranges by ref
token_metadata_impl: calculate_pending_ranges_for_* call in thread
token_metadata: update_pending_ranges: create seastar thread
abstract_replication_strategy: add get_address_ranges method for specific endpoint
token_metadata_impl: clone_after_all_left: sort tokens only once
token_metadata: futurize clone_after_all_left
token_metadata: futurize clone_only_token_map
token_metadata: use mutable_token_metadata_ptr in calculate_pending_ranges_for_*
repair: replace_with_repair: use token_metadata::clone_async
storage_service: reindent token_metadata blocks
token_metadata: add clone_async
abstract_replication_strategy: accept a token_metadata_ptr in get_pending_address_ranges methods
abstract_replication_strategy: accept a token_metadata_ptr in get_ranges methods
boot_strapper: get_*_tokens: use token_metadata_ptr
...
Add a test that better clarifies what StartingSequenceNumber returned by
DescribeStream really guarantees (this question was raised in a review
of a different patch). The main thing we can guarantee is that reading a
shard from that position returns all the information in that shard -
similar to TRIM_HORIZON. This test verifies this, and it passes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201112081250.862119-1-nyh@scylladb.com>
When the admission queue capacity reaches its limits, excessive
reads are shed in order to avoid overload. Each such operation
now bumps the metrics, which can help the user judge if a replica
is overloaded.
This change adds metrics for counting request message types
listed in the CQL v.4 spec under section 4.1
(https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v4.spec).
To organize things properly, we introduce a new cql_server::transport_stats
object type for aggregating the message and server statistics.
Fixes#4888Closes#7574
Rewrite in a more readable way that will later allow us to split the WHERE expression in two: a storage-reading part and a post-read filtering part.
Tests: unit (dev,debug)
Closes#7591
* github.com:scylladb/scylla:
cql3: Rewrite need_filtering() from scratch
cql3: Store index info in statement_restrictions
The name of the utility function test_object_name() is confusing - by
starting with the word "test", pytest can think (if it's imported to the
top-level namespace) that it is a test... So this patch gives it a better
name - unique_name().
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201111140638.809189-1-nyh@scylladb.com>
This patch adds a new document, docs/alternator/compatibility.md,
which focuses on what users switching from DynamoDB to Alternator
need to know about where Alternator differs from DynamoDB and which
features are missing.
The compatibility information in the old alternator.md is not deleted
yet. It probably should.
Fixes#7556
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201110180242.716295-1-nyh@scylladb.com>
* seastar a62a80ba1d...043ecec732 (8):
> semaphore: make_expiry_handler: explicitly use this lambda capture
> configure: add --{enable,disable}-debug-shared-ptr option
> cmake: add SEASTAR_DEBUG_SHARED_PTR also in dev mode
> tls_test: Update the certificates to use sha256
> logger: allow applying a rate-limit to log messages
> Merge "Handle CPUs not attached to any NUMA nodes" from Pavel E
> memory: fix malloc_usable_size() during early initialization
> Merge "make semaphore related functions noexcept" from Benny
Makes it easier to understand, in preparation for separating the WHERE
expression into filtering and storage-reading parts.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
To rewrite need_filtering() in a more readable way, we need to store
info on found indexes in statement_restrictions data members.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
The functions can be simplified as they are all now being called
from a seastar thread.
Make them sequential, returning void, and yielding if necessary.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Some of the callers of get_address_ranges are interested in the ranges
of a specific endpoint.
Rather than building a map for all endpoints and then traversing
it looking for this specific endpoint, build a multimap of token ranges
relating only to the specified endpoint.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the sorted tokens are copied needlessly by on this path
by `clone_only_token_map` and then recalculated after calling
remove_endpoint for each leaving endpoint.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Call the futurized clone_only_token_map and
remove the _leaving_endpoints from the cloned token_metadata_impl.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Does part of clone_async() using continuations to prevent stalls.
Rename synchronous variant to clone_only_token_map_sync
that is going to be deprecated once all its users will be futurized.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Replacing old code using lw_shared_ptr<token_metadata> with the "modern"
mutable_token_metadata_ptr alias.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Clone the input token_metadata asynchronously using
clone_async() before modifying it using update_normal_tokens.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Many code blocks using with_token_metadata_lock
and get_mutable_token_metadata_ptr now need re-indenting.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Only replace_with_repair needs to clone the token_metadata
and update the local copy, so we can safely pass a read-only
snapshot of the token_metadata rather than copying it in all cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Perform replication in 2 phases.
First phase just clones the mutable_token_metadata_ptr on all shards.
Second phase applies the cloned copies onto each local_ss._shared_token_metadata.
That phase should never fail.
To add suspenders over the belt, in the impossible case we do get an
exception, it is logged and we abort.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
clone _token_metadata for updating into _updated_token_metadata
and use it to update the local token_metadata on all shard via
do_update_pending_ranges().
Adjust get_token_metadata to get either the update the updated_token_metadata,
if available, or the base token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than using `serialized_action`, grab a lock before mutating
_token_metadata and hold it until its replicated to all shards.
A following patch will use a mutable token_metadata_ptr
that is updated out of line under the lock.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the replication to other shards happens later in `prepare_to_join`
that is called in `init_server`.
We should isolate the changes made by init_server and update them first
to all shards so that we can serialize them easily using a lock
and a mutable_token_metadata_ptr, otherwise the lock and the mutable_token_metadata_ptr
will have to be handed over (from this call path) to `prepare_to_join`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
get() the latest token_metadata_ptr from the
shared_token_metadata before each use.
expose get_token_metadata_ptr() rather than get_token_metadata()
so that caller can keep it across continuations.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To facilitate that, keep a const shared_token_metadata& in class database
rather than a const token_metadata&
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In preparation to chaging network_topology_strategy to
accept a const shared_token_metadata& rather than token_metadata&.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than accessing abstract_replication_strategy::_token_metedata directly.
In preparation to changing it to a shared_token_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And use it to get a token_metadata& compatible
with current usage, until the services are converted to
use token_metadata_ptr.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that `replicate_tm_only` doesn't throw, we handle all errors
in `replicate_tm_only().handle_exception`.
We can't just proceed with business as usual if we failed to replicate
token_metadata on all shards and continue working with inconsistent
copies.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And with that mark also do_replicate_to_all_cores as noexcept.
The motivation to do so is to catch all errors in replicate_tm_only
and calling on_internal_error in the `handle_exception` continuation
in do_replicate_to_all_cores.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than calling invalidate_cached_rings and update_topology
on all shards do that only on shard 0 and then replicate
to all other shards using replicate_to_all_cores as we do
in all other places that modify token_metadata.
Do this in preparation to using a token_metadata_ptr
with which updating of token_metadata is done on a cloned
copy (serialized under a lock) that becomes visible only when
applied with replicate_to_all_cores.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move the functionality from gossiping_property_file_snitch::reload_configuration
to the storage_service class.
With that we can make get_mutable_token_metadata private.
TODO: update token_metadata on shard 0 and then
replicate_to_all_cores rather than updating on all shards
in parallel.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
keyspace_changed just calls update_pending_ranges (and ignoring any
errors returned from it), so invoke it on shard 0, and with
that update_pending_ranges() is always called on shard 0
and it doesn't need to use `invoke_on` shard 0 by itself.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We need to assert in only 2 places:
do_update_pending_ranges, that updates token metadata,
and replicate_tm_only, that copies the token metadata
to all other shards.
Currently we throw errors if this is violated
but it should never happen and it's not really recoverable.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently update_pending_ranges involves 2 serialized actions:
do_update_pending_ranges, and then replicate_to_all_cores.
These can be combind by calling do_replicate_to_all_cores
directly from do_update_pending_ranges.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It was introduced in 74b4035611
As part of the fix for #3203.
However, the reactor stalls have nothing to do with gossip
waiting for update_pending_ranges - they are related to it being
synchronous and quadratic in the number of tokens
(e.g. get_address_ranges calling calculate_natural_endpoints
for every token then simple_strategy::calculate_natural_endpoints
calling get_endpoint for every token)
There is nothing special in handle_state_leaving that requires
moving update_pending_ranges to the background, we call
update_pending_ranges in many other places and wait for it
so if gossip loop waiting on it was a real problem, then it'd
be evident in many other places.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently _update_pending_ranges_action is called only on shard 0
and only later update_pending_ranges() updates shard 0 again and replicates
the result to all shards.
There is no need to wait between the two, and call _update_pending_ranges_action
again, so just call update_pending_ranges() in the first place.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
so that the updated host_id (on shard 0) will get replicated to all shards
via update_pending_ranges.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It's already done by each handle_state_* function
either by directly calling replicate_to_all_cores or indirectly, via
update_pending_renages.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the updates to token_metadata are immediately visible
on shard 0, but not to other shards until replicate_to_all_cores
syncs them.
To prepare for converting to using shared token_metadata.
In the new world the updated token_metadata is not visible
until committed to the shared_token_metadata, so
commit it here and replicate to all other shards.
It is not clear this isn't needed presently too.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If the consumer happens to check the EOS flag before it hits the
exception injected by the abort (by calling fill_buffer()), they can
think the stream ended normally and expect it to be valid. However this
is not guaranteed when the reader is aborted. To avoid consumers falsely
thinking the stream ended normally, don't set the EOS flag on abort at
all.
Additionally make sure the producer is aborted too on abort. In theory
this is not needed as they are the one initiating the abort, but better
to be safe then sorry.
Fixes: #7411
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20201102100732.35132-1-bdenes@scylladb.com>
"
Today, if scylla crashes mid-way in sstable::idempotent-move-sstable
or sstable::create_links we may end up in an inconsistent state
where it refuses to restart due to the presence of the moved-
sstable component files in both the staging directory and
main directory.
This series hardens scylla against this scenario by:
1. Improving sstable::create_links to identify the replay condition
and support it.
2. Modifying the algorithm for moving sstables between directories
to never be in a state where we have two valid sstable with the
same generation, in both the source and destination directories.
Instead, it uses the temporary TOC file as a marker for rolling
backwards or forewards, and renames it atomically from the
destination directory back to the source directory as a commit
point. Before which it is preparing the sstable in the destination
dir, and after which it starts the process of deleting the sstable
in the source dir.
Fixes#7429
Refs #5714
"
* tag 'idempotent-move-sstable-v3' of github.com:bhalevy/scylla:
sstable: create_links: support for move
sstable_directory: support sstables with both TemporaryTOC and TOC
sstable: create_links: move automatic sstring variables
sstable: create_links: use captured comps
sstable: create_links: capture dir by reference
sstable: create_links: fix indentation
sstable: create_links: no need to roll-back on failure anymore
sstable: create_links: support idempotent replay
sstable: create_links: cleanup style
sstable: create_links: add debug/trace logging
sstable: move_to_new_dir: rm TOC last
sstable: move_to_new_dir: io check remove calls
test: add sstable_move_test
Previously, test/cql-pytest/run was a Python script, while
test/cql-pytest/run-cassandra (to run the tests against Cassandra)
was still a shell script - modeled after test/alternator/run.
This patch makes rewrites run-cassandra in Python.
A lot of the same code is needed for both run and run-cassandra
tools. test/cql-pytest/run was already written in a way that this
common code was separate functions. For example, functions to start a
server in a temporary directory, to check when it finishes booting,
and to clean up at the end. This patch moves this common code to
a new file, "run.py" - and the tools "run" and "cassandra-run" are
very short programs which mostly use functions from run.py (run-cassandra
also has some unique code to run Cassandra, that no other test runner
will need).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201110215210.741753-1-nyh@scylladb.com>
We currently set PATH for relocatable CLI tools in scylla_util.run() and
scylla_util.out(), but it doesn't work for perftune.py, since it's not part of
Scylla, does not use scylla_util module.
We can set PATH in python thunk instead, it can set PATH for all python scripts.
Fixes#7350
After full cluster shutdown, the node which is being replaced will not have its
STATUS set to NORMAL (bug #6088), so listeners will not update _token_metadata.
The bootstrap procedure of replacing node has a workaround for this
and calls update_normal_tokens() on token metadata on behalf of the
replaced node based on just its TOKENS state obtained in the shadow
round.
It does this only for the replacing_a_node_with_same_ip case, but not
for replacing_a_node_with_diff_ip. As a result, replacing the node
with the same ip after full cluster shutdown fails.
We can always call update_normal_tokens(). If the cluster didn't
crash, token_metadata would get the tokens.
Fixes#4325
Message-Id: <1604675972-9398-1-git-send-email-tgrabiec@scylladb.com>
This patch introduces a new way to do functional testing on Scylla,
similar to Alternator's test/alternator but for the CQL API:
The new tests, in test/cql-pytest, are written in Python (using the pytest
framework), and use the standard Python CQL driver to connect to any CQL
implementation - be it Scylla, Cassandra, Amazon Keyspaces, or whatever.
The use of standard CQL allows the test developer to easily run the same
test against both Scylla and Cassandra, to confirm that the behaviour that
our test expects from Scylla is really the "correct" (meaning Cassandra-
compatible) behavior.
A developer can run Scylla or Cassandra manually, and run "pytest"
to connect to them (see README.md for more instructions). But even more
usefully, this patch also provides two scripts: test/cql-pytest/run and
test/cql-pytest/run-cassandra. These scripts automate the task of running
Scylla or Cassandra (respectively) in a random IP address and temporary
directory, and running the tests against it.
The script test/cql-pytest/run is inspired by the existing test run
scripts of Alternator and Redis, but rewritten in Python in a way that
will make it easy to rewrite - in a future patch - all these other run
scripts to use the same common code to safely run a test server in a
temporary directory.
"run" is extremely quick, taking around two seconds to boot Scylla.
"run-cassandra" is slower, taking 13 seconds to boot Cassandra (maybe
this can be improved in the future, I still don't know how).
The tests themselves take milliseconds.
Although the 'run' script runs a single Scylla node, the developer
can also bring up any size of Scylla or Cassandra cluster manually
and run the tests (with "pytest") against this cluster.
This new test framework differs from the existing alternatives in the
following ways:
dtest: dtest focuses on testing correctness of *distributed* behavior,
involving clusters of multiple nodes and often cluster changes
during the test. In contrast, cql-pytest focuses on testing the
*functionality* of a large number of small CQL features - which
can usually be tested on a single-node cluster.
Additionally, dtest is out-of-tree, while cql-pytest is in-tree,
making it much easier to add or change tests together with code
patches.
Finally, dtest tests are notoriously slow. Hundreds of tests in
the new framework can finish faster than a single dtest.
Slow and out-of-tree tests are difficult to write, and I believe
this explains why no developer loves writing dtests and maintainers
do not insist on having them. I hope cql-pytest can change that.
test/cql: The defining difference between the existing test/cql suite
and the new test/cql-pytest is the new framework is programmatic,
Python code, not a text file with desired output. Tests written with
` code allow things like looping, repeating the same test with different
parameters. Also, when a test fails, it makes it easier to understand
why it failed beyond just the fact that the output changed.
Moreover, in some cases, the output changes benignly and cql-pytest
may check just the desired features of the output.
Beyond this, the current version of test/cql cannot run against
Cassandra. test/cql-pytest can.
The primary motivation for this new framework was
https://github.com/scylladb/scylla/issues/7443 - where we had an
esoteric feature (sort order of *partitions* when an index is addded),
which can be shown in Cqlsh to have what we think is incorrect behavior,
and yet: 1. We didn't catch this bug because we never wrote a test for it,
possibly because it too difficult to contribute tests, and 2. We *thought*
that we knew what Cassandra does in this case, but nobody actually tested
it. Yes, we can test it manually with cqlsh, but wouldn't everything be
better if we could just run the same test that we wrote for Scylla against
Cassandra?
So one of the tests we add in this patch confirms issue #7443 in Scylla,
and that our hunch was correct and Cassandra indeed does not have this
problem. I also add a few trivial tests for keyspace create and drop,
as additional simple examples.
Refs #7443.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201110110301.672148-1-nyh@scylladb.com>
I noticed that we require filtering for continuous clustering key, which is not necessary. I dropped the requirement and made sure the correct data is read from the storage proxy.
The corresponding dtest PR: https://github.com/scylladb/scylla-dtest/pull/1727
Tests: unit (dev,debug), dtest (next-gating, cql*py)
Closes#7460
* github.com:scylladb/scylla:
cql3: Delete some newlines
cql3: Drop superfluous ALLOW FILTERING
cql3: Drop unneeded filtering for continuous CK
When there are too many in-flight hints, writes start returning
overloaded exceptions. We're missing metrics for that, and these could
be useful when judging if the system is in overloaded state.
This commit changes the build file generation and the package
creation scripts to be product aware. This will change the
relocatable package archives to be named after the product,
this commit deals with two main things:
1. Creating the actual Scylla server relocatable with a product
prefixed name - which is independent of any other change
2. Expect all other packages to create product prefixed archive -
which is dependant uppon the actual submodules creating
product prefixed archives.
If the support is not introduced in the submodules first this
will break the package build.
Tests: Scylla full build with the original product and a
different product name.
Closes#7581
Currently debian_files_gen.py mistakenly renames scylla-server.service to
"scylla-server." on non-standard product name environment such as
scylla-enterprise, it should be fix to correct filename.
Fixes#7423
This patch introduces many changes to the Scylla `CMakeLists.txt`
to enable building Scylla without resorting to pre-building
with a previous configure.py build, i.e. cmake script can now
be used as a standalone solution to build and execute scylla.
Submodules, such as Seastar and Abseil, are also dealt with
by importing their CMake scripts directly via `add_subdirectory`
calls. Other submodules, such as `libdeflate` now have a
custom command to build the library at runtime.
There are still a lot of things that are incomplete, though:
* Missing auxiliary packaging targets
* Unit-tests are not built (First priority to address in the
following patches)
* Compile and link flags are mostly hardcoded to the values
appropriate for the most recent Fedora 33 installation.
System libraries should be found via built-in `Find*` scripts,
compiler and linker flags should be observed and tested by
executing feature tests.
* The current build is aimed to be built by GCC, need to support
Clang since we are moving to it.
* Utility cmake functions should be moved to a separate "cmake"
directory.
The script is updated to use the most recent CMake version available
in Fedora 33, which is 3.18.
Right now this is more of a PoC rather that a full-fledged solution
but as far as it's not used widely, we are free to evolve it in
a relaxed manner, improving it step by step to achieve feature
parity with `configure.py` solution.
The value in this patch is that now we are able to use any
C++ IDE capable of dealing with CMake solutions and take
advantage of their built-in capabilities, such as:
* Building a code model to efficiently navigate code.
* Find references to symbols.
* Use pretty-printers, beautifiers and other tools conveniently.
* Run scylla and debug it right from the IDE.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201103221619.612294-1-pa.solodovnikov@scylladb.com>
DescribeTable should return a UUID "TableId" in its reponse.
We alread had it for CreateTable, and now this patch adds it to
DescribeTable.
The test for this feature is no longer xfail. Moreover, I improved
the test to not only check that the TableId field is present - it
should also match the documented regular expression (the standard
representation of a UUID).
Refs #5026
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201104114234.363046-1-nyh@scylladb.com>
When moving a sstable between directories, we would like to
be able to crash at any point during the algorithm with a
clear way to either roll the operation forwards or backwards.
To achieve that, define sstable::create_links_common that accepts
a `mark_for_removal` flag, implementing the following algorithm:
1. link src.toc to dst.temp_toc.
until removed, the destination sstable is marked for removal.
2. link all src components to dst.
crashing here will leave dst with both temp_toc and toc.
3.
a. if mark_for_removal is unset then just remove dst.temp_toc.
this is commit the destination sstable and complete create_links.
b. if mark_for_removal is set then move dst.temp_toc to src.temp_toc.
this will atomically toggle recovery after crash from roll-back
to roll-forward.
here too, crashing at this point will leave src with both
temp_toc and toc.
Adjust the unit test for the revised algorithm.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Keep descriptors in a map so it could be searched easily by generation.
and possibly delete the descriptor, if found, in the presence of
a temporary toc component.
A following patch will add support to create_links for moving
sstables between directories. It is based on keeping a TemporaryTOC
file in the destination directory while linking all source components.
If scylla crashes here, the destination sstable will have both
its TemporaryTOC and TOC components and it needs to be removed
to roll the move backwards.
Then, create_links will atomically move the TemporaryTOC from
the destination back to the source directory, to toggle rolling
back to rolling forward by marking the source sstable for removal.
If scylla crashes here, the source sstable will have both
its TemporaryTOC and TOC components and it needs to be removed
to roll the move forward.
Add unit test for this case.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that we use `idempotent_link_file` it'll no longer
fail with EEXIST in a replay scenario.
It may fail on ENOENT, and return an exceptional future.
This will be propagated up the stack. Since it may indicate
parallel invokation of move_to_new_dir, that deletes the source
sstable while this thread links it to the same destination,
rolling back by removing the destination links would
be dangerous.
For an other error, the node is going to be isolated
and stop operation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Handle the case where create_link is replayed after crashing in the middle.
In particular, if we restart when moving sstables from staging to the base dir,
right after create_links completes, and right before deleting the source links,
we end up with seemingly 2 valid sstables, one still in staging and the other
already in the base table directory, both are hard linked to the same inodes.
Make create_links idempotent so it can replay the operation safely if crashed and
restarted at any point of its operation.
Add unit tests for replay after partial create_links that is expected to succeed,
and a test for replay when an sstable exist in the destination that is not
hard-linked to the source sstable; create_links is expected to fail in this case.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To facilitate cleanup on crash, first rename the TOC file to TOC.tmp,
and keep until all other files are removed, finally remove TOC.tmp.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This use-after move was apprently exposed after switching to clang
in commit eb861e68e9.
The directory_entry is required for std::stoi(de.name.c_str())
and later in the catch{} clause.
This shows in the node logs as a "Ignore invalid directory" debug
log message with an empty name, and caused the hintedhandoff_rebalance_test
to fail when hints files aren't rebalanced.
Test: unit(dev)
DTest: hintedhandoff_additional_test.py:TestHintedHandoff.hintedhandoff_rebalance_test (dev, debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201106172017.823577-1-bhalevy@scylladb.com>
On older distribution such as CentOS7, it does not support systemd user mode.
On such distribution nonroot mode does not work, show warning message and
skip running systemctl --user.
Fixes#7071
... to config descriptions
We allow setting the transitional auth as one of the options
in scylla.yaml, but don't mention it at all in the field's
description. Let's change that.
Closes#7565
The function is used by raft and fails with ubsan and clang.
The ub is harmless. Lets wait for it to be fixed in boost.
Message-Id: <20201109090353.GZ3722852@scylladb.com>
Retry mechanism didn't work when URLError happend. For example:
urllib.error.URLError: <urlopen error [Errno 101] Network is unreachable>
Let's catch URLError instead of HTTP since URLError is a base exception
for all exceptions in the urllib module.
Fixes: #7569Closes#7567
If _offset falls beyond compound_type->types().size()
ignore the extra components instead of accessing out of the types
vector range.
FIXME: we should validate the thrift key against the schema
and reject it in the thrift handler layer.
Refs #7568
Test: unit(dev)
DTest: cql_tests.py:MiscellaneousCQLTester.cql3_insert_thrift_test (dev, debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201108175738.1006817-1-bhalevy@scylladb.com>
Users can change `durable_writes` anytime with ALTER KEYSPACE.
Cassandra reads the value of `durable_writes` every time when applying
a mutation, so changes to that setting take effect immediately. That is,
mutations are added to the commitlog only when `durable_writes` is `true`
at the moment of their application.
Scylla reads the value of `durable_writes` only at `keyspace` construction time,
so changes to that setting take effect only after Scylla is restarted.
This patch fixes the inconsistency.
Fixes#3034Closes#7533
This series provides assorted fixes which are a
pre-requisite for the joint consensus implementation
series which follows.
* scylla-dev/raft-misc:
raft: fix raft_fsm_test flakiness
raft: drop a waiter of snapshoted entry
raft: use correct type for node info in add_server()
raft: overload operator<< for debugging
An index that is waited can be included in an installed snapshot in
which case there is no way to know if the entry was committed or not.
Abort such waiters with an appropriate error.
Overload operator<< for ostream and print relevant state for server, fsm, log,
and typed_uint64 types.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
The query processor is present in the global namespace and is
widely accessed with global get(_local)?_query_processor().
There's a long-term task to get rid of this globality and make
services and componenets reference each-other and, for and
due-to this, start and stop in specific order. This set makes
this for the query processor.
The remaining users of it are -- alternator, controllers for
client services, schema_tables and sys_dist_ks. All of them
except for the schema_tables are fixed just by passing the
reference on query processor with small patches. The schema
tables accessing qp sit deep inside the paxos code, but can
be "fixed" with the qctx thing until the qctx itself is
de-globalized.
* https://github.com/xemul/scylla/tree/br-rip-global-query-processor:
code: RIP global query processor instance
cql test env: Keep query processor reference on board
system distributed keyspace: Start sharded service erarlier
schema_tables: Use qctx to make internal requests
transport: Keep sharded query processor reference on controller
thrift: Keep sharded query processor reference on controller
alternator: Use local query processor reference to get keys
alternator: Keep local query processor reference in server
The only purpose of this change is to compile (git-bisect
safety) and thus prove that the next patch is correct.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the view builder cannot read view building progress from an
internal CQL table it produces an error message, but that only confuses
the user and the test suite -- this situation is entirely recoverable,
because the builder simply assumes that there is no progress and the
view building should start from scratch.
Fixes#7527Closes#7558
repair: Use single writer for all followers
Currently, repair master create one writer for each follower to write
rows from follower to sstables. That are RF - 1 writers in total. Each
writer creates 1 sstable for the range repaired, usually a vnode range.
Those sstables for a given vnode range are disjoint.
To reduce the compaction work, we can create one writer for all the
followers. This reduces the number of sstables generated by repair
significantly to one per vnode range from RF - 1 per vnode range.
Fixes#7525Closes#7528
* github.com:scylladb/scylla:
repair: No more vector for _writer_done and friends
repair: Use single writer for all followers
The default of DBUILD_TOOL=docker requires passwordless access to docker
by the user of dbuild. This is insecure, as any user with unconstrained
access to docker is root equivalent. Therefore, users might prefer to
run docker as root (e.g. by setting DBUILD_TOOL="sudo docker").
However, `$tool -e HOME` exports HOME as seen by $tool.
This breaks dbuild when `$tool` runs docker as a another user.
`$tool -e HOME="$HOME"` exports HOME as seen by dbuild, which is
the intended behaviour.
Closes#7555
Instead of invoking `$tool`, as is done everywhere else in dbuild,
kill_it() invoked `docker` explicitly. This was slightly breaking the
script for DBUILD_TOOL other than `docker`.
Closes#7554
Cleanup compaction is using consume_pausable_in_thread() to skip over
disowned partitions, which uses flat_mutation_reader::next_partition().
The implementation of next_partition() for the sstable reader has a
bug which may cause the following assertion failure:
scylla: sstables/mp_row_consumer.hh:422: row_consumer::proceed sstables::mp_row_consumer_k_l::flush(): Assertion `!_ready' failed.
This happens when the sstable reader's buffer gets full when we reach
the partition end. The last fragment of the partition won't be pushed
into the buffer but will stay in the _ready variable. When
next_partition() is called in this state, _ready will not be cleared
and the fragment will be carried over to the next partition. This will
cause assertion failure when the reader attempts to emit the first
fragment of the next partition.
The fix is to clear _ready when entering a partition, just like we
clear _range_tombstones there.
Fixes#7553.
Message-Id: <1604534702-12777-1-git-send-email-tgrabiec@scylladb.com>
Fixes returned rows ordering to proper signed token ordering. Before this change, rows were sorted by token, but using unsigned comparison, meaning that negative tokens appeared after positive tokens.
Rename `token_column_computation` to `legacy_token_column_computation` and add some comments describing this computation.
Added (new) `token_column_computation` which returns token as `long_type`, which is sorted using signed comparison - the correct ordering of tokens.
Add new `correct_idx_token_in_secondary_index` feature, which flags that the whole cluster is able to use new `token_column_computation`.
Switch token computation in secondary indexes to (new) `token_column_computation`, which fixes the ordering. This column computation type is only set if cluster supports `correct_idx_token_in_secondary_index` feature to make sure that all nodes
will be able to compute new `token_column_computation`. Also old indexes will need to be rebuilt to take advantage of this fix, as new token column computation type is only set for new indexes.
Fix tests according to new token ordering and add one new test to validate this aspect explicitly.
Fixes#7443
Tested manually a scenario when someone created an index on old version of Scylla and then migrated to new Scylla. Old index continued to work properly (but returning in wrong order). Upon dropping and re-creating the index, it still returned the same data, but now in correct order.
Closes#7534
* github.com:scylladb/scylla:
tests: add token ordering test of indexed selects
tests: fix tests according to new token ordering
secondary_index: use new token_column_computation
feature: add correct_idx_token_in_secondary_index
column_computation: add token_column_computation
token_column_computation: rename as legacy
The shared_from_this lw_shared_ptr must not be accessed
across shards. Capturing it in the lambda passed to
mutation_writer::distribute_reader_and_consume_on_shards
causes exactly that since the captured lw_shared_ptr
is copied on other shards, and ends up in memory corruption
as seen in #7535 (probably due to lw_shared_ptr._count
going out-of-sync when incremented/decremented in parallel
on other shards with no synchronization.
This was introduced in 289a08072a.
The writer is not needed in the body of this lambda anyways
so it doesn't need to capture it. It is already held
by the continuations until the end of the chain.
Fixes#7535
Test: repair_additional_test:RepairAdditionalTest.repair_disjoint_row_3nodes_diff_shard_count_test (dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201104142216.125249-1-bhalevy@scylladb.com>
"
Since we are switching to clang due to raft make it actually compile
with clang.
"
tgrabiec: Dropped the patch "raft: compile raft by default" because
the replication_test still fails in debug mode:
/usr/include/boost/container/deque.hpp:1802:63: runtime error: applying non-zero offset 8 to null pointer
* 'raft-clang-v2' of github.com:scylladb/scylla-dev:
raft: Use different type to create type dependent statement for static assertion
raft: drop use of <ranges> for clang
raft: make test compile with clang
raft: drop -fcoroutines support from configure.py
Now that both repair followers and repair master use a single writer. We
can get rid of the vector associated with _writer_done and friends.
Fixes#7525
Currently, repair master create one writer for each follower to write
rows from follower to sstables. That are RF - 1 writers in total. Each
writer creates 1 sstable for the range repaired, usually a vnode range.
Those sstables for a given vnode range are disjoint.
To reduce the compaction work, we can create one writer for all the
followers. This reduces the number of sstables generated by repair
significantly to one per vnode range from RF - 1 per vnode range.
Fixes#7525
A gcc bug [1] caused objects built by different versions of gcc
not to interoperate. Gcc helpfully warns when it encounters code that
could be affected.
Since we build everything with one version, and as that versions is far
newer than the last version generating incorrect code, we can silence
that warning without issue.
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77728Closes#7495
Do not run tests which are not built.
For that, pass the test list from configure.py to test.py
via ninja unit_test_list target.
Minor cleanups.
* scylla-dev.git/test.py-list:
test: enable raft tests
test.py: do not run tests which are not built
configure.py: add a ninja command to print unit test list
test.py: handle ninja mode_list failure
configure.py: don't pass modes_list unless it's used
Add new test validating that rows returned from both non-indexed selects
and indexed selects return rows sorted in token order (making sure
that both positive and negative tokens are present to test if signed
comparison order is maintained).
Switches token column computation to (new) token_column_computation,
which fixes#7443, because new token column will be compared using
signed comparisons, not the previous unsigned comparison of CQL bytes
type.
This column computation type is only set if cluster supports
correct_idx_token_in_secondary_index feature to make sure that all nodes
will be able to compute (new) token_column_computation. Also old
indexes will need to be rebuilt to take advantage of this fix, as new
token column computation type is only set for new indexes.
Add new correct_idx_token_in_secondary_index feature, which will be used
to determine if all nodes in the cluster support new
token_column_computation. This column computation will replace
legacy_token_column_computation in secondary indexes, which was
incorrect as this column computation produced values that when compared
with unsigned comparison (CQL type bytes comparison) resulted in
different ordering than token signed comparison. See issue:
https://github.com/scylladb/scylla/issues/7443
Introduce new token_column_computation class which is intended to
replace legacy_token_column_computation. The new column computation
returns token as long_type, which means that it will be ordered
according to signed comparison (not unsigned comparison of bytes), which
is the correct ordering of tokens.
Raname token_column_computation to legacy_token_column_computation, as
it will be replaced with new column_computation. The reason is that this
computation returns bytes, but all tokens in Scylla can now be
represented by int64_t. Moreover, returning bytes causes invalid token
ordering as bytes comparison is done in unsigned way (not signed as
int64_t). See issue:
https://github.com/scylladb/scylla/issues/7443
meaningful
When computing moving average rates too early after startup, the
rate can be infinite, this is simply because the sample interval
since the system started is too small to generate meaningful results.
Here we check for this situation and keep the rate at 0 if it happens
to signal that there are still no meaningful results.
This incident is unlikely to happen since it can happen only during a
very small time window after restart, so we add a hint to the compiler
to optimize for that in order to have a minimum impact on the normal
usecase.
Fixes#4469
The memory configuration for the database object was left at zero.
This can cause the following chain of failures:
- the test is a little slow due to the machine being overloaded,
and debug mode
- this causes the memtable flush_controller timer to fire before
the test completes
- the backlog computation callback is called
- this calculates the backlog as dirty_memory / total_memory; this
is 0.0/0.0, which resolves to NaN
- eventually this gets converted to an integer
- UBSAN dooesn't like the convertion from NaN to integer, and complains
Fix by initializing dbcfg.available_memory.
Test: gossip_test(debug), 1000 repetitions with concurrency 6
Closes#7544
Fixes#7325
When building with clang on fedora32, calling the string_view constructor
of bignum generates broken ID:s (i.e. parsing borks). Creating a temp
std::string fixes it.
Closes#7542
Since 11a8912093, get_gossip_status
returns a std::string_view rather than a sstring.
As seen in dtest we may print garbage to the log
if we print the string_view after preemption (calling
_gossiper.reset_endpoint_state_map().get())
Test: update_cluster_layout_tests:TestUpdateClusterLayout.simple_add_two_nodes_in_parallel_test (dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201103132720.559168-1-bhalevy@scylladb.com>
"
Introduce a gentle (yielding) implementation of reserve for chunked
vector and use it when reserving the backing storage vector for large
bitset. Large bitset is used by bloom filters, which can be quite large
and have been observed to cause stalls when allocating memory for the
storage.
Fixes: #6974
Tests: unit(dev)
"
* 'gentle-reserve/v1' of https://github.com/denesb/scylla:
utils/large_bitset: use reserve_partial() to reserve _storage
utils/chunked_vector: add reserve_partial()
There's a perf_bptree test that compares B+ tree collection with
std::set and std::map ones. There will come more, also the "patterns"
to compare are not just "fill with keys" and "drain to empty", so
here's the perf_collection test, that measures timings of
- fill with keys
- drain key by key
- empty with .clear() call
- full scan with iterator
- insert-and-remove of a single element
for currently used collections
- std::set
- std::map
- intrusive_set_external_comparator
- bplus::tree
* https://github.com/xemul/scylla/tree/br-perf-collection-test:
test: Generalize perf_bptree into perf_collection
perf_collection: Clear collection between itartions
perf_collection: Add intrusive_set_external_comparator
perf_collection: Add test for single element insertion
perf_collection: Add test for destruction with .clear()
perf_collection: Add test for full scan time
To avoid stalls when reserving memory for a large bloom filter. The
filter creation already has a yielding loop for initialization, this
patch extends it to reservation of memory too.
A variant of reserve() which allows gentle reserving of memory. This
variant will allocate just one chunk at a time. To drive it to
completion, one should call it repeatedly with the return value of the
previous call, until it returns 0.
This variant will be used in the next patch by the large bitset creation
code, to avoid stalls when allocating large bloom filters (which are
backed by large bitset).
Although the code for it existed already, the validation function
hasn't been invoked properly. This change fixes that, adding
a validating check when converting from text to specific value
type and throwing a marshal exception if some characters
are not ASCII.
Fixes#5421Closes#7532
When unit tests fail the test.py dump their output on the screen. This is impossible
to read this output from the terminal, all the more so the logs are anyway saved in
the testlog/ directory. At the same time the names of the failed tests are all left
_before_ these logs, and if the terminal history is not large enough, it becomes
quite annoying to find the names out.
The proposal is not to spoil the terminal with raw logs -- just names and summaries.
Logs themselves are at testlog/$mode/$name_of_the_test.log
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201031154518.22257-1-xemul@scylladb.com>
Otherwise all readers will be created with the default forwarding::yes.
This inhibits some optimizations (e.g. results in more sstable read-ahead).
It will also be problematic when we introduce mutation sources which don't support
forwarding::yes in the future.
Message-Id: <1604065206-3034-1-git-send-email-tgrabiec@scylladb.com>
Clang brings us working support for coroutines, which are
needed for Raft and for code simplification.
perf_simple_query as well as full system tests show no
significant performance regression.
Test: unit(dev, release, debug)
Closes#7531
Fixes#7496
Since cdc log now has an end-of-batch/record marker that tells
us explicitly that we've read the last row of a change, we
can use this instead of timestamp checks + limit extra to
ensure we have complete records.
Note that this does not try to fulfill user query limit
exact. To do this we would need to add a loop and potentially
re-query if quried rows are not enough. But that is a
separate exercise, and superbly suited for coroutines!
Closes#7498
* github.com:scylladb/scylla:
alternator::streams: Reduce the query limit depending on cdc opts
alternator::streams: Use end-of-record info in get_records
This series cleans up the gossiper endpoint_state interface
marking methods const and const noexcept where possible.
To achieve that, endpoint_state::get_status was changed to
return a string_view rather than a sstring so it won't
need to allocate memory.
Also, the get_cluster_name and get_partitioner_name were
changes to return a const sstring& rather than sstring
so they won't need to allocate memory.
The motivation for the series stems from #7339
where an exception in get_host_id within a storage_service
notification handler, called from seastar::defer crashed
the server.
With this series, get_host_id may still throw exceptions on
logical error, but not from calling get_application_state_ptr.
Refs #7339
Test: unit(dev)
* tag 'gossiper-endpoint-noexcept-v2':
gossiper: mark trivial methods noexcept
gossiper: get_cluster_name, get_partitioner_name: make noexcept
gossiper: get_gossip_status: return string_view and make noexcept
gms/endpoint_state: mark methods using get_status noexcept
gms/endpoint_state: get_status: return string_view and make noexcept
gms/endpoint_state: mark get_application_state_ptr and is_cql_ready noexcept
gms/endpoint_state: mark trivial methods noexcept
gms/heart_beat_state: mark methods noexcept
gms/versioned_value: mark trivial methods noexcept
gms/version_generator: mark get_next_version noexcept
fb_utilities.hh: mark methods noexcept
messaging: msg_addr: mark methods noexcept
gms/inet_address: mark methods noexcept
when logging in to the GCE instance that is created from the GCE image it takes 10 seconds to understand that we are not running on AWS. Also, some unnecessary debug logging messages are printed:
```
bentsi@bentsi-G3-3590:~/devel/scylladb$ ssh -i ~/.ssh/scylla-qa-ec2 bentsi@35.196.8.86
Warning: Permanently added '35.196.8.86' (ECDSA) to the list of known hosts.
Last login: Sun Nov 1 22:14:57 2020 from 108.128.125.4
_____ _ _ _____ ____
/ ____| | | | | __ \| _ \
| (___ ___ _ _| | | __ _| | | | |_) |
\___ \ / __| | | | | |/ _` | | | | _ <
____) | (__| |_| | | | (_| | |__| | |_) |
|_____/ \___|\__, |_|_|\__,_|_____/|____/
__/ |
|___/
Version:
666.development-0.20201101.6be9f4938
Nodetool:
nodetool help
CQL Shell:
cqlsh
More documentation available at:
http://www.scylladb.com/doc/
By default, Scylla sends certain information about this node to a data collection server. For information, see http://www.scylladb.com/privacy/
WARNING:root:Failed to grab http://169.254.169.254/latest/...
WARNING:root:Failed to grab http://169.254.169.254/latest/...
Initial image configuration failed!
To see status, run
'systemctl status scylla-image-setup'
[bentsi@artifacts-gce-image-jenkins-db-node-aa57409d-0-1 ~]$
```
this PR fixes this
Closes#7523
* github.com:scylladb/scylla:
scylla_util.py: remove unnecessary logging
scylla_util.py: make is_aws_instance faster
scylla_util.py: added ability to control sleep time between retries in curl()
Old secondary index schemas did not have their idx_token column
marked as computed, and there already exists code which updates
them. Unfortunately, the fix itself contains an error and doesn't
fire if computed columns are not yet supported by the whole cluster,
which is a very common situation during upgrades.
Fixes#7515Closes#7516
Fixes#7496
Since cdc log now has an end-of-batch/record marker that tells
us explicitly that we've read the last row of a change, we
can use this instead of timestamp checks + limit extra to
ensure we have complete records.
Note that this does not try to fulfill user query limit
exact. To do this we would need to add a loop and potentially
re-query if quried rows are not enough. But that is a
separate exercise, and superbly suited for coroutines!
The instructions are updated for multiarch images (images that
can be used on x86 and ARM machines).
Additionally,
- docker is replaced with podman, since that is now used by
developers. Docker is still supported for developers, but
the image creation instructions are only tested with podman.
- added instructions about updating submodules
- `--format docker` is removed. It is not necessary with
more recent versions of docker.
Closes#7521
connection_notifier.hh defines a number of template-specialized
variables in a header. This is illegal since you're allowed to
define something multiple times if it's a template, but not if it's
fully specialized. gcc doesn't care but clang notices and complains.
Fix by defining the variiables as inline variables, which are
allowed to have definitions in multiple translation units.
Closes#7519
Some ARM cores are slow, and trip our current timeout of 3000
seconds in debug mode. Quadrupling the timeout is enough to make
debug-mode tests pass on those machines.
Since the timeout's role is to catch rare infinite loops in unsupervised
testing, increasing the timeout has no ill effect (other than to
delay the report of the failure).
Closes#7518
The main goal of this this series is to fix issue #6951 - a Query (or Scan) with
a combination of filtering and projection parameters produced wrong results if
the filter needs some attributes which weren't projected.
This series also adds new tests for various corner cases of this issue. These
new tests also pass after this fix, or still fail because some other missing
feature (namely, nested attributes). These additional tests will be important if
we ever want to refactor or optimize this code, because they exercise some rare
corner code paths at the intersection of filtering and projection.
This series also fixes some additional problems related to this issue, like
combining old and new filtering/projection syntaxes (should be forbidden), and
even one fix to a wrong comment.
Closes#7328
* github.com:scylladb/scylla:
alternator test: tests for nested attributes in FilterExpression
alternator test: fix comment
alternator tests: additional tests for filter+projection combination
alternator: forbid combining old and new-style parameters
alternator: fix query with both projection and filtering
when calling curl and exception is raised we can see unnecessary log messages that we can't control.
For example when used in scylla_login we can see following messages:
WARNING:root:Failed to grab http://169.254.169.254/latest/...
WARNING:root:Failed to grab http://169.254.169.254/latest/...
Initial image configuration failed!
To see status, run
'systemctl status scylla-image-setup'
These methods can return a const sstring& rather than
allocating a sstring. And with that they can be marked noexcept.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Change get_gossip_status to return string_view,
and with that it can be noexcept now that it doesn't
allocate memory via sstring.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that get_status returns string_view, just compare it with a const char*
rather than making a sstring out of it, and consequently, can be marked noexcept.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
get_status doesn't need to allocate a sstring, it can just
return a std::string_view to the status string, if found.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Although std::map::find is not guaranteed to be noexcept
it depends on the comperator used and in this case comparing application_state
is noexcept. Therefore, we can safely mark get_application_state_ptr noexcept.
is_cql_ready depends on get_application_state_ptr and otherwise
handles an exceptions boost::lexical_cast so it can be marked
noexcept as well.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that get_next_version() is noexcept,
update_heart_beat can be noexcept too.
All others are trivially noexcept.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Based on gms::inet_address.
With that, gossiper::get_msg_addr can be marked noexcept (and const while at it).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Clang does not implement P1814R0 (class template argument deduction
for alias templates), so it can't deduce the template arguments
for range_bound, but it can for interval_bound, so switch to that.
Using the modern name rather than the compatibility alias is preferred
anyway.
Closes#7422
In commit de38091827 the two IO priority classes streaming_read
and streaming_write into just one. The document docs/isolation.md
leaves a lot to be desired (hint, hint, to anyone reading this and
can write content!) but let's at least not have incorrect information
there.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201101102220.2943159-1-nyh@scylladb.com>
This series adds more context to debugging information in case a view gets out of sync with its base table.
A test was conducted manually, by:
1. creating a table with a secondary index
2. manually deleting computed column information from system_schema.computed_columns
3. restarting the target node
4. trying to write to the index
Here's what's logged right after the index metadata is loaded from disk:
```
ERROR 2020-10-30 12:30:42,806 [shard 0] view - Column idx_token in view ks.t_c_idx_index was not found in the base table ks.t
ERROR 2020-10-30 12:30:42,806 [shard 0] view - Missing idx_token column is caused by an incorrect upgrade of a secondary index. Please recreate index ks.t_c_idx_index to avoid future issues.
```
And here's what's logged during the actual failure - when Scylla notices that there exists
a column which is not computed, but it's also not found in the base table:
```
ERROR 2020-10-30 12:31:25,709 [shard 0] storage_proxy - exception during mutation write to 127.0.0.1: seastar::internal::backtraced<std::runtime_error> (base_schema(): operation unsupported when initialized only for view reads. Missing column in the base table: idx_token Backtrace: 0x1d14513
0x1d1468b
0x1d1492b
0x109bbad
0x109bc97
0x109bcf4
0x1bc4370
0x1381cd3
0x1389c38
0xaf89bf
0xaf9b20
0xaf1654
0xaf1afe
0xb10525
0xb10ad8
0xb10c3a
0xaaefac
0xabf525
0xabf262
0xac107f
0x1ba8ede
0x1bdf749
0x1be338c
0x1bfe984
0x1ba73fa
0x1ba77a4
0x9ea2c8
/lib64/libc.so.6+0x27041
0x9d11cd
--------
seastar::lambda_task<seastar::execution_stage::flush()::{lambda()#1}>
```
Hopefully, this information will make it much easier to solve future problems with out-of-sync views.
Tests: unit(dev)
Fixes#7512Closes#7513
* github.com:scylladb/scylla:
view: add printing missing base column on errors
view: simplify creating base-dependent info for reads only
view: fix typo: s/dependant/dependent
view: add error logs if a view is out of sync with its base
In certain CQL statements it's possible to provide a custom timestamp via the USING TIMESTAMP clause. Those values are accepted in microseconds, however, there's no limit on the timestamp (apart from type size constraint) and providing a timestamp in a different unit like nanoseconds can lead to creating an entry with a timestamp way ahead in the future, thus compromising the table.
To avoid this, this change introduces a sanity check for modification and batch statements that raises an error when a timestamp of more than 3 days into the future is provided.
Fixes#5619Closes#7475
The constructors just set up the references, real start happens in .start()
so it is safe to do this early. This helps not carrying migration manager
and query processor down the storage service cluster joining code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The query processor global instance is going away. The schema_tables usage
of it requires a huge rework to push the qp reference to the needed places.
However, those places talk to system keyspace and are thus the users of the
"qctx" thing -- the query context for local internal requests.
To make cql tests not crash on null qctx pointer, its initialization should
come earlier (conforming to the main start sequence).
The qctx itself is a global pointer, which waits for its fix too, of course.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When an out-of-sync view is attempted to be used in a write operation,
the whole operation needs to be aborted with an error. After this patch,
the error contains more context - namely, the missing column.
The code which created base-dependent info for materialized views
can be expressed with fewer branches. Also, the constructor
which takes a single parameter is made explicit.
When Scylla finds out that a materialized view contains columns
which are not present in the base table (and they are not computed),
it now presents comprehensible errors in the log.
* seastar 6973080cd1...57b758c2f9 (11):
> http: handle 'match all' rule correctly
> http: add missing HTTP methods
> memory: remove unused lambda capture in on_allocation_failure()
> Support seastar allocator when seastar::alien is used
> Merge "make timer related functions noexcept" from Benny
> script: update dependecy packages for centos7/8
> tutorial: add linebreak between sections
> doc: add nav for the second last chap
> doc: add nav bar at the bottom also
> doc: rename add_prologue() to add_nav_to_body()
> Wrong name used in an example in mini tutorial.
An upcoming change in Seastar only initializes the Seastar allocator in
reactor threads. This causes imr_test and double_decker_test to fail:
1. Those tests rely on LSA working
2. LSA requires the Seastar allocator
3. Seastar is not initialized, so the Seastar allocator is not initialized.
Fix by switching to the Seastar test framework, which initializes Seastar.
Closes#7486
test.py estimates the amount of memory needed per test
in order not to overload the machine, but it underestimates
badly and so machines with many cores but not a lot of memory
fail the tests (in debug mode principally) due to running out
of memory.
Increase the estimate from 2GB per test to 6GB.
Closes#7499
gcc collects all the initialization code for thread-local storage
and puts it in one giant function. In combination with debug mode,
this creates a very large stack frame that overflows the stack
on aarch64.
Work around the problem by placing each initializer expression in
its own function, thus reusing the stack.
Closes#7509
Add additional comments to select_statement_utils, fix formatting, add
missing #pragma once and introduce set_internal_paging_size_guard to
set internal_paging in RAII fashion.
Closes#7507
Gossip currently runs inside the default (main) scheduling group. It is
fine to run inside default scheduling group. From time to time, we see
many tasks in main scheduling group and we suspect gossip. It is best
we can move gossip to a dedicated scheduling group, so that we can catch
bugs that leak tasks to main group more easily.
After this patch, we can check:
scylla_scheduler_time_spent_on_task_quota_violations_ms{group="gossip",shard="0"}
Fixes: #7154
Tests: unit(dev)
We require a kernel that is at least 3.10.0-514, because older
kernel have an XFS related bug that causes data corruption. However
this Requires: clause pulls in a kernel even in Docker installation,
where it (and especially the associated firmware) occupies a lot of
space.
Change to a Conflicts: instead. This prevents installation when
the really old kernel is present, but doesn't pull it in for the
Docker image.
Closes#7502
Overview
Fixes#7355.
Before this changes, there were a few invalid results of aggregates/GROUP BY on tables with secondary indexes (see below).
Unfortunately, it still does NOT fix the problem in issue #7043. Although this PR moves forward fixing of that issue, there is still a bug with `TOKEN(...)` in `WHERE` clauses of indexed selects that is not addressed in this PR. It will be fixed in my next PR.
It does NOT fix the problems in issues #7432, #7431 as those are out-of-scope of this PR and do not affect the correctness of results (only return a too large page).
GROUP BY (first commit)
Before the change, `GROUP BY` `SELECT`s with some `WHERE` restrictions on an indexed column would return invalid results (same grouped column values appearing multiple times):
```
CREATE TABLE ks.t(pk int, ck int, v int, PRIMARY KEY(pk, ck));
CREATE INDEX ks_t on ks.t(v);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 2, 3);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 4, 3);
SELECT pk FROM ks.t WHERE v=3 GROUP BY pk;
pk
----
1
1
```
This is fixed by correctly passing `_group_by_cell_indices` to `result_set_builder`. Fixes the third failing example from issue #7355.
Paging (second commit)
Fixes two issues related to improper paging on indexed `SELECT`s. As those two issues are closely related (fixing one without fixing the other causes invalid results of queries), they are in a single commit (second commit).
The first issue is that when using `slice.set_range`, the existing `_row_ranges` (which specify clustering key prefixes) are not taken into account. This caused the wrong rows to be included in the result, as the clustering key bound was set to a half-open range:
```
CREATE TABLE ks.t(a int, b int, c int, PRIMARY KEY ((a, b), c));
CREATE INDEX kst_index ON ks.t(c);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 3);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 4);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 5);
SELECT COUNT(*) FROM ks.t WHERE c = 3;
count
-------
2
```
The second commit fixes this issue by properly trimming `row_ranges`.
The second fixed problem is related to setting the `paging_state` to `internal_options`. It was improperly set to the value just after reading from index, making the base query start from invalid `paging_state`.
The second commit fixes this issue by setting the `paging_state` after both index and base table queries are done. Moreover, the `paging_state` is now set based on `paging_state` of index query and the results of base table query (as base query can return more rows than index query).
The second commit fixes the first two failing examples from issue #7355.
Tests (fourth commit)
Extensively tests queries on tables with secondary indices with aggregates and `GROUP BY`s.
Tests three cases that are implemented in `indexed_table_select_statement::do_execute` - `partition_slices`,
`whole_partitions` and (non-`partition_slices` and non-`whole_partitions`). As some of the issues found were related to paging, the tests check scenarios where the inserted data is smaller than a page, larger than a page and larger than two pages (and some in-between page boundaries scenarios).
I found all those parameters (case of `do_execute`, number of inserted rows) to have an impact of those fixed bugs, therefore the tests validate a large number of those scenarios.
Configurable internal_paging_size (third commit)
Before this change, internal `page_size` when doing aggregate, `GROUP BY` or nonpaged filtering queries was hard-coded to `DEFAULT_COUNT_PAGE_SIZE` (10,000). This change adds new internal_paging_size variable, which is configurable by `set_internal_paging_size` and `reset_internal_paging_size` free functions. This functionality is only meant for testing purposes.
Closes#7497
* github.com:scylladb/scylla:
tests: Add secondary index aggregates tests
select_statement: Introduce internal_paging_size
select_statement: Fix paging on indexed selects
select_statement: Fix GROUP BY on indexed select
Update the toolchain to Fedora 33 with clang 11 (note the
build still uses gcc).
The image now creates a /root/.m2/repository directory; without
this the tools/jmx build fails on aarch64.
Add java-1.8.0-openjdk-devel since that is where javac lives now.
Add a JAVA8_HOME environment variable; wihtout this ant is not
able to find javac.
The toolchain is enabled for x86_64 and aarch64.
Extensively tests queries on tables with secondary indices with
aggregates and GROUP BYs. Tests three cases that are implemented
in indexed_table_select_statement::do_execute - partition_slices,
whole_partitions and (non-partition_slices and non-whole_partitions).
As some of the issues found were related to paging, the tests check
scenarios where the inserted data is smaller than a page, larger than
a page and larger than two pages (and some boundary scenarios).
Before this change, internal page_size when doing aggregate, GROUP BY
or nonpaged filtering queries was hard-coded to DEFAULT_COUNT_PAGE_SIZE.
This made testing hard (timeouts in debug build), because the tests had
to be large to test cases when there are multiple internal pages.
This change adds new internal_paging_size variable, which is
configurable by set_internal_paging_size and reset_internal_paging_size
free functions. This functionality is only meant for testing purposes.
Fixes two issues related to improper paging on indexed SELECTs. As those
two issues are closely related (fixing one without fixing the other
causes invalid results of queries), they are in a single commit.
The first issue is that when using slice.set_range, the existing
_row_ranges (which specify clustering key prefixes) are not taken into
account. This caused the wrong rows to be included in the result, as the
clustering key bound was set to a half-open range:
CREATE TABLE ks.t(a int, b int, c int, PRIMARY KEY ((a, b), c));
CREATE INDEX kst_index ON ks.t(c);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 3);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 4);
INSERT INTO ks.t(a, b, c) VALUES (1, 2, 5);
SELECT COUNT(*) FROM ks.t WHERE c = 3;
count
-------
2
This change fixes this issue by properly trimming row_ranges.
The second fixed problem is related to setting the paging_state
to internal_options. It was improperly set just after reading from
index, making the base query start from invalid paging_state.
This change fixes this issue by setting the paging_state after both
index and base table queries are done. Moreover, the paging_state is
now set based on paging_state of index query and the results of base
table query (as base query can return more rows than index query).
Fixes the first two failing examples from issue #7355.
Before the change, GROUP BY SELECTs with some WHERE restrictions on an
indexed column would return invalid results (same grouped column values
appearing multiple times):
CREATE TABLE ks.t(pk int, ck int, v int, PRIMARY KEY(pk, ck));
CREATE INDEX ks_t on ks.t(v);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 2, 3);
INSERT INTO ks.t(pk, ck, v) VALUES (1, 4, 3);
SELECT pk FROM ks.t WHERE v=3 GROUP BY pk;
pk
----
1
1
This is fixed by correctly passing _group_by_cell_indices to
result_set_builder. Fixes the third failing example from issue #7355.
This is the continuation of 30722b8c8e, so let me re-cite Rafael:
The constructors of these global variables can allocate memory. Since
the variables are thread_local, they are initialized at first use.
There is nothing we can do if these allocations fail, so use
disable_failure_guard.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201028140553.21709-1-xemul@scylladb.com>
The future of the fiber that writes data into sstables inside
the repair_writer is stored in _writer_done like below:
class repair_writer {
_writer_done[node_idx] =
mutation_writer::distribute_reader_and_consume_on_shards().then([this] {
...
}).handle_exception([this] {
...
});
}
The fiber access repair_writer object in the error handling path. We
wait for the _writer_done to finish before we destroy repair_meta
object which contains the repair_writer object to avoid the fiber
accessing already freed repair_writer object.
To be safer, we can make repair_writer a shared pointer and take a
reference in the distribute_reader_and_consume_on_shards code path.
Fixes#7406Closes#7430
LD_PRELOAD libraries usually have dependencies in the host system,
which they will not have access to in a relocatable environment
since we use a different libc. Detect that LD_PRELOAD is in use and if
so, abort with an error.
Fixes#7493.
Closes#7494
Clang complains if it sees linker-only flags when called for compilation,
so move the compile-time flags from cxx_ld_flags to cxxflags, and remove
cxx_ld_flags from the compiler command line.
The linker flags are also passed to Seastar so that the build-id and
interpreter hacks still apply to iotune.
Closes#7466
python3 has its own relocatable package, no need to include it
in scylla-package.tar.gz.
Python has its own relocatable package, so packaging it in scylla-package.ta
Closes#7467
We already have a test for the behavior of a closed shard and how
iterators previously created for it are still valid. In this patch
we add to this also checking that the shard id itself, not just the
iterator, is still valid.
Additionally, although the aforementioned test used a disabled stream
to create a closed shard, it was not a complete test for the behavior
of a disabled stream, and this patch adds such a test. We check that
although the stream is disabled, it is still fully usable (for 24 hours) -
its original ARN is still listed on ListStreams, the ARN is still usable,
its shards can be listed, all are marked as closed but still fully readable.
Both tests pass on DynamoDB, and xfail on Alternator because of
issue #7239 - CDC drops the CDC log table as soon as CDC is disabled,
so the stream data is lost immediately instead of being retained for
24 hours.
Refs #7239
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20201006183915.434055-1-nyh@scylladb.com>
This patch fills the following columns in `system.clients` table:
* `connection_stage`
* `driver_name`
* `driver_version`
* `protocol_version`
It also improves:
* `client_type` - distinguishes cql from thrift just in case
* `username` - now it displays correct username iff `PasswordAuthenticator` is configured.
What is still missing:
* SSL params (I'll happily get some advice here)
* `hostname` - I didn't find it in tested drivers
Refs #6946Closes#7349
* github.com:scylladb/scylla:
transport: Update `connection_stage` in `system.clients`
transport: Retrieve driver's name and version from STARTUP message
transport: Notify `system.clients` about "protocol_version"
transport: On successful authentication add `username` to system.clients
This patch change the code that iterates over the metrics to use a copy
of the metrics names to make it safe to remove the metrics from the
metrics object.
Fixes#7488
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Allows QA to bypass the normal hardcoded 24h ttl of data and still
get "proper" behaviour w.r.t. available stream set/generations.
I.e. can manually change cdc ttl option for alternator table after
streams enabled. Should not be exposed, but perhaps useful for
testing.
Closes#7483
Refs #7364
The number of tombstones can be large. As a stopgap measure to
just returning a source range (with keepalive), we can at least
alleviate the problem by using a chunked vector.
Closes#7433
Fixes#7435
Adds an "eor" (end-of-record) column to cdc log. This is non-null only on
last-in-timestamp group rows, i.e. end of a singular source "event".
A client can use this as a shortcut to knowing whether or not he has a
full cdc "record" for a given source mutation (single row change).
Closes#7436
Makes files shorter while still keeping the lines under 120 columns.
Separate from other commits to make review easier.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Don't require filtering when a continuous slice of the clustering key
is requested, even if partition is unrestricted. The read command we
generate will fetch just the selected data; filtering is unnecessary.
Some tests needed to update the expected results now that we're not
fetching the extra data needed for filtering. (Because tests don't do
the final trim to match selectors and assert instead on all the data
read.)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
The username becomes known in the course of resolving challenges
from `PasswordAuthenticator`. That's why username is being set on
successful authentication; until then all users are "anonymous".
Meanwhile, `AllowAllAuthenticator` (the default) does not request
username, so users logged with it will remain as "anonymous" in
`system.clients`.
Shuffling of code was necessary to unify existing infrastructure
for INSERTing entries into `system.clients` with later UPDATEs.
In some cases a collection is used to keep several elements,
so it's good to know this timing.
For example, a mutation_partition keeps a set of rows, if used
in cache it can grow large, if used in mutation to apply, it's
typically small. Plain replacement of bst into b-tree caused
performance degardation of mutation application because b-tree
is only better at big sizes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This collection is widely used, any replacement should be
compared against it to better understand pros-n-cons.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Alternator does not yet support direct access to nested attributes in
expressions (this is issue #5024). But it's still good to have tests
covering this feature, to make it easier to check the implementation
of this feature when it comes.
Until now we did not have tests for using nested attributes in
*FilterExpression*. This patch adds a test for the straightforward case,
and also adds tests for the more elaborate combination of FilterExpression
and ProjectionExpression. This combination - see issue #6951 - means that
some attributes need to be retrieved despite not being projected (because
they are needed in a filter). When we support nested attributes there will
be special cases when the projected and filtered attributes are parts of
the same top-level attribute, so the code will need to handle those cases
correctly. As I was working on issue #6951 now, it is a good time to write
a test for these special cases, even if nested attributes aren't yet
supported - so we don't forget to handle these special cases later.
Both new tests pass on DynamoDB, and xfail on Alternator.
Refs #5024 (nested attributes)
Refs #6951 (FilterExpression with ProjectionExpression)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
A comment in test/alternator/test_lsi.py wrongly described the schema
of one of the test tables. Fix that comment.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch provides two more tests for issue #6951. As this issue was
already fixed, the two new tests pass.
The two new test check two special cases for which were handled correctly
but not yet tested - when the projected attribute is a key attribute of
the table or of one of its LSIs. Having these two additional tests will
ensure that any future refactoring or optimizations in the this area of
the code (filtering, projection, and its combination) will not break these
special cases.
Refs #6951.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The DynamoDB API has for the Query and Scan requests two filtering
syntaxes - the old (QueryFilter or ScanFilter) and the new (FilterExpression).
Also for projection, it has an old syntax (AttributesToGet) and a new
one (ProjectionExpression). Combining an old-style and new-style parameter
is forbidden by DynamoDB, and should also be forbidden by Alternator.
This patch fixes, and removes the "xfails" tag, of two tests:
test_query_filter.py::test_query_filter_and_projection_expression
test_filter_expression.py::test_filter_expression_and_attributes_to_get
Refs #6951
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We had a bug when a Query/Scan had both projection (ProjectionExpression
or AttributesToGet) and filtering (FilterExpression or Query/ScanFilter).
The problem was that projection left only the requested attributes, and
the filter might have needed - and not got - additional attributes.
The solution in this patch is to add the generated JSON item also
the extra attributes needed by filtering (if any), run the filter on
that, and only at the end remove the extra filtering attributes from
the item to be returned.
The two tests
test_query_filter.py::test_query_filter_and_attributes_to_get
test_filter_expression.py::test_filter_expression_and_projection_expression
Which failed before this patch now pass so we drop their "xfail" tag.
Fixes#6951.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2020-10-05 02:19:22 +03:00
1535 changed files with 54182 additions and 21183 deletions
Use the [ScyllaDB user mailing list](https://groups.google.com/forum/#!forum/scylladb-users) or the [Slack workspace](http://slack.scylladb.com) for general questions and help.
## Asking questions or requesting help
# Reporting an issue
Use the [Scylla Users mailing list](https://groups.google.com/g/scylladb-users) or the [Slack workspace](http://slack.scylladb.com) for general questions and help.
Please use the [Issue Tracker](https://github.com/scylladb/scylla/issues/) to report issues. Fill in as much information as you can in the issue template, especially for performance problems.
Join the [Scylla Developers mailing list](https://groups.google.com/g/scylladb-dev) for deeper technical discussions and to discuss your ideas for contributions.
# Contributing Code to Scylla
## Reporting an issue
To contribute code to Scylla, you need to sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send your changes as [patches](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches) to the [mailing list](https://groups.google.com/forum/#!forum/scylladb-dev). We don't accept pull requests on GitHub.
Please use the [issue tracker](https://github.com/scylladb/scylla/issues/) to report issues or to suggest features. Fill in as much information as you can in the issue template, especially for performance problems.
## Contributing code to Scylla
Before you can contribute code to Scylla for the first time, you should sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send the signed form cla@scylladb.com. You can then submit your changes as patches to the to the [scylladb-dev mailing list](https://groups.google.com/forum/#!forum/scylladb-dev) or as a pull request to the [Scylla project on github](https://github.com/scylladb/scylla).
If you need help formatting or sending patches, [check out these instructions](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches).
The Scylla C++ source code uses the [Seastar coding style](https://github.com/scylladb/seastar/blob/master/coding-style.md) so please adhere to that in your patches. Note that Scylla code is written with `using namespace seastar`, so should not explicitly add the `seastar::` prefix to Seastar symbols. You will usually not need to add `using namespace seastar` to new source files, because most Scylla header files have `#include "seastarx.hh"`, which does this.
throwapi_error::validation(format("Cannot use both old-style and new-style parameters in same request: {} and ProjectionExpression",conditions_attribute));
co_returnapi_error::request_limit_exceeded(format("too many in-flight requests (configured via max_concurrent_requests_per_shard): {}",_pending_requests.get_count()));
seastar::metrics::description("Latency histogram of an operation via Alternator API"),{op(CamelCaseName)},[this]{returnto_metrics_histogram(api_operations.name);}),
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.