If system_keyspace::stop() is called before system_keyspace::shutdown(),
it will never finish, because the uncleared shared pointers will keep
it alive indefinitely.
Currently this can happen if an exception is thrown before the construction
of the shutdown() defer. This patch moves the shutdown() call to immediately
before stop(). I see no reason why it should be elsewhere.
Fixesscylladb/scylla-enterprise#4380
(cherry picked from commit eeaf4c3443)
Closesscylladb/scylladb#20146
There are two schemas associated with a sstable writer:
the sstable's schema (i.e. the schema of the table at the time when the
sstable object was created), and the writer's schema (equal to the schema
of the reader which is feeding into the writer).
It's easy to mix up the two and break something as a result.
The writer's schema is needed to correctly interpret and serialize the data
passing through the writer, and to populate the on-disk metadata about the
on-disk schema.
The sstables's schema is used to configure some parameters for newly created
sstable, such as bloom filter false positive ratio, or compression.
This series fixes the known mixups between the two — when setting up compression,
and when setting up the bloom filters.
Fixesscylladb/scylladb#16065
The bug is present in all supported versions, so the patch has to be backported to all of them.
(cherry picked from commit a1834efd82)
(cherry picked from commit d10b38ba5b)
(cherry picked from commit 1a8ee69a43)
Refs scylladb/scylladb#19695Closesscylladb/scylladb#19878
* github.com:scylladb/scylladb:
sstables/mx/writer: when creating local_compression, use the sstables's schema, not the writer's
sstables/mx/writer: when creating filter, use the sstables's schema, not the writer's
sstables: for i_filter downcasts, use dynamic_cast instead of static_cast
There are two schema's associated with a sstable writer:
the sstable's schema (i.e. the schema of the table at the time when the
sstable object was created), and the writer's schema (equal to the schema
of the reader which is feeding into the writer).
It's easy to mix up the two and break something as a result.
The writer's schema is needed to correctly interpret and serialize the data
passing through the writer, and to populate the on-disk metadata about the
on-disk schema.
The sstables's schema is used to configure some parameters for newly created
sstable, such as bloom filter false positive ratio, or compression.
The problem fixed by this patch is that the writer was wrongly creating
the compressor objects based on its own schema, but using them based
based on the sstable's schema the sstable's schema.
This patch forces the writer to use the sstable's schema for both.
(cherry picked from commit 1a8ee69a43)
There are two schema's associated with a sstable writer:
the sstable's schema (i.e. the schema of the table at the time when the
sstable object was created), and the writer's schema (equal to the schema
of the reader which is feeding into the writer).
It's easy to mix up the two and break something as a result.
The writer's schema is needed to correctly interpret and serialize the data
passing through the writer, and to populate the on-disk metadata about the
on-disk schema.
The sstables's schema is used to configure some parameters for newly created
sstable, such as bloom filter false positive ratio, or compression.
The problem fixed by this patch is that the writer was wrongly creating
the filter based on its own schema, while the layer outside the writer
was interpreting it as if it was created with the sstable's schema.
This patch forces the writer to pick the filter's parameters based on the
sstable's schema instead.
(cherry picked from commit d10b38ba5b)
The SSTable is removed from the reclaimed memory tracking logic only
when its object is deleted. However, there is a risk that the Bloom
filter reloader may attempt to reload the SSTable after it has been
unlinked but before the SSTable object is destroyed. Prevent this by
removing the SSTable from the reclaimed list maintained by the manager
as soon as it is unlinked.
The original logic that updated the memory tracking in
`sstables_manager::deactivate()` is left in place as (a) the variables
have to be updated only when the SSTable object is actually deleted, as
the memory used by the filter is not freed as long as the SSTable is
alive, and (b) the `_reclaimed.erase(*sst)` is still useful during
shutdown, for example, when the SSTable is not unlinked but just
destroyed.
Fixes https://github.com/scylladb/scylladb/issues/19722
Closes scylladb/scylladb#19717
* github.com:scylladb/scylladb:
boost/bloom_filter_test: add testcase to verify unlinked sstables are not reloaded
sstables: do not reload components of unlinked sstables
sstables/sstables_manager: introduce on_unlink method
(cherry picked from commit 591876b44e)
Backported from #19717 to 5.4
Closesscylladb/scylladb#19831
When writing a mutation, it might happen that there are no live targets
to send the mutation to, yet the request can be satisfied. For example,
when writing with CL=ANY to a dead node, the request is completed by
storing a local hint.
Currently, in that case, a write response handler is created for the
request and it remains active until it timeouts because it is not
removed anywhere, even though the write is completed successfuly after
storing the hint. The response handler should be removed usually when
receiving responses from all targets, but in this case there are no
targets to trigger the removal.
In this commit we check if we don't have live targets to send the
mutation to. If so, we remove the response handler immediately.
Fixesscylladb/scylladb#19529
(cherry picked from commit a9fdd0a93a)
Closesscylladb/scylladb#19679
Currently, when we calculate the number of deactivated segments
in test_commitlog_delete_when_over_disk_limit, we only count the
segments that were active during the first flush. However, during
the test, there may have been more than one flush, and a segment
could have been created between them. This segment would sometimes
get deactivated and even destroyed, and as a result, the count of
destroyed segments would appear larger than the count of deactivated
ones.
This patch fixes this behavior by accounting for all segments that
were active during any flush instead of just segments active during
the first flush.
Fixes#10527
(cherry-picked from commit 39a8f4310d)
Closesscylladb/scylladb#19706
apply_monotonically() is run with reclaim disabled. So with some bad luck,
sentinel insertion might fail with bad_alloc even on a perfectly healthy node.
We can't deal with the failure of sentinel insertion, so this will result in a
crash.
This patch prevents the spurious OOM by reserving some memory (1 LSA segment)
and only making it available right before the critical allocations.
Fixes https://github.com/scylladb/scylladb/issues/19552
(cherry picked from commit f784be6a7e)
(cherry picked from commit 7b3f55a65f)
(cherry picked from commit 78d6471ce4)
Refs #19617Closesscylladb/scylladb#19676
* github.com:scylladb/scylladb:
mutation_partition_v2: in apply_monotonically(), avoid bad_alloc on sentinel insertion
logalloc: add hold_reserve
logalloc: generalize refill_emergency_reserve()
The reader concurrency semaphore restricts the concurrency of reads that require CPU (intention: they read from the cache) to 1, meaning that if there is even a single active read which declares that it needs just CPU to proceed, no new read is admitted. This is meant to keep the concurrency of reads in the cache at 1. The idea is that concurrency in the cache is not useful: it just leads to the reactor rotating between these reads, all of the finishing later then they could if they were the only active read in the cache.
This was observed to backfire in the case where there reads from a single table are mostly very fast, but on some keys are very slow (hint: collection full of tombstones). In this case the slow read keeps up the fast reads in the queue, increasing the 99th percentile latencies significantly.
This series proposes to fix this, by making the CPU concurrency configurable. We don't like tunables like this and this is not a proper fix, but a workaround. The proper fix would be to allow to cut any page early, but we cannot cut a page in the middle of a row. We could maybe have a way of detecting slow reads and excluding them from the CPU concurrency. This would be a heuristic and it would be hard to get right. So in this series a robust and simple configurable is offered, which can be used on those few clusters which do suffer from the too strict concurrency limit. We have seen it in very few cases so far, so this doesn't seem to be wide-spread.
Fixes: https://github.com/scylladb/scylladb/issues/19017
This PR backports https://github.com/scylladb/scylladb/pull/19018 and its follow-up https://github.com/scylladb/scylladb/pull/19600.
Closesscylladb/scylladb#19646
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: execution_loop(): move maybe_admit_waiters() to the inner loop
test/boost/reader_concurrency_semaphore_test: add test for live-configurable cpu concurrency
test/boost/reader_concurrency_semaphore_test: hoist require_can_admit
reader_concurrency_semaphore: wire in the configurable cpu concurrency
reader_concurrency_semaphore: add cpu_concurrency constructor parameter
db/config: introduce reader_concurrency_semahore_cpu_concurrency
mutation_partition_v2::apply_monotonically() needs to perform some allocations
in a destructor, to ensure that the invariants of the data structure are
restored before returning. But it is usually called with reclaiming disabled,
so the allocations might fail even in a perfectly healthy node with plenty of
reclaimable memory.
This patch adds a mechanism which allows to reserve some LSA memory (by
asking the allocator to keep it unused) and make it available for allocation
right when we need to guarantee allocation success.
(cherry picked from commit 7b3f55a65f)
It was observed that some use cases might append old data constantly to
memtable, blocking GC of expired tombstones.
That's because timestamp of memtable is unconditionally used for
calculating max purgeable, even when the memtable doesn't contain the
key of the tombstone we're trying to GC.
The idea is to treat memtable as we treat L0 sstables, i.e. it will
only prevent GC if it contains data that is possibly shadowed by the
expired tombstone (after checking for key presence and timestamp).
Memtable will usually have a small subset of keys in largest tier,
so after this change, a large fraction of keys containing expired
tombstones can be GCed when memtable contains old data.
Fixes#17599.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 38699f6c3d)
Closesscylladb/scylladb#19551
Normally, the space overhead for TWCS is 1/N, where is number of windows. But during off-strategy, the overhead is 100% because input sstables cannot be released earlier.
Reshaping a TWCS table that takes ~50% of available space can result in system running out of space.
That's fixed by restricting every TWCS off-strategy job to 10% of free space in disk. Tables that aren't big will not be penalized with increased write amplification, as all input (disjoint) sstables can still be compacted in a single round.
Fixes https://github.com/scylladb/scylladb/issues/16514.
(cherry picked from commit b8bd4c51c2)
(cherry picked from commit 51c7ee889e)
(cherry picked from commit 0ce8ee03f1)
(cherry picked from commit ace4e5111e)
Refs https://github.com/scylladb/scylladb/pull/18137
note for maintainer: first cleanup (conflicting) patch was removed and doesn't change anything.
Closesscylladb/scylladb#19549
* github.com:scylladb/scylladb:
compaction: Reduce twcs off-strategy space overhead to 10% of free space
compaction: wire storage free space into reshape procedure
sstables: Allow to get free space from underlying storage
This is currently a lambda in a test, hoist it into the global scope and
make it into a function, so other tests can use it too (in the next
patch).
(cherry picked from commit 9cbdd8ef92)
This patch adds a check if aggregation query is doing single-partition read and if so, makes the query to not use forward_service and do not parallelize the request.
Fixesscylladb/scylladb#19349
(cherry picked from commit e9ace7c203)
(cherry picked from commit 8eb5ca8202)
Refs scylladb/scylladb#19350Closesscylladb/scylladb#19500
* github.com:scylladb/scylladb:
test/boost/cql_query_test: add test for single-partition aggregation
cql3/select_statement: do not parallelize single-partition aggregations
TWCS off-strategy suffers with 100% space overhead, so a big TWCS table
can cause scylla to run out of disk space during node ops.
To not penalize TWCS tables, that take a small percentage of disk,
with increased write ampl, TWCS off-strategy will be restricted to
10% of free disk space. Then small tables can still compact all
disjoint sstables in a single round.
Fixes#16514.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit ace4e5111e)
After this, TWCS reshape procedure can be changed to limit job
to 10% of available space.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 0ce8ee03f1)
That will be used in turn to restrict reshape to 10% of available space
in underlying storage.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 51c7ee889e)
The name of the Scylla table backing an Alternator LSI looks like `basename:!lsiname`. Some REST API clients (including Scylla Manager) when they send a "!" character in the REST API request path may decide to "URL encode" it - convert it to `%21`.
Because of a Seastar bug (https://github.com/scylladb/seastar/issues/725) Scylla's REST API server forgets to do the URL decoding on the path part of the request, which leads to the REST API request failing to address the LSI table.
The first patch in this PR fixes the bug by using a new Seastar API introduced in https://github.com/scylladb/seastar/pull/2125 that does the URL decoding as appropriate. The second patch in the PR is a new test for this bug, which fails without the fix, and passes afterwards.
Fixes#5883.
Closesscylladb/scylladb#18286
* github.com:scylladb/scylladb:
test/alternator: test addressing LSI using REST API
REST API: stop using deprecated, buggy, path parameter
(cherry picked from commit 0438febdc9)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Fixes#19334
Current impl uses hardcoded printing of a few extensions.
Instead, use extension options to string and print all.
(cherry picked from commit 73abc56d79)
This test in topology_experimental_raft/test_alternator.py wants to
check that during Alternator TTL's expiration scans, ALL of the CPU was
used in the "streaming" scheduling group and not in the "statement"
scheduling group. But to allow for some fluke requests (e.g., from the
driver), the test actually allows work in the statement group to be
up to 1% of the work.
Unfortunately, in one test run - a very slow debug+aarch64 run - we
saw the work on the statement group reach 1.4%, failing the test.
I don't know exactly where this work comes from, perhaps the driver,
but before this bug was fixed we saw more than 58% of the work in the
wrong scheduling group, so neither 1% or 1.4% is a sign that the bug
came back. In fact, let's just change the threshold in the test to 10%,
which is also much lower than the pre-fix value of 58%, so is still a
valid regression test.
Fixes#19307
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19323
(cherry picked from commit 9fc70a28ca)
Fixesscylladb/scylla-pkg#3845
Don't overwrite (or rather change) AWS credentials variables if already set in
enclosing environment. Ensures EAR tests for AWS KMS can run properly in CI.
v2:
* Allow environment variables in reading obj storage config - allows CI to
use real credentials in env without risking putting them info less seure
files
* Don't write credentials info from miniserver into config, instead use said
environment vars to propagate creds.
v3:
* Fix python launch scripts to not clear environment, thus retaining above aws envs.
(cherry picked from commit 5056a98289)
Closesscylladb/scylladb#19336
utils/chunked_vector::reserve_partial: fix usage in callers
The method reserve_partial(), when used as documented, quits before the
intended capacity can be reserved fully. This can lead to overallocation
of memory in the last chunk when data is inserted to the chunked vector.
The method itself doesn't have any bug but the way it is being used by
the callers needs to be updated to get the desired behaviour.
Instead of calling it repeatedly with the value returned from the
previous call until it returns zero, it should be repeatedly called with
the intended size until the vector's capacity reaches that size.
This PR updates the method comment and all the callers to use the
right way.
Fixes#19254Closesscylladb/scylladb#19279
* github.com:scylladb/scylladb:
utils/large_bitset: remove unused includes identified by clangd
utils/large_bitset: use thread::maybe_yield()
test/boost/chunked_managed_vector_test: fix testcase tests_reserve_partial
utils/lsa/chunked_managed_vector: fix reserve_partial()
utils/chunked_vector: return void from reserve_partial and make_room
test/boost/chunked_vector_test: fix testcase tests_reserve_partial
utils/chunked_vector::reserve_partial: fix usage in callers
(cherry picked from commit b2ebc172d0)
Backported from #19308 to 5.4
Closesscylladb/scylladb#19355
Store that maintenance scheduling group inside the sstables_manager. The
next patch will use this to run the components reloader fiber.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 79f6746298)
This patch adds a test for issue #18719: Although the Alternator TTL
work is supposedly done in the "streaming" scheduling group, it turned
out we had a bug where work sent on behalf of that code to other nodes
failed to inherit the correct scheduling group, and was done in the
normal ("statement") group.
Because this problem only happens when more than one node is involved,
the test is in the multi-node test framework test/topology_experimental_raft.
The test uses the Alternator API. We already had in that framework a
test using the Alternator API (a test for alternator+tablets), so in
this patch we move the common Alternator utility functions to a common
file, test_alternator.py, where I also put the new test.
The test is based on metrics: We write expiring data, wait for it to expire,
and then check the metrics on how much CPU work was done in the wrong
scheduling group ("statement"). Before #18719 was fixed, a lot of work
was done there (more than half of the work done in the right group).
After the issue was fixed in the previous patch, the work on the wrong
scheduling group went down to zero.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 1fe8f22d89)
Modifications in the cherry-pick:
* Moved test to topology_custom directory, so it runs without tablets
* use the server_add() function instead of the newer add_servers() which
didn't yet exist in this branch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This branch (5.4) does NOT support tablets, and we don't want to run
any tests with the "tablets" experimental feature. When we made test.py
enable that feature by default, it was probably considered harmless -
the partial implementation we had in this branch won't do anything if
tablets aren't actually enabled for a specific keyspace.
But unfortunately, Alternator doesn't work with tablets enabled (there
was a bug in the callback during table creation), so we can't run any
Alternator tests from test.py (like the one we we wan to backport for
Alternator TTL scheduling groups) unless we drop that experimental
feature.
Note that one specific test subdirectory,
test/topology_experimental_raft, does enable this experimental
flag. The others shouldn't.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Fetching only the first page is not the intuitive behavior expected by users.
This causes flakiness in some tests which generate variable amount of
keys depending on execution speed and verify later that all keys were
written using a single SELECT statement. When the amount of keys
becomes larger than page size, the test fails.
Fixes#18774
(cherry picked from commit 43b907b499)
Closesscylladb/scylladb#19129
before this change, when performing memtable_test, we expect that
the memtables of ks.cf is the only memtables being flushed. and
we inject 4 failures in the code path of flush, and wait until 4
of them are triggered. but in the background, `dirty_memory_manager`
performs flush on all tables when necessary. so, the total number of
failures is not necessary the total number of failures triggered
when flushing ks.cf, some of them could be triggered when flushing
system tables. that's why we have sporadict test failures from
this test. as we might check `t.min_memtable_timestamp()` too soon.
after this change, we increase `unspooled_dirty_soft_limit` setting,
in order to disable `dirty_memory_manager`, so that the only flush
is performed by the test.
Fixes#19034
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 223fba3243)
before this change, we verify the behavior of design under test using
`BOOST_ASSERT()`, which is a wrapper around `assert()`, so if a test
fails, the test just aborts. this is not very helpful for postmortem
debugging.
after this change, we use `BOOST_REQUIRE` macro for verifying the
behavior, so that Boost.Test prints out the condition if it does not
hold when we test it.
Refs #19034
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 2df4e9cfc2)
Incremented the components_memory_reclaim_threshold config's default
value to 0.2 as the previous value was too strict and caused unnecessary
eviction in otherwise healthy clusters.
Fixes#18607
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 3d7d1fa72a)
Closesscylladb/scylladb#19013
PR #17771 introduced a threshold for the total memory used by all bloom filters across SSTables. When the total usage surpasses the threshold, the largest bloom filter will be removed from memory, bringing the total usage back under the threshold. This PR adds support for reloading such reclaimed bloom filters back into memory when memory becomes available (i.e., within the 10% of available memory earmarked for the reclaimable components).
The SSTables manager now maintains a list of all SSTables whose bloom filter was removed from memory and attempts to reload them when an SSTable, whose bloom filter is still in memory, gets deleted. The manager reloads from the smallest to the largest bloom filter to maximize the number of filters being reloaded into memory.
Backported from https://github.com/scylladb/scylladb/pull/18186 to 5.4.
Closesscylladb/scylladb#18660
* github.com:scylladb/scylladb:
sstable_datafile_test: add testcase to test reclaim during reload
sstable_datafile_test: add test to verify auto reload of reclaimed components
sstables_manager: reload previously reclaimed components when memory is available
sstables_manager: start a fiber to reload components
sstable_directory_test: fix generation in sstable_directory_test_table_scan_incomplete_sstables
sstable_datafile_test: add test to verify reclaimed components reload
sstables: support reloading reclaimed components
sstables_manager: add new intrusive set to track the reclaimed sstables
sstable: add link and comparator class to support new instrusive set
sstable: renamed intrusive list link type
sstable: track memory reclaimed from components per sstable
sstable: rename local variable in sstable::total_reclaimable_memory_size
When a table has secondary indexes on *multiple* columns, and several
such columns are used for filtering in a query, Scylla chooses one
of these indexes as the main driver of the query, and the second
column's restriction is implemented as filtering.
Before this patch, the index to use was chosen fairly randomly, based on
the order of the indexes in the schema. This order may be different in
different coordinators, and may even change across restarts on the same
coordinators. This is not only inconsistent, it can cause outright wrong
results when using *paging* and switching (or restarting) coordinates
in the middle of a paged scan... One coordinator saves one index's key
in the paging state, and then the other coordinator gets this paging
state and wrongly believes it is supposed to be a key of a *different*
index.
The fix in this patch is to pick the index suitable for the first
indexed column mentioned in the query. This has two benefits over
the situation before the patch:
1. The decision of which index to use no longer changes between
coordinators or across restarts - it just depends on the schema
and the specific query.
2. Different indexes can have different "specificity" so using one
or the other can change the query's performance. After this patch,
the user is in control over which index is used by changing the
order of terms in the query. A curious user can use tracing to
check which index was used to implement a particular query.
An xfailing test we had for this issue no longer fails, so the "xfail"
marker is removed.
Fixes#7969
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 77c61f907e)
Closesscylladb/scylladb#18963
The function intersection(r1,r2) in statement_restrictions.cc is used
when several WHERE restrictions were applied to the same column.
For example, for "WHERE b<1 AND b<2" the intersection of the two ranges
is calculated to be b<1.
As noted in issue #18690, Scylla is inconsistent in where it allows or
doesn't allow these intersecting restrictions. But where they are
allowed they must be implemented correctly. And it turns out the
function intersection() had a bug that caused it to sometimes enter
an infinite loop - when the intent was only to call itself once with
swapped parameters.
This patch includes a test reproducing this bug, and a fix for the
bug. The test hangs before the fix, and passes after the fix.
While at it, I carefully reviewed the entire code used to implement
the intersection() function to try to make sure that the bug we found
was the only one. I also added a few more comments where I thought they
were needed to understand complicated logic of the code.
The bug, the fix and the test were originally discovered by
Michał Chojnowski.
Fixes#18688
Refs #18690
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 27ab560abd)
Closesscylladb/scylladb#18717
We have to account for moved objects as well
as copied objects so they will be balanced with
the respective `del_live_object` calls called
by the destructor.
However, since chunked_vector requires the
value_type to be nothrow_move_constructible,
just count the additional live object, but
do not modify _countdown or, respectively, throw
an exception, as this should be considered only
for the default and copy constructors.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
when migrating to the uuid-based identifiers, the mapping from the
integer-based generation to the shard-id is preserved. we used to have
"gen % smp_count" for calculating the shard which is responsible to host
a given sstable. despite that this is not a documented behavior, this is
handy when we try to correlate an sstable to a shard, typically when
looking at a performance issue.
in this change, a new subcommand is added to expose the connection
between the sstable and its "owner" shards.
Fixes#16343
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16345
(cherry picked from commit 273ee36bee)
The testcase uses an sstable whose mutation key and the generation are
owned by different shards. Due to this, when process_sstable_dir is
called, the sstable gets loaded into a different shard than the one that
was intended. This also means that the sstable and the sstable manager
end up in different shards.
The following patch will introduce a condition variable in sstables
manager which will be signalled from the sstables. If the sstable and
the sstable manager are in different shards, the signalling will cause
the testcase to fail in debug mode with this error : "Promise task was
set on shard x but made ready on shard y". So, fix it by supplying
appropriate generation number owned by the same shard which owns the
mutation key as well.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 24064064e9)
The direct failure detector design is simplistic. It sends pings
sequentially and times out listeners that reached the threshold (i.e.
didn't hear from a given endpoint for too long) in-between pings.
Given the sequential nature, the previous ping must finish so the next
ping can start. We timeout pings that take too long. The timeout was
hardcoded and set to 300ms. This is too low for wide-area setups --
latencies across the Earth can indeed go up to 300ms. 3 subsequent timed
out pings to a given node were sufficient for the Raft listener to "mark
server as down" (the listener used a threshold of 1s).
Increase the ping timeout to 600ms which should be enough even for
pinging the opposite side of Earth, and make it tunable.
Increase the Raft listener threshold from 1s to 2s. Without the
increased threshold, one timed out ping would be enough to mark the
server as down. Increasing it to 2s requires 3 timed out pings which
makes it more robust in presence of transient network hiccups.
In the future we'll most likely want to decrease the Raft listener
threshold again, if we use Raft for data path -- so leader elections
start quickly after leader failures. (Faster than 2s). To do that we'll
have to improve the design of the direct failure detector.
Ref: scylladb/scylladb#16410Fixes: scylladb/scylladb#16607
---
I tested the change manually using `tc qdisc ... netem delay`, setting
network delay on local setup to ~300ms with jitter. Without the change,
the result is as observed in scylladb/scylladb#16410: interleaving
```
raft_group_registry - marking Raft server ... as dead for Raft groups
raft_group_registry - marking Raft server ... as alive for Raft groups
```
happening once every few seconds. The "marking as dead" happens whenever
we get 3 subsequent failed pings, which is happens with certain (high)
probability depending on the latency jitter. Then as soon as we get a
successful ping, we mark server back as alive.
With the change, the phenomenon no longer appears.
(cherry picked from commit 8df6d10e88)
Closesscylladb/scylladb#18559
More than three years ago, in issue #7949, we noticed that trying to
set a `map<ascii, int>` from JSON input (i.e., using INSERT JSON or the
fromJson() function) fails - the ascii key is incorrectly parsed.
We fixed that issue in commit 75109e9519
but unfortunately, did not do our due diligence: We did not write enough
tests inspired by this bug, and failed to discover that actually we have
the same bug for many other key types, not just for "ascii". Specifically,
the following key types have exactly the same bug:
* blob
* date
* inet
* time
* timestamp
* timeuuid
* uuid
Other types, like numbers or boolean worked "by accident" - instead of
parsing them as a normal string, we asked the JSON parser to parse them
again after removing the quotes, and because unquoted numbers and
unquoted true/false happwn to work in JSON, this didn't fail.
The fix here is very simple - for all *native* types (i.e., not
collections or tuples), the encoding of the key in JSON is simply a
quoted string - and removing the quotes is all we need to do and there's
no need to run the JSON parser a second time. Only for more elaborate
types - collections and tuples - we need to run the JSON parser a
second time on the key string to build the more elaborate object.
This patch also includes tests for fromJson() reading a map with all
native key types, confirming that all the aforementioned key types
were broken before this patch, and all key types (including the numbers
and booleans which worked even befoe this patch) work with this patch.
Fixes#18477.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 21557cfaa6)
Closesscylladb/scylladb#18522
When reclaiming memory from bloom filters, do not remove them from
_recognised_components, as that leads to the on-disk filter component
being left back on disk when the SSTable is deleted.
Fixes#18398
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#18400
(cherry picked from commit 6af2659b57)
Add a positive test, as it turns out we had some false-positive
validation bugs in the validator and we need a regression test for this.
(cherry picked from commit 2335f42b2b)