The operation after gate.enter() in tracker::start() can fail and throw,
we should call gate.leave() in such case to avoid unbalanced enter and
leave calls. tracker::done() has similar issue too.
Fix it by removing the gate enter and leave logic in tracker start and
done. A helper tracker::run() is introduced to take care of the gate and
repair status.
In addition, the error log is improved. It now logs exceptions on all
shards in the summary. e.g.,
[shard 0] repair - repair id 1 failed: std::runtime_error
({shard 0: std::runtime_error (error0), shard 1: std::runtime_error (error1)})
Fixes#5074
Currently, the population stat is not increased for entries that are
evicted immediately on insert, however the code that does the eviction
still decreases the population stat, leading to an imbalance and in some
cases the underflow of the population stat. To fix, unconditionally
increase the population stat upon inserting an entry, regardless of
whether it is immediately evicted or not.
Fixes: #5123
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191001153215.82997-1-bdenes@scylladb.com>
The example Python code had wrong indentation, and wouldn't actually
work if naively copy-pasted. Noticed by Noam Hasson.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190929091440.28042-1-nyh@scylladb.com>
"
This is a collection of assorted patches that will be needed for LWT.
Most of them are trivial, but one touches a lot of files, so have a
good chance to cause rebase headache (I already had to rebase it on
top of Alternator). Lets push them earlier instead of carrying them in
the lwt branch.
"
* 'gleb/lwt-prepare-v2' of github.com:scylladb/seastar-dev:
lwt: make _last_timestamp_micros static
lwt: Add client_state::get_timestamp_for_paxos() function
lwt: Pass client_state reference all the way to storage_proxy::query
exceptions: Add a constructor for unavailable_exception that allows providing a custom message
serializer: Add std::variant support
lwt: Add missing functions to utils/UUID_gen.hh
"
This is the second version of the patch series. The previous one was just the second patch, this one adds more tests an another patch to make it easier to test that the new code has the same behavior as the old one.
"
* 'espindola/overflow-is-intentional' of https://github.com/espindola/scylla:
types: Simplify and explain from_varint_to_integer
Add more cast tests
Affects single-partition reads only.
Refs #5113
When executing a query on the replica we do several things in order to
narrow down the sstable set we read from.
For tables which use LeveledCompactionStrategy, we store sstables in
an interval set and we select only sstables whose partition ranges
overlap with the queried range. Other compaction strategies don't
organize the sstables and will select all sstables at this stage. The
reasoning behind this is that for non-LCS compaction strategies the
sstables' ranges will typically overlap and using interval sets in
this case would not be effective and would result in quadratic (in
sstable count) memory consumption.
The assumption for overlap does not hold if the sstables come from
repair or streaming, which generates non-overlapping sstables.
At a later stage, for single-partition queries, we use the sstables'
bloom filter (kept in memory) to drop sstables which surely don't
contain given partition. Then we proceed to sstable indexes to narrow
down the data file range.
Tables which don't use LCS will do unnecessary I/O to read index pages
for single-partition reads if the partition is outside of the
sstable's range and the bloom filter is ineffective (Refs #5112).
This patch fixes the problem by consulting sstable's partition range
in addition to the bloom filter, so that the non-overlapping sstables
will be filtered out with certainty and not depend on bloom filter's
efficiency.
It's also faster to drop sstables based on the keys than the bloom
filter.
Tests:
- unit (dev)
- manual using cqlsh
Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190927122505.21932-1-tgrabiec@scylladb.com>
The method sstable::estimated_keys_for_range() was severely
under-estimating the number of partitions in an sstable for a given
token range.
The first reason is that it underestimated the number of sstable index
pages covered by the range, by one. In extreme, if the requested range
falls into a single index page, we will assume 0 pages, and report 1
partition. The reason is that we were using
get_sample_indexes_for_range(), which returns entries with the keys
falling into the range, not entries for pages which may contain the
keys.
A single page can have a lot of partitions though. By default, there
is a 1:20000 ratio between summary entry size and the data file size
covered by it. If partitions are small, that can be many hundreds of
partitions.
Another reason is that we underestimate the number of partitions in an
index page. We multiply the number of pages by:
(downsampling::BASE_SAMPLING_LEVEL * _components->summary.header.min_index_interval)
/ _components->summary.header.sampling_level
Using defaults, that means multiplying by 128. In the cassandra-stress
workload a single partition takes about 300 bytes in the data file and
summary entry is 22 bytes. That means a single page covers 22 * 20'000
= 440'000 bytes of the data file, which contains about 1'466
partitions. So we underestimate by an order of magnitude.
Underestimating the number of partitions will result in too small
bloom filters being generated for the sstables which are the output of
repair or streaming. This will make the bloom filters ineffective
which results in reads selecting more sstables than necessary.
The fix is to base the estimation on the number of index pages which
may contain keys for the range, and multiply that by the average key
count per index page.
Fixes#5112.
Refs #4994.
The output of test_key_count_estimation:
Before:
count = 10000
est = 10112
est([-inf; +inf]) = 512
est([0; 0]) = 128
est([0; 63]) = 128
est([0; 255]) = 128
est([0; 511]) = 128
est([0; 1023]) = 128
est([0; 4095]) = 256
est([0; 9999]) = 512
est([5000; 5000]) = 1
est([5000; 5063]) = 1
est([5000; 5255]) = 1
est([5000; 5511]) = 1
est([5000; 6023]) = 128
est([5000; 9095]) = 256
est([5000; 9999]) = 256
est(non-overlapping to the left) = 1
est(non-overlapping to the right) = 1
After:
count = 10000
est = 10112
est([-inf; +inf]) = 10112
est([0; 0]) = 2528
est([0; 63]) = 2528
est([0; 255]) = 2528
est([0; 511]) = 2528
est([0; 1023]) = 2528
est([0; 4095]) = 5056
est([0; 9999]) = 10112
est([5000; 5000]) = 2528
est([5000; 5063]) = 2528
est([5000; 5255]) = 2528
est([5000; 5511]) = 2528
est([5000; 6023]) = 5056
est([5000; 9095]) = 7584
est([5000; 9999]) = 7584
est(non-overlapping to the left) = 0
est(non-overlapping to the right) = 0
Tests:
- unit (dev)
Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190927141339.31315-1-tgrabiec@scylladb.com>
`dbuild` was recently (24c732057) updated to run in interactive mode
when given no arguments; we can now update the README to mention that.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
When the toppartitions operation gathers results, it copies partition
keys with their schema_ptr:s. When these schema_ptr:s are copies
or destroyed, they can cause leaks or premature frees of the schema
in its original shard since reference count operations in are not atomic.
Fix that by converting the schema_ptr to a global_schema_ptr during
transportation.
Fixes#5104 (direct bug)
Fixes#5018 (schema prematurely freed, toppartitions previously executed on that node)
Fixes#4973 (corrupted memory pool of the same size class as schema, toppartitions previously executed on that node)
Tests: new test added that fails with the existing code in debug mode,
manual toppartitions test
Copying schema_ptrs across shards results in memory corruption since
lw_shared_ptr does not use atomic operations for reference counts.
Prevent that by converting schema_ptr:s to global_schema_ptr:s before
shipping them across shards in the map operation, and converting them
back to local schema_ptr:s in the reduce operation.
This allows keys from different stages in the schema's like to compare equal.
This is safe since the partition key cannot change, unlike the rest of the schema.
More importantly, it will allow us to compare keys made local after a pass through
global_schema_ptr, which does not guarantee that the schema_ptr conversion will be
the same even when starting with the same global_schema_ptr.
Throwing move constructors are a a pain; so we should try to make
them noexcept. Currently, global_schema_ptr's move constructor
throws an exception if used illegaly (moving from a different shard);
this patch changes it to an assert, on the grounds that this error
is impossible to recover from.
The direct motivation for the patch is the desire to store objects
containing a global_schema_ptr in a chunked_vector, to move lists
of partition keys across shards for the topppartitions functionality.
chunked_vector currently requires noexcept move constructors for its
value_type.
When a user type changes we were not recreating other uses types that
use it. This patch series fixes that and makes it clear which code is
responsible for it.
In the system.types table a user type refers to another by name. When
a user type is modified, only its entry in the table is changed.
At runtime a user type has direct pointer to the types it uses. To
handle the discrepancy we need to recreate any dependent types when a
entry in system.types changes.
Fixes#5049
If each client_state has its own copy of the variable two clients may
generate timestamps that clash and needlessly create contention. Making
the variable shared between all client_state on the same shard will make
sure this will not happen to two clients on the same shard. It may still
happen for two client on two different shards or two different nodes.
Paxos needs a unique timestamp that is greater than some other
timestamp, so that the next round had more chances to succeed.
Add a function that returns such a timestamp.
client_state holds a state to generate monotonically increasing unique
timestamp. Queries with a SERIAL consistency level need it to generate
a paxos round.
In the system.types table a user type refers to another by name. When
a user type is modified, only its entry in the table is changed.
At runtime a user type has direct pointer to the types it uses. To
handle the discrepancy we need to recreate any dependent types when a
entry in system.types changes.
Fixes#5049
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The way schema changes propagate is by editing the system tables and
comparing the before and after state.
When a user type A uses another user type B and we modify B, the
representation of A in the system table doesn't change, so this code
was not producing any changes on the diff that the receiving side
uses.
Deleting it makes it clear that it is the receiver's responsibility to
handle dependent user types.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
With this patch db::cql_type_parser::raw_builder creates a local copy
of the list of existing types and uses that internally. By doing that
build() should have no observable behavior other than returning the
new types.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
We were never passing a null pointer and never saving a copy of the
lw_shared_ptr. Passing a reference is more flexible as not all callers
are required to hold the user_types_metadata in a lw_shared_ptr.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
* seastar b56a8c5045...c21a7557f9 (3):
> net: socket::{set,get}_reuseaddr() should not be virtual
> iotune: print verbose message in case of shutdown errors
> iotune: close test file on shutdown
Fixes#4946.
1. Add assert in remove_response_handler to make crashes like in #5032 easier to understand.
2. Lookup the view_update_write_response_handler id before calling timeout_cb and tolerate it not found.
Just log a warning if this happened.
Fixes#5032
"
Currently affects only counter tables.
Introduced in 27014a2.
mutation_partition(s, mp) is incorrect because it uses s to interpret
mp, while it should use mp_schema.
We may hit this if the current node has a newer schema than the
incoming mutation. This can happen during table schema altering when we receive the
mutation from a node which hasn't processed the schema change yet.
This is undefined behavior in general. If the alter was adding or
removing columns, this may result in corruption of the write where
values of one column are inserted into a different column.
Fixes#5095.
"
* 'fix-schema-alter-counter-tables' of https://github.com/tgrabiec/scylla:
mvcc: Fix incorrect schema verison being used to copy the mutation when applying
mutation_partition: Track and validate schema version in debug builds
tests: Use the correct schema to access mutation_partition
Currently affects only counter tables.
Introduced in 27014a2.
mutation_partition(s, mp) is incorrect, because it uses s to interpret
mp, while it should use mp_schema.
We may hit this if the current node has a newer schema than the
incoming mutation. This can happen during alter when we receive the
mutation from a node which hasn't processed the schema change yet.
This is undefined behavior in general. If the alter was adding or
removing columns, this may result in corruption of the write where
values of one column are inserted into a different column.
Fixes#5095.
This patch makes mutation_partition validate the invariant that it's
supposed to be accessed only with the schema version which it conforms
to.
Refs #5095
* seastar e51a1a8ed9...b56a8c5045 (3):
> net: add support for UNIX-domain sockets
> future: Warn on promise::set_exception with no corresponding future or task
> Merge "Handle exceptions in repeat_until_value and misc cleanups" from Rafael
Handle a race where a write handler is removed from _response_handlers
but not yet from _view_update_handlers_list.
Fixes#5032
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Refactor remove_response_handler_entry out of remove_response_handler,
to be called on a valid iterator found by _response_handlers.find(id).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Help identify cases like seen in #5032 where the handler id
wasn't found from the on_down -> timeout_cb path.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
compaction_manager::perform_sstable_upgrade() fails when it feeds
compaction mechanism with shared sstables. Shared sstables should
be ignored when performing upgrade and so wait for reshard to pick
them up in parallel. Whenever a shared sstable is brought up either
on restart or via refresh, reshard procedure kicks in.
Reshard picks the highest supported format so the upgrade for
shared sstable will naturally take place.
Fixes#5056.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190925042414.4330-1-raphaelsc@scylladb.com>
- Update the outdated comments in do_stop_gossiping. It was
storage_service not storage_proxy that used the lock. More
importantly, storage_service does not use it any more.
- Drop the unused timer_callback_lock and timer_callback_unlock API
- Use with_semaphore to make sure the semaphore usage is balanced.
- Add log in gossiper::do_stop_gossiping when it tries to take the
semaphore to help debug hang during the shutdown.
Refs: #4891
Refs: #4971
A documentation file that is intended to be a place for anything
debugging related: getting started tutorial, tips and tricks and
advanced guides.
For now it contains a short introductions, some selected links to
more in-depth documentation and some trips and tricks that I could think
off the top of my head.
One of those tricks describes how to load cores obtained from
relocatable packages inside the `dbuild` container. I originally
intended to add that to `tools/toolchain/README.md` but was convinced
that `docs/debugging.md` would be a better place for this.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190924133110.15069-1-bdenes@scylladb.com>