Fixes#7345Fixes#7346
Do a more efficient collection skip when doing paging, instead of
iterating the full sets.
Ensure some semblance of sanity in the parent-child relationship
between shards by ensuring token order sorting and finding the
apparent previous ID coverting the approximate range of new gen.
Fix endsequencenumber generation by looking at whether we are
last gen or not, instead of the (not filled in) 'expired' column.
Fixes#7344
It is not data really needed, as shard_id:s are not required
to be unique across streams, and also because the length limit
on shard_id text representation.
As a side effect, shard iter instead carries the stream arn.
Fixes#7347
If cdc stream id:s are older than either table creation or now - 24h
we can skip them in describe_stream, to minimize the amount of
shards being returned.
"
Misc. cleanups and minor optimizations of token_metadata methods
in preparation to futurizing parts of the api around
update_pending_ranges and abstract_replication_strategy::calculate_natural_endpoints,
to prevent reactor stalls on these paths
Test: unit(dev)
"
* 'token_metadata_cleanup' of github.com:bhalevy/scylla:
token_metadata: get rid of unused calculate_pending_ranges_for_* methods
token_metadata: get rid of clone_after_all_settled
token_metadata_impl: remove_endpoint: do not sort tokens
token_metadata_impl: always sort_tokens in place
On some environment such as CentOS8, journalctl --user -xe does not work
since journald is running in volatile mode.
The issue cannnot fix in non-root mode, as a workaround we should logging to
a file instead of journal.
Also added scylla_logrotate to ExecStartPre which rename previous log file,
since StandardOutput=file:/path/to/file will erase existing file when service
restarted.
Fixes#7131Closes#7326
Instead of relying on SCYLLA-VERSION-GEN to be consistently
updated in each submodule, propagate the top-level product-version-release
to all submodules. This reduces the churn required for each release,
and makes the release strings consistent (previously, the git hash in each
was different).
Closes#7268
* seastar 292ba734bc...8c8fd3ed28 (15):
> semaphore_units: add return_units and return_all
> semaphore_units: release: mark as noexcept
> circular_buffer: support non-default-constructible allocators correctly
> core/shared_ptr: Expose use count through {lw_}enable_shared_from_this
> memory: align allocations to std::max_align_t
> util/log: logger::do_log(): go easier on allocations
> doc: add link to multipage version of tutorial
> doc: fix the output directories of split and tutorial.html
> build: do not copy htmlsplit.py to build dir
> doc: add "--input" and "--output-dir" options to htmlsplit.py
> doc: update split script to use xml.etree.ElementTree
> Merge "shared_future: make functions noexcept" from Benny
> tutorial: add linebreak between sections
> tutorial: format "future<int>" as inline code block
> docs: specify HTML language code for tutorial.html
This patch change the way the request object is passed,
using a reference instead of temporaries.
'exists' test is passing in debug mode, whereas it was
always failing before.
Fixes#7261 by ensuring request object is alive for all commands
during the whole request duration.
Signed-off-by: Etienne Adam <etienne.adam@gmail.com>
Message-Id: <20200924202034.30399-1-etienne.adam@gmail.com>
Lua 5.4 added an extra parameter to lua_resume()[1]. The parameter
denotes the number of arguments yielded, but our coroutines
don't yield any arguments, so we can just ignore it.
Define a macro to allow adding extra stuff with Lua 5.4,
and use it to supply the extra parameter.
[1] https://www.lua.org/manual/5.4/manual.html#8.3Closes#7324
Clang eagerly instantiates templates, apparently with the following
algorithm:
- if both the declaration and definition are seen at the time of
instantiation, instantiate the template
- if only the declaration is see at the time of instantiation, just emit
a reference to the template; even if the definition is later seen,
it is not instantiated
The "reference" in the second case is a relocation entry in the object file
that is satisfied at link time by the linker, but if no other object file
instantiated the needed template, a link error results.
These problems are hard to diagnose but easy to fix. This series fixes all
known such issues in the code base. It was tested on gcc as well.
Closes#7322
* github.com:scylladb/scylla:
query-result-reader: order idl implementations correctly
frozen_schema: order idl implementations correctly
idl-compiler: generate views after serializers
On Ubuntu 16/18 and Debian 9, LimitNOFILE is set to 4096 and not able to override from
user unit.
To run scylla-server in such environment, we need to turn on developer mode and show
warnings.
Fixes#7133Closes#7323
Clang eagerly instantiates templates, so if it needs a template
function for which it has a declaration but not a definition, it
will not instantiate the definition when it sees it. This causes
link errors.
Fix by ordering the idl implementation files so that definitions
come before uses.
Clang eagerly instantiates templates, so if it needs a template
function for which it has a declaration but not a definition, it
will not instantiate the definition when it sees it. This causes
link errors.
Fix by ordering the idl implementation files so that definitions
come before uses.
Clang eagerly instantiates templates, so if it needs a template
function for which it has a declaration but not a definition, it
will not instantiate the definition when it sees it. This causes
link errors.
In this case, the views use the serializer implementations, but are
generated before them.
Fix by generating the view implementations after the serializer
implementations that they use.
Add disable option for test configuration.
Tests in this list will be disabled for all modes.
* alejo/next-disable-raft-tests-01:
Raft: disable boost tests for now
Tests: add disable to configuration
Raft: Remove tests for now
For suite.yaml add an extra configuration option disable.
Tests in this list will disabled for all modes.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Remove raft C++ tests until raft is included in build process.
[tgrabiec]: Fixes test.py failure. Tests are not compiled unless --build-raft is
passed to configure.py and we cannot enable it by default yet.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20201002102847.1140775-1-alejo.sanchez@scylladb.com>
Unavailable exception means that operation was not started and it can be
retried safely. If lwt fails in the learn stage though it most
certainly means that its effect will be observable already. The patch
returns timeout exception instead which means uncertainty.
Fixes#7258
Message-Id: <20201001130724.GA2283830@scylladb.com>
This is the beginning of raft protocol implementation. It only supports
log replication and voter state machine. The main difference between
this one and the RFC (besides having voter state machine) is that the
approach taken here is to implement raft as a deterministic state
machine and move all the IO processing away from the main logic.
To do that some changes to RPC interface was required: all verbs are now
one way meaning that sending a request does not wait for a reply and
the reply arrives as a separate message (or not at all, it is safe to
drop packets).
* scylla-dev/raft-v4:
raft: add a short readme file
raft: compile raft tests
raft: add raft tests
raft: Implement log replication and leader election
raft: Introduce raft interface header
Compilation is not enabled by default as it requires coroutines support
and may require special compiler (until distributed one fixes all the
bugs related to coroutines). To enable raft tests compilation new
configure.py option is added (--build-raft).
Add test for currently implemented raft features. replication_test
tests replication functionality with various initial log configurations.
raft_fsm_test test voting state machine functionality.
This patch introduces partial RAFT implementation. It has only log
replication and leader election support. Snapshotting and configuration
change along with other, smaller features are not yet implemented.
The approach taken by this implementation is to have a deterministic
state machine coded in raft::fsm. What makes the FSM deterministic is
that it does not do any IO by itself. It only takes an input (which may
be a networking message, time tick or new append message), changes its
state and produce an output. The output contains the state that has
to be persisted, messages that need to be sent and entries that may
be applied (in that order). The input and output of the FSM is handled
by raft::server class. It uses raft::rpc interface to send and receive
messages and raft::storage interface to implement persistence.
This commit introduce public raft interfaces. raft::server represents
single raft server instance. raft::state_machine represents a user
defined state machine. raft::rpc, raft::rpc_client and raft::storage are
used to allow implementing custom networking and storage layers.
A shared failure detector interface defines keep-alive semantics,
required for efficient implementation of thousands of raft groups.
Recently, the cql_server_config::max_concurrent_requests field was
changed to be an updateable_value, so that it is updated when the
corresponding option in Scylla's configuration is live-reloaded.
Unfortunately, due to how cql_server is constructed, this caused
cql_server instances on all shards to store an updateable_value which
pointed to an updateable_value_source on shard 0. Unsynchronized
cross-shard memory operations ensue.
The fix changes the cql_server_config so that it holds a function which
creates an updateable_value appropriate for the given shard. This
pattern is similar to another, already existing option in the config:
get_service_memory_limiter_semaphore.
This fix can be reverted if updateable_value becomes safe to use across
shards.
Tests: unit(dev)
Fixes: #7310
The 'redis_database_count' was already existing, but
was not used when initializing the keyspaces. This
patch merely uses it. I think it's better that way, it
seems cleaner not to create 15 x 5 tables when we
use only one redis database.
Also change a test to test with a higher max number
of database.
Signed-off-by: Etienne Adam <etienne.adam@gmail.com>
Message-Id: <20200930210256.4439-1-etienne.adam@gmail.com>
In order to improve observability, add a username field to the the
system_traces.sessions table. The system table should be change
while upgrading by running the fix_system_distributed_tables.py
script. Until the table is updated, the old behaviour is preserved.
Fixes#6737.
This miniseries enhances the code from #7279 by:
* adding metrics for shed requests, which will allow to pinpoint the problem if the max concurrent requests threshold is too low
* making the error message more comprehensive by pointing at the variable used to set max concurrent requests threshold
Example of an ehanced error message:
```
ConnectionException('Failed to initialize new connection to 127.0.0.1: Error from server: code=1001 [Coordinator node overloaded] message="too many in-flight requests (configured via max_concurrent_requests_per_shard): 18"',)})
```
Closes#7299
* github.com:scylladb/scylla:
transport: make _requests_serving param uint32_t
transport: make overloaded error message more descriptive
transport: add requests_shed metrics
Call sort_tokens at the caller as all call sites from
within token_metadata_impl call remove_endpoint for multiple
endpoints so the tokens can be re-sorted only once, when done
removing all tokens.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It's not realistic for a shard to have over 4 billion concurrent
requests, so this value can be safely represented in 32 bits.
Also, since the current concurrency limit is represented in uint32_t,
it makes sense for these two to have matching types.
"
The last major untracked area of the reader pipeline is the reader
buffers. These scale with the number of readers as well as with the size
and shape of data, so their memory consumption is unpredictable varies
wildly. For example many small rows will trigger larger buffers
allocated within the `circular_buffer<mutation_fragment>`, while few
larger rows will consume a lot of external memory.
This series covers this area by tracking the memory consumption of both
the buffer and its content. This is achieved by passing a tracking
allocator to `circular_buffer<mutation_fragment>` so that each
allocation it makes is tracked. Additionally, we now track the memory
consumption of each and every mutation fragment through its whole
lifetime. Initially I contemplated just tracking the `_buffer_size` of
`flat_mutation_reader::impl`, but concluded that as our reader trees are
typically quite deep, this would result in a lot of unnecessary
`signal()`/`consume()` calls, that scales with the number of mutation
fragments and hence adds to the already considerable per mutation
fragment overhead. The solution chosen in this series is to instead
track the memory consumption of the individual mutation fragments, with
the observation that these are typically always moved and very rarely
copied, so the number of `signal()`/`consume()` calls will be minimal.
This additional tracking introduces an interesting dilemma however:
readers will now have significant memory on their account even before
being admitted. So it may happen that they can prevent their own
admission via this memory consumption. To prevent this, memory
consumption is only forwarded to the semaphore upon admission. This
might be solved when the semaphore is moved to the front -- before the
cache.
Another consequence of this additional, more complete tracking is that
evictable readers now consume memory even when the underlying reader is
evicted. So it may happen that even though no reader is currently
admitted, all memory is consumed from the semaphore. To prevent any such
deadlocks, the semaphore now admits a reader unconditionally if no
reader is admitted -- that is if all count resources all available.
Refs: #4176
Tests: unit(dev, debug, release)
"
* 'track-reader-buffers/v2' of https://github.com/denesb/scylla: (37 commits)
test/manual/sstable_scan_footprint_test: run test body in statement sched group
test/manual/sstable_scan_footprint_test: move test main code into separate function
test/manual/sstable_scan_footprint_test: sprinkle some thread::maybe_yield():s
test/manual/sstable_scan_footprint_test: make clustering row size configurable
test/manual/sstable_scan_footprint_test: document sstable related command line arguments
mutation_fragment_test: add exception safety test for mutation_fragment::mutate_as_*()
test: simple_schema: add make_static_row()
reader_permit: reader_resources: add operator==
mutation_fragment: memory_usage(): remove unused schema parameter
mutation_fragment: track memory usage through the reader_permit
reader_permit: resource_units: add permit() and resources() accessors
mutation_fragment: add schema and permit
partition_snapshot_row_cursor: row(): return clustering_row instead of mutation_fragment
mutation_fragment: remove as_mutable_end_of_partition()
mutation_fragment: s/as_mutable_partition_start/mutate_as_partition_start/
mutation_fragment: s/as_mutable_range_tombstone/mutate_as_range_tombstone/
mutation_fragment: s/as_mutable_clustering_row/mutate_as_clustering_row/
mutation_fragment: s/as_mutable_static_row/mutation_as_static_row/
flat_mutation_reader: make _buffer a tracked buffer
mutation_reader: extract the two fill_buffer_result into a single one
...
Today when ever we are building scylla in a singel mode we still
building jmx, tools and python3 for all dev,release and debug.
Let's make sure we build only in relevant build mode
Also adding unified-tar to ninja build
Closes#7260
`trace_keyspace_helper::make_slow_query_mutation_data` expected a
"query" key in its parameters, which does not appear in case of
e.g. batches of prepared statements. This is example of failing
`record.parameters`:
```
...{"query[0]" : "INSERT INTO ks.tbl (pk, i) values (?, ?);"},
{"query[1]" : "INSERT INTO ks.tbl (pk, i) values (?, ?);"}...
```
In such case Scylla recorded no trace and said:
```
ERROR 2020-09-28 10:09:36,696 [shard 3] trace_keyspace_helper - No
"query" parameter set for a session requesting a slow_query_log record
```
Fix here is to leave query empty if not found. The users can still
retrieve the query contents from existing info.
Fixes#5843Closes#7293
The blacklog of current and max in VIEW_BACKLOG is not update but the
nodes are updating VIEW_BACKLOG all the time. For example:
```
INFO 2020-03-06 17:13:46,761 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.3, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486026590,718)
INFO 2020-03-06 17:13:46,821 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486026531,742)
INFO 2020-03-06 17:13:47,765 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.3, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486027590,721)
INFO 2020-03-06 17:13:47,825 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486027531,745)
INFO 2020-03-06 17:13:48,772 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.3, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486028590,726)
INFO 2020-03-06 17:13:48,833 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486028531,750)
INFO 2020-03-06 17:13:49,772 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.3, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486029590,729)
INFO 2020-03-06 17:13:49,832 [shard 0] storage_service - Update system.peers table: endpoint=127.0.0.2, app_state=VIEW_BACKLOG, versioned_value=Value(0:18446744073709551615:1583486029531,753)
```
The downside of such updates:
- Introduces more gossip exchange traffic
- Updates system.peers all the time
The extra unnecessary gossip traffic is fine to a cluster in a good
shape but when some of the nodes or shards are loaded, such messages and
the handling of such messages can make the system even busy.
With this patch, VIEW_BACKLOG is updated only when the backlog is really
updated.
Btw, we can even make the update only when the change of the backlog is
great than a threshold, e.g., 5%, which can reduce the traffic even
further.
Fixes#5970
We used to calculate the number of endpoints for quorum and local_quorum
unconditionally as ((rf / 2) + 1). This formula doesn't take into
account the corner case where RF = 0, in this situation quorum should
also be 0.
This commit adds the missing corner case.
Tests: Unit Tests (dev)
Fixes#6905Closes#7296