start node 1 2 3
shutdown node2
shutdown node1 and node3
start node1 and node3
nodetool removenode node2
clean up all scylla data on node2
bootstrap node2 as a new node
I saw node2 could not bootstrap stuck at waiting for schema information to compelte for ever:
On node1, node3
[shard 0] gossip - received an invalid gossip generation for peer 127.0.0.2; local generation = 2, received generation = 1521779704
On node2
[shard 0] storage_service - JOINING: waiting for schema information to complete
This is becasue in nodetool removenode operation, the generation of node1 was increased from 0 to 2.
gossiper::advertise_removing () calls eps.get_heart_beat_state().force_newer_generation_unsafe();
gossiper::advertise_token_removed() calls eps.get_heart_beat_state().force_newer_generation_unsafe();
Each force_newer_generation_unsafe increases the generation by 1.
Here is an example,
Before nodetool removenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
{
"addrs": "127.0.0.2",
"generation": 0,
"is_alive": false,
"update_time": 1521778757334,
"version": 0
},
```
After nodetool revmoenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
{
"addrs": "127.0.0.2",
"application_state": [
{
"application_state": 0,
"value": "removed,146b52d5-dc94-4e35-b7d4-4f64be0d2672,1522038476246",
"version": 214
},
{
"application_state": 6,
"value": "REMOVER,14ecc9b0-4b88-4ff3-9c96-38505fb4968a",
"version": 153
}
],
"generation": 2,
"is_alive": false,
"update_time": 1521779276246,
"version": 0
},
```
In gossiper::apply_state_locally, we have this check:
```
if (local_generation != 0 && remote_generation > local_generation + MAX_GENERATION_DIFFERENCE) {
// assume some peer has corrupted memory and is broadcasting an unbelievable generation about another peer (or itself)
logger.warn("received an invalid gossip generation for peer {}; local generation = {}, received generation = {}",ep, local_generation, remote_generation);
}
```
to skip the gossip update.
To fix, we relax generation max difference check to allow the generation
of a removed node.
After this patch, the removed node bootstraps successfully.
Tests: dtest:update_cluster_layout_tests.py
Fixes#3331
Message-Id: <678fb60f6b370d3ca050c768f705a8f2fd4b1287.1522289822.git.asias@scylladb.com>
Just saw this today during a crash when creating Materialized Views.
It is still unclear why this happened. But the message says:
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: scylla: sstables/sstables.cc:2973: sstables::sstable::remove_sstable_with_temp_toc(seastar::sstring, seastar::sstring, seastar::sstring, int64_t, sstables::sstable::version_types, sstables::sstable::format_types)::<lambda()>: Assertion `tmptoc == true' failed.
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: Aborting on shard 0.
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: Backtrace:
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000005b4b4c
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000005b4df5
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000005b4ea3
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libpthread.so.0+0x000000000000f0ff
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x00000000000355f6
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x0000000000036ce7
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x000000000002e565
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: /lib64/libc.so.6+0x000000000002e611
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x00000000015969d0
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x0000000001596f7a
Mar 28 15:55:58 ip-172-31-24-9 scylla[14055]: 0x000000000051ca8d
I can't even guess which table caused the problem, let alone which SSTable.
That's because those asserts are the very first thing we do. We can discuss
whether or not assert is the right behaviour (usually we can't guarantee the
state is sane if that is missing, so I don't see a problem)
But it would be nice to see which SSTable we are processing before we assert.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180328160856.10717-1-glauber@scylladb.com>
"
The configuration API is part of scylla v2 configuration.
It uses the new definition capabilities of the API to dynamically create
the swagger definition for the configuration.
This mean that the swagger will contain an entry with description and
type for each of the config value.
To get the v2 of the swager file:
http://localhost:10000/v2
If using with swagger ui, change http://localhost:10000/api-doc to http://localhost:10000/v2
It takes longer to load because the file is much bigger now.
"
* 'amnon/config_api_v5' of github.com:scylladb/seastar-dev:
Explanation about the API V2
API: add the config API as part of the v2 API.
Defining the config api
The config API is created dynamically from the config. This mean that
the swagger definition file will contain the description and types based on the
configuration.
The config.json file is used by the code generator to define a path that is
used to register the handler function.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
"
This series introduces the view_builder class, a sharded service
responsible for building all defined materialized views. This process
entails walking over the existing data in a given base table, and using
it to calculate and insert the respective entries for one or more views.
The view_builder uses the migration_manager to subscribe to schema
change events, and update its bookkeeping accordingly. We prefer this
to having the database call into the view_builder, as that would
create a cyclic dependency.
We serialize changes to the views of a particular base table, such
that schema changes do not interfere with the view building process.
We employ a flat_mutation_reader for each base table for which we're
building views.
We consume from the reader associated with each base table until all
its views are built. If the reader reaches the end and there are
incomplete views, then a view was added while others were being built.
In such cases, we restart the reader to the beginning of the current
token, but not to the beginning of the token range, when the view is
added. Then, when we exhaust the reader, we simply create a new one
for the whole token range, and resume building the pending views.
We aim to be resource-conscious. On a given shard, at any given moment,
we consume at most from one reader. We also strive for fairness, in that
each build step inserts entries for the views of a different base. Each
build step reads and generates updates for batch_size rows. We lack a
controller, which could potentially allow us to go faster (to execute
multiple steps at the same time, or consume more rows per batch), and
also which would apply backpressure, so we could, for example, delay
executing a build step.
Interaction with the system tables:
- When we start building a view, we add an entry to the
scylla_views_builds_in_progress system table. If the node restarts
at this point, we'll consider these newly inserted views as having
made no progress, and we'll treat them as new views;
- When we finish a build step, we update the progress of the views
that we built during this step by writing the next token to the
scylla_views_builds_in_progress table. If the node restarts here,
we'll start building the views at the token in the next_token
column.
- When we finish building a view, we mark it as completed in the
built views system table, and remove it from the in-progress system
table. Under failure, the following can happen:
* When we fail to mark the view as built, we'll redo the last
step upon node reboot;
* When we fail to delete the in-progress record, upon reboot
we'll remove this record.
A view is marked as completed only when all shards have finished
their share of the work, that is, if a view is not built, then all
shards will still have an entry in the in-progress system table;
- A view that a shard finished building, but not all other shards,
remains in the in-progress system table, with first_token ==
next_token.
Interaction with the distributed system tables:
- When we start building a view, we mark the view build as being
in-progress;
- When we finish building a view, we mark the view as being built.
Upon failure, we ensure that if the view is in the in-progress
system table, then it may not have been written to this table. We
don't load the built views from this table when starting. When
starting, the following happens:
* If the view is in the system.built_views table and not the
in-progress system table, then it will be in this one;
* If the view is in the system.built_views table and not in
this one, it will still be in the in-progress system table -
we detect this and mark it as built in this table too,
keeping the invariant;
* If the view is in this table but not in system.built_views,
then it will also be in the in-progress system table - we
don't detect this and will redo the missing step, for
simplicity.
View building is necessarily a sharded process. That means that on
restart, if the number of shards has changed, we need to calculate
the most conservative token range that has been built, and build
the remainder.
When building view updates, we consider that everything is new and
nothing pre-existing is there (which means no tombstones will be sent
out to the paired view replicas).
Tests:
unit (debug)
dtest (materialized_view_test.py(smp=1, smp=2))
"
* 'view-building/v4' of https://github.com/duarten/scylla: (22 commits)
tests/view_build_test: Add tests for view building
tests/cql_test_env: Move eventually() to this file
tests/cql_assertions: Assert result set is not empty
tests/cql_test_env: Start the view_builder
db/view/view_builder: Allow synchronizing with the end of a build
db/view/view_builder: Actually build views
flat_mutation_reader: Make reader from mutation fragments
db/view/view_builder: React to schema changes
service/migration_listener: Add class for view notifications
db/view: Introduce view_builder
column_family: Add function to populate views
column_family: Allow synchronizing with in-progress writes
database: Compare view id instead of name in find_views()
database: Add get_views() function
db/view: Return a future when sending view updates
service/storage_service: Allow querying the view build status
db: Introduce system_distributed_keyspace
tests: Add unit test for build_progress_virtual_reader
db/system_keyspace: Add API for MV-related system tables
db/system_keyspace: Add virtual reader for MV in-progress build status
...
This is a separate file from view_schema_test because that one is
already becoming too long to run; also, having multiple test files
means they can be executed in parallel.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Intended for use by unit tests, this patch allows synchronizing with
the end of a build for a particular view.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the missing view building code to the eponymous class.
We consume from the reader associated with each base table until all
its views are built. If the reader reaches the end and there are
incomplete views, then a view was added while others were being built.
In such cases, we restart the reader to the beginning of the current
token, but not to the beginning of the token range, when the view is
added. Then, when we exhaust the reader, we simply create a new one
for the whole token range, and resume building the pending views.
We aim to be resource-conscious. On a given shard, at any given moment,
we consume at most from one reader. We also strive for fairness, in that
each build step inserts entries for the views of a different base. Each
build step reads and generates updates for batch_size rows. We lack a
controller, which could potentially allow us to go faster (to execute
multiple steps at the same time, or consume more rows per batch), and
also which would apply backpressure, so we could, for example, delay
executing a build step.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Builds a reader from a set of ordered mutations fragments. This is
useful for building a reader out of a subset of segments returned by a
different reader. It is equivalent to building a mutation out of the
set of mutation fragments, and calling
make_flat_mutation_reader_from_mutations, except that it doest not yet
support fast-forwarding.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The view_builder now uses the migration_manager to subscribe to schema
change events, and update its bookkeeping accordingly. We prefer this
to having the database call into the view_builder, as that would
create a cyclic dependency.
We serialize changes to the views of a particular base table, such
that schema changes do not interfere with the upcoming view building
code.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Add a convenience base class for view notifications, which provides
a default implementation for all other types of notifications.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces the view_builder class, a sharded service
responsible for building all defined materialized views. This process
entails walking over the existing data in a given base table, and using
it to calculate and insert the respective entries for one or more views.
This patch introduces only the bootstrap functionality, which is
responsible for loading the data stored in the system tables and
filling the in-memory data structures with the relevant information,
to be used in subsequent patches for the actual view building. The
interaction with the system tables is as follows.
Interaction with the tables in system_keyspace:
- When we start building a view, we add an entry to the
scylla_views_builds_in_progress system table. If the node restarts
at this point, we'll consider these newly inserted views as having
made no progress, and we'll treat them as new views;
- When we finish a build step, we update the progress of the views
that we built during this step by writing the next token to the
scylla_views_builds_in_progress table. If the node restarts here,
we'll start building the views at the token in the next_token
column.
- When we finish building a view, we mark it as completed in the
built views system table, and remove it from the in-progress system
table. Under failure, the following can happen:
* When we fail to mark the view as built, we'll redo the last
step upon node reboot;
* When we fail to delete the in-progress record, upon reboot
we'll remove this record.
A view is marked as completed only when all shards have finished
their share of the work, that is, if a view is not built, then all
shards will still have an entry in the in-progress system table;
- A view that a shard finished building, but not all other shards,
remains in the in-progress system table, with first_token ==
next_token.
Interaction with the distributed system table (view_build_status):
- When we start building a view, we mark the view build as being
in-progress;
- When we finish building a view, we mark the view as being built.
Upon failure, we ensure that if the view is in the in-progress
system table, then it may not have been written to this table. We
don't load the built views from this table when starting. When
starting, the following happens:
* If the view is in the system.built_views table and not the
in-progress system table, then it will be in view_build_status;
* If the view is in the system.built_views table and not in
this one, it will still be in the in-progress system table -
we detect this and mark it as built in this table too,
keeping the invariant;
* If the view is in this table but not in system.built_views,
then it will also be in the in-progress system table - we
don't detect this and will redo the missing step, for
simplicity.
View building is necessarily a sharded process. That means that on
restart, if the number of shards has changed, we need to calculate
the most conservative token range that has been built, and build
the remainder.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The populate_views() function takes a set of views to update, a
tokento select base table partitions, and the set of sstables to
query. This lays the foundation for a view building mechanism to exist,
which walks over a given base table, reads data token-by-token,
calculates view updates (in a simplified way, compared to the existing
functions that push view updates), and sends them to the paired view
replicas.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds a mechanism to class column_family through which we
can synchronize with in-progress writes. This is useful for code that,
after some modification, needs to ensure that new writes will see it
before it can proceed.
In particular, this will be used by the view building code, which needs
to wait until the in-progress writes, which may have missed that there
is now a view, is observable to the view building code.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
While we now send view mutations asynchronously in the normal view
write path, other processes interested in sending view updates, such
as streaming or view building, may wish to do it synchronously.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds support for the nodetool viewbuildstatus command,
which shows the progress of a materialized view build across the
cluster.
A view can be absent from the result, successfully built, or
currently being built.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces a distributed system keyspace, used to hold
system tables that need to be replicated across a set of replicas
(that is, can't use the LocalStrategy).
In following patches, we will use this keyspace to hold a table
containing view building status updates for each node, used to support
range movements and a new nodetool command.
Fixes#3237
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch implements an API to access the MV-related system tables,
which pertain to the view building process.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Provide a virtual reader so users can query the in-progress view table
in a way compatible with Apache Cassandra.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
When building a materialized view, we divide our work by shard, so we
need to register which shard did what work in the in-progress system
table. We also add the token we started at, which will enable some
optimizations in the view building code.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"
Additional extension points.
* Allows wrapping commitlog file io (including hinted handoff).
* Allows system schema modification on boot, allowing extensions
to inject extensions into hardcoded schemas.
Note: to make commitlog file extensions work, we need to both
enforce we can be notified on segment delete, and thus need to
fix the old issue of hard ::unlink call in segment destructor.
Segment delete is therefore moved to a batch routine, run at
intervals/flush. Replay segments and hints are also deleted via
the commitlog object, ensuring an extension is notified (metadata).
Configurable listeneres are now allowed to inject configuration
object into the main config. I.e. a local object can, either
by becoming a "configurable" or manually, add references to
self-describing values that will be parsed from the scylla.yaml
file, effectively extending it.
All these wonderful abstractions courtesy of encryption of course.
But super generalized!
"
* 'calle/commitlog_ext' of github.com:scylladb/seastar-dev:
db::extensions: Allow extensions to modify (system) schemas
db::commitlog: Add commitlog/hints file io extension
db::commitlog: Do segment delete async + force replay delete go via CL
main/init: Change configurable callbacks and calls to allow adding opts
util::config_file: Add "add" config item overload
Allows extensions/config listeners to potentially augument
(system) schemas at boot time. This is only useful for schemas
who do not pass through system_schema tables.
Refs #2858
Push segement files to be deleted to a pending list, and process at
intervals or flush-requests (or shutdown). Note that we do _not_
indescrimenately do deletes in non-anchored tasks, because we need
to guarantee that finshed segments are fully deleted and gone on CL
shutdown, not to be mistaken for replayables.
Also make sure we delete segments replayed via commitlog call,
so IFF we add metadata processing for CL, we can clear it out.
Since we just keep retrying, this can cause Scylla to not shutdown for
a while.
The data will be safe in the commit log.
Note that this patch doesn't fix the issue when shutdown goes through
storage_service::drain_on_shutdown - more work is required to handle
that case.
Ref #3318.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180324140822.3743-3-duarte@scylladb.com>
In column_family::try_flush_memtable_to_sstable, the handle_exception()
block is on the inside of the continuations to
write_memtable_to_sstable(), which, if it fails, will leave the
sstable in the compaction_backlog_tracker::_ongoing_writes map, which
will waste disk space, and that sstable will map to a dangling pointer
to a destroyed database_sstable_write_monitor, which causes a seg
fault when accessed (for example, through the backlog_controller,
which accounts the _ongoing_writes when calculating the backlog).
Fix this by increasing the scope of handle_exception().
Fixes#3315
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180324140822.3743-2-duarte@scylladb.com>
* Don't dump output of failed tests immediately, print the output
for failed tests in the end instead.
* Fix exception printing in run_test(): don't assume passed in error
object is a `bytes` (or bytes-like) object, call the object's str
operator instead and let callers encode bytes objects instead.
* Don't assume Exception object has an `out` member, use operator str
instead to convert it to string.
* Don't print progress in run_test() directly because it results in
incomprehensible output as the executors race to print to stdout. Leave
progress report to the caller who can serialize progress prints.
* Automatically detect non-tty stdout and don't try to edit already
printed text.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <7bb7e0003ded9b28710250bff851ea849bb99f7d.1522062795.git.bdenes@scylladb.com>
"
This series does not add or change any features of access-control and
roles, but addresses some bugs and finalizes the switch to roles.
"auth: Wait for schema agreement" and the patch prior help avoid false
negatives for integration tests and error messages in logs.
"auth: Remove ordering dependence" fixes an important bug in `auth` that
could leave the default superuser in a corrupted state when it is first
created.
Since roles are feature-complete (to the best of the author's knowledge
as of this writing), the final patch in the series removes any warnings
about them being unimplemented.
Tests: unit (release), dtest (PENDING)
"
* 'jhk/auth_fixes/v1' of https://github.com/hakuch/scylla:
Roles are implemented
auth: Increase delay before background tasks start
auth: Remove ordering dependence
auth: Don't warn on rescheduled task
auth: Wait for schema agreement
Single-node clusters can agree on schema
I've observed failures due to "missing" the peer nodes by about 1
second. Adding 5 second to the existing delay should cover most false
negative test results.
Fixes#3320.
If `auth::password_authenticator` also creates `system_auth.roles` and
we fix the existence check for the default superuser in
`auth::standard_role_manager` to only search for the columns that it
owns (instead of the column itself), then both modules' initialization
are independent of one another.
Fixes#3319.
Apache Cassandra also prints at the `info` level. This change prevents
tasks which we expect to be rescheduled from failing tests and scaring
users.
A good example of this importance of this change is when queries with a
quorum consistency level (for the default superuser) fail because a
quorum is not available. We will try again in this case, and this should
not cause integration tests to fail.
Some modules of `auth` create a default superuser if it does not already
exist.
The existence check is through a SELECT query with quorum consistency
level. If the schema for the applicable tables has not yet propagated to
a peer node at the time that it processes this query, then the
`storage_proxy` will print an error message to the log and the query
will be retried.
Eventually, the schema will propagate and the default superuser will be
created. However, the error message in the log causes integration tests
to fail (and is somewhat annoying).
Now, prior to querying for existing data, we wait for all gossip peers
to have the same schema version as we do.
Fixes#2852.
At some points while bootstrapping [1], new non-seed Scylla nodes wait
for schema agreement among all known endpoints in the cluster.
The check for schema agreement was in
`service::migration_manager::is_ready_for_bootstrap`. This function
would return `true` if, at the time of its invocation, the node was
aware of at least one `UP` peer (not itself) and that all `UP` peers had
the same schema version as the node.
We wish to re-use this check in the `auth` sub-system to ensure that
the schema for internal system tables used for access-control have
propagated to the entire cluster.
Unlike in `service/storage_service.cc`, where `is_ready_for_bootstrap`
was only invoked for seed nodes, we wish to wait for schema agreement
for all nodes regardless of whether or not they are seeds.
For a single-node cluster with itself as a seed,
`is_ready_for_bootstrap` would always return `false`.
We therefore change the conditions for schema agreement. Schema
agreement is now reached when there are no known peers (so the endpoint
map of the gossiper consists only of ourselves), or when there is at
least one `UP` peer and all `UP` peers have the same schema version as
us.
This change should not impact any bootstrap behavior in
`storage_service` because seed nodes do not invoke the function and
non-seed nodes wait for peer visibility before checking for schema
agreement.
Since this function is no longer checking for schema agreement only in
the context of bootstrapping non-seed nodes, we rename it to reflect its
generality.
[1] http://thelastpickle.com/blog/2017/05/23/auto-bootstrapping-part1.html
I see the following error:
seastar/core/future-util.hh:597:10: note: constraints not satisfied
seastar/core/future-util.hh:597:10: note: with ‘sstables::sstable_version_types* c’
seastar/core/future-util.hh:597:10: note: with ‘sub_partitions_read::run_test_case()::<lambda(sstables::sstable::version_types)> aa’
seastar/core/future-util.hh:597:10: note: the required expression ‘seastar::futurize_apply(aa, (* c.begin()))’ would be ill-formed
seastar/core/future-util.hh:597:10: note: ‘seastar::futurize_apply(aa, (* c.begin()))’ is not implicitly convertible to ‘seastar::future<>’
The C array all_sstable_versions decayed to a pointer (see second gcc note)
and of course doesn't support std::begin().
Fix by replacing the C array with an std::array<>, which supports std::begin().
Not clear what made this break again, or why it worked before.
Message-Id: <20180325095239.12407-1-avi@scylladb.com>