Always enable lightweight transactions. Remove the check for the command
line switch from the feature service, assuming LWT is always enabled.
Remove the check for LWT from Alternator.
Note that in order for the cluster to work with LWT, all nodes need
to support it.
Rename LWT to UNUSED in db/config.hh, to keep accepting lwt keyword in
--experimental-features command line option, but do nothing with it.
Changes in v2:
* remove enable_lwt feature flag, it's always there
Closes#6102
test: unit (dev, debug)
Message-Id: <20200401071149.41921-1-kostja@scylladb.com>
(cherry picked from commit 9948f548a5)
The learning stage of PAXOS protocol leaves behind an entry in
system.paxos table with the last learned value (which can be large). In
case not all participants learned it successfully next round on the same
key may complete the learning using this info. But if all nodes learned
the value the entry does not serve useful purpose any longer.
The patch adds another round, "prune", which is executed in background
(limited to 1000 simultaneous instances) and removes the entry in
case all nodes replied successfully to the "learn" round. It uses the
ballot's timestamp to do the deletion, so not to interfere with the
next round. Since deletion happens very close to previous writes it will
likely happen in memtable and will never reach sstable, so that reduces
memtable flush and compaction overhead.
Fixes#5779
Message-Id: <20200330154853.GA31074@scylladb.com>
(cherry picked from commit 8a408ac5a8)
The value that is stored in "in_progress_ballot" cell is the value of
promised ballot, so call the cell accordingly to avoid confusion
especially as we have a notion of "in progress" proposal in the code
which is not the same as in_progress_ballot here.
We can still do it without care about backwards compatibility since LWT
is still marked as experimental.
Fixes#6087.
Message-Id: <20200326095758.GA10219@scylladb.com>
(cherry picked from commit b3db6f5b04)
Fixes#5899
When terminating (closing) a segment, we write a trailing block
of zero so reader can have an empty region after last used chunk
as end marker. This is due to using recycled, pre-allocated
segments with potentially non-zero data extending over the point
where we are ending the segment (i.e. we are not fully filling
the segment due to a huge mutation or similar).
However, if we reach end of segment writing the final block
(typically many small mutations), the file will end naturally
after the data written, and any trailing zero block would in fact
just extend the file further. While this will only happen once per
segment recycled (independent on how many times it is recycled),
it is still both slightly breaking the disk usage contract and
also potentially causing some disk stalls due to metadata changes
(though of course very infrequent).
We should only write trailing zero if we are below the max_size
file size when terminating
Adds a small size check to commitlog test to verify size bounds.
(Which breaks without the patch)
v2:
- Fix test to take into account that files might be deleted
behind our backs.
v3:
- Fix test better, by doing verification _before_ segments are
queued for delete.
Message-Id: <20200226121601.15347-2-calle@scylladb.com>
Message-Id: <20200324100235.23982-1-calle@scylladb.com>
(cherry picked from commit 9fee712d62)
Consider 3 nodes in the cluster, n1, n2, n3 with gossip generation
number g1, g2, g3.
n1, n2, n3 running scylla version with commit
0a52ecb6df (gossip: Fix max generation
drift measure)
One year later, user wants the upgrade n1,n2,n3 to a new version
when n3 does a rolling restart with a new version, n3 will use a
generation number g3'. Because g3' - g2 > MAX_GENERATION_DIFFERENCE and
g3' - g1 > MAX_GENERATION_DIFFERENCE, so g1 and g2 will reject n3's
gossip update and mark g3 as down.
Such unnecessary marking of node down can cause availability issues.
For example:
DC1: n1, n2
DC2: n3, n4
When n3 and n4 restart, n1 and n2 will mark n3 and n4 as down, which
causes the whole DC2 to be unavailable.
To fix, we can start the node with a gossip generation within
MAX_GENERATION_DIFFERENCE difference for the new node.
Once all the nodes run the version with commit
0a52ecb6df, the option is no logger
needed.
Fixes#5164
(cherry picked from commit 743b529c2b)
This reverts commit 0b34d88957. According
to Rafael Avila de Espindola:
"I have bisected the recent failures [in commitlog_test] on next to this
patch."
Before this patch, when db/view/view.hh was modified, 89 source files had to
be recompiled. After this patch, this number is down to 5.
Most of the irrelevant source files got view.hh by including database.hh,
which included view.hh just for the definition of statistics. So in this
patch we split the view statistics to a separate header file, view_stats.hh,
and database.hh only includes that. A few source files which included
only database.hh and also needed view.hh (for materialized-view related
functions) now need to include view.hh explicitly.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200319121031.540-1-nyh@scylladb.com>
View updates sent as part of the view building process should never
be ignored, but fd49fd7 introduced a bug which may cause exactly that:
the updates are mistakenly sent to background, so the view builder
will not receive negative feedback if an update failed, which will
in turn not cause a retry. Consequently, view building may report
that it "finished" building a view, while some of the updates were
lost. A simple fix is to restore previous behaviour - all updates
triggered by view building are now waited for.
Fixes#6038
Tests: unit(dev),
dtest: interrupt_build_process_with_resharding_low_to_half_test
"
This PR makes it possible to enable the usage of different partitioner for each table. If no table-specific partitioner is set for a given table then a default partitioner is used.
The PR is composed of the following parts:
- Introduction of schema::get_partitioner that still returns dht::global_partitioner
- Replacement of all the usage of dht::global_partitioner with schema::get_partitioner
- Making it possible to set table-specific partitioner in a schema_builder
- Remove all the places that were setting default partitioner except for main.cc (mostly tests)
- Move default partitioner from i_partitioner to schema.cc and hide it from the rest of the codebase
- Remove dht::global_partitioner
After this PR there's no such thing as global partitioner at all. There is only a default partitioner but it still has to be accessed through schema::get_partitioner.
There are some intermediate states in which i_partitioner is stored as shared_ptr in the schema but the final version keeps it by const&.
The PR does not enable per table partitioner end-to-end. Just the internals of the single node are covered. I still have to deal with:
- Making sure a table has the same partitioner on each node
- Allowing user to set up a table-specific partitioner on table
- Signal driver about what partitioner is used by a given table
- Persist partitioner info for each table that does not use default partitioner.
Fixes#5493
Tests: unit(dev, release, debug), dtest(byo)
"
* 'per_table_partitioner' of https://github.com/haaawk/scylla:
schema: drop optional from _partitioner field
make_multishard_combining_reader: stop taking partitioner
split_range_to_single_shard: stop taking partitioner as argument
tests: remove unused murmur3 includes
partitioner: move default_partitioner to schema.cc
partitioner: hide dht::default_partitioner
schema: include partitioner name in scylla tables mutation
schema: make it possible to set custom partitioner
scylla_tables: add partitioner column
schema_features: add PER_TABLE_PARTITIONERS feature
features: add PER_TABLE_PARTITIONERS feature
There are two results of this patch:
1. New partitioner name column is persited on node's disk in scylla_tables
2. New partitioner name column is included into schema digest
This is achieved by including this new column in scylla tables mutation.
For that we:
1. Add partitioner name to the result of make_scylla_tables_mutation.
If table does not have a specific partitioner set and uses default
partitioner then we don't include the name of such default partitioner.
Only the name of custom partitioner is added if a table has one.
2. In create_table_from_mutations we check whether scylla tables mutation
has a partitioner name set. If so then we use it as a parameter for
schema_builder.
Note that previous patches have ensured that this new column will be included
into schema digest only after the whole cluster supports per table partitioners.
Before that, during rolling upgrade, new partitioner name column is hidden and
not shared with other nodes.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Following commits make it possible to set a specific
partitioner for a table. We want to persist that information
and include it into schema digest. For that a new column
in scylla_tables is needed. This commit adds such column.
We add the new column to scylla_tables because it's a Scylla
specific extension.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
With per table partitioners, partitioner name will be a part
of table schema. To allow rolling upgrade we need to perform
special logic that hides new partitioner name schema column
during the upgrade. This commit adds new schema feature that
controls this logic.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
While CQL does not allow creation of a materialized view with more than one
base regular column in the view's key, in Alternator we do allow this - both
partition and clustering key may be a base regular column. We had a bug in
the logic handling this case:
If the new base row is missing a value for *one* of the view key columns,
we shouldn't create a view row. Similarly, if the existing base row was
missing a value for *one* of the view key columns, a view row does not
exist and doesn't need to be deleted. This was done incorrectly, and made
decisions based on just one of the key columns, and the logic is now
fixed (and I think, simplified) in this patch.
With this patch, the Alternator test which previously failed because of
this problem now passes. The patch also includes new tests in the existing
C++ unit test test_view_with_two_regular_base_columns_in_key. This tests
was already supposed to be testing various cases of two-new-key-columns
updates, but missed the cases explained above. These new tests failed
badly before this patch - some of them had clean write errors, others
caused crashes. With this patch, they pass.
Fixes#6008.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200312162503.8944-1-nyh@scylladb.com>
Local view updates (updates applied to a local node,
without remote communication) are from now on performed
synchronously - which adds consistency guarantees, as a local
write failure will be returned to the client instead of being
silently ignored.
Currently, launching view updates as an asynchronous background job
is done via not waiting for mutate_MV() future in
table::generate_and_propagate_view_updates. That has a big downside,
since mutate_MV() handles *all* view updates for *all* views of a table,
so it's not possible to wait for each view independently.
Per-view granularity is required in order to implement synchronous
view updates of local views - because then we'll synchronously
wait for all views that write to a local node (due to having a matching
partition key with the base), while remote view updates will still
be sent asynchronously.
In order to do that, instead of not waiting for mutate_MV,
we do wait for it properly, but instead launch the asynchronous,
unwaited-for futures inside mutate_MV.
Effectively that means no changes for view updates so far - all updates
will be fired in the background. Later, another patch will introduce
a way to wait for selected updates to finish.
It may be not safe to move sharded services, so it will be prohibited in
the future seastar update. Remove all current cases where we do it.
Fixes#5814.
Message-Id: <20200301095423.GY434@scylladb.com>
If the feature service is stopped without enabling some features,
the latrer may end up with "broken promise" exception on futures
attached to the _pr promise. Fix this by switching the only user
of it onto 'listener' API and remove future-based one.
Tests: unit(debug), manual start-stop and aborted-start
Fixes#5899
When terminating (closing) a segment, we write a trailing block
of zero so reader can have an empty region after last used chunk
as end marker. This is due to using recycled, pre-allocated
segments with potentially non-zero data extending over the point
where we are ending the segment (i.e. we are not fully filling
the segment due to a huge mutation or similar).
However, if we reach end of segment writing the final block
(typically many small mutations), the file will end naturally
after the data written, and any trailing zero block would in fact
just extend the file further. While this will only happen once per
segment recycled (independent on how many times it is recycled),
it is still both slightly breaking the disk usage contract and
also potentially causing some disk stalls due to metadata changes
(though of course very infrequent).
We should only write trailing zero if we are below the max_size
file size when terminating
Adds a small size check to commitlog test to verify size bounds.
(Which breaks without the patch)
Message-Id: <20200226121601.15347-2-calle@scylladb.com>
Previously the tokens were stored as strings
because token could have been represented in multiple ways.
Now token representation is always int64_t so we can
store them as ints in cdc description as well.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
With #5950 we changed the representation of stream_id
in CDC Log from two int columns to a single blob column.
This PR cleans up stream_id representation internally.
Now stream_id is stored as blob both in-memory and in
internal CDC tables.
Tests: unit(dev)
* hawk/stream_id_representation:
cdc: store stream_ids as blobs in internal tables
cdc: improve do_update_streams_description
cdc: Fix generate_topology_description
cdc: add stream_id::operator<
cdc: change stream_id representation
Fixes#5891
Refs #5899
When creating segments with the o_dsync option active, we write max_size
zeros to disk, to ensure actual disk blocks are allocated.
However, if we recycle a segment, we should, when not actually creating
a new file, check the existing size on disk, and only zero any blocks
not already allocated (i.e. if recycled file was smaller than max_size,
due to segement truncation on close).
test: unit
Message-Id: <20200226121601.15347-1-calle@scylladb.com>
In new CDC Log format stream_id is represented by a single
blob column so it makes sense to store it in the same form
everywhere - including internal CDC tables.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The header sits in many other headers, but there's a handy
schema_fwd.hh that's tiny and contains needed declarations
for other headers. So replace shema.hh with schema_fwd.hh
in most of the headers (and remove completely from some).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200303102050.18462-1-xemul@scylladb.com>
The function in question uses future-based .when_enabled() subscription
on cluster_supports_truncation_table feature. This method is considered
to be unsafe, so here's the patch that changes it onto feature::listener.
The completion of the migration is only awaited by a single test, so
this waiting mechanism is also slightly simplified.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We have a few kind of queries whose memory consumption is not limited at
all. One of these is reverse queries, which reads entire partitions into
memory, before reversing them. These partitions can be larger than
memory and thus such a query can single-handedly cause OOM.
This patch introduces a configuration for a memory limit for such
queries. This will serve as a hard limit and queries which attempt to
use more memory than this, will be aborted.
The limit is propagated to table objects, with the intention of keeping
system tables unlimited. These tables are usually small and initiators
of system queries are not prepared for failures.
"
Here is a simple introduction to the node operations scylla supports and
some of the issues.
- Replace operation
It is used to replace a dead node. The token ring does not change. It
pulls data from only one of the replicas which might not be the
latest copy.
- Rebuild operation
It is used to get all the data this node owns form other nodes. It
pulls data from only one of the replicas which might not be the
latest copy.
- Bootstrap operation
It is used to add a new node into the cluster. The token ring
changes. Do no suffer from the "not the latest replica” issue. New
node pulls data from existing nodes that are losing the token range.
Suffer from failed streaming. We split the ranges in 10 groups and we
stream one group at a time. Restream the group if failed, causing
unnecessary data transmission on wire.
Bootstrap is not resumable. Failure after 99.99% of data is streamed.
If we restart the node again, we need to stream all the data again
even if the node already has 99.99% of the data.
- Decommission operation
It is used to remove a live node form the cluster. Token ring
changes. Do not suffer “not the latest replica” issue. The leaving
node pushes data to existing nodes.
It suffers from resumable issue like bootstrap operation.
- Removenode operation
It is used to remove a dead node out of the cluster. Existing nodes
pulls data from other existing nodes for the new ranges it own. It
pulls from one of the replicas which might not be the latest copy.
To solve all the issues above. We could use repair based node operation.
The idea behind repair based node operations is simple: use repair to
sync data between replicas instead of streaming.
The benefits:
- Latest copy is guaranteed
- Resumable in nature
- No extra data is streamed on wire
E.g., rebuild twice, will not stream the same data twice
- Unified code path for all the node operations
- Free repair operation during bootstrap, replace operation and so on.
Fixes: #3003Fixes: #4208
Tests: update_cluster_layout_tests.py + replace_address_test.py + manual test
"
* 'repair_for_node_ops' of https://github.com/asias/scylla:
docs: Add doc for repair_based_node_ops
storage_service: Enable node repair based ops for bootstrap
storage_service: Enable node repair based ops for decommission
storage_service: Enable node repair based ops for replace
storage_service: Enable node repair based ops for removenode
storage_service: Enable node repair based ops for rebuild
storage_service: Use the same tokens as previous bootstrap
storage_service: Add is_repair_based_node_ops_enabled helper
config: Add enable_repair_based_node_ops
repair: Add replace_with_repair
repair: Add rebuild_with_repair
repair: Add do_rebuild_replace_with_repair
repair: Add removenode_with_repair
repair: Add decommission_with_repair
repair: Add do_decommission_removenode_with_repair
repair: Add bootstrap_with_repair
repair: Introduce sync_data_using_repair
repair: Propagate exception in tracker::run
Changes the name of storage_proxy::mutate_hint_from_scratch function to
another name, whose meaning is more clear: send_hint_to_all_replicas.
Tests: unit(dev)
Due to a bug the entire segment is written in one huge write of 32Mb.
The idea was to split it to writes of 128K, so fix it.
Fixes#5857
Message-Id: <20200220102939.30769-1-gleb@scylladb.com>
There may be other commitlog writes waiting for zeroing to complete, so
not using proper scheduling class causes priority inversion.
Fixes#5858.
Message-Id: <20200220102939.30769-2-gleb@scylladb.com>
When dropping a table, the table and its views are dropped
in parallel, this is not a problem as for itself but we
have mechanism to snapshot a deleted table before the
actual delete. When a secondary index is removed, in the
snapshot process it looks for it's schema for creating the
schema part of the snapshot but if the main table is already
gone it will not find it.
This commit serializes views and main table removals and
removes the views prior to the tables.
See discussion on #5713
Tests:
Unit tests (dev)
dtest - A test that failed on "can't find schema" error
Fixes#5614
* eliran/serialize_table_views_deletion:
Materialized Views: serialize tables and views creation
Materialized Views: drop materialized views before tables
This change serializes tables and views creation. The
changes purpose is to avoid future possible races due to
a view searching for its base table information while the
later haven't been created yet.
When dropping a table, the table and its views are dropped
in parallel, this is not a problem as for itself but we
have mechanism to snapshot a deleted table before the
actual delete. When a secondary index is removed, in the
snapshot process it looks for its schema for creating the
schema part of the snapshot but if the main table is already
gone it will not find it.
This commit serializes views and main table removals and
removes the views prior to the tables.
See discussion on https://github.com/scylladb/scylla/pull/5713
Tests:
Unit tests (dev)
dtest - A test that failed on "can't find schema" error
Fixes#5614
Refs #817
Truncation is potentially long. It has its own timeout in storage
proxy/rpc. This value should probably also be higher than default
timeout.
Message-Id: <20200218135926.26522-1-calle@scylladb.com>
When replaying a hint with a destination node that is no longer in the
cluster, it will be sent with cl=ALL to all its new replicas. Before
this patch, the MUTATION verb was used, which causes such hints to be
handled on the same connection and with the same priority as regular
writes. This can cause problems when a large number of hints is
orphaned and they are scheduled to be sent at once. Such situation
may happen when replacing a dead node - all nodes that accumulated hints
for the dead node will now send them with cl=ALL to their new replicas.
This patch changes the verb used to send such hints to HINT_MUTATION.
This verb is handled on a separate connection and with streaming
scheduling group, which gives them similar priority to non-orphaned
hints.
Refs: #4712
Tests: unit(dev)
and replace all calls to dht::global_partitioner().get_token
dht::get_token is better because it takes schema and uses it
to obtain partitioner instead of using a global partitioner.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
and replace all dht::global_partitioner().decorate_key
with dht::decorate_key
It is an improvement because dht::decorate_key takes schema
and uses it to obtain partitioner instead of using global
partitioner as it was before.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Take const schema& as a parameter of shard_of and
use it to obtain partitioner instead of calling
global_partitioner().
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The update generation path must track and apply all tombstones,
both from the existing base row (if read-before-write was needed)
and for the new row. One such path contained an error, because
it assumed that if the existing row is empty, then the update
can be simply generated from the new row. However, lack of the
existing row can also be the result of a partition/range tombstone.
If that's the case, it needs to be applied, because it's entirely
possible that this partition row also hides the new row.
Without taking the partition tombstone into account, creating
a future tombstone and inserting an out-of-order write before it
in the base table can result in ghost rows in the view table.
This patch comes with a test which was proven to fail before the
changes.
Branches 3.1,3.2,3.3
Fixes#5793
Tests: unit(dev)
Message-Id: <8d3b2abad31572668693ab585f37f4af5bb7577a.1581525398.git.sarna@scylladb.com>
All internal execution always uses query text as a key in the
cache of internal prepared statements. There is no need
to publish API for executing an internal prepared statement object.
The folded execute_internal() calls an internal prepare() and then
internal execute().
execute_internal(cache=true) does exactly that.
"
Lots of code needs storage_service just to get token_metadata from.
This creates unwanted dependency loops and increases the use of
global storage_service instance.
This set keeps the sharded<locator::token_metadata> on main's stack
and carries the references where needed. This removes the dependency
on storage_service from:
- storage_proxy
- gossiper
- redis
- batchlog manager
and makes the database only need it for sstables_format (will fix
in one of the next sets).
Also, this set is the prerequisite for controlling the copying of
token_metadata instances (spotted two occurrences in bootstrap
code).
Tests: unit(dev), manual start-stop
"
* 'br-token-metadata-standalone-2' of https://github.com/xemul/scylla:
api: Keep and use reference on token_metadata
redis: Use proxy token_metadata
gossiper: Keep needed for failure_detection values on board
database: Use own token_metadata
batchlog: Use token_metadata from proxy
proxy: Use own token_metadata
gossiper: Use own token_metadata
tokens: Switch into standalone sharded instance
batchlog: Use in-config ring-delay
database: Have it in size_estimate_virtual_reader
storage_proxy: Pass token_metadata in some static helpers
storage_service: Move get_local_tokens wrapper
size_estimates_virtual_reader: Make get_local_ranges static
migration_manager: Refactor validation of new/updating ksm
storage_service: Tiny cleanup of excessive self-reference
Merged pull request https://github.com/scylladb/scylla/pull/5755 from
Avi Kivity:
This series removes some #include dependencies around cql3. It results in
30k line (6.6%) reduction in the preprocessed size of database.i, mainly
due to elimination of boost::regex (which was brought in in turn by
like_matcher). This should result in fewer and faster recompiles.
commits:
tracing: remove #include of modification_statement.hh from table_helper
cql3: selection: remove now-unneeded include of statement_restrictions.hh
cql3: deinline result_set_builder::restrictions_filter constructor
view_info: remove include of select_statement.hh
cql3: selection: remove unnecessary include of selector_factories
cql3: query_processor: reduce #includes