In previous patches, we gave up on an old (and broken) attempt to track
the timestamps of many unselected base-table columns through one row marker
in the view table - and replaced them by "virtual cells", one per unselected
cell.
The do_delete_old_entry() function still contains old code which maintained
that row marker, and is no longer needed. That old code is no only no longer
needed, it also no longer did anything because all columns now appear in
the view (as virtual columns) so the code ignored them when calculating the
row marker.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180829131914.16042-1-nyh@scylladb.com>
"
When a view's partition key contains only columns from the base's partition
key (and not an additional one), the liveness - existance or disappearance -
of a view-table row is tied to the liveness of the base table row. And
that, in turn, depends not only on selected columns (base-table columns
SELECTed to also appear in the view) but also on unselected columns.
This means that we may need to keep a view row alive even without data,
just because some unselected column is alive in the base table. Before this
patch set we tried to build a single "row marker" in the view column which
tried to summarize the liveness information in all unselected columns.
But this proved unworkable, as explained in issue #3362 and as will be
demonstrated in unit tests at the end of this series.
Because we can't replace several unselected cells by one row marker, what
we do in this series is to add for each for the unselected cells a "virtual
cell" which contains the cell's liveness information (timestamp, deletion,
ttl) but not its value. For collections, we can't represent the entire
collection by one virtual cell, and rather need a collection of virtual
cells.
Fixes#3362
"
* 'virtual-cols-v3' of https://github.com/nyh/scylla:
Materialized Views: test that virtual columns are not visible
Materialized Views: unit test reproducing fixed issue #3362
Materialized Views: no need for elaborate row marker calculations
Materialized Views: add unselected columns as virtual columns
Materialized Views: fill virtual columns
Do not allow selecting a virtual column
schema: persist "view virtual" columns to a separate system table
schema: add "view virtual" flag to schema's column_definition
Add "empty" type name to CQL parser, but only for internal parsing
memtable flushes for system and regular region groups run under the
memtable_scheduling_group, but the controller adjusts shares based on
the occupancy of the regular region group.
It can happen that regular is not under pressure, but system is. In
this case the controller will incorrectly assign low shares to the
memtable flush of system. This may result in high latency and low
throughput for writes in the system group.
I observed writes to the sytem keyspace timing out (on scylla-2.3-rc2)
in the dtest: limits_test.py:TestLimits.max_cells_test, which went
away after this.
Fixes#3717.
Message-Id: <1535016026-28006-1-git-send-email-tgrabiec@scylladb.com>
Now that we have separate virtual cells to represent unselected columns
in a materialized view, we no longer need the elaborate row-marker liveness
calculations which aimed (but failed) to do the same thing. So that code
can be removed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When a view's partition key contains only columns from the base's partition
key (and not an additional one), the liveness (existance or disappearance)
of a view-table row is tied to the liveness of the base table row - and
that depends not only on selected columns (base-table columns SELECTed to
also appear in the view) but also on unselected columns.
This means that we may need to keep a view row alive even without data,
just because some unselected column is alive in the base table. Before this
patch we tried to build a single "row marker" in the view column which
summarizes the liveness information in all unselected columns, but this
proved unworkable, as explained in issue #3362 and as will be demonstrated
in unit tests in a later patch.
Because we can't replace several unselected cells by one row marker, what
we do in this patch is to add for each for the unselected cell a "virtual
cell" which contains the cell's liveness information (timestamp, deletion,
ttl) but not its value. For collections, we can't represent the entire
collection by one virtual cell, and rather need a collection of virtual
cells.
This patch just adds the virtual columns to the view schema. Code in
the previous patch, when it notices the virtual columns in the view's
schema, added the appropriate content into these columns.
We may need to add virtual columns to a view when first created, but also
when an unselected column is added to the base table with "ALTER TABLE",
so both are supported in this patch.
Fixes#3362.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The add_cells_to_view() function usually adds selected cells from the base
table to the view mutation. For issue #3362, we sometimes want to also
add unselected cells as "virtual" cells - truncated versions of the
base-table cells just without the values.
This patch contains the code to fill the virtual columns' data using the
regular columns from the base table.
This patch does not yet actually *add* any virtual columns to the schema,
so until that is done (in the next patch), this patch will not yet cause
any behavior change. This is important for bisectability.
Refs #3362.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In the previous patch, we added a "view virtual" flag on columns. In this
patch we add persistance to this flag: I.e., writing it to the on-disk
schema table and reading it back on startup. But the implementation is
not as simple as adding a flag:
In the on-disk system tables, we have a "columns" table listing all the
columns in the database and their types. Cqlsh's "DESCRIBE MATERIALIZED
VIEW" works by reading this "columns" table, and listing all of the
requested view's columns. Therefore, we cannot add "virtual columns" -
which are columns not added by the user and not intended to be seen -
to this list.
We therefore need to create in this patch a separate list for virtual
columns, in a new table "view_virtual_columns". This table is essentially
identical to the existing "columns" table, just separate. We need to write
each column to the appropriate table (columns with the view_virtual flag to
"view_virtual_columns", columns without it to the old "columns"), read
from both on startup, and remember to delete columns from both when a table
is dropped.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Even before this patch, Scylla supported the "empty" type (a column with
no content) but only internally - i.e., in code but not in CQL syntax.
The "empty" type was used in dense tables without regular columns, and a
special optimization in db::cql_type_parser::parse() allowed this type
name to be parsed when reading the schema tables, without allowing the
"empty" type to be used by users in CQL statements.
However, parse() only supported "empty" itself, and more complex types
like list<empty> were not recognized by parse(). In the following patches,
we plan to add to virtual columns to materialized views, with types empty,
list<empty> or map<something, empty>. We need all these types to work, and
before this patch, they don't. But we want all of these types to only work
internally - when Scylla's code creates these hidden columns; we do not
want to add the "empty" type to CQL's syntax.
This is what we do in this patch: The CQL parser's comparator_type rule
now has a parameter, "internal", used to differenciate internal calls
via db::cql_type_parser::parse() from calls from CQL query parsing.
If a user tries something like:
CREATE TABLE e (pk empty PRIMARY KEY);
He will get the error:
Invalid (reserved) user type name empty
Note that here, as usual, unknown types are treated as "user types",
and "empty" is not allowed as a user type name - we "reserve" it in case
one day in the future we will want to allow users a direct syntax to
create empty columns. We already have, following Cassandra, a bunch of
other names reserved from being user type names, including "byte",
"complex", and others (see _reserved_type_names()), and using "empty"
as a type name will result in a similar error message.
Just like all other type names, the name "empty" is not a reserved
keyword in other senses: a user can create a table or a column with
the name "empty", just like he can create one with the name "int".
Refs #3362.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When murmur3_ignore_msb_bits was introduced in 1.7, we set its default zero
(to avoid resharding on upgrade) and set it to 12 in the scylla.yaml template
(to make sure we get the right value for new clusters).
Now, however, things have changed:
- clusters installed before 1.7 are a small minority
- they should have resharded long ago
- resharding is much better these days
- we have more migrations from Cassandra compared to old clusters
To allow clusters that migrated using their cassandra.yaml, and to clean up
the default scylla.yaml, make the default 12.
Users upgrading from pre-1.7 clusters will need to update their scylla.yaml,
or to reshard (which is a good idea anyway).
Fixes#3670.
Message-Id: <20180808063003.26046-1-avi@scylladb.com>
"
This series replaces infinite time-outs in internal distributed
(non-local) CQL queries with finite ones.
The implementation of tracing, which also performs internal queries,
already has finite time-outs, so it is unchanged.
Fixes#3603.
"
* 'jhk/finite_time_outs/v2' of https://github.com/hakuch/scylla:
Use finite time-outs for internal auth. queries
Use finite query time-outs for `system_distributed`
The moving operation changes a node's token to a new token. It is
supported only when a node has one token. The legacy moving operation is
useful in the early days before the vnode is introduced where a node has
only one token. I don't think it is useful anymore.
In the future, we might support adjusting the number of vnodes to reblance
the token range each node owns.
Removing it simplifies the cluster operation logic and code.
Fixes#3475
Message-Id: <144d3bea4140eda550770b866ec30e961933401d.1533111227.git.asias@scylladb.com>
std::random_device() uses the relatively slow /dev/urandom, and we rarely if
ever intend to use it directly - we normally want to use it to seed a faster
random_engine (a pseudo-random number generator).
In many places in the code, we first created a random_device variable, and then
using it created a random_engine variable. However, this practice created the
risk of a programmer accidentally using the random_device object, instead of the
random_engine object, because both have the same API; This hurts performance.
This risk materialized in just two places in the code, utils/uuid.cc and
gms/gossiper.cc. A patch for to uuid.cc was sent previously by Pawel and is
not included in this patch, and the fix for gossiper.{cc,hh} is included here.
To avoid risking the same mistake in the future, this patch switches across the
code to an idiom where the random_device object is not *named*, so cannot be
accidentally used. We use the following idiom:
std::default_random_engine _engine{std::random_device{}()};
Here std::random_device{}() creates the random device (/dev/urandom) and pulls
a random integer from it. It then uses this seed to create the random_engine
(the pseudo-random number generator). The std::random_device{} object is
temporary and unnamed, and cannot be unintentionally used directly.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180726154958.4405-1-nyh@scylladb.com>
This series contains a couple of fixes to the bookkeeping of the view
build process, which could cause data to be left behind in the system
tables.
* git@github.com:duarten/scylla.git materialized-views/view-build-fixes/v1:
Duarte Nunes (3):
db/system_keyspace: Add function to remove view build status of a
shard
db/view: Don't have shard 0 clear other shard's status on drop
db/view: Restrict writes to the distributed system keyspace to shard 0
This series contains a couple of fixes to the adjusting of clustering
keys in the build_progress_virtual_reader, some of which could
potentially cause heap overflows when querying the legacy system table.
* git@github.com:duarten/scylla.git materialized-views/build-progress-virtual-reader-fixes/v1:
Duarte Nunes (3):
db/view/build_progress_virtual_reader: Use correct schema to adjust ck
db/view/build_progress_virtual_reader: Fix full ck detection
db/view/build_progress_virtual_reader: Also adjust end RT bound
Use abstract_type::to_string() to prettify partition key components.
Manually tested by setting --compaction-large-partition-warning-threshold-mb
to zero and inspecting the output for compound and non-compound partition
keys.
As an optimization, the virtual reader doesn't change the underlying
key if it is not full, and hence doesn't include the extra clustering
key. However, this detection is broken because it checked for 3
clustering columns, instead of 2.
This patch fixes that by obtaining the clustering key size from the
underlying schema instead of hardcoding the size.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The virtual reader adjusts clustering keys obtained from the
underlying, scylla-specific schema, and potentially sheds the extra
clustering key that's absent from the Cassandra-compatible schema.
This patches ensures we use the correct schema to iterator over the
key.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Writing to the distributed system keyspace should be confined to a
single shard of each host, namely shard 0. We were violating this
constraint by having all shards set the host status to "started". This
could be problematic when the build finishes quickly or there's a
concurrent view drop, such that a write done by shard 0 can have a
smaller timestamp than one done by some other shard.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Shard 0 can clear the in-progress build status of all shards when a
view finishes building, because we are ensured all writes to the
system table have completed with earlier timestamps.
This is not the case when dropping a view. A drop can happen
concurrently with the build, in which case shard 0 may process the
notification before another shard receives it, and before that shard
writes to the system table.
Fix this by ensuring each shard clears its own status on drop.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds a function that clears the view build in-progress
status for the current shard, similar to the existing one that clears
it across all shards.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We were considering the token ranges in the size_estimates system
table to be inclusive, which is incorrect and incompatible with
Cassandra.
While we ignore the inclusiveness of the partition_range bounds when
selecting sstables, we do take it into account in
estimated_keys_for_range(). We would thus select the correct sstables,
but could over-estimate the range size nonetheless.
Tests: virtual_reader_test(release)
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180709115919.5106-1-duarte@scylladb.com>
Require a timeout parameter for storage_proxy::mutate_begin() and
all its callers (all the way to thrift and cql modification_statement
and batch_statement).
This should fix spurious debug-mode test failures, where overcommit
and general debug slowness result in the default timeouts being
exceeded. Since the tests use infinite timeouts, they should not
time out any more.
Tests: unit (release), with an extra patch that aborts
when a non-infinite timeout is detected.
Message-Id: <20180707204424.17116-1-avi@scylladb.com>
Rebalance hints segments that need to be sent among all present shards.
Ensure that after rebalancing the difference between the number of segments
of any two shards is not greater than 1.
Try to minimize the amount of "file rename" operations in order to achieve the needed result.
Note: "Resharding" is a particular case of rebalancing.
Tests: dtest: hintedhandoff_additional_test.py:TestHintedHandoff.hintedhandoff_rebalance_test
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Reserving 10% of space for hints managers makes sense if the device
is shared with other components (like /data or /commitlog).
But, if hints directory is mounted on a dedicated storage, it makes
sense to reserve much more - 90% was chosen as a sane limit.
Whether storage is 'dedicated' or not is based on a simple check
if given hints directory is a mount point.
Fixes#3516
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
Instead of having one static space limit for all directories,
space_watchdog now keeps a per-device limit, shared among
hints managers residing on the same disks.
References #3516
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
In order to make space_watchdog device-aware, device_id field
is added to hints manager. It's an equivalent of stat.st_dev
and it identifies the disk that contains manager's root directory.
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
In order to distinguish which directories reside on which devices,
get_device_id function is added to resource manager.
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
Now that we have the controller, we would like to take min_threshold as
a hint. If there is nothing to compact, we can ignore that and start
compacting less than min_threshold SSTables so that the backlog keeps
reducing.
But there are cases in which we don't want min_threshold to be a hint
and we want to enforce it strictly. For instance, if write amplification
is more of a concern than space amplification.
This patch adds a YAML option that allows the user to tell us that. We will
default to false, meaning min_threshold is not strictly enforced.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Previously max_shard_disk_space_size was unconditionally initialized
with the capacity of hints_directory. But, it's likely that
hints_directory doesn't exist at all if hinted handoff is not enabled,
which results in Scylla failing to boot.
So, max_shard_disk_space_size is now initialized with the capacity
of hints_for_views directory, which is always present.
This commit also moves max_shard_disk_space_size to the .cc file
where it belongs - resource_manager.cc.
Tests: unit (release)
Message-Id: <9f7b86b6452af328c05c5c6c55bfad3382e12445.1528977363.git.sarna@scylladb.com>
When I came across db/legacy_schema_migrator.cc, I had no idea what it
does and though I had obvious guesses (it somehow migrates old schemas,
right?) I didn't know what it really does. So after I figured this out,
I wrote this comment so the next person doesn't need to guess.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180605120225.25173-1-nyh@scylladb.com>
"
The IndexInfo table tracks the secondary indexes that have already
been populated. Since our secondary index implementation is backed by
materialized views, we can virtualize that table so queries are
actually answered by built_views.
Fixes#3483
"
* 'built-indexes-virtual-reader/v2' of github.com:duarten/scylla:
tests/virtual_reader_test: Add test for built indexes virtual reader
db/system_keysace: Add virtual reader for IndexInfo table
db/system_keyspace: Explain that table_name is the keyspace in IndexInfo
index/secondary_index_manager: Expose index_table_name()
db/legacy_schema_migrator: Don't migrate indexes
"
As in #3423, ensuring token order on secondary index queries can be done
by adding an additional column to views that back secondary indexes.
This column is a first clustering column and contains token value,
computed on updates.
This series also updates tests and comments refering to issue 3423.
Tests: unit (release, debug)
"
* 'order_by_token_in_si_5' of https://github.com/psarna/scylla:
cql3: update token order comments
index, tests: add token column to secondary index schema
view: add handling of a token column for secondary indexes
view: add is_index method
In order to ensure token order on secondary index queries,
first clustering column for each view that backs a secondary index
is going to store a token computed from base's partition keys.
After this commit, if there exists a column that is not present
in base schema, it will be filled with computed token.
We would like the clients to be able to route work directly to the right
shards. To do that, they need to know the sharding algorithm and its
parameters.
The algorithm can be copied into the client, but the parameters need to
be exported somewhere. Let's use the local table for that.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
---
v2: force msb to zero on non-murmur
The IndexInfo table tracks the secondary indexes that have already
been populated. Since our secondary index implementation is backed by
materialized views, we can virtualize that table so queries are
actually answered by built_views.
Fixes#3483
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the same comment that exists in Apache Cassandra,
explaining that the table_name column in the IndexInfo system table
actually refers to the keyspace name. Don't be fooled.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Now that more than one instance of hints manager can be present
at the same time, registering metrics is moved out of the constructor
to prevent 'registering metrics twice' errors.