Fixes#1484.
We drop tables as part of keyspace drop. Table drop starts with
creating a snapshot on all shards. All shards must use the same
snapshot timestamp which, among other things, is part of the snapshot
name. The timestamp is generated using supplied timestamp generating
function (joinpoint object). The joinpoint object will wait for all
shards to arrive and then generate and return the timestamp.
However, we drop tables in parallel, using the same joinpoint
instance. So joinpoint may be contacted by snapshotting shards of
tables A and B concurrently, generating timestamp t1 for some shards
of table A and some shards of table B. Later the remaining shards of
table A will get a different timestamp. As a result, different shards
may use different snapshot names for the same table. The snapshot
creation will never complete because the sealing fiber waits for all
shards to signal it, on the same name.
The fix is to give each table a separate joinpoint instance.
Message-Id: <1469117228-17879-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 5e8f0efc85)
This patch implements the merge_types() function,
allowing mutations to user defined types to be applied.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Seastar wrongly limits the number of concurrent submit_to()s to a single
remote shard. This can cause an ABBA deadlock:
fiberA fiberB (x127)
submit_to(0) # lock schema
<- returns
submit_to(0) # lock schema (waits)
submit_to(0) # do work (waits)
The fiberBs wait for fiberA, which in turn waits for a fiberB to return.
While the correct fix is to remote the client-side limit and replace it
with a server-side per-verb limit, we start with a simpler fix that
replaces the blocking lock call with a non-blocking call, removing the
deadlock.
Fixes#1088.
Message-Id: <1459095357-28950-1-git-send-email-avi@scylladb.com>
At the momment, the callbacks returns void, it is impossible to wait for
the callbacks to complete. Make the callbacks runs inside seastar
thread, so if we need to wait for the callback, we can make it call
foo_operation().get() in the callback. It is easier than making the
callbacks return future<>.
This patch makes sure that every time we need to create a new generation number -
the very first step in the creation of a new SSTable, the respective CF is already
initialized and populated. Failure to do so can lead to data being overwritten.
Extensive details about why this is important can be found
in Scylla's Github Issue #1014
Nothing should be writing to SSTables before we have the chance to populate the
existing SSTables and calculate what should the next generation number be.
However, if that happens, we want to protect against it in a way that does not
involve overwriting existing tables. This is one of the ways to do it: every
column family starts in an unwriteable state, and when it can finally be written
to, we mark it as writeable.
Note that this *cannot* be a part of add_column_family. That adds a column family
to a db in memory only, and if anybody is about to write to a CF, that was most
likely already called. We need to call this explicitly when we are sure we're ready
to issue disk operations safely.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Currently schema changes are only logged at coordinator node which
initiates the change. It would be helpful in post morten analysis to
also see when and how schema changes are resolved when applied on
other nodes.
Message-Id: <1456953095-1982-1-git-send-email-tgrabiec@scylladb.com>
Fixes#937
In fixing #884, truncation not truncating memtables properly,
time stamping in truncate was made shard-local. This however
breaks the snapshot logic, since for all shards in a truncate,
the sstables should snapshot to the same location.
This patch adds a required function argument to truncate (and
by extension drop_column_family) that produces a time stamp in
a "join" fashion (i.e. same on all shards), and utilizes the
joinpoint type in caller to do so.
Message-Id: <1456332856-23395-2-git-send-email-calle@scylladb.com>
Fixes#884
Time stamps for truncation must be generated after flush, either by
splitting the truncate into two (or more) for-each-shard operations,
or simply by doing time stamping per shard (this solution).
We generate TS on each shard after flushing, and then rely on the
actual stored value to be the highest time point generated.
This should however, from batch replay point of view, be functionally
equivalent. And not a problem.
Replicates https://issues.apache.org/jira/browse/CASSANDRA-7910 :
"Prepare a statement with a wildcard in the select clause.
2. Alter the table - add a column
3. execute the prepared statement
Expected result - get all the columns including the new column
Actual result - get the columns except the new column"
Currently the notify_*() method family broadcasts to all shards, so
schema merging code invokes them only on shard 0, to avoid doubling
notifications. We can simplify this by making the notify_*() methods
per-instance and thus shard-local.
We want the node's schema version to change whenever
table_schema_version of any table changes. The latter is calculated by
hashing mutations so we should also use mutation hash when calculating
schema digest.
The intent is to make data returned by queries always conform to a
single schema version, which is requested by the client. For CQL
queries, for example, we want to use the same schema which was used to
compile the query. The other node expects to receive data conforming
to the requested schema.
Interface on shard level accepts schema_ptr, across nodes we use
table_schema_version UUID. To transfer schema_ptr across shards, we
use global_schema_ptr.
Because schema is identified with UUID across nodes, requestors must
be prepared for being queried for the definition of the schema. They
must hold a live schema_ptr around the request. This guarantees that
schema_registry will always know about the requested version. This is
not an issue because for queries the requestor needs to hold on to the
schema anyway to be able to interpret the results. But care must be
taken to always use the same schema version for making the request and
parsing the results.
Schema requesting across nodes is currently stubbed (throws runtime
exception).
The version needs to change value not only on structural changes but
also temporal. This is needed for nodes to detect if the version they
see was already synchronized with or not even if it has the same
structure as the past versions. We also need to end up with the same
version on all nodes when schema changes are commuted.
For regular mutable schemas version will be calculated from underlying
mutations when schema is announced. For static schemas of system
keyspace it is calculated by hashing scylla version and column id,
because we don't have mutations at the time of building the schema.
We will be able to reuse the code in frozen_schema. We need to read
data in mutation form so that we can construct the correct
schema_table_version, and attach the mutations to schema_ptr.
It simplifies add_table_to_schema_mutation() interface.
The current code is also a bit confusing, partition_key is created
with the keyspaces() schema and used in mutations destined for the
columnfamilies() schema. It works, the types are the same, but looks a
bit scary.
We use boost::any to convert to and from database values (stored in
serlialized form) and native C++ values. boost::any captures information
about the data type (how to copy/move/delete etc.) and stores it inside
the boost::any instance. We later retrieve the real value using
boost::any_cast.
However, data_value (which has a boost::any member) already has type
information as a data_type instance. By teaching data_type intances about
the corresponding native type, we can elimiante the use of boost::any.
While boost::any is evil and eliminating it improves efficiency somewhat,
the real goal is growing native type support in data_type. We will use that
later to store native types in the cache, enabling O(log n) access to
collections, O(1) access to tuples, and more efficient large blob support.
It may happen that the user will migrate a table to Scylla which
compaction strategy isn't supported yet, such as Data tiered.
Let's handle that by falling back to size-tiered compaction
strategy and printing a warning message.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When building the in-memory schema for a column family, we were
ignoring compaction strategy class because of a bug in the
existing code. Example: suppose that you create a column family
with leveled compaction strategy. This option would be ignored
and the default strategy (size-tiered) would be used instead.
Found this problem while working on leveled compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
In Cassandra, when you create a new column family, a directory for it
immediately appears under the KS directory.
In the past, we have made a decision to delay that creation until the first
SSTable is created, which works well in general.
There is a problem, however, for backup restoration: the standard procedure to
call loadNewSSTables is to do that in an empty directory. But the directory
simply won't be there until we create the first SSTable: bummer!
In the current incarnation of the code in schema_tables.cc, there is already
some code that runs on CPU0 only. That is a perfect place for the directory
creation. So let's do it.
After this patch, a directory for the CF appears right after the CF creation.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Currently, we are calculating truncated_at during truncate() independently for
each shard. It will work if we're lucky, but it is fairly easy to trigger cases
in which each shard will end up with a slightly different time.
The main problem here, is that this time is used as the snapshot name when auto
snapshots are enabled. Previous to my last fixes, this would just generate two
separate directories in this case, which is wrong but not severe.
But after the fix, this means that both shards will wait for one another to
synchronize and this will hang the database.
Fix this by making sure that the truncation time is calculated before
invoke_on_all in all needed places.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Almost the whole file is (accidentally) indented four spaces to the
right for no reason. Fix that up because it's annoying as hell.
Signed-off-by: Pekka Enberg <penberg@scylladb.com>
When we query schema keyspaces after we have applied a delete mutation,
the dropped keyspace does not exist in the "after" result set. Fix the
merge_keyspaces() algorithm to take that into account.
Makes merge_keyspaces() really call to database::drop_keyspace() when a
keyspace is dropped.
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>