The intent is to make data returned by queries always conform to a
single schema version, which is requested by the client. For CQL
queries, for example, we want to use the same schema which was used to
compile the query. The other node expects to receive data conforming
to the requested schema.
Interface on shard level accepts schema_ptr, across nodes we use
table_schema_version UUID. To transfer schema_ptr across shards, we
use global_schema_ptr.
Because schema is identified with UUID across nodes, requestors must
be prepared for being queried for the definition of the schema. They
must hold a live schema_ptr around the request. This guarantees that
schema_registry will always know about the requested version. This is
not an issue because for queries the requestor needs to hold on to the
schema anyway to be able to interpret the results. But care must be
taken to always use the same schema version for making the request and
parsing the results.
Schema requesting across nodes is currently stubbed (throws runtime
exception).
With 10 sstables/shard and 50 shards, we get ~10*50*50 messages = 25,000
log messages about sstables being ignored. This is not reasonable.
Reduce the log level to debug, and move the message to database.cc,
because at its original location, the containing function has nothing to
do with the message itself.
Reviewed-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Message-Id: <1452181687-7665-1-git-send-email-avi@scylladb.com>
We have an API that wraps open_file_dma which we use in some places, but in
many other places we call the reactor version directly.
This patch changes the latter to match the former. It will have the added benefit
of allowing us to make easier changes to these interfaces if needed.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <29296e4ec6f5e84361992028fe3f27adc569f139.1451950408.git.glauber@scylladb.com>
This exception was not caught properly as a std::exception
by report_failed_future call to report_exception because the
superclass std::exception was not initialized.
Fixes#669.
Signed-off-by: Benoît Canet <benoit@scylladb.com>
max_purgeable was being incorrectly calculated because the code
that creates vector of uncompacted sstables was wrong.
This value is used to determine whether or not a tombstone can
be purged.
Operand < is supposed to be used instead in the callback passed
as third parameter to boost::set_difference.
This fix is a step towards closing the issue #676.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Use steady_clock instead of high_resolution_clock where monotonic
clock is required. high_resolution_clock is essentially a
system_clock (Wall Clock) therefore may not to be assumed monotonic
since Wall Clock may move backwards due to time/date adjustments.
Fixes issue #638
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
From Paweł:
"This series fixes sstables::key_reader not respecting range inclusiveness
if the bounds were the keys that were present in the index summary.
Fixes #663."
When choosing a relevant range of buckets it wasn't taken into account
whether the range bounds are inclusive or not. That may have resulted in
more buckets being read than necessary which was a condition not
expected by the code responsible from looking for a relevant keys inside
the buckets.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
I am sure it's a compiler issue but I am not ready to give up and
upgrade just yet:
sstables/compaction.cc:307:55: error: converting to ‘std::unordered_map<int, long int>’ from initializer list would use explicit constructor ‘std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::unordered_map(std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::size_type, const hasher&, const key_equal&, const allocator_type&) [with _Key = int; _Tp = long int; _Hash = std::hash<int>; _Pred = std::equal_to<int>; _Alloc = std::allocator<std::pair<const int, long int> >; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::size_type = long unsigned int; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::hasher = std::hash<int>; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::key_equal = std::equal_to<int>; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::allocator_type = std::allocator<std::pair<const int, long int> >]’
stats->start_size, stats->end_size, {});
That's important for compaction stats API that will need stats
data of each ongoing compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When compaction job finishes, call function to update the system
table COMPACTION_HISTORY. That's also needed for the compaction
history API.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Schemas using compact storage can have clustering keys with the trailing
components not set and effectively being a clustering key prefixes
instead of full clustering keys.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
In case of non-compound dense tables the column name is just the value
of the clustering key (which has only one component). Current code just
casts clustering_key to bytes_view which works because there is no
additional metadata in single element clustering keys.
However, that may change when the internal representation of clustering
key is changed so explicitly extract the proper component.
This change will become necessary when clustering_key is replaced by
clustering_key_prefix.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Similiar to origin, off heap memory, memory_footprint is the size of
queus multiply by the structure size.
memory_footprint is used by the API to report the memory that is taken
by the summary.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Scylla changes:
sstable.cc: Remove file_exists() function which conflicts with seastar's
Amnon Heiman (2):
reactor: Add file_exists method
Add a wrapper for file_exists
Avi Kivity (2):
Merge "Introduce shared_future" from Tomasz
Merge ""scripts: a few fixes in posix_net_conf.sh" from Vlad
Gleb Natapov (3):
rpc: not stop client in error state
avoid allocation in parallel_for_each is there is nothing to do
memory: fix size_to_idx calculation
Nadav Har'El (1):
test: fix use-after-free in timertest
Pawe�� Dziepak (1):
memory: use size instead of old_size to shrink memory block
Tomasz Grabiec (7):
file: Mark move constructor as noexcept
core: future: Add static asserts about type's noexcept guarantees
core: future: Drop now redundant move_noexcept flag
core: future_state: Make state getters non-destructive for non-rvalue-refs
core: future: Make get_available_state() noexcept
core: Introduce shared_future
Make json_return_type movable
Vlad Zolotarov (8):
scripts: posix_net_conf.sh: ban NIC IRQs from being moved by irqbalance
scripts: posix_net_conf.sh: exclude CPU0 siblings from RPS
scripts: posix_net_conf.sh: Configure XPS
scripts: posix_net_conf.sh: Add a new mode for MQ NICs
scripts: posix_net_conf.sh: increase some backlog sizes
core: to_sstring(): cleanup
core: to_sstring_strintf(): always use %g(or %lg) format for floating point values
core: prevent explicit calls for to_sstring_sprintf()
The add interface of the estimated histogram is confusing as it is not
clear what units are used.
This patch removes the general add method and replace it with a add_nano
that adds nanoseconds or add that gets duration.
To be compatible with origin, nanoseconds vales are translated to
microseconds.
Avi says:
"A small buffer size will hurt if we read a large file, but
a large buffer size won't hurt if we read a small file, since
we close it immediately."
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Let's move the code that prints that a compaction succeeded only
after the code that catches exception on either read or write
fibers. Let's also get rid of done and use repeat instead in
the read fiber.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently, we don't let the user know even what is the filename that failed.
That information should be included in the message.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
This assert (in write fiber) would fail if read fiber failed
because the variable done will not be set to true.
The use of assert is very bad, because it prevents scylla
from proceeding, which is possible.
To solve it, let's trigger an exception if done is not true.
We do have code that will wait for both read and write fibers,
and catch exceptions, if any.
Closes#523.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We use boost::any to convert to and from database values (stored in
serlialized form) and native C++ values. boost::any captures information
about the data type (how to copy/move/delete etc.) and stores it inside
the boost::any instance. We later retrieve the real value using
boost::any_cast.
However, data_value (which has a boost::any member) already has type
information as a data_type instance. By teaching data_type intances about
the corresponding native type, we can elimiante the use of boost::any.
While boost::any is evil and eliminating it improves efficiency somewhat,
the real goal is growing native type support in data_type. We will use that
later to store native types in the cache, enabling O(log n) access to
collections, O(1) access to tuples, and more efficient large blob support.
Now that #475 is solved an read_indexes() guarantees to return disjont
sets of keys sstable key reader can be simplified, namely, only two key
lookups are needed (the first and the last one) and there is no need for
range splitting.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
sstable level is set to zero by default, but it may be set to
a different value if a new sstable is the result of leveled
compaction. This is done outside write_components.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We were incorrectly setting s.header.min_index_interval to
BASE_SAMPLING_LEVEL, which luckily is the default value to
min index interval. BASE_SAMPLING_LEVEL was also used as
the min index interval when checking if the estimated
number of summary entries is greater than the limit.
To fix problems, get min index interval from schema and
use this value to check the limit.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
read_indexes() will not work for a column family that minimum
index interval is different than sampling level or that sampling
level is lower than BASE_SAMPLING_LEVEl.
That's because the function was using sampling level to determine
the interval between indexes that are stored by index summary.
Instead, method from downsampling will be used to calculate the
effective interval based on both minimum_index_interval and
sampling_level parameters.
Fixes issue #474.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"This patchset implements load_new_sstables, allowing one to move tables inside the
data directory of a CF, and then call "nodetool refresh" to start using them.
Keep in mind that for Cassandra, this is deemed an unsafe operation:
https://issues.apache.org/jira/browse/CASSANDRA-6245
It is still for us something we should not recommend - unless the CF is totally
empty and not yet used, but we can do a much better job in the safety front.
To guarantee that, the process works in four steps:
1) All writes to this specific column family are disabled. This is a horrible thing to
do, because dirty memory can grow much more than desired during this. Throughout out
this implementation, we will try to keep the time during which the writes are disabled
to its bare minimum.
While disabling the writes, each shard will tell us about the highest generation number
it has seen.
2) We will scan all tables that we haven't seen before. Those are any tables found in the
CF datadir, that are higher than the highest generation number seen so far. We will link
them to new generation numbers that are sequential to the ones we have so far, and end up
with a new generation number that is returned to the next step
3) The generation number computed in the previous step is now propagated to all CFs, which
guarantees that all further writes will pick generation numbers that won't conflict with
the existing tables. Right after doing that, the writes are resumed.
4) The tables we found in step 2 are passed on to each of the CFs. They can now load those
tables while operations to the CF proceed normally."
This will be used, for instance, when importing an SSTable.
We would like to force all new SSTables to sit at level 0 for
compaction purposes.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
In some situations (restoring a backup from load_new_sstables), we want to
change the SSTable generation number. This patch provides a procedure to
achieve that.
It does so by linking the old files to new ones, and then removing the old
ones.
The reason we link instead of removing, is that we want to make sure that in
case there is a crash in the middle, the old data is still accessible.
If the crash happens after the link is done but before we start removing the
old files, that is fine: we will end up with duplicated data that will
disappear after the next compaction.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
That is the way to generate groups of files for the SSTables, so we must do it.
Because the links were mostly used by processes like snapshots and backups
where and external tool would (hopefully) verify the results, it was not that
serious.
But we now plan to use links to bring things into the main directory. It must
absolutely be done right.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
During some situations (restoring a snapshot for instance) we may want a file
to get a different generation. This patch changes the code in create_links
slightly, so that it is able to link not only to a different location, but to
files with a different name, possibly in the same location - that is equivalent
to a generation change.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
This is done on behalf of load_new_sstables: we would like to know which
components are present in the file, but without triggering the read for the
rest of the metadata.
As noted by Avi, using this directly can leave the SSTable in an inconsistent
state. We will have to fix is later since this is not the first offender.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Avoid using long for it, and let's use a fixed size instead. Let's do signed
instead of unsigned to avoid upsetting any code that we may have converted.
Signed-off-by: Glauber Costa <glommer@scylladb.com>