There are various call-sites that explicitly check for EEXIST and
ENOENT:
$ git grep "std::error_code(E"
database.cc: if (e.code() != std::error_code(EEXIST, std::system_category())) {
database.cc: if (e.code() != std::error_code(ENOENT, std::system_category())) {
database.cc: if (e.code() != std::error_code(ENOENT, std::system_category())) {
database.cc: if (e.code() != std::error_code(ENOENT, std::system_category())) {
sstables/sstables.cc: if (e.code() == std::error_code(ENOENT, std::system_category())) {
sstables/sstables.cc: if (e.code() == std::error_code(ENOENT, std::system_category())) {
Commit 961e80a ("Be more conservative when deciding when to shut down
due to disk errors") turned these errors into a storage_io_exception
that is not expected by the callers, which causes 'nodetool snapshot'
functionality to break, for example.
Whitelist the two error codes to revert back to the old behavior of
io_check().
Message-Id: <1465454446-17954-1-git-send-email-penberg@scylladb.com>
Make storage_io_exception exception error message less cryptic by
actually including the human-readable error message from
std::system_error...
Before:
nodetool: Scylla API server HTTP POST to URL '/storage_service/snapshots' failed: Storage io error errno: 2
After:
nodetool: Scylla API server HTTP POST to URL '/storage_service/snapshots' failed: Storage I/O error: 2: No such file or directory
We can improve this further by including the name of the file that the I/O
error happened on.
Message-Id: <1465452061-15474-1-git-send-email-penberg@scylladb.com>
Several shards may share the same sstable - e.g., when re-starting scylla
with a different number of shards, or when importing sstables from an
external source. Sharing an sstable is fine, but it can result in excessive
disk space use because the shared sstable cannot be deleted until all
the shards using it have finished compacting it. Normally, we have no idea
when the shards will decide to compact these sstables - e.g., with size-
tiered-compaction a large sstable will take a long time until we decide
to compact it. So what this patch does is to initiate compaction of the
shared sstables - on each shard using it - so that a soon as possible after
the restart, we will have the original sstable is split into separate
sstables per shard, and the original sstable can be deleted. If several
sstables are shared, we serialize this compaction process so that each
shard only rewrites one sstable at a time. Regular compactions may happen
in parallel, but they will not not be able to choose any of the shared
sstables because those are already marked as being compacted.
Commit 3f2286d0 increased the need for this patch, because since that
commit, if we don't delete the shared sstable, we also cannot delete
additional sstables which the different shards compacted with it. For one
scylla user, this resulted in so much excessive disk space use, that it
literally filled the whole disk.
After this patch commit 3f2286d0, or the discussion in issue #1318 on how
to improve it, is no longer necessary, because we will never compact a shared
sstable together with any other sstable - as explained above, the shared
sstables are marked as "being compacted" so the regular compactions will
avoid them.
Fixes#1314.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1465406235-15378-1-git-send-email-nyh@scylladb.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Previously, we were using a stat to decide if compaction should be
retried, but that's not efficient. The information is also lost
after node is restarted.
After these changes, compaction will be retried until strategy is
satisfied, i.e. there is nothing to compact.
We will now be doing the following in a loop:
Get compaction job from compaction strategy.
If cannot run, finish the loop.
Otherwise, compact this column family.
Go back to start of the loop.
By the way, pending_compactions stat will be deprecated after this
commit. Previously, it was increased to indicate the want for
compaction and decreased when compaction finished. Now, we can
compact more than we asked for, so it would be decreased below 0.
Also, it's the strategy that will tell the want for compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <899df0d8d807f6b5d9bb8600d7c63b4e260cc282.1465398243.git.raphaelsc@scylladb.com>
Sometimes a metric previously reported from collectd is not available
anymore. Previously, this caused scyllatop to log and exception to the
user - which in effect destroyes the user experience and inhibits
monitoring other metrics. This patch makes ScyllaTop handle this
problem. It will display such metrics and 'not available', and exclude
them from some and average computations.
Closes issue #1287.
Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <1465301178-27544-1-git-send-email-yoav@scylladb.com>
From Asias:
In f27e5d2a6 (messaging_service: Delay listening ms during boot up),
messaging_service startup is splitted into two stages. Adjust the api
registration code and fix up the messaging_service stop code.
This patch makes a few minor improvements in the parser:
- merge first and rest into 2-argument form of Word to define
identifier – should give some performance boost, simpler code
- replace Literal(keyword_string) with Keyword(keyword_string)
throughout - stricter parsing, avoids misinterpreting identifiers
with keywords
- replace expr.setResultsName("name") with expr("name") throughout –
this is a style change (no actual change in underlying parser
behavior), but I find this form easier to follow
- add calls to setName to make exceptions more readable
Message-Id: <005901d1bbd2$711f7bb0$535e7310$@austin.rr.com>
There are two problems:
1. _server_tls is not stopped
2. _server and _server_tls might not be created if
messaging_service::start_listen is not called yet.
Since messaging_service is fully initialized in
storage_service::init_server which calls
messaging_service::start_listen, we need to delay
the messaging_service api registration after it.
The rate_moving_average is used by timed_rate_moving_average to return
its internal values.
If there are no timed event, the mean_rate is not propertly initilized.
To solve that the mean_rate is now initilized to 0 in the structure
definition.
Refs #1306
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1465231006-7081-1-git-send-email-amnon@scylladb.com>
This variable if set to true will activate
developer mode. It will be set by using the
-e option of docker run.
The xfs bind mount behavior and the cpuset behavior
will be set by using the relevant docker command
lines options and documented in the scylla/docker
howto.
Fixes: #1267
Signed-of-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <1465213713-2537-1-git-send-email-benoit@scylladb.com>
Add a support for defining a probability (a value in a [0,1] range)
for tracing the next CQL request.
Traces for requests that are chosen to be traced due to this feature
are not going to flushed immediately.
Use std::subtract_with_carry_engine (implements the "lagged Fibonacci" algorithm)
random number engine for fastest generation of random integer values.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
A tracing session life cycle includes 3 stages:
1) Active: when new trace records are being added to this session.
2) Pending for flushing to a storage: when session is over but not
yet flushed to the storage ("backend").
3) Flushing: when session's records are being flushed to the storage
and this process is not yet completed.
Sessions may accumulate in each of the stages above and we should limit
the maximum amount of sessions being accumulated in each of them in order to avoid OOM
situation.
Current in-tree implementation only limits the number of tracing sessions
accumulated in the first ("Active") stage.
Since currently every closing session is being immediately flushed (as long
as "settraceprobability" is not implemented) the second stage never accumulates
tracing sessions.
The third stage is currently not controlled at all and if, for instance, we
succeed to push enough tracing session towards a slow storage backend, they may
accumulate there consuming an uncontrolled amount of memory and may eventually consume
all of it.
This patch fixes this unpleasant situation by implying the following strategy:
- Limit the total amount of accumulated tracing sessions in all stages above together
by a static value - 2 times "flush threshold". "2 times" is needed to allow new
tracing sessions to accumulate in the stage 2 while sessions in the stage 3 are still
being processed.
- Forcefully flush sessions in the stage 2 to the storage when their count reaches a "flush
threshold".
This would ensure that there will not more than totally (2 * "flush threshold") sessions (in any stage)
on each shard.
An advantage of this strategy is its simplicity - we only need a single threshold to control all stages.
If we feel that we needed a finer graining for each stage we may add separate limits for each of them
in the future.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
* dist/ami/files/scylla-ami 72ae258...863cc45 (3):
> Move --cpuset/--smp parameter settings from scylla_sysconfig_setup to scylla_ami_setup
> convert scylla_install_ami to bash script
> 'sh -x -e' is not valid since all scripts converted to bash script, so remove them
Call for a tracing::tracing::create_session() doesn't promise a session creation.
Check that the session is actually created before trying to use it.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Currently we only shut down on EIO. Expand this to shut down on any
system_error.
This may cause us to shut down prematurely due to a transient error,
but this is better than not shutting down due to a permanent error
(such as ENOSPC or EPERM). We may whitelist certain errors in the future
to improve the behavior.
Fixes#1311.
Message-Id: <1465136956-1352-1-git-send-email-avi@scylladb.com>
It was discussed that leveled strategy may not benefit from parallel
compaction feature because almost all compaction jobs will have similar
size. It was also found that leveled strategy wasn't working correctly
with it because two overlapping sstable (targetting the same level)
could be created in parallel by two ongoing compaction.
Fixes#1293.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <60fe165d611c0283ca203c6d3aa2662ab091e363.1464883077.git.raphaelsc@scylladb.com>
From Duarte:
This patchset adds the range_tombstone_list data structure,
used to hold a set of disjoint range tombstones, and changes
the internal representation of row tombstones to use that
data structure.
Fixes#1155
[tgrabiec: Added compound_wrapper::make_empty(const schema&) overload
to fix compilation failure in tracing code]
This patch enables the RANGE_TOMBSTONES supported feature, meaning
that the node is capable of accepting row entry tombstones as range
tombstones.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch uses the composite_marker to add inclusiveness information
to the prefixes of a range tombstone.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Since Scylla now supports proper range tombstones, the code for
reading ranges from sstables and converting them to overlapping
tombstones is no longer necessary, and is, in fact, wasteful as
the internal representation converts overlapping tombstones back to
ranges.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch moves the difference between two mutation_partition's
row_tombstones inside the range_tombstone_list.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the type of the mutation partition's row_tombstones
to be a range_tombstone_list, so that they are now represented as a
set of disjoint ranges. All of its usages are updated accordingly.
Fixes#1155
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the range tombstones feature, which is not enabled
yet, to the storage_service, so that consumers can query for it.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the gms::feature destructor so it
checks whether the gossiper has been stopped before trying
to unregister the feature.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch extracts the code from sstables/partition.cc which is used
to transform a set of range tombstones into a set of overlapping
scylladb tombstones.
The range_tombstone_merger will be used to send mutations to nodes not
yet updated to support the internal range tombstone representation.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This class is responsible for representing a set of range tombstones
as non-overlapping disjoint sets of range tombstones.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces the range_tombstone class, composed of
a [start, end] pair of clustering_key_prefixes, the type
of inclusiveness of each bound, and a tombstone.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>