Since reading mutations can consume a large amount of memory, which, moreover,
is not predicatable at the time the read is initiated, restrict the number
of reads to 100 per shard. This is more than enough to saturate the disk,
and hopefully enough to prevent allocation failures.
Restriction is applied in column_family::make_sstable_reader(), which is
called either on a cache miss or if the cache is disabled. This allows
cached reads to proceed without restriction, since their memory usage is
supposedly low.
Reads from the system keyspace use a separate semaphore, to prevent
user reads from blocking system reads. Perhaps we should select the
semaphore based on the source of the read rather than the keyspace,
but for now using the keyspace is sufficient.
If a CF does not have any sstables at all, we should treat it
as having a replay position of zero. However, since we also
must deal with potential re-sharding, we cannot just set
shard->uuid->zero initially, because we don't know what shards
existed.
Go through all CF:s post map-reduce, and for every shard where
a CF does not have an RP-mapping (no sstables found), set the
global min pos (for shard) to zero.
Fixes#1372
Message-Id: <1465991864-4211-1-git-send-email-calle@scylladb.com>
Try to emulate the origin behaviour for batch reply. They use an
explicit write handler, combinging
1.) Hinting to all known dead endpoints
2.) Sending to all persumed live, requiring ack from all
3.) Hinting to endpoint to which send failed.
We don't have hints, so try to work around by doing send with
cl=ALL, and if send fails (wholly or partially), retain the
batch in the log.
This is still slight behavioural difference, and we also risk
filling up the batch log in extreme cases. (Though probably not
in any real environment).
Refs #1222
Message-Id: <1466444170-23797-1-git-send-email-calle@scylladb.com>
Using an ordering mechanism better than rw-locks for write/flush
means we can wait for pending write in batch mode, and coalesce
data from more than one mutation into a chunk.
It also means we can wait for a specific read+flush pair (based on
file position).
Downside is that we will not do parallel writes in batch mode (unless
we run out of buffer), which might underutilize the disk bandwidth.
Upside is that running in batch mode (i.e. per-write consistency)
now has way better bandwidth, and also, at least with high mutation
rate, better average latency.
Message-Id: <1465990064-2258-1-git-send-email-calle@scylladb.com>
Config provides operators << >> for string_map which makes it impossible
to have generic stream operators for unordered_map. Fix it by making
string_map a separate type and not just an alias.
Message-Id: <20160602102642.GJ9939@scylladb.com>
compact_on_idle will lead users to thinking we're talking about sstable
compaction, not log-structured-allocator compaction.
Rename the variable to reduce the probability of confusion.
Message-Id: <1464261650-14136-1-git-send-email-avi@scylladb.com>
"This series introduces some additional metrics (mostly) in a storage_proxy and
a database level that are meant to create a better picture of how data flows
in the cluster.
First of all where possible counters of each category (e.g. total writes in the storage
proxy level) are split into the following categories:
- operations performed on a local Node
- operations performed on remote Nodes aggregated per DC
In a storage_proxy level there are the following metrics that have this "split"
nature (all on a sending side):
- total writes (attempts/errors)
- writes performed as a result of a Read Repair logic
- total data reads (attempts/completed/errors)
- total digest reads (attempts/completed/errors)
- total mutations data reads (attempts/completed/errors)
In a batchlog_manager:
- writes performed as a result of a batchlog replay logic
Thereby if for instance somebody wants to get an idea of how many writes
the current Node performs due to user requested mutations only he/she has
to take a counter of total writes and subtract the writes resulted by Read
Repairs and batchlog replays.
On a receiving side of a storage_proxy we add the two following counters:
- total number of received mutations
- total number of forwarded mutations (attempts/errors)
In order to get a better picture of what is going on on a local Node
we are adding two counters on a database level:
- total number of writes
- total number of reads
Comparing these to total writes/reads in a storage_proxy may give a good
idea if there is an excessive access to a local DB for example."
This patch ensures type_parser can handle user defined types. It also
prefixes user_type_impl::make_name() with
org.apache.cassandra.db.marshal.UserType.
Fixes#631
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch implements the merge_types() function,
allowing mutations to user defined types to be applied.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This is kind of sorting, so it belongs there, but it also fixes a bug in
storage_proxy::get_read_executor() that assumes filter_for_query() do
not change order of nodes in all_nodes when extra replica is chosen.
Otherwise if coordinator ip happens to be last in all_nodes then it will
be chosen as extra replica and will be quired twice.
Message-Id: <1460549369-29523-1-git-send-email-gleb@scylladb.com>
"There is a need to have an ability to detect whether a feature is
supported by entire cluster. The way to do it is to advertise feature
availability over gossip and then each node will be able to check if all
other nodes have a feature in question.
The idea is to have new application state SUPPORTED_FEATURES that will contain
set of strings, each string holding feature name.
This series adds API to do so.
The following patch on top of this series demostreates how to wait for features
during boot up. FEATURE1 and FEATURE2 are introduced. We use
wait_for_feature_on_all_node to wait for FEATURE1 and FEATURE2 successfully.
Since FEATURE3 is not supported, the wait will not succeed, the wait will timeout.
--- a/service/storage_service.cc
+++ b/service/storage_service.cc
@@ -95,7 +95,7 @@ sstring storage_service::get_config_supported_features() {
// Add features supported by this local node. When a new feature is
// introduced in scylla, update it here, e.g.,
// return sstring("FEATURE1,FEATURE2")
- return sstring("");
+ return sstring("FEATURE1,FEATURE2");
}
std::set<inet_address> get_seeds() {
@@ -212,6 +212,11 @@ void storage_service::prepare_to_join() {
// gossip snitch infos (local DC and rack)
gossip_snitch_info().get();
+ gossiper.wait_for_feature_on_all_node(std::set<sstring>{sstring("FEATURE1"), sstring("FEATURE2")}, std::chrono::seconds(30)).get();
+ logger.info("Wait for FEATURE1 and FEATURE2 done");
+ gossiper.wait_for_feature_on_all_node(std::set<sstring>{sstring("FEATURE3")}).get();
+ logger.info("Wait for FEATURE3 done");
+
We can query the supported_features:
cqlsh> SELECT supported_features from system.peers;
supported_features
--------------------
FEATURE1,FEATURE2
FEATURE1,FEATURE2
(2 rows)
cqlsh> SELECT supported_features from system.local;
supported_features
--------------------
FEATURE1,FEATURE2
(1 rows)"
"batchlog_manager is modified to allow the storage_service to initate a bachlog
replay operation.
Refs #1085.
Tested with tests/batchlog_manager_test and batch_test.py"
With big rows I see contention in XFS allocations which cause reactor
thread to sleep. Commitlog is a main offender, so enlarge extent to
commitlog segment size for big files (commitlog and sstable Data files).
Message-Id: <20160404110952.GP20957@scylladb.com>
Cassandra-derived tools (such as sstable2json) may write commitlog segments,
that Scylla cannot recognize. Since we now write them with a distinct name,
we can recognize the name and ignore these segments, as we know the data they
contain is not interesting.
Fixes#1112.
Message-Id: <1459356904-20699-1-git-send-email-avi@scylladb.com>
During decommission, the storage_service::unbootstrap() needs to
initiate a batchlog replay operation. To sync the replay operation
initiated by the timer in batchlog_manager and storage_service, a
semaphore is introduced. To simplify the semaphore locking, the
management code now always runs on shard zero, but the real work is
distruted to all shards.
commitlog's sync period is initialized as the batch period, and not as the
sync period itself as it should be.
I've found this by code inspection, but unless I am missing something
really fundamental, this seems to be completely wrong. It's been working
fine because in our defaults, I have checked that both variables default to
the same value. But it seems to me that as long as anyone would change one
of them, the behavior wouldn't be as expected.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2e7c565242fe5d4481a3ee8b0ba425ef14f5e42a.1459252783.git.glauber@scylladb.com>
Seastar wrongly limits the number of concurrent submit_to()s to a single
remote shard. This can cause an ABBA deadlock:
fiberA fiberB (x127)
submit_to(0) # lock schema
<- returns
submit_to(0) # lock schema (waits)
submit_to(0) # do work (waits)
The fiberBs wait for fiberA, which in turn waits for a fiberB to return.
While the correct fix is to remote the client-side limit and replace it
with a server-side per-verb limit, we start with a simpler fix that
replaces the blocking lock call with a non-blocking call, removing the
deadlock.
Fixes#1088.
Message-Id: <1459095357-28950-1-git-send-email-avi@scylladb.com>
At the momment, the callbacks returns void, it is impossible to wait for
the callbacks to complete. Make the callbacks runs inside seastar
thread, so if we need to wait for the callback, we can make it call
foo_operation().get() in the callback. It is easier than making the
callbacks return future<>.
This patch makes sure that every time we need to create a new generation number -
the very first step in the creation of a new SSTable, the respective CF is already
initialized and populated. Failure to do so can lead to data being overwritten.
Extensive details about why this is important can be found
in Scylla's Github Issue #1014
Nothing should be writing to SSTables before we have the chance to populate the
existing SSTables and calculate what should the next generation number be.
However, if that happens, we want to protect against it in a way that does not
involve overwriting existing tables. This is one of the ways to do it: every
column family starts in an unwriteable state, and when it can finally be written
to, we mark it as writeable.
Note that this *cannot* be a part of add_column_family. That adds a column family
to a db in memory only, and if anybody is about to write to a CF, that was most
likely already called. We need to call this explicitly when we are sure we're ready
to issue disk operations safely.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
During bootstrapping additional copies of data has to be made to ensure
that CL level is met (see CASSANDRA-833 for details). Our code does
that, but it does not take into account that bootstraping node can be
dead which may cause request to proceed even though there is no
enough live nodes for it to be completed. In such a case request neither
completes nor timeouts, so it appear to be stuck from CQL layer POV. The
patch fixes this by taking into account pending nodes while checking
that there are enough sufficient live nodes for operation to proceed.
Fixes#965
Message-Id: <20160303165250.GG2253@scylladb.com>
Currently schema changes are only logged at coordinator node which
initiates the change. It would be helpful in post morten analysis to
also see when and how schema changes are resolved when applied on
other nodes.
Message-Id: <1456953095-1982-1-git-send-email-tgrabiec@scylladb.com>