We saw the message twice for the same feature check. This is a bit
confusing.
INFO 2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {RANGE_TOMBSTONES} in features {}
INFO 2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {RANGE_TOMBSTONES} in features {}
INFO 2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {LARGE_PARTITIONS} in features {}
INFO 2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {LARGE_PARTITIONS} in features {}
This is because
ss._range_tombstones_feature = gms::feature(RANGE_TOMBSTONES_FEATURE);
ss._large_partitions_feature = gms::feature(LARGE_PARTITIONS_FEATURE);
The first message is printed when gms::feature(RANGE_TOMBSTONES_FEATURE)
is constructed. The second message is printed when the
ss._range_tombstones_feature is copy-constructed.
Add the helper function to enable the a feature and log the feature is
enabled.
When a feature is enabled, we see
INFO 2016-12-15 11:29:32,443 [shard 0] gossip - Feature LARGE_PARTITIONS is enabled
INFO 2016-12-15 11:29:32,443 [shard 0] gossip - Feature RANGE_TOMBSTONES is enabled
in the log.
"This series makes Scylla limit size of query results it produces in case they
grow unreasonably large. This is possible because CQL paging queries do not
guarantee that the returned page is going to have page_size rows and pages
smaller than tha *do not* indicate end of stream. Non-paged queries and Thrift
requests do not have such flexibility and they also get all the requested data
(though their memory usage is still accounted for and may limit paged queries).
There is a maximum result size (1 MB) and all results builders will stop after
reaching it. Moreover, there is a per-shard limitation on the amount of memory
used by all results combined (10%). To avoid tiny results a query has to
reserve (wait if necessary) 4 kB before starting executing, after that it can
consume more memory without any additional waiting provided it is below
individual and shard-local limits.
Enabling the cluster to return less rows than requested also means some changes
for the coordinator. Firstly, if it receives such short result from a replica
retrying it with a larger limit obviously makes no sense whatsoever. Instead,
in such cases the coordinator removes the clustering rows it has incomplate
information about and sends short result back to the client. Moreover, even
if no replica returned short response reconciliation may have made it so. In
this case, the coordinator do not necessairly need to retry the query as well.
Unfortunately, with the current implementation short responses ruin data
queries since they will cause a digest mismatch.
Three new metrics were added:
* database_bytes_total_result_memory -- total memory used by query results
* database_total_operations_short_data_queries -- data queries that were
limited by size, particulary bad as it basically forces coordinator to
retry them as mutation queries
* database_total_operations_short_mutation_queries -- mutation queries limited
by size"
* 'pdziepak/short-paged-reads/v4' of github.com:cloudius-systems/seastar-dev:
storage_proxy: clean up after primary_key introduction
cql3: allow short reads with paged queries
storage_proxy: handle intentional short reads
storage_proxy: make sure coordinator has complete data
storage_proxy: honour partition limit
storage_proxy: use cmd limits to determine that replica reached end
db: add metrics for short reads and memory used for results
data_query: limit result size
mutation_query: limit result size
db: create result_memory_accounters when starting query
query_builder: add partition_slice getter
reconcilable_result: keep result_memory_tracker object
mutation_compactor: honour stop_iteration from consumers
db: add result_memory_limiter
query: add result size limiter
reconcilable_result: properly propagate short_read flag
query_pagers: handle short reads properly
query: allow short reads
serializer_impl: add serializer for bool_class<Tag>
primary_key was introduced as a replacement for
std::pair<dht::decorated_key, std::optional<clustering_key>>. In order
to simplify patch introducing its fields were named 'first' and
'second'. This patch changes the names to something less useless,
removes old row_address alias and removes is_missing_rows() in favour of
primary_key::less_compare_clustering comparator.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
If the result is going to be too large the replica may decide to make it
shorter and coordinator should handle this properly (i.e. do not retry).
Moreover, coordinator could avoid some retries by setting the short_read
flag itself.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
got_incomplete_information() ensures that the coordinator has received
all required data from all replicas.
(see 77dbe3c12f "storage_proxy: fix
reconciliation with limits" for the examples when that may not be the
case).
However, this function is called only if reconciled result has at least
as much rows as the user asked for. This was correct when we had only
total row limit: if the result was shorter than that either all replicas
sent all data they have or the coordinator will retry anyway. However,
since then we got partition limit and per partition row limit and a
request may be limited by one of these while being still below the total
row limit.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
At the moment the coordinator does not care much for the partition
limit. In particular it doesn't check whether after reconciliation the
result still contains enough partitions.
This patch makes it honour the partition limit and increase it in the
retried queries if necessary.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Coordinator may retry a query with larger limits. However, code
determining whether replica has no more data always used the original
limits. This may cause a livelock.
For example, consider cluster having the following partitions (deletions
cover live cells):
node1:
pk=0, v=0
pk=1, v=1
node2
delete pk=0
delete pk=1
pk=2, v=2
pk=3, v=3
Now, if there is a query SELECT * FROM cf LIMIT 2 the first node is
going to send partitions 0 and 1 while second node is going to send 2
and 3 + tombstones for 0 and 1. The coordinator will decide that it
needs to retry the request with larger row limit since node1 may have
some information about partitions 2 and 3 that are newer than what node2
has sent.
However, when the second response arrives node1 will still sent only two
rows since it has no more data. Because the coordinator uses original
row limit it will not notice that this node reached the end and we are
going to get another retry without making any progress.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This pach ensures than when we start executing a query a minimum result
size is reserved from result_memory_limiter.
Moreover, range queries need a way of merging memory usage information
from different shards.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch introduces an infrastrucutre for limiting result size.
There is a shard-local limit which makes sure that all results combined
do not use more than 10% of the shard memory.
There is also an invidual limit which restricts a result to 4 MB.
In order
In order to avoid sending tiny results there is minimum guaranteed size
(4 kB), which the query needs to reserve before it starts producing the
result.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
reconcilable_result can be merged with another or transformed into
query::result. Make sure that short_read information is never lost.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Currently, the paging implementation assumes that the server retunrs
either as many rows as it was asked for all reached the end. Soon,
that's not going to be true so instead of making any assumptions about
the number of the rows returned use the new "short read" flag to
determine whether there is going to be more data.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
When paging is used the cluster is allowed to return less rows than the
client asked for. However, if such possibility is used we need a way of
telling that to the coordinator and the paging implementation so that
they can differentiate between short reads caused by the replica running
out of data to sent and short reads caused by any other means.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
The test assumed that mutations added to the commitlog are visible to
reads as soon as a new segment is opened. That's not true because
buffers are written back in the background, and new segment may be
active while the previous one is still being written or not yet
synced.
Fix the test so that it expectes that the number of mutations read
this way is <= the number of mutations read, and that after all
segments are synced, the number of mutations read is equal.
Message-Id: <1481630481-19395-1-git-send-email-tgrabiec@scylladb.com>
"The current criteria for memtable flush is not being respected. The
problem is demonstrated to happen when the dirty memory group is over
limit, and so is the system table extra allowance. In that situation,
both the normal region and the system table region will be under
pressure and try to flush.
More specifically, because the normal region inherits from the system
region, if the normal region is under pressure (over the soft limit
threshold), the system region will certainly be as well, even though it
has an extra allowance. This is because after virtual dirty, we start
blocking when we reach half the region, but memory itself can grow up to
100 % of the region. So the total amount of memory used will be
certainly bigger than the system pressure threshold, which is now 50 %
plus the allowance.
To fix that, this patch reworks the flush logic so that the regions are
not dependent on each other.
Fixes#1918"
* 'flush-criteria-v6' of github.com:glommer/scylla:
config: get rid of memtable_total_space
database: rework dirty memory hierarchy
system keyspace: write batchlog mutation in user memory
database: remove flush_token
database: abstract pressure condition notification
database: encapsulate semaphore_units into a flush_permit
database: remove friendship declaration
database: simplify flush_one
database: make memtable_list aware in cases it can't flush
In 7c873f0d (repair: Reduce unnecessary streaming traffic), we optimize
in cases when 1) all the remote nodes has the same checksum and 2) local node
has zero checksum.
In this patch, we make the optimization more generec and cover more cases.
1) With RF = 3, 3 nodes cluster, rm data on node3 then run repair on node2
Before:
INFO 2016-12-09 16:24:31,961 [shard 0] repair - Found differing range (-4091524285777924069, -4086237930244473115]
on nodes {127.0.0.3, 127.0.0.1}, in = {127.0.0.3, 127.0.0.1}, out = {127.0.0.3, 127.0.0.1}
INFO 2016-12-09 16:24:31,963 [shard 0] repair - Found differing range (-609511120964672970, -605253169726090861]
on nodes {127.0.0.1, 127.0.0.3}, in = {127.0.0.1, 127.0.0.3}, out = {127.0.0.1, 127.0.0.3}
INFO 2016-12-09 16:24:31,964 [shard 0] repair - Found differing range (-7655412157560911259, -7652234653747163387]
on nodes {127.0.0.3, 127.0.0.1}, in = {127.0.0.3, 127.0.0.1}, out = {127.0.0.3, 127.0.0.1}
INFO 2016-12-09 16:24:31,965 [shard 0] repair - Found differing range (-4133815130045531703, -4128528774512080749]
on nodes {127.0.0.3, 127.0.0.1}, in = {127.0.0.3, 127.0.0.1}, out = {127.0.0.3, 127.0.0.1}
INFO 2016-12-09 16:24:31,967 [shard 0] repair - Found differing range (-605253169726090861, -600995218487508751]
on nodes {127.0.0.1, 127.0.0.3}, in = {127.0.0.1, 127.0.0.3}, out = {127.0.0.1, 127.0.0.3}
INFO 2016-12-09 16:24:31,968 [shard 0] repair - Found differing range (438510347741343837, 441475345714861354]
on nodes {127.0.0.1, 127.0.0.3}, in = {127.0.0.1, 127.0.0.3}, out = {127.0.0.1, 127.0.0.3}
After:
INFO 2016-12-09 16:30:29,204 [shard 0] repair - Found differing range (-660606535827658284, -656348584589076175]
on nodes {127.0.0.1, 127.0.0.3}, in = {}, out = {127.0.0.3}
INFO 2016-12-09 16:30:29,204 [shard 0] repair - Found differing range (-4234255885181099833, -4228969529647648879]
on nodes {127.0.0.3, 127.0.0.1}, in = {}, out = {127.0.0.3}
INFO 2016-12-09 16:30:29,204 [shard 0] repair - Found differing range (-4228969529647648879, -4223683174114197925]
on nodes {127.0.0.3, 127.0.0.1}, in = {}, out = {127.0.0.3}
INFO 2016-12-09 16:30:29,204 [shard 0] repair - Found differing range (-4223683174114197925, -4218396818580746971]
on nodes {127.0.0.3, 127.0.0.1}, in = {}, out = {127.0.0.3}
INFO 2016-12-09 16:30:29,204 [shard 0] repair - Found differing range (-7728494745277112315, -7725317241463364443]
on nodes {127.0.0.3, 127.0.0.1}, in = {}, out = {127.0.0.3}
INFO 2016-12-09 16:30:29,204 [shard 0] repair - Found differing range (-720217853167807818, -715959901929225709]
on nodes {127.0.0.1, 127.0.0.3}, in = {}, out = {127.0.0.3}
Before, we need to fetch data from both node 1 and node 3 and send data back to node 1 and node 3, i.e., 2 IN, 2 OUT
After, we only need to fetch data from node 3, i.e. 0 IN, 1 OUT
We saved 3X traffic, with higher RF, we can save even more.
2) With RF = 3, 3 nodes cluster, rm data on node3 then run repair on node3
Before:
INFO 2016-12-09 16:20:11,448 [shard 0] repair - Found differing range (-8533861887892628919, -8052600134279395253]
on nodes {127.0.0.1, 127.0.0.2}, in = {127.0.0.1}, out = {}
INFO 2016-12-09 16:20:11,465 [shard 0] repair - Found differing range (7190719703944308372, 7692358524564683543]
on nodes {127.0.0.1, 127.0.0.2}, in = {127.0.0.1}, out = {}
INFO 2016-12-09 16:20:11,486 [shard 0] repair - Found differing range (-3305328316052774469, -2671876682129336880]
on nodes {127.0.0.1, 127.0.0.2}, in = {127.0.0.1}, out = {}
INFO 2016-12-09 16:20:11,494 [shard 0] repair - Found differing range (-2190610927722759275, -1305178847032904465]
on nodes {127.0.0.2, 127.0.0.1}, in = {127.0.0.2}, out = {}
INFO 2016-12-09 16:20:11,518 [shard 0] repair - Found differing range (-4747032371925842389, -4070378863644120252]
on nodes {127.0.0.2, 127.0.0.1}, in = {127.0.0.2}, out = {}
INFO 2016-12-09 16:20:11,519 [shard 0] repair - Found differing range (-1137497074548854552, -592479316010344531]
on nodes {127.0.0.1, 127.0.0.2}, in = {127.0.0.1}, out = {}
After:
INFO 2016-12-09 16:29:22,433 [shard 0] repair - Found differing range (67885601051654285, 447405341661896387]
on nodes {127.0.0.2, 127.0.0.1}, in = {127.0.0.2}, out = {}
INFO 2016-12-09 16:29:22,454 [shard 0] repair - Found differing range (-2190610927722759275, -1305178847032904465]
on nodes {127.0.0.2, 127.0.0.1}, in = {127.0.0.2}, out = {}
INFO 2016-12-09 16:29:22,473 [shard 0] repair - Found differing range (2523396860109747637, 3083778975065200884]
on nodes {127.0.0.2, 127.0.0.1}, in = {127.0.0.2}, out = {}
INFO 2016-12-09 16:29:22,474 [shard 0] repair - Found differing range (-3305328316052774469, -2671876682129336880]
on nodes {127.0.0.1, 127.0.0.2}, in = {127.0.0.1}, out = {}
INFO 2016-12-09 16:29:22,487 [shard 0] repair - Found differing range (-4747032371925842389, -4070378863644120252]
on nodes {127.0.0.2, 127.0.0.1}, in = {127.0.0.2}, out = {}
INFO 2016-12-09 16:29:22,493 [shard 0] repair - Found differing range (-1137497074548854552, -592479316010344531]
on nodes {127.0.0.1, 127.0.0.2}, in = {127.0.0.1}, out = {}
This shows the new more generic methods covers the optimization we had before as well.
Issue #1918 describes a problem, in which we are generating smaller
memtables than we could, and therefore not respecting the flush
criteria.
That happens because group sizes (and limits) for pressure purposes, and
the the soft threshold is currently at 40 %. This causes system group's
soft threshold to be way below regular's virtual dirty limit and close
to regular group's soft threshold. The system group was very likely to
become under soft pressure when regular was because writes to regular
group are not yet throttled when they cross both soft thresholds.
This is a direct consequence of the linear hierarchy between the regions
and to guarantee that it won't happen we would have acqire the semaphore
of all ancestor regions when flushing from a child region. While that
works, it can lead to problems on its own, like priority inversion if
the regions have different priorities - like streaming and regular, and
groups lower in the hierarchy, like user, blocking explicit flushes
from their ancestors
To fix that, this patch reorganizes the dirty memory region groups so
that groups are now completely independent. As a disadvantage, when
streaming happen we will draw some memory from the cache, but we will
live with it for the time being.
Fixes#1918
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Batchlog is a potentially memory-intensive table whose workload is
driven by user needs, not system's. Move it to the user dirty memory
manager.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We had a flush_token structure in addition to the flush_permit because
we needed to keep a pointer to the dirty_memory_manager and apply
changes to the region group upon the region destruction. Since Tomek's
latest series, this is no longer needed and now this structure doesn't
have a place in the world anymore. Simplify the code by removing it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Done in a separate patch to reduce clutter in the main patch.
Soon we'll be testing for one more condition.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We will soon need to hold more than a semaphore_units<> object per
flush, potentially.
Preparation patch for that.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
flush_one has to make sure that we're using the correct
dirty_memory_manager object, because we could be flushing from a region
group different than the one the flush request originated.
It's simpler to just assume flush_one will be dealing with the right
object, and use a different object instead of "this" when calling it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Some of our CFs can't be flushed. Those are the ones who are not marked
as having durable writes. We treat them just the same from the point of
view of the flush logic, but they provide a function that doesn't do
anything and just returns right away.
We already had troubles with that in the past, and that also poses a
problem for an upcoming patch reworking the flush memtable pick
criteria.
It's easier, simpler, and cleaner, to just make the memtable_list aware
it can't flush. Achieving that is also not very complicated: we just
need a special constructor that doesn't take a seal function and then we
make sure that it is initialized to an empty std::function
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Currently, we weren't completing a query as early as possible if it
reached the partition limit, we instead had to wait until reaching the
end of the specified partition ranges. This patches fixes that by
including a check to the partition limit in the termination condition.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161213114559.26438-1-duarte@scylladb.com>
column_mapping is not safe to access across shards, because data_type
is not safe to access. One of the manifestation of this is that
abstract_type::is_value_compatible_with() always fails if the two
types belong to different shards.
During replay, column_mapping lives on the replaying shard, and is
used by converting_mutation_partition_applier against the schema on
the target shard. Since types in the mapping will be considered
incompatible with types in the schema, all cells will be dropped.
Fix by using column_mapping in a safe way, by copying it to the target
shard if necessary. Each shard maintains its own cache of column
mappings.
Fixes#1924.
Message-Id: <1481310463-13868-1-git-send-email-tgrabiec@scylladb.com>
* seastar 0773e98...6fbd792 (2):
> tls: Only run our "verify" function in client session
> Merge "Clean the metric definition" from Amnon
Includes patch from Amnon adjusting the metrics registration due to seastar
API changes.
* seastar 0a74317...0773e98 (6):
> tls: Add support for client cetrificate verification & priority strings
> semaphore: add consume_units
> semaphore: add available_units()
> thread: check need_preempt for threads in a scheduling group as well
> tutorial: fix semaphore example, and text
> stop_iteration: add && and || operators