There is a possibility that the type of the given entry may not be available
that would manifest in the ENOENT or ENOTDIR value set in the errno
by the fstat() call for this entry. In this case engine().file_type() will return
a not engaged optional<directory_entry_type> value.
Return the future with the std::runtime_error exception in this case.
This will prevent any further usage of the not engaged optional value by the code
in the normal flow.
The exception is going to be propagated to the caller and it's the caller's responsibility
to handle it.
Fixes#2071
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
"This series uses the newly added histogram and label support to add metrics to
the storage_proxy and to the column_family.
This would add latency and histogram and the missing metrics from column family."
* 'amnon/histogram_metrics' of github.com:cloudius-systems/seastar-dev:
database: add metrics registration for the coloumn family
storage_proxy: add read and write latency histogram
estimated_histogram: returns a metrics histogram
This patch adds a metrics registration to the column_family.
Using label each column metrics is label with its keyspace and column
family name.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch changes the database apply path so that it also
generates the mutations for the column family's views and
sends them to the paired view replicas.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This adds to the "write_type" enum also the "VIEW" write type.
To be honest, I don't understand why the "write_type" distinction
is important.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
By removing the default case in the switch statement over a write_type
variable, we ensure the compiler warns us about lack of exhaustiveness
in case we add a value to the enum but forget to change the
corresponding operator<<().
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the generate_view_updates() function to the
column_family class, which will use the view_update_builder to
generate updates to the column_family's materialized views.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
To help calculate the view mutations from a base update, we store in
the view class the column that's part of the view's primary key but
not part of the base's, if such column exists.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Instead of storing the view in the column_family's map of materialized
views, store a lw_shared_ptr so that the view can be removed while it
is being updated.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"Before, the logic for releasing writes blocked on dirty worked like this:
1) When region group size changes and it is not under pressure and there
are some requests blocked, then schedule request releasing task
2) request releasing task, if no pressure, runs one request and if there are
still blocked requests, schedules next request releasing task
If requests don't change the size of the region group, then either some request
executes or there is a request releasing task scheduled. The amount of scheduled
tasks is at most 1, there is a single releasing thread.
However, if requests themselves would change the size of the group, then each
such change would schedule yet another request releasing thread, growing the task
queue size by one.
The group size can also change when memory is reclaimed from the groups (e.g.
when contains sparse segments). Compaction may start many request releasing
threads due to group size updates.
Such behavior is detrimental for performance and stability if there are a lot
of blocked requests. This can happen on 1.5 even with modest concurrency
because timed out requests stay in the queue. This is less likely on 1.6 where
they are dropped from the queue.
The releasing of tasks may start to dominate over other processes in the
system. When the amount of scheduled tasks reaches 1000, polling stops and
server becomes unresponsive until all of the released requests are done, which
is either when they start to block on dirty memory again or run out of blocked
requests. It may take a while to reach pressure condition after memtable flush
if it brings virtual dirty much below the threshold, which is currently the
case for workloads with overwrites producing sparse regions.
I saw this happening in a write workload from issue #2021 where the number of
request releasing threads grew into thousands.
Fix by ensuring there is at most one request releasing thread at a time. There
will be one releasing fiber per region group which is woken up when pressure is
lifted. It executes blocked requests until pressure occurs."
* tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev:
tests: lsa: Add test for reclaimer starting and stopping
tests: lsa: Add request releasing stress test
lsa: Avoid avalanche releasing of requests
lsa: Move definitions to .cc
lsa: Simplify hard pressure notification management
lsa: Do not start or stop reclaiming on hard pressure
tests: lsa: Adjust to take into account that reclaimers are run synchronously
lsa: Document and annotate reclaimer notification callbacks
tests: lsa: Use with_timeout() in quiesce()
That's because a single shard is used to calculate generation for new
sstables in upload directory, and that will result in that single shard
sharing all the resources with other shards.
For refresh without upload dir, it currently works fine because we
reshuffle column family dir instead.
flush_upload_dir() is now a free function, takes a distributed database
object, and uses calculate_shard_from_sstable_generation() to decide
which shard will move sstable using its own generation namespace.
Fixes#2008.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <b0cccf7bbb61416ff8718bac92fdca90cc5fb9c9.1484253232.git.raphaelsc@scylladb.com>
This patch ensures that when adding a shared sstable, we select only
one cpu to update that column family's stats. This is important so we
don't overestimated the on-disk size of sstables when resharding
This fixes only a temporary miscount of the current load, since shared
sstables are eventually re-written, but a fixes a permanent miscount
of the total load.
Refs #1592
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170119144823.31041-1-duarte@scylladb.com>
After resharding, sstables may be owned by all shards, which
means that file descriptors and memory usage for metadata will
increase by a factor equal to number of shards. That can easily
lead to OOM.
SSTable components are immutable, so they can be stored in one
shard and shared with others that need it. We use the following
formula to decide which shard will open the sstable and share
it with the others: (generation % smp::count), which is the
inverse of how we calculate generation for new sstables.
So if no resharding is performed, everything is shard-local.
With this approach, resource usage due to loaded sstables will
be evenly distributed among shards.
For this approach to work, we now only populate keyspaces from
shard 0. It's now the sole responsible for iterating through
column family dirs. In addition, most of population functions
are now free and take distributed database object as parameter.
Fixes#1951.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"Reduce the size of mutation_partition by implementing intrusive set using
bi::rbtree_algorithms directly and using tree nodes optimized for size.
This will reduce the size of mutation_partition by:
24 bytes + <number of cql rows> * 8 bytes
This should have a positive impact on performance because mutation_partitions
are stored both in memtable and cache.
Fixes #742."
* 'haaawk/742' of github.com:cloudius-systems/seastar-dev:
intrusive_set: rename size() to calculate_size()
Make intrusive_set_external_comparator::_value_traits static
Implement intrusive set using rbtree_algorithms
mutation_partition: make apply_reversibly_intrusive_set nongeneric
mutation_partition: take schema in find_row and clustered_row
mutation_partition: Extract intrusive set logic to a class.
mutation_partition: Replace value_comp with key_comp calls
Before this patch system table writes were not writing to commit log
because database::add_column_family() disables writes to commit log
for the table which is added if _commitlog is not set at that
time. Fix by initializing commit log before system tables are created.
Fixes#1986.
Fixes recent regression in
batch_test.py:TestBatch.replay_after_schema_change_test after
scylla-jmx was updated to not flush system tables on nodetool flush.
Could cause system keyspace writes to be delayed for more than before
under heavy write workload. Refs #1926.
Message-Id: <1483618117-4535-1-git-send-email-tgrabiec@scylladb.com>
We may want to change the default individual result size limit in the
future. If it is provided by the coordinator and not hardcoded in the
replicas this can be done without causing data query digest mismatches
or wasteful mutation query results.
Digest reads differ from data reads in a way that they do not really
consume any memory. We still want them to stop in the same place that
data reads would, but the per-shard semaphore shouldn't be updated by
them.
For data queries it is very important that all replicas get limited in
the same place (this includes replicas returning only digest). That's
why they shouldn't be affected by per-shard result memory limit.
Moreover, we should make sure that individual memory limits are the
same, making the coordinator provide it for replicas which allow to
safely change it in the future.
Mutation queries are not as sensitive but it is still beneficial to make
sure that all replicas use the same individual limit.
Shared sstables will now be resharded in the same order to guarantee
that all shards owning a sstable will agree on its deletion nearly
the same time, therefore, reducing disk space requirement.
That's done by picking which column family to reshard in UUID order,
and each individual column family will reshard its shared sstables
in generation order.
Fixes#1952.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <87ff649ed24590c55c00cbb32bffd8fa2743e36e.1482342754.git.raphaelsc@scylladb.com>
Refs #1943.
* 'tgrabiec/optimize-bloom-filter' of github.com:cloudius-systems/seastar-dev:
db: Compute key hash once in partition_presence_checker
bloom_filter: Allow checking presence using pre-hashed key
db: Use incremental selector in partition_presence_checker
This patch extracts update_column_family from schema_tables into
database so it can be used when adding materialized views, in future
patches.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the drop_column_family() function to remove
a view schema from the list of views of its base table.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds code for parsing the views schema table upon init and
also ensures that when adding a view column family, that we add it to
its base table list of views.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch moves some duplicate code into the
add_column_family_and_create_directory() function. It also saves some
superfluous keyspace lookups and readies the code to be used by
materialized views.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds utility functions to keyspace_metadata to select only
the tables or only the views out of all the schemas.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the view class, which will contains functions related
to populating a view, either from the base table's write path or from
the view building mechanism which copies over already existing data in
the base table.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This reduces the number of sstables we need to check to only those
whose token range overlaps with the key. Reduces cache update
time. Especially effective with leveled compaction strategy.
Refs #1943.
Incremental selector works with an immutable sstable set, so cache
updates need to be serialized. Otherwise we could mispopulate due to
stale presence information.
Presence checker interface was changed to accept decorated key in
order to gain easy access to the token, which is required by
the incremental selector.
"This patchset ensures the partition limit is enforced at
the storage_proxy level. To achieve this, we add the partition
count to query::result, and allow the result_merger to trim
excess partitions."
* 'enforce-partition-limit/v3' of https://github.com/duarten/scylla:
storage_proxy: Decrease limits when retrying command
storage_proxy: Don't fetch superfluous partitions
query::result: Add partition count
column_family: Use counters in query::result::builder
query_result_builder: Use the underlying counters
mutation_partition: Count partitions in query_compacted
mutation_partition: Remove tabs in query_compacted
query::result::builder: Add partition count
query_result_merger: Limit partitions
A case could be made that we should have counters for them no matter
what, since it can help us reason about the distribution of memory among
the groups. But with the hierarchy being broken in 1.5 it becomes even
more important. Now by looking solely at dirty, we will have no idea
about how much memory we are using in those groups.
After this patch, the dirty_memory_manager will register its metrics
for the 3 groups that we have, and the legacy names will be used to show
totals.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0d04ca4c7e8472097f16a5dc950b77c73766049e.1481831644.git.glauber@scylladb.com>
A naive approach was to create a set of readers for each range and pass
them all to combining reader. This however performed badly if the number
of ranges was high.
The solution is to use multi range reader which uses only a single set
of readers and fast forwards from range to range when necessary. This
adds another requirement that the ranges passed to
make_streaming_reader() are sorted and disjoint.
This patch changes column_family::query() to use the counters in the
builder to determine how many partitions and rows to ask for and also
to implement the stop condition. This saves a continuation to do the
bookkeeping, and allows us to remove data_query_result.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This pach ensures than when we start executing a query a minimum result
size is reserved from result_memory_limiter.
Moreover, range queries need a way of merging memory usage information
from different shards.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
"The current criteria for memtable flush is not being respected. The
problem is demonstrated to happen when the dirty memory group is over
limit, and so is the system table extra allowance. In that situation,
both the normal region and the system table region will be under
pressure and try to flush.
More specifically, because the normal region inherits from the system
region, if the normal region is under pressure (over the soft limit
threshold), the system region will certainly be as well, even though it
has an extra allowance. This is because after virtual dirty, we start
blocking when we reach half the region, but memory itself can grow up to
100 % of the region. So the total amount of memory used will be
certainly bigger than the system pressure threshold, which is now 50 %
plus the allowance.
To fix that, this patch reworks the flush logic so that the regions are
not dependent on each other.
Fixes#1918"
* 'flush-criteria-v6' of github.com:glommer/scylla:
config: get rid of memtable_total_space
database: rework dirty memory hierarchy
system keyspace: write batchlog mutation in user memory
database: remove flush_token
database: abstract pressure condition notification
database: encapsulate semaphore_units into a flush_permit
database: remove friendship declaration
database: simplify flush_one
database: make memtable_list aware in cases it can't flush