The name "column_family" is both awkward and obsolete. Rename to
the modern and accurate "table".
An alias is kept to avoid huge code churn.
To prevent a One Definition Rule violation, a preexisting "table"
type is moved to a new namespace row_cache_stress_test.
Tests: unit (release)
Message-Id: <20180624065238.26481-1-avi@scylladb.com>
Now that we have the controller, we would like to take min_threshold as
a hint. If there is nothing to compact, we can ignore that and start
compacting less than min_threshold SSTables so that the backlog keeps
reducing.
But there are cases in which we don't want min_threshold to be a hint
and we want to enforce it strictly. For instance, if write amplification
is more of a concern than space amplification.
This patch adds a YAML option that allows the user to tell us that. We will
default to false, meaning min_threshold is not strictly enforced.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"
This is series is for nodetool getsstables.
This patch is based on:
8daaf9833a
With some minor adjustments because of the code change in sstables.
The idea is to allow searching for all the sstables that contains a
given key.
After this patch if there is a table t1 in keyspace k1 and it has a key
called aa.
curl -X GET "http://localhost:10000/column_family/sstables/by_key/k1%3At1?key=aa"
Will return the list of sstables file names that contains that key.
"
* 'amnon/sstable_for_key_v4' of github.com:scylladb/seastar-dev:
Add the API implementation to get_sstables_by_key
api: column_family.json make the get_sstables_for_key doc clearer
column_family: Add the get_sstables_by_partition_key method
sstable test: add has_partition_key test
sstable: Add has_partition_key method
keys_test: add a test for nodetool_style string
keys: Add from_nodetool_style_string factory method
The get_sstables_by_partition_key method used by the API to return a set of
sstables names that holds a given partition key.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
In 455d5a5 (streaming memtables: coalesce incoming writes), we
introduced the delayed flush to coalesce incoming streaming mutations
from different stream_plan.
However, most of the time there will be one stream plan at a time, the
next stream plan won't start until the previous one is finished. So, the
current coalescing does not really work.
The delayed flush adds 2s of dealy for each stream session. If we have lots
of table to stream, we will waste a lot of time.
We stream a keyspace in around 10 stream plans, i.e., 10% of ranges a
time. If we have 5000 tables, even if the tables are almost empty, the
delay will waste 5000 * 10 * 2 = 27 hours.
To stream a keyspace with 4 tables, each table has 1000 rows.
Before:
[shard 0] stream_session - [Stream #944373d0-5d9c-11e8-9cdb-000000000000] Executing streaming plan for Bootstrap-ks-index-0 with peers={127.0.0.1}, master
[shard 0] stream_session - [Stream #944373d0-5d9c-11e8-9cdb-000000000000] Streaming plan for Bootstrap-ks-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1030 KiB, 125.21 KiB/s
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks succeeded, took 8.233 seconds
After:
[shard 0] stream_session - [Stream #e00bf6a0-5d99-11e8-a7b8-000000000000] Executing streaming plan for Bootstrap-ks-index-0 with peers={127.0.0.1}, master
[shard 0] stream_session - [Stream #e00bf6a0-5d99-11e8-a7b8-000000000000] Streaming plan for Bootstrap-ks-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1030 KiB, 4772.32 KiB/s
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks succeeded, took 0.216 seconds
Fixes#3436
Message-Id: <cb2dde263782d2a2915ddfe678c74f9637ffd65b.1526979175.git.asias@scylladb.com>
This commit clears table's views before truncating it
in drop_column_family function. The only case when
views are not empty during drop is when they're backing secondary
indexes of a base table and they are all atomically dropped
in the same go as the base table itself.
This change will prevent trying to truncate views that were
already dropped, which used to result in no_such_column_family error.
References #3202
"
This series introduces materialized view statistics, as stated in issue #3385:
- updates pushed
- updates failed
- row lock stats
It also addresses issue #3416 by decoupling user write stats from view
update stats.
"
* 'materialized_view_metrics_9' of https://github.com/psarna/scylla:
view: adapt view_stats to act as write stats
storage_proxy: decouple write_stats from stats
db: add row locking metrics
view: add view metrics
This commit adds statistics to row_locker class. Metrics are
independendly counted for all lock types: row<->partition and
exclusive<->shared.
Metrics gathered:
- total acquisitions
- operations that wait on the lock
- histogram of the time spent on waiting on this type of lock
References #3385
References #3416
This commit introduces view statistics:
- updates pushed to local/remote replicas
- updates failed to be pushed to local/remote replicas
Metrics are kept on per-table basis, i.e. updates_pushed_remote
shows the number of total updates (mutations) pushed to all paired
mv replicas that this particular table has.
Every single update is taken into consideration, so if view update
requires removing a row from one view and adding a row to another,
it will be counted as 2 updates.
References #3385
References #3416
There was recently an attempt to add minimum shares to major compactions
which ended up being harder than it should be due to all the plumbing
necessary to call the compaction controller from inside the compaction
manager-- since it is currently a database object. We had this problem
again when trying to return fixed shares in case of an exception.
Taking a step back, all of those problems stem from the fact that the
compaction controller really shouldn't be a part of the database: as it
deals with compactions and its consequences it is a lot more natural to
have it inside the compaction manager to begin with.
Once we do that, all the aforementioned problems go away. So let's move
there where it belongs.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This ensures we respect the write timeout set by the client when
applying base writes, in case a writes takes too long to acquire the
row lock for the read-before-write phase of a materialized view
update.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180507132755.8751-1-duarte@scylladb.com>
This patch adds a simple and naive mechanism to ensure a base replica
doesn't overwhelm a potentially overloaded view replica by sending too
many concurrent view updates. We add a semaphore to limit to 100 the
number of outstanding view updates. We limit globally per shard, and
not per destination view replica. We also limit statically.
Refs #2538
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180426134457.21290-2-duarte@scylladb.com>
This commit makes database, sstables and tests aware
of which large_partition_handler they use.
Proper large_partition_handler is retrievable from config information
and is based on existing compaction_large_partition_warning_threshold_mb
entry. Right now CQL TABLE variant of large_partition_handler is used
in the database.
Tests use a NOP version of large_partition_handler, which does not
depend on CQL queries at all.
The populate_views() function takes a set of views to update, a
tokento select base table partitions, and the set of sstables to
query. This lays the foundation for a view building mechanism to exist,
which walks over a given base table, reads data token-by-token,
calculates view updates (in a simplified way, compared to the existing
functions that push view updates), and sends them to the paired view
replicas.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds a mechanism to class column_family through which we
can synchronize with in-progress writes. This is useful for code that,
after some modification, needs to ensure that new writes will see it
before it can proceed.
In particular, this will be used by the view building code, which needs
to wait until the in-progress writes, which may have missed that there
is now a view, is observable to the view building code.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
When we introduced the CPU scheduler, we have also introduced a group
for commitlog - but never used it. There is also doubtful value in
separating reads from writes, since they are often part of the same
workload.
To accomodate for that, let's rename the query group to "statement"
(query is not incorrect, just confusing), and move the write path,
currently ungrouped, inside it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Specifically for the reader-permit based eviction. This test lives in a
separate executable as it uses with_cql_test_env() and thus needs a
main() of it's own.
Cached queriers should not sit in the cache indefinitely otherwise
abandoned reads would cause excess and unncessary resource-usage. Attach
an expiry timer to each cache-entry which evicts it after the TTL
passes.
Use the querier_cache (represented by the passed-in
querier_cache_context) object to lookup saved queriers at the start of
the page and save them at the end of it if it is likely that there will
be more page requests.
In preparation to reader_concurrency_semaphore being added to the file.
The reader_resource_tracker is really only a helper class for
reader_concurrency_semaphore so the latter is better suited to provide
the name of the file.
"The motivation is that it's no longer needed after new resharding
algorithm that is the sole responsible for working with shared
sstables and regular compaction will not work with those!
So resharding will schedule deletion of shared sstables once it's
certain that shards that own them have the new unshared sstables.
The manager was needed for orchestrating deletion of shared sstable
across shards. It brings extra complexity that's not longer needed,
and it was also overloading shard 0, but the latter could have
been fixed.
Tests:
- unit: release mode
- dtest: resharding_test.py"
* 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla:
Remove SSTable's atomic deletion manager
Stop using SSTable's atomic deletion manager
database: split column_family::rebuild_sstable_list
The motivation is that resharding will not want the code that is
specific to regular compaction after atomic deletion is removed.
Resharding will eventually only need to replace old tables with
new ones, and it will be in charge of deletion of old tables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We have had so far an I/O controller, for compactions and memtables, and
a CPU controller, for memtables only -- since the scheduling was still
quota-based.
Now that the CPU scheduler is fully functional, it is time to do away
with the differences and integrate them both into one. We now have a
memtable controller and a compaction controller, and they control both
CPU and I/O.
In the future, we may want to control processes that don't do one of
them, like cache updates. If that ever happens, we'll try to make
controlling one of them optional. But for now, since the I/O and CPU
controllers for our main two processes would look exactly the same we
should integrate them.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Because execution stages defer and batch processing of the function
they run, they escape their fiber's context and therefore the
scheduling group.
Fix (for data_query) by initializing the execution_stage with the
query scheduling_group. To do that we have to move the execution
stage into the database object, so it has access to the scheduling
group during initialization.
Set up scheduling groups for streaming, compaction, memtable flush, query,
and commitlog.
The background writer scheduling group is retired; it is split into
the memtable flush and compaction groups.
Comments from Glauber:
This patch is based in a patch from Avi with the same subject, but the
differences are signficant enough so that I reset authorship. In
particular:
1) A bug/regression is fixed with the boundary calculations for the
memtable controller sampling function.
2) A leftover is removed, where after flushing a memtable we would
go back to the main group before going to the cache group again
3) As per Tomek's suggestion, now the submission of compactions
themselves are run in the compaction scheduling group. Having that
working is what changes this patch the most: we now store the
scheduling group in the compaction manager and let the compaction
manager itself enforce the scheduling group.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
thread_scheduling_groups are converted to plain scheduling_group. Due to
differences in initialization (scheduling_group initializtion defers), we
create the scheduling_groups in main.cc and propagate them to users via
a new class database_config.
The sstable writer loses its thread_scheduling_group parameter and instead
inherits scheduling from its caller.
Since shares are in the 1-1000 range vs. 0-1 for thread scheduling quotas,
the flush controller was adjusted to return values within the higher ranges.
Introduce class result_options to carry result options through the
request pipeline, which at this point mean the result type and the
digest algorithm. This class allows us to encapsulate the concrete
digest algorithm to use.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Before this patch, our Materialized Views implementation can produce
incorrect results when given concurrent updates of the same base-table
row. Such concurrent updates may result, in certain cases, in two
different rows added to the view table, instead of just one with the latest
data. In this patch we we add locking which serializes the two conflicting
updates, and solves this problem. The locking for a single base-table
column_family is implemented by the row_locker class introduced in a
previous patch.
A long comment in the code of this patch explains in more detail why
this locking is needed, when, and what types of locks are needed: We
sometimes need to lock a single clustering row, sometimes an entire
partition, sometimes an exclusive lock and sometimes a shared lock.
Fixes#3168
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
After the new compaction controller code, the monitor has to be kept
alive until the sstable is added to the SSTable set.
This is correctly handled for all the writers, except the streaming big.
That flusher is a big confusing, as it builds an sstable list first and
only later adds the elements in the list to the sstable set. The
monitors are destroyed at the end of phase 1, so we will SIGSEGV later
when calling add_sstable().
The fix for this is to make sure the lifetime of the monitors are tied
to the lifetime of the sstables being handled big the big streaming
flush process.
Caught by dtests, update_cluster_layout_tests.py:TestUpdateClusterLayout.add_node_with_large_partition3_test
Fixes#3131
Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout.add_node_with_large_partition3_test now passes.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180118202230.17107-1-glauber@scylladb.com>
Timeouts are a global property. However, for tables in keyspaces like
the system keyspace, we don't want to uphold that timeout--in fact, we
wan't no timeout there at all.
We already apply such configuration for requests waiting in the queued
sstable queue: system keyspace requests won't be removed. However, the
storage proxy will insert its own timeouts in those requests, causing
them to fail.
This patch changes the storage proxy read layer so that the timeout is
applied based on the column family configuration, which is in turn
inherited from the keyspace configuration. This matches our usual
way of passing db parameters down.
In terms of implementation, we can either move the timeout inside the
abstract read executor or keep it external. The former is a bit cleaner,
the the latter has the nice property that all executors generated will
share the exact same timeout point. In this patch, we chose the latter.
We are also careful to propagate the timeout information to the replica.
So even if we are talking about the local replica, when we add the
request to the concurrency queue, we will do it in accordance with the
timeout specified by the storage proxy layer.
After this patch, Scylla is able to start just fine with very low
timeouts--since read timeouts in the system keyspace are now ignored.
Fixes#2462
Implementation notes, and general comments about open discussion in 2462:
* Because we are not bypassing the timeout, just setting it high enough,
I consider the concerns about the batchlog moot: if we fail for any
other reason that will be propagated. Last case, because the timeout
is per-CF, we could do what we do for the dirty memory manager and
move the batchlog alone to use a different timeout setting.
* Storage proxy likes specifying its timeouts as a time_point, whereas
when we get low enough as to deal with the read_concurrency_config,
we are talking about deltas. So at some point we need to convert time_points
to durations. We do that in the database query functions.
v2:
- use per-request instead of per-table timeouts.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This patch enables passing a timeout to the restricted_mutation_reader
through the read path interface -- using fill_buffer and friends. This
will serve as a basis for having per-timeout requests.
The config structure still has a timeout, but that is so far only used
to actually pass the value to the query interface. Once that starts
coming from the storage proxy layer (next patch) we will remove.
The query callers are patched so that we pass the timeout down. We patch
the callers in database.cc, but leave the streaming ones alone. That can
be safely done because the default for the query path is now no_timeout,
and that is what the streaming code wants. So there is no need to
complicate the interface to allow for passing a timeout that we intend
to disable.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
At the moment, various different subsystems use their different
ideas of what a timeout_clock is. This makes it a bit harder to pass
timeouts between them because although most are actually a lowres_clock,
that is not guaranteed to be the case. As a matter of fact, the timeout
for restricted reads is expressed as nanoseconds, which is not a valid
duration in the lowres_clock.
As a first step towards fixing this, we'll consolidate all of the
existing timeout_clocks in one, now called db::timeout_clock. Other
things that tend to be expressed in terms of that clock--like the fact
that the maximum time_point means no timeout and a semaphore that
wait()s with that resolution are also moved to the common header.
In the upcoming patch we will fix the restricted reader timeouts to
be expressed in terms of the new timeout_clock.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The algorithm and principle of operation is the same as the CPU
controller. It is, however, always enabled and we will operate on
I/O shares.
I/O-bound workloads are expected to hit the maximum once virtual
dirty fills up and stay there while the load is steady.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Compactions can be a heavy disk user and the I/O scheduler can always
guarantee that it uses its fair share of disk.
Such fair share can, however, be a lot more than what compaction indeed
need. This patch draws on the controllers infrastructure to adjust the
I/O shares that the compaction class will get so that compaction
bandwidth is dynamically adjusted.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The control algorithm we are using for memtables have proven itself
quite successful. We will very likely use the same for other processes,
like compactions.
Make the code a bit more generic, so that a new controller has to only
set the desired parameters
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Passing the read monitor down to the sstable readers is tricky. The
point of interest - like compaction - are usually very far from the
interfaces that register the monitor, like read_rows. Between the two,
there is usually a mutation_reader, which is and ought to be totally
unaware of the read monitor: technically, a mutation_reader may not even
know it is backed by sstables.
The solution is to create a read_monitor_generator, that can be passed
from the upper layers, like compaction, to the layers that are actually
making the decision of which sstables to create readers for.
Note that we don't need an equivalent piece of infrastructure for
writes, because writes don't happen through hidden layers and have all
the information they need to initialize their monitors.
Signed-off-by: Glauber Costa <glauber@scylladb.com>