Commit Graph

632 Commits

Author SHA1 Message Date
Avi Kivity
cb549c767a database: rename column_family to table
The name "column_family" is both awkward and obsolete. Rename to
the modern and accurate "table".

An alias is kept to avoid huge code churn.

To prevent a One Definition Rule violation, a preexisting "table"
type is moved to a new namespace row_cache_stress_test.

Tests: unit (release)
Message-Id: <20180624065238.26481-1-avi@scylladb.com>
2018-06-24 14:54:46 +03:00
Glauber Costa
290d553c3a compaction_strategy: allow the user to tell us if min_threshold has to be strict
Now that we have the controller, we would like to take min_threshold as
a hint. If there is nothing to compact, we can ignore that and start
compacting less than min_threshold SSTables so that the backlog keeps
reducing.

But there are cases in which we don't want min_threshold to be a hint
and we want to enforce it strictly. For instance, if write amplification
is more of a concern than space amplification.

This patch adds a YAML option that allows the user to tell us that. We will
default to false, meaning min_threshold is not strictly enforced.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-15 13:42:43 -04:00
Gleb Natapov
f41575a156 Provide available memory size to database object during creation 2018-06-11 15:34:13 +03:00
Avi Kivity
2582f53b44 Merge "database and API: Add column_family::get_sstables_by_key" from Amnon
"
This is series is for nodetool getsstables.

This patch is based on:
8daaf9833a

With some minor adjustments because of the code change in sstables.

The idea is to allow searching for all the sstables that contains a
given key.

After this patch if there is a table t1 in keyspace k1 and it has a key
called aa.

curl -X GET "http://localhost:10000/column_family/sstables/by_key/k1%3At1?key=aa"

Will return the list of sstables file names that contains that key.
"

* 'amnon/sstable_for_key_v4' of github.com:scylladb/seastar-dev:
  Add the API implementation to get_sstables_by_key
  api: column_family.json make the get_sstables_for_key doc clearer
  column_family: Add the get_sstables_by_partition_key method
  sstable test: add has_partition_key test
  sstable: Add has_partition_key method
  keys_test: add a test for nodetool_style string
  keys: Add from_nodetool_style_string factory method
2018-06-10 16:53:56 +03:00
Amnon Heiman
acb0a738eb column_family: Add the get_sstables_by_partition_key method
The get_sstables_by_partition_key method used by the API to return a set of
sstables names that holds a given partition key.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-06-10 16:13:01 +03:00
Asias He
6496cdf0fb db: Get rid of the streaming memtable delayed flush
In 455d5a5 (streaming memtables: coalesce incoming writes), we
introduced the delayed flush to coalesce incoming streaming mutations
from different stream_plan.

However, most of the time there will be one stream plan at a time, the
next stream plan won't start until the previous one is finished. So, the
current coalescing does not really work.

The delayed flush adds 2s of dealy for each stream session. If we have lots
of table to stream, we will waste a lot of time.

We stream a keyspace in around 10 stream plans, i.e., 10% of ranges a
time. If we have 5000 tables, even if the tables are almost empty, the
delay will waste 5000 * 10 * 2 = 27 hours.

To stream a keyspace with 4 tables, each table has 1000 rows.

Before:

 [shard 0] stream_session - [Stream #944373d0-5d9c-11e8-9cdb-000000000000] Executing streaming plan for Bootstrap-ks-index-0 with peers={127.0.0.1}, master
 [shard 0] stream_session - [Stream #944373d0-5d9c-11e8-9cdb-000000000000] Streaming plan for Bootstrap-ks-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1030 KiB, 125.21 KiB/s
 [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks succeeded, took 8.233 seconds

After:

 [shard 0] stream_session - [Stream #e00bf6a0-5d99-11e8-a7b8-000000000000] Executing streaming plan for Bootstrap-ks-index-0 with peers={127.0.0.1}, master
 [shard 0] stream_session - [Stream #e00bf6a0-5d99-11e8-a7b8-000000000000] Streaming plan for Bootstrap-ks-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1030 KiB, 4772.32 KiB/s
 [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks succeeded, took 0.216 seconds

Fixes #3436

Message-Id: <cb2dde263782d2a2915ddfe678c74f9637ffd65b.1526979175.git.asias@scylladb.com>
2018-06-06 10:16:02 +03:00
Piotr Sarna
f8237dd664 database: do not truncate already removed views
This commit clears table's views before truncating it
in drop_column_family function. The only case when
views are not empty during drop is when they're backing secondary
indexes of a base table and they are all atomically dropped
in the same go as the base table itself.
This change will prevent trying to truncate views that were
already dropped, which used to result in no_such_column_family error.

References #3202
2018-05-22 21:10:51 +02:00
Duarte Nunes
a3bbd52e2e Merge 'Add materialized view metrics' from Piotr
"
This series introduces materialized view statistics, as stated in issue #3385:
 - updates pushed
 - updates failed
 - row lock stats

It also addresses issue #3416 by decoupling user write stats from view
update stats.
"

* 'materialized_view_metrics_9' of https://github.com/psarna/scylla:
  view: adapt view_stats to act as write stats
  storage_proxy: decouple write_stats from stats
  db: add row locking metrics
  view: add view metrics
2018-05-22 18:41:51 +01:00
Piotr Sarna
9246bb36bc db: add row locking metrics
This commit adds statistics to row_locker class. Metrics are
independendly counted for all lock types: row<->partition and
exclusive<->shared.

Metrics gathered:
 - total acquisitions
 - operations that wait on the lock
 - histogram of the time spent on waiting on this type of lock

References #3385
References #3416
2018-05-22 16:52:58 +02:00
Piotr Sarna
49bebcfa25 view: add view metrics
This commit introduces view statistics:
 - updates pushed to local/remote replicas
 - updates failed to be pushed to local/remote replicas

Metrics are kept on per-table basis, i.e. updates_pushed_remote
shows the number of total updates (mutations) pushed to all paired
mv replicas that this particular table has.
Every single update is taken into consideration, so if view update
requires removing a row from one view and adding a row to another,
it will be counted as 2 updates.

References #3385
References #3416
2018-05-22 16:52:58 +02:00
Glauber Costa
d758a416f8 backlog_controller: move compaction controller to the compaction manager
There was recently an attempt to add minimum shares to major compactions
which ended up being harder than it should be due to all the plumbing
necessary to call the compaction controller from inside the compaction
manager-- since it is currently a database object. We had this problem
again when trying to return fixed shares in case of an exception.

Taking a step back, all of those problems stem from the fact that the
compaction controller really shouldn't be a part of the database: as it
deals with compactions and its consequences it is a lot more natural to
have it inside the compaction manager to begin with.

Once we do that, all the aforementioned problems go away. So let's move
there where it belongs.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:24:19 -04:00
Duarte Nunes
c053275a48 db/view/row_locking: Add timeout when waiting for the lock
This ensures we respect the write timeout set by the client when
applying base writes, in case a writes takes too long to acquire the
row lock for the read-before-write phase of a materialized view
update.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180507132755.8751-1-duarte@scylladb.com>
2018-05-07 18:22:39 +01:00
Duarte Nunes
4b3562c3f5 db/view: Limit number of pending view updates
This patch adds a simple and naive mechanism to ensure a base replica
doesn't overwhelm a potentially overloaded view replica by sending too
many concurrent view updates. We add a semaphore to limit to 100 the
number of outstanding view updates. We limit globally per shard, and
not per destination view replica. We also limit statically.

Refs #2538

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180426134457.21290-2-duarte@scylladb.com>
2018-05-07 11:25:27 +03:00
Piotr Sarna
fe02c3d0e2 database, sstables, tests: add large_partition_handler
This commit makes database, sstables and tests aware
of which large_partition_handler they use.
Proper large_partition_handler is retrievable from config information
and is based on existing compaction_large_partition_warning_threshold_mb
entry. Right now CQL TABLE variant of large_partition_handler is used
in the database.

Tests use a NOP version of large_partition_handler, which does not
depend on CQL queries at all.
2018-05-04 14:38:13 +02:00
Duarte Nunes
f298f57137 column_family: Add function to populate views
The populate_views() function takes a set of views to update, a
tokento select base table partitions, and the set of sstables to
query. This lays the foundation for a view building mechanism to exist,
which walks over a given base table, reads data token-by-token,
calculates view updates (in a simplified way, compared to the existing
functions that push view updates), and sends them to the paired view
replicas.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
67dd3e6e5d column_family: Allow synchronizing with in-progress writes
This patch adds a mechanism to class column_family through which we
can synchronize with in-progress writes. This is useful for code that,
after some modification, needs to ensure that new writes will see it
before it can proceed.

In particular, this will be used by the view building code, which needs
to wait until the in-progress writes, which may have missed that there
is now a view, is observable to the view building code.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Duarte Nunes
9b9ba525f7 database: Add get_views() function
Returns all the schemas that are views.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-27 01:20:10 +01:00
Glauber Costa
9188059427 database: group statements in their own scheduling group
When we introduced the CPU scheduler, we have also introduced a group
for commitlog - but never used it. There is also doubtful value in
separating reads from writes, since they are often part of the same
workload.

To accomodate for that, let's rename the query group to "statement"
(query is not incorrect, just confusing), and move the write path,
currently ungrouped, inside it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-20 16:58:36 -04:00
Botond Dénes
c0009750c3 Add unit test for resource based cache eviction
Specifically for the reader-permit based eviction. This test lives in a
separate executable as it uses with_cql_test_env() and thus needs a
main() of it's own.
2018-03-13 16:20:50 +02:00
Botond Dénes
d5bcadcfda Time-based cache eviction
Cached queriers should not sit in the cache indefinitely otherwise
abandoned reads would cause excess and unncessary resource-usage. Attach
an expiry timer to each cache-entry which evicts it after the TTL
passes.
2018-03-13 10:34:34 +02:00
Botond Dénes
ff808d9ce6 Save and restore queriers in mutation_query() and data_query()
Use the querier_cache (represented by the passed-in
querier_cache_context) object to lookup saved queriers at the start of
the page and save them at the end of it if it is likely that there will
be more page requests.
2018-03-13 10:34:34 +02:00
Botond Dénes
1259031af3 Use the reader_concurrency_semaphore to limit reader concurrency 2018-03-08 14:12:12 +02:00
Botond Dénes
d5bb8a47fc mv reader_resource_tracker.hh -> reader_concurrency_semaphore.hh
In preparation to reader_concurrency_semaphore being added to the file.
The reader_resource_tracker is really only a helper class for
reader_concurrency_semaphore so the latter is better suited to provide
the name of the file.
2018-03-08 10:29:16 +02:00
Raphael S. Carvalho
aa75684ee7 sstables: Warn when an extra-large partition is written
Based on https://issues.apache.org/jira/browse/CASSANDRA-9643

For compaction_large_partition_warning_threshold_mb option set to 1,
follow an example output:

WARN  2018-02-22 19:52:11,029 [shard 0] sstable - Writing large
row system/local:{key: pk{00056c6f63616c}, token:-7564491331177403445}
(1276758 bytes)

Fixes #2209.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180306175912.19259-1-raphaelsc@scylladb.com>
2018-03-07 15:49:46 +00:00
Duarte Nunes
76e6423910 database: Truncate views when truncating the base table
Fixes #3200

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180211124218.41373-1-duarte@scylladb.com>
2018-02-27 15:54:43 +02:00
Avi Kivity
432268f582 Merge "branch 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla" from Raphael
"The motivation is that it's no longer needed after new resharding
algorithm that is the sole responsible for working with shared
sstables and regular compaction will not work with those!
So resharding will schedule deletion of shared sstables once it's
certain that shards that own them have the new unshared sstables.
The manager was needed for orchestrating deletion of shared sstable
across shards. It brings extra complexity that's not longer needed,
and it was also overloading shard 0, but the latter could have
been fixed.

Tests:
- unit: release mode
- dtest: resharding_test.py"

* 'remove_atomic_deletion_manager_v2' of github.com:raphaelsc/scylla:
  Remove SSTable's atomic deletion manager
  Stop using SSTable's atomic deletion manager
  database: split column_family::rebuild_sstable_list
2018-02-08 19:10:16 +02:00
Duarte Nunes
456b678e0b database.hh: Fix data query stage argument type
Fixes a merge gone wrong.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180208163338.25238-1-duarte@scylladb.com>
2018-02-08 16:35:10 +00:00
Avi Kivity
404172652e Merge "Use xxHash for digest instead of MD5" from Duarte
"This series changes digest calculation to use a faster algorithm
(xxHash) and to also cache calculated cell hashes that can be kept in
memory to speed up subsequent digest requests.

The MD5 hash function has proved to be slow for large cell values:

size = 256; elapsed = 4us
size = 512; elapsed = 8us
size = 1024; elapsed = 14us
size = 2048; elapsed = 21us
size = 4096; elapsed = 33us
size = 8192; elapsed = 51us
size = 16384; elapsed = 86us
size = 32768; elapsed = 150us
size = 65536; elapsed = 278us
size = 131072; elapsed = 531us
size = 262144; elapsed = 1032us
size = 524288; elapsed = 2026us
size = 1048576; elapsed = 4004us
size = 2097152; elapsed = 7943us
size = 4194304; elapsed = 15800us
size = 8388608; elapsed = 31731us
size = 16777216; elapsed = 64681us
size = 33554432; elapsed = 130752us
size = 67108864; elapsed = 263154us

The xxHash is a non-cryptographic, 64bit (there's work in progress on
the 128 version) hash that can be used to replace MD5. It performs much
better:

size = 256; elapsed = 2us
size = 512; elapsed = 1us
size = 1024; elapsed = 1us
size = 2048; elapsed = 2us
size = 4096; elapsed = 2us
size = 8192; elapsed = 3us
size = 16384; elapsed = 5us
size = 32768; elapsed = 8us
size = 65536; elapsed = 14us
size = 131072; elapsed = 28us
size = 262144; elapsed = 59us
size = 524288; elapsed = 116us
size = 1048576; elapsed = 226us
size = 2097152; elapsed = 456us
size = 4194304; elapsed = 935us
size = 8388608; elapsed = 1848us
size = 16777216; elapsed = 4723us
size = 33554432; elapsed = 10507us
size = 67108864; elapsed = 21622us

Performance was tested using a 3 node cluster with 1 cpu and 8GB,
and with the following cassandra-stress loaders. Measurements are for
the read workload.

sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=5000000 -schema 'replication(factor=3)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100
sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..5000000,5000000,500000)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100

xxhash + caching:

Results:
op rate                   : 32699 [READ:32699]
partition rate            : 32699 [READ:32699]
row rate                  : 32699 [READ:32699]
latency mean              : 3.0 [READ:3.0]
latency median            : 3.0 [READ:3.0]
latency 95th percentile   : 3.9 [READ:3.9]
latency 99th percentile   : 4.5 [READ:4.5]
latency 99.9th percentile : 6.6 [READ:6.6]
latency max               : 24.0 [READ:24.0]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:05:05
END

md5:

Results:
op rate                   : 25241 [READ:25241]
partition rate            : 25241 [READ:25241]
row rate                  : 25241 [READ:25241]
latency mean              : 3.9 [READ:3.9]
latency median            : 3.9 [READ:3.9]
latency 95th percentile   : 5.1 [READ:5.1]
latency 99th percentile   : 5.8 [READ:5.8]
latency 99.9th percentile : 8.0 [READ:8.0]
latency max               : 24.8 [READ:24.8]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:06:36
END

This translates into a 21% improvoment for this workload.

Bigger cell values were also tested:

sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=1000000 -schema 'replication(factor=3)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100
sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..1000000,500000,100000)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100

xxhash + caching:

Results:
op rate                   : 19964 [READ:19964]
partition rate            : 19964 [READ:19964]
row rate                  : 19964 [READ:19964]
latency mean              : 4.9 [READ:4.9]
latency median            : 4.6 [READ:4.6]
latency 95th percentile   : 7.2 [READ:7.2]
latency 99th percentile   : 11.5 [READ:11.5]
latency 99.9th percentile : 13.6 [READ:13.6]
latency max               : 29.2 [READ:29.2]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:08:20
END

md5:

Results:
op rate                   : 12773 [READ:12773]
partition rate            : 12773 [READ:12773]
row rate                  : 12773 [READ:12773]
latency mean              : 7.7 [READ:7.7]
latency median            : 7.3 [READ:7.3]
latency 95th percentile   : 10.2 [READ:10.2]
latency 99th percentile   : 16.8 [READ:16.8]
latency 99.9th percentile : 19.2 [READ:19.2]
latency max               : 71.5 [READ:71.5]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:13:02
END

This translates into a 37% improvoment for this workload.

Fixes #2884

Tests: unit-tests (release), dtests (smp=2)

Note: dtests are kinda broken in master (> 30 failures), so take the
tests tag with a grain of himalayan salt."

* 'xxhash/v5' of https://github.com/duarten/scylla: (29 commits)
  tests/row_cache_test: Test hash caching
  tests/memtable_test: Test hash caching
  tests/mutation_test: Use xxHash instead of MD5 for some tests
  tests/mutation_test: Test xx_hasher alongside md5_hasher
  schema: Remove unneeded include
  service/storage_proxy: Enable hash caching
  service/storage_service: Add and use xxhash feature
  message/messaging_service: Specify algorithm when requesting digest
  storage_proxy: Extract decision about digest algorithm to use
  cache_flat_mutation_reader: Pre-calculate cell hash
  partition_snapshot_reader: Pre-calculate cell hash
  query::partition_slice: Add option to specify when digest is requested
  row: Use cached hash for hash calculation
  mutation_partition: Replace hash_row_slice with appending_hash
  mutation_partition: Allow caching cell hashes
  mutation_partition: Force vector_storage internal storage size
  test.py: Increase memory for row_cache_stress_test
  atomic_cell_hash: Add specialization for atomic_cell_or_collection
  query-result: Use digester instead of md5_hasher
  range_tombstone: Replace feed_hash() member function with appending_hash
  ...
2018-02-08 18:24:58 +02:00
Raphael S. Carvalho
b78881c0e9 database: split column_family::rebuild_sstable_list
The motivation is that resharding will not want the code that is
specific to regular compaction after atomic deletion is removed.
Resharding will eventually only need to replace old tables with
new ones, and it will be in charge of deletion of old tables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-02-07 22:18:18 -02:00
Glauber Costa
4272279bbb controllers: unify the I/O and CPU controllers
We have had so far an I/O controller, for compactions and memtables, and
a CPU controller, for memtables only -- since the scheduling was still
quota-based.

Now that the CPU scheduler is fully functional, it is time to do away
with the differences and integrate them both into one.  We now have a
memtable controller and a compaction controller, and they control both
CPU and I/O.

In the future, we may want to control processes that don't do one of
them, like cache updates. If that ever happens, we'll try to make
controlling one of them optional. But for now, since the I/O and CPU
controllers for our main two processes would look exactly the same we
should integrate them.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:30 -05:00
Avi Kivity
ce94e6deb7 database: place data_query execution stage into scheduling_group
Because execution stages defer and batch processing of the function
they run, they escape their fiber's context and therefore the
scheduling group.

Fix (for data_query) by initializing the execution_stage with the
query scheduling_group. To do that we have to move the execution
stage into the database object, so it has access to the scheduling
group during initialization.
2018-02-07 17:19:29 -05:00
Glauber Costa
956af9f099 database, main: set up scheduling_groups for our main tasks
Set up scheduling groups for streaming, compaction, memtable flush, query,
and commitlog.

The background writer scheduling group is retired; it is split into
the memtable flush and compaction groups.

Comments from Glauber:

This patch is based in a patch from Avi with the same subject, but the
differences are signficant enough so that I reset authorship. In
particular:

1) A bug/regression is fixed with the boundary calculations for the
   memtable controller sampling function.
2) A leftover is removed, where after flushing a memtable we would
   go back to the main group before going to the cache group again
3) As per Tomek's suggestion, now the submission of compactions
   themselves are run in the compaction scheduling group. Having that
   working is what changes this patch the most: we now store the
   scheduling group in the compaction manager and let the compaction
   manager itself enforce the scheduling group.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Avi Kivity
641aaba12c database, sstables, compaction: convert use of thread_scheduling_group to seastar cpu scheduler
thread_scheduling_groups are converted to plain scheduling_group. Due to
differences in initialization (scheduling_group initializtion defers), we
create the scheduling_groups in main.cc and propagate them to users via
a new class database_config.

The sstable writer loses its thread_scheduling_group parameter and instead
inherits scheduling from its caller.

Since shares are in the 1-1000 range vs. 0-1 for thread scheduling quotas,
the flush controller was adjusted to return values within the higher ranges.
2018-02-07 17:19:29 -05:00
Duarte Nunes
6b4b429883 query-result: Introduce class result_options
Introduce class result_options to carry result options through the
request pipeline, which at this point mean the result type and the
digest algorithm. This class allows us to encapsulate the concrete
digest algorithm to use.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:50 +00:00
Nadav Har'El
2ea1922a4d Materialized views: serialize read-modify-update of base table
Before this patch, our Materialized Views implementation can produce
incorrect results when given concurrent updates of the same base-table
row. Such concurrent updates may result, in certain cases, in two
different rows added to the view table, instead of just one with the latest
data. In this patch we we add locking which serializes the two conflicting
updates, and solves this problem. The locking for a single base-table
column_family is implemented by the row_locker class introduced in a
previous patch.

A long comment in the code of this patch explains in more detail why
this locking is needed, when, and what types of locks are needed: We
sometimes need to lock a single clustering row, sometimes an entire
partition, sometimes an exclusive lock and sometimes a shared lock.

Fixes #3168

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2018-01-30 16:21:43 +02:00
Botond Dénes
b7d902a9e9 database: remove unused concurrency config members
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <b257c7e9d403c55aaec34fc48863c18f9c9ae11a.1517314398.git.bdenes@scylladb.com>
2018-01-30 14:21:25 +02:00
Glauber Costa
0c00667206 streaming big: keep write_monitor alive until the end of flush
After the new compaction controller code, the monitor has to be kept
alive until the sstable is added to the SSTable set.

This is correctly handled for all the writers, except the streaming big.
That flusher is a big confusing, as it builds an sstable list first and
only later adds the elements in the list to the sstable set. The
monitors are destroyed at the end of phase 1, so we will SIGSEGV later
when calling add_sstable().

The fix for this is to make sure the lifetime of the monitors are tied
to the lifetime of the sstables being handled big the big streaming
flush process.

Caught by dtests, update_cluster_layout_tests.py:TestUpdateClusterLayout.add_node_with_large_partition3_test

Fixes #3131
Tests: update_cluster_layout_tests.py:TestUpdateClusterLayout.add_node_with_large_partition3_test now passes.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180118202230.17107-1-glauber@scylladb.com>
2018-01-21 14:09:43 +02:00
Piotr Jastrzebski
4c74b8c7e7 Migrate materalized views to flat_mutation_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-18 07:32:35 +01:00
Glauber Costa
08a0c3714c allow request-specific read timeouts in storage proxy reads
Timeouts are a global property. However, for tables in keyspaces like
the system keyspace, we don't want to uphold that timeout--in fact, we
wan't no timeout there at all.

We already apply such configuration for requests waiting in the queued
sstable queue: system keyspace requests won't be removed. However, the
storage proxy will insert its own timeouts in those requests, causing
them to fail.

This patch changes the storage proxy read layer so that the timeout is
applied based on the column family configuration, which is in turn
inherited from the keyspace configuration. This matches our usual
way of passing db parameters down.

In terms of implementation, we can either move the timeout inside the
abstract read executor or keep it external. The former is a bit cleaner,
the the latter has the nice property that all executors generated will
share the exact same timeout point. In this patch, we chose the latter.

We are also careful to propagate the timeout information to the replica.
So even if we are talking about the local replica, when we add the
request to the concurrency queue, we will do it in accordance with the
timeout specified by the storage proxy layer.

After this patch, Scylla is able to start just fine with very low
timeouts--since read timeouts in the system keyspace are now ignored.

Fixes #2462

Implementation notes, and general comments about open discussion in 2462:

* Because we are not bypassing the timeout, just setting it high enough,
  I consider the concerns about the batchlog moot: if we fail for any
  other reason that will be propagated. Last case, because the timeout
  is per-CF, we could do what we do for the dirty memory manager and
  move the batchlog alone to use a different timeout setting.

* Storage proxy likes specifying its timeouts as a time_point, whereas
  when we get low enough as to deal with the read_concurrency_config,
  we are talking about deltas. So at some point we need to convert time_points
  to durations. We do that in the database query functions.

v2:
- use per-request instead of per-table timeouts.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-12 07:43:21 -05:00
Glauber Costa
3c9eeea4cf restricted_mutation_reader: don't pass timeouts through the config structure
This patch enables passing a timeout to the restricted_mutation_reader
through the read path interface -- using fill_buffer and friends. This
will serve as a basis for having per-timeout requests.

The config structure still has a timeout, but that is so far only used
to actually pass the value to the query interface. Once that starts
coming from the storage proxy layer (next patch) we will remove.

The query callers are patched so that we pass the timeout down. We patch
the callers in database.cc, but leave the streaming ones alone. That can
be safely done because the default for the query path is now no_timeout,
and that is what the streaming code wants. So there is no need to
complicate the interface to allow for passing a timeout that we intend
to disable.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-12 07:43:21 -05:00
Glauber Costa
80c4a211d8 consolidate timeout_clock
At the moment, various different subsystems use their different
ideas of what a timeout_clock is. This makes it a bit harder to pass
timeouts between them because although most are actually a lowres_clock,
that is not guaranteed to be the case. As a matter of fact, the timeout
for restricted reads is expressed as nanoseconds, which is not a valid
duration in the lowres_clock.

As a first step towards fixing this, we'll consolidate all of the
existing timeout_clocks in one, now called db::timeout_clock. Other
things that tend to be expressed in terms of that clock--like the fact
that the maximum time_point means no timeout and a semaphore that
wait()s with that resolution are also moved to the common header.

In the upcoming patch we will fix the restricted reader timeouts to
be expressed in terms of the new timeout_clock.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-11 12:07:41 -05:00
Glauber Costa
40c428dc19 database: delete unused function
no in-tree users.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-11 12:07:41 -05:00
Glauber Costa
4f1b875784 database: add a controller for I/O on memtable flushes.
The algorithm and principle of operation is the same as the CPU
controller. It is, however, always enabled and we will operate on
I/O shares.

I/O-bound workloads are expected to hit the maximum once virtual
dirty fills up and stay there while the load is steady.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-03 19:58:57 -05:00
Glauber Costa
244c564aac compaction: adjust shares for compactions
Compactions can be a heavy disk user and the I/O scheduler can always
guarantee that it uses its fair share of disk.

Such fair share can, however, be a lot more than what compaction indeed
need. This patch draws on the controllers infrastructure to adjust the
I/O shares that the compaction class will get so that compaction
bandwidth is dynamically adjusted.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-03 19:58:57 -05:00
Glauber Costa
1671d9c433 factor out some of the controller code
The control algorithm we are using for memtables have proven itself
quite successful. We will very likely use the same for other processes,
like compactions.

Make the code a bit more generic, so that a new controller has to only
set the desired parameters

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-03 19:56:54 -05:00
Glauber Costa
3bd6bceaf0 sstables: add read_monitor_generator
Passing the read monitor down to the sstable readers is tricky. The
point of interest - like compaction - are usually very far from the
interfaces that register the monitor, like read_rows. Between the two,
there is usually a mutation_reader, which is and ought to be totally
unaware of the read monitor: technically, a mutation_reader may not even
know it is backed by sstables.

The solution is to create a read_monitor_generator, that can be passed
from the upper layers, like compaction, to the layers that are actually
making the decision of which sstables to create readers for.

Note that we don't need an equivalent piece of infrastructure for
writes, because writes don't happen through hidden layers and have all
the information they need to initialize their monitors.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-02 18:43:07 -05:00
Paweł Dziepak
24026a0c7d db: fully convert incremental_reader_selector to flat readers 2017-12-13 12:01:03 +00:00
Paweł Dziepak
73b3d02cc0 db: make make_range_sstable_reader() return flat reader 2017-12-13 12:01:03 +00:00
Paweł Dziepak
8b3c3fc832 db: make column_family::make_reader() return flat reader 2017-12-13 12:01:03 +00:00
Paweł Dziepak
e12959616c db: make column_family::make_sstable_reader() return a flat reader 2017-12-13 12:01:03 +00:00