scylladb

Author	SHA1	Message	Date
Avi Kivity	c180a18dbb	Distribute distributed_loader into its own header and source files distributed_loader is a sizeable fraction of database.cc, so moving it out reduces compile time and improves readability. Message-Id: <20181230200926.15074-1-avi@scylladb.com>	2018-12-31 14:27:27 +02:00
Tomasz Grabiec	7747f2dde3	Merge "nodetool toppartitions" from Rafi & Avi Implementation of nodetool toppartiotion query, which samples most frequest PKs in read/write operation over a period of time. Content: - data_listener classes: mechanism that interfaces with mutation readers in database and table classes, - toppartition_query and toppartition_data_listener classes to implement toppartition-specific query (this interfaces with data_listeners and the REST api), - REST api for toppartitions query. Uses Top-k structure for handling stream summary statistics (based on implementation in C, see #2811). What's still missing: - JMX interface to nodetool (interface customization may be required), - Querying #rows and #bytes (currently, only #partitions is supported). Fixes #2811 https://github.com/avikivity/scylla rafie_toppartitions_v7.1: top_k: whitespace and minor fixes top_k: map template arguments top_k: std::list -> chunked_vector top_k: support for appending top_k results nodetool toppartitions: refactor table::config constructor nodetool toppartitions: data listeners nodetool toppartitions: add data_listeners to database/table nodetool toppartitions: fully_qualified_cf_name nodetool toppartitions: Toppartitions query implementation nodetool toppartitions: Toppartitions query REST API nodetool toppartitions: nodetool-toppartitions script	2018-12-28 16:31:24 +01:00
Rafi Einstein	0bffe5f83e	nodetool toppartitions: add data_listeners to database/table Add data_listeners member to database. Adds data_listeners* to table::config, to be used by table methods to invoke listeners. Install on_read() listener in table::make_reader(). Install on_write() listener in database::apply_in_memory(). Tests: Unit (release) Signed-off-by: Rafi Einstein <rafie@scylladb.com>	2018-12-28 16:45:57 +02:00
Rafi Einstein	038f8c7988	nodetool toppartitions: refactor table::config constructor Eliminae extra parameters to ctor and deduce them instead from db param. Signed-off-by: Rafi Einstein <rafie@scylladb.com>	2018-12-28 16:45:57 +02:00
Avi Kivity	74c1afad29	database: provide accessor to db::extensions Rather than forcing callers to go through get_config(), provide a direct accessor. This reduces dependencies on config.hh, and will allow separation of extensions from configuration.	2018-12-21 20:15:43 +00:00
Duarte Nunes	2174eed640	database: Expose current memory view update backlog Expose the base replica's current memory view update backlog, which is defined in terms of units consumed from the semaphore. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-12-19 22:38:30 +00:00
Duarte Nunes	fc9176e784	database: Wait on view update semaphore for view building View building sends view updates synchronously, which has natural backpressure. However, they 1) Contribute to the load on the view replicas, and; 2) Add memory pressure to the base replica. They should thus count towards the current view update backlog, and consume units from the view update concurrency semaphore. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-12-19 22:38:30 +00:00
Duarte Nunes	86198060e5	database: generate_and_propagate_view_updates no longer needs a timeout We no longer wait on the semaphore and instead over-subscribe it, so there's not reason to pass a timeout. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-12-19 22:38:29 +00:00
Duarte Nunes	39eda68094	database: Don't generate view updates when node is overloaded We arrive at an overloaded state when we fail to acquire semaphore units in the base replica. This can mean clients are working in interactive mode, we fail to throttle them and consequently should start shedding load. We want to avoid impacting base table availability by running out of memory, so we could offload the memory queue to disk by writing the view updates as hints without attempting to send them. However, the disk is also a limited resource and in extreme cases we won’t be able to write hints. A tension exists between forgetting the view updates, thereby opening up a window for inconsistencies between base and view, or failing the base replica write. The latter can fail the whole user write, or if the coordinator was able to achieve CL, can instead cause inconsistencies between base tables (we wouldn't want to store a hint, because if the base replica is still overloaded, we would redo the whole dance). Between the devil and the deep blue sea, we chose to forget view updates. As a further simplification, we don't even write hints, assuming that if clients can’t be throttled (as we'll attempt to do in future patches), it will only be a matter of time before view updates can’t be offloaded. We also start acquiring the semaphore units using consume(), which is non-blocking, but allows for underflow of the available semaphore units. This is okay, and we expect not to underflow by much, as we stop generating new view updates. Refs #2538 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-12-19 22:38:29 +00:00
Duarte Nunes	1f1fc36b72	database: Make view update concurrency semaphore memory-based The semaphore currently limiting the amount of view updates a given base replica emits aims to control the load that is imposed on the cluster, to protect view replicas from being overloaded when there are bursts of traffic (especially for degenerate cases like an index with low selectivity). 100 is, however, an arbitrary number. It might allow too much load on the view replicas, and it might also allow too much memory from the base shard to be consumed. Conversely, it might allow for too few updates to be queued in case of a burst, or to absorb updates while a view replica becomes partitioned. To deal with the load that is inflicted on the cluster, future patches will ensure that the rate of base writes obeys the rate at which the slowest view replica can consume the corresponding view updates. To protect the current shard from using too much memory for this queue, we will limit it to 10% of the shard's memory. The goal is to both protect the shard from being overloaded, but also to allow it to absorb bursts of writes resulting in large view mutations. Refs #2538 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-12-19 22:38:29 +00:00
Botond Dénes	76fe4ebc18	Move cache_temperature into its own header Some headers need to include database.hh just because of cache_temperature. Move it into its own header so these includes can be removed.	2018-12-12 16:03:45 +02:00
Avi Kivity	414b14a6bd	Merge "Make inactive shard readers evictable" from Botond " This series attempts to solve the regressions recently discovered in performance of multi-partition range-scans. Namely that they: * Flood the reader concurrency semaphore's queues, trampling other reads. * Behave very badly when too many of them is running concurrently (trashing). * May deadlock if enough of them is running without a timeout. The solution for these problems is to make inactive shard readers evictable. This should address all three issues listed above, to varying degrees: * Shard readers will now not cling onto their permits for the entire duration of the scan, which might be a lot of time. * Will be less affected by infinite concurrency (more than the node can handle) as each scan now can make progress by evicting inactive shard readers belonging to other scans. * Will not deadlock at all. In addition to the above fix, this series also bundles two further improvements: * Add a mechanism to `reader_concurrecy_semaphore` to be notified of newly inserted evictables. * General cleanups and fixes for `multishard_combining_reader` and `foreign_reader`. I can unbundle these mini series and send them separately, if the maintainers so prefer, altough considering that this series will have to be backported to 3.0, I think this present form is better. Fixes: #3835 " * 'evictable-inactive-shard-readers/v7' of https://github.com/denesb/scylla: (27 commits) tests/multishard_mutation_query_test: test stateless query too tests/querier_cache: fail resource-based eviction test gracefully tests/querier_cache: simplify resource-based eviction test tests/mutation_reader_test: add test_multishard_combining_reader_next_partition tests/mutation_reader_test: restore indentation tests/mutation_reader_test: enrich pause-related multishard reader test multishard_combining_reader: use pause-resume API query::partition_slice: add clear_ranges() method position_in_partition: add region() accessor foreign_reader: add pause-resume API tests/mutation_reader_test: implement the pause-resume API query_mutations_on_all_shards(): implement pause-resume API make_multishard_streaming_reader(): implement the pause-resume API database: add accessors for user and streaming concurrency semaphores reader_lifecycle_policy: extend with a pause-resume API query_mutations_on_all_shards(): restore indentation query_mutations_on_all_shards(): simplify the state-machine multishard_combining_reader: use the reader lifecycle policy multishard_combining_reader: add reader lifecycle policy multishard_combining_reader: drop unnecessary `reader_promise` member ...	2018-12-04 10:22:35 +02:00
Botond Dénes	bf0d1f4eea	database: add accessors for user and streaming concurrency semaphores These will soon be needed to register inactive user and streaming reads with the respective semaphores.	2018-12-04 08:51:05 +02:00
Benny Halevy	857ff4f59a	database: directly use std::experimental::filesystem::path for lister::path Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2018-12-02 22:02:10 +02:00
Piotr Sarna	1336b9ee31	database: add is_internal_keyspace Similarly to is_system_keyspace, it will allow checking if a keyspace is created for internal use.	2018-11-28 09:21:56 +01:00
Avi Kivity	775b7e41f4	Update seastar submodule * seastar d59fcef...b924495 (2): > build: Fix protobuf generation rules > Merge "Restructure files" from Jesse Includes fixup patch from Jesse: " Update Seastar `#include`s to reflect restructure All Seastar header files are now prefixed with "seastar" and the configure script reflects the new locations of files. Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com> Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com> "	2018-11-21 00:01:44 +02:00
Glauber Costa	9f403334c8	remove monitor if sstable write failed In (almost) all SSTable write paths, we need to inform the monitor that the write has failed as well. The monitor will remove the SSTable from controller's tracking at that point. Except there is one place where we are not doing that: streaming of big mutations. Streaming of big mutations is an interesting use case, in which it is done in 2 parts: if the writing of the SSTable fails right away, then we do the correct thing. But the SSTables are not commited at that point and the monitors are still kept around with the SSTables until a later time, when they are finally committed. Between those two points in time, it is possible that the streaming code will detect a failure and manually call fail_streaming_mutations(), which marks the SSTable for deletions. At that point we should propagate that information to the monitor as well, but we don't. Fixes #3732 (hopefully) Tests: unit (release) Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20181114213618.16789-1-glauber@scylladb.com>	2018-11-20 16:15:12 +00:00
Piotr Sarna	d7849e6ea4	database: add get_staging_sstable method This method can be used to check if sstable is staging, i.e. it shouldn't be compacted and it will not be used for generating view updates from other staging tables, and return proper shared_sstable pointer if it is.	2018-11-13 15:04:43 +01:00
Piotr Sarna	348fa3b092	table: add stream_view_replica_updates Generating view replica updates during streaming ignores the staging sstable that is used to generate them.	2018-11-13 14:52:22 +01:00
Piotr Sarna	fed9c59eb8	table: split push_view_replica_updates push_view_replica_updates is split in order to allow different mutation source to be provided.	2018-11-13 14:52:22 +01:00
Piotr Sarna	466d780445	table: add as_mutation_source_excluding A variant of table::as_mutation_source that allows excluding a single sstable is added.	2018-11-13 14:52:22 +01:00
Piotr Sarna	e88b85134c	database: add sstable-excluding reader When generating view updates from a staging sstable, this sstable should not be used in the process. Hence, a reader that skips a single sstable is added.	2018-11-13 14:52:22 +01:00
Piotr Sarna	160a6d58d2	table: add move_sstable_from_staging_in_thread function After materialized view updates are generated, the sstable should be moved from staging/ to a regular directory. It's expected to be called from seastar::async thread context.	2018-11-13 11:45:30 +01:00
Piotr Sarna	e42d97060f	database: provide nonfrozen version of push_view_replica_updates Now it's also possible to pass a mutation to push to view replicas.	2018-11-13 11:45:30 +01:00
Piotr Sarna	642c3ae0e0	database: add subdir param to make_streaming_sstable_for_write This function allows specifying a subfolder to put a newly created sstable in - e.g. staging/ subfolder for streamed base table mutations.	2018-11-13 11:45:30 +01:00
Piotr Sarna	701d88e39f	database: add staging sstables map In order to keep track of staging sstables (used for mv updates), a map of them is now kept in table class.	2018-11-13 11:45:30 +01:00
Botond Dénes	23f3831aaf	table::make_streaming_reader(): add forwarding parameter The single-range overload, when used by make_multishard_streaming_reader(), has to create a reader that is forwardable. Otherwise the multishard streaming reader will not produce any output as it cannot fast-forward its shard readers to the ranges produced by the generator. Also add a unit test, that is based on the real-life purpose the multishard streaming reader was designed for - serving partition from a shard, according to a sharding configuration that is different than the local one. This is also the scenario that found the buf in the first place. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <bf799961bfd535882ede6a54cd6c4b6f92e4e1c1.1539235034.git.bdenes@scylladb.com>	2018-10-11 10:59:18 +03:00
Botond Dénes	4bb0bbb9e2	database: add make_multishard_streaming_reader() Creates a streaming reader that reads from all shards. Shard readers are created with `table::make_streaming_reader()`. This is needed for the new row-level repair. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <4b74c710bed2ef98adf07555a4c841e5b690dd8c.1538470782.git.bdenes@scylladb.com>	2018-10-09 11:07:47 +03:00
Botond Dénes	3eeb6fbd23	table::make_streaming_reader(): add single-range overload This will be used by the `make_multishard_streaming_reader()` in the next patch. This method will create a multishard combining reader which needs its shard readers to take a single range, not a vector of ranges like the existing overload. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <cc6f2c9a8cf2c42696ff756ed6cb7949b95fe986.1538470782.git.bdenes@scylladb.com>	2018-10-09 11:07:46 +03:00
Raphael S. Carvalho	745e35fa82	database: Fix sstable resharding for mc format SStable format mc doesn't write ancestors to metadata, so resharding will not work with this new format because it relies on ancestors to replace new unshared sstables with old shared ones. Fix is about not relying on ancestors metadata for this operation. Fixes #3777. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180922211933.1987-1-raphaelsc@scylladb.com>	2018-09-25 18:37:48 +03:00
Botond Dénes	253407bdc8	multishard_mutation_query: add badness counters Add badness counters that allow tracking problems. The following counters are added: 1) multishard_query_unpopped_fragments 2) multishard_query_unpopped_bytes 3) multishard_query_failed_reader_stops 4) multishard_query_failed_reader_saves The first pair of counters observe the amount of work range scan queries have to undo on each page. It is normal for these counters to be non-zero, however sudden spikes in their values can indicate problems. This undoing of work is needed for stateful range-scans to work. When stateful queries are enabled the `multishard_combining_reader` is dismantled and all unconsumed fragments in its and any of its intermediate reader's buffers are pushed back into the originating shard reader's buffer (via `unpop_mutation_fragment()`). This also includes the `partition_start`, the `static_row` (if there is one) and all extracted and active `range_tombstone` fragments. This together can amount to a substantial amount of fragments. (1) counts the amount of fragments moved back, while (2) counts the number of bytes. Monitoring size and quantity separately allows for detecting edge cases like moving many small fragments or just a few huge ones. The counters count the fragments/bytes moved back to readers located on the shard they belong to. The second pair of counters are added to detect any problems around saving readers. Since the failure to save a reader will not fail the read itself, it is necessary to add visibility to these failures by other means. (3) counts the number of times stopping a shard reader (waiting on pending read-aheads and next-partitions) failed while (4) counts the number of times inserting the reader into the `querier_cache` failed. Contrary to the first two counters, which will almost certainly never be zero, these latter two counters should always be zero. Any other value indicates problems in the respective shards/nodes.	2018-09-03 10:31:44 +03:00
Botond Dénes	97364c7ad9	database: add query_mutations_on_all_shards() This method allows for querying a range or ranges on all shards of the node. Under the hood it uses the multishard_combining_reader for executing the query. It supports paging and stateful queries (saving and reusing the readers between pages). All this is transparent to the client, who only needs to supply the same query::read_command::query_uuid through the pages of the query (and supply correct start positions on each page, that match the stop position of the last page).	2018-09-03 10:31:44 +03:00
Botond Dénes	5f726e9a89	querier: move all to query namespace To avoid name clashes.	2018-09-03 10:31:44 +03:00
Avi Kivity	37f9a3c566	database: make database's mutation apply stage inherit its scheduling group from the caller Like the two preceeding patches, convert the mutation apply stage to an inheriting_concrete_scheduling_group. This change has two added benefits: we get rid of a thread_local, and we drop a with_scheduling_group() inside an execution stage which just creates a bunch of continuations and somewhat undoes the benefit of the execution stage.	2018-08-24 19:04:49 +03:00
Avi Kivity	596fb6f2f7	database: make database::_data_query_stage inheriting its caller's scheduling_group Now (`8c993e0728`) that replica-side operations run under the correct scheduling group, we can inherit the scheduling_group for _data_query_stage from the caller. By itself this doesn't do much, but it will later allow us to have multiple groups for statement executions.	2018-08-24 19:04:49 +03:00
Avi Kivity	ef9b36376c	Merge "database: support multiple data directories" from Glauber " While Cassandra supports multiple data directories, we have been historically supporting just one. The one-directory model suits us better because of the I/O Scheduler and so far we have seen very few requests -- if any, to support this. Still, the infrastructure needed to support multiple directories can be beneficial so I am trying to bring this in. For simplicity, we will treat the first directory in the list as the main directory. By being able to still associate one singular directory with a table, most of the code doesn't have to change and we don't have to worry about how to distribute data between the directories. In this design: - We scan all data directories for existing data. - resharding only happens within a particular data directory. - snapshot details are accumulated with data for all directories that host snapshots for the tables we are examining - snapshots are created with files in its own directories, but the manifest file goes to the main directory. For this one, note that in Cassandra the same thing happens, except that there is no "main" directory. Still the manifest file is still just in one of them. - SSTables are flushed into the main directory. - Compactions write data into the main directory Despite the restrictions, one example of usage of this is recovery. If we have network attached devices for instance, we can quickly attach a network device to an existing node and make the data immediately available as it is compacted back to main storage. Tests: unit (release) " * 'multi-data-file-v2' of github.com:glommer/scylla: database: change ident database: support multiple data directories database: allow resharing to specify a directory database: support multiple directories in get_snapshot_details database: move get_snapshot_info into a seastar::thread snapshots: always create the snapshot directory sstables: pass sstable dir with entry descriptor database: make nodetool listsnapshots print correct information sstables: correctly create descriptors for snapshots	2018-07-15 13:31:04 +03:00
Asias He	6540051f77	database: Add add_sstable_and_update_cache Since we can write mutations to sstable directly in streaming, we need to add those sstables to the system so it can be seen by the query. Also we need to update the cache so the query refects the latest data.	2018-07-13 08:36:45 +08:00
Asias He	dfc2739625	database: Add make_streaming_sstable_for_write This will be used to create sstable for streaming receiver to write the mutations received from network to sstable file instead of writing to memtable.	2018-07-13 08:36:45 +08:00
Glauber Costa	99c8a1917f	database: support multiple data directories While Cassandra supports multiple data directories, we have been historically supporting just one. The one-directory model suits us better because of the I/O Scheduler and so far we have seen very few requests -- if any, to support this. Still, the infrastructure needed to support multiple directories can be beneficial so I am trying to bring this in. For simplicity, we will treat the first directory in the list as the main directory. By being able to still associate one singular directory with a table, most of the code doesn't have to change and we don't have to worry about how to distribute data between the directories. In this design: - We scan all data directories for existing data. - resharding only happens within a particular data directory. - snapshot details are accumulated with data for all directories that host snapshots for the tables we are examining - snapshots are created with files in its own directories, but the manifest file goes to the main directory. For this one, note that in Cassandra the same thing happens, except that there is no "main" directory. Still the manifest file is still just in one of them. - SSTables are flushed into the main directory. - Compactions write data into the main directory Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-07-05 16:58:39 -04:00
Avi Kivity	f3da043230	Merge "Make in-memory partition version merging preemptable" from Tomasz " Partition snapshots go away when the last read using the snapshot is done. Currently we will synchronously attempt to merge partition versions on this event. If partitions are large, that may stall the reactor for a significant amount of time, depending on the size of newer versions. Cache update on memtable flush can create especially large versions. The solution implemented in this series is to allow merging to be preemptable, and continue in the background. Background merging is done by the mutation_cleaner associated with the container (memtable, cache). There is a single merging process per mutation_cleaner. The merging worker runs in a separate scheduling group, introduced here, called "mem_compaction". When the last user of a snapshot goes away the snapshot is slided to the oldest unreferenced version first so that the version is no longer reachable from partition_entry::read(). The cleaner will then keep merging preceding (newer) versions into it, until it merges a version which is referenced. The merging is preemtable. If the initial merging is preempted, the snapshot is enqueued into the cleaner, the worker woken up, and merging will continue asynchronously. When memtable is merged with cache, its cleaner is merged with cache cleaner, so any outstanding background merges will be continued by the cache cleaner without disruption. This reduces scheduling latency spikes in tests/perf_row_cache_update for the case of large partition with many rows. For -c1 -m1G I saw them dropping from >23ms to 1-2ms. System-level benchmark using scylla-bench shows a similar improvement. " * tag 'tgrabiec/merge-snapshots-gradually-v4' of github.com:tgrabiec/scylla: tests: perf_row_cache_update: Test with an active reader surviving memtable flush memtable, cache: Run mutation_cleaner worker in its own scheduling group mutation_cleaner: Make merge() redirect old instance to the new one mvcc: Use RAII to ensure that partition versions are merged mvcc: Merge partition version versions gradually in the background mutation_partition: Make merging preemtable tests: mvcc: Use the standard maybe_merge_versions() to merge snapshots	2018-07-01 15:32:51 +03:00
Tomasz Grabiec	074be4d4e8	memtable, cache: Run mutation_cleaner worker in its own scheduling group The worker is responsible for merging MVCC snapshots, which is similar to merging sstables, but in memory. The new scheduling group will be therefore called "memory compaction". We should run it in a separate scheduling group instead of main/memtables, so that it doesn't disrupt writes and other system activities. It's also nice for monitoring how much CPU time we spend on this.	2018-06-27 21:51:04 +02:00
Piotr Sarna	e1a867cbe3	database: add phaser for reads Currently drop_column_family waits on write_in_progress phaser, but there's no such mechanism for reads. This commit adds a corresponding reads phaser. Refs #3357 Reported-by: Duarte Nunes <duarte@scylladb.com> Signed-off-by: Piotr Sarna <sarna@scylladb.com> Message-Id: <70b5fdd44efbc24df61585baef024b809cabe527.1529928323.git.sarna@scylladb.com>	2018-06-27 10:02:56 +01:00
Paweł Dziepak	96b0577343	row_cache: deglobalise row cache tracker Row cache tracker has numerous implicit dependencies on ohter objects (e.g. LSA migrators for data held by mutation_cleaner). The fact that both cache tracker and some of those dependencies are thread local objects makes it hard to guarantee correct destruction order. Let's deglobalise cache tracker and put in in the database class.	2018-06-25 09:37:43 +01:00
Avi Kivity	cb549c767a	database: rename column_family to table The name "column_family" is both awkward and obsolete. Rename to the modern and accurate "table". An alias is kept to avoid huge code churn. To prevent a One Definition Rule violation, a preexisting "table" type is moved to a new namespace row_cache_stress_test. Tests: unit (release) Message-Id: <20180624065238.26481-1-avi@scylladb.com>	2018-06-24 14:54:46 +03:00
Glauber Costa	290d553c3a	compaction_strategy: allow the user to tell us if min_threshold has to be strict Now that we have the controller, we would like to take min_threshold as a hint. If there is nothing to compact, we can ignore that and start compacting less than min_threshold SSTables so that the backlog keeps reducing. But there are cases in which we don't want min_threshold to be a hint and we want to enforce it strictly. For instance, if write amplification is more of a concern than space amplification. This patch adds a YAML option that allows the user to tell us that. We will default to false, meaning min_threshold is not strictly enforced. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-06-15 13:42:43 -04:00
Gleb Natapov	f41575a156	Provide available memory size to database object during creation	2018-06-11 15:34:13 +03:00
Avi Kivity	2582f53b44	Merge "database and API: Add column_family::get_sstables_by_key" from Amnon " This is series is for nodetool getsstables. This patch is based on: `8daaf9833a` With some minor adjustments because of the code change in sstables. The idea is to allow searching for all the sstables that contains a given key. After this patch if there is a table t1 in keyspace k1 and it has a key called aa. curl -X GET "http://localhost:10000/column_family/sstables/by_key/k1%3At1?key=aa" Will return the list of sstables file names that contains that key. " * 'amnon/sstable_for_key_v4' of github.com:scylladb/seastar-dev: Add the API implementation to get_sstables_by_key api: column_family.json make the get_sstables_for_key doc clearer column_family: Add the get_sstables_by_partition_key method sstable test: add has_partition_key test sstable: Add has_partition_key method keys_test: add a test for nodetool_style string keys: Add from_nodetool_style_string factory method	2018-06-10 16:53:56 +03:00
Amnon Heiman	acb0a738eb	column_family: Add the get_sstables_by_partition_key method The get_sstables_by_partition_key method used by the API to return a set of sstables names that holds a given partition key. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2018-06-10 16:13:01 +03:00
Asias He	6496cdf0fb	db: Get rid of the streaming memtable delayed flush In `455d5a5` (streaming memtables: coalesce incoming writes), we introduced the delayed flush to coalesce incoming streaming mutations from different stream_plan. However, most of the time there will be one stream plan at a time, the next stream plan won't start until the previous one is finished. So, the current coalescing does not really work. The delayed flush adds 2s of dealy for each stream session. If we have lots of table to stream, we will waste a lot of time. We stream a keyspace in around 10 stream plans, i.e., 10% of ranges a time. If we have 5000 tables, even if the tables are almost empty, the delay will waste 5000 * 10 * 2 = 27 hours. To stream a keyspace with 4 tables, each table has 1000 rows. Before: [shard 0] stream_session - [Stream #944373d0-5d9c-11e8-9cdb-000000000000] Executing streaming plan for Bootstrap-ks-index-0 with peers={127.0.0.1}, master [shard 0] stream_session - [Stream #944373d0-5d9c-11e8-9cdb-000000000000] Streaming plan for Bootstrap-ks-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1030 KiB, 125.21 KiB/s [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks succeeded, took 8.233 seconds After: [shard 0] stream_session - [Stream #e00bf6a0-5d99-11e8-a7b8-000000000000] Executing streaming plan for Bootstrap-ks-index-0 with peers={127.0.0.1}, master [shard 0] stream_session - [Stream #e00bf6a0-5d99-11e8-a7b8-000000000000] Streaming plan for Bootstrap-ks-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1030 KiB, 4772.32 KiB/s [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks succeeded, took 0.216 seconds Fixes #3436 Message-Id: <cb2dde263782d2a2915ddfe678c74f9637ffd65b.1526979175.git.asias@scylladb.com>	2018-06-06 10:16:02 +03:00
Piotr Sarna	f8237dd664	database: do not truncate already removed views This commit clears table's views before truncating it in drop_column_family function. The only case when views are not empty during drop is when they're backing secondary indexes of a base table and they are all atomically dropped in the same go as the base table itself. This change will prevent trying to truncate views that were already dropped, which used to result in no_such_column_family error. References #3202	2018-05-22 21:10:51 +02:00

1 2 3 4 5 ...

675 Commits