scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-27 11:55:15 +00:00

Author	SHA1	Message	Date
Raphael S. Carvalho	34ed930aa4	sstables: fix lack of accuracy in disk usage report To report disk usage, scylla was only taking into account size of sstable data component. Other components such as index and filter may be relatively big too. Therefore, 'nodetool status' would report an innacurate disk usage. That can be fixed by taking into account size of all sstable components. Fixes #943. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <08453585223570006ac4d25fe5fb909ad6c140a5.1456762244.git.raphaelsc@scylladb.com>	2016-03-01 08:58:42 +02:00
Raphael S. Carvalho	fc4cbcde72	Revert "Revert "database: Fix use and assumptions about pending compations"" This reverts commit `a4d92750eb`. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <8a405e7c1daf94c4d70d8084f59ce7205d56fe52.1456415398.git.raphaelsc@scylladb.com>	2016-02-25 18:02:01 +02:00
Pekka Enberg	a4d92750eb	Revert "database: Fix use and assumptions about pending compations" This reverts commit `9586793c70`. It breaks sstable_test as follows: [penberg@nero scylla]$ build/release/tests/sstable_test --smp 1 Running 81 test cases... INFO [shard 0] compaction_manager - Asked to stop INFO [shard 0] compaction_manager - Stopped sstable_test: database.cc:878: future<> column_family::run_compaction(sstables::compaction_descriptor): Assertion `_stats.pending_compactions > 0' failed. unknown location(0): fatal error in "compaction_manager_test": signal: SIGABRT (application abort requested) tests/sstable_datafile_test.cc(1023): last checkpoint	2016-02-25 15:28:06 +02:00
Calle Wilund	9586793c70	database: Fix use and assumptions about pending compations Fixes #934 - faulty assert in discard_sstables run_with_compaction_disabled clears out a CF from compaction mananger queue. discard_sstables wants to assert on this, but looks at the wrong counters. pending_compactions is an indicator on how much interested parties want a CF compacted (again and again). It should not be considered an indicator of compactions actually being done. This modifies the usage slightly so that: 1.) The counter is always incremented, even if compaction is disallowed. The counters value on end of run_with_compaction_disabled is then instead used as an indicator as to whether a compaction should be re-triggered. (If compactions finished, it will be zero) 2.) Document the use and purpose of the pending counter, and add method to re-add CF to compaction for r_w_c_d above. 3.) discard_sstables now asserts on the right things. Message-Id: <1456332824-23349-1-git-send-email-calle@scylladb.com>	2016-02-25 08:57:04 +02:00
Calle Wilund	590ec1674b	truncate: Require timestamp join-function to ensure equal values Fixes #937 In fixing #884, truncation not truncating memtables properly, time stamping in truncate was made shard-local. This however breaks the snapshot logic, since for all shards in a truncate, the sstables should snapshot to the same location. This patch adds a required function argument to truncate (and by extension drop_column_family) that produces a time stamp in a "join" fashion (i.e. same on all shards), and utilizes the joinpoint type in caller to do so. Message-Id: <1456332856-23395-2-git-send-email-calle@scylladb.com>	2016-02-24 18:59:31 +02:00
Raphael S. Carvalho	59bbe98c21	sstables: keep track of compacting sstables in compacton manager itself Avi says: "Something like unordered_set<unsigned long> is error prone, because ints tend to mix up (also, need to use a sized type, unsigned long varies among machines)." With that in mind, it's better if we keep track of compacting sstables in a unordered_set<shared_sstable>. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <249f0fd4cfcf786cf3c37a79978f7743d07f48ad.1455120811.git.raphaelsc@scylladb.com>	2016-02-15 18:35:43 +02:00
Calle Wilund	18203a4244	database::truncate/drop: Move time stamp generation to shard Fixes #884 Time stamps for truncation must be generated after flush, either by splitting the truncate into two (or more) for-each-shard operations, or simply by doing time stamping per shard (this solution). We generate TS on each shard after flushing, and then rely on the actual stored value to be the highest time point generated. This should however, from batch replay point of view, be functionally equivalent. And not a problem.	2016-02-09 15:45:37 +00:00
Gleb Natapov	a9e4afd8d2	Drop query-result.hh from database.hh It is not needed there but causes a lot of recompilation when changed. Message-Id: <1454496142-14537-3-git-send-email-gleb@scylladb.com>	2016-02-04 13:22:27 +02:00
Tomasz Grabiec	9fa62af96b	database: Move implementation to .cc Message-Id: <1453980679-27226-1-git-send-email-tgrabiec@scylladb.com>	2016-01-28 13:35:33 +02:00
Glauber Costa	f6cfb04d61	add a priority class to mutation readers SSTables already have a priority argument wired to their read path. However, most of our reads do not call that interface directly, but employ the services of a mutation reader instead. Some of those readers will be used to read through a mutation_source, and those have to patched as well. Right now, whenever we need to pass a class, we pass Seastar's default priority class. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-01-25 15:20:38 -05:00
Raphael S. Carvalho	2164aa8d5b	move compaction manager from /utils to /sstables Compaction manager was initially created at utils because it was more generic, and wasn't only intended for compaction. It was more like a task handler based on futures, but now it's only intended to manage compaction tasks, and thus should be moved elsewhere. /sstables is where compaction code is located. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-01-21 15:23:05 -02:00
Raphael S. Carvalho	a5c90194f5	db: add support to clean up a column family Cleanup is a procedure that will discard irrelevant keys from all sstables of a column family, thus saving disk space. Scylla will clean up a sstable by using compaction code, in which this sstable will be the only input used. Compaction manager was changed to become aware of cleanup, such that it will be able to schedule cleanup requests and also know how to handle them properly. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-01-12 03:53:04 -02:00
Raphael S. Carvalho	d44a5d1e94	compaction: filter out compacting sstables The implementation is about storing generation of compacting sstables in an unordered set per column family, so before strategy is called, compaction manager will filter out compacting sstables. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-01-12 01:18:29 -02:00
Raphael S. Carvalho	9c13c1c738	compaction: move compaction execution from strategy to manager Currently, compaction strategy is the responsible for both getting the sstables selected for compaction and running compaction. Moving the code that runs compaction from strategy to manager is a big improvement, which will also make possible for the compaction manager to keep track of which sstables are being compacted at a moment. This change will also be needed for cleanup and concurrent compaction on the same column family. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-01-12 00:04:27 -02:00
Raphael S. Carvalho	5c674091dc	db: move code that rebuilds sstable list to a function That code will be used by column family cleanup, so let's put that code into a function. This change also improves the code readability. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-01-11 19:51:04 -02:00
Raphael S. Carvalho	58189dd489	db: move generation calculation code to a function Code that calculates generation should be put in a function. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-01-11 19:51:02 -02:00
Tomasz Grabiec	8deb3f18d3	query_processor: Invalidate prepared statements when columns change Replicates https://issues.apache.org/jira/browse/CASSANDRA-7910 : "Prepare a statement with a wildcard in the select clause. 2. Alter the table - add a column 3. execute the prepared statement Expected result - get all the columns including the new column Actual result - get the columns except the new column"	2016-01-11 10:34:55 +01:00
Tomasz Grabiec	40858612e5	db: Make column_family::schema() return const& to avoid copy	2016-01-11 10:34:54 +01:00
Tomasz Grabiec	8164902c84	schema_tables: Change column_family schema on schema sync Notifications are not implemented yet.	2016-01-11 10:34:52 +01:00
Tomasz Grabiec	d81a46d7b5	column_family: Add schema setters There is one current schema for given column_family. Entries in memtables and cache can be at any of the previous schemas, but they're always upgraded to current schema on access.	2016-01-11 10:34:52 +01:00
Tomasz Grabiec	4e5a52d6fa	db: Make read interface schema version aware The intent is to make data returned by queries always conform to a single schema version, which is requested by the client. For CQL queries, for example, we want to use the same schema which was used to compile the query. The other node expects to receive data conforming to the requested schema. Interface on shard level accepts schema_ptr, across nodes we use table_schema_version UUID. To transfer schema_ptr across shards, we use global_schema_ptr. Because schema is identified with UUID across nodes, requestors must be prepared for being queried for the definition of the schema. They must hold a live schema_ptr around the request. This guarantees that schema_registry will always know about the requested version. This is not an issue because for queries the requestor needs to hold on to the schema anyway to be able to interpret the results. But care must be taken to always use the same schema version for making the request and parsing the results. Schema requesting across nodes is currently stubbed (throws runtime exception).	2016-01-11 10:34:52 +01:00
Tomasz Grabiec	036974e19b	Make mutation interfaces support multiple versions Schema is tracked in memtable and cache per-entry. Entries are upgraded lazily on access. Incoming mutations are upgraded to table's current schema on given shard. Mutating nodes need to keep schema_ptr alive in case schema version is requested by target node.	2016-01-11 10:34:51 +01:00
Vlad Zolotarov	07f8549683	database: filter out a manifest.json files Filter out manifest.json files when reading sstables during bootup and when loading new sstables ('nodetool refresh'). Fixes issue #529 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1451911734-26511-3-git-send-email-vladz@cloudius-systems.com>	2016-01-07 15:56:02 +02:00
Vlad Zolotarov	d5920705b8	service::storage_service: move clear_snapshot() code to 'database' class service::storage_service::clear_snapshot() was built around _db.local() calls so it makes more sense to move its code into the 'database' class instead of calling _db.local().bla_bla() all the time. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-01-03 14:22:17 +02:00
Avi Kivity	827a4d0010	Merge "streaming: Invalidate cache upon receiving of stream" from Asias "When a node gain or regain responsibility for certain token ranges, streaming will be performed, upon receiving of the stream data, the row cache is invalidated for that range. Refs #484."	2015-12-28 10:24:46 +02:00
Avi Kivity	f3980f1fad	Merge seastar upstream * seastar 51154f7...8b2171e (9): > memcached: avoid a collision of an expiration with time_point(-1). > tutorial: minor spelling corrections etc. > tutorial: expand semaphores section > Merge "Use steady_clock where monotonic clock is required" from Vlad > Merge "TLS fixes + RPC adaption" from Calle > do_with() optimization > tutorial: explain limiting parallelism using semaphores > submit_io: change pending flushes criteria > apps: remove defunct apps/seastar Adjust code to use steady_clock instead of high_resolution_clock.	2015-12-27 14:40:20 +02:00
Asias He	c25393a3f6	database: Add non-const version of get_row_cache We need this to invalidate row cache of a column family.	2015-12-21 14:42:47 +08:00
Paweł Dziepak	25d255390e	database: add non-const getter for compaction_manager Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-12-17 14:06:41 +01:00
Raphael S. Carvalho	a26fb15d1a	db: add method to get compaction manager from cf Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2015-12-15 09:50:20 -02:00
Amnon Heiman	7e79d35f85	Estimated histogram: Clean the add interface The add interface of the estimated histogram is confusing as it is not clear what units are used. This patch removes the general add method and replace it with a add_nano that adds nanoseconds or add that gets duration. To be compatible with origin, nanoseconds vales are translated to microseconds.	2015-12-01 15:28:06 +02:00
Asias He	aa2b11f21b	database: Move is_replacing and get_replace_address to database class So they can be used outside storage_service.	2015-11-30 09:15:42 +08:00
Glauber Costa	fa1ae45218	database: export collectd metrics about the state of memtable flushing When analyzing a recent performance issue, I found helpful to keep track of the amount of memtables that are currently in flight, as well as how much memory they are consuming in the system. Although those are memtable statistics, I am grouping them under the "cf_stats" structure: being the column family a central piece of the puzzle, it is reasonable to assume that a lot of metrics about it would be potentially welcome in the future. Note that we don't want to reuse the "stats" structure in the column family: for once, the fields not always map precisely (pending flushes, for instance, only tracks explicit flushes), and also the stats structure is a lot more complex than we need. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-11-12 20:17:22 +02:00
Asias He	20ecb0bede	database: Introduce get_initial_tokens Get initial tokens specified by the initial_token in scylla.conf. E.g., --initial-token "-1112521204969569328,1117992399013959838" --initial-token "1117992399013959838" It can be multiple tokens split by comma.	2015-11-04 10:40:12 +08:00
Avi Kivity	f7087da054	Merge "GET methods for snapshots" from Glauber "The snapshots API need to expose GET methods so people can query information on them. Now that taking snapshots is supported, this relatively simple series implement get_snapshot_details, a column family method, and wire that up through the storage_service."	2015-10-22 15:23:45 +03:00
Avi Kivity	5f3a46eabb	Merge "load_new_sstables" from Glauber "This patchset implements load_new_sstables, allowing one to move tables inside the data directory of a CF, and then call "nodetool refresh" to start using them. Keep in mind that for Cassandra, this is deemed an unsafe operation: https://issues.apache.org/jira/browse/CASSANDRA-6245 It is still for us something we should not recommend - unless the CF is totally empty and not yet used, but we can do a much better job in the safety front. To guarantee that, the process works in four steps: 1) All writes to this specific column family are disabled. This is a horrible thing to do, because dirty memory can grow much more than desired during this. Throughout out this implementation, we will try to keep the time during which the writes are disabled to its bare minimum. While disabling the writes, each shard will tell us about the highest generation number it has seen. 2) We will scan all tables that we haven't seen before. Those are any tables found in the CF datadir, that are higher than the highest generation number seen so far. We will link them to new generation numbers that are sequential to the ones we have so far, and end up with a new generation number that is returned to the next step 3) The generation number computed in the previous step is now propagated to all CFs, which guarantees that all further writes will pick generation numbers that won't conflict with the existing tables. Right after doing that, the writes are resumed. 4) The tables we found in step 2 are passed on to each of the CFs. They can now load those tables while operations to the CF proceed normally."	2015-10-22 13:42:24 +03:00
Amnon Heiman	c130381284	Adding live_scanned and tombstone scaned histogram to column family This series adds a histogrm to the column family for live scanned and tombstone scaned. It expose those histogram via the API instead of the stub implmentation, currently exist. The implementation update of the histogram will be added in a different series. Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>	2015-10-22 11:13:28 +03:00
Glauber Costa	36cea4313e	column family: load new sstables CF-level code to load new SSTables. There isn't really a lot of complication here. We don't even need to repopulate the entire SSTable directory: by requiring that the external service who is coordinating this tell us explicitly about the new SSTables found in the scan process, we can just load them specifically and add them to the SSTable map. All new tables will start their lifes as shared tables, and will be unshared if it is possible to do so: this all happens inside add_sstable and there isn't really anything special in this front. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:06:22 +02:00
Glauber Costa	61be9fb02d	reshuffle tables: mechanism to adjust new sstables' generation number Before loading new SSTables into the node, we need to make sure that their generation numbers are sequential (at least if we want to follow Cassandra's footsteps here). Note that this is unsafe by design. More information can be found at: https://issues.apache.org/jira/browse/CASSANDRA-6245 However, we can already to slightly better in two ways: Unlike Cassandra, this method takes as a parameter a generation number. We will not touch tables that are before that number at all. That number must be calculated from all shards as the highest generation number they have seen themselves. Calling load_new_sstables in the absence of new tables will therefore do nothing, and will be completely safe. It will also return the highest generation number found after the reshuffling process. New writers should start writing after that. Therefore, new tables that are created will have a generation number that is higher than any of this, and will therefore be safe. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:06:22 +02:00
Glauber Costa	1351c1cc13	database: mechanism to stop writing sstables During certain operations we need to stop writing SSTables. This is needed when we want to load new SSTables into the system. They will have to be scanned by all shards, agreed upon, and in most cases even renamed. Letting SSTables be written at that point makes it inherently racy - specially with the rename. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:06:22 +02:00
Glauber Costa	29e2ad7fd8	column family: commonize code to calculate the desired SSTable generation We will reuse this for load_new_sstables. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:02:43 +02:00
Glauber Costa	f3bad2032d	database: fix type for sstable generation. Avoid using long for it, and let's use a fixed size instead. Let's do signed instead of unsigned to avoid upsetting any code that we may have converted. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:01:20 +02:00
Tomasz Grabiec	764d913d84	Merge branch 'pdziepak/row-cache-range-query/v4' from seastar-dev.git From Pawel: This series enables row cache to serve range queries. In order to achieve that row cache needs to know whether there are some other partitions in the specified range that are not cached and need to be read from the sstables. That information is provied by key_readers, which work very similarly to mutation_readers, but return only the decorated key of partitions in range. In case of sstables key_readers is implemented to use partition index. Approach like this has the disadvantage of needing to access the disk even if all partitions in the range are cached. There are (at least) two solutions ways of dealing with that problem: - cache partition index - that will also help in all other places where it is neededed - add a flag to cache_entry which, when set, indicates that the immediate successor of the partition is also in the cache. Such flag would be set by mutation reader and cleared during eviction. It will also allow newly created mutations from memtable to be moved to cache provided that both their successors and predecessors are already there. The key_reader part of this patchsets adds a lot of new code that probably won't be used in any other place, but the alternative would be to always interleave reads from cache with reads from sstables and that would be more heavy on partition index, which isn't cached. Fixes #185.	2015-10-21 15:26:45 +02:00
Glauber Costa	77513a40db	database: get_snapshot_details For each of the snapshots available, the api may query for some information: the total size on disk, and the "real" size. As far as I could understand, the real size is the size that is used by the SSTables themselves, while the total size includes also the metadata about the snapshot - like the manifest.json file. Details follow: In the original Cassandra code, total size is: long sizeOnDisk = FileUtils.folderSize(snapshot); folderSize recurses on directories, and adds file.length() on files. Again, my understanding is that file_size() would give us the same as the length() method for Java. The other value, real (or true) size is: long trueSize = getTrueAllocatedSizeIn(snapshot); getTrueAllocatedSizeIn seems to be a tree walker, whose visitor is an instance of TrueFilesSizeVisitor. What that visitor does, is add up the size of the files within the tree who are "acceptable". An acceptable file is a file which: starts with the same prefix as we want (IOW, belongs to the same SSTable, we will just test that directly), and is not "alive". The alive list is just the list of all SSTables in the system that are used by the CFs. What this tries to do, is to make sure that the trueSnapshotSize is just the extra space on disk used by the snapshot. Since the snapshots are links, then if a table goes away, it adds to this size. If it would be there anyway, it does not. We can do that in a lot simpler fashion: for each file, we will just look at the original CF directory, and see if we can find the file there. If we can't, then it counts towards the trueSize. Even for files that are deleted after compaction, that "eventually" works, and that simplifies the code tremendously given that we don't have to neither list all files in the system - as Cassandra does - or go check other shards for liveness information - as we would have to do. The scheme I am proposing may need some tweaks when we support multiple data directories, as the SSTables may not be directly below the snapshot level. Still, it would be trivial to inform the CF about their possible locations. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 13:48:44 +02:00
Paweł Dziepak	96a42a9c69	column_family: add sstables_as_key_source() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-10-20 20:27:53 +02:00
Glauber Costa	d236b01b48	snapshots: check existence of snapshots We go to the filesystem to check if the snapshot exists. This should make us robust against deletions of existing snapshots from the filesystem. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-20 15:58:26 +02:00
Glauber Costa	d3aef2c1a5	database: support clear snapshot This allows for us to delete an existing snapshot. It works at the column family level, and removing it from the list of keyspace snapshots needs to happen only when all CFs are processed. Therefore, that is provided as a separate operation. The filesystem code is a bit ugly: it can be made better by making our file lister more generic. First step would be to call it walker, not lister... For now, we'll use the fact that there are mostly two levels in the snapshot hierarchy to our advantage, and avoid a full recursion - using the same lambda for all calls would require us to provide a separate class to handle the state, that's part of making this generic. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-20 15:38:14 +02:00
Avi Kivity	2ccb5feabd	Merge "Support nodetool cfhistogram" "This series adds the missing estimated histogram to the column family and to the API so the nodetool cfhistogram would work."	2015-10-19 17:11:46 +03:00
Raphael S. Carvalho	35b75e9b67	adapt compaction procedure to support leveled strategy Adapt our compaction code to start writing a new sstable if the one being written reached its maximum size. Leveled strategy works with that concept. If a strategy other than leveled is being used, everything will work as before. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-10-16 01:54:52 -03:00
Calle Wilund	012ab24469	column_family: Add flush queue object to act as ordering guarantee	2015-10-14 14:07:40 +02:00
Glauber Costa	b2fef14ada	do not calculate truncation time independently Currently, we are calculating truncated_at during truncate() independently for each shard. It will work if we're lucky, but it is fairly easy to trigger cases in which each shard will end up with a slightly different time. The main problem here, is that this time is used as the snapshot name when auto snapshots are enabled. Previous to my last fixes, this would just generate two separate directories in this case, which is wrong but not severe. But after the fix, this means that both shards will wait for one another to synchronize and this will hang the database. Fix this by making sure that the truncation time is calculated before invoke_on_all in all needed places. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-09 17:17:11 +03:00

1 2 3 4 5 ...

324 Commits