scylladb

Author	SHA1	Message	Date
Tomasz Grabiec	8deb3f18d3	query_processor: Invalidate prepared statements when columns change Replicates https://issues.apache.org/jira/browse/CASSANDRA-7910 : "Prepare a statement with a wildcard in the select clause. 2. Alter the table - add a column 3. execute the prepared statement Expected result - get all the columns including the new column Actual result - get the columns except the new column"	2016-01-11 10:34:55 +01:00
Tomasz Grabiec	c6a52bed73	db: Fail when attempting to mutate using not synced schema	2016-01-11 10:34:53 +01:00
Tomasz Grabiec	f0d886893d	db: Mark new schemas as synced	2016-01-11 10:34:52 +01:00
Tomasz Grabiec	8164902c84	schema_tables: Change column_family schema on schema sync Notifications are not implemented yet.	2016-01-11 10:34:52 +01:00
Tomasz Grabiec	d81a46d7b5	column_family: Add schema setters There is one current schema for given column_family. Entries in memtables and cache can be at any of the previous schemas, but they're always upgraded to current schema on access.	2016-01-11 10:34:52 +01:00
Tomasz Grabiec	4e5a52d6fa	db: Make read interface schema version aware The intent is to make data returned by queries always conform to a single schema version, which is requested by the client. For CQL queries, for example, we want to use the same schema which was used to compile the query. The other node expects to receive data conforming to the requested schema. Interface on shard level accepts schema_ptr, across nodes we use table_schema_version UUID. To transfer schema_ptr across shards, we use global_schema_ptr. Because schema is identified with UUID across nodes, requestors must be prepared for being queried for the definition of the schema. They must hold a live schema_ptr around the request. This guarantees that schema_registry will always know about the requested version. This is not an issue because for queries the requestor needs to hold on to the schema anyway to be able to interpret the results. But care must be taken to always use the same schema version for making the request and parsing the results. Schema requesting across nodes is currently stubbed (throws runtime exception).	2016-01-11 10:34:52 +01:00
Tomasz Grabiec	036974e19b	Make mutation interfaces support multiple versions Schema is tracked in memtable and cache per-entry. Entries are upgraded lazily on access. Incoming mutations are upgraded to table's current schema on given shard. Mutating nodes need to keep schema_ptr alive in case schema version is requested by target node.	2016-01-11 10:34:51 +01:00
Tomasz Grabiec	9eef4d1651	db: Learn schema versions when adding tables	2016-01-11 10:34:51 +01:00
Tomasz Grabiec	dbb7b7ebe3	db: Move system keyspace initialization to init_system_keyspace()	2016-01-08 21:10:26 +01:00
Avi Kivity	0c755d2c94	db: reduce log spam when ignoring an sstable With 10 sstables/shard and 50 shards, we get ~105050 messages = 25,000 log messages about sstables being ignored. This is not reasonable. Reduce the log level to debug, and move the message to database.cc, because at its original location, the containing function has nothing to do with the message itself. Reviewed-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com> Message-Id: <1452181687-7665-1-git-send-email-avi@scylladb.com>	2016-01-07 19:23:25 +02:00
Vlad Zolotarov	07f8549683	database: filter out a manifest.json files Filter out manifest.json files when reading sstables during bootup and when loading new sstables ('nodetool refresh'). Fixes issue #529 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1451911734-26511-3-git-send-email-vladz@cloudius-systems.com>	2016-01-07 15:56:02 +02:00
Vlad Zolotarov	c5aa2d6f1a	database: lister: add a filtering option Add a possibility to pass a filter functor receiving a full path to a directory entry and returning a boolean value: TRUE if an entry should be enumerated and FALSE - if it should be filtered out. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Message-Id: <1451911734-26511-2-git-send-email-vladz@cloudius-systems.com>	2016-01-07 15:56:01 +02:00
Pekka Enberg	f4bdec4d09	Merge "Support for deleting all snapshots" from Vlad "Add support for deleting all snapshots of all keyspaces." Fixes #639.	2016-01-05 15:42:44 +02:00
Glauber Costa	74fbd8fac0	do not call open_file_dma directly We have an API that wraps open_file_dma which we use in some places, but in many other places we call the reactor version directly. This patch changes the latter to match the former. It will have the added benefit of allowing us to make easier changes to these interfaces if needed. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <29296e4ec6f5e84361992028fe3f27adc569f139.1451950408.git.glauber@scylladb.com>	2016-01-05 10:37:57 +02:00
Vlad Zolotarov	7bb2b2408b	database::clear_snapshot(): added support for deleting all snapshots When 'nodetool clearsnapshot' is given no parameters it should remove all existing snapshots. Fixes issue #639 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-01-03 14:22:25 +02:00
Vlad Zolotarov	d5920705b8	service::storage_service: move clear_snapshot() code to 'database' class service::storage_service::clear_snapshot() was built around _db.local() calls so it makes more sense to move its code into the 'database' class instead of calling _db.local().bla_bla() all the time. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-01-03 14:22:17 +02:00
Vlad Zolotarov	756de38a9d	database: actually check that a snapshot directory exists Actually check that a snapshot directory with a given tag exists instead of just checking that a 'snapshot' directory exists. Fixes issue #689 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2015-12-29 12:59:00 +01:00
Avi Kivity	41bd266ddd	db: provide more information on "Unrecognized error" while loading sstables This information can be used to understand the root cause of the failure. Refs #692.	2015-12-29 10:23:32 +02:00
Pekka Enberg	eeadf601e6	Merge "cleanups and improvements" from Raphael	2015-12-18 13:45:11 +02:00
Pekka Enberg	e56bf8933f	Improve not implemented errors Print out the function name where we're throwing the exception from to make it easier to debug such exceptions.	2015-12-18 10:51:37 +01:00
Raphael S. Carvalho	41be378ff1	db: fix build of sstable list in column_family::compact_sstables The last two loops were incorrectly inside the first one. That's a bug because a new sstable may be emplaced more than once in the sstable list, which can cause several problems. mark_for_deletion may also be called more than once for compacted sstables, however, it is idempotent. Found this issue while auditing the code. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2015-12-16 17:46:17 +02:00
Raphael S. Carvalho	6142efaedb	db: fix indentation Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2015-12-14 12:43:34 -02:00
Raphael S. Carvalho	7bbc1b49b6	db: add missing sstable::mark_for_deletion call If a sstable doesn't belong to current shard, mark_for_deletion should be called for the deletion manager to still work. It doesn't mean that the sstable will be deleted, but that the sstable is not relevant to the current shard, thus it can be deleted by the deletion manager in the future. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2015-12-14 12:42:26 -02:00
Amnon Heiman	2086c651ba	column_family: get_snapshot_details should return empty map for no snapshots If there is no snapshot directory for the specific column family, get_snapshot_details should return an empty map. This patch check that a directory exists before trying to iterate over it. Fixes #619 Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2015-12-07 12:51:04 +01:00
Tomasz Grabiec	bc23ebcbc3	schema_tables: Replace schema_result::value_type with equivalent movable type future<> requires and will assert nothrow move constructible types.	2015-12-07 09:50:27 +01:00
Amnon Heiman	7e79d35f85	Estimated histogram: Clean the add interface The add interface of the estimated histogram is confusing as it is not clear what units are used. This patch removes the general add method and replace it with a add_nano that adds nanoseconds or add that gets duration. To be compatible with origin, nanoseconds vales are translated to microseconds.	2015-12-01 15:28:06 +02:00
Asias He	aa2b11f21b	database: Move is_replacing and get_replace_address to database class So they can be used outside storage_service.	2015-11-30 09:15:42 +08:00
Tomasz Grabiec	a7c11d1e30	db: Fix handling of missing column family The FIXMEs are no longer valid, we load schema on bootstrap and don't support hot-plugging of column families via file system (nor does Cassandra). Handling of missing tables matches Cassandra 2.1, applies log it and continue, queries propagate the error.	2015-11-25 16:59:15 +02:00
Raphael S. Carvalho	0f3ccc1143	db: optimize the sstable loading process Currently, we only determine if a sstable belongs to current shard after loading some of its components into memory. For example, filter may be considerably big and its content is irrelevant to decide if a sstable should be included to a given shard. Start using the functions previously introduced to optimize the sstable loading process. add_sstable no longer checks if a sstable is relevant to the current shard. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2015-11-19 13:34:25 -02:00
Raphael S. Carvalho	0ce2b7bc8d	db: introduce belongs_to_current_shard Returns true if key range belongs to current shard. False otherwise. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2015-11-19 13:34:21 -02:00
Raphael S. Carvalho	966e8c7144	db: introduce parallelism to sstable loading Boot may be slow because the function that loads sstables do so serially instead of in parallel. In the callback supplied to lister::scan_dir, let's push the future returned by probe_file (function that loads sstable) into a vector of future and wait for all of them at the end. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2015-11-19 13:34:11 -02:00
Glauber Costa	fa1ae45218	database: export collectd metrics about the state of memtable flushing When analyzing a recent performance issue, I found helpful to keep track of the amount of memtables that are currently in flight, as well as how much memory they are consuming in the system. Although those are memtable statistics, I am grouping them under the "cf_stats" structure: being the column family a central piece of the puzzle, it is reasonable to assume that a lot of metrics about it would be potentially welcome in the future. Note that we don't want to reuse the "stats" structure in the column family: for once, the fields not always map precisely (pending flushes, for instance, only tracks explicit flushes), and also the stats structure is a lot more complex than we need. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-11-12 20:17:22 +02:00
Calle Wilund	284b10cabe	Make partition_slice::row_ranges mulitplex on partition Allows for having more than one clustering row range set, depending on PK queried (although right now limited to one - which happens to be exactly the number of mutiplexing paging needs... What a coincidence...) Encapsulates the row_ranges member in a query function, and if needed holds ranges outside the default one in an extra object. Query result::builder::add_partition now fetches the correct row range for the partition, and this is the range used in subsequent iteration.	2015-11-10 13:12:33 +01:00
Gleb Natapov	d77a2a0f03	do not try to write same memtable to sstable twice if moving it to a cache failed. Error handling in column_family::try_flush_memtable_to_sstable() is misplaced. It happens after update_cache(), so writing sstable may have succeeded, but moving memtable into the cache may have failed. update_cache() destroys memtable even if it fails, but error handler is not aware of it (it does not even distinguish whether error happened during sstable creation or moving into cache) and when it tells caller to retry it retries with already destroyed memtable. Fix it by ignoring moving to cache errors.	2015-11-09 11:27:37 +01:00
Avi Kivity	cb93af2ad7	Revert "do not try to write same memtable to sstable twice if moving it to a cache failed." This reverts commit `fff37d15cd`. Says Tomek (and the comment in the code): "update_cache() must be called before unlinking the memtable because cache + memtable at any time is supposed to be authoritative source of data for contained partitions. If there is a cache hit in cache, sstables won't be checked. If we unlink the memtable before cache is updated, it's possible that a query will miss data which was in that unlinked memtable, if it hits in the cache (with an old value)."	2015-11-09 11:22:12 +02:00
Gleb Natapov	fff37d15cd	do not try to write same memtable to sstable twice if moving it to a cache failed. Error handling in column_family::try_flush_memtable_to_sstable() is misplaced. It happens after update_cache(), so writing sstable may have succeeded, but moving memtable into the cache may have failed. update_cache() destroys memtable even if it fails, but error handler is not aware of it (it does not even distinguish whether error happened during sstable creation or moving into cache) and when it tells caller to retry it retries with already destroyed memtable. Fix it by ignoring moving to cache errors.	2015-11-09 09:56:45 +02:00
Asias He	20ecb0bede	database: Introduce get_initial_tokens Get initial tokens specified by the initial_token in scylla.conf. E.g., --initial-token "-1112521204969569328,1117992399013959838" --initial-token "1117992399013959838" It can be multiple tokens split by comma.	2015-11-04 10:40:12 +08:00
Calle Wilund	ceb9f4d647	database: Just do commitlog::shutdown on shutdown. It will do flushes.	2015-10-26 14:56:24 +01:00
Avi Kivity	f7087da054	Merge "GET methods for snapshots" from Glauber "The snapshots API need to expose GET methods so people can query information on them. Now that taking snapshots is supported, this relatively simple series implement get_snapshot_details, a column family method, and wire that up through the storage_service."	2015-10-22 15:23:45 +03:00
Avi Kivity	5f3a46eabb	Merge "load_new_sstables" from Glauber "This patchset implements load_new_sstables, allowing one to move tables inside the data directory of a CF, and then call "nodetool refresh" to start using them. Keep in mind that for Cassandra, this is deemed an unsafe operation: https://issues.apache.org/jira/browse/CASSANDRA-6245 It is still for us something we should not recommend - unless the CF is totally empty and not yet used, but we can do a much better job in the safety front. To guarantee that, the process works in four steps: 1) All writes to this specific column family are disabled. This is a horrible thing to do, because dirty memory can grow much more than desired during this. Throughout out this implementation, we will try to keep the time during which the writes are disabled to its bare minimum. While disabling the writes, each shard will tell us about the highest generation number it has seen. 2) We will scan all tables that we haven't seen before. Those are any tables found in the CF datadir, that are higher than the highest generation number seen so far. We will link them to new generation numbers that are sequential to the ones we have so far, and end up with a new generation number that is returned to the next step 3) The generation number computed in the previous step is now propagated to all CFs, which guarantees that all further writes will pick generation numbers that won't conflict with the existing tables. Right after doing that, the writes are resumed. 4) The tables we found in step 2 are passed on to each of the CFs. They can now load those tables while operations to the CF proceed normally."	2015-10-22 13:42:24 +03:00
Glauber Costa	36cea4313e	column family: load new sstables CF-level code to load new SSTables. There isn't really a lot of complication here. We don't even need to repopulate the entire SSTable directory: by requiring that the external service who is coordinating this tell us explicitly about the new SSTables found in the scan process, we can just load them specifically and add them to the SSTable map. All new tables will start their lifes as shared tables, and will be unshared if it is possible to do so: this all happens inside add_sstable and there isn't really anything special in this front. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:06:22 +02:00
Glauber Costa	61be9fb02d	reshuffle tables: mechanism to adjust new sstables' generation number Before loading new SSTables into the node, we need to make sure that their generation numbers are sequential (at least if we want to follow Cassandra's footsteps here). Note that this is unsafe by design. More information can be found at: https://issues.apache.org/jira/browse/CASSANDRA-6245 However, we can already to slightly better in two ways: Unlike Cassandra, this method takes as a parameter a generation number. We will not touch tables that are before that number at all. That number must be calculated from all shards as the highest generation number they have seen themselves. Calling load_new_sstables in the absence of new tables will therefore do nothing, and will be completely safe. It will also return the highest generation number found after the reshuffling process. New writers should start writing after that. Therefore, new tables that are created will have a generation number that is higher than any of this, and will therefore be safe. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:06:22 +02:00
Glauber Costa	1351c1cc13	database: mechanism to stop writing sstables During certain operations we need to stop writing SSTables. This is needed when we want to load new SSTables into the system. They will have to be scanned by all shards, agreed upon, and in most cases even renamed. Letting SSTables be written at that point makes it inherently racy - specially with the rename. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:06:22 +02:00
Glauber Costa	29e2ad7fd8	column family: commonize code to calculate the desired SSTable generation We will reuse this for load_new_sstables. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:02:43 +02:00
Tomasz Grabiec	764d913d84	Merge branch 'pdziepak/row-cache-range-query/v4' from seastar-dev.git From Pawel: This series enables row cache to serve range queries. In order to achieve that row cache needs to know whether there are some other partitions in the specified range that are not cached and need to be read from the sstables. That information is provied by key_readers, which work very similarly to mutation_readers, but return only the decorated key of partitions in range. In case of sstables key_readers is implemented to use partition index. Approach like this has the disadvantage of needing to access the disk even if all partitions in the range are cached. There are (at least) two solutions ways of dealing with that problem: - cache partition index - that will also help in all other places where it is neededed - add a flag to cache_entry which, when set, indicates that the immediate successor of the partition is also in the cache. Such flag would be set by mutation reader and cleared during eviction. It will also allow newly created mutations from memtable to be moved to cache provided that both their successors and predecessors are already there. The key_reader part of this patchsets adds a lot of new code that probably won't be used in any other place, but the alternative would be to always interleave reads from cache with reads from sstables and that would be more heavy on partition index, which isn't cached. Fixes #185.	2015-10-21 15:26:45 +02:00
Glauber Costa	77513a40db	database: get_snapshot_details For each of the snapshots available, the api may query for some information: the total size on disk, and the "real" size. As far as I could understand, the real size is the size that is used by the SSTables themselves, while the total size includes also the metadata about the snapshot - like the manifest.json file. Details follow: In the original Cassandra code, total size is: long sizeOnDisk = FileUtils.folderSize(snapshot); folderSize recurses on directories, and adds file.length() on files. Again, my understanding is that file_size() would give us the same as the length() method for Java. The other value, real (or true) size is: long trueSize = getTrueAllocatedSizeIn(snapshot); getTrueAllocatedSizeIn seems to be a tree walker, whose visitor is an instance of TrueFilesSizeVisitor. What that visitor does, is add up the size of the files within the tree who are "acceptable". An acceptable file is a file which: starts with the same prefix as we want (IOW, belongs to the same SSTable, we will just test that directly), and is not "alive". The alive list is just the list of all SSTables in the system that are used by the CFs. What this tries to do, is to make sure that the trueSnapshotSize is just the extra space on disk used by the snapshot. Since the snapshots are links, then if a table goes away, it adds to this size. If it would be there anyway, it does not. We can do that in a lot simpler fashion: for each file, we will just look at the original CF directory, and see if we can find the file there. If we can't, then it counts towards the trueSize. Even for files that are deleted after compaction, that "eventually" works, and that simplifies the code tremendously given that we don't have to neither list all files in the system - as Cassandra does - or go check other shards for liveness information - as we would have to do. The scheme I am proposing may need some tweaks when we support multiple data directories, as the SSTables may not be directly below the snapshot level. Still, it would be trivial to inform the CF about their possible locations. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 13:48:44 +02:00
Avi Kivity	5453cfbab7	Merge "snapshots: take + clear" from Glauber "This is the code for taking a snapshot, and clearing a snapshot."	2015-10-21 08:59:42 +03:00
Paweł Dziepak	c1e95dd893	row_cache: pass underlying key_source to row_cache Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-10-20 20:27:53 +02:00
Paweł Dziepak	96a42a9c69	column_family: add sstables_as_key_source() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-10-20 20:27:53 +02:00
Avi Kivity	cf734132e7	Merge "Flusing of CF:s without replay positions" from Calle "Fixes: #469 We occasionally generate memtables that are not empty, yet have no high replay_position set. (Typical case is CL replay, but apparently there are others). Moreover, we can do this repeatedly, and thus get caught in the flush queue ordering restrictions. Solve this by treating a flush without replay_position as a flush at the highest running position, i.e. "last" in queue. Note that this will not affect the actual flush operation, nor CL callbacks, only anyone waiting for the operation(s) to complete. To do this, the flush_queue had its restrictions eased, and some introspection methods added."	2015-10-20 17:36:57 +03:00

1 2 3 4 5 ...

454 Commits