scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 19:35:12 +00:00

Author	SHA1	Message	Date
Avi Kivity	f7087da054	Merge "GET methods for snapshots" from Glauber "The snapshots API need to expose GET methods so people can query information on them. Now that taking snapshots is supported, this relatively simple series implement get_snapshot_details, a column family method, and wire that up through the storage_service."	2015-10-22 15:23:45 +03:00
Avi Kivity	5f3a46eabb	Merge "load_new_sstables" from Glauber "This patchset implements load_new_sstables, allowing one to move tables inside the data directory of a CF, and then call "nodetool refresh" to start using them. Keep in mind that for Cassandra, this is deemed an unsafe operation: https://issues.apache.org/jira/browse/CASSANDRA-6245 It is still for us something we should not recommend - unless the CF is totally empty and not yet used, but we can do a much better job in the safety front. To guarantee that, the process works in four steps: 1) All writes to this specific column family are disabled. This is a horrible thing to do, because dirty memory can grow much more than desired during this. Throughout out this implementation, we will try to keep the time during which the writes are disabled to its bare minimum. While disabling the writes, each shard will tell us about the highest generation number it has seen. 2) We will scan all tables that we haven't seen before. Those are any tables found in the CF datadir, that are higher than the highest generation number seen so far. We will link them to new generation numbers that are sequential to the ones we have so far, and end up with a new generation number that is returned to the next step 3) The generation number computed in the previous step is now propagated to all CFs, which guarantees that all further writes will pick generation numbers that won't conflict with the existing tables. Right after doing that, the writes are resumed. 4) The tables we found in step 2 are passed on to each of the CFs. They can now load those tables while operations to the CF proceed normally."	2015-10-22 13:42:24 +03:00
Amnon Heiman	c130381284	Adding live_scanned and tombstone scaned histogram to column family This series adds a histogrm to the column family for live scanned and tombstone scaned. It expose those histogram via the API instead of the stub implmentation, currently exist. The implementation update of the histogram will be added in a different series. Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>	2015-10-22 11:13:28 +03:00
Glauber Costa	36cea4313e	column family: load new sstables CF-level code to load new SSTables. There isn't really a lot of complication here. We don't even need to repopulate the entire SSTable directory: by requiring that the external service who is coordinating this tell us explicitly about the new SSTables found in the scan process, we can just load them specifically and add them to the SSTable map. All new tables will start their lifes as shared tables, and will be unshared if it is possible to do so: this all happens inside add_sstable and there isn't really anything special in this front. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:06:22 +02:00
Glauber Costa	61be9fb02d	reshuffle tables: mechanism to adjust new sstables' generation number Before loading new SSTables into the node, we need to make sure that their generation numbers are sequential (at least if we want to follow Cassandra's footsteps here). Note that this is unsafe by design. More information can be found at: https://issues.apache.org/jira/browse/CASSANDRA-6245 However, we can already to slightly better in two ways: Unlike Cassandra, this method takes as a parameter a generation number. We will not touch tables that are before that number at all. That number must be calculated from all shards as the highest generation number they have seen themselves. Calling load_new_sstables in the absence of new tables will therefore do nothing, and will be completely safe. It will also return the highest generation number found after the reshuffling process. New writers should start writing after that. Therefore, new tables that are created will have a generation number that is higher than any of this, and will therefore be safe. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:06:22 +02:00
Glauber Costa	1351c1cc13	database: mechanism to stop writing sstables During certain operations we need to stop writing SSTables. This is needed when we want to load new SSTables into the system. They will have to be scanned by all shards, agreed upon, and in most cases even renamed. Letting SSTables be written at that point makes it inherently racy - specially with the rename. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:06:22 +02:00
Glauber Costa	29e2ad7fd8	column family: commonize code to calculate the desired SSTable generation We will reuse this for load_new_sstables. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:02:43 +02:00
Glauber Costa	f3bad2032d	database: fix type for sstable generation. Avoid using long for it, and let's use a fixed size instead. Let's do signed instead of unsigned to avoid upsetting any code that we may have converted. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:01:20 +02:00
Tomasz Grabiec	764d913d84	Merge branch 'pdziepak/row-cache-range-query/v4' from seastar-dev.git From Pawel: This series enables row cache to serve range queries. In order to achieve that row cache needs to know whether there are some other partitions in the specified range that are not cached and need to be read from the sstables. That information is provied by key_readers, which work very similarly to mutation_readers, but return only the decorated key of partitions in range. In case of sstables key_readers is implemented to use partition index. Approach like this has the disadvantage of needing to access the disk even if all partitions in the range are cached. There are (at least) two solutions ways of dealing with that problem: - cache partition index - that will also help in all other places where it is neededed - add a flag to cache_entry which, when set, indicates that the immediate successor of the partition is also in the cache. Such flag would be set by mutation reader and cleared during eviction. It will also allow newly created mutations from memtable to be moved to cache provided that both their successors and predecessors are already there. The key_reader part of this patchsets adds a lot of new code that probably won't be used in any other place, but the alternative would be to always interleave reads from cache with reads from sstables and that would be more heavy on partition index, which isn't cached. Fixes #185.	2015-10-21 15:26:45 +02:00
Glauber Costa	77513a40db	database: get_snapshot_details For each of the snapshots available, the api may query for some information: the total size on disk, and the "real" size. As far as I could understand, the real size is the size that is used by the SSTables themselves, while the total size includes also the metadata about the snapshot - like the manifest.json file. Details follow: In the original Cassandra code, total size is: long sizeOnDisk = FileUtils.folderSize(snapshot); folderSize recurses on directories, and adds file.length() on files. Again, my understanding is that file_size() would give us the same as the length() method for Java. The other value, real (or true) size is: long trueSize = getTrueAllocatedSizeIn(snapshot); getTrueAllocatedSizeIn seems to be a tree walker, whose visitor is an instance of TrueFilesSizeVisitor. What that visitor does, is add up the size of the files within the tree who are "acceptable". An acceptable file is a file which: starts with the same prefix as we want (IOW, belongs to the same SSTable, we will just test that directly), and is not "alive". The alive list is just the list of all SSTables in the system that are used by the CFs. What this tries to do, is to make sure that the trueSnapshotSize is just the extra space on disk used by the snapshot. Since the snapshots are links, then if a table goes away, it adds to this size. If it would be there anyway, it does not. We can do that in a lot simpler fashion: for each file, we will just look at the original CF directory, and see if we can find the file there. If we can't, then it counts towards the trueSize. Even for files that are deleted after compaction, that "eventually" works, and that simplifies the code tremendously given that we don't have to neither list all files in the system - as Cassandra does - or go check other shards for liveness information - as we would have to do. The scheme I am proposing may need some tweaks when we support multiple data directories, as the SSTables may not be directly below the snapshot level. Still, it would be trivial to inform the CF about their possible locations. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 13:48:44 +02:00
Paweł Dziepak	96a42a9c69	column_family: add sstables_as_key_source() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-10-20 20:27:53 +02:00
Glauber Costa	d236b01b48	snapshots: check existence of snapshots We go to the filesystem to check if the snapshot exists. This should make us robust against deletions of existing snapshots from the filesystem. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-20 15:58:26 +02:00
Glauber Costa	d3aef2c1a5	database: support clear snapshot This allows for us to delete an existing snapshot. It works at the column family level, and removing it from the list of keyspace snapshots needs to happen only when all CFs are processed. Therefore, that is provided as a separate operation. The filesystem code is a bit ugly: it can be made better by making our file lister more generic. First step would be to call it walker, not lister... For now, we'll use the fact that there are mostly two levels in the snapshot hierarchy to our advantage, and avoid a full recursion - using the same lambda for all calls would require us to provide a separate class to handle the state, that's part of making this generic. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-20 15:38:14 +02:00
Avi Kivity	2ccb5feabd	Merge "Support nodetool cfhistogram" "This series adds the missing estimated histogram to the column family and to the API so the nodetool cfhistogram would work."	2015-10-19 17:11:46 +03:00
Raphael S. Carvalho	35b75e9b67	adapt compaction procedure to support leveled strategy Adapt our compaction code to start writing a new sstable if the one being written reached its maximum size. Leveled strategy works with that concept. If a strategy other than leveled is being used, everything will work as before. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-10-16 01:54:52 -03:00
Calle Wilund	012ab24469	column_family: Add flush queue object to act as ordering guarantee	2015-10-14 14:07:40 +02:00
Glauber Costa	b2fef14ada	do not calculate truncation time independently Currently, we are calculating truncated_at during truncate() independently for each shard. It will work if we're lucky, but it is fairly easy to trigger cases in which each shard will end up with a slightly different time. The main problem here, is that this time is used as the snapshot name when auto snapshots are enabled. Previous to my last fixes, this would just generate two separate directories in this case, which is wrong but not severe. But after the fix, this means that both shards will wait for one another to synchronize and this will hang the database. Fix this by making sure that the truncation time is calculated before invoke_on_all in all needed places. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-09 17:17:11 +03:00
Amnon Heiman	6d90eebfb9	column family: Add estimated histogram impl This patch adds the read and write latency estimated histogram support and add an estimatd histogram to the number of sstable that were used in a read. Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>	2015-10-08 14:59:17 +03:00
Tomasz Grabiec	bc1d159c1b	Merge branch 'penberg/cql-drop-table/v3' from seastar-dev.git From Pekka: This patch series implements support for CQL DROP TABLE. It uses the newly added truncate infrastructure under the hood. After this series, the test_table CQL test in dtest passes: [penberg@nero urchin-dtest]$ nosetests -v cql_tests.py:TestCQL.table_test table_test (cql_tests.TestCQL) ... ok ---------------------------------------------------------------------- Ran 1 test in 23.841s OK	2015-10-06 13:39:25 +02:00
Pekka Enberg	afbb2f865d	database: Add keyspace_metadata::remove_column_family() helper Signed-off-by: Pekka Enberg <penberg@scylladb.com>	2015-10-06 11:28:55 +03:00
Pekka Enberg	0651ab6901	database: Futurize drop_column_family() function Futurize drop_column_family() so that we can call truncate() from it. Signed-off-by: Pekka Enberg <penberg@scylladb.com>	2015-10-06 11:28:55 +03:00
Pekka Enberg	85ffaa5330	database: Add truncate() variant that does not look up CF by name For drop_column_family(), we want to first remove the column_family from lookup tables and truncate after that to avoid races. Introduce a truncate() variant that takes keyspace and column_family references. Signed-off-by: Pekka Enberg <penberg@scylladb.com>	2015-10-06 11:28:54 +03:00
Glauber Costa	639ba2b99d	incremental backups: move control to the CF level Currently, we control incremental backups behavior from the storage service. This creates some very concrete problems, since the storage service is not always available and initialized. The solution is to move it to the column family (and to the keyspace so we can properly propagate the conf file value). When we change this from the api, we will have to iterate over all of them, changing the value accordingly. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-05 13:16:11 +02:00
Glauber Costa	69d1358627	database: non const versions of get_keyspaces/column_families We will need to change some properties of the keyspace / cf. We need an acessor that is not marked as const. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-05 13:13:37 +02:00
Amnon Heiman	1f16765140	column family: setting the read and write latency histogram This patch contains the following changes, in the definition of the read and write latency histogram it removes the mask value, so the the default value will be used. To support the gothering of the read latency histogram the query method cannot be const as it modifies the histogram statistics. The read statistic is sample based and it should have no real impact on performance, if there will be an impact, we can always change it in the future to a lower sampling rate. Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>	2015-10-04 11:52:19 +03:00
Pekka Enberg	5e27d476d4	database: Improve exception error messages When we convert exceptions into CQL server errors, type information is not preserved. Therefore, improve exception error messages to make debugging dtest failures, for example, slightly easier. Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>	2015-10-01 11:23:46 +03:00
Calle Wilund	68b8d8f48c	database: Implement "truncate" for column family Including snapshotting.	2015-09-30 09:09:42 +02:00
Calle Wilund	56228fba24	column family: Add "snapshot" operation.	2015-09-30 09:09:42 +02:00
Calle Wilund	c141e15a4a	column family: Add "run_with_compaction_disabled" helper A'la origin. Could as well been RAII.	2015-09-30 09:09:41 +02:00
Avi Kivity	d5cf0fb2b1	Add license notices	2015-09-20 10:43:39 +03:00
Amnon Heiman	089bd6a5bd	column family: Expose the compaction strategy This expose the compaction strategy object. Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>	2015-09-12 08:35:34 +03:00
Amnon Heiman	3af683e6f4	column family: add estimate read, write This adds an estimated read and estimated write histogram to the column family stats object.	2015-09-12 08:35:03 +03:00
Amnon Heiman	dd7638cfa9	Expose the dirty_memory_region_group in database and add occupancy to column_family This patch adds a getter for the dirty_memory_region_group in the database object and add an occupency method to column family that returns the total occupency in all the memtable in the column family. Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>	2015-09-10 00:22:08 +03:00
Avi Kivity	b96018411b	Merge "Fix flush in the middle of scanning bug" from Tomasz Fixes #309. Conflicts: sstables/sstables.cc	2015-09-09 11:56:04 +03:00
Tomasz Grabiec	320ff132f8	sstables: Relax header dependencies	2015-09-09 10:07:43 +02:00
Gleb Natapov	df468504b6	schema_table: convert code to use distributed<storage_proxy> instead of storage_proxy& All database code was converted to is when storage_proxy was made distributed, but then new code was written to use storage_proxy& again. Passing distributed<> object is safer since it can be passed between shards safely. There was a patch to fix one such case yesterday, I found one more while converting.	2015-09-09 10:19:30 +03:00
Tomasz Grabiec	c623fbe1f7	database: Keep sstable as lw_shared_ptr<> from the beginning Allows us to save on indentation, and we need it as shared anyway later.	2015-09-08 10:19:19 +02:00
Calle Wilund	380649eb66	Database: Add commitlog flush handler to switch memtables to disk Initiates flushing of CF:s to sstable on CL disk overflow (flush req)	2015-09-07 13:21:46 +02:00
Avi Kivity	349015a269	Merge "Fix migration manager logging" from Pekka "Fix migration manager logging to output what origin does. Fixes #112."	2015-08-31 16:27:49 +03:00
Calle Wilund	987454d012	Database: Add "flush_all_memtables"	2015-08-31 14:29:50 +02:00
Pekka Enberg	03e0bcd8cb	database: Add operator<< for keyspace_metadata Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>	2015-08-31 13:35:19 +03:00
Pekka Enberg	04a65ec06f	database: Add keyspace_metadata::validate() helper Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>	2015-08-31 11:54:56 +03:00
Avi Kivity	012fd41fc0	db: hard dirty memory limit Unlike cache, dirty memory cannot be evicted at will, so we must limit it. This patch establishes a hard limit of 50% of all memory. Above that, new requests are not allowed to start. This allows the system some time to clean up memory. Note that we will need more fine-grained bandwidth control than this; the hard limit is the last line of defense against running our of reclaimable memory. Tested with a mixed read/write load; after reads start to dominate writes (due to the proliferation of small sstables, and the inability of compaction to keep up, dirty memory usage starts to climb until the hard stop prevents it from climbing further and ooming the server).	2015-08-28 14:47:17 +02:00
Avi Kivity	5f62f7a288	Revert "Merge "Commit log replay" from Calle" Due to test breakage. This reverts commit `43a4491043`, reversing changes made to `5dcf1ab71a`.	2015-08-27 12:39:08 +03:00
Avi Kivity	0fff367230	Merge "test for compaction metadata's ancestors" from Raphael	2015-08-27 11:07:53 +03:00
Avi Kivity	4e3c9c5493	Merge "compaction manager fixes" from Raphael	2015-08-27 11:05:26 +03:00
Avi Kivity	43a4491043	Merge "Commit log replay" from Calle "Initial implementation/transposition of commit log replay. * Changes replay position to be shard aware * Commit log segment ID:s now follow basically the same scheme as origin; max(previous ID, wall clock time in ms) + shard info (for us) * SStables now use the DB definition of replay_position. * Stores and propagates (compaction) flush replay positions in sstables * If CL segments are left over from a previous run, they, and existing sstables are inspected for high water mark, and then replayed from those marks to amend mutations potentially lost in a crash * Note that CPU count change is "handled" in so much that shard matching is per _previous_ runs shards, not current. Known limitations: * Mutations deserialized from old CL segments are _not_ fully validated against existing schemas. * System::truncated_at (not currently used) does not handle sharding afaik, so watermark ID:s coming from there are dubious. * Mutations that fail to apply (invalid, broken) are not placed in blob files like origin. Partly because I am lazy, but also partly because our serial format differs, and we currently have no tools to do anything useful with it * No replay filtering (Origin allows a system property to designate a filter file, detailing which keyspace/cf:s to replay). Partly because we have no system properties. There is no unit test for the commit log replayer (yet). Because I could not really come up with a good one given the test infrastructure that exists (tricky to kill stuff just "right"). The functionality is verified by manual testing, i.e. running scylla, building up data (cassandra-stress), kill -9 + restart. This of course does not really fully validate whether the resulting DB is 100% valid compared to the one at k-9, but at least it verified that replay took place, and mutations where applied. (Note that origin also lacks validity testing)"	2015-08-27 10:53:36 +03:00
Amnon Heiman	b5ceef451e	keyspace: Add the get_non_system_keyspaces and expose the replication This patch adds the get_non_system_keyspaces that found in origin and expose the replication strategy. With the get_replication_strategy method. Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>	2015-08-25 19:39:13 +03:00
Calle Wilund	df8d7a8295	Database: Add "flush_all_memtables"	2015-08-25 09:41:56 +02:00
Avi Kivity	4390be3956	Rename 'negative_mutation_reader' to 'partition_presence_checker' Suggested by Tomek.	2015-08-24 18:03:22 +03:00

1 2 3 4 5 ...

291 Commits