scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-21 17:10:35 +00:00

Author	SHA1	Message	Date
Glauber Costa	7e3093709a	backlog: add level to write progress monitor For SSTables being written, we don't know their level yet. Add that information to the write monitor. New SSTables will always be at L0. Compacted SSTables will have their level determined by the compaction process. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-05-31 21:09:38 -04:00
Glauber Costa	b573a2ff61	backlog: keep track of maximum timestamp in write monitor For sealed SSTables we can get the maximum timestamp from the statistics component. But for partially written SSTables, the metadata is not yet available. One way to solve this would be to make the SSTable statistics available earlier. But we would end up with a maximum timestamp that potentially changes all the time as we write more cells. A better approach is to take note of what's the maximum timestamp in a memtable before we start to flush, and when time comes for us to flush we will use the progress manager to inform the consumers about the maximum timestamp. For SSTables being compacted, we can't know for sure what is the maximum timestamp as some entries could be TTLd already. But the maximum of all SSTables present in the compaction is a good enough estimation for this purposes. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-05-22 12:55:58 -04:00
Piotr Sarna	fe02c3d0e2	database, sstables, tests: add large_partition_handler This commit makes database, sstables and tests aware of which large_partition_handler they use. Proper large_partition_handler is retrievable from config information and is based on existing compaction_large_partition_warning_threshold_mb entry. Right now CQL TABLE variant of large_partition_handler is used in the database. Tests use a NOP version of large_partition_handler, which does not depend on CQL queries at all.	2018-05-04 14:38:13 +02:00
Vladimir Krivopalov	15ef4ca73c	Support for writing SSTables 3.0 ('mc') Data.db and Index.db files - rows only. This fix adds functionality for writing data in 'mc' format to Data.db file according to the SSTables 3.0 data format as described at https://github.com/scylladb/scylla/wiki/SSTables-3.0-Data-File-Format and Index.db file according to the specification at https://github.com/scylladb/scylla/wiki/SSTables-3.0-Index-File-Format The following cases are not supported yet: - writing counter cells - range tombstones In Index.db, end open markers are not written since range tombstones are not supported for data files yet. For #1969. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-04-26 14:34:20 -07:00
Raphael S. Carvalho	11940ca39e	sstables: Fix bloom filter size after resharding by properly estimating partition count We were feeding the total estimation partition count of an input shared sstable to the output unshared ones. So sstable writer thinks, from estimation, that each sstable created by resharding will have the same data amount as the shared sstable they are being created from. That's a problem because estimation is feeded to bloom filter creation which directly influences its size. So if we're resharding all sstables that belong to all shards, the disk usage taken by filter components will be multiplied by the number of shards. That becomes more of a problem with #3302. Partition count estimation for a shard S will now be done as follow: // // TE, the total estimated partition count for a shard S, is defined as // TE = Sum(i = 0...N) { Ei / Si }. // // where i is an input sstable that belongs to shard S, // Ei is the estimated partition count for sstable i, // Si is the total number of shards that own sstable i. Fixes #2672. Refs #3302. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180423151001.9995-1-raphaelsc@scylladb.com>	2018-04-23 18:11:20 +03:00
Avi Kivity	03c22ad524	Merge "Support for Cassandra 2.2 (LA) SSTable formats" from Daniel " These patches add support for C* 2.2 file(name) format. Namely: * It forces Scylla to write files in la format. * Adds storage-service feature for them. * cf and ks are determined from directory, not from file-name (for 2.2 format). * Adds some other fixes to make dtest happy. * Unit tests work with la format or with both formats. " * 'danfiala/filename-format-2.2-v4' of https://github.com/hagrid-the-developer/scylla: tests/sstables: Tests use la format or iterate over both formats. tests/sstables: Helper functions support 2.2 format directory structure. stables: Use 2.2 (la) format as a default format to store sstables if it is enabled by feature-bits. storage_service: Support la sstable storage format as a feature. sstables: make_descriptor accepts sstable-directory, because it is necessary to determine cf and ks in 2.2 format. sstables: Throw more detail exception for unknown item in reverse_map. sstables/compaction: Suppress NaN in a report of a throughput.	2018-03-19 17:49:44 +02:00
Daniel Fiala	c5eca593fc	sstables/compaction: Suppress NaN in a report of a throughput. * It causes failures in dtest. Signed-off-by: Daniel Fiala <daniel@scylladb.com>	2018-03-18 05:46:32 +01:00
Raphael S. Carvalho	aa75684ee7	sstables: Warn when an extra-large partition is written Based on https://issues.apache.org/jira/browse/CASSANDRA-9643 For compaction_large_partition_warning_threshold_mb option set to 1, follow an example output: WARN 2018-02-22 19:52:11,029 [shard 0] sstable - Writing large row system/local:{key: pk{00056c6f63616c}, token:-7564491331177403445} (1276758 bytes) Fixes #2209. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180306175912.19259-1-raphaelsc@scylladb.com>	2018-03-07 15:49:46 +00:00
Glauber Costa	956af9f099	database, main: set up scheduling_groups for our main tasks Set up scheduling groups for streaming, compaction, memtable flush, query, and commitlog. The background writer scheduling group is retired; it is split into the memtable flush and compaction groups. Comments from Glauber: This patch is based in a patch from Avi with the same subject, but the differences are signficant enough so that I reset authorship. In particular: 1) A bug/regression is fixed with the boundary calculations for the memtable controller sampling function. 2) A leftover is removed, where after flushing a memtable we would go back to the main group before going to the cache group again 3) As per Tomek's suggestion, now the submission of compactions themselves are run in the compaction scheduling group. Having that working is what changes this patch the most: we now store the scheduling group in the compaction manager and let the compaction manager itself enforce the scheduling group. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-02-07 17:19:29 -05:00
Avi Kivity	641aaba12c	database, sstables, compaction: convert use of thread_scheduling_group to seastar cpu scheduler thread_scheduling_groups are converted to plain scheduling_group. Due to differences in initialization (scheduling_group initializtion defers), we create the scheduling_groups in main.cc and propagate them to users via a new class database_config. The sstable writer loses its thread_scheduling_group parameter and instead inherits scheduling from its caller. Since shares are in the 1-1000 range vs. 0-1 for thread scheduling quotas, the flush controller was adjusted to return values within the higher ranges.	2018-02-07 17:19:29 -05:00
Duarte Nunes	cbbdfde979	sstables/compaction_backlog_tracker: Constify backlog() Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180111004914.25796-1-duarte@scylladb.com>	2018-01-11 13:20:57 +02:00
Glauber Costa	ca284174d0	infrastructure for backlog estimator for compaction work. This patch adds infrastucture in various points in the system to allow us to determine the amount of work present as backlog from compactions. What needs to be done can be explained in three major pieces: 1) Add hooks in the points where sstables are added or inserted to a column family (or more precisely, to a compaction_strategy object). 2) Add hooks in reads and write monitors that allows a compaction backlog estimator (tracker) to become aware of bytes that are partially written and compacted away. 3) Add a per-column family class (compaction_backlog_tracker) that can be used to track work that is done and relevant to compactions (like the two above), and a compaction manager to provide a system-wide backlog based on the response of the individual trackers. The definition of how much backlog one has is strategy-specific. The Null strategy is easy, as it never really has any backlog, and so is the major strategy - since what it really matters is the backlog of the underlying compaction strategy. Although backlogs are strategy-specific, they should be "compatible", in the sense that if a particular strategy has more work to do, it should yield a higher number than its counterparts. All the others are presented in this patch as unimplemented: they will always advertise a mild backlog that should yield a constant CPU-utilization if used alone. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-02 18:43:07 -05:00
Glauber Costa	d4109ebb80	compaction: control destruction of readers Compactions run from a seastar::thread, in run(). They will either fail or succeed, and from the point of view of ordering of destruction between the compaction object and its readers: - if compaction succeed, we have no control over who gets destructed first since both objects will be going out of scope. - if they fail, we will forceably destruct the compaction object, at which point the readers are still alive From the point of view of lifetime management, it would be nice to make sure that the compaction object outlives whichever other objects it needs during compaction. This nice to have will become paramount when we start adding read_monitors to the compaction object, that have to, themselves outlive the readers. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-02 18:43:06 -05:00
Raphael S. Carvalho	eff62bc61e	compaction_manager: serialize compaction of same size tier for different cfs Currently, compaction manager will serialize compaction of same size tier (or weight) if they belong to the same column family. However, it fails to do so if the compaction jobs belong to different column families. That can lead to an ungodly amount of running compaction which gets worse the higher the number of shards and active column families. The problem is that it may affect overall system performance due to excessive resource usage. It's easy to trigger it during bootstraping after loading node with new sstables or repairing, or if lots of cfs are being actively written. That being said, compaction jobs of same size tier are now serialized on a given shard, such that maximum number of compaction (system wise) is now: (SHARDS) * (SIZE TIERS) instead of: (SHARDS) * (COLUMN FAMILIES) * (SIZE TIERS) We'll work hard to release a size tier (weight) for a column family waiting on it as fast as possible, given that we wouldn't like to underutilize resources available for compaction. We want one starting after the other. Compaction for a column family that cannot run now because the size tier is taken, will be postponed. There's a worker that will be sleeping on a condition variable that will be signalled whenever a compaction completes. FIFO ordering is used on postponed list for fairness. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-12-17 17:42:48 -02:00
Raphael S. Carvalho	49f3cfe746	sstables: improve compact_sstables() interface Motivation is that a new field in the descriptor will be forwarded to compaction procedure without extending parameter list even more. Also beautifies the interface, making it concise and easier to play with. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-12-17 17:22:19 -02:00
Paweł Dziepak	73b3d02cc0	db: make make_range_sstable_reader() return flat reader	2017-12-13 12:01:03 +00:00
Paweł Dziepak	e12959616c	db: make column_family::make_sstable_reader() return a flat reader	2017-12-13 12:01:03 +00:00
Avi Kivity	d934ca55a7	Merge "SSTable resharding fixes" from Raphael "Didn't affect any release. Regression introduced in `301358e`. Fixes #3041" * 'resharding_fix_v4' of github.com:raphaelsc/scylla: tests: add sstable resharding test to test.py tests: fix sstable resharding test sstables: Fix resharding by not filtering out mutation that belongs to other shard db: introduce make_range_sstable_reader rename make_range_sstable_reader to make_local_shard_sstable_reader db: extract sstable reader creation from incremental_reader_selector db: reuse make_range_sstable_reader in make_sstable_reader	2017-12-07 16:42:48 +02:00
Raphael S. Carvalho	bad21ba444	sstables: Fix resharding by not filtering out mutation that belongs to other shard After `301358e`, sstable resharding stopped work because shared sstables would use a filtering reader, which excludes mutation that belong to other shards. That completely breaks which relies on compaction of mutations that belong to different shards. The fix is about using recently introduced non local shard reader. Fixes #3041. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-12-07 03:15:26 -02:00
Raphael S. Carvalho	d1b146baa6	rename make_range_sstable_reader to make_local_shard_sstable_reader Tomek says: "I think that the least surprising behavior for a function named like this is to read the sstables unfiltered (it just reads them), and the filtering should be indicated specially in the name or by accepting a parameter." Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-12-07 03:15:25 -02:00
Raphael S. Carvalho	809b30c4a2	sstables/compaction: do not actually compact fully expired sstables There's no need to actually compact a sstable which is fully expired and which deletion of all its data will not ressurect older data. For that, a sstable will only be considered fully expired if it doesn't contain data newer than its overlapping counterparts. That way, there could be a false negative, but never a false positive. Currently, a fully expired sstable would unnecessarily waste read bandwidth of disk. This will help a lot time series workloads in which data for a given time window is all deleted at once using TTL. Fixes #2620. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-12-06 19:52:33 -02:00
Raphael S. Carvalho	d2ab154f12	sstables: switch to const ref wherever possible Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-12-06 19:52:33 -02:00
Raphael S. Carvalho	d916c8cdad	sstables: use gc_clock::time_point for gc_before Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-12-06 19:52:33 -02:00
Raphael S. Carvalho	45c11865fa	sstables: change return value type of get_fully_expired_sstables unordered_set will allow us to quickly extract fully expired tables from a set of compacting sstables. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-12-06 18:45:55 -02:00
Paweł Dziepak	b64dd21751	sstables: convert compaction to flat_mutation_reader	2017-11-23 18:14:31 +00:00
Duarte Nunes	baeec0935f	Replace query::full_slice with schema::full_slice() query::full_slice doesn't select any regular or static columns, which is at odds with the expectations of its users. This patch replaces it with the schema::full_slice() version. Refs #2885 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1507732800-9448-2-git-send-email-duarte@scylladb.com>	2017-10-17 11:25:53 +02:00
Botond Dénes	47e07b787e	restricted_mutation_reader: restrict based-on memory consumption Restrict readers based on their memory consumption, instead of the count of the top-level readers. To do this an interposer is installed at the input_stream level which tracks buffers emmited by the stream. This way we can have an accurate picture of the readers' actual memory consumption. New readers will consume 16k units from the semaphore up-front. This is to account their own memory-consumption, apart from the buffers they will allocate. Creating the reader will be deferred to when there are enough resources to create it. As before only new readers will be blocked on an exhausted semaphore, existing readers can continue to work.	2017-10-03 12:44:12 +03:00
Avi Kivity	78eae8bf48	Revert "Merge "Make restricting_mutation_reader more accurate" from Botond" This reverts commit `c6e5dcc556`, reversing changes made to `19b21a0ab2`. Failes to build, plus author has more changes.	2017-10-03 11:58:59 +03:00
Avi Kivity	c6e5dcc556	Merge "Make restricting_mutation_reader more accurate" from Botond "Currently restricting_mutation_reader restricts mutation_readears on a count basis. This is inaccurate on multiple levels. The reader might be a combined_mutation_reader, which might be composed of multiple individual readers, whose number might change during the lifetime of the reader. The memory consumption of the readers can vary and may change during the lifetime of the reader as well. To remedy this, make the restriction memory-consumption based. The restricting semaphore is now configured with the amound of memory (bytes) that its readers are allowed to consume in total. New readers consume 128k units up-front to account for read-ahead buffers, and then consume additional units for any buffer (returned from input_stream<>::read()) they keep around. Like before, readers already allowed to read will not be blocked, instead new readers will be blocked on their first read if all the units all consumed." Fixes #2692. * 'bdenes/restricting_mutation_reader-v4' of https://github.com/denesb/scylla: Update reader restriction related metrics Add restricted_reader_test unit test restricted_mutation_reader: restrict based-on memory consumption mutation_reader.hh: Move restricted_reader related code	2017-10-03 11:15:34 +03:00
Raphael S. Carvalho	e34c1db642	db: update compaction history outside the sstable write lock The reason to do that is because compaction can deadlock if refresh disables write which waits for compaction, and compaction in turn waits for dirty memory[1] that would be released by memtable write. Dirty memory manager for non-system cfs was being used for system cfs, which was useful for exposing this problem. [1]: when updating compaction history. Fixes #2769. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170918215238.9810-2-raphaelsc@scylladb.com>	2017-09-26 19:51:12 +02:00
Botond Dénes	33e97e7457	restricted_mutation_reader: restrict based-on memory consumption Restrict readers based on their memory consumption, instead of the count of the top-level readers. To do this an interposer is installed at the input_stream level which tracks buffers emmited by the stream. This way we can have an accurate picture of the readers' actual memory consumption. New readers will consume 16k units from the semaphore up-front. This is to account their own memory-consumption, apart from the buffers they will allocate. Creating the reader will be deferred to when there are enough resources to create it. As before only new readers will be blocked on an exhausted semaphore, existing readers can continue to work.	2017-09-20 11:14:35 +03:00
Avi Kivity	9b540eccb0	database: remove dependency on compaction.hh and compaction_manager.hh	2017-09-11 20:09:45 +03:00
Glauber Costa	db846326f8	compaction: remove dead code This code has no more users. Bury it. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20170908005305.29925-1-glauber@scylladb.com>	2017-09-08 08:17:15 +02:00
Botond Dénes	611774b1d9	Use the incremental reader for compaction As leveled compaction strategy stands to gain the most from incrementally opening sstables. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <292648d3fa4ea97376c0b4360754a20132194f63.1502822066.git.bdenes@scylladb.com>	2017-08-15 21:38:04 +03:00
Duarte Nunes	7fb6a74302	combined_mutation_reader: Drop exhausted readers if not in FF mode Exhausted readers can be fast forwarded, so we have to keep them around. However, if the current reader is not fast forwardable, then we can drop those readers and their buffers. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-08-14 14:37:27 +02:00
Botond Dénes	94fc550e68	sstable_set::incremental_selector: select() now returns a selection A seletion contains - in addition to the list of sstables - a next_token which is a hint as to what is the next best token to call select() with. This should be the smallest token such that at the next call to select() the least number of new sstables will be returned, without skipping any.	2017-08-09 16:27:33 +03:00
Glauber Costa	4f01ec0910	restrict background writers to 50 % of CPU. In scylla, we have foreground processes, which are latency sensitive and need to be responded to as fast as possible in order to maintain good latency profiles, and background process, which are less so. The most important background processes we have during normal write workload operations are memtable writes and sstable compactions. Those processes are quite CPU-intensive, and left unchecked will easily dominate the CPU. Lower values of task-quota usually help, as it will force those processes to preempt more, but aren't enough to guarantee good isolation. We have seen boxes with good NVMe storage having their throughput reduced to less than half of the original baseline in a short dive down for the duration of a compaction. In the long run, our goal is to leverage the CPU scheduler to make sure that those processes are balanced with respect to all the others. However, the current state of affairs is causing grievances as this very moment. Thankfully, those processes live in a seastar::thread, that ships with its own rudimentary bandwidth control mechanism: the scheduling group. The goal of this patch is to wrap background processes together in a scheduling group, and assign to such group 50 % of our CPU power; the remainder being left to foreground processes. While we pride ourselves in dynamically adjusting things to the workload, we won't be able to do this properly before the CPU scheduler lands - and let's face it, leaving background processes run wild is not adaptative either. Every workload would benefit most from a different value for such shares, but 50 % is as fair as it gets if we really need static partitining in the mean time. As a defense against unforeseen consequences, we'll leave the actual value as an option, but will do our best to hide it - as this is not a tunable that we want to be part of a normal Scylla setup. The most convenient place for this tunable is still db::config, so we can easily pass it down to the database layer - but we will not document it in the yaml, and will clearly note in the help string that it is not supposed to be tuned. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-07-18 23:35:33 -04:00
Raphael S. Carvalho	4351e0a996	compaction: introduce new compaction type for reshard so now user can look at nodetool compactionstats and determine whether or not resharding is running, for example: $ ./bin/nodetool compactionstats pending tasks: 3 id compaction type keyspace table completed total unit progress <none> RESHARD system compaction_history 11 256 keys 4.30% <none> RESHARD system compaction_history 2 256 keys 0.78% <none> RESHARD system compaction_history 10 256 keys 3.91% <none> RESHARD system compaction_history 8 256 keys 3.12% <none> RESHARD system compaction_history 7 256 keys 2.73% Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170620175733.25882-1-raphaelsc@scylladb.com>	2017-06-22 14:48:38 +03:00
Raphael S. Carvalho	41137c7fb6	compaction: use sstable::bytes_on_disk for calculating start and end size Currently, start and end size of compaction are calculated using the uncompressed size of data component. bytes_on_disk() returns size used by all components. NOTE: start and end size are written to compaction history, so users who monitor it should be aware of this change. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170525212129.6758-1-raphaelsc@scylladb.com>	2017-05-28 11:33:24 +03:00
Avi Kivity	ebaeefa02b	Merge seatar upstream (seastar namespace) - introcduced "seastarx.hh" header, which does a "using namespace seastar"; - 'net' namespace conflicts with seastar::net, renamed to 'netw'. - 'transport' namespace conflicts with seastar::transport, renamed to cql_transport. - "logger" global variables now conflict with logger global type, renamed to xlogger. - other minor changes	2017-05-21 12:26:15 +03:00
Raphael S. Carvalho	ddc1d80c28	compaction: remove dead function declaration Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170504013046.23522-2-raphaelsc@scylladb.com>	2017-05-04 11:48:51 +03:00
Raphael S. Carvalho	61229ab88c	compaction: fix type for cleanup After compaction revamp, compaction type set by cleanup at its ctor is being overwritten at compaction::setup(). Consequently, cleanup would not be stopped by 'nodetool stop cleanup' and cleanup would be listed as regular compaction in 'nodetool compactionstats'. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170504013046.23522-1-raphaelsc@scylladb.com>	2017-05-04 11:48:50 +03:00
Raphael S. Carvalho	3071b9052a	compaction: make cleanup_compaction inherit from regular_compaction Some fields that belong to regular and cleanup aren't needed for resharding_compaction, such as incremental selector (which is used for determining max purgeable timestamp for a given decorated key) Better move those fields to regular and make cleanup inherit from regular compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170428195611.9196-1-raphaelsc@scylladb.com>	2017-04-30 19:37:09 +03:00
Raphael S. Carvalho	687a4bb0c2	dtcs: do not compact fully expired sstable which ancestor is not deleted yet Currently, fully expired sstable[1] is unconditionally chosen for compaction by DTCS, but that may lead to a compaction loop under certain conditions. Let's consider that an almost expired sstable is compacted, and it's not deleted yet, and that the new sstable becomes expired before its ancestor is deleted. Because this new sstable is expired, it will be chosen by DTCS, but it will not be purged because 'compacted undeleted' sstables are taken into account by calculation of max purgeable timestamp and prevents expired data from being purged. The problem is that this sequence of events can keep happening forever as reported by issue #2260. NOTE: This problem was easier to reproduce before improvement on compaction of expired cells, because fully expired sstable was being converted into a sstable full of tombstones, which is also considered fully expired. Fixes #2260. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20170428233554.13744-1-raphaelsc@scylladb.com>	2017-04-30 19:35:46 +03:00
Duarte Nunes	4e693383f7	mutation_partion: Use row_tombstone This patch replaces the current row tombstone representation by a row_tombstone. The intent of the patch is thus to reify the idea of shadowable tombstones, that up until now we considered all materialized view row tombstones to be. We need to distinguish shadowable from non-shadowable row tombstones to support scenarios such as, when inserting to a table with a materialzied view: 1. insert into base (p, v1, v2) values (3, 1, 3) using timestamp 1 2. delete from base using timestamp 2 where p = 3 3. insert into base (p, v1) values (3, 1) using timestamp 3 These should yield a view row where v2 is definitely null, but with the current implementation, v2 will pop back with its value v2=3@TS=1, even though its dead in the base row. This is because the row tombstone inserted at 2) is a shadowable one. This patch only addresses the memory representation of such row_tombstones. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-04-25 11:46:33 +02:00
Raphael S. Carvalho	0127309820	sstables: extend compaction for new resharding Extends compaction for new resharding algorithm. Not wired yet. New resharding will compact shared sstable(s) and create one sstable for each owner. It's up to the caller to open these new unshared sstables at their respective column families. This new approach will save a lot of bandwidth because we'll no longer read the entire shared sstable #smp::count times. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:08 -03:00
Raphael S. Carvalho	2a437ab427	compaction: rework compacting_sstable_writer to work with multiple writers compacting_sstable_writer only allowed one writer so far, but we will need multiple ones for resharding. It's done by moving writer management to compaction. finish_sstable_writer() is added for compaction impl to stop all writers, whereas stop_sstable_writer() will only stop current writer (needed when current sstable reaches max limit size for example). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:05 -03:00
Raphael S. Carvalho	a35a3a9647	compaction: prepare compacting_sstable_writer to work with writers No need for compacting_sstable_writer to store items that are available in compaction class. Also, that's a step towards supporting multiple writers for compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:03 -03:00
Raphael S. Carvalho	38ed83e2f7	sstables: rework compaction to make it easy to extend compact_sstables() supported both regular and cleanup compaction, but with lots of conditions that made it ugly and hard to extend. In the future, we want to introduce a new type of compaction for resharding that will create one sstable for every shard owning the sstable(s) given as input. That will be easier now. Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-04-21 17:11:02 -03:00
Tomasz Grabiec	124dde30db	sstables: Extract writer parameters into config objects Also enables users to change the default promoted index block size.	2017-03-10 14:42:22 +01:00

1 2 3

140 Commits