scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-03 05:26:58 +00:00

Author	SHA1	Message	Date
Takuya ASADA	828b63f4fb	dist/redhat: manage .pyc as a part of package Since we don't install .pyc files on our package, python3 will generate .pyc file when we launch setup script first time. Then we will have unmanaged files under script directory, it will remain when Scylla package upgraded / removed. We need to compile .py when we generate relocatable package, add compiled .pyc files on .rpm/.deb packages. Fixes #4612 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <20190627053530.10406-1-syuu@scylladb.com>	2019-06-27 14:22:39 +03:00
Avi Kivity	dd76943125	Merge "Segregate data when streaming by timestamp for time window compaction strategy" from Botond " When writing streamed data into sstables, while using time window compaction strategy, we have to emit a new sstable for each time window. Otherwise we can end up with sstables, mixing data from wildly different windows, ruining the compaction strategy's ability to drop entire sstables when all data within is expired. This gets worse as these mixed sstables get compacted together with sstables that used to contain a single time window. This series provides a solution to this by segregating the data by its atom's the time-windows. This is done on the new RPC streaming and the new row-level, repair, memtable-flush and compaction, ensuring that the segregation requirement is respected at all times. Fixes: #2687 " * 'segregate-data-into-sstables-by-time-window-streaming/v2.1' of ssh://github.com/denesb/scylla: streaming,repair: restore indentation repair: pass the data stream through the compaction strategy's interposer consumer streaming: pass the data stream through the compaction strategy's interposer consumer TWCS: implement add_interposer_consumer() compaction_strategy: add add_interposer_consumer() Add mutation_source_metadata tests: add unit test for timestamp_based_splitting_writer Add timestamp_based_splitting_writer Introduce mutation_writer namespace	2019-06-26 19:18:52 +03:00
Tomasz Grabiec	3e30a33e31	Merge "Introduce tests::random_schema" from Botond Most of our tests use overly simplistic schemas (`simple_schema`) or very specialized ones that focus on exercising a specific area of the tested code. This is fine in most places as not all code is schema dependent, however practice has showed that there can be nasty bugs hiding in dark corners that only appear with a schema that has a specific combination of types. This series introduces `tests::random_schema` a utility class for generating random schemas and random data for them. An important goal is to make using random schemas in tests as simple and convenient as possible, therefore fostering the appearance of tests using random schemas. Random schema was developed to help testing code I'm currently working on, which segregates data by time-windows. As I wasn't confident in my ability to think of every possible combination of types that can break my code I came up with random-schema to help me finding these corner cases. So far I consider it a success, it already found bugs in my code that I'm not sure I would have found if I had relied on specific schemas. It also found bugs in unrelated areas of the code which proves my point in the first paragraph. * https://github.com/denesb/scylla.git random_schema/v5: tests/data_model: approximate to the modeled data structures data_value: add ascii constructor tests/random-utils.hh: add stepped_int_distribution tests/random-utils.hh: get_int() add overloads that accept external rand engine tests/random-utils.hh: add get_real() tests: introduce random_schema	2019-06-26 18:10:20 +02:00
Botond Dénes	12b8405720	streaming,repair: restore indentation Deferred from the previous two patches.	2019-06-26 18:45:36 +03:00
Botond Dénes	e3f4692868	repair: pass the data stream through the compaction strategy's interposer consumer	2019-06-26 18:45:36 +03:00
Botond Dénes	9c2407573c	streaming: pass the data stream through the compaction strategy's interposer consumer	2019-06-26 18:45:36 +03:00
Botond Dénes	ee563928df	TWCS: implement add_interposer_consumer() Exploit the interposer customization point to inject a consumer that will segregate the mutation stream based on the contained atoms' timestamps, allowing the requirements of TWCS to be mantained every time sstables are written to disk. For the implementation, `timestamp_based_splitting_writer` is used, with a classifier that maps timestamps to windows.	2019-06-26 18:45:36 +03:00
Tomasz Grabiec	2d3e3640df	Merge "Collection: use utils::chunked_vector to store the cells" from Botond This is a band-aid patch that is supposed to fix the immediate problem of large collections causing large allocations. The proper fix is to use IMR but that will take time. In the meanwhile alleviate the pressure on the memory allocator by using a chunked storage collection (utils::chunked_vector) instead of std::vector. In the linked issue seastar::chunked_fifo was also proposed as the container to use, however chunked fifo is not traversable in reverse which disqualifies it from this role. Refs: #3602	2019-06-26 15:32:25 +02:00
Botond Dénes	a280dcfe4c	compaction_strategy: add add_interposer_consumer() This will be the customization point for compaction strategies, used to inject a specific interposer consumer that can manipulate the fragment stream so that it satisfies the requirements of the compaction strategy. For now the only candidate for injecting such an interposer is time-window compaction strategy, which needs to write sstables that only contains atoms belonging to the same time-window. By default no interposer is injected. Also add an accompanying customization point `adjust_partition_estimate()` which returns the estimated per-sstable partition-estimate that the interposer will produce.	2019-06-26 15:45:59 +03:00
Botond Dénes	3ce902a4be	Add mutation_source_metadata This struct contains metadata regarding to a mutation_source. Currently it contains the min and max timestamp. This will be used later by compaction strategies to determine whether a given mutation stream has to be split or not.	2019-06-26 15:45:59 +03:00
Botond Dénes	25d7cbedc0	tests: add unit test for timestamp_based_splitting_writer	2019-06-26 15:45:59 +03:00
Botond Dénes	df29600eec	Add timestamp_based_splitting_writer This writer implements the core logic of time-window based data segregation. It splits the fragment stream provided by a reader, such that each atom (cell) in the stream will be written into a consumer based on the time-window its timestamp belongs to. The end result is that each consumer will only see fragments, whoose atoms all have timestamps belonging to the same time-window. When a mutation fragment has atoms belonging to different time-windows, it is split into as many fragments as needed so each has only atoms that belong to the same time-window.	2019-06-26 15:45:59 +03:00
Botond Dénes	2693f1838a	Introduce mutation_writer namespace Currently there is a single mutation_writer: `multishard_writer`, however in the next path we are going to add another one. This is the right moment to move these into a common namespace (and folder), we have way too much stuff scattered already in the top-level namespace (and folder). Also rename `tests/multishard_writer_test.cc` to `tests/mutation_writer_test.cc`, this test-suite will be the home of all the different mutation writer's unit test cases.	2019-06-26 15:45:59 +03:00
Avi Kivity	adcc95dddc	Merge "sstable: mc: reader: Optimize multi-partition scans for data sets with small partitions" from Tomasz " Currently, parser and the consumer save its state and return the control to the caller, which then figures out that it needs to enter a new partition, and that it doesn't need to skip. We do it twice, after row end, and after row start. All this work could be avoided if the consumer installed by the reader adjusted its state and pushed the fragments on the spot. This patch achieves just that. This results in less CPU overhead. The ka/la reader is left still stopping after row end. Brings a 20% improvement in frag/s for a full scan in perf_fast_forward (Haswell, NVMe): perf_fast_forward -c1 -m1G --run-tests=small-partition-skips: Before: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu -> 1 0 0.952372 4 1000000 1050009 755 1050765 1046585 976.0 971 124256 1 0 0 0 0 0 0 0 99.7% After: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu -> 1 0 0.790178 4 1000000 1265538 1150 1266687 1263684 975.0 971 124256 2 0 0 0 0 0 0 0 99.6% Tests: unit (dev) " * 'sstable-optimize-partition-scans' of https://github.com/tgrabiec/scylla: sstable: mc: reader: Do not stop parsing across partitions sstables: reader: Move some parser state from sstable_mutation_reader to mp_row_consumer_reader sstables: reader: Simplify _single_partition_read checking sstables: reader: Update stats from on_next_partition() sstables: mutation_fragment_filter: Drop unnecessary calls to _walker.out_of_range() sstables: ka/la: reader make push_ready_fragments() safe to call many times sstables: mc: reader: Move out-of-range check out of push_ready_fragments() sstables: reader: Return void from push_ready_fragments() sstables: reader: Rename on_end_of_stream() to on_out_of_clustering_range() sstables: ka/la: reader: Make sure push_ready_fragments() does not miss to emit partition_end	2019-06-26 13:19:12 +03:00
Avi Kivity	06a9596491	tests: cql_test_env: disable commitlog O_DSYNC O_DSYNC causes commitlog to pre-allocate each commitlog segment by writing zeroes into it. In normal operation, this is amortized over the many times the segment will be reused. In tests, this is wasteful, but under the default workstation configuration with /tmp using tmpfs, no actual writes occur. However on a non-default configuration with /tmp mounted on a real disk, this causes huge disk I/O and eventually a crash (observed in schema_change_test). The crash is likely only caused indirectly, as the extra I/O (exacerbated by many tests running in parallel) xcauses timeouts. I reproduced this problem by running 15 copies of schema_change_test in parallel with /tmp mounted on a real filesystem. Without this change, I usually observe one or two of the copies crashing, with the change they complete (and much more quickly, too).	2019-06-26 12:15:53 +02:00
Asias He	f0f0beba2e	repair: Move the global tracker object into repair_service The tracker object was a static object in repair.cc. At the time we initialize it, we do not know the smp::count, so we have to initialize the _repairs object when it is used on the fly. void init_repair_info() { if (_repairs.size() != smp::count) { _repairs.resize(smp::count); } } This introduces a race if init_repair_info is called on different thread(shard). To fix, put the tracker object inside the newly introduced repair_service object which is created in main.cc. Fixes #4593 Message-Id: <b1adef1c0528354d2f92f8aaddc3c4bee5dc8a0a.1561537841.git.asias@scylladb.com>	2019-06-26 12:53:10 +03:00
Botond Dénes	572a738777	collection: use chunked_vector to store cells This is quick fix to the immediate problem of large collections causing large allocations, triggering stalls or OOM. The proper fix is to use IMR for storing the cells, but that is a complex change that will require time, so let's not stall/OOM in the meanwhile.	2019-06-26 11:40:44 +03:00
Botond Dénes	c68ffc330e	types: don't copy collection_type_impl::mutation_view Just because its a view its not cheap to copy.	2019-06-26 11:39:41 +03:00
Rafael Ávila de Espíndola	94d2194c77	dht: token: Simplify operator< While this is a strict weak ordering, it is not obvious and duplicates a bit of logic. This ptach simplifies it by using tri_compare. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20190621204820.37874-1-espindola@scylladb.com>	2019-06-25 19:06:30 +03:00
Tomasz Grabiec	269e65a8db	Merge "Sync schema before repair" from Asias This series makes sure new schema is propagated to repair master and follower nodes before repair. Fixes #4575 * dev.git asias/repair_pull_schema_v2: migration_manager: Add sync_schema repair: Sync schema from follower nodes before repair	2019-06-25 19:05:29 +03:00
Amos Kong	f0cd589a75	dist: suppress the yaml load warning YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. Fix it by use new safe interface - yaml.safe_load() Signed-off-by: Amos Kong <amos@scylladb.com> Cc: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <9b68601845117274573474ede0341cc81f80efa6.1561156205.git.amos@scylladb.com>	2019-06-25 19:05:29 +03:00
Avi Kivity	fc629bb14f	Merge "cql3: lift infinite bound check" from Benny & Piotr " If the database supports infinite bound range deletions, CQL layer will no longer throw an error indicating that both ranges need to be specified. Fixes #432 Update test_range_deletion_scenarios unit test accordingly. " * 'cql3-lift-infinite-bound-check' of https://github.com/bhalevy/scylla: cql3: lift infinite bound check if it's supported service: enable infinite bound range deletions with mc database: add flag for infinite bound range deletions	2019-06-25 19:05:29 +03:00
Nadav Har'El	a88c9ca5a5	Merge branch 'add_proper_aggregation_for_paged_indexing_2' of git://github.com/psarna/scylla into next Piotr Sarna says: Fixes #4540 This series adds proper handling of aggregation for paged indexed queries. Before this series returned results were presented to the user in per-page partial manner, while they should have been returned as a single aggregated value. Tests: unit(dev) Piotr Sarna (8): cql3: split execute_base_query implementation cql3: enable explicit copying of query_options cql3: add a query options constructor with explicit page size cql3: add proper aggregation to paged indexing cql3: make DEFAULT_COUNT_PAGE_SIZE constant public tests: add query_options to cquery_nofail tests: add indexing + paging + aggregation test case tests: add indexing+paging test case for clustering keys	2019-06-25 19:05:29 +03:00
Avi Kivity	7195f75fb2	Update seastar submodule * seastar ded50bd8a4...b629d5ef7a (9): > sharded: no_sharded_instance_exception: fix grammar > core,net: output_stream: remove redundant std::move() > perftune: make sure that ethtool -K has a chance of succeeding > net/dpdk: upgrade to dpdk-19.05 > perftune.py: Fix a few more places where we use deprecated pyudev.Device ones > reactor: provide an uptime function > rpc: add sink::flush() to streaming api > Use a table to document the various build modes > foreign_ptr: Fix compilation error due to unused variable	2019-06-25 19:05:29 +03:00
Avi Kivity	9d21341733	review-checklist.md: add common checks - code style - naming - micro-performance - concurrency - unit-testing - templates and type erasure - singletons	2019-06-25 19:05:29 +03:00
Piotr Sarna	efa7951ea5	main: stop view builder conditionally The view builder is started only if it's enabled in config, via the view_building=true variable. Unfortunately, stopping the builder was unconditional, which may result in failed assertions during shutdown. To remedy this, view building is stopped only if it was previously started. Fixes #4589	2019-06-25 19:05:29 +03:00
Asias He	bb5665331c	repair: Sync schema from follower nodes before repair Since commit "repair: Use the same schema version for repair master and followers", repair master and followers uses the same schema version that master decides to use during the whole repair operation. If master has older version of schema, repair could ignore the data which makes use of the new schema, e.g., writes to new columns. To fix, always sync the schema agreement before repair. The master node pulls schema from followers and applies locally. The master then uses the "merged" schema. The followers use get_schema_for_write() to pull the "merged" schema. Fixes #4575 Backports: 3.1	2019-06-25 17:13:47 +08:00
Asias He	14c1a71860	migration_manager: Add sync_schema Makes sure this node knows about all schema changes known by "nodes" that were made prior to this call. Refs: #4575 Backports: 3.1	2019-06-25 17:13:47 +08:00
Botond Dénes	d00cb4916c	tests: introduce random_schema random_schema is a utility class that provides methods for generating random schemas as well as generating data (mutations) for them. The aim is to make using random schemas in tests as simple and convenient as is using `simple_schema`. For this reason the interface of `random_schema` follows closely that of `simple_schema` to the extent that it makes sense. An important difference is that `random_schema` relies on `data_model` to actually build mutations. So all its mutation-related operations work with `data_model::mutation_descrition` instead of actual `mutation` objects. Once the user arrived at the desired mutation description they can generate an actual mutation via `data_model::mutation_description::build()`. In addition to the `random_schema` class, the `random_schema.hh` header exposes the generic utility classes for generating types and values that it internally uses. random_schema is fully deterministic. Using the same seed and the same set of operations is guaranteed to result in generating the same schema and data.	2019-06-25 12:01:33 +03:00
Botond Dénes	070d72ee23	tests/random-utils.hh: add get_real()	2019-06-25 12:01:33 +03:00
Botond Dénes	2d9f6c3b63	tests/random-utils.hh: get_int() add overloads that accept external rand engine	2019-06-25 12:01:33 +03:00
Botond Dénes	2a7710129e	tests/random-utils.hh: add stepped_int_distribution	2019-06-25 12:01:33 +03:00
Botond Dénes	a3f9932a2f	data_value: add ascii constructor To allow a `data_value` with `ascii_type` to be constructed.	2019-06-25 12:01:33 +03:00
Botond Dénes	1bd8b77770	tests/data_model: approximate to the modeled data structures Make the the data modelling structures model their "real" counterparts more closely, allowing the user greater control on the produced data. The changes: * Add timestamp to atomic_value (which is now a struct, not just an alias to bytes). * Add tombstone to collection. * Add row_tombstone to row. * Add bound kinds and tombstone to range_tombstone. Great care was taken to preserve backward compatibility, to avoid unnecessary changes in existing code.	2019-06-25 12:01:33 +03:00
Piotr Sarna	add40d4e59	cql3: lift infinite bound check if it's supported If the database supports infinite bound range deletions, CQL layer will no longer throw an error indicating that both ranges need to be specified. [bhalevy] Update test_range_deletion_scenarios unit test accordingly. Fixes #432 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2019-06-24 15:58:34 +03:00
Piotr Sarna	c19fdc4c90	service: enable infinite bound range deletions with mc As soon as it's agreed that the cluster supports sstables in mc format, infinite bound range deletions in statements can be safely enabled. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2019-06-24 15:58:28 +03:00
Piotr Sarna	e77ef849af	database: add flag for infinite bound range deletions Database can only support infinite bound range deletions if sstable mc format is supported. As a first step to implement these checks, an appropriate flag is added to database.	2019-06-24 15:57:47 +03:00
Piotr Sarna	b668ee2b2d	tests: add indexing+paging test case for clustering keys Indexing a non-prefix part of the clustering key has a separate code path (see issue #3405), so it deserves a separate test case.	2019-06-24 14:51:17 +02:00
Piotr Sarna	3d9a37f28f	tests: add indexing + paging + aggregation test case Indexed queries used to erroneously return partial per-page results for aggregation queries. This test case used to reproduce the problem and now ensures that there would be no regressions. Refs #4540	2019-06-24 14:06:42 +02:00
Piotr Sarna	60cafcc39c	tests: add query_options to cquery_nofail The cquery_nofail utility is extended, so it can accept custom query options, just like execute_cql does.	2019-06-24 14:06:41 +02:00
Piotr Sarna	fe18638de3	cql3: make DEFAULT_COUNT_PAGE_SIZE constant public The constant will be later used in test scenarios.	2019-06-24 13:21:37 +02:00
Piotr Sarna	bb08af7e68	cql3: add proper aggregation to paged indexing Aggregated and paged filtering needs to aggregate the results from all pages in order to avoid returning partial per-page results. It's a little bit more complicated than regular aggregation, because each paging state needs to be translated between the base table and the underlying view. The routine keeps fetching pages from the underlying view, which are then used to fetch base rows, which go straight to the result set builder. Fixes #4540	2019-06-24 13:21:32 +02:00
Piotr Sarna	97d476b90f	cql3: add a query options constructor with explicit page size For internal use, there already exists a query_options constructor that copies data from another query_options with overwritten paging state. This commit adds an option to overwrite page size as well.	2019-06-24 13:21:32 +02:00
Piotr Sarna	fa89e220ef	cql3: enable explicit copying of query_options	2019-06-24 12:57:04 +02:00
Piotr Sarna	7a8b243ce4	cql3: split execute_base_query implementation In order to handle aggregation queries correctly, the function that returns base query results is split into two, so it's possible to access raw query results, before they're converted into end-user CQL message.	2019-06-24 12:57:03 +02:00
Benny Halevy	b1e78313fe	log_histogram: log_heap_options::bucket_of: avoid calling pow2_rank(0) pow2_rank is undefined for 0. bucket_of currently works around that by using a bitmask of 0. To allow asserting that count_{leading,trailing}_zeros are not called with 0, we want to avoid it at all call sites. Fixes #4153 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20190623162137.2401-1-bhalevy@scylladb.com>	2019-06-23 19:32:51 +03:00
Avi Kivity	779b378785	Merge "Fix partitioned_sstable_set by making it self sufficient" from Raphael & Benny " partitioned_sstable_set is not self sufficient because it relies on compatible_ring_position_view, which in turn relies on lifetime of sstable object. This leads to use-after-free. Fix this problem by introducing compatible_ring_position and using it in p__s__s. Fixes #4572. Test: unit (dev), compaction dtests (dev) " * 'projects/fix_partitioned_sstable_set/v4' of ssh://github.com/bhalevy/scylla: tests: Test partitioned sstable set's self-sufficiency sstables: Fix partitioned_sstable_set by making it self sufficient Introduce compatible_ring_position and compatible_ring_position_or_view	2019-06-23 17:17:18 +03:00
Raphael S. Carvalho	14fa7f6c02	tests: Test partitioned sstable set's self-sufficiency Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2019-06-23 16:29:13 +03:00
Raphael S. Carvalho	293557a34e	sstables: Fix partitioned_sstable_set by making it self sufficient Partitioned sstable set is not self sufficient, because it uses compatible_ring_position_view as key for interval map, which is constructed from a decorated key in sstable object. If sstable object is destroyed, like when compaction releases it early, partitioned set potentially no longer works because c__r__p__v would store information that is already freed, meaning its use implies use-after-free. Therefore, the problem happens when partitioned set tries to access the interval of its interval map and uses freed information from c__r__p__v. Fix is about using the newly introduced compatible_ring_position_or_view which can hold a ring_position, meaning that partitioned set is no longer dependent on lifetime of sstable object. Retire compatible_ring_position_view.hh as it is now unused. Fixes #4572. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2019-06-23 16:29:13 +03:00
Raphael S. Carvalho	9a83561700	Introduce compatible_ring_position and compatible_ring_position_or_view The motivation for supporting ring position is that containers using it can be self sufficient. The existing compatible_ring_position_view could lead to use after free when the ring position data, it was built from, is gone. The motivation for compatible_ring_position_or_view is to allow lookup on containers that don't support different key types using c__r__p, and also to avoid unnecessary copies. If the user is provided only with a ring_position_view, c__r__p__or_v could be built from it and used for lookups. Converting ring_position_view to ring_position is very bug prone because there could be information lost in the process. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2019-06-23 16:29:12 +03:00

1 2 3 4 5 ...

18846 Commits