scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-06 23:13:15 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	2fc144e1a8	tests: memtable_snapshot_source: Allow changing the schema	2019-10-03 22:03:29 +02:00
Tomasz Grabiec	22dde90dba	tests: simple_schema: Prepare for schema altering Currently, methods of simple_schema assume that table's schema doesn't change. Accessors like get_value() assume that rows were generated using simple_schema::_s. Because if that, the column_definition& for the "v" column is cached in the instance. That column_definiion& cannot be used to access objects created with a different schema version. To allow using simple_schema after schema changes, column_definition& caching is now tagged with the table schema version of origin. Methods which access schema-dependent objects, like get_value(), are now accepting schema& corresponding to the objects. Also, it's now possible to tell simple_schema to use a different schema version in its generator methods.	2019-10-03 22:03:29 +02:00
Tomasz Grabiec	e6afc89735	row_cache: Record upgraded schema in memtable entries during update Cache update may defer in the middle of moving of partition entry from a flushed memtable to the cache. If the schema was changed since the entry was written, it upgrades the schema of the partition_entry first but doesn't update the schema_ptr in memtable_entry. The entry is removed from the memtable afterward. If a memtable reader encounters such an entry, it will try to upgrade it assuming it's still at the old schema. That is undefined behavior in general, which may include: - read failures due to bad_alloc, if fixed-size cells are interpreted as variable-sized cells, and we misinterpret a value for a huge size - wrong read results - node crash This doesn't result in a permanent corruption, restarting the node should help. It's the more likely to happen the more rows there are in a partition. It's unlikely to happen with single-row partitions. Introduced in `70c7277`. Fixes #5128.	2019-10-03 22:03:29 +02:00
Tomasz Grabiec	ea461a3884	memtable: Extract memtable_entry::upgrade_schema()	2019-10-03 22:03:29 +02:00
Tomasz Grabiec	90d6c0b9a2	row_cache, mvcc: Prevent locked snapshots from being evicted If the whole partition entry is evicted while being updated from the memtable, a subsequent read may populate the partition using the old version of data if it attempts to do it before cache update advances past that partition. Partial eviction is not affected because populating reads will notice that there is a newer snapshot corresponding to the updater. This can happen only in OOM situations where the whole cache gets evicted. Affects only tables with multi-row partitions, which are the only ones that can experience the update of partition entry being preempted. Introduced in `70c7277`. Fixes #5134.	2019-10-03 22:03:29 +02:00
Tomasz Grabiec	57a93513bd	row_cache: Make evict() not use invalidate_unwrapped() invalidate_unwrapped() calls cache_entry::evict(), which cannot be called concurrently with cache update. invalidate() serializes it properly by calling do_update(), but evict() doesn't. The purpose of evict() is to stress eviction in tests, which can happen concurrently with cache update. Switch it to use memory reclaimer, so that it's both correct and more realistic. evict() is used only in tests.	2019-10-03 22:03:28 +02:00
Tomasz Grabiec	c88a4e8f47	mvcc: Introduce partition_snapshot::touch()	2019-10-03 22:03:28 +02:00
Tomasz Grabiec	25e2f87a37	row_cache, mvcc: Do not upgrade schema of entries which are being updated When a read enters a partition entry in the cache, it first upgrades it to the current schema of the cache. The same happens when an entry is updated after a memtable flush. Upgrading the entry is currently performed by squashing all versions and replacing them with a single upgraded version. That has a side effect of detaching all snapshots from the partition entry. Partition entry update on memtable flush is writing into a snapshot. If that snapshot is detached by a schema upgrade, the entry will be missing writes from the memtable which fall into continuous ranges in that entry which have not yet been updated. This can happen only if the update of the entry is preempted and the schema was altered during that, and a read hit that partition before the update went past it. Affects only tables with multi-row partitions, which are the only ones that can experience the update of partition entry being preempted. The problem is fixed by locking updated entries and not upgrading schema of locked entries. cache_entry::read() is prepared for this, and will upgrade on-the-fly to the cache's schema. Fixes #5135	2019-10-03 22:03:28 +02:00
Tomasz Grabiec	0675088818	row_cache: Use the correct schema version to populate the partition entry The sstable reader which populates the partition entry in the cache is using the schema of the partition entry snapshot, which will be the schema of the cache at the time the partition was entered. If there was a schema change after the cache reader entered the partition but before it created the sstable reader, the cache populating reader will interpret sstable fragments using the wrong schema version. That is more likely if partitions have many rows, and the front of the partition is populated. With single-row partitions that's unlikely to happen. That is undefined behavior in general, which may include: - read failures due to bad_alloc, if fixed-size cells are interpreted as variable-sized cells, and we misinterpret a value for a huge size - wrong read results - node crash This doesn't result in a permanent corruption, restarting the node should help. Fixes #5127.	2019-10-03 22:03:28 +02:00
Tomasz Grabiec	10992a8846	delegating_reader: Optimize fill_buffer() Use move_buffer_content_to() which is faster than fill_buffer_from() because it doesn't involve popping and pushing the fragments across buffers. We save on size estimation costs.	2019-10-03 22:03:28 +02:00
Tomasz Grabiec	aad1307b14	row_cache, memtable: Use upgrade_schema()	2019-10-03 13:28:33 +02:00
Tomasz Grabiec	3177732b35	flat_mutation_reader: Introduce upgrade_schema()	2019-10-03 13:28:33 +02:00
Asias He	a9b95f5f01	repair: Fix tracker::start and tracker::done in case of error The operation after gate.enter() in tracker::start() can fail and throw, we should call gate.leave() in such case to avoid unbalanced enter and leave calls. tracker::done() has similar issue too. Fix it by removing the gate enter and leave logic in tracker start and done. A helper tracker::run() is introduced to take care of the gate and repair status. In addition, the error log is improved. It now logs exceptions on all shards in the summary. e.g., [shard 0] repair - repair id 1 failed: std::runtime_error ({shard 0: std::runtime_error (error0), shard 1: std::runtime_error (error1)}) Fixes #5074	2019-10-03 13:33:02 +03:00
Botond Dénes	00b432b61d	querier_cache: correctly account entries evicted on insertion in the population Currently, the population stat is not increased for entries that are evicted immediately on insert, however the code that does the eviction still decreases the population stat, leading to an imbalance and in some cases the underflow of the population stat. To fix, unconditionally increase the population stat upon inserting an entry, regardless of whether it is immediately evicted or not. Fixes: #5123 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20191001153215.82997-1-bdenes@scylladb.com>	2019-10-03 11:49:44 +03:00
Konstantin Osipov	e8c13efb41	lwt: move mutation hashers to mutation.hh Prepare mutation hashers for reuse in CAS implementation. Message-Id: <20190930202409.40561-2-kostja@scylladb.com>	2019-10-01 19:49:31 +02:00
Konstantin Osipov	6cde985946	lwt: remove code that no longer servers as a reference Remove ifdef'ed Java code, since LWT implementation is based on the current state of the origin. Message-Id: <20190930201022.40240-2-kostja@scylladb.com>	2019-10-01 19:46:15 +02:00
Konstantin Osipov	4d214b624b	lwt: ensure enum_set::of is constexpr. This allows using it to initialize const static members. Message-Id: <20190930200530.40063-2-kostja@scylladb.com>	2019-10-01 19:45:56 +02:00
Tomasz Grabiec	3b9bf9d448	Merge "storage_proxy: replace variadic futures with structs" from Avi Seastar variadic futures are deprecated, so replace with structs to avoid nasty deprecation warnings.	2019-10-01 19:32:55 +02:00
Avi Kivity	162730862d	storage_proxy: remove variadic future from query_partition_key_range_concurrent() Seastar variadic futures are deprecated, so replace with a nice struct.	2019-09-30 21:33:44 +03:00
Avi Kivity	968b34a2b4	storage_proxy: remove variadic future from digest_read_resolver Seastar variadic futures are deprecated, so replace with a nice struct.	2019-09-30 21:32:17 +03:00
Nadav Har'El	c9aae13fae	docs/alternator/getting-started.md: fix indentation in example code The example Python code had wrong indentation, and wouldn't actually work if naively copy-pasted. Noticed by Noam Hasson. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20190929091440.28042-1-nyh@scylladb.com>	2019-09-30 13:03:29 +03:00
Avi Kivity	c6b66d197b	Merge "Couple of preparatory patches for lwt" from Gleb " This is a collection of assorted patches that will be needed for LWT. Most of them are trivial, but one touches a lot of files, so have a good chance to cause rebase headache (I already had to rebase it on top of Alternator). Lets push them earlier instead of carrying them in the lwt branch. " * 'gleb/lwt-prepare-v2' of github.com:scylladb/seastar-dev: lwt: make _last_timestamp_micros static lwt: Add client_state::get_timestamp_for_paxos() function lwt: Pass client_state reference all the way to storage_proxy::query exceptions: Add a constructor for unavailable_exception that allows providing a custom message serializer: Add std::variant support lwt: Add missing functions to utils/UUID_gen.hh	2019-09-29 13:02:26 +03:00
Avi Kivity	9e990725d9	Merge "Simplify and explain from_varint_to_integer #5031 " from Rafael " This is the second version of the patch series. The previous one was just the second patch, this one adds more tests an another patch to make it easier to test that the new code has the same behavior as the old one. " * 'espindola/overflow-is-intentional' of https://github.com/espindola/scylla: types: Simplify and explain from_varint_to_integer Add more cast tests	2019-09-29 11:27:55 +03:00
Tomasz Grabiec	b0e0f29b06	db: read: Filter-out sstables using its first and last keys Affects single-partition reads only. Refs #5113 When executing a query on the replica we do several things in order to narrow down the sstable set we read from. For tables which use LeveledCompactionStrategy, we store sstables in an interval set and we select only sstables whose partition ranges overlap with the queried range. Other compaction strategies don't organize the sstables and will select all sstables at this stage. The reasoning behind this is that for non-LCS compaction strategies the sstables' ranges will typically overlap and using interval sets in this case would not be effective and would result in quadratic (in sstable count) memory consumption. The assumption for overlap does not hold if the sstables come from repair or streaming, which generates non-overlapping sstables. At a later stage, for single-partition queries, we use the sstables' bloom filter (kept in memory) to drop sstables which surely don't contain given partition. Then we proceed to sstable indexes to narrow down the data file range. Tables which don't use LCS will do unnecessary I/O to read index pages for single-partition reads if the partition is outside of the sstable's range and the bloom filter is ineffective (Refs #5112). This patch fixes the problem by consulting sstable's partition range in addition to the bloom filter, so that the non-overlapping sstables will be filtered out with certainty and not depend on bloom filter's efficiency. It's also faster to drop sstables based on the keys than the bloom filter. Tests: - unit (dev) - manual using cqlsh Reviewed-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20190927122505.21932-1-tgrabiec@scylladb.com>	2019-09-28 19:42:57 +03:00
Tomasz Grabiec	b93cc21a94	sstables: Fix partition key count estimation for a range The method sstable::estimated_keys_for_range() was severely under-estimating the number of partitions in an sstable for a given token range. The first reason is that it underestimated the number of sstable index pages covered by the range, by one. In extreme, if the requested range falls into a single index page, we will assume 0 pages, and report 1 partition. The reason is that we were using get_sample_indexes_for_range(), which returns entries with the keys falling into the range, not entries for pages which may contain the keys. A single page can have a lot of partitions though. By default, there is a 1:20000 ratio between summary entry size and the data file size covered by it. If partitions are small, that can be many hundreds of partitions. Another reason is that we underestimate the number of partitions in an index page. We multiply the number of pages by: (downsampling::BASE_SAMPLING_LEVEL * _components->summary.header.min_index_interval) / _components->summary.header.sampling_level Using defaults, that means multiplying by 128. In the cassandra-stress workload a single partition takes about 300 bytes in the data file and summary entry is 22 bytes. That means a single page covers 22 * 20'000 = 440'000 bytes of the data file, which contains about 1'466 partitions. So we underestimate by an order of magnitude. Underestimating the number of partitions will result in too small bloom filters being generated for the sstables which are the output of repair or streaming. This will make the bloom filters ineffective which results in reads selecting more sstables than necessary. The fix is to base the estimation on the number of index pages which may contain keys for the range, and multiply that by the average key count per index page. Fixes #5112. Refs #4994. The output of test_key_count_estimation: Before: count = 10000 est = 10112 est([-inf; +inf]) = 512 est([0; 0]) = 128 est([0; 63]) = 128 est([0; 255]) = 128 est([0; 511]) = 128 est([0; 1023]) = 128 est([0; 4095]) = 256 est([0; 9999]) = 512 est([5000; 5000]) = 1 est([5000; 5063]) = 1 est([5000; 5255]) = 1 est([5000; 5511]) = 1 est([5000; 6023]) = 128 est([5000; 9095]) = 256 est([5000; 9999]) = 256 est(non-overlapping to the left) = 1 est(non-overlapping to the right) = 1 After: count = 10000 est = 10112 est([-inf; +inf]) = 10112 est([0; 0]) = 2528 est([0; 63]) = 2528 est([0; 255]) = 2528 est([0; 511]) = 2528 est([0; 1023]) = 2528 est([0; 4095]) = 5056 est([0; 9999]) = 10112 est([5000; 5000]) = 2528 est([5000; 5063]) = 2528 est([5000; 5255]) = 2528 est([5000; 5511]) = 2528 est([5000; 6023]) = 5056 est([5000; 9095]) = 7584 est([5000; 9999]) = 7584 est(non-overlapping to the left) = 0 est(non-overlapping to the right) = 0 Tests: - unit (dev) Reviewed-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20190927141339.31315-1-tgrabiec@scylladb.com>	2019-09-28 19:36:43 +03:00
Piotr Sarna	10f90d0e25	types: remove deprecated comment The comment does not apply anymore, as this definition is no more in database.hh. Message-Id: <a0b6ff851e1e3bcb5fcd402fbf363be7af0219af.1569580556.git.sarna@scylladb.com>	2019-09-27 19:32:17 +02:00
Dejan Mircevski	9a89e0c5ec	dbuild: Update README on interactive mode `dbuild` was recently (`24c732057`) updated to run in interactive mode when given no arguments; we can now update the README to mention that. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2019-09-27 16:33:27 +02:00
Dejan Mircevski	f8638d8ae1	alternator: Add build byproducts to .gitignore Add .pytest_cache and expressions.tokens to the top-level .gitignore. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2019-09-27 16:18:45 +02:00
Dejan Mircevski	332ffa77ea	alternator: Actually use BEGINS_WITH in its tests For some reason, BEGINS_WITH tests used EQ as comparison operator. Tests: pytest test_expected.py Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2019-09-26 22:41:34 +03:00
Tomasz Grabiec	5b0e48f25b	Merge "toppartitions: don't transport schema_ptr across shards" from Avi When the toppartitions operation gathers results, it copies partition keys with their schema_ptr:s. When these schema_ptr:s are copies or destroyed, they can cause leaks or premature frees of the schema in its original shard since reference count operations in are not atomic. Fix that by converting the schema_ptr to a global_schema_ptr during transportation. Fixes #5104 (direct bug) Fixes #5018 (schema prematurely freed, toppartitions previously executed on that node) Fixes #4973 (corrupted memory pool of the same size class as schema, toppartitions previously executed on that node) Tests: new test added that fails with the existing code in debug mode, manual toppartitions test	2019-09-26 17:09:54 +02:00
Avi Kivity	36b4d55b28	tests: add test for toppartitions cross-shard schema_ptr copy	2019-09-26 17:40:46 +03:00
Avi Kivity	670f398a8a	toppartitions: do not copy schema_ptr:s in item keys across shards Copying schema_ptrs across shards results in memory corruption since lw_shared_ptr does not use atomic operations for reference counts. Prevent that by converting schema_ptr:s to global_schema_ptr:s before shipping them across shards in the map operation, and converting them back to local schema_ptr:s in the reduce operation.	2019-09-26 17:26:40 +03:00
Avi Kivity	f015bd69b7	toppartitions: compare schemas using schema::id(), not pointer to schema This allows keys from different stages in the schema's like to compare equal. This is safe since the partition key cannot change, unlike the rest of the schema. More importantly, it will allow us to compare keys made local after a pass through global_schema_ptr, which does not guarantee that the schema_ptr conversion will be the same even when starting with the same global_schema_ptr.	2019-09-26 17:15:46 +03:00
Avi Kivity	ea4976a128	schema_registry: mark global_schema_ptr move constructor noexcept Throwing move constructors are a a pain; so we should try to make them noexcept. Currently, global_schema_ptr's move constructor throws an exception if used illegaly (moving from a different shard); this patch changes it to an assert, on the grounds that this error is impossible to recover from. The direct motivation for the patch is the desire to store objects containing a global_schema_ptr in a chunked_vector, to move lists of partition keys across shards for the topppartitions functionality. chunked_vector currently requires noexcept move constructors for its value_type.	2019-09-26 16:56:59 +03:00
Avi Kivity	ba64ec78cf	messaging_service: use rpc::tuple instead of variadic futures for rpc Since variadic future<> is deprecated, switch to rpc::tuple for multiple return values in rpc calls. This is more or less mechanical translation.	2019-09-26 12:09:31 +02:00
Tomasz Grabiec	9183e28f2c	Merge "Recreate dependent user types" from Rafael When a user type changes we were not recreating other uses types that use it. This patch series fixes that and makes it clear which code is responsible for it. In the system.types table a user type refers to another by name. When a user type is modified, only its entry in the table is changed. At runtime a user type has direct pointer to the types it uses. To handle the discrepancy we need to recreate any dependent types when a entry in system.types changes. Fixes #5049	2019-09-26 12:06:32 +02:00
Gleb Natapov	e0b303b432	lwt: make _last_timestamp_micros static If each client_state has its own copy of the variable two clients may generate timestamps that clash and needlessly create contention. Making the variable shared between all client_state on the same shard will make sure this will not happen to two clients on the same shard. It may still happen for two client on two different shards or two different nodes.	2019-09-26 11:44:00 +03:00
Gleb Natapov	622d21f740	lwt: Add client_state::get_timestamp_for_paxos() function Paxos needs a unique timestamp that is greater than some other timestamp, so that the next round had more chances to succeed. Add a function that returns such a timestamp.	2019-09-26 11:44:00 +03:00
Gleb Natapov	e72a105b5e	lwt: Pass client_state reference all the way to storage_proxy::query client_state holds a state to generate monotonically increasing unique timestamp. Queries with a SERIAL consistency level need it to generate a paxos round.	2019-09-26 11:44:00 +03:00
Gleb Natapov	556f65e8a1	exceptions: Add a constructor for unavailable_exception that allows providing a custom message	2019-09-26 11:44:00 +03:00
Gleb Natapov	209414b4eb	serializer: Add std::variant support	2019-09-26 11:44:00 +03:00
Gleb Natapov	f9209e27d4	lwt: Add missing functions to utils/UUID_gen.hh Some lwt related code is missing in our UUID implementation. Add it.	2019-09-26 11:44:00 +03:00
Rafael Ávila de Espíndola	5af8b1e4a3	types: recreate dependent user types. In the system.types table a user type refers to another by name. When a user type is modified, only its entry in the table is changed. At runtime a user type has direct pointer to the types it uses. To handle the discrepancy we need to recreate any dependent types when a entry in system.types changes. Fixes #5049 Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-09-25 15:41:45 -07:00
Rafael Ávila de Espíndola	4c3209c549	types: Don't include dependent user types in update. The way schema changes propagate is by editing the system tables and comparing the before and after state. When a user type A uses another user type B and we modify B, the representation of A in the system table doesn't change, so this code was not producing any changes on the diff that the receiving side uses. Deleting it makes it clear that it is the receiver's responsibility to handle dependent user types. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-09-25 15:41:45 -07:00
Rafael Ávila de Espíndola	34eddafdb0	types: Don't modify the type list in db::cql_type_parser::raw_builder With this patch db::cql_type_parser::raw_builder creates a local copy of the list of existing types and uses that internally. By doing that build() should have no observable behavior other than returning the new types. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-09-25 15:41:45 -07:00
Rafael Ávila de Espíndola	d6b2e3b23b	types: pass a reference to prepare_internal We were never passing a null pointer and never saving a copy of the lw_shared_ptr. Passing a reference is more flexible as not all callers are required to hold the user_types_metadata in a lw_shared_ptr. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-09-25 15:40:30 -07:00
Avi Kivity	03260dd910	Update seastar submodule * seastar b56a8c5045...c21a7557f9 (3): > net: socket::{set,get}_reuseaddr() should not be virtual > iotune: print verbose message in case of shutdown errors > iotune: close test file on shutdown Fixes #4946.	2019-09-25 16:08:32 +03:00
Tomasz Grabiec	06b9818e98	Merge "storage_proxy: tolerate view_update_write_response_handler id not found on shutdown" from Benny 1. Add assert in remove_response_handler to make crashes like in #5032 easier to understand. 2. Lookup the view_update_write_response_handler id before calling timeout_cb and tolerate it not found. Just log a warning if this happened. Fixes #5032	2019-09-25 14:49:42 +02:00
Avi Kivity	83bc59a89f	Merge "mvcc: Fix incorrect schema version being used to copy the mutation when applying (#5099 )" from Tomasz " Currently affects only counter tables. Introduced in `27014a2`. mutation_partition(s, mp) is incorrect because it uses s to interpret mp, while it should use mp_schema. We may hit this if the current node has a newer schema than the incoming mutation. This can happen during table schema altering when we receive the mutation from a node which hasn't processed the schema change yet. This is undefined behavior in general. If the alter was adding or removing columns, this may result in corruption of the write where values of one column are inserted into a different column. Fixes #5095. " * 'fix-schema-alter-counter-tables' of https://github.com/tgrabiec/scylla: mvcc: Fix incorrect schema verison being used to copy the mutation when applying mutation_partition: Track and validate schema version in debug builds tests: Use the correct schema to access mutation_partition	2019-09-25 15:30:22 +03:00
Tomasz Grabiec	11440ff792	mvcc: Fix incorrect schema verison being used to copy the mutation when applying Currently affects only counter tables. Introduced in `27014a2`. mutation_partition(s, mp) is incorrect, because it uses s to interpret mp, while it should use mp_schema. We may hit this if the current node has a newer schema than the incoming mutation. This can happen during alter when we receive the mutation from a node which hasn't processed the schema change yet. This is undefined behavior in general. If the alter was adding or removing columns, this may result in corruption of the write where values of one column are inserted into a different column. Fixes #5095.	2019-09-25 11:28:07 +02:00

1 2 3 4 5 ...

19730 Commits