scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-28 04:06:59 +00:00

Author	SHA1	Message	Date
Avi Kivity	1fd701e709	test: cql-pytest: skip tests depending on timeuuid monotonicity timeuuid is not monotonic when now() is called on different connections, so when running tests that depend on that property, we get failures if using the Scylla driver (which became standard in `729d0fe`). Skip the tests for now, until we figure out what to do. We probably can't make now() globally monotonic, and there isn't much to gain by making it monotonic only per connection, since clients are allowed to switch connections (and even nodes) at will. Ref #9300 Closes #9323 [avi: committing my own patch to unblock master]	2021-09-12 19:30:40 +03:00
Nadav Har'El	1d4474d543	test/alternator/run: don't run Scylla if "--aws" option The test/alternator/run script runs Scylla and then runs pytest against it. But when passing the "--aws" option, the intention is that these tests be run against AWS DynamoDB, not a local Scylla, so there is no point in starting Scylla at all - so this is what we do in this patch. This doesn't really add a new feature - "test/alternator/run --aws" will now be nothing more than "cd test/alternator; pytest --aws". But it adds the convenience that you can run the same tests on Scylla and AWS with exactly the same "run" command, just adding the "--aws" option, and don't need to sometimes use "run" and sometimes "pytest". Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210912133239.75463-1-nyh@scylladb.com>	2021-09-12 16:50:38 +03:00
Avi Kivity	c5f52f9d97	schema_tables: don't flush in tests Flushing schema tables is important for crash recovery (without a flush, we might have sstables using a new schema before the commitlog entry noting the schema change has been replayed), but not important for tests that do not test crash recovery. Avoiding those flushes reduces system, user, and real time on tests running on a consumer-level SSD. before: real 8m51.347s user 7m5.743s sys 5m11.185s after: real 7m4.249s user 5m14.085s sys 2m11.197s Note real time is higher that user+sys time divided by the number of hardware threads, indicating that there is still idle time due to the disk flushing, so more work is needed. Closes #9319	2021-09-12 11:32:13 +03:00
Raphael S. Carvalho	acba3bd3c4	sstables: give a more descriptive name to compaction_options the name compaction_options is confusing as it overlaps in meaning with compaction_descriptor. hard to reason what are the exact difference between them, without digging into the implementation. compaction_options is intended to only carry options specific to a give compaction type, like a mode for scrub, so let's rename it to compaction_type_options to make it clearer for the readers. [avi: adjust for scrub changes] Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210908003934.152054-1-raphaelsc@scylladb.com>	2021-09-12 11:21:33 +03:00
Tomasz Grabiec	83113d8661	Merge "raft: new schema for storing raft snapshots" from Pavel Solodovnikov Previously, the layout for storing raft snapshot descriptors contained a `config` field, which had `blob` data type. That means `raft::configuration` for the snapshot was serialized as a whole in binary form. It's convenient to implement and is the most compact form of representing the data, but: 1. Hard to debug due to the need to de-serialize the data. 2. Plants a time bomb wrt. changing data layout and also the documentation in the future. Remove the `config` field from `system.raft_snapshots` and extract it to a separate `system.raft_config` table to store the data in exploded form. Also, modify the schema of `system.raft_snapshots` table in the following way: add a `server_id` field as a part of composite partition key ((group_id, server_id)) to be able to start multiple raft servers belonging to one raft group on the same scylla node. Rename `id` field in `raft_snapshots` to `snapshot_id` so it's self-documenting. Rename `snapshot_id` from clustering key since a given server can have only one snapshot installed at a time. Note that the `raft::server_address` stucture contains an opaque `info` member, which is `bytes`, but in the `raft_config` table we use `ip_addr inet` field, instead. We always know that the corresponding member field is going to contain an IP address (either v4 or v6) of a given raft server. So, now the snapshots schema looks like this: CREATE TABLE raft_snapshots ( group_id timeuuid, server_id uuid, snapshot_id uuid, idx int, term int, -- no `config` field here, moved to `raft_config` table PRIMARY KEY ((group_id, server_id)) ) CREATE TABLE raft_config ( group_id timeuuid, my_server_id uuid, server_id uuid, disposition text, -- can be either 'CURRENT` or `PREVIOUS' can_vote bool, ip_addr inet, PRIMARY KEY ((group_id, my_server_id), server_id, disposition) ); This way it's much easier to extend the schema with new fields, very easy to debug and inspect via CQL, and it's much more descriptive in terms of self-documentation. Tests: unit(dev) * manmanson/raft_snapshots_new_schema_v2: test: adjust `schema_change_test` to include new `system.raft_config` table raft: new schema for storing raft snapshots raft: pass server id to `raft_sys_table_storage` instance	2021-09-10 20:41:59 +02:00
Avi Kivity	7a798b44a2	cql3: expr: replace column_value_tuple by a composition of tuple_constructor and column_value column_value_tuple overlaps both column_value and tuple_constructor (in different respects) and can be replaced by a combination: a tuple_constructor of column_value. The replacement is more expressive (we can have a tuple of column_value and other expression types), though the code (especially grammar) do not allow it yet. So remove column_value_tuple and replace it everywhere with tuple_constructor. Visitors get the merged behavior of the existing tuple_constructor and column_value_tuple, which is usually trivial since tuple_constructor and column_value_tuple came from different hierarchies (term::raw and relation), so usually one of the types just calls on_internal_error(). The change results in awkwards casts in two areas: WHERE clause filtering (equal() and related), and clustering key range evaluations (limits() and related). When equal() is replaced by recursive evaluate(), the casts will go way (to be replaced by the evaluate()) visitor. Clustering key range extraction will remain limited to tuples of column_value, so the prepare phase will have to vet the expressions to ensure the casts don't fail (and use the filtering path if they will). Tests: unit (dev) Closes #9274	2021-09-10 10:43:29 +02:00
Avi Kivity	219fdcd8da	Merge 'tools: introduce scylla-sstable' from Botond Dénes A tool which can be used to examine the content of sstable(s) and execute various operations on them. The currently supported operations are: * dump - dumps the content of the sstable(s), similar to sstabledump; * dump-index - dumps the content of the sstable index(es), similar to scylla-sstable-index; * writetime-histogram - generates a histogram of all the timestamps in the sstable(s); * custom - a hackable operation for the expert user (until scripting support is implemented); * validate - validate the content of the sstable(s) with the mutation fragment stream validator, same as scrub in validate mode; The sstables to-be-examined are passed as positional command line arguments. Sstables will be processed by the selected operation one-by-one (can be changed with `--merge`). Any number of sstables can be passed but mind the open file limits. Pass the full path to the data component of the sstables (-Data.db). For now it is required that the sstable is found at a valid data path: /path/to/datadir/{keyspace_name}/{table_name}-{table_id}/ The schema to read the sstables is read from a `schema.cql` file. This should contain the keyspace and table definitions, as well as any UDTs used. Filtering the sstable(s) to process only certain partition(s) is supported via the `--partition` and `--partitions-file` command line flags. Partition keys are expected to be in the hexdump format used by scylla (hex representation of the raw buffer). Operations write their output to stdout, or file(s). The tool logs to stderr, with a logger called `scylla-sstable-crawler`. Examples: # dump the content of the sstable $ scylla-sstable-crawler --dump /path/to/md-123456-big-Data.db # dump the content of the two sstable(s) as a unified stream $ scylla-sstable-crawler --dump --merge /path/to/md-123456-big-Data.db /path/to/md-123457-big-Data.db # generate a joint histogram for the specified partition $ scylla-sstable-crawler --writetime-histogram --partition={{myhexpartitionkey}} /path/to/md-123456-big-Data.db # validate the specified sstables $ scylla-sstable-crawler --validate /path/to/md-123456-big-Data.db /path/to/md-123457-big-Data.db Future plans: JSON output for dump. * A simple way of generating `schema.cql` for any schema, other than copying it from snapshots, or copying from `cqlsh`. None of these generate a complete output. * Relax sstable path checks, so sstables can be loaded from any path. * Add scripting support (Lua), allowing custom operations to be written in a scripting language. Refs: #9241 Closes #9271 * github.com:scylladb/scylla: tools: remove scylla-sstable-index tools: introduce scylla-sstable tools: extract finding selected operation (handler) into function tools: add schema_loader cql3: query_processor: add parse_statements() cql3: statements/create_type: expose create_type() cql3: statements/create_keyspace: add get_keyspace_metadata()	2021-09-09 19:24:06 +03:00
Avi Kivity	c1028de22a	Merge 'Introduce native reversed format' from Botond Dénes We define the native reverse format as a reversed mutation fragment stream that is identical to one that would be emitted by a table with the same schema but with reversed clustering order. The main difference to the current format is how range tombstones are handled: instead of looking at their start or end bound depending on the order, we always use them as-usual and the reversing reader swaps their bounds to facilitate this. This allows us to treat reversed streams completely transparently: just pass along them a reversed schema and all the reader, compacting and result building code is happily ignorant about the fact that it is a reversed stream. This series is the first step towards implementing efficient reverse reads. It allows us to remove all the special casing we have in various places for reverse reads and thus treating reverse streams transparently in all the middle layers. The only layers that have to know about the actual reversing are mutation sources proper. The plan is that when reading in reverse we create a reversed schema in the top layer then pass this down as the schema for the read. There are two layers that will need to act on this reversed schema: * The layer sitting on top of the first layer which still can't handle reversed streams, this layer will create a reversed reader to handle the transition. * The mutation source proper: which will obtain the underlying schema and will emit the data in reverse order. Once all the mutation sources are able to handle reverse reads, we can get rid of the reverse reader entirely. Refs: #1413 Tests: unit(dev) TODO: * v2 * more testing Also on: https://github.com/denesb/scylla.git reverse-reads/v3 Changelog v3: * Drop the entire schema transformation mechanism; * Drop reversing from `schema_builder()`; * Don't keep any information about whether the schema is reversed or not in the schema itself, instead make reversing deterministic w.r.t. schema version, such that: `s.version() == s.make_reversed().make_reversed().version()`; * Re-reverse range tombstones in `streaming_mutation_freezer`, so `reconcilable_results` sent to the coordinator during read repair still use the old reverse format; v2: * Add `data_type reversed(data_type)`; * Add `bound_kind reverse_kind(bound_kind)`; * Make new API safer to use: - `schema::underlying_type()`: return this when unengaged; - `schema::make_transformed()`: noop when applying the same transformation again; * Generalize reversed into transformation. Add support to transferring to remote nodes and shards by way of making `schema_tables` aware of the transformation; * Use reverse schema everywhere in reverse reader; Closes #9184 * github.com:scylladb/scylla: range_tombstone_accumulator: drop _reversed flag test/boost/mutation_test: add test for mutation::consume() monotonicity test/boost/flat_mutation_reader_test: more reversed reader tests flat_mutation_reader: make_reversing_reader(): implement fast_forward_to(partition_range) flat_mutation_reader: make_reversing_reader(): take ownership of the reader test/lib/mutation_source_test: add consistent log to all methods mutation: introduce reverse() mutation_rebuilder: make it standalone mutation: make copy constructor compatible with mutation_opt treewide: switch to native reversed format for reverse reads mutation: consume(): add native reverse order mutation: consume(): don't include dummy rows query: add slice reversing functions partition_slice_builder: add range mutating methods partition_slice_builder: add constructor with slice query: specific_ranges: add non-const ranges accessor range_tombstone: add reverse() clustering_bounds_comparator: add reverse_kind() schema: introduce make_reversed() schema: add a transforming copy constructor utils: UUID_gen: introduce negate() types: add reversed(data_type) docs: design-notes: add reverse-reads.md	2021-09-09 15:50:22 +03:00
Botond Dénes	f02632aeb0	range_tombstone_accumulator: drop _reversed flag	2021-09-09 15:42:15 +03:00
Botond Dénes	f07805c3ef	test/boost/mutation_test: add test for mutation::consume() monotonicity In both forward and reverse modes.	2021-09-09 15:42:15 +03:00
Botond Dénes	3cc882f6a8	test/boost/flat_mutation_reader_test: more reversed reader tests Check that the reverse reader emits a stream identical to that emitted by a reader reading in native order from a table with reversed clustering order.	2021-09-09 15:42:15 +03:00
Botond Dénes	350440b418	flat_mutation_reader: make_reversing_reader(): take ownership of the reader Makes for much simpler client code.	2021-09-09 15:42:15 +03:00
Botond Dénes	c71a281e6b	test/lib/mutation_source_test: add consistent log to all methods Most test methods log their own name either via testlog.info() or BOOST_TEST_MESSAGE() so failures can be more easily located. Not all do however. This commit fixes this and also converts all those using BOOST_TEST_MESSAGE() for this to testlog.info(), for consistency.	2021-09-09 15:42:15 +03:00
Botond Dénes	74a22a706b	mutation_rebuilder: make it standalone Not requiring a wrapper object to become usable.	2021-09-09 15:42:15 +03:00
Botond Dénes	502a45ad58	treewide: switch to native reversed format for reverse reads We define the native reverse format as a reversed mutation fragment stream that is identical to one that would be emitted by a table with the same schema but with reversed clustering order. The main difference to the current format is how range tombstones are handled: instead of looking at their start or end bound depending on the order, we always use them as-usual and the reversing reader swaps their bounds to facilitate this. This allows us to treat reversed streams completely transparently: just pass along them a reversed schema and all the reader, compacting and result building code is happily ignorant about the fact that it is a reversed stream.	2021-09-09 15:42:15 +03:00
Botond Dénes	f200c8104a	schema: introduce make_reversed() `make_revered()` creates a schema identical to the schema instance it is called on, with clustering order reversed. To distinguish the reverse schema from the original one, the node-id part of its version UUID is bit-flipped. This ensures that reversing a schema twice will result in the identical schema to the original one (although a different C++ object). This reversed schema will be used in reversed reads, so intermediate layers can be ignorant of the fact that the read happens in reverse.	2021-09-09 11:49:05 +03:00
Botond Dénes	65913f4cfa	utils: UUID_gen: introduce negate()	2021-09-09 11:49:05 +03:00
Dejan Mircevski	6afdc6004c	cql3/modification_statement: Replace empty-range check with null check The empty-range check causes more bugs than it fixes. Replace it with an explicit check for =NULL (see #7852). Fixes #9311. Fixes #9290. Tests: unit (dev), cql-pytest on Cassandra 4.0 Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #9314	2021-09-09 10:56:13 +03:00
Dejan Mircevski	58a9a24ff0	cql3: Allow indexed query to select static columns We previously forbade selecting a static column when an index is used. But Cassandra allows it, so we should, too -- see #8869. After removing the static-column check, the existing code gets the correct result without any further changes (though it may read multiple rows from the same partition). Fixes #8869. Tests: unit (dev) Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #9307	2021-09-08 08:22:59 +02:00
Tomasz Grabiec	9a77a03ea1	Merge "Remove most uses of gms::get_gossiper(), gms::get_local_gossiper()" from Avi In the quest to have explicit dependencies and the abiliy to run multiple nodes in one process, remove some uses of get_gossiper() and get_local_gossiper() and replace them with parameters passed from main() or its equivalents. Some uses still remain, mostly in snitch, but this series removes a majority. * https://github.com/avikivity/scylla.git gossiper-deglobal-1/v1 alternator: remove uses of get_local_gossiper() storage_service: remove stray get_gossiper(), get_local_gossiper() calls migration_manager: remove use of get_gossiper() from passive_announce() storage_proxy: start_hints_manager(): don't require caller to provide gossiper migration_manager: remove uses of get_local_gossiper() storage_proxy: remove uses of get_local_gossiper() gossiper: remove get_local_gossiper() from some inline helpers gossiper: remove get_gossiper() from stop_gossiping() gossiper: remove uses of get_local_gossiper for its rpc server api: remove use of get_local_gossiper() gossiper: remove calls to global get_gossiper from within the gossiper itself	2021-09-07 20:02:30 +02:00
Avi Kivity	d8f7903f60	migration_manager: remove uses of get_local_gossiper() Pass gossiper as a constructor parameter instead. cql_test_env gains a use of get_gossiper() instead, but at least these uses are concentrated in one place.	2021-09-07 20:08:11 +03:00
Avi Kivity	71081be99c	storage_proxy: remove uses of get_local_gossiper() Pass the gossiper as a constructor parameter instead.	2021-09-07 17:14:09 +03:00
Avi Kivity	aa68927873	gossiper: remove get_local_gossiper() from some inline helpers Some state accessors called get_local_gossiper(); this is removed and replaced with a parameter. Some callers (redis, alternators) now have the gossiper passed as a parameter during initialization so they can use the adjusted API.	2021-09-07 17:03:37 +03:00
Avi Kivity	9ce1af9fcb	gossiper: remove get_gossiper() from stop_gossiping() Have the callers pass it instead, and they all have a reference already except for cql_test_env (which will be fixed later). The checks for initialization it does are likely unnecessary, but we'll only be able to prove it when get_gossiper() is completely removed.	2021-09-07 16:20:04 +03:00
Botond Dénes	23a56beccc	tools: add schema_loader A utility which can load a schema from a schema.cql file. The file has to contain all the "dependencies" of the table: keyspace, UDTs, etc. This will be used by the scylla-sstable-crawler in the next patch.	2021-09-07 15:47:22 +03:00
Benny Halevy	b7eaa22ce6	abstract_replication_strategy: create_replication_strategy: drop keyspace name parameter It is not used. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210906133840.3307279-1-bhalevy@scylladb.com>	2021-09-06 16:51:21 +03:00
Avi Kivity	dfc135dbd1	Merge "Keep range_tombstone apart from list linkage" from Pavel E " There's a landmine buried in range_rombstone's move constructor. Whoever tries to use it risks grabbing the tombstone from the containing list thus leaking the guy optionally invalidating an iterator pointing at it. There's a safety without_link moving constructor out there, but still. To keep this place safe it's better to separate range_tombstone from its linkage into anywhere. In particular to keep the range tombstones in a range_tombstone_list here's the entry that keeps the tombstone _and_ the list hook (which's a boost set hook). The approach resembles the rows_entry::deletable_row pair. tests: unit(dev, debug, patch from #9207) fixes: #9243 " * 'br-range-tombstone-vs-entry' of https://github.com/xemul/scylla: range_tombstone: Drop without-link constructor range_tombstone: Drop move_assign() range_tombstone: Move linkage into range_tombstone_entry range_tombstone_list: Prepare to use range_tombstone_entry range_tombstone, code: Add range_tombstone& getters range_tombstone_list: Factor out tombstone construction range_tombstone_list: Simplify (maybe) pop_front_and_lock() range_tombstone_list: De-templatize pop_as<> range_tombstone_list: Conceptualize erase_where() range_tombstone(_list): Mark some bits noexcept mutation: Use range_tombstone_list's iterators mutation_partition: Shorten memory usage calculation mutation_partition: Remove unused local variable	2021-09-05 17:26:13 +03:00
Dejan Mircevski	1fdaeca7d0	cql3: Reject updates with NULL key values We were silently ignoring INSERTs with NULL values for primary-key columns, which Cassandra rejects. Fix it by rejecting any modification_statement that would operate on empty partition or clustering range. This is the most direct fix, because range and slice are calculated in one place for all modification statements. It covers not only NULL cases, but also impossible restrictions like c>0 AND c<0. Unfortunately, Cassandra doesn't treat all modification statements consistently, so this fix cannot fully match its behavior. We err on the side of tolerance, accepting some DELETE statements that Cassandra rejects. We add a TODO for rejecting such DELETEs later. Fixes #7852. Tests: unit (dev), cql-pytest against Cassandra 4.0 Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #9286	2021-09-05 10:23:28 +03:00
Pavel Emelyanov	d6af441eaa	range_tombstone: Move linkage into range_tombstone_entry Now it's time to remove the boost set's hook from the range_tombstone and keep it wrapped into another class if the r._tombstone's location is the range_tombstone_list. Also the added previously .tombstone() getters and the _entry alias can be removed -- all the code can work with the new class. Two places in the code that made use of without_link{} move-constructor are patched to get the range_tombstone part from the respective _entry with the same result. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-03 19:34:45 +03:00
Pavel Emelyanov	5515f7187d	range_tombstone, code: Add range_tombstone& getters Currently all the code operates on the range_tombstone class. and many of those places get the range tombstone in question from the range_tombstone_list. Next patches will make that list carry (and return) some new object called range_tombstone_entry, so all the code that expects to see the former one there will need to patched to get the range_tombstone from the _entry one. This patch prepares the ground for that by introdusing the range_tombstone& tombstone() { return *this; } getter on the range_tombstone itself and patching all future users of the _entry to call .tombstone() right now. Next patch will remove those getters together with adding the new range_tombstone_entry object thus automatically converting all the patched places into using the entry in a proper way. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-03 19:34:45 +03:00
Nadav Har'El	b3f4a37a75	test/alternator: verify that nulls are valid inside string and bytes The tests in this patch verify that null characters are valid characters inside string and bytes (blob) attributes in Alternator. The tests verify this for both key attributes and non-key attributes (since those are serialized differently, it's important to check both cases). The tests pass on both DynamoDB and Alternator - confirming that we don't have a bug in this area. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210824163442.186881-1-nyh@scylladb.com>	2021-09-03 08:49:06 +02:00
Nadav Har'El	068c4283b7	test/cql-pytest: add tests for some undocumented cases of string types This patch adds tests for two undocumented (as far as I can tell) corner cases of CQL's string types: 1. The types "text" and "varchar" are not just similar - they are in fact exactly the same type. 2. All CQL string and blob types ("ascii", "text" or "varchar", "blob") allow the null character as a valid character inside them. They are not C strings that get terminated by the first null. These tests pass on both Cassandra and Scylla, so did not expose any bug, but having such tests is useful to understand these (so-far) undocumented behaviors - so we can later document them. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210824225641.194146-1-nyh@scylladb.com>	2021-09-02 15:45:47 +03:00
Avi Kivity	403645f58c	Merge "raft: miscellaneous fixes" from Gleb * 'raft-misc-v3' of github.com:scylladb/scylla-dev: raft: rename snapshot into snapshot_descriptor raft: drop snapshot if is application failed raft: fix local snapshot detection raft: replication_test: store multiple snapshots in a state machine raft: do not wait for entry to become stable before replicate it	2021-09-02 11:25:06 +03:00
Liu Lan	a5c54867f8	alternator: Exclusive start key must lie within the segment ...when using Segment/TotalSegment option. The requirement is not specified in DynamoDB documents, but found in DynamoDB Local: {"__type":"com.amazon.coral.validate#ValidationException", "message":"Exclusive start key must lie within the segment"} Fixes #9272 Signed-off-by: Liu Lan <liulan_yewu@cmss.chinamobile.com> Closes #9270	2021-09-01 11:05:45 +03:00
Avi Kivity	8b59e3a0b1	Merge ' cql3: Demand ALLOW FILTERING for unlimited, sliced partitions ' from Dejan Mircevski Return the pre- `6773563d3` behavior of demanding ALLOW FILTERING when partition slice is requested but on potentially unlimited number of partitions. Put it on a flag defaulting to "off" for now. Fixes #7608; see comments there for justification. Tests: unit (debug, dev), dtest (cql_additional_test, paging_test) Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #9126 * github.com:scylladb/scylla: cql3: Demand ALLOW FILTERING for unlimited, sliced partitions cql3: Track warnings in prepared_statement test: Use ALLOW FILTERING more strictly cql3: Add statement_restrictions::to_string	2021-08-31 18:05:26 +03:00
Dejan Mircevski	2f28f68e84	cql3: Demand ALLOW FILTERING for unlimited, sliced partitions When a query requests a partition slice but doesn't limit the number of partitions, require that it also says ALLOW FILTERING. Although do_filter() isn't invoked for such queries, the performance can still be unexpectedly slow, and we want to signal that to the user by demanding they explicitly say ALLOW FILTERING. Because we now reject queries that worked fine before, existing applications can break. Therefore, the behavior is controlled by a flag currently defaulting to off. We will default to "on" in the next Scylla version. Fixes #7608; see comments there for justification. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2021-08-31 10:45:41 -04:00
Pavel Emelyanov	e26a6c1acc	btree, test: Test exception safety and non-leakness of btree::clone_from Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-08-31 12:23:49 +03:00
Pavel Emelyanov	da38038222	btree, test: Test key copy constructor may throw It calls the tree_test_key_base copy constructor which is throwing. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-08-31 12:23:49 +03:00
Gleb Natapov	ce40b01b07	raft: rename snapshot into snapshot_descriptor The snapshot structure does not contain the snapshot itself but only refers to it trough its id. Rename it to snapshot_descriptor for clarity.	2021-08-29 12:53:03 +03:00
Gleb Natapov	80a392a444	raft: replication_test: store multiple snapshots in a state machine State machine should be able to store more then one snapshot at a time (one may be the currently used one and another is transferred from a leader but not applied yet).	2021-08-29 12:53:03 +03:00
Gleb Natapov	5e1d589872	raft: do not wait for entry to become stable before replicate it Since io_fiber persist entries before sending out messages even non stable entries will become stable before observed by other nodes. This patch also moves generation of append messages into get_outptut() call because without the change we will lose batching since each advance of last_idx will generate new append message.	2021-08-29 12:48:15 +03:00
Pavel Emelyanov	60a7ca62f2	storage_service: Drop .enable_all_features() This method has nothing to do with storage service and is only needed to move feature service options from one method to another. This can be done by the only caller of it. tests: unit(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210827133954.29535-1-xemul@scylladb.com>	2021-08-29 11:27:05 +03:00
Pavel Solodovnikov	b00443ab87	test: adjust `schema_change_test` to include new `system.raft_config` table Check that the new table uses null sharder. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-08-27 09:30:17 +03:00
Pavel Solodovnikov	8d3c0ee9b6	raft: new schema for storing raft snapshots Previously, the layout for storing raft snapshot descriptors contained a `config` field, which had `blob` data type. That means `raft::configuration` for the snapshot was serialized as a whole in binary form. It's convenient to implement and is the most compact form of representing the data, but: 1. Hard to debug due to the need to de-serialize the data. 2. Plants a time bomb wrt. changing data layout and also the documentation in the future. Remove the `config` field from `system.raft_snapshots` and extract it to a separate `system.raft_config` table to store the data in exploded form. Also, modify the schema of `system.raft_snapshots` table in the following way: add a `server_id` field as a part of composite partition key ((group_id, server_id)) to be able to start multiple raft servers belonging to one raft group on the same scylla node. Rename `id` field in `raft_snapshots` to `snapshot_id` so it's self-documenting. Rename `snapshot_id` from clustering key since a given server can have only one snapshot installed at a time. Note that the `raft::server_address` stucture contains an opaque `info` member, which is `bytes`, but in the `raft_config` table we use `ip_addr inet` field, instead. We always know that the corresponding member field is going to contain an IP address (either v4 or v6) of a given raft server. So, now the snapshots schema looks like this: CREATE TABLE raft_snapshots ( group_id timeuuid, server_id uuid, snapshot_id uuid, idx int, term int, -- no `config` field here, moved to `raft_config` table PRIMARY KEY ((group_id, server_id)) ) CREATE TABLE raft_config ( group_id timeuuid, my_server_id uuid, server_id uuid, disposition text, -- can be either 'CURRENT` or `PREVIOUS' can_vote bool, ip_addr inet, PRIMARY KEY ((group_id, my_server_id), server_id, disposition) ); This way it's much easier to extend the schema with new fields, very easy to debug and inspect via CQL, and it's much more descriptive in terms of self-documentation. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-08-27 09:24:46 +03:00
Pavel Solodovnikov	0a8faee660	raft: pass server id to `raft_sys_table_storage` instance Preparations for changing raft snapshots schema. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-08-27 09:24:20 +03:00
Pavel Solodovnikov	c0854a0f62	raft: create system tables only when `raft` experimental feature is set Also introduce a tiny function to return raft-enabled db config for cql testing. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210826091432.279532-1-pa.solodovnikov@scylladb.com>	2021-08-26 12:21:12 +03:00
Avi Kivity	acf8da2bce	Merge "flat_mutation_reader: keep timeout in permit" from Benny " This series moves the timeout parameter, that is passed to most f_m_r methods, into the reader_permit. This eliminates the need to pass the timeout around, as it's taken from the permit when needed. The permit timeout is updated in certain cases when the permit/reader is paused and retrieved later on for reuse. Following are perf_simple_query results showing ~1% reduction in insns/op and corresponding increase in tps. $ build/release/test/perf/perf_simple_query -c 1 --operations-per-shard 1000000 --task-quota-ms 10 Before: 102500.38 tps ( 75.1 allocs/op, 12.1 tasks/op, 45620 insns/op) After: 103957.53 tps ( 75.1 allocs/op, 12.1 tasks/op, 45372 insns/op) Test: unit(dev) DTest: repair_additional_test.py:RepairAdditionalTest.repair_abort_test (release) materialized_views_test.py:TestMaterializedViews.remove_node_during_mv_insert_3_nodes_test (release) materialized_views_test.py:InterruptBuildProcess.interrupt_build_process_with_resharding_half_to_max_test (release) migration_test.py:TTLWithMigrate.big_table_with_ttls_test (release) " * tag 'reader_permit-timeout-v6' of github.com:bhalevy/scylla: flat_mutation_reader: get rid of timeout parameter reader_concurrency_semaphore: use permit timeout for admission reader_concurrency_semaphore: adjust reactivated reader timeout multishard_mutation_query: create_reader: validate saved reader permit repair: row_level: read_mutation_fragment: set reader timeout flat_mutation_reader: maybe_timed_out: use permit timeout test: sstable_datafile_test: add sstable_reader_with_timeout reader_permit: add timeout member	2021-08-25 17:51:10 +03:00
Avi Kivity	993f824cfd	Merge "raft: implement linearisable reads on a follower" from Gleb and Kostja " This series implements section 6.4 of the Raft PhD. It allows to do linearisable reads on a follower bypassing raft log entirely. After this series server::read_barrier can be executed on a follower as well as leader and after it completes local user's state machine state can be accessed directly. " * 'raft-read-v9' of github.com:scylladb/scylla-dev: raft: test: add read_barrier test to replication_test raft: test: add read_barrier tests to fsm_test raft: make read_barrier work on a follower as well as on a leader raft: add a function to wait for an index to be applied raft: (server) add a helper to wait through uncertainty period raft: make fsm::current_leader() public raft: add hasher for raft::internal::tagged_uint64 serialize: add serialized for std::monostate raft: fix indentation in applier_fiber	2021-08-25 13:11:35 +03:00
Gleb Natapov	3ff6f76cef	raft: test: add read_barrier test to replication_test	2021-08-25 08:57:13 +03:00
Gleb Natapov	ad2c2abcb8	raft: test: add read_barrier tests to fsm_test	2021-08-25 08:57:13 +03:00

1 2 3 4 5 ...

2200 Commits