scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-03 21:47:10 +00:00

Author	SHA1	Message	Date
Nadav Har'El	3de09042bb	CDC topology change support Merged pull request https://github.com/scylladb/scylla/pull/5485 by Kamil Braun: This series introduces the notion of CDC generations: sets of CDC streams used by the cluster to choose partition keys for CDC log writes. Each CDC generation begins operating at a specific time point, called the generation's timestamp (cdc_streams_timestamp in the code). It continues being used by all nodes in the cluster to generate log writes until superseded by a new generation. Generations are chosen so that CDC log writes are colocated with their corresponding base table writes, i.e. their partition keys (which are CDC stream identifiers picked from the generation operating at time of making the write) fall into the same vnode and shard as the corresponding base table write partition keys. Currently this is probabilistic and not 100% of log writes will be colocated - this will change in future commits, after per-table partitioners are implemented. CDC generations are a global property of the cluster -- they don't depend on any particular table's configuration. Therefore the old "CDC stream description tables", which were specific to each CDC-enabled table, were removed and replaced by a new, global description table inside the system_distributed keyspace. A new generation is introduced and supersedes the previous one whenever we insert new tokens into the token ring, which breaks the colocation property of the previous generation. The new generation is chosen to account for the new tokens and restore colocation. This happens when a new node joins the cluster. The joining node is responsible for creating and informing other nodes about the new CDC generation. It does that by serializing it and inserting into an internal distributed table ("CDC topology description table"). If it fails the insert, it fails the joining process. It then announces the generation to other nodes through gossip using the generation's timestamp, which is the partition key of the inserted distributed table entry. Nodes that learn about the new generation through gossip attempt to retrieve it from the distributed table. This might fail - for example, if the node is partitioned away from all replicas that hold this generation's table entry. In that case the node might stop accepting writes, since it knows that it should send log entries to a new generation of streams, but it doesn't know what the generation is. The node will keep trying to retrieve the data in the background until it succeeds or sees that it is no longer necessary (e.g., because yet another generation superseded this one). So we give up some availability to achieve safety. However, this solution is not completely safe (might break consistency properties): if a node learns about a new generation too late (if gossip doesn't reach this node in time), the node might send writes to the wrong (old) generation. In the future we will introduce a transaction-based approach where we will always make sure that all nodes receive the new generation before any of them starts using it (and if it's impossible e.g. due to a network partition, we will fail the bootstrap attempt). In practice, if the admin makes sure that the cluster works correctly before bootstrapping a new node, and a network partition doesn't start in the few seconds window where a new generation is announced, everything will work as it should. After the learning node retrieves the generation, it inserts it into an in-memory data structure called "CDC metadata". This structure is then used when performing writes to the CDC log -- given the timestamp of the written mutation, the data structure will return the CDC generation operating at this time point. CDC metadata might reject the query for two reasons: if the timestamp belongs to an earlier generation, which most probably doesn't have the colocation property anymore, or if it is picked too far away into the future, where we don't know if the current generation won't be superseded by a different one (so we don't yet know the set of streams that this log write should be sent to). If the client uses server-generated timestamps, the query will never be rejected. Clients can also use client-generated timestamps, but they must make sure that their clocks are not too desynchronized with the database -- otherwise some or all of their writes to CDC-enabled tables will be rejected. In the case of rolling upgrade, where we restart nodes that were previously running without CDC, we act a bit differently - there is no naturally selected joining node which must propose a new generation. We have to select such a node using other means. For this we use a bully approach: every node compares its host id with host ids of other nodes and if it finds that it has the greatest host id, it becomes responsible for creating the first generation. This change also fixes the way of choosing values of the "time" column of CDC log writes: the timeuuid is chosen in a way which preserves ordering of corresponding base table mutations (the timestamp of this timeuuid is equal to the base table mutation timestamp). Warning: if you were running a previous CDC version (without topology change support), make sure to disable CDC on all tables before performing the upgrade. This will drop the log data -- backup it if needed. TODO in future patchset: expire CDC generations. Currently, each inserted CDC generation will stay in the distributed tables forever (until manually removed by the administrator). When a generation is superseded, it should become "expired", and 24 hours after expiration, it should be removed. The distributed tables (cdc_topology_description and cdc_description) both have an "expired" column which can be used for this purpose. Unit tests: dev, debug, release dtests (dev): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/907/	2020-02-04 10:20:29 +02:00
Kamil Braun	b130b76274	test: disable CDC flag by default When CDC flag is on, the node startup procedure takes a few seconds longer (we have to generate CDC streams). This is not necessary in non-CDC tests.	2020-02-03 10:57:31 +01:00
Kamil Braun	5fb5925fb4	test: add cdc::find_timestamp tests	2020-02-03 10:57:31 +01:00
Avi Kivity	541893e69a	Merge "Fix conversion of lua nil to cql null" from Rafael " The fix itself is fairly simple, but looking at the code I found that our code base was not cleanly distinguishing null and empty values and was treating null and missing values differently, but that distinction was dead since a null is represented as a dead cell. " * 'espindola/lua-fix-null-v6' of https://github.com/espindola/scylla: lua: Handle nil returns correctly types: Return bytes_opt from data_value::serialize query-result-set: Assert that we don't have null values types: Fix comparison of empty and null data_values Revert "tests: Handle null and not present values differently" query-result-set: Avoid a copy during construction types: Move operator== for data_value out-of-line	2020-02-02 15:43:24 +02:00
Avi Kivity	ec5b721db7	test: make eventually() more patient We use eventually() in tests to wait for eventually consistent data to become consistent. However, we see spurious failures indicating that we wait too little. Increasing the timeout has a negative side effect in that tests that fail will now take longer to do so. However, this negative side effect is negligible to false-positive failures, since they throw away large test efforts and sometimes require a person to investigate the problem, only to conclude it is a false positive. This patch therefore makes eventually() more patient, by a factor of 32. Fixes #4707. Message-Id: <20200130162745.45569-1-avi@scylladb.com>	2020-01-31 14:02:18 +01:00
Eliran Sinvani	971711a546	storage proxy: migrate to per scheduling group statistics This commit builds on top of the introduced per scheduling group statistics template and employs it for achieving a per scheduling group statistics in storage_proxy. Some of the statistics also had meaning as a global - per shard one. Those are the ones for determining if to throttle the write request. This was handled by creating a global stats struct that will hold those stats and by changing the stat update to also include the global one. One point that complicated it is an already existing aggregation over the per shard stats that now became a per scheduling group per shard stats, converting the aggregation to a two-dimensional aggregation. One thing this commit doesn't handle is validating that an individual statistic didn't "cross a scheduling group boundary", such validation is possible but it can easily be added in the future. There is a subtlety to doing so since if the operation did cross to other scheduling group two connected statistics can lose balance for example written bytes and completed write transactions. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2020-01-30 15:01:44 +01:00
Rafael Ávila de Espíndola	bd93a0af52	types: Return bytes_opt from data_value::serialize Since a data_value can contain a null value, returning bytes from serialize() was losing information as it was mapping null to empty. This also introduces a serialize_nonnull that still returns bytes, but results in an internal error if called with a null value. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-01-29 14:04:59 -08:00
Rafael Ávila de Espíndola	9031294ea9	Revert "tests: Handle null and not present values differently" This reverts commit `2ebd1463b2`. The test introduced by that commit was wrong, and in fact depended on a bug in operator== for data_value. A followup patch fixes operator==, so this reverts the broken commit first. The reason it was broken was that it created a live cell with a null data_value. In reality, null values are represented with dead cells. For example, the sstable produced by CREATE TABLE my_table (key int PRIMARY KEY, v1 int, v2 int) with compression = {'sstable_compression': ''}; INSERT INTO my_table (key, v1, v2) VALUES (1, 42, null); Is 00 04 key_length 00 00 00 01 key 7f ff ff ff local_deletion_time 80 00 00 00 00 00 00 00 marked_for_delete_at 24 HAS_ALL_COLUMNS \| HAS_TIMESTAMP 09 row_body_size 12 prev_unfiltered_size 00 delta_timestamp 08 USE_ROW_TIMESTAMP_MASK 00 00 00 2a value 0d USE_ROW_TIMESTAMP_MASK \| HAS_EMPTY_VALUE_MASK \| IS_DELETED_MASK 00 deletion time 01 END_OF_PARTITION Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-01-29 13:24:10 -08:00
Botond Dénes	dfc8b2fc45	treewide: replace reader_resource_tracer with reader_permit The former was never really more than a reader_permit with one additional method. Currently using it doesn't even save one from any includes. Now that readers will be using reader_permit we would have to pass down both to mutation_source. Instead get rid of reader_resource_tracker and just use reader_permit. Instead of making it a last and optional parameter that is easy to ignore, make it a first class parameter, right after schema, to signify that permits are now a prominent part of the reader API. This -- mostly mechanical -- patch essentially refactors mutation_source to ask for the reader_permit instead of reader_resource_tracking and updates all usage sites.	2020-01-28 08:13:16 +02:00
Tomasz Grabiec	36d90e637e	Merge "Relax migration manager dependencies" from Pavel Emalyanov The set make dependencies between mm and other services cleaner, in particular, after the set: - the query processor no longer needs migration manager (which doesn't need query processor either) - the database no longer needs migration manager, thus the mutual dependency between these two is dropped, only migration manager -> database is left - the migration manager -> storage_service dependency is relaxed, one more patchset will be needed to remove it, thus dropping one more mutual dependency between them, only the storage_service -> migration manager will be left - the migration manager is stopped on drain, but several more services need it on stop, thus causing use after free problems, in particular there's a caught bug when view builder crashes when unregistering from notifier list on stop. Fixed. Tests: unit(dev) Fixes: #5404	2020-01-16 12:12:25 +01:00
Piotr Dulikowski	c383652061	gossip: allow for aborting on sleep This commit makes most sleeps in gossip.cc abortable. It is now possible to quickly shut down a node during startup, most notably during the phase while it waits for gossip to settle.	2020-01-16 12:05:50 +02:00
Rafael Ávila de Espíndola	2ebd1463b2	tests: Handle null and not present values differently Before this patch result_set_assertions was handling both null values and missing values in the same way. This patch changes the handling of missing values so that now checking for a null value is not the same as checking for a value not being present. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200114184116.75546-1-espindola@scylladb.com>	2020-01-16 12:05:50 +02:00
Pavel Emelyanov	5cf365d7e7	database: Explicitly pass migration_manager through init_non_system_keyspace This is the last place where database code needs the migration_manager instance to be alive, so now the mutual dependency between these two is gone, only the migration_manager needs the database, but not the vice-versa. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-01-15 14:29:21 +03:00
Pavel Emelyanov	9e4b41c32a	tests: Switch on migration notifier Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-01-15 14:29:21 +03:00
Pavel Emelyanov	d9edcb3f15	query_processor: Use migration_notifier This patch breaks one (probably harmless but still) dependency loop. The query_processor -> migration_manager -> storage_proxy -> tracing -> query_processor. The first link is not not needed, as the query_processor needs the migration_manager purely to (ub)subscribe on notifications. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-01-15 14:28:21 +03:00
Pavel Emelyanov	28f1250b8b	view_builder: Use migration notifier The migration manager itself is still needed on start to wait for schema agreement, but there's no longer the need for the life-time reference on it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-01-15 14:28:21 +03:00
Pavel Emelyanov	7cfab1de77	database: Switch on mnotifier from migration_manager Do not call for local migration manager instance to send notifications, call for the local migration notifier, it will always be alive. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-01-15 14:28:21 +03:00
Pavel Emelyanov	f45b23f088	storage_service: Keep migration_notifier The storage service will need this guy to initialize sub-services with. Also it registers itself with notifiers. That said, it's convenient to have the migration notifier on board. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-01-15 14:28:21 +03:00
Pavel Emelyanov	f240d5760c	migration_manager: Split notifier from main class The _listeners list on migration_manager class and the corresponding notify_xxx helpers have nothing to do with the its instances, they are just transport for notification delivery. At the same time some services need the migration manager to be alive at their stop time to unregister from it, while the manager itself may need them for its needs. The proposal is to move the migration notifier into a complete separate sharded "service". This service doesn't need anything, so it's started first and stopped last. While it's not effectively a "migration" notifier, we inherited the name from Cassandra and renaming it will "scramble neurons in the old-timers' brains but will make it easier for newcomers" as Avi says. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-01-15 14:28:19 +03:00
Rafael Ávila de Espíndola	dca1bc480f	everywhere: Use serialized(foo) instead of data_value(foo).serialize() This is just a simple cleanup that reduces the size of another patch I am working on and is an independent improvement. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200114051739.370127-1-espindola@scylladb.com>	2020-01-14 12:17:12 +02:00
Rafael Ávila de Espíndola	88b5aadb05	tests: cql_test_env: wait for two futures starting internal services I noticed this while looking at the crashes next is currently experiencing. While I have no idea if this fixes the issue, it does avoid broken future warnings (for no_sharded_instance_exception) in a debug build. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200103201540.65324-1-espindola@scylladb.com>	2020-01-05 12:09:59 +02:00
Pavel Solodovnikov	5a15bed569	cql3: return `result_set` by cref in `cql3::result::result_set` Changes summary: * make `cql3::result_set` movable-only * change signature of `cql3::result::result_set` to return by cref * adjust available call sites to the aforementioned method to accept cref Motivation behind this change is elimination of dangerous API, which can easily set a trap for developers who don't expect that result_set would be returned by value. There is no point in copying the `result_set` around, so make `cql3::result::result_set` to cache `result_set` internally in a `unique_ptr` member variable and return a const reference so to minimize unnecessary copies here and there. Tests: unit(debug) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20191220115100.21528-1-pa.solodovnikov@scylladb.com>	2019-12-21 16:56:42 +02:00
Nadav Har'El	8157f530f5	merge: CDC: handle schema changes Merged pull request https://github.com/scylladb/scylla/pull/5366 from Calle Wilund: Moves schema creation/alter/drop awareness to use new "before" callbacks from migration manager, and adds/modifies log and streams table as part of the base table modification. Makes schema changes semi-atomic per node. While this does not deal with updates coming in before a schema change has propagated cluster, it now falls into the same pit as when this happens without CDC. Added side effect is also that now schemas are transparent across all subsystems, not just cql. Patches: cdc_test: Add small test for altering base schema (add column) cdc: Handle schema changes via migration manager callbacks migration_manager: Invoke "before" callbacks for table operations migration_listener: Add empty base class and "before" callbacks for tables cql_test_env: Include cdc service in cql tests cdc: Add sharded service that does nothing. cdc: Move "options" to separate header to avoid to much header inclusion cdc: Remove some code from header	2019-12-17 23:04:36 +02:00
Konstantin Osipov	1c8736f998	tests: move all test source files to their new locations 1. Move tests to test (using singular seems to be a convention in the rest of the code base) 2. Move boost tests to test/boost, other (non-boost) unit tests to test/unit, tests which are expected to be run manually to test/manual. Update configure.py and test.py with new paths to tests.	2019-12-16 17:47:42 +03:00
Konstantin Osipov	2fca24e267	tests: move a few remaining headers Move sstable_test.hh, test_table.hh and cql_assertions.hh from tests/ to test/lib or test/boost and update dependent .cc files. Move tests/perf_sstable.hh to test/perf/perf_sstable.hh	2019-12-16 17:47:42 +03:00
Konstantin Osipov	b9bf1fbede	tests: move another set of headers to the new test layout Move another small subset of headers to test/ with the same goals: - preserve bisectability - make the revision history traceable after a move Update dependent files.	2019-12-16 17:47:42 +03:00
Konstantin Osipov	8047d24c48	tests: move .hh files and resources to new locations The plan is to move the unstructured content of tests/ directory into the following directories of test/: test/lib - shared header and source files for unit tests test/boost - boost unit tests test/unit - non-boost unit tests test/manual - tests intended to be run manually test/resource - binary test resources and configuration files In order to not break git bisect and preserve the file history, first move most of the header files and resources. Update paths to these files in .cc files, which are not moved.	2019-12-16 17:47:42 +03:00

27 Commits