scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-02 04:56:58 +00:00

Author	SHA1	Message	Date
Avi Kivity	cdecb21b78	Update seastar submodule * seastar 65980a9b30...30185fd901 (12): > sstring: resize: NulTerminate when downsizing > reactor: make open_flags::dsync respect --unsafe-bypass-fsync > json/json_elements: Use double quotes around element name > Revert "reactor: make open_flags::dsync respect --unsafe-bypass-fsync" > Merge "smp: reduce allocations in work_item::process" from Avi > task: optimize destruction by making destructor non-virtual > reactor: make open_flags::dsync respect --unsafe-bypass-fsync > Revert "sstring: resize: NulTerminate when downsizing" > sstring: resize: NulTerminate when downsizing > tests: Rename unix domain socket test for consistency > resource: downgrade cgroupsv2 message. > Merge "Simplify the stream/subscription implementation" from Rafael	2020-02-04 10:20:29 +02:00
Nadav Har'El	3de09042bb	CDC topology change support Merged pull request https://github.com/scylladb/scylla/pull/5485 by Kamil Braun: This series introduces the notion of CDC generations: sets of CDC streams used by the cluster to choose partition keys for CDC log writes. Each CDC generation begins operating at a specific time point, called the generation's timestamp (cdc_streams_timestamp in the code). It continues being used by all nodes in the cluster to generate log writes until superseded by a new generation. Generations are chosen so that CDC log writes are colocated with their corresponding base table writes, i.e. their partition keys (which are CDC stream identifiers picked from the generation operating at time of making the write) fall into the same vnode and shard as the corresponding base table write partition keys. Currently this is probabilistic and not 100% of log writes will be colocated - this will change in future commits, after per-table partitioners are implemented. CDC generations are a global property of the cluster -- they don't depend on any particular table's configuration. Therefore the old "CDC stream description tables", which were specific to each CDC-enabled table, were removed and replaced by a new, global description table inside the system_distributed keyspace. A new generation is introduced and supersedes the previous one whenever we insert new tokens into the token ring, which breaks the colocation property of the previous generation. The new generation is chosen to account for the new tokens and restore colocation. This happens when a new node joins the cluster. The joining node is responsible for creating and informing other nodes about the new CDC generation. It does that by serializing it and inserting into an internal distributed table ("CDC topology description table"). If it fails the insert, it fails the joining process. It then announces the generation to other nodes through gossip using the generation's timestamp, which is the partition key of the inserted distributed table entry. Nodes that learn about the new generation through gossip attempt to retrieve it from the distributed table. This might fail - for example, if the node is partitioned away from all replicas that hold this generation's table entry. In that case the node might stop accepting writes, since it knows that it should send log entries to a new generation of streams, but it doesn't know what the generation is. The node will keep trying to retrieve the data in the background until it succeeds or sees that it is no longer necessary (e.g., because yet another generation superseded this one). So we give up some availability to achieve safety. However, this solution is not completely safe (might break consistency properties): if a node learns about a new generation too late (if gossip doesn't reach this node in time), the node might send writes to the wrong (old) generation. In the future we will introduce a transaction-based approach where we will always make sure that all nodes receive the new generation before any of them starts using it (and if it's impossible e.g. due to a network partition, we will fail the bootstrap attempt). In practice, if the admin makes sure that the cluster works correctly before bootstrapping a new node, and a network partition doesn't start in the few seconds window where a new generation is announced, everything will work as it should. After the learning node retrieves the generation, it inserts it into an in-memory data structure called "CDC metadata". This structure is then used when performing writes to the CDC log -- given the timestamp of the written mutation, the data structure will return the CDC generation operating at this time point. CDC metadata might reject the query for two reasons: if the timestamp belongs to an earlier generation, which most probably doesn't have the colocation property anymore, or if it is picked too far away into the future, where we don't know if the current generation won't be superseded by a different one (so we don't yet know the set of streams that this log write should be sent to). If the client uses server-generated timestamps, the query will never be rejected. Clients can also use client-generated timestamps, but they must make sure that their clocks are not too desynchronized with the database -- otherwise some or all of their writes to CDC-enabled tables will be rejected. In the case of rolling upgrade, where we restart nodes that were previously running without CDC, we act a bit differently - there is no naturally selected joining node which must propose a new generation. We have to select such a node using other means. For this we use a bully approach: every node compares its host id with host ids of other nodes and if it finds that it has the greatest host id, it becomes responsible for creating the first generation. This change also fixes the way of choosing values of the "time" column of CDC log writes: the timeuuid is chosen in a way which preserves ordering of corresponding base table mutations (the timestamp of this timeuuid is equal to the base table mutation timestamp). Warning: if you were running a previous CDC version (without topology change support), make sure to disable CDC on all tables before performing the upgrade. This will drop the log data -- backup it if needed. TODO in future patchset: expire CDC generations. Currently, each inserted CDC generation will stay in the distributed tables forever (until manually removed by the administrator). When a generation is superseded, it should become "expired", and 24 hours after expiration, it should be removed. The distributed tables (cdc_topology_description and cdc_description) both have an "expired" column which can be used for this purpose. Unit tests: dev, debug, release dtests (dev): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/907/	2020-02-04 10:20:29 +02:00
Gleb Natapov	2876482373	lwt: account for cases where LWT request were moved to another shard in statistics Now that we bounce lwt requests to the correct shard before calling into storage_proxy the cross shard op accounting does not account for bounced lwt statement. Fix that by increasing corresponding counter when returning a "bounce" reply. Message-Id: <20200203122011.GH26048@scylladb.com>	2020-02-04 10:20:28 +02:00
Nadav Har'El	37f2f6112e	cql3::util::maybe_quote: avoid stack overflow and fix quote doubling Merged patch series from Benny Halevy: The function was reimplemented to solve the following issues. The cutom implementation also improved its performance in close to 19% Using regex_match("[a-z][a-z0-9_]") may cause stack overflow on long input strings as found with the limits_test.py:TestLimits.max_key_length_test dtest. std::regex_replace does not replace in-place so no doubling of quotes was actually done. Add unit test that reproduces the crash without this fix and tests various string patterns for correctness. Note that defining the regex with std::regex::optimize still ended up with stack overflow. Fixes #5671 cql3::util::maybe_quote: avoid stack overflow and fix quote doubling * cql3::util::maybe_quote: further optimize quote doubling	2020-02-04 10:20:28 +02:00
Nadav Har'El	6e91f159fe	LWT: handle bounce_to_shard result for batch statements Merged patch series from Gleb Natapov: Batch statement can also execute LWT and hence need to handle bounce_to_shard result. * transport: handle bounce_to_shard for batch statement * transport: consolidate bounce_to_shard handling between all three verbs that handle it	2020-02-04 10:20:28 +02:00
Takuya ASADA	1446fe930b	dist/redhat: install specified version of scylla-conf on meta package (#5599 ) To install specified version of scylla-conf package, we need to add it on Requires. Fixes #5639	2020-02-04 10:20:28 +02:00
Benny Halevy	f45fabab73	gossiper: do_stop_gossiping: copy live endpoints vector It can be resized asynchronously by mark_dead. Fixes #5701 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20200203091344.229518-1-bhalevy@scylladb.com>	2020-02-04 10:20:28 +02:00
Avi Kivity	501b24cad3	test.py: use command line option in preference to environment variable when calling a test Command line options are printed out, so if a user cuts-and-pastes a command line they will get a run that is more similar to the one that the test executed. Message-Id: <20200202133209.209608-1-avi@scylladb.com>	2020-02-04 10:20:28 +02:00
Gleb Natapov	9c75a25e9f	transport: consolidate bounce_to_shard handling between all three verbs that handle it All three verbs that need to handle bounce_to_shard have almost identical process_() and process__on_shard() functions. Consolidate them into one to reuse the code.	2020-02-03 14:27:50 +02:00
Gleb Natapov	dd793098fa	transport: handle bounce_to_shard for batch statement Batch statement can also execute LWT and hence need to handle bounce_to_shard result. Fixes: #5644	2020-02-03 14:27:30 +02:00
Kamil Braun	4b3754ff94	docs: add documentation about CDC generations	2020-02-03 10:57:31 +01:00
Kamil Braun	b130b76274	test: disable CDC flag by default When CDC flag is on, the node startup procedure takes a few seconds longer (we have to generate CDC streams). This is not necessary in non-CDC tests.	2020-02-03 10:57:31 +01:00
Kamil Braun	0d41e2c1fe	test: add cdc::generate_timeuuid tests	2020-02-03 10:57:31 +01:00
Kamil Braun	5fb5925fb4	test: add cdc::find_timestamp tests	2020-02-03 10:57:31 +01:00
Kamil Braun	7cb6ac33f5	storage_service: check if we know other nodes' tokens when joining ring If we are a seed node (but not the only one) or we set auto_bootstrap=off, it might happen due to misconfiguration or a network partition that we don't know other nodes' tokens at the end of the join_token_ring function, when we go into the NORMAL status, finishing the joining process. CDC however requires that we know other nodes' tokens at this point: we need them to correctly create a new CDC generation. This commit adds a check which prevents the node from starting if that's not the case. If the check fails, the node first tries waiting a bit until it learns about the tokens or timeouts.	2020-02-03 10:57:28 +01:00
Avi Kivity	2816404f57	test.py: documented exit code value Document our chosen exit failure code value and its relationship to git bisect. Message-Id: <20200202134223.210578-1-avi@scylladb.com>	2020-02-03 00:58:58 +02:00
Avi Kivity	541893e69a	Merge "Fix conversion of lua nil to cql null" from Rafael " The fix itself is fairly simple, but looking at the code I found that our code base was not cleanly distinguishing null and empty values and was treating null and missing values differently, but that distinction was dead since a null is represented as a dead cell. " * 'espindola/lua-fix-null-v6' of https://github.com/espindola/scylla: lua: Handle nil returns correctly types: Return bytes_opt from data_value::serialize query-result-set: Assert that we don't have null values types: Fix comparison of empty and null data_values Revert "tests: Handle null and not present values differently" query-result-set: Avoid a copy during construction types: Move operator== for data_value out-of-line	2020-02-02 15:43:24 +02:00
Avi Kivity	c8890eb124	Merge "Simplify usage of stream subscriptions" from Rafael " In a few places, the only use we had for a subscription was calling done(). With this series we now call done() early and store the future<> instead. " * 'espindola/stream-cleanup' of https://github.com/espindola/scylla: sstable_test: Store a future<> instead of a subscription commitlog: Store a future instead of a subscription in db::commitlog::segment_manager::list_descriptors::helper lister: Store a future<> instead of a subscription	2020-02-02 14:49:00 +02:00
Rafael Ávila de Espíndola	5dfb658e77	build: Add two missing dependencies With this change we always rebuild seastar/libseastar_testing.a for the same reason we always rebuild seastar/libseastar.a: We have no idea what its dependencies are, we have to recurse to seastar to find out. The other missing dependency is that we have to rebuild build.ninja when seastar/CMakeLists.txt changes. A change in seastar/CMakeLists.txt can cause seastar.pc to change which can change the command lines used. That is incomplete as change other seastar files can have the same impact, but it is better than nothing. It is not sufficient to put a dependency in the seastar.pc file as that file will be modified when cmake is run and the scylla ninja process doesn't see the CMakeLists.txt to seastar.pc edge. Fixes: #5687 Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200201001126.458992-1-espindola@scylladb.com>	2020-02-01 21:08:26 +02:00
Pavel Emelyanov	4839ca8491	storage_service: Unregister from gossiper notifications ... at all This unregistration doesn't happen currently, but doesn't seem to cause any problems in general, as on stop gossiper is stopped and nothing from it hits the store_service. However (!) if an exception pops up between the storage_service is subscribed on gossiper and the drain_on_shutdown defer action is set up then we _may_ get into the following situation: - main's stuff gets unrolled back - gossiper is not stopped (drain_on_shutdown defer is not set up) - migration manager is stopped (with deferred action in main) - a nitification comes from gossiper -> storage_service::on_change might want to pull schema with the help of local migration manager -> assert(local_is_initialized) strikes Fix this by registering storage_service to gossiper a bit earlier (both are already initialized y that time) and setting up unregister defer right afterwards. Test: unit(dev), manual start-stop Bug: #5628 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20200130190343.25656-1-xemul@scylladb.com>	2020-01-31 14:02:18 +01:00
Avi Kivity	ec5b721db7	test: make eventually() more patient We use eventually() in tests to wait for eventually consistent data to become consistent. However, we see spurious failures indicating that we wait too little. Increasing the timeout has a negative side effect in that tests that fail will now take longer to do so. However, this negative side effect is negligible to false-positive failures, since they throw away large test efforts and sometimes require a person to investigate the problem, only to conclude it is a false positive. This patch therefore makes eventually() more patient, by a factor of 32. Fixes #4707. Message-Id: <20200130162745.45569-1-avi@scylladb.com>	2020-01-31 14:02:18 +01:00
Dejan Mircevski	6661ed7de4	cql3: Drop restrictions::values() method No-one seems to invoke this method. Instead, clients invoke restriction::values (note singular "restriction"). Most subclasses of restrictions also inherit from restriction, so values() still exists in their public interface. Tests: unit (dev) Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2020-01-31 13:05:51 +01:00
Avi Kivity	985e00efa6	Merge "Fix the serialization of negative varint values" from Rafael " Benny pointed out that we could avoid a branch inside a loop is the old serialization code. That got me looking at the logic and I found that it would also produce an unnecessary 0xff prefix for some negative numbers. This patch series fixes the serialization and optimizes it. It now does no extra copies for positives numbers and only one extra copy for negative numbers, which I think is optimal since cpp_int uses sign magnitude and we want the 2 complement representation. " * 'espindola/serialize_varint-improvements-v2' of https://github.com/espindola/scylla: types: Use a fancy iterator to avoid a temporary buffer types: Use export_bits to serialize cpp_int types: Avoid a branch in a loop types: Fix encoding of negative varint types: Replace "num.sign() < 0" with "num < 0"	2020-01-30 20:35:54 +02:00
Rafael Ávila de Espíndola	cc81ba3432	types: Use a fancy iterator to avoid a temporary buffer By using a fancy iterator we can avoid calling export_bits with a temporary buffer before copying the result to the output. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-01-30 10:26:39 -08:00
Rafael Ávila de Espíndola	7e67ce0bdb	types: Use export_bits to serialize cpp_int This avoid a copy when serializing positive numbers. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-01-30 10:26:39 -08:00
Rafael Ávila de Espíndola	27a67f1a2c	types: Avoid a branch in a loop Thanks to Benny for the suggestion. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-01-30 10:26:39 -08:00
Rafael Ávila de Espíndola	c89c90d07f	types: Fix encoding of negative varint We would sometimes produce an unnecessary extra 0xff prefix byte. The new encoding matches what cassandra does. This was both a efficiency and correctness issue, as using varint in a key could produce different tokens. Fixes #5656 Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-01-30 10:25:09 -08:00
Rafael Ávila de Espíndola	ed747122aa	types: Replace "num.sign() < 0" with "num < 0" Surprisingly, this produces better code with cpp_int. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-01-30 10:24:03 -08:00
Rafael Ávila de Espíndola	cc9495d4d3	sstable_test: Store a future<> instead of a subscription The only use we had for the subscription was calling done, may as well call it early and store the future<>. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-01-30 08:31:28 -08:00
Rafael Ávila de Espíndola	da984f1f33	commitlog: Store a future instead of a subscription in db::commitlog::segment_manager::list_descriptors::helper The only use we had for the subscription was calling done, may as well call it early and store the future<>. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-01-30 08:31:28 -08:00
Rafael Ávila de Espíndola	b88f6edee0	lister: Store a future<> instead of a subscription The only use we had for the subscription was calling done, may as well call it early and store the future<>. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-01-30 08:31:28 -08:00
Gleb Natapov	b08679e1d3	db/system_keyspace: use user memory limits for local.paxos table Treat writes to local.paxos as user memory, as the number of writes is dependent on the amount of user data written with LWT. Fixes #5682 Message-Id: <20200130150048.GW26048@scylladb.com>	2020-01-30 17:07:27 +02:00
Piotr Sarna	b783d40aaf	Merge 'Add per scheduling groups statistics' from Eliran This set implements support for per scheduling group statistics in storage proxy and tables view statistics (although tables view per scheduling group stats are not actively applied in this series). Having those statistics per scheduling group can help in finding operations that are performed outside their context, another advantage is that it lays the land for supporting per service level statistics for the workload prioritization enterprise feature. At some point there was a thought to add those stats per role but for now it is not feasible at the moment: 1. The number of roles/user is unbounded so it is dangerous to hold stats (in memory) for all of them. 2. We will need a proper design of how to deal with the hierarchical nature of roles in the stats. Besides these reasons and regardless, it is beneficial to look on resource related stats per scheduling group, looking at resources per user or role will not necessarily give insights since resources are divided per sg and not role, so it can lead to false conclusions if more than one role is attached to the same service level. Tests: unit tests (Dev, Debug) validating the stats with monitor * es/per_sg_stats/v6: storage proxy: migrate to per scheduling group statistics internalize storage proxy statistics metric registration	2020-01-30 15:02:33 +01:00
Eliran Sinvani	971711a546	storage proxy: migrate to per scheduling group statistics This commit builds on top of the introduced per scheduling group statistics template and employs it for achieving a per scheduling group statistics in storage_proxy. Some of the statistics also had meaning as a global - per shard one. Those are the ones for determining if to throttle the write request. This was handled by creating a global stats struct that will hold those stats and by changing the stat update to also include the global one. One point that complicated it is an already existing aggregation over the per shard stats that now became a per scheduling group per shard stats, converting the aggregation to a two-dimensional aggregation. One thing this commit doesn't handle is validating that an individual statistic didn't "cross a scheduling group boundary", such validation is possible but it can easily be added in the future. There is a subtlety to doing so since if the operation did cross to other scheduling group two connected statistics can lose balance for example written bytes and completed write transactions. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2020-01-30 15:01:44 +01:00
Eliran Sinvani	8cfc2aad57	internalize storage proxy statistics metric registration The storage proxy statistics structure did not contain a method for registering the statistics for metric groups, instead, each user had to register some of the metrics by itself. There is no real reason for separating the metrics registration from the statistics data. There is even less justification for doing this only for part of the stats as is the case for those statistics. This commit internalize the metrics registration in the storage_proxy stats structures. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2020-01-30 15:01:40 +01:00
Gleb Natapov	c138dfd33e	lwt: introduce LWT gossiper feature Do not allow lwt operation if LWT is not enabled by entire cluster. Message-Id: <20200130120912.GV26048@scylladb.com>	2020-01-30 15:12:56 +02:00
Benny Halevy	606db0d412	cql3::util::maybe_quote: further optimize quote doubling Avoid string copies when doubling quotes in the string by counting them when scanning the input string and reserving the required space when making the result std::string. This showed a performance improvement of ~1.8% when running the maybe_quote unit test in tight loop (w/ the shorter strings only) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-01-30 14:55:51 +02:00
Rafael Ávila de Espíndola	a16cb00719	configure: Don't use -Wno-error when building seastar This depends on the recent patches to avoid warnings in seastar. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200127210833.200410-1-espindola@scylladb.com>	2020-01-30 14:10:18 +02:00
Avi Kivity	09e2556541	Update seastar submodule * seastar 44cf127ee9...65980a9b30 (2): > io_tester: fix the fix for lack of file closing > cmake: Disable broken gcc warning -Warray-bounds	2020-01-30 14:10:18 +02:00
Avi Kivity	b01f0cab60	utils: add missing include for ssize_t gcc 10 tightened its C++ includes to no longer provide ssize_t, so we must get it from a C header instead. Message-Id: <20200129205912.21139-1-avi@scylladb.com>	2020-01-30 14:10:18 +02:00
Avi Kivity	adb64dc72f	treewide: tighten concepts syntax gcc 10 requires a semicolon after every compound requirement, as per the standard. Add missing semicolons where necessary. Message-Id: <20200129205805.20928-1-avi@scylladb.com>	2020-01-30 14:10:18 +02:00
Rafael Ávila de Espíndola	4b4efcf302	types: Remove collection_type_impl::serialize The rest of the serialize api has been devirtualized some time ago, but this auxiliary function stayed virtual. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200129203916.20460-1-espindola@scylladb.com>	2020-01-30 14:10:18 +02:00
Kamil Braun	bd42b10df1	cdc: rename cdc/cdc.{hh,cc} to cdc/log.{hh,cc} To increase modularity, making it easier to find what is where and maintain. The 'log' module (cdc/log.{hh,cc}) is responsible for updating CDC log tables when base table writes are performed. The 'generation' module (cdc/generation.{hh,cc}) handles stream generation changes in response to topology change events. cdc/metadata.{hh,cc} contains a helper class which holds the currently used generation of streams. It is used by both aforementioned modules: 'log' queries it, while 'generation' updates it.	2020-01-30 11:10:39 +01:00
Kamil Braun	1a56310687	locator: remove get_shard_count and get_ignore_msb_bits from snitch Snitch forms a class hierarchy which get_shard_count and get_ignore_msb_bits ignore (their returned values only depend on the gossiper's state). Besides, these functions just don't belong there. Snitch has nothing to do with shard_count or ignore_msb_bits.	2020-01-30 11:10:08 +01:00
Kamil Braun	e91af78cf5	cdc: update streams description table Inform CDC users about newly generated streams.	2020-01-30 11:10:08 +01:00
Kamil Braun	cbe510d1b8	cdc: use stream generations Change the CDC code to use the global CDC stream generations. The per-base-table CDC description table was removed. The code instead uses cdc::metadata which is updated on gossip events. The per-table description tables were replaced by a global description table to be used by clients when searching for streams.	2020-01-30 11:10:08 +01:00
Kamil Braun	8f4a2ba0b9	storage_service: learn about CDC stream generations. When a node learns that another node joins the cluster (or begins the joining process, i.e. bootstrap), it will read the CDC generation timestamp proposed by that node, use it to retrieve the generation from the distributed generations table, and save it in its local generation queue to be used for writing to the CDC log when its local clock crosses the generation's timestamp. The CDC generation is saved in the queue before tokens are saved in token_metadata. This is important so that when the node becomes a coordinator of a write, it will already have all the necessary information required to generate a corresponding CDC log mutation. After joining, nodes should keep gossiping their proposed stream generation timestamps forever, until they learn about a newer timestamp, in which case they'll start gossiping the new timestamp. There is one case where a node won't gossip such any generation timestamp: if it's upgrading from a non-CDC version. In this situation we make one of the nodes begin the first generation.	2020-01-30 11:10:08 +01:00
Kamil Braun	834c2ca997	cdc: add cdc::metadata class The class stores a queue of CDC generations to be used for choosing streams when writing to the CDC log. This data structure will be updated on some gossip events (when a new node joins the cluster and proposes a new generation of CDC streams).	2020-01-30 11:10:08 +01:00
Kamil Braun	86af2a63ec	clocks: add printing functions For debugging and logging.	2020-01-30 11:10:08 +01:00
Kamil Braun	34e4ce275d	storage_service: restore CDC streams timestamp when replacing a node When a node is replacing another node it will keep gossiping its CDC streams generation timestamp.	2020-01-30 11:10:08 +01:00

1 2 3 4 5 ...

20911 Commits