Commit Graph

201 Commits

Author SHA1 Message Date
Avi Kivity
3d345609d8 config: disable "mc" format sstables for new data
"md" format was introduced in 4.3, in 3530e80ce1, two years ago.
Disable the option to create new sstables with the "mc" format.

Closes #11265
2022-11-08 08:36:27 +02:00
Benny Halevy
40cd685371 storage_service: get_ignore_dead_nodes_for_replace: use tm.parse_host_id_and_endpoint
Allow specifying the dead node to ignore either as host_id
or ip address.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-28 07:38:13 +03:00
Avi Kivity
20bad62562 Merge 'Detect and record large collections' from Benny Halevy
This series adds support for detecting collections that have too many items
and recording them in `system.large_cells`.

A configuration variable was added to db/config: `compaction_collection_items_count_warning_threshold` set by default to 10000.
Collections that have more items than this threshold will be warned about and will be recorded as a large cell in the `system.large_cells` table.  Documentation has been updated respectively.

A new column was added to system.large_cells: `collection_items`.
Similar to the `rows` column in system.large_partition, `collection_items` holds the number of items in a collection when the large cell is a collection, or 0 if it isn't.  Note that the collection may be recorded in system.large_cells either due to its size, like any other cell, and/or due to the number of items in it, if it cross the said threshold.

Note that #11449 called for a new system.large_collections table, but extending system.large_cells follows the logic of system.large_partitions is a smaller change overall, hence it was preferred.

Since the system keyspace schema is hard coded, the schema version of system.large_cells was bumped, and since the change is not backward compatible, we added a cluster feature - `LARGE_COLLECTION_DETECTION` - to enable using it.
The large_data_handler large cell detection record function will populate the new column only when the new cluster feature is enabled.

In addition, unit tests were added in sstable_3_x_test for testing large cells detection by cell size, and large_collection detection by the number of items.

Closes #11449

Closes #11674

* github.com:scylladb/scylladb:
  sstables: mx/writer: optimize large data stats members order
  sstables: mx/writer: keep large data stats entry as members
  db: large_data_handler: dynamically update config thresholds
  utils/updateable_value: add transforming_value_updater
  db/large_data_handler: cql_table_large_data_handler: record large_collections
  db/large_data_handler: pass ref to feature_service to cql_table_large_data_handler
  db/large_data_handler: cql_table_large_data_handler: move ctor out of line
  docs: large-rows-large-cells-tables: fix typos
  db/system_keyspace: add collection_elements column to system.large_cells
  gms/feature_service: add large_collection_detection cluster feature
  test: sstable_3_x_test: add test_sstable_too_many_collection_elements
  test: lib: simple_schema: add support for optional collection column
  test: lib: simple_schema: build schema in ctor body
  test: lib: simple_schema: cql: define s1 as static only if built this way
  db/large_data_handler: maybe_record_large_cells: consider collection_elements
  db/large_data_handler: debug cql_table_large_data_handler::delete_large_data_entries
  sstables: mx/writer: pass collection_elements to writer::maybe_record_large_cells
  sstables: mx/writer: add large_data_type::elements_in_collection
  db/large_data_handler: get the collection_elements_count_threshold
  db/config: add compaction_collection_elements_count_warning_threshold
  test: sstable_3_x_test: add test_sstable_write_large_cell
  test: sstable_3_x_test: pass cell_threshold_bytes to large_data_handler
  test: sstable_3_x_test: large_data_handler: prepare callback for testing large_cells
  test: sstable_3_x_test: large_data tests: use BOOST_REQUIRE_[GL]T
  test: sstable_3_x_test: test_sstable_log_too_many_rows: use tests::random
2022-10-06 18:28:21 +03:00
Benny Halevy
2c4ff71d2b db: large_data_handler: dynamically update config thresholds
make the various large data thresholds live-updateable
and construct the observers and updaters in
cql_table_large_data_handler to dynamically update
the base large_data_handler class threshold members.

Fixes #11685

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-05 10:53:40 +03:00
Avi Kivity
37c6b46d26 dirty_memory_manager: re-term "virtual dirty" to "unspooled dirty"
The "virtual dirty" term is not very informative. "Virtual" means
"not real", but it doesn't say in which way it isn't real.

In this case, virtual dirty refers to real dirty memory, minus
the portion of memtables that has been written to disk (but not
yet sealed - in that case it would not be dirty in the first
place).

I chose to call "the portion of memtables that has been written
to disk" as "spooled memory". At least the unique term will cause
people to look it up and may be easier to remember. From that
we have "unspooled memory".

I plan to further change the accounting to account for spooled memory
rather than unspooled, as that is a more natural term, but that is left
for later.

The documentation, config item, and metrics are adjusted. The config
item is practically unused so it isn't worth keeping compatibility here.
2022-10-04 14:03:59 +03:00
Benny Halevy
167ec84eeb db/config: add compaction_collection_elements_count_warning_threshold
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:31:10 +03:00
TarasBor
1f4a93da78 Show warn message if tombstone_warn_threshold reached on querier.
When querier read page with tombstones more than `tombstone_warn_threshold` limit - warning message appeared in logs.
If `tombstone_warn_threshold:0` feature disabled.

Refs scylladb#11410
2022-09-22 16:42:31 +03:00
Michał Chojnowski
cdb3e71045 sstables: add a flag for disabling long-term index caching
Long-term index caching in the global cache, as introduced in 4.6, is a major
pessimization for workloads where accesses to the index are (spacially) sparse.
We want to have a way to disable it for the affected workloads.

There is already infrastructure in place for disabling it for BYPASS CACHE
queries. One way of solving the issue is hijacking that infrastructure.

This patch adds a global flag (and a corresponding CLI option) which controls
index caching. Setting the flag to `false` causes all index reads to behave
like they would in BYPASS CACHE queries.

Consequences of this choice:

- The per-SSTable partition_index_cache is unused. Every index_reader has
  its own, and they die together. Independent reads can no longer reuse the
  work of other reads which hit the same index pages. This is not crucial,
  since partition accesses have no (natural) spatial locality. Note that
  the original reason for partition_index_cache -- the ability to share
  reads for the lower and upper bound of the query -- is unaffected.
- The per-SSTable cached_file is unused. Every index_reader has its own
  (uncached) input stream from the index file, and every
  bsearch_clustered_cursor has its own cached_file, which dies together with
  the cursor. Note that the cursor still can perform its binary search with
  caching. However, it won't be able to reuse the file pages read by
  index_reader. In particular, if the promoted index is small, and fits inside
  the same file page as its index_entry, that page will be re-read.
  It can also happen that index_reader will read the same index file page
  multiple times. When the summary is so dense that multiple index pages fit in
  one index file page, advancing the upper bound, which reads the next index
  page, will read the same index file page. Since summary:disk ratio is 1:2000,
  this is expected to happen for partitions with size greater than 2000
  partition keys.

Fixes #11202
2022-09-15 17:16:26 +03:00
Raphael S. Carvalho
0a8afe18ca cql: Reject create and alter table with DateTieredCompactionStrategy
It's been ~1 year (2bf47c902e) since we set restrict_dtcs
config option to WARN, meaning users have been warned about the
deprecation process of DTCS.

Let's set the config to TRUE, meaning that create and alter statements
specifying DTCS will be rejected at the CQL level.

Existing tables will still be supported. But the next step will
be about throwing DTCS code into the shadow realm, and after that,
Scylla will automatically fallback to STCS (or ICS) for users which
ignored the deprecation process.

Refs #8914.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #11458
2022-09-15 11:46:18 +03:00
Nadav Har'El
8ece63c433 Merge 'Safemode - Introduce TimeWindowCompactionStrategy Guardrails'
This series introduces two configurable options when working with TWCS tables:

- `restrict_twcs_default_ttl` - a LiveUpdate-able tri_mode_restriction which defaults to WARN and will notify the user whenever a TWCS table is created without a `default_time_to_live` setting
- `twcs_max_window_count` - Which forbids the user from creating TWCS tables whose window count (buckets) are past a certain threshold. We default to 50, which should be enough for most use cases, and a setting of 0 effectively disables the check.

Refs: #6923
Fixes: #9029

Closes #11445

* github.com:scylladb/scylladb:
  tests: cql_query_test: add mixed tests for verifying TWCS guard rails
  tests: cql_query_test: add test for TWCS window size
  tests: cql_query_test: add test for TWCS tables with no TTL defined
  cql: add configurable restriction of default_time_to_live when for TimeWindowCompactionStrategy tables
  cql: add max window restriction for TimeWindowCompactionStrategy
  time_window_compaction_strategy: reject invalid window_sizes
  cql3 - create/alter_table_statement: Make check_restricted_table_properties accept a schema_ptr
2022-09-12 23:55:51 +03:00
Botond Dénes
5374f0edbf Merge 'Task manager' from Aleksandra Martyniuk
Task manager for observing and managing long-running, asynchronous tasks in Scylla
with the interface for the user. It will allow listing of tasks, getting detailed
task status and progression, waiting for their completion, and aborting them.
The task manager will be configured with a “task ttl” that determines how long
the task status is kept in memory after the task completes.

At first it will support repair and compaction tasks, and possibly more in the future.

Currently:
Sharded `task_manager` is started in `main.cc` where it is further passed
to `http_context` for the purpose of user interface.

Task manager's tasks are implemented in two two layers: the abstract
and the implementation one. The latter is a pure virtual class which needs
to be overriden by each module. Abstract layer provides the methods that
are shared by all modules and the access to module-specific methods.

Each module can access task manager, create and manage its tasks through
`task_manager::module` object. This way data specific to a module can be
separated from the other modules.

User can access task manager rest api interface to track asynchronous tasks.
The available options consist of:
- getting a list of modules
- getting a list of basic stats of all tasks in the requested module
- getting the detailed status of the requested task
- aborting the requested task
- waiting for the requested task to finish

To enable testing of the provided api, test specific task implementation and module
are provided. Their lifetime can be simulated with the standalone test api.
These components are compiled and the tests are run in all but release build modes.

Fixes: #9809

Closes #11216

* github.com:scylladb/scylladb:
  test: task manager api test
  task_manager: test api layer implementation
  task_manager: add test specific classes
  task_manager: test api layer
  task_manager: api layer implementation
  task_manager: api layer
  task_manager: keep task_manager reference in http_context
  start sharded task manager
  task_manager: create task manager object
2022-09-12 09:26:46 +03:00
Felipe Mendes
7fec4fcaa6 cql: add configurable restriction of default_time_to_live when for TimeWindowCompactionStrategy tables
TimeWindowCompactionStrategy (TWCS) tables are known for being used explicitly for time-series workloads. In particular, most of the time users should specify a default_time_to_live during table creation to ensure data is expired such as in a sliding window. Failure to do so may create unbounded windows - which - depending on the compaction window chosen, may introduce severe latency and operational problems, due to unbounded window growth.

However, there may be some use cases which explicitly ingest data by using the `USING TTL` keyword, which effectively has the same effect. Therefore, we can not simply forbid table creations without a default_time_to_live explicitly set to any value other than 0.

The new restrict_twcs_without_default_ttl option has three values: "true", "false", and "warn":

We default to "warn", which will notify the user of the consequences when creating a TWCS table without a default_time_to_live value set. However, users are encouraged to switch it to "true", as - ideally - a default_time_to_live value should always be expected to prevent applications failing to ingest data against the database ommitting the `USING TTL` keyword.
2022-09-11 16:50:42 -03:00
Felipe Mendes
a3356e866b cql: add max window restriction for TimeWindowCompactionStrategy
The number of potential compaction windows (or buckets) is defined by the default_time_to_live / sstable_window_size ratio. Every now and then we end up in a situation on where users of TWCS end up underestimating their window buckets when using TWCS. Unfortunately, scenarios on which one employs a default_time_to_live setting of 1 year but a window size of 30 minutes are not rare enough.

Such configuration is known to only make harm to a workload: As more and more windows are created, the number of SSTables will grow in the same pace, and the situation will only get worse as the number of shards increase.

This commit introduces the twcs_max_window_count option, which defaults to 50, and will forbid the Creation or Alter of tables which get past this threshold. A value of 0 will explicitly skip this check.

Note: this option does not forbid the creation of tables with a default_time_to_live=0 as - even though not recommended - it is perfectly possible for a TWCS table with default TTL=0 to have a bound window, provided any ingestion statements make use of 'USING TTL' within the CQL statement, in addition to it.
2022-09-11 16:50:22 -03:00
Aleksandra Martyniuk
2439e55974 task_manager: create task manager object
Implementation of a task manager that allows tracking
and managing asynchronous tasks.

The tasks are represented by task_manager::task class providing
members common to all types of tasks. The methods that differ
among tasks of different module can be overriden in a class
inheriting from task_manager::task::impl class. Each task stores
its status containing parameters like id, sequence number, begin
and end time, state etc. After the task finishes, it is kept
in memory for configurable time or until it is unregistered.
Tasks need to be created with make_task method.

Each module is represented by task_manager::module type and should
have an access to task manager through task_manager::module methods.
That allows to easily separate and collectively manage data
belonging to each module.
2022-09-09 14:29:28 +02:00
Mikołaj Grzebieluch
5b1421cc33 db: config: add BROADCAST_TABLES feature flag
Add experimental flag 'broadcast-tables' for enabling BROADCAST_TABLES feature.
This feature requires raft group0, thus enabling it without RAFT will cause an error.
2022-09-05 11:11:08 +02:00
Avi Kivity
0dbcd13a0f config: change logging::settings constructor call to use designated initializer
Safer wrt reordering, and more readable too.

Closes #11382
2022-08-26 06:14:01 +03:00
Botond Dénes
33f0447ba0 db/config: add config item for query tombstone limit
This will be the value used to break pages, after processing the
specified amount of tombstones. The page will be cut even if empty.
We could maybe use the already existing tombstone_{warn,fail}_threshold
instead and use them as a soft/hard limit pair, like we did with page
sizes.
2022-08-09 10:00:40 +03:00
Benny Halevy
edd308c705 config: use ordered map for experimental features
So that the help string will be sorted lexicographically.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11178
2022-08-01 17:40:10 +03:00
Avi Kivity
1f21c1ecc8 Merge "Add IO throttling to streaming class" from Pavel E
"
Same thing was done for compaction class some time ago, now
it's time for streaming to keep repair-generated IO in bounds.
This set mostly resembles the one for compaction IO class with
the exception that boot-time reshard/reshape currently runs in
streaming class, but that's nod great if the class is throttled,
so the set also moves boot-time IO into default IO class.
"

* 'br-streaming-class-throttling-2' of https://github.com/xemul/scylla:
  distributed_loader: Populate keyspaces in default class
  streaming: Maintain class bandwidth
  streaming: Pass db::config& to manager constructor
  config: Add stream_io_throughput_mb_per_sec option
  sstables: Keep priority class on sstable_directory
2022-07-19 17:10:25 +03:00
Pavel Emelyanov
07460761fb Merge "Make compaction_static_shares and memtable_flush_static_shares live updateable" from Igor Ribeiro Barbosa Duarte (3):
Currently, after updating the static shares it's necessary
to restart the cluster. This patch series makes
compaction_static_shares and memtable_flush_static_shares
live updateable so that this restart isn't necessary anymore.

dtests: https://github.com/igorribeiroduarte/scylla-dtest/tree/test_liveupdate_compaction_static_shares
ci: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1412/

* https://github.com/igorribeiroduarte/scylla/tree/make_compaction_static_shares_live_updateable:
  memtable_flush: Make memtable_flush_static_shares liveupdateable
  compaction: Make compaction_static_shares liveupdateable
  backlog_controller: Unify backlog_controller constructors
2022-07-19 16:55:55 +03:00
Igor Ribeiro Barbosa Duarte
3b19bcf1a1 memtable_flush: Make memtable_flush_static_shares liveupdateable
This patch makes memtable_flush_static_shares liveupdateable
to avoid having to restart the cluster after updating
this config.

Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
2022-07-19 10:10:46 -03:00
Igor Ribeiro Barbosa Duarte
8dd0f4672d compaction: Make compaction_static_shares liveupdateable
This patch makes compaction_static_shares liveupdateable
to avoid having to restart the cluster after updating
this config.

Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
2022-07-19 10:10:46 -03:00
Pavel Emelyanov
85d32485d9 config: Mark compaction_throughput_mb_per_sec option as Used
Otherwise it's not shown in the --help output.
Should've been the part of 868c3be0

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220716085221.26634-1-xemul@scylladb.com>
2022-07-19 13:18:17 +03:00
Pavel Emelyanov
7d0110cd31 config: Add stream_io_throughput_mb_per_sec option
It's going to control the bandwidth for the streaming prio class.
For now it's jsut added but does't work for real

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-19 12:14:41 +03:00
Nadav Har'El
cc69177dcc config: fix printing of experimental feature list
Recently we noticed a regression where with certain versions of the fmt
library,

   SELECT value FROM system.config WHERE name = 'experimental_features'

returns string numbers, like "5", instead of feature names like "raft".

It turns out that the fmt library keep changing their overload resolution
order when there are several ways to print something. For enum_option<T> we
happen to have to conflicting ways to print it:
  1. We have an explicit operator<<.
  2. We have an *implicit* convertor to the type held by T.

We were hoping that the operator<< always wins. But in fmt 8.1, there is
special logic that if the type is convertable to an int, this is used
before operator<<()! For experimental_features_t, the type held in it was
an old-style enum, so it is indeed convertible to int.

The solution I used in this patch is to replace the old-style enum
in experimental_features_t by the newer and more recommended "enum class",
which does not have an implicit conversion to int.

I could have fixed it in other ways, but it wouldn't have been much
prettier. For example, dropping the implicit convertor would require
us to change a bunch of switch() statements over enum_option (and
not just experimental_features_t, but other types of enum_option).

Going forward, all uses of enum_option should use "enum class", not
"enum". tri_mode_restriction_t was already using an enum class, and
now so does experimental_features_t. I changed the examples in the
comments to also use "enum class" instead of enum.

This patch also adds to the existing experimental_features test a
check that the feature names are words that are not numbers.

Fixes #11003.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11004
2022-07-11 09:17:30 +02:00
Avi Kivity
3b20407f25 Merge 'db: Avoid memtable flush latency on schema merge' from Tomasz Grabiec
Currently, applying schema mutations involves flushing all schema
tables so that on restart commit log replay is performed on top of
latest schema (for correctness). The downside is that schema merge is
very sensitive to fdatasync latency. Flushing a single memtable
involves many syncs, and we flush several of them. It was observed to
take as long as 30 seconds on GCE disks under some conditions.

This patch changes the schema merge to rely on a separate commit log
to replay the mutations on restart. This way it doesn't have to wait
for memtables to be flushed. It has to wait for the commitlog to be
synced, but this cost is well amortized.

We put the mutations into a separate commit log so that schema can be
recovered before replaying user mutations. This is necessary because
regular writes have a dependency on schema version, and replaying on
top of latest schema satisfies all dependencies. Without this, we
could get loss of writes if we replay a write which depends on the
latest schema on top of old schema.

Also, if we have a separate commit log for schema we can delay schema
parsing for after the replay and avoid complexity of recognizing
schema transactions in the log and invoking the schema merge logic.

I reproduced bad behavior locally on my machine with a tired (high latency)
SSD disk, load driver remote. Under high load, I saw table alter (server-side part) taking
up to 10 seconds before. After the patch, it takes up to 200 ms (50:1 improvement).
Without load, it is 300ms vs 50ms.

Fixes #8272
Fixes #8309
Fixes #1459

Closes #10333

* github.com:scylladb/scylla:
  config: Introduce force_schema_commit_log option
  config: Introduce unsafe_ignore_truncation_record
  db: Avoid memtable flush latency on schema merge
  db: Allow splitting initiatlization of system tables
  db: Flush system.scylla_local on change
  migration_manager: Do not drop system.IndexInfo on keyspace drop
  Introduce SCHEMA_COMMITLOG cluster feature
  frozen_mutation: Introduce freeze/unfreeze helpers for vectors of mutations
  db/commitlog: Improve error messages in case of unknown column mapping
  db/commitlog: Fix error format string to print the version
  db: Introduce multi-table atomic apply()
2022-07-07 16:03:50 +03:00
Tomasz Grabiec
6622e3369a config: Introduce force_schema_commit_log option 2022-07-06 22:08:56 +02:00
Tomasz Grabiec
b8d20335a4 config: Introduce unsafe_ignore_truncation_record
The node now refuses to boot if schema tables were truncated.
This adds a config option to ignore truncation records as a
workaround if user truncated them manually.
2022-07-06 22:08:56 +02:00
Pavel Emelyanov
868c3be01f config: Tune the config option
The option is used, but is not implemented. If attaching implementation
to it right a once the compaction will slow down to 16MB/s on all nodes.
Make it zero (unbound) by default and mard live-updateable while at it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-30 09:55:52 +03:00
Pavel Emelyanov
3a753068be Merge "Make permissions cache live updateable and add an API for resetting authorization cache" from Igor Ribeiro Barbosa Duarte
Currently, for users who have permissions_cache configs set to very high
values (and thus can't wait for the configured times to pass) having to restart
the service every time they make a change related to permissions or
prepared_statements cache (e.g. Adding a user and changing their permissions)
can become pretty annoying.
This patch series make permissions_validity_in_ms, permissions_update_interval_in_ms
and permissions_cache_max_entries live updateable so that restarting the
service is not necessary anymore for these cases.
It also adds an API for flushing the cache to make it easier for users who
don't want to modify their permissions_cache config.

branch: https://github.com/igorribeiroduarte/scylla/tree/make_permissions_cache_live_updateable
CI: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1005/
dtests: https://github.com/igorribeiroduarte/scylla-dtest/tree/test_permissions_cache

* https://github.com/igorribeiroduarte/scylla/make_permissions_cache_live_updateable:
  loading_cache_test: Test loading_cache::reset and loading_cache::update_config
  api: Add API for resetting authorization cache
  authorization_cache: Make permissions cache and authorized prepared statements cache live updateable
  auth_prep_statements_cache: Make aut_prep_statements_cache accept a config struct
  utils/loading_cache.hh: Add update_config method
  utils/loading_cache.hh: Rename permissions_cache_config to loading_cache_config and move it to loading_cache.hh
  utils/loading_cache.hh: Add reset method
2022-06-29 11:14:13 +03:00
Igor Ribeiro Barbosa Duarte
b9051c79bc authorization_cache: Make permissions cache and authorized prepared statements cache live updateable
Currently, for users who have permissions_cache configs set to very high
values (and thus can't wait for the configured times to pass) having to restart
the service every time they make a change related to permissions or
prepared_statements cache(e.g.: Adding a user) can become pretty annoying.
This patch make permissions_validity_in_ms, permissions_update_interval_in_ms
and permissions_cache_max_entries live updateable so that restarting the
service is not necessary anymore for these cases.

Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
2022-06-28 19:58:06 -03:00
Piotr Dulikowski
761a037afb config: add add_per_partition_rate_limit_extension function for testing
...and use it in cql_test_env to enable the per_partition_rate_limit
extension for all tests that use it.
2022-06-22 20:16:49 +02:00
Benny Halevy
6677028212 sstables: mx/writer: auto-scale promoted index
Add column_index_auto_scale_threshold_in_kb to the configuration
(defaults to 10MB).

When the promoted index (serialized) size gets to this
threshold, it's halved by merging each two adjacent blocks
into one and doubling the desired_block_size.

Fixes #4217

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-24 13:32:35 +03:00
Michał Radwański
29e09a3292 db/config: command line arguments logger_stdout_timestamps and logger_ostream_type are no longer ignored
Closes #10452
2022-05-04 14:40:52 +03:00
Calle Wilund
7dd7760e8d commitlog: Make flush threshold a config parameter 2022-04-11 16:34:00 +00:00
Calle Wilund
d478896d46 commitlog: kill non-recycled segment management
It has been default for a while now. Makes no sense to not do it.
Even hints can use it (even if it makes no difference there)
2022-04-11 16:34:00 +00:00
Piotr Sarna
3272b4826f db: add keyspace-storage-options experimental feature
Specifying non-standard keyspace options is experimental, so it's
going to be protected by a configuration flag.
2022-04-08 09:17:01 +02:00
Nadav Har'El
49a8164fb7 alternator: add configurable scan period to TTL expiration
Before this patch, the experimental TTL (expiration time) feature in
Alternator scans tables for expiration in a tight loop - starting the
next scan one second after the previous one completed.

In this patch we introduce a new configuration option,
alternator_ttl_period_in_seconds, which determines how frequently
to start the scan. The default is 24 hours - meaning that the next
scan is started 24 hours after the previous one started.

The tests (test/alternator/run) change this configuration back to one
second, so that expiration tests finish as quickly as possible.

Please note that the scan is *not* slowed down to fill this 24 hours -
if it finishes in one hour, it will then sleep for 23 hours. Additional
work would be needed to slow down the scan to not finish too quickly.
One idea not yet implemented is to move the expiration service from
the "maintenance" scheduling group which it uses today to a new
scheduling group, and modifying the number of shares that this group
gets.

Another thing worth noting about the configurable period (which defaults
to 24 hours) is that when TTL is enabled on an Alternator table, it can
take that amount of time until its scan starts and items start expiring
from it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-02-25 07:26:11 +02:00
Michael Livshin
3bf1e137fc config: make the ME sstable format default
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
0b1447c702 add "sstable_format" config
Initialize it to "md" until ME format support is
complete (i.e. storing originating host id in sstable stats metadata
is implemented), so at present there is no observable change by
default.

Also declare "enable_sstables_md_format" unused -- the idea, going
forward, being that only "sstable_format" controls the written sstable
file format and that no more per-format enablement config options
shall be added.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Avi Kivity
13cf66d3ef Revert "schema_registry: Increase grace period for schema version cache"
This reverts commit 23da2b5879. It causes
the node to quickly run out of memory when many schema changes are made
within a small time window.

Fixes #10071.
2022-02-13 19:38:24 +02:00
Nadav Har'El
fef7934a2d config: fix some types in system.config virtual table
The system.config virtual tables prints each configuration variable of
type T based on the JSON printer specified in the config_type_for<T>
in db/config.cc.

For two variable types - experimental_features and tri_mode_restriction,
the specified converter was wrong: We used value_to_json<string> or
value_to_json<vector<string>> on something which was *not* a string.
Unfortunately, value_to_json silently casted the given objects into
strings, and the result was garbage: For example as noted in #10047,
for experimental_features instead of printing a list of features *names*,
e.g., "raft", we got a bizarre list of one-byte strings with each feature's
number (which isn't documented or even guaranteed to not change) as well
as carriage-return characters (!?).

So solution is a new printable_to_json<T> which works on a type T that
can be printed with operator<< - as in fact the above two types can -
and the type is converted into a string or vector of strings using this
operator<<, not a cast.

Also added a cql-pytest test for reading system.config and in particular
options of the above two types - checking that they contain sensible
strings and not "garbage" like before this patch.

Fixes #10047.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220209090421.298849-1-nyh@scylladb.com>
2022-02-10 09:10:24 +03:00
Tomasz Grabiec
23da2b5879 schema_registry: Increase grace period for schema version cache
If version is absent in cache, it will be fetched from the
coordinator. This is not expensive, but if the version is not known,
it must be also "synced". It means that the node will do a full schema
pull from the coordinator. This pull is expensive and can take seconds.

If the coordinator we pull from is at an old version, the pull will do
nothing and current node will soon forget the old version, initiating
another pull.

If some nodes stay at an old version for a long time for some reason,
this will make new coordinators initiate pulls frequently.

Increase the expiration period to 15 minutes to reduce the impact in
such scenarios.

Fixes #10042.

Message-Id: <20220207122317.674241-1-tgrabiec@scylladb.com>
2022-02-09 09:27:07 +02:00
Michał Sala
b439d6e710 db: config: add a flag to disable new parallelized aggregation algorithm
Just in case the new algorithm turns out to be buggy, add a flag to
fall-back to the old algorithm.
2022-02-01 21:26:25 +01:00
Pavel Emelyanov
a026b4ef49 config: Add option to disable config updates via CQL
The system.config table allows changing config parameters, but this
change doesn't survive restarts and is considered to be dangerous
(sometimes). Add an option to disable the table updates. The option
is LiveUpdate and can be set to false via CQL too (once).

fixes #9976

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220201121114.32503-1-xemul@scylladb.com>
2022-02-01 14:30:47 +02:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Kamil Braun
e98711cfcb db: config: add a flag to disable new reversed reads algorithm
Just in case the new algorithm turns out to be buggy, or give a
performance regression, add a flag to fall-back to the old algorithm for
use in the field.
2022-01-12 18:59:19 +01:00
Asias He
a8ad385ecd repair: Get rid of the gc_grace_seconds
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.

To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.

In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:

1) GC a tombstone after gc_grace_seconds

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;

This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.

2) Never GC a tombstone

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};

3) GC a tombstone immediately

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};

4) GC a tombstone after repair

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};

In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.

A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.

Tests: compaction_test.py, ninja test

Fixes #3560

[avi: resolve conflicts vs data_dictionary]
2022-01-04 19:48:14 +02:00
Avi Kivity
9e74556413 Merge 'Support reverse reads in the row cache natively' from Tomasz Grabiec
This change makes row cache support reverse reads natively so that reversing wrappers are not needed when reading from cache and thus the read can be executed efficiently, with similar cost as the forward-order read.

The database is serving reverse reads from cache by default after this. Before, it was bypassing cache by default after 703aed3277.

Refs: #1413

Tests:

  - unit [dev]
  - manual query with build/dev/scylla and cache tracing on

Closes #9454

* github.com:scylladb/scylla:
  tests: row_cache: Extend test_concurrent_reads_and_eviction to run reverse queries
  row_cache: partition_snapshot_row_cursor: Print more details about the current version vector
  row_cache: Improve trace-level logging
  config: Use cache for reversed reads by default
  config: Adjust reversed_reads_auto_bypass_cache description
  row_cache: Support reverse reads natively
  mvcc: partition_snapshot: Support slicing range tombstones in reverse
  test: flat_mutation_reader_assertions: Consume expected range tombstones before end_of_partition
  row_cache: Log produced range tombstones
  test: Make produces_range_tombstone() report ck_ranges
  tests: lib: random_mutation_generator: Extract make_random_range_tombstone()
  partition_snapshot_row_cursor: Support reverse iteration
  utils: immutable-collection: Make movable
  intrusive_btree: Make default-initialized iterator cast to false
2021-12-29 16:53:25 +02:00
Asias He
eba4a4fba4 repair: Allow ignoring dead nodes for replace operation
Consider

1) n1, n2, n3, n4, n5
2) n2 and n3 are both down
3) start n6 to replace n2
4) start n7 to replace n3

We want to replace the dead nodes n2 and n3 to fix the cluster to have 5
running nodes.

Replace operation in step 3 will fail because n3 is down.
We would see errors like below:

replace[25edeec0-57d4-11ec-be6b-7085c2409b2d]: Nodes={127.0.0.3} needed
for replace operation are down. It is highly recommended to fix the down
nodes and try again.

In the above example, currently, there is no way to replace any of the
dead nodes.

Users can either fix one of the dead nodes and run replace or run
removenode operation to remove one of the dead nodes then run replace
and run bootstrap to add another node.

Fixing dead nodes is always the best solution but it might not be
possible. Running removenode operation is not better than running
replace operation (with best effort by ignoring the other dead node) in
terms of data consistency. In addition, users have to run bootstrap
operation to add back the removed node. So, allowing replacing in such
case is a clear win.

This patch adds the --ignore-dead-nodes-for-replace option to allow run
replace operation with best effort mode. Please note, use this option
only if the dead nodes are completely broken and down, and there is no
way to fix the node and bring it back. This also means the user has to
make sure the ignored dead nodes specified are really down to avoid any
data consistency issue.

Fixes #9757

Closes #9758
2021-12-20 00:49:03 +02:00