Commit Graph

259 Commits

Author SHA1 Message Date
Pavel Emelyanov
6d0c8fb6e2 config: Add constexpr value for default murmur ignore bits
... and use in some places of sstable_compaction_test. This will allow
getting rid of global test_db_config thing later

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-01 13:38:15 +03:00
Nadav Har'El
2dedb5ea75 alternator: make Alternator TTL feature no longer "experimental"
Until now, the Alternator TTL feature was considered "experimental",
and had to be manually enabled on all nodes of the cluster to be usable.

This patch removes this requirement and in essence GAs this feature.

Even after this patch, Alternator TTL is still a "cluster feature",
i.e., for this feature to be usable every node in the cluster needs
to support it. If any of the nodes is old and does not yet support this
feature, the UpdateTimeToLive request will not be accepted, so although
the expiration-scanning threads may exist on the newer nodes, they will
not do anything because none of the tables can be marked as having
expiration enabled.

This patch does not contain documentation fixes - the documentation
still suggests that the Alternator TTL feature is experimental.
The documentation patch will come separately.

Fixes #12037

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #12049
2022-11-24 17:21:39 +02:00
Avi Kivity
20bad62562 Merge 'Detect and record large collections' from Benny Halevy
This series adds support for detecting collections that have too many items
and recording them in `system.large_cells`.

A configuration variable was added to db/config: `compaction_collection_items_count_warning_threshold` set by default to 10000.
Collections that have more items than this threshold will be warned about and will be recorded as a large cell in the `system.large_cells` table.  Documentation has been updated respectively.

A new column was added to system.large_cells: `collection_items`.
Similar to the `rows` column in system.large_partition, `collection_items` holds the number of items in a collection when the large cell is a collection, or 0 if it isn't.  Note that the collection may be recorded in system.large_cells either due to its size, like any other cell, and/or due to the number of items in it, if it cross the said threshold.

Note that #11449 called for a new system.large_collections table, but extending system.large_cells follows the logic of system.large_partitions is a smaller change overall, hence it was preferred.

Since the system keyspace schema is hard coded, the schema version of system.large_cells was bumped, and since the change is not backward compatible, we added a cluster feature - `LARGE_COLLECTION_DETECTION` - to enable using it.
The large_data_handler large cell detection record function will populate the new column only when the new cluster feature is enabled.

In addition, unit tests were added in sstable_3_x_test for testing large cells detection by cell size, and large_collection detection by the number of items.

Closes #11449

Closes #11674

* github.com:scylladb/scylladb:
  sstables: mx/writer: optimize large data stats members order
  sstables: mx/writer: keep large data stats entry as members
  db: large_data_handler: dynamically update config thresholds
  utils/updateable_value: add transforming_value_updater
  db/large_data_handler: cql_table_large_data_handler: record large_collections
  db/large_data_handler: pass ref to feature_service to cql_table_large_data_handler
  db/large_data_handler: cql_table_large_data_handler: move ctor out of line
  docs: large-rows-large-cells-tables: fix typos
  db/system_keyspace: add collection_elements column to system.large_cells
  gms/feature_service: add large_collection_detection cluster feature
  test: sstable_3_x_test: add test_sstable_too_many_collection_elements
  test: lib: simple_schema: add support for optional collection column
  test: lib: simple_schema: build schema in ctor body
  test: lib: simple_schema: cql: define s1 as static only if built this way
  db/large_data_handler: maybe_record_large_cells: consider collection_elements
  db/large_data_handler: debug cql_table_large_data_handler::delete_large_data_entries
  sstables: mx/writer: pass collection_elements to writer::maybe_record_large_cells
  sstables: mx/writer: add large_data_type::elements_in_collection
  db/large_data_handler: get the collection_elements_count_threshold
  db/config: add compaction_collection_elements_count_warning_threshold
  test: sstable_3_x_test: add test_sstable_write_large_cell
  test: sstable_3_x_test: pass cell_threshold_bytes to large_data_handler
  test: sstable_3_x_test: large_data_handler: prepare callback for testing large_cells
  test: sstable_3_x_test: large_data tests: use BOOST_REQUIRE_[GL]T
  test: sstable_3_x_test: test_sstable_log_too_many_rows: use tests::random
2022-10-06 18:28:21 +03:00
Avi Kivity
37c6b46d26 dirty_memory_manager: re-term "virtual dirty" to "unspooled dirty"
The "virtual dirty" term is not very informative. "Virtual" means
"not real", but it doesn't say in which way it isn't real.

In this case, virtual dirty refers to real dirty memory, minus
the portion of memtables that has been written to disk (but not
yet sealed - in that case it would not be dirty in the first
place).

I chose to call "the portion of memtables that has been written
to disk" as "spooled memory". At least the unique term will cause
people to look it up and may be easier to remember. From that
we have "unspooled memory".

I plan to further change the accounting to account for spooled memory
rather than unspooled, as that is a more natural term, but that is left
for later.

The documentation, config item, and metrics are adjusted. The config
item is practically unused so it isn't worth keeping compatibility here.
2022-10-04 14:03:59 +03:00
Benny Halevy
167ec84eeb db/config: add compaction_collection_elements_count_warning_threshold
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:31:10 +03:00
Piotr Sarna
481240b8b4 Merge 'Alternator: Run more TTL tests by default (and add a test for metrics)' from Nadav Har'El
We had quite a few tests for Alternator TTL in test/alternator, but most
of them did not run as part of the usual Jenkins test suite, because
they were considered "very slow" (and require a special "--runveryslow"
flag to run).

In this series we enable six tests which run quickly enough to run by
default, without an additional flag. We also make them even quicker -
the six tests now take around 2.5 seconds.

I also noticed that we don't have a test for the Alternator TTL metrics
- and added one.

Fixes #11374.
Refs https://github.com/scylladb/scylla-monitoring/issues/1783

Closes #11384

* github.com:scylladb/scylladb:
  test/alternator: insert test names into Scylla logs
  rest api: add a new /system/log operation
  alternator ttl: log warning if scan took too long.
  alternator,ttl: allow sub-second TTL scanning period, for tests
  test/alternator: skip fewer Alternator TTL tests
  test/alternator: test Alternator TTL metrics
2022-09-22 09:47:50 +02:00
Michał Chojnowski
cdb3e71045 sstables: add a flag for disabling long-term index caching
Long-term index caching in the global cache, as introduced in 4.6, is a major
pessimization for workloads where accesses to the index are (spacially) sparse.
We want to have a way to disable it for the affected workloads.

There is already infrastructure in place for disabling it for BYPASS CACHE
queries. One way of solving the issue is hijacking that infrastructure.

This patch adds a global flag (and a corresponding CLI option) which controls
index caching. Setting the flag to `false` causes all index reads to behave
like they would in BYPASS CACHE queries.

Consequences of this choice:

- The per-SSTable partition_index_cache is unused. Every index_reader has
  its own, and they die together. Independent reads can no longer reuse the
  work of other reads which hit the same index pages. This is not crucial,
  since partition accesses have no (natural) spatial locality. Note that
  the original reason for partition_index_cache -- the ability to share
  reads for the lower and upper bound of the query -- is unaffected.
- The per-SSTable cached_file is unused. Every index_reader has its own
  (uncached) input stream from the index file, and every
  bsearch_clustered_cursor has its own cached_file, which dies together with
  the cursor. Note that the cursor still can perform its binary search with
  caching. However, it won't be able to reuse the file pages read by
  index_reader. In particular, if the promoted index is small, and fits inside
  the same file page as its index_entry, that page will be re-read.
  It can also happen that index_reader will read the same index file page
  multiple times. When the summary is so dense that multiple index pages fit in
  one index file page, advancing the upper bound, which reads the next index
  page, will read the same index file page. Since summary:disk ratio is 1:2000,
  this is expected to happen for partitions with size greater than 2000
  partition keys.

Fixes #11202
2022-09-15 17:16:26 +03:00
Nadav Har'El
8ece63c433 Merge 'Safemode - Introduce TimeWindowCompactionStrategy Guardrails'
This series introduces two configurable options when working with TWCS tables:

- `restrict_twcs_default_ttl` - a LiveUpdate-able tri_mode_restriction which defaults to WARN and will notify the user whenever a TWCS table is created without a `default_time_to_live` setting
- `twcs_max_window_count` - Which forbids the user from creating TWCS tables whose window count (buckets) are past a certain threshold. We default to 50, which should be enough for most use cases, and a setting of 0 effectively disables the check.

Refs: #6923
Fixes: #9029

Closes #11445

* github.com:scylladb/scylladb:
  tests: cql_query_test: add mixed tests for verifying TWCS guard rails
  tests: cql_query_test: add test for TWCS window size
  tests: cql_query_test: add test for TWCS tables with no TTL defined
  cql: add configurable restriction of default_time_to_live when for TimeWindowCompactionStrategy tables
  cql: add max window restriction for TimeWindowCompactionStrategy
  time_window_compaction_strategy: reject invalid window_sizes
  cql3 - create/alter_table_statement: Make check_restricted_table_properties accept a schema_ptr
2022-09-12 23:55:51 +03:00
Nadav Har'El
e7e9adc519 alternator,ttl: allow sub-second TTL scanning period, for tests
Alternator has the "alternator_ttl_period_in_seconds" parameter for
controlling how often the expiration thread looks for expired items to
delete. It is usually a very large number of seconds, but for tests
to finish quickly, we set it to 1 second.
With 1 second expiration latency, test/alternator/test_ttl.py took 5
seconds to run.

In this patch, we change the parameter to allow a  floating-point number
of seconds instead of just an integer. Then, this allows us to halve the
TTL period used by tests to 0.5 seconds, and as a result, the run time of
test_ttl.py halves to 2.5 seconds. I think this is fast enough for now.

I verified that even if I change the period to 0.1, there is no noticable
slowdown to other Alternator tests, so 0.5 is definitely safe.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-09-12 10:32:56 +03:00
Botond Dénes
5374f0edbf Merge 'Task manager' from Aleksandra Martyniuk
Task manager for observing and managing long-running, asynchronous tasks in Scylla
with the interface for the user. It will allow listing of tasks, getting detailed
task status and progression, waiting for their completion, and aborting them.
The task manager will be configured with a “task ttl” that determines how long
the task status is kept in memory after the task completes.

At first it will support repair and compaction tasks, and possibly more in the future.

Currently:
Sharded `task_manager` is started in `main.cc` where it is further passed
to `http_context` for the purpose of user interface.

Task manager's tasks are implemented in two two layers: the abstract
and the implementation one. The latter is a pure virtual class which needs
to be overriden by each module. Abstract layer provides the methods that
are shared by all modules and the access to module-specific methods.

Each module can access task manager, create and manage its tasks through
`task_manager::module` object. This way data specific to a module can be
separated from the other modules.

User can access task manager rest api interface to track asynchronous tasks.
The available options consist of:
- getting a list of modules
- getting a list of basic stats of all tasks in the requested module
- getting the detailed status of the requested task
- aborting the requested task
- waiting for the requested task to finish

To enable testing of the provided api, test specific task implementation and module
are provided. Their lifetime can be simulated with the standalone test api.
These components are compiled and the tests are run in all but release build modes.

Fixes: #9809

Closes #11216

* github.com:scylladb/scylladb:
  test: task manager api test
  task_manager: test api layer implementation
  task_manager: add test specific classes
  task_manager: test api layer
  task_manager: api layer implementation
  task_manager: api layer
  task_manager: keep task_manager reference in http_context
  start sharded task manager
  task_manager: create task manager object
2022-09-12 09:26:46 +03:00
Felipe Mendes
7fec4fcaa6 cql: add configurable restriction of default_time_to_live when for TimeWindowCompactionStrategy tables
TimeWindowCompactionStrategy (TWCS) tables are known for being used explicitly for time-series workloads. In particular, most of the time users should specify a default_time_to_live during table creation to ensure data is expired such as in a sliding window. Failure to do so may create unbounded windows - which - depending on the compaction window chosen, may introduce severe latency and operational problems, due to unbounded window growth.

However, there may be some use cases which explicitly ingest data by using the `USING TTL` keyword, which effectively has the same effect. Therefore, we can not simply forbid table creations without a default_time_to_live explicitly set to any value other than 0.

The new restrict_twcs_without_default_ttl option has three values: "true", "false", and "warn":

We default to "warn", which will notify the user of the consequences when creating a TWCS table without a default_time_to_live value set. However, users are encouraged to switch it to "true", as - ideally - a default_time_to_live value should always be expected to prevent applications failing to ingest data against the database ommitting the `USING TTL` keyword.
2022-09-11 16:50:42 -03:00
Felipe Mendes
a3356e866b cql: add max window restriction for TimeWindowCompactionStrategy
The number of potential compaction windows (or buckets) is defined by the default_time_to_live / sstable_window_size ratio. Every now and then we end up in a situation on where users of TWCS end up underestimating their window buckets when using TWCS. Unfortunately, scenarios on which one employs a default_time_to_live setting of 1 year but a window size of 30 minutes are not rare enough.

Such configuration is known to only make harm to a workload: As more and more windows are created, the number of SSTables will grow in the same pace, and the situation will only get worse as the number of shards increase.

This commit introduces the twcs_max_window_count option, which defaults to 50, and will forbid the Creation or Alter of tables which get past this threshold. A value of 0 will explicitly skip this check.

Note: this option does not forbid the creation of tables with a default_time_to_live=0 as - even though not recommended - it is perfectly possible for a TWCS table with default TTL=0 to have a bound window, provided any ingestion statements make use of 'USING TTL' within the CQL statement, in addition to it.
2022-09-11 16:50:22 -03:00
Aleksandra Martyniuk
2439e55974 task_manager: create task manager object
Implementation of a task manager that allows tracking
and managing asynchronous tasks.

The tasks are represented by task_manager::task class providing
members common to all types of tasks. The methods that differ
among tasks of different module can be overriden in a class
inheriting from task_manager::task::impl class. Each task stores
its status containing parameters like id, sequence number, begin
and end time, state etc. After the task finishes, it is kept
in memory for configurable time or until it is unregistered.
Tasks need to be created with make_task method.

Each module is represented by task_manager::module type and should
have an access to task manager through task_manager::module methods.
That allows to easily separate and collectively manage data
belonging to each module.
2022-09-09 14:29:28 +02:00
Mikołaj Grzebieluch
5b1421cc33 db: config: add BROADCAST_TABLES feature flag
Add experimental flag 'broadcast-tables' for enabling BROADCAST_TABLES feature.
This feature requires raft group0, thus enabling it without RAFT will cause an error.
2022-09-05 11:11:08 +02:00
Benny Halevy
d295d8e280 everywhere: define locator::host_id as a strong tagged_uuid type
So it can be distinguished from other uuid-based
identifiers in the system.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11276
2022-08-12 06:01:44 +03:00
Botond Dénes
33f0447ba0 db/config: add config item for query tombstone limit
This will be the value used to break pages, after processing the
specified amount of tombstones. The page will be cut even if empty.
We could maybe use the already existing tombstone_{warn,fail}_threshold
instead and use them as a soft/hard limit pair, like we did with page
sizes.
2022-08-09 10:00:40 +03:00
Benny Halevy
edd308c705 config: use ordered map for experimental features
So that the help string will be sorted lexicographically.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11178
2022-08-01 17:40:10 +03:00
Pavel Emelyanov
7d0110cd31 config: Add stream_io_throughput_mb_per_sec option
It's going to control the bandwidth for the streaming prio class.
For now it's jsut added but does't work for real

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-19 12:14:41 +03:00
Nadav Har'El
cc69177dcc config: fix printing of experimental feature list
Recently we noticed a regression where with certain versions of the fmt
library,

   SELECT value FROM system.config WHERE name = 'experimental_features'

returns string numbers, like "5", instead of feature names like "raft".

It turns out that the fmt library keep changing their overload resolution
order when there are several ways to print something. For enum_option<T> we
happen to have to conflicting ways to print it:
  1. We have an explicit operator<<.
  2. We have an *implicit* convertor to the type held by T.

We were hoping that the operator<< always wins. But in fmt 8.1, there is
special logic that if the type is convertable to an int, this is used
before operator<<()! For experimental_features_t, the type held in it was
an old-style enum, so it is indeed convertible to int.

The solution I used in this patch is to replace the old-style enum
in experimental_features_t by the newer and more recommended "enum class",
which does not have an implicit conversion to int.

I could have fixed it in other ways, but it wouldn't have been much
prettier. For example, dropping the implicit convertor would require
us to change a bunch of switch() statements over enum_option (and
not just experimental_features_t, but other types of enum_option).

Going forward, all uses of enum_option should use "enum class", not
"enum". tri_mode_restriction_t was already using an enum class, and
now so does experimental_features_t. I changed the examples in the
comments to also use "enum class" instead of enum.

This patch also adds to the existing experimental_features test a
check that the feature names are words that are not numbers.

Fixes #11003.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11004
2022-07-11 09:17:30 +02:00
Tomasz Grabiec
6622e3369a config: Introduce force_schema_commit_log option 2022-07-06 22:08:56 +02:00
Tomasz Grabiec
b8d20335a4 config: Introduce unsafe_ignore_truncation_record
The node now refuses to boot if schema tables were truncated.
This adds a config option to ignore truncation records as a
workaround if user truncated them manually.
2022-07-06 22:08:56 +02:00
Avi Kivity
dab56b82fa Merge 'Per-partition rate limiting' from Piotr Dulikowski
Due to its sharded and token-based architecture, Scylla works best when the user workload is more or less uniformly balanced across all nodes and shards. However, a common case when this assumption is broken is the "hot partition" - suddenly, a single partition starts getting a lot more reads and writes in comparison to other partitions. Because the shards owning the partition have only a fraction of the total cluster capacity, this quickly causes latency problems for other partitions within the same shard and vnode.

This PR introduces per-partition rate limiting feature. Now, users can choose to apply per-partition limits to their tables of choice using a schema extension:

```
ALTER TABLE ks.tbl
WITH per_partition_rate_limit = {
	'max_writes_per_second': 100,
	'max_reads_per_second': 200
};
```

Reads and writes which are detected to go over that quota are rejected to the client using a new RATE_LIMIT_ERROR CQL error code - existing error codes didn't really fit well with the rate limit error, so a new error code is added. This code is implemented as a part of a CQL protocol extension and returned to clients only if they requested the extension - if not, the existing CONFIG_ERROR will be used instead.

Limits are tracked and enforced on the replica side. If a write fails with some replicas reporting rate limit being reached, the rate limit error is propagated to the client. Additionally, the following optimization is implemented: if the coordinator shard/node is also a replica, we account the operation into the rate limit early and return an error in case of exceeding the rate limit before sending any messages to other replicas at all.

The PR covers regular, non-batch writes and single-partition reads. LWT and counters are not covered here.

Results of `perf_simple_query --smp=1 --operations-per-shard=1000000`:

- Write mode:
  ```
  8f690fdd47 (PR base):
  129644.11 tps ( 56.2 allocs/op,  13.2 tasks/op,   49785 insns/op)
  This PR:
  125564.01 tps ( 56.2 allocs/op,  13.2 tasks/op,   49825 insns/op)
  ```
- Read mode:
  ```
  8f690fdd47 (PR base):
  150026.63 tps ( 63.1 allocs/op,  12.1 tasks/op,   42806 insns/op)
  This PR:
  151043.00 tps ( 63.1 allocs/op,  12.1 tasks/op,   43075 insns/op)
  ```

Manual upgrade test:
- Start 3 nodes, 4 shards each, Scylla version 8f690fdd47
- Create a keyspace with scylla-bench, RF=3
- Start reading and writing with scylla-bench with CL=QUORUM
- Manually upgrade nodes one by one to the version from this PR
- Upgrade succeeded, apart from a small number of operations which failed when each node was being put down all reads/writes succeeded
- Successfully altered the scylla-bench table to have a read and write limit and those limits were enforced as expected

Fixes: #4703

Closes #9810

* github.com:scylladb/scylla:
  storage_proxy: metrics for per-partition rate limiting of reads
  storage_proxy: metrics for per-partition rate limiting of writes
  database: add stats for per partition rate limiting
  tests: add per_partition_rate_limit_test
  config: add add_per_partition_rate_limit_extension function for testing
  cf_prop_defs: guard per-partition rate limit with a feature
  query-request: add allow_limit flag
  storage_proxy: add allow rate limit flag to get_read_executor
  storage_proxy: resultize return type of get_read_executor
  storage_proxy: add per partition rate limit info to read RPC
  storage_proxy: add per partition rate limit info to query_result_local(_digest)
  storage_proxy: add allow rate limit flag to mutate/mutate_result
  storage_proxy: add allow rate limit flag to mutate_internal
  storage_proxy: add allow rate limit flag to mutate_begin
  storage_proxy: choose the right per partition rate limit info in write handler
  storage_proxy: resultize return types of write handler creation path
  storage_proxy: add per partition rate limit to mutation_holders
  storage_proxy: add per partition rate limit info to write RPC
  storage_proxy: add per partition rate limit info to mutate_locally
  database: apply per-partition rate limiting for reads/writes
  database: move and rename: classify_query -> classify_request
  schema: add per_partition_rate_limit schema extension
  db: add rate_limiter
  storage_proxy: propagate rate_limit_exception through read RPC
  gms: add TYPED_ERRORS_IN_READ_RPC cluster feature
  storage_proxy: pass rate_limit_exception through write RPC
  replica: add rate_limit_exception and a simple serialization framework
  docs: design doc for per-partition rate limiting
  transport: add rate_limit_error
2022-06-24 01:32:13 +03:00
Piotr Dulikowski
761a037afb config: add add_per_partition_rate_limit_extension function for testing
...and use it in cql_test_env to enable the per_partition_rate_limit
extension for all tests that use it.
2022-06-22 20:16:49 +02:00
Pavel Emelyanov
820be06ac1 hints: Remove snitch dependency
After previous patch hints manager class gets unused dependency on
snitch. While removing it it turns out that several unrelated places
get needed headers indirectly via host_filter.hh -> snitsh_base.hh
inclusion.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-22 11:47:26 +03:00
Benny Halevy
6677028212 sstables: mx/writer: auto-scale promoted index
Add column_index_auto_scale_threshold_in_kb to the configuration
(defaults to 10MB).

When the promoted index (serialized) size gets to this
threshold, it's halved by merging each two adjacent blocks
into one and doubling the desired_block_size.

Fixes #4217

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-24 13:32:35 +03:00
Calle Wilund
7dd7760e8d commitlog: Make flush threshold a config parameter 2022-04-11 16:34:00 +00:00
Calle Wilund
d478896d46 commitlog: kill non-recycled segment management
It has been default for a while now. Makes no sense to not do it.
Even hints can use it (even if it makes no difference there)
2022-04-11 16:34:00 +00:00
Piotr Sarna
3272b4826f db: add keyspace-storage-options experimental feature
Specifying non-standard keyspace options is experimental, so it's
going to be protected by a configuration flag.
2022-04-08 09:17:01 +02:00
Nadav Har'El
49a8164fb7 alternator: add configurable scan period to TTL expiration
Before this patch, the experimental TTL (expiration time) feature in
Alternator scans tables for expiration in a tight loop - starting the
next scan one second after the previous one completed.

In this patch we introduce a new configuration option,
alternator_ttl_period_in_seconds, which determines how frequently
to start the scan. The default is 24 hours - meaning that the next
scan is started 24 hours after the previous one started.

The tests (test/alternator/run) change this configuration back to one
second, so that expiration tests finish as quickly as possible.

Please note that the scan is *not* slowed down to fill this 24 hours -
if it finishes in one hour, it will then sleep for 23 hours. Additional
work would be needed to slow down the scan to not finish too quickly.
One idea not yet implemented is to move the expiration service from
the "maintenance" scheduling group which it uses today to a new
scheduling group, and modifying the number of shares that this group
gets.

Another thing worth noting about the configurable period (which defaults
to 24 hours) is that when TTL is enabled on an Alternator table, it can
take that amount of time until its scan starts and items start expiring
from it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-02-25 07:26:11 +02:00
Michael Livshin
3fef604075 sstables_manager: add get_local_host_id() method and support
Since ME sstable format includes originating host id in stats
metadata, local host id needs to be made available for writing and
validation.

Both Scylla server (where local host id comes from the `system.local`
table) and unit tests (where it is fabricated) must be accomodated.
Regardless of how the host id is obtained, it is stored in the db
config instance and accessed through `sstables_manager`.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
0b1447c702 add "sstable_format" config
Initialize it to "md" until ME format support is
complete (i.e. storing originating host id in sstable stats metadata
is implemented), so at present there is no observable change by
default.

Also declare "enable_sstables_md_format" unused -- the idea, going
forward, being that only "sstable_format" controls the written sstable
file format and that no more per-format enablement config options
shall be added.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michał Sala
b439d6e710 db: config: add a flag to disable new parallelized aggregation algorithm
Just in case the new algorithm turns out to be buggy, add a flag to
fall-back to the old algorithm.
2022-02-01 21:26:25 +01:00
Pavel Emelyanov
a026b4ef49 config: Add option to disable config updates via CQL
The system.config table allows changing config parameters, but this
change doesn't survive restarts and is considered to be dangerous
(sometimes). Add an option to disable the table updates. The option
is LiveUpdate and can be set to false via CQL too (once).

fixes #9976

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220201121114.32503-1-xemul@scylladb.com>
2022-02-01 14:30:47 +02:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Kamil Braun
e98711cfcb db: config: add a flag to disable new reversed reads algorithm
Just in case the new algorithm turns out to be buggy, or give a
performance regression, add a flag to fall-back to the old algorithm for
use in the field.
2022-01-12 18:59:19 +01:00
Asias He
eba4a4fba4 repair: Allow ignoring dead nodes for replace operation
Consider

1) n1, n2, n3, n4, n5
2) n2 and n3 are both down
3) start n6 to replace n2
4) start n7 to replace n3

We want to replace the dead nodes n2 and n3 to fix the cluster to have 5
running nodes.

Replace operation in step 3 will fail because n3 is down.
We would see errors like below:

replace[25edeec0-57d4-11ec-be6b-7085c2409b2d]: Nodes={127.0.0.3} needed
for replace operation are down. It is highly recommended to fix the down
nodes and try again.

In the above example, currently, there is no way to replace any of the
dead nodes.

Users can either fix one of the dead nodes and run replace or run
removenode operation to remove one of the dead nodes then run replace
and run bootstrap to add another node.

Fixing dead nodes is always the best solution but it might not be
possible. Running removenode operation is not better than running
replace operation (with best effort by ignoring the other dead node) in
terms of data consistency. In addition, users have to run bootstrap
operation to add back the removed node. So, allowing replacing in such
case is a clear win.

This patch adds the --ignore-dead-nodes-for-replace option to allow run
replace operation with best effort mode. Please note, use this option
only if the dead nodes are completely broken and down, and there is no
way to fix the node and bring it back. This also means the user has to
make sure the ignored dead nodes specified are really down to avoid any
data consistency issue.

Fixes #9757

Closes #9758
2021-12-20 00:49:03 +02:00
Avi Kivity
f28552016f Update seastar submodule
* seastar f8a038a0a2...8d15e8e67a (21):
  > core/program_options: preserve defaultness of CLI arguments
  > log: Silence logger when logging
  > Include the core/loop.hh header inside when_all.hh header
  > http: Fix deprecated wrappers
  > foreign_ptr: Add concept
  > util: file: add read_entire_file
  > short_streams: move to util
  > Revert "Merge: file: util: add read_entire_file utilities"
  > foreign_ptr: declare destroy as a static method
  > Merge: file: util: add read_entire_file utilities
  > Merge "output_stream: handle close failure" from Benny
  > net: bring local_address() to seastar::connected_socket.
  > Merge "Allow programatically configuring seastar" from Botond
  > Merge 'core: clean up memory metric definitions' from John Spray
  > Add PopOS to debian list in install-dependencies.sh
  > Merge "make shared_mutex functions exception safe and noexcept" from Benny
  > on_internal_error: set_abort_on_internal_error: return current state
  > Implementation of iterator-range version of when_any
  > net: mark functions returning ethernet_address noexcept
  > net: ethernet_address: mark functions noexcept
  > shared_mutex: mark wake and unlock methods noexcept

Contains patch from Botond Dénes <bdenes@scylladb.com>:

db/config: configure logging based on app_template::seastar_options

Scylla has its own config file which supports configuring aspects of
logging, in addition to the built-in CLI logging options. When applying
this configuration, the CLI provided option values have priority over
the ones coming from the option file. To implement this scylla currently
reads CLI options belonging to seastar from the boost program options
variable map. The internal representation of CLI options however do not
constitute an API of seastar and are thus subject to change (even if
unlikely). This patch moves away from this practice and uses the new
shiny C++ api: `app_template::seastar_options` to obtain the current
logging options.
2021-12-08 14:21:11 +02:00
Pavel Emelyanov
d513034ca4 utils: Ability to set_value(sstring) for an option
There soon will appear an updateable system.config table that
will push sstrings into names_value-s. Prepare for this change
by adding the respective .set_value() call. Since the update
only works for LiveUpdate-able options, and inability to do it
can be propagated back to the caller make this method return
true/false whether the update took place or not.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-11 15:15:05 +03:00
Pavel Emelyanov
71ce7c6e87 db.config: Verbose address resolver helper
The helper works on named_value() and throws and exception containing
the option name for convenient error reporting.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-08 17:33:27 +03:00
Nadav Har'El
e4a6569258 config: experimental flag UNUSED_CDC shouldn't be distinct from UNUSED
When an experimental feature graduates from being experimental, we want
to continue allow the old "--experimental-features=..." option to work,
in case some user's configuration uses it - just do nothing. The way
we do it is to map in db::experimental_features_t::map() the feature's
name to the UNUSED value - this way the feature's name is accepted, but
doesn't change anything.

When the CDC feature graduated from being experimental, a new bit
UNUSED_CDC was introduced to do the same thing. This separate bit was
not actually necessary - if we ever check for UNUSED_CDC bit anywhere
in the code it means the flag isn't actually unused ;-) And we don't
check it.

So simplify the code by conflating UNUSED_CDC into UNUSED.
This will also make it easy to build from db::experimental_features_t::map()
a list of current experimental features - now it will simply be those that
do not map to UNUSED.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211013105107.123544-1-nyh@scylladb.com>
2021-10-20 17:54:17 +03:00
Tomasz Grabiec
e89b9799b8 Merge 'sstable mx reader: implement reverse single-partition reads' from Kamil Braun
Until now reversed queries were implemented inside
`querier::consume_page` (more precisely, inside the free function
`consume_page` used by `querier::consume_page`) by wrapping the
passed-in reader into `make_reversing_reader` and then consuming
fragments from the resulting reversed reader.

The first couple of commits change that by pushing the reversing down below
the `make_combined_reader` call in `table::query`. This allows
working on improving reversing for memtables independently from
reversing for sstables.

We then extend the `index_reader` with functions that allow
reading the promoted index in reverse.

We introduce `partition_reversing_data_source`, which wraps an sstable data
file and returns data buffers with contents of a single chosen partition
as if the rows were stored in reverse order.

We use the reversing source and the extended index reader in
`mx_sstable_mutation_reader` to implement efficient (at least in theory)
reversed single-partition reads.

The patchset disables cache for reversed reads. Fast-forwarding
is not supported in the mx reader for reversed queries at this point.

Details in commit messages. Read the commits in topological order
for best review experience.

Refs: #9134
(not saying "Fixes" because it's only for single-partition queries
without forwarding)

Closes #9281

* github.com:scylladb/scylla:
  table: add option to automatically bypass cache for reversed queries
  test: reverse sstable reader with random schema and random mutations
  sstables: mx: implement reversed single-partition reads
  sstables: mx: introduce partition_reversing_data_source
  sstables: index_reader: add support for iterating over clustering ranges in reverse
  clustering_key_filter: clustering_key_filter_ranges owning constructor
  flat_mutation_reader: mention reversed schema in make_reversing_reader docstring
  clustering_key_filter: document clustering_key_filter_ranges::get_ranges
2021-10-04 15:37:34 +02:00
Kamil Braun
703aed3277 table: add option to automatically bypass cache for reversed queries
Currently the new reversing sstable algorithms do not support fast
forwarding and the cache does not yet handle reversed results. This
forced us to disable the cache for reversed queries if we want to
guarantee bounded memory. We introduce an option that does this
automatically (without specifying `bypass cache` in the query) and turn
it on by default.

If the user decides that they prefer to keep the cache at the
cost of fetching entire partitions into memory (which may be viable
if their partitions are small) during reversed queries, the option can
be turned off. It is live-updateable.
2021-10-04 15:24:12 +02:00
Pavel Emelyanov
bbcf671276 config: Remove unused replacing options
The --replace-token and --replace-node were added some time ago, but
have never been used since then, just parsed and immediatelly aborted.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210930102222.16294-1-xemul@scylladb.com>
2021-09-30 14:56:04 +03:00
Nadav Har'El
4ffd8c1f2b alternator: stub TTL operations
This patch adds stubs for the UpdateTimeToLive and DescribeTimeToLive
operations to Alternator. These operations can enable, disable, or inquire
about, the chosen expiration-time attribute.

Currently, the information about the chosen attribute is only saved, with
no actual expiration of any items taking place.

Some of the tests for the TTL feature start to pass, so their xfail tag
is removed.

Because this this new feature is incomplete, it is not enabled unless
the "alternator-ttl" experimental feature is enabled. Moreover, for
these operations to be allowed, the entire cluster needs to support
this experimental feature, because all nodes need to participate in the
data expiration - if some old nodes don't support Alternator TTL, some
of the data they hold won't get expired... So we don't allow enabling
TTL until all the nodes in the cluster support this feature.

The implementation is in a new source file, alternator/ttl.cc. This
source file will continue to grow as we implement the expiration feature.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-09-19 21:05:21 +03:00
Avi Kivity
c5f52f9d97 schema_tables: don't flush in tests
Flushing schema tables is important for crash recovery (without a flush,
we might have sstables using a new schema before the commitlog entry
noting the schema change has been replayed), but not important for tests
that do not test crash recovery. Avoiding those flushes reduces system,
user, and real time on tests running on a consumer-level SSD.

before:
real	8m51.347s
user	7m5.743s
sys	5m11.185s

after:
real	7m4.249s
user	5m14.085s
sys	2m11.197s

Note real time is higher that user+sys time divided by the number
of hardware threads, indicating that there is still idle time due
to the disk flushing, so more work is needed.

Closes #9319
2021-09-12 11:32:13 +03:00
Avi Kivity
705f957425 Merge "Generalize TLS creds builder configuration" from Pavel E
"
There are 4 places out there that do the same steps parsing
"client_|server_encryption_options" and configuring the
seastar::tls::creds_builder with the values (messaging, redis,
alternator and transport).

Also to make redis and transport look slimmer main() cleans
the client_encryption_options by ... parsing it too.

This set introduces a (coroutinized) helper to configure the
creds_builder with map<string, string> and removes the options
beautification from main.

tests: unit(dev), dtest.internode_ssl_test(dev)
"

* 'br-generalize-tls-creds-builder-configuration' of https://github.com/xemul/scylla:
  code: Generalize tls::credentials_builder configuration
  transport, redis: Do not assume fixed encryption options
  messaging: Move encryption options parsing to ms
  main: Open-code internode encryption misconfig warning
  main, config: Move options parsing helpers
2021-09-01 14:19:19 +03:00
Avi Kivity
8b59e3a0b1 Merge ' cql3: Demand ALLOW FILTERING for unlimited, sliced partitions ' from Dejan Mircevski
Return the pre- 6773563d3 behavior of demanding ALLOW FILTERING when partition slice is requested but on potentially unlimited number of partitions.  Put it on a flag defaulting to "off" for now.

Fixes #7608; see comments there for justification.

Tests: unit (debug, dev), dtest (cql_additional_test, paging_test)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #9126

* github.com:scylladb/scylla:
  cql3: Demand ALLOW FILTERING for unlimited, sliced partitions
  cql3: Track warnings in prepared_statement
  test: Use ALLOW FILTERING more strictly
  cql3: Add statement_restrictions::to_string
2021-08-31 18:05:26 +03:00
Dejan Mircevski
2f28f68e84 cql3: Demand ALLOW FILTERING for unlimited, sliced partitions
When a query requests a partition slice but doesn't limit the number
of partitions, require that it also says ALLOW FILTERING.  Although
do_filter() isn't invoked for such queries, the performance can still
be unexpectedly slow, and we want to signal that to the user by
demanding they explicitly say ALLOW FILTERING.

Because we now reject queries that worked fine before, existing
applications can break.  Therefore, the behavior is controlled by a
flag currently defaulting to off.  We will default to "on" in the next
Scylla version.

Fixes #7608; see comments there for justification.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-08-31 10:45:41 -04:00
Pavel Solodovnikov
22794efc22 db: add experimental option for raft
Introduce `raft` experimental option.
Adjust the tests accordingly to accomodate the new option.

It's not enabled by default when providing
`--experimental=true` config option and should be
requested explicitly via `--experimental-options=raft`
config option.

Hide the code related to `raft_group_registry` behind
the switch. The service object is still constructed
but no initialization is performed (`init()` is not
called) if the flag is not set.

Later, other raft-related things, such as raft schema
changes, will also use this flag.

Also, don't introduce a corresponding gossiper feature
just yet, because again, it should be done after the
raft schema changes API contract is stabilized.

This will be done in a separate series, probably related to
implementing the feature itself.

Tests: unit(dev)

Ref #9239.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210823121956.167682-1-pa.solodovnikov@scylladb.com>
2021-08-23 17:45:58 +03:00
Pavel Emelyanov
e02b39ca3d code: Generalize tls::credentials_builder configuration
All the places in code that configure the mentioned creds builder
from client_|server_encryption_options now do it the same way.
This patch generalizes it all in the utils:: helper.

The alternator code "ignores" require_client_auth and truststore
keys, but it's easy to make the generalized helper be compatible.

Also make the new helper coroutinized from the beginning.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-20 18:05:41 +03:00