Commit Graph

323 Commits

Author SHA1 Message Date
Raphael S. Carvalho
707ade21f8 replica: Add async gate to compaction_group
replica::table has the same gate for gating async operations, and
even synchronize stop of table with in-flight writes that will
apply into memory.

compaction group gains the same gate, which will be used when
operations are confined to a single group. table's gate is kept
for table wide operations like query, truncate, etc.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-09-27 19:13:14 -03:00
Benny Halevy
2937552e5b table: init_storage: create upload and staging subdirs on all datadirs
We need to have a complete directory structure for each table
and each configured datadir.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-23 08:24:54 +03:00
Kamil Braun
4376854473 schema_tables: remove default value for reload in merge_schema
To avoid bugs like the one fixed in the previous commit.
2023-09-15 13:04:04 +02:00
Petr Gusev
ce0ee32d5a database.cc: make _uses_schema_commitlog optional
This field on the null shard is properly initialized
in maybe_init_schema_commitlog function, until then
we can't make decisions based on its value. This problem
can happen e.g. if add_column_family function is called
with readonly=false before maybe_init_schema_commitlog.
It will call commitlog_for to pass the commitlog to
mark_ready_for_writes and commitlog_for reads _uses_schema_commitlog.

In this commit we add protection against this case - we
trigger internal_error if _uses_schema_commitlog is read
before it is initialized.

maybe_init_schema_commitlog() was added to cql_test_env
to make boost tests work with the new invariant.
2023-09-13 23:17:20 +04:00
Petr Gusev
beb29f094b system_keyspace: drop load phases
We want to switch system.scylla_local table to the
schema commitlog, but load phases hamper here - schema
commitlog is initialized after phase1,
so a table which is using it should be moved to phase2,
but system.scylla_local contains features, and we need
them before  schema commitlog initialization for
SCHEMA_COMMITLOG feature.

In this commit we are taking a different approach to
loading system tables. First, we load them all in
one pass in 'readonly' mode. In this mode, the table
cannot be written to and has not yet been assigned
a commit log. To achieve this we've added _readonly bool field
to the table class, it's initialized to true in table's
constructor. In addition, we changed the table constructor
to always assign nullptr to commitlog, and we trigger
an internal error if table.commitlog() property is accessed
while the table is in readonly mode. Then, after
triggering on_system_tables_loaded notifications on
feature_service and sstable_format_selector, we call
system_keyspace::mark_writable and eventually
table::mark_ready_for_writes which selects the
proper commitlog and marks the table as writable.

In sstable_compaction_test we drop several
mark_ready_for_writes calls since they are redundant,
the table has already been made writable in
env.make_table_for_tests call.

The table::commitlog function either returns the current
commitlog or causes an error if the table is readonly. This
didn't work for virtual tables, since they never called
mark_ready_for_writes. In this commit we add this
call to initialize_virtual_tables.
2023-09-13 23:17:20 +04:00
Petr Gusev
47ffc66c7f database.hh: add_column_family: add readonly parameter
Previously, creating a table or view in
schema_tables.cc/merge_tables_and_views was a two-step process:
first adding a column family (add_column_family function) and
then marking it as ready for writes (mark_table_as_writable).
There is an yield between these stages, this means
someone could see a table or view for which the
mark_table_as_writable method had not yet been called,
and start writing to it.

This problem was demonstrated by materialised view dtests.
A view is created on all nodes. On some nodes it will be created
earlier than on others and the view rebuild process will start
writing data to that view on other nodes, where mark_table_as_writable
has not yet been called.

In this patch we solve this problem by adding a readonly parameter
to the add_column_family method. When loading tables from disk,
this flag is set to true and the mark_table_as_writable
is called only after all sstables have been loaded.
When creating a new table, this flag is set to false,
mark_table_as_writable is called from inside add_column_family
and the new table becomes visible already as writable.
2023-09-13 23:17:20 +04:00
Petr Gusev
b90011294d config.cc: drop db::config::host_id
In this refactoring commit we remove the db::config::host_id
field, as it's hacky and duplicates token_metadata::get_my_id.

Some tests want specific host_id, we add it to cql_test_config
and use in cql_test_env.

We can't pass host_id to sstables_manager by value since it's
initialized in database constructor and host_id is not loaded yet.
We also prefer not to make a dependency on shared_token_metadata
since in this case we would have to create artificial
shared_token_metadata in many tools and tests where sstables_manager
is used. So we pass a function that returns host_id to
sstables_manager constructor.
2023-09-13 23:00:15 +04:00
Tomasz Grabiec
c27d212f4b api, storage_service: Recalculate table digests on relocal_schema api call
Currently, the API call recalculates only per-node schema version. To
workaround issues like #4485 we want to recalculate per-table
digests. One way to do that is to restart the node, but that's slow
and has impact on availability.

Use like this:

  curl -X POST http://127.0.0.1:10000/storage_service/relocal_schema

Fixes #15380

Closes #15381
2023-09-13 18:27:57 +03:00
Dawid Medrek
c7fe5d7f94 utils/lister: Limit the API of scan_dir() to fs::path
Right now, the function allows for passing the path to a file as a seastar::sstring,
which is then converted to std::filesystem::path -- implicitly to the caller.
However, the function performs I/O, and there is no reason to accept any other type
than std::filesystem::path, especially because the conversion is straightforward.
Callers can perform it on their own.

This commit introduces the more constrained API.

Closes #15266
2023-09-05 20:50:42 +03:00
Michał Chojnowski
50b429f255 config: add index_cache_fraction
Adds a configurable upper limit to memory usage by index caches.
See the source code comments added in this patch for more details.

This patch shouldn't change visible behaviour, because the limit is set to 1.0
by default, so it is never triggerred. We will change the default in a future
patch.
2023-09-01 22:34:23 +02:00
Eliran Sinvani
eb368f9f6e internal_keyspace extention: enhance the semantics also to flushes
commit 7c8c020 introduced a new type of a keyspace, an internal keyspace
It defined the semantics for this internal keyspace, this keyspace is
somewhat a hybrid between system and user keyspace.

Here we extend the semantics to include also flushes, meaning that
flushes will be done using the system dirty_mamory_manager. This is
in order to allow inter dependencies between internal tables and user
tables and prevent deadlocks.

One example of such a deadlock is our `replicated_key_provider`
encryption on the enterprise version. The deadlock occur because in some
circumstances, an encrypted user table flush is dependant upon the
`encrypted_keys` table being flushed but since the requests are
serialized, we get a deadlock.

Tests: unit tests dev + debug
The deadlock dtest reproducer:
encryption_at_rest_test.py::TestEncryptionAtRest::test_reboot

Fixes #14529

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>

Closes #14547
2023-08-21 18:17:05 +03:00
Raphael S. Carvalho
b578d6643f Kill scylla option to configure number of compaction groups
The option was introduced to bootstrap the project. It's still
useful for testing, but that translates into maintaining an
additional option and code that will not be really used
outside of testing. A possible option is to later map the
option in boost tests to initial_tablets, which may yield
the same effect for testing.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-08-16 18:23:53 -03:00
Patryk Jędrzejczak
e7077da12d replica: reduce the size limit of the schema commitlog
The size of the schema commitlog is incorrectly set to 10 TB. To
avoid wasting space, we reduce it to 2 * schema commitlog segment
size.

Closes #14946
2023-08-14 20:41:15 +02:00
Pavel Emelyanov
fa93ac9bfd database: Add wasm::manager& dependency
The dependency is needed by db::schema_tables to get wasm manager for
its needs. This patch prepares the ground. Now the wasm::manager is
shared between replica::database and cql3::query_processor

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-04 19:47:50 +03:00
Botond Dénes
00a62866ac Merge 'Make database::add_column_family exception safe.' from Aleksandra Martyniuk
If some state update in database::add_column_family throws,
info about a column family would be inconsistent.

Undo already performed operations in database::add_column_family
when one throws.

Fixes: #14666.

Closes #14672

* github.com:scylladb/scylladb:
  replica: undo the changes if something fails
  replica: start table earlier in database::add_column_family
2023-08-04 10:58:17 +03:00
Aleksandra Martyniuk
1e9b2972ea replica: undo the changes if something fails
If a step of adding a table fails, previous steps are undone.
2023-08-03 17:37:31 +02:00
Aleksandra Martyniuk
9f68566038 replica: start table earlier in database::add_column_family
In database::add_column_family table::start() is called before
a table is registered in different structures.
2023-08-02 16:35:34 +02:00
Amnon Heiman
d10a3dd19a config: add enable_node_table_metrics flag
By default, per-table-per-shard metrics reporting is turned off, and the
aggregated version of the metrics (per-table-per-node) will be turned
on.

There could be a situation where a user with an excessive number of
tables would suffer from performance issues, both from the network and
the metrics collection server.

This patch adds a config option, enable_node_table_metrics, which allows
users to turn off per-table metrics reporting altogether.

For example, when running Scylla with the command line argument
'--enable-node-aggregated-table_metrics 0' per-table metrics will not be reported.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2023-08-02 10:20:18 +03:00
Botond Dénes
4a02865ea1 Merge 'Prevent invalidation of iterators over database::_column_families' from Aleksandra Martyniuk
Maps related to column families in database are extracted
to a column_families_data class. Access to them is possible only
through methods. All methods which may preempt hold rwlock
in relevant mode, so that the iterators can't become invalid.

Fixes: #13290

Closes #13349

* github.com:scylladb/scylladb:
  replica: make tables_metadata's attributes private
  replica: add methods to get a filtered copy of tables map
  replica: add methods to check if given table exists
  replica: add methods to get table or table id
  replica: api: return table_id instead of const table_id&
  replica: iterate safely over tables related maps
  replica: pass tables_metadata to phased_barrier_top_10_counts
  replica: add methods to safely add and remove table
  replica: wrap column families related maps into tables_metadata
  replica: futurize database::add_column_family and database::remove
2023-07-31 15:31:59 +03:00
Botond Dénes
72043a6335 Merge 'Avoid using qctx in schema_tables' column-mapping queries' from Pavel Emelyanov
There are three methods in system_keyspace namespace that run queries over `system.scylla_table_schema_history` table. For that they use qctx which's not nice.

Fortunately, all the callers already have the system_keyspace& local variable or argument they can pass to those methods. Since the accessed table belongs to system keyspace, the latter declares the querying methods as "friends" to let them get private `query_processor& _qp` member

Closes #14876

* github.com:scylladb/scylladb:
  schema_tables: Extract query_processor from system_keyspace for querying
  schema_tables: Add system_keyspace& argument to ..._column_mapping() calls
  migration_manager: Add system_keyspace argument to get_schema_mapping()
2023-07-31 15:00:59 +03:00
Pavel Emelyanov
cf4d4d7e9b schema_tables: Add system_keyspace& argument to ..._column_mapping() calls
The callers all have local sys_ks argument:

- merge_tables_and_views()
- service::get_column_mapping()
- database::parse_system_tables()

And a test that can get it from cql_test_env.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-28 15:55:13 +03:00
Botond Dénes
b599f15b26 replica: make_[multishard_]streaming_reader(): make compaction_time mandatory
Now that all users have opted in unconditionally, there is no point in
keeping this optional. Make it mandatory to make sure there are no
opt-out by mistake.
The global override via enable_compacting_data_for_streaming_and_repair
config item still remains, allowing compaction to be force turned-off.
2023-07-27 04:57:52 -04:00
Botond Dénes
2f8d77e97b replica/table: add optional compacting to make_multishard_streaming_reader()
Doing to make_multishard_streaming_reader() what the previous commit did
to make_streaming_reader(). In fact, the new compaction_time parameter
is simply forwarded to the make_streaming_reader() on the shard readers.

Call sites are updated, but none opt in just yet.
2023-07-27 03:22:11 -04:00
Botond Dénes
42b0dd5558 replica/table: add optional compacting to make_streaming_reader()
Opt-in is possible by passing an engaged `compaction_time`
(gc_clock::time_point) to the method. When this new parameter is
disengaged, no compaction happens.
Note that there is a global override, via the
enable_compacting_data_for_streaming_and_repair config item, which can
force-disable this compaction.
Compaction done on the output of the streaming reader does *not*
garbage-collect tombstones!

All call-sites are adjusted (the new parameter is not defaulted), but
none opt in yet. This will be done in separate commit per user.
2023-07-27 03:22:11 -04:00
Avi Kivity
ff1f461a42 Merge 'Introduce tablet load balancer' from Tomasz Grabiec
After this series, tablet replication can handle the scenario of bootstrapping new nodes. The ownership is distributed indirectly by the means of a load-balancer which moves tablets around in the background. See docs/dev/topology-over-raft.md for details.

The implementation is by no means meant to be perfect, especially in terms of performance, and will be improved incrementally.

The load balancer will be also kicked by schema changes, so that allocation/deallocation done during table creation/drop will be rebalanced.

Tablet data is streamed using existing `range_streamer`, which is the infrastructure for "the old streaming". This will be later replaced by sstable transfer once integration of tablets with compaction groups is finished. Also, cleanup is not wired yet, also blocked by compaction group integration.

Closes #14601

* github.com:scylladb/scylladb:
  tests: test_tablets: Add test for bootstraping a node
  storage_service: topology_coordinator: Implement tablet migration state machine
  tablets: Introduce tablet_mutation_builder
  service: tablet_allocator: Introduce tablet load balancer
  tablets: Introduce tablet_map::for_each_tablet()
  topology: Introduce get_node()
  token_metadata: Add non-const getter of tablet_metadata
  storage_service: Notify topology state machine after applying schema change
  storage_service: Implement stream_tablet RPC
  tablets: Introduce global_tablet_id
  stream_transfer_task, multishard_writer: Work with table sharder
  tablets: Turn tablet_id into a struct
  db: Do not create per-keyspace erm for tablet-based tables
  tablets: effective_replication_map: Take transition stage into account when computing replicas
  tablets: Store "stage" in transition info
  doc: Document tablet migration state machine and load balancer
  locator: erm: Make get_endpoints_for_reading() always return read replicas
  storage_service: topology_coordinator: Sleep on failure between retries
  storage_service: topology_coordinator: Simplify coordinator loop
  main: Require experimental raft to enable tablets
2023-07-26 12:30:29 +03:00
Botond Dénes
ad2ddffb22 Merge 'Remove qctx from system_keyspace::save_truncation_record()' from Pavel Emelyanov
The method is called by db::truncate_table_on_all_shards(), its call-chain, in turn, starts from

- proxy::remote::handle_truncate()
- schema_tables::merge_schema()
- legacy_schema_migrator
- tests

All of the above are easy to get system_keyspace reference from. This, in turn, allows making the method non-static and use query_processor reference from system_keyspace object in stead of global qctx

Closes #14778

* github.com:scylladb/scylladb:
  system_keyspace: Make save_truncation_record() non-static
  code: Pass sharded<db::system_keyspace>& to database::truncate()
  db: Add sharded<system_keyspace>& to legacy_schema_migrator
2023-07-26 08:48:49 +03:00
Tomasz Grabiec
c2b18ae483 db: Do not create per-keyspace erm for tablet-based tables
This erm is not updated when replicating token metadata in
storage_service::replicate_to_all_cores() so will pin token metadata
version and prevent token metadata barrier from finishing.

It is not necessary to have per-keyspace erm for tablet-based tables,
so just don't create it.
2023-07-25 21:08:51 +02:00
Aleksandra Martyniuk
c5cad803b3 replica: add methods to get a filtered copy of tables map 2023-07-25 17:13:24 +02:00
Aleksandra Martyniuk
ff26b2ba3f replica: add methods to check if given table exists 2023-07-25 17:13:24 +02:00
Aleksandra Martyniuk
6796721c3d replica: add methods to get table or table id 2023-07-25 17:13:24 +02:00
Aleksandra Martyniuk
e072a2341d replica: api: return table_id instead of const table_id&
Return table_id instead of const table_id& from database::find_uuid
as copying table_id does not cause much overhead and simplifies
methods signature.
2023-07-25 17:13:24 +02:00
Aleksandra Martyniuk
cdbfa0b2f5 replica: iterate safely over tables related maps
Loops over _column_families and _ks_cf_to_uuid which may preempt
are protected by reader mode of rwlock so that iterators won't
get invalid.
2023-07-25 17:13:04 +02:00
Aleksandra Martyniuk
a21d3357c3 replica: pass tables_metadata to phased_barrier_top_10_counts 2023-07-25 16:13:00 +02:00
Aleksandra Martyniuk
8842bd87c3 replica: add methods to safely add and remove table 2023-07-25 16:13:00 +02:00
Aleksandra Martyniuk
52afd9d42d replica: wrap column families related maps into tables_metadata
As a preparation for ensuring access safety for column families
related maps, add tables_metadata, access to members of which
would be protected by rwlock.
2023-07-25 16:13:00 +02:00
Aleksandra Martyniuk
395ce87eff replica: futurize database::add_column_family and database::remove
As a preparation for further changes, database::add_column_family
and database::remove return future<>.
2023-07-25 16:13:00 +02:00
Botond Dénes
a8feb7428d Merge 'semaphore mismatch: don't throw an error if both semaphores belong to user' from Michał Jadwiszczak
If semaphore mismatch occurs, check whether both semaphores belong
to user. If so, log a warning, log a `querier_cache_scheduling_group_mismatches` stat and drop cached reader instead of throwing an error.

Until now, semaphore mismatch was only checked in multi-partition queries.  The PR pushes the check to `querier_cache` and perform it on all `lookup_*_querier` methods.

The mismatch can happen if user's scheduling group changed during
a query. We don't want to throw an error then, but drop and reset
cached reader.

This patch doesn't solve a problem with mismatched semaphores because of changes in service levels/scheduling groups but only mitigate it.

Refers: https://github.com/scylladb/scylla-enterprise/issues/3182
Refers: https://github.com/scylladb/scylla-enterprise/issues/3050
Closes: #14770

Closes #14736

* github.com:scylladb/scylladb:
  querier_cache: add stats of scheduling group mismatches
  querier_cache: check semaphore mismatch during querier lookup
  querier_cache: add reference to `replica::database::is_user_semaphore()`
  replica:database: add method to determine if semaphore is user one
2023-07-24 14:13:09 +03:00
Kamil Braun
e6099c4685 Merge 'config: set schema_commitlog_segment_size_in_mb to 128 ' from Patryk Jędrzejczak
Fixes #14668

In #14668, we have decided to introduce a new `scylla.yaml` variable for the schema commitlog segment size and set it to 128MB. The reason is that segment size puts a limit on the mutation size that can be written at once, and some schema mutation writes are much larger than average, as shown in #13864. This `schema_commitlog_segment_size_in_mb variable` variable is now added to `scylla.yaml` and `db/config`.

Additionally,  we do not derive the commitlog sync period for schema commitlog anymore because schema commitlog runs in batch mode, so it doesn't need this parameter. It has also been discussed in #14668.

Closes #14704

* github.com:scylladb/scylladb:
  replica: do not derive the commitlog sync period for schema commitlog
  config: set schema_commitlog_segment_size_in_mb to 128
  config: add schema_commitlog_segment_size_in_mb variable
2023-07-24 10:23:34 +02:00
Michał Jadwiszczak
246728cbbb querier_cache: add stats of scheduling group mismatches
Add stats to count dropped queriers because of scheduling group
mismatch.
2023-07-21 19:05:55 +02:00
Michał Jadwiszczak
a5fc53aa11 querier_cache: check semaphore mismatch during querier lookup
Previously semaphore mismatch was checked only in multi-partition
queries and if happened, an internal error was thrown.

This commit pushed the check down to `querier_cache`, so each
`lookup_*_querier` method will check for the mismatch.

What's more, if semaphore mismatch occurs, check whether both semaphores belong
to user. If so, log a warning and drop cached reader instead of
throwing an error.

The mismatch can happen if user's scheduling group changed during
a query. We don't want to throw an error then, but drop and reset
cached reader.
2023-07-21 19:05:50 +02:00
Michał Jadwiszczak
e5c965b280 querier_cache: add reference to replica::database::is_user_semaphore() 2023-07-21 18:58:57 +02:00
Pavel Emelyanov
db1c6e2255 system_keyspace: Make save_truncation_record() non-static
... and stop using qctx

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-21 13:12:50 +03:00
Pavel Emelyanov
eaeffcdb81 code: Pass sharded<db::system_keyspace>& to database::truncate()
The arguments goes via the db::(drop|truncate)_table_on_all_shards()
pair of calls that start from

- storage_proxy::remote: has its sys.ks reference already
- schema_tables::merge_schema: has sys.ks argument already
- legacy_schema_migrator: the reference was added by previous patch
- tests: run in cql_test_env with sys.ks on board

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-21 13:11:59 +03:00
Michał Jadwiszczak
d7a3aa2698 replica:database: add method to determine if semaphore is user one
Add method to compare semaphore with system ones (streaming, compaction,
system read) to be able if the semaphore belongs to a user.
2023-07-20 10:24:21 +02:00
Patryk Jędrzejczak
ee1c240f2a replica: do not derive the commitlog sync period for schema commitlog
We don't want to apply the value of the commitlog_sync_period_in_ms
variable to schema commitlog. Schema commitlog runs in batch mode,
so it doesn't need this parameter.
2023-07-19 14:16:50 +02:00
Patryk Jędrzejczak
5b167a4ad7 config: add schema_commitlog_segment_size_in_mb variable
In #14668, we have decided to introduce a new scylla.yaml variable
for the schema commitlog segment size. The segment size puts a limit
on the mutation size that can be written at once, and some schema
mutation writes are much larger than average, as shown in #13864.
Therefore, increasing the schema commitlog segment size is sometimes
necessary.
2023-07-19 14:16:41 +02:00
Pavel Emelyanov
312184c0c7 keys: Move exploded_clustering_prefix's operator<< to keys.cc
Now it sits in replicate/database.cc, but the latter is overloaded with
code, worth keeping less, all the more so the ..._prefix itself lives in
the keys.hh header.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14748
2023-07-19 11:57:27 +03:00
Petr Gusev
f05ab33ee7 database.hh: make_multishard_streaming_reader with range parameter
We add an overload of make_multishard_streaming_reader
which reads all the data in the given range. We will use it later
in row level repair if --smp is different on the
nodes and the number of partitions is small.
2023-07-04 13:30:37 +03:00
Petr Gusev
614a1b3770 database.cc: extract streaming_reader_lifecycle_policy
We are going to use it later in a new
make_multishard_streaming_reader overload.
In this commit we just move it outside
into the anonymous namespace, no other code changes
were made.
2023-07-04 13:30:37 +03:00
Pavel Emelyanov
0d4c981423 database: Remove unused proxy arg from update_keyspace_on_all_shards()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-07-03 14:19:54 +03:00