There will appear the future-less method which better deserves
the get_ prefix, so give the existing method the load_ one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
compaction_info must only contain info data to be exported to the
outside world, whereas compaction_data will contain data for
controlling compaction behavior and stats which change as
compaction progresses.
This separation makes the interface clearer, also allowing for
future improvements like removing direct references to table
in compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The api::set_server_config() depends on sharded database to start, but
really doesn't need it -- it needs only the db::config object which's
available earlier.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In anticipation of making system_keyspace a class instead of a
namespace, rename any member that is currently forward-declared,
since one can't forward-declare a class member. Each member
is taken out of the system_keyspace namespace and gains a
system_keyspace prefix. Aliases are added to reduce code churn.
The result isn't lovely, but can be adjusted later.
the name compaction_options is confusing as it overlaps in meaning
with compaction_descriptor. hard to reason what are the exact
difference between them, without digging into the implementation.
compaction_options is intended to only carry options specific to
a give compaction type, like a mode for scrub, so let's rename
it to compaction_type_options to make it clearer for the
readers.
[avi: adjust for scrub changes]
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210908003934.152054-1-raphaelsc@scylladb.com>
Some state accessors called get_local_gossiper(); this is removed
and replaced with a parameter. Some callers (redis, alternators)
now have the gossiper passed as a parameter during initialization
so they can use the adjusted API.
Pass down gossiper from main, converting it to a shard-local instance
in calls to register_api() (which is the point that broadcasts the
endpoint registration across shards).
This helps remove gossiper as a global variable.
Use a forward declaration of cql3::expr::oper_t to reduce the
number of translation units depending on expression.hh.
Before:
$ find build/dev -name '*.d' | xargs cat | grep -c expression.hh
272
After:
$ find build/dev -name '*.d' | xargs cat | grep -c expression.hh
154
Some translation units adjust their includes to restore access
to required headers.
Closes#9229
"
Validation compaction -- although I still maintain that it is a good
descriptive name -- was an unfortunate choice for the underlying
functionality because Origin has burned the name already as it uses it
for a compaction type used during repair. This opens the door for
confusion for users coming from Cassandra who will associate Validation
compaction with the purpose it is used for in Origin.
Additionally, since Origin's validation compaction was not user
initiated, it didn't have a corresponding `nodetool` command to start
it. Adding such a command would create an operational difference between
us and Origin.
To avoid all this we fold validation compaction into scrub compaction,
under a new "validation" mode. I decided against using the also
suggested `--dry-mode` flag as I feel that a new mode is a more natural
choice, we don't have to define how it interacts with all the other
modes, unlike with a `--dry-mode` flag.
Fixes: #7736
Tests: unit(dev), manual(REST API)
"
* 'scrub-validation-mode/v2' of https://github.com/denesb/scylla:
compaction/compaction_descriptor: add comment to Validation compaction type
compaction/compaction_descriptor: compaction_options: remove validate
api: storage_service: validate_keyspace -> scrub_keyspace (validate mode)
compaction/compaction_manager: hide perform_sstable_validation()
compaction: validation compaction -> scrub compaction (validate mode)
compaction/compaction_descriptor: compaction_options: add options() accessor
compaction/compaction_descriptor: compaction_options::scrub::mode: add validate
Adds HTTP endpoints for manipulating hint sync points:
- /hinted_handoff/sync_point (POST) - creates a new sync point for
hints towards nodes listed in the `target_hosts` parameter
- /hinted_handoff/sync_point (GET) - checks the status of the sync
point. If a non-zero `timeout` parameter is given, it waits until the
sync point is reached or the timeout expires.
Registration of the currently unused hinted handoff endpoints is moved
out from the set_server_done function. They are now explicitly
registered in main.cc by calling api::set_hinted_handoff and also
uninitialized by calling api::unset_hinted_handoff.
Setting/unsetting HTTP API separately will allow to pass a reference to
the sync_point_service without polluting the set_server_done function.
We are folding validation compaction into scrub (at least on the
interface level), so remove the validation entry point accordingly and
have users go through `perform_sstable_scrub()` instead.
The reference in question is already there, handlers that need
storage service can capture it and use. These handlers are not
yet stopped, but neither is the storage service itself, so the
potentially dangling reference is not being set up here.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Both set_server_storage_service and set_server_storage_proxy set up
API handlers that need storage service to work. Now they all call for
global storage service instance, but it's better if they receive one
from main. This patch carries the sharded storage service reference
down to handlers setting function, next patch will make use of it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Partition count is of a type size_t but we use std::plus<int>
to reduce values of partition count in various column families.
This patch changes the argument of std::plus to the right type.
Using std::plus<int> for size_t compiles but does not work as expected.
For example plus<int>(2147483648LL, 1LL) = -2147483647 while the code
would probably want 2147483649.
Fixes#9090
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Closes#9074
In an upcoming commit I will add "system.describe_ring" table which uses
endpoint's inet address as a part of CK and, therefore, needs to keep them
sorted with `inet_addr_type::less`.
This warning prevents using std::move() where it can hurt
- on an unnamed temporary or a named automatic variable being
returned from a function. In both cases the value could be
constructed directly in its final destination, but std::move()
prevents it.
Fix the handful of cases (all trivial), and enable the warning.
Closes#8992
Since compaction is layered on top of sstables, let's move all compaction code
into a new top-level directory.
This change will give me extra motivation to remove all layer violations, like
sstable calling compaction-specific code, and compaction entanglement with
other components like table and storage service.
Next steps:
- remove all layer violations
- move compaction code in sstables namespace into a new one for compaction.
- move compaction unit tests into its own file
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210707194058.87060-1-raphaelsc@scylladb.com>
Cassandra 3.0 deprecated the 'sstable_compression' attribute and added
'class' as a replacement. Follow by supporting both.
The SSTABLE_COMPRESSION variable is renamed to SSTABLE_COMPRESSION_DEPRECATED
to detect all uses and prevent future misuse.
To prevent old-version nodes from seeing the new name, the
compression_parameters class preserves the key name when it is
constructed from an options map, and emits the same key name when
asked to generate an options map.
Existing unit tests are modified to use the new name, and a test
is added to ensure the old name is still supported.
Fixes#8948.
Closes#8949
The option is provided by nodetool snapshot
https://docs.scylladb.com/operating-scylla/nodetool-commands/snapshot/
```
nodetool [(-h <host> | --host <host>)] [(-p <port> | --port <port>)]
[(-pp | --print-port)] [(-pw <password> | --password <password>)]
[(-pwf <passwordFilePath> | --password-file <passwordFilePath>)]
[(-u <username> | --username <username>)] snapshot
[(-cf <table> | --column-family <table> | --table <table>)]
[(-kc <kclist> | --kc.list <kclist>)]
[(-sf | --skip-flush)] [(-t <tag> | --tag <tag>)] [--] [<keyspaces...>]
-sf / –skip-flush Do not flush memtables before snapshotting (snapshot will not contain unflushed data)
```
But is currently ignored by scylla-jmx (scylladb/scylla-jmx#167)
and not supported at the api level.
This patch adds support for the option in advance
from the api service level down via snapshot_ctl
to the table class and snapshot implementation.
In addition, a corresponding unit test was added to verify
that taking a snapshot with `skip_flush` does not flush the memtable
(at the table::snapshot level).
Refs #8725Closes#8726
* github.com:scylladb/scylla:
test: database_test: add snapshot_skip_flush_works
api: storage_service/snapshots: support skip-flush option
snapshot: support skip_flush option
table: snapshot: add skip_flush option
api: storage_service/snapshots: add sf (skip_flush) option
Eliminate not used includes and replace some more includes
with forward declarations where appropriate.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The option is provided by nodetool snapshot
https://docs.scylladb.com/operating-scylla/nodetool-commands/snapshot/
```
nodetool [(-h <host> | --host <host>)] [(-p <port> | --port <port>)]
[(-pp | --print-port)] [(-pw <password> | --password <password>)]
[(-pwf <passwordFilePath> | --password-file <passwordFilePath>)]
[(-u <username> | --username <username>)] snapshot
[(-cf <table> | --column-family <table> | --table <table>)]
[(-kc <kclist> | --kc.list <kclist>)]
[(-sf | --skip-flush)] [(-t <tag> | --tag <tag>)] [--] [<keyspaces...>]
-sf / –skip-flush Do not flush memtables before snapshotting (snapshot will not contain unflushed data)
```
But is currently ignored by scylla-jmx (scylladb/scylla-jmx#167)
and not supported at the api level.
This patch wires the skip_flush option support to the
REST API.
Fixes#8725
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Note: I tried adding the option and calling it "skip_flush"
but I couldn't make it work with scylla-jmx, hence it's
called by the abbreviated name - "sf".
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The reset_local_schema call needs proxy and feature service to do its
job. Right now the features are retrived from global storage service,
but they are present on the proxy as well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Instead of getting database from global storage service it's simpler
and better to grab it from the http context at hands.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Clang warns when "return std::move(x)" is needed to elide a copy,
but the call to std::move() is missing. We disabled the warning during
the migration to clang. This patch re-enables the warning and fixes
the places it points out, usually by adding std::move() and in one
place by converting the returned variable from a reference to a local,
so normal copy elision can take place.
Closes#8739
"
The patch set is an assorted collection of header cleanups, e.g:
* Reduce number of boost includes in header files
* Switch to forward declarations in some places
A quick measurement was performed to see if these changes
provide any improvement in build times (ccache cleaned and
existing build products wiped out).
The results are posted below (`/usr/bin/time -v ninja dev-build`)
for 24 cores/48 threads CPU setup (AMD Threadripper 2970WX).
Before:
Command being timed: "ninja dev-build"
User time (seconds): 28262.47
System time (seconds): 824.85
Percent of CPU this job got: 3979%
Elapsed (wall clock) time (h:mm:ss or m:ss): 12:10.97
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2129888
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1402838
Minor (reclaiming a frame) page faults: 124265412
Voluntary context switches: 1879279
Involuntary context switches: 1159999
Swaps: 0
File system inputs: 0
File system outputs: 11806272
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
After:
Command being timed: "ninja dev-build"
User time (seconds): 26270.81
System time (seconds): 767.01
Percent of CPU this job got: 3905%
Elapsed (wall clock) time (h:mm:ss or m:ss): 11:32.36
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2117608
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1400189
Minor (reclaiming a frame) page faults: 117570335
Voluntary context switches: 1870631
Involuntary context switches: 1154535
Swaps: 0
File system inputs: 0
File system outputs: 11777280
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
The observed improvement is about 5% of total wall clock time
for `dev-build` target.
Also, all commits make sure that headers stay self-sufficient,
which would help to further improve the situation in the future.
"
* 'feature/header_cleanups_v1' of https://github.com/ManManson/scylla:
transport: remove extraneous `qos/service_level_controller` includes from headers
treewide: remove evidently unneded storage_proxy includes from some places
service_level_controller: remove extraneous `service/storage_service.hh` include
sstables/writer: remove extraneous `service/storage_service.hh` include
treewide: remove extraneous database.hh includes from headers
treewide: reduce boost headers usage in scylla header files
cql3: remove extraneous includes from some headers
cql3: various forward declaration cleanups
utils: add missing <limits> header in `extremum_tracking.hh`
"
There are many global stuff in repair -- a bunch of pointers to
sharded services, tracker, map of metas (maybe more). This set
removes the first group, all those services had become main-local
recently. Along the way a call to global storage proxy is dropped.
To get there the repair_service is turned into a "classical"
sharded<> service, gets all the needed dependencies by references
from main and spreads them internally where needed. Tracker and other
stuff is left global, but tracker is now the candidate for merging
with the now sharded repair_service, since it emulates the sharded
concept internally.
Overall the change is
- make repair_service sharded and put all dependencies on it at start
- have sharded<repair_service> in API and storage service
- carry the service reference down to repair_info and repair_meta
constructions to give them the depedencies
- use needed services in _info and _meta methods
tests: unit(dev), dtest.repair(dev)
"
* 'br-repair-service' of https://github.com/xemul/scylla: (29 commits)
repair: Drop most of globals from repair
repair: Use local references in messaging handler checks
repair: Use local references in create_writer()
repair: Construct repair_meta with local references
repair: Keep more stuff on repair_info
repair: Kill bunch of global usages from insert_repair_meta
repair: Pass repair service down to meta insertion
repair: Keep local migration manager on repair_info
repair: Move unused db captures
repair: Remove unused ms captures
repair: Construct repair_info with service
repair: Loop over repair sharded container
repair: Make sync_data_using_repair a method
repair: Use repair from storage service
repair: Keep repair on storage service
repair: Make do_repair_start a method
repair: Pass repair_service through the API until do_repair_start
repair: Fix indentation after previous patch
repair: Split sync_data_using_repair
repair: Turn repair_range a repair_info method
...
"
The current scrub compaction has a serious drawback, while it is
very effective at removing any corruptions it recognizes, it is very
heavy-handed in its way of repairing such corruptions: it simply drops
all data that is suspected to be corrupt. While this *is* the safest way
to cleanse data, it might not be the best way from the point of view of
a user who doesn't want to loose data, even at the risk of retaining
some business-logic level corruption. Mind you, no database-level scrub
can ever fully repair data from the business-logic point of view, they
can only do so on the database-level. So in certain cases it might be
desirable to have a less heavy-handed approach of cleansing the data,
that tries as hard as it can to not loose any data.
This series introduces a new scrub mode, with the goal of addressing
this use-case: when the user doesn't want to loose any data. The new
mode is called "segregate" and it works by segregating its input into
multiple outputs such that each output contains a valid stream. This
approach can fix any out-of-order data, be that on the partition or
fragment level. Out-of-order partitions are simply written into a
separate output. Out of order fragments are handled by injecting a
partition-end/partition-start pair right before them, so that they are
now in a separate (duplicate) partition, that will just be written into
a separate output, just like a regular out-of-order partition.
The reason this series is posted as an RFC is that although I consider
the code stable and tested, there are some questions related to the UX.
* First and foremost every scrub that does more than just discard data
that is suspected to be corrupt (but even these a certain degree) have
to consider the possibility that they are rehabilitating corruptions,
leaving them in the system without a warning, in the sense that the
user won't see any more problems due to low-level corruptions and
hence might think everything is alright, while data is still corrupt
from the business logic point of view. It is very hard to draw a line
between what should and shouldn't scrub do, yet there is a demand from
users for scrub that can restore data without loosing any of it. Note
that anybody executing such a scrub is already in a bad shape, even if
they can read their data (they often can't) it is already corrupt,
scrub is not making anything worse here.
* This series converts the previous `skip_corrupted` boolean into an
enum, which now selects the scrub mode. This means that
`skip_corrupted` cannot be combined with segregate to throw out what
the former can't fix. This was chosen for simplicity, a bunch of
flags, all interacting with each other is very hard to see through in
my opinion, a linear mode selector is much more so.
* The new segregate mode goes all-in, by trying to fix even
fragment-level disorder. Maybe it should only do it on the partition
level, or maybe this should be made configurable, allowing the user to
select what to happen with those data that cannot be fixed.
Tests: unit(dev), unit(sstable_datafile_test:debug)
"
* 'sstable-scrub-segregate-by-partition/v1' of https://github.com/denesb/scylla:
test: boost/sstable_datafile_test: add tests for segregate mode scrub
api: storage_service/keyspace_scrub: expose new segregate mode
sstables: compaction/scrub: add segregate mode
mutation_fragment_stream_validator: add reset methods
mutation_writer: add segregate_by_partition
api: /storage_service/keyspace_scrub: add scrub mode param
sstables: compaction/scrub: replace skip_corrupted with mode enum
sstables: compaction/scrub: prevent infinite loop when last partition end is missing
tests: boost/sstable_datafile_test: use the same permit for all fragments in scrub tests
storage_proxy works with vectors of inet_addresses for replica sets
and for topology changes (pending endpoints, dead nodes). This patch
introduces new names for these (without changing the underlying
type - it's still std::vector<gms::inet_address>). This is so that
the following patch, that changes those types to utils::small_vector,
will be less noisy and highlight the real changes that take place.
Add direct support to the newly added scrub mode enum. Instead of the
legacy `skip_corrupted` flag, one can now select the desired mode from
the mode enum. `skip_corrupted` is still supported for backwards
compatibility but it is ignored when the mode enum is set.
* seastar 48376c76a...72e3baed9 (3):
> file: Add RFW_NOWAIT detection case for AuFS
> sharded: provide type info on no sharded instance exception
> iotune: Estimate accuarcy of measurement
Added missing include "database.hh" to api/lsa.cc since seastar::sharded<>
now needs full type information.
Right now toppartitions can only be invoked on one column family at a time.
This change introduces a natural extension to this functionality,
allowing to specify a list of families.
We provide three ways for filtering in the query parameter "name_list":
1. A specific column family to include in the form "ks:cf"
2. A keyspace, telling the server to include all column families in it.
Specified by omitting the cf name, i.e. "ks:"
3. All column families, which is represented by an empty list
The list can include any amount of one or both of the 1. and 2. option.
Fixes#4520Closes#7864
* 'preparatory_work_for_compound_set' of github.com:raphaelsc/scylla:
sstable_set: move all() implementation into sstable_set_impl
sstable_set: preparatory work to change sstable_set::all() api
sstables: remove bag_sstable_set
users of sstable_set::all() rely on the set itself keeping a reference
to the returned list, so user can iterate through the list assuming
that it is alive all the way through.
this will change in the future though, because there will be a
compound set impl which will have to merge the all() of multiple
managed sets, and the result is a temporary value.
so even range-based loops on all() have to keep a ref to the returned
list, to avoid the list from being prematurely destroyed.
so the following code
for (auto& sst : *sstable_set.all()) { ...}
becomes
for (auto sstables = sstable_set.all(); auto& sst : *sstables) { ... }
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>