Incremental selection may not work properly for LCS and ICS due to an
use-after-free bug in partitioned set which came into existence after
compound set was introduced.
The use-after-free happens because partitioned set wasn't taking into
account that the next position can become the current position in the
next iteration, which will be used by all selectors managed by
compound set. So if next position is freed, when it were being used
as current position, subsequent selectors would find the current
position freed, making them produce incorrect results.
Fix this by moving ownership of next pos from incremental_selector_impl
to incremental_selector, which makes it more robust as the latter knows
better when the selection is done with the next pos. incremental_selector
will still return ring_position_view to avoid copies.
Fixes#8802.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210611130957.156712-1-raphaelsc@scylladb.com>
In #8772, an assert validating first token <= last token
failed in leveled_manifest::overlapping.
It is unclear how we got to that state, so add validation
in sstable::set_first_and_last_keys() that the to-be-set
first and last keys are well ordered.
Otherwise, throw malformed_sstable_exception.
set_first_and_last_keys is called both on the write path
from the sstable writer before the sstable is sealed,
and on the open/load path via update_info_for_opened_data().
This series also fixes issues with unit tests with
regards to first/last keys so they won't fail the
validation.
Refs #8772
Test: unit(dev)
DTest: next-gating(dev), materialized_views_test:TestMaterializedViews.interrupt_build_process_and_resharding_half_to_max_test(debug)
* tag 'validate-first-and-last-keys-ordering-v1':
sstable: validate first and last keys ordering
test: lib: reusable_sst: save unexpected errors
test: sstable_datafile_test: stcs_reshape_test: use token_generation_for_current_shard
test: sstable_test: define primary key in schema for compressed sstable
* scylla-dev/raft-group-0-part-1-rebase:
raft: (service) pass Raft service into storage_service
raft: (service) add comments for boot steps
raft: add ordering for raft::server_address based on id
raft: (internal) simplify construction of tagged_id
raft: (internal) tagged_id minor improvements
Raft group 0 initialization and configuration changes
should be integrated with Scylla cluster assembly,
happening when starting the storage service and joining
the cluster. Prepare for this.
Since Raft service depends on query processor, and query
processor depends on storage service, to break a dependency
loop split Raft initialization into two steps: starting
an under-constructed instance of "sharded" Raft service,
accepting an under-constructed instance of "sharded"
query_processor, and then passed into storage service start
function, and then the local state of Raft groups from system
tables once query processor starts.
Consistently abbreviate raft_services instance raft_svcs, as
is the convention at Scylla.
Update the tests.
Introduce a syntax helper tagged_id::create_random_id(),
used to create a new Raft server or group id.
Provide a default ordering for tagged ids, for use
in Raft leader discovery, which selects the smallest
id for leader.
Indexed select statements fetch primary key information from
their internal materialized views and then use it to query
the base table. Unfortunately, the current mechanism for retrieving
base table rows makes it easy to overwhelm the replicas with unbounded
concurrency - the number of concurrent ops is increased exponentially
until a short read is encountered, but it's not enough to cap the
concurrency - if data is fetched row-by-row, then short reads usually
don't occur and as a result it's easy to see concurrency of 1M or
higher. In order to avoid overloading the replicas, the concurrency
of indexed queries is now capped at 4096 and additionally throttled
if enough results are already fetched. For paged queries it means that
the query returns as soon as 1MB of data is ready, and for unpaged ones
the concurrency will no longer be doubled as soon as the previous
iteration fetched 1MB of results.
The fixed 4096 value can be subject to debate, its reasoning is as follows:
for 2KiB rows, so moderately large but not huge, they result in
fetching 10MB of data, which is the granularity used by replicas.
For 200B rows, which is rather small, the result would still be
around 1MB.
At the same time, 4096 separate tasks also means 4096 allocations,
so increasing the number also strains the allocator.
Fixes#8799
Tests: unit(release),
manual: observing metrics of modified index_paging_test
Closes#8814
* github.com:scylladb/scylla:
cql3: limit the transitional result size for indexed queries
cql3: return indexed pages after 1MB worth of data
cql3: limit the concurrency of indexed statements
Unpaged indexed queries already have a concurrency limit of 4096,
but now the concurrency is further limited by previous number of bytes
fetched. Once this number reached 1MB, the concurrency will not be
increased in consecutive queries to avoid overload.
Eliminate not used includes and replace some more includes
with forward declarations where appropriate.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Fixes#8773
When refactored for cdc, properties -> extensions merge
was modified so it did not handle _removal_ (i.e. an
extension function returning null -> no entry in new map).
This causes certain enterprise extensions to not be able
to disable themselves.
Fixed by filtering existing extensions by property keywords.
Unit test added.
Closes#8774
Also drop a single violation in transport/server.cc. This helps
prevent dead code from piling up.
Three functions in row_cache_test that are not used in debug mode
are moved near their user, and under the same ifdef, to avoid triggering
the error.
Closes#8767
Currently the test is using "first_key", "last_key"
literals for the first and last keys and expects them
to sort properly with the murmur3 partitioner.
Also it does that for all generated sstables
which is less interesting for reshape.
Use token_generation_for_current_shard to
generate random, properly ordered keys.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Otherwise, the primary_key will be considered as composite,
as its length does not equal 1. That hampers token caluclation
when decorating the dirst and last keys in the summary file.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch is not backward compatible with its original,
but it's considered fine, since the original workload types were not
yet part of any release.
The changes include:
- instead of using 'unspecified' for declaring that there's no workload
type for a particular service level, NULL is used for that purpose;
NULL is the standard way of representing lack of data
- introducing a delete marker, which accompanies NULL and makes it
possible to distinguish between wanting to forcibly reset a workload
type to unspecified and not wanting to change the previous value
- updating the tests accordingly
These changes come in as a single patch, because they're intertwined
with each other and the tests for workload types are already in place;
an attempt to split them proved to be more complicated than it's worth.
Tests: unit(release)
Closes#8763
Compaction manager can start tons of compaction of fully expired sstable in
parallel, which may consume a significant amount of resources.
This problem is caused by weight being released too early in compaction, after
data is all compacted but before table is called to update its state, like
replacing sstables and so on.
Fully expired sstables aren't actually compacted, so the following can happen:
- compaction 1 starts for expired sst A with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 2 starts for expired sst B with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 3 starts for expired sst C with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 1 is done updating table state, so it finally completes and
releases all the resources.
- compaction 2 is done updating table state, so it finally completes and
releases all the resources.
- compaction 3 is done updating table state, so it finally completes and
releases all the resources.
This happens because, with expired sstable, compaction will release weight
faster than it will update table state, as there's nothing to be compacted.
With my reproducer, it's very easy to reach 50 parallel compactions on a single
shard, but that number can be easily worse depending on the amount of sstables
with fully expired data, across all tables. This high parallelism can happen
only with a couple of tables, if there are many time windows with expired data,
as they can be compacted in parallel.
Prior to 55a8b6e3c9, weight was released earlier in compaction, before
last sstable was sealed, but right now, there's no need to release weight
earlier. Weight can be released in a much simpler way, after the compaction is
actually done. So such compactions will be serialized from now on.
Fixes#8710.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210527165443.165198-1-raphaelsc@scylladb.com>
[avi: drop now unneeded storage_service_for_tests]
user_defined_function_test fails sporadically in debug mode
due to lua timeout. Raise the timeout to avoid the failure, but
not so much that the test that expects timout becomes too slow.
Fixes#8746.
Closes#8747
This is another boring patch.
One of schema constructors has been deprecated for many years now but
was used in several places anyway. Usage of this constructor could
lead to data corruption when using MX sstables because this constructor
does not set schema version. MX reading/writing code depends on schema
version.
This patch replaces all the places the deprecated constructor is used
with schema_builder equivalent. The schema_builder sets the schema
version correctly.
Fixes#8507
Test: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <4beabc8c942ebf2c1f9b09cfab7668777ce5b384.1622357125.git.piotr@scylladb.com>
Reopening #8286 since the token metadata fix that allows `Everywhere` strategy tables to work with RBO (#8536) has been merged.
---
Currently when a node wants to create and broadcast a new CDC generation
it performs the following steps:
1. choose the generation's stream IDs and mapping (how this is done is
irrelevant for the current discussion)
2. choose the generation's timestamp by taking the current time
(according to its local clock) and adding 2 * ring_delay
3. insert the generation's data (mapping and stream IDs) into
system_distributed.cdc_generation_descriptions, using the
generation's timestamp as the partition key (we call this table
the "old internal table" below)
4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP"
application state.
The timestamp spreads epidemically through the gossip protocol. When
nodes see the timestamp, they retrieve the generation data from the
old internal table.
Unfortunately, due to the schema of the old internal table, where
the entire generation data is stored in a single cell, step 3 may fail for
sufficiently large generations (there is a size threshold for which step
3 will always fail - retrying the operation won't help). Also the old
internal table lies in the system_distributed keyspace that uses
SimpleStrategy with replication factor 3, which is also problematic; for
example, when nodes restart, they must reach at least 2 out of these 3
specific replicas in order to retrieve the current generation (we write
and read the generation data with QUORUM, unless we're a single-node
cluster, where we use ONE). Until this happens, a restarting
node can't coordinate writes to CDC-enabled tables. It would be better
if the node could access the last known generation locally.
The commit introduces a new table for broadcasting generation data with
the following properties:
- it uses a better schema that stores the data in multiple rows, each
of manageable size
- it resides in a new keyspace that uses EverywhereStrategy so the
data will be written to every node in the cluster that has a token in
the token ring
- the data will be written using CL=ALL and read using CL=ONE; thanks
to this, restarting node won't have to communicate with other nodes
to retrieve the data of the last known generation. Note that writing
with CL=ALL does not reduce availability: creating a new generation
*requires* all nodes to be available anyway, because they must learn
about the generation before their clocks go past the generation's
timestamp; if they don't, partitions won't be mapped to stream IDs
consistently across the cluster
- the partition key is no longer the generation's timestamp. Because it
was that way in the old internal table, it forced the algorithm to
choose the timestamp *before* the generation data was inserted into
the table. What if the inserting took a long time? It increased the
chance that nodes would learn about the generation too late (after
their clocks moved past its timestamp). With the new schema we will
first insert the generation data using a randomly generated UUID as
the partition key, *then* choose the timestamp, then gossip both the
timestamp and the UUID.
Observe that after a node learns about a generation broadcasted using
this new method through gossip it will retrieve its data very quickly
since it's one of the replicas and it can use CL=ONE as it was
written using CL=ALL.
The generation's timestamp and the UUID mentioned in the last point form
a "generation identifier" for this new generation. For passing these new
identifiers around, we introduce the cdc::generation_id_v2 type.
Fixes#7961.
---
For optimal review experience it is best to first read the updated design notes (you can read them rendered here: https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md), specifically the ["Generation switching"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#generation-switching) section followed by the ["Internal generation descriptions table V1 and upgrade procedure"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#internal-generation-descriptions-table-v1-and-upgrade-procedure) section, then read the commits in topological order.
dtest gating run (dev): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/1160/
unit tests (dev) passed locally
Closes#8643
* github.com:scylladb/scylla:
docs: update cdc.md with info about the new internal table
sys_dist_ks: don't create old CDC generations table on service initialization
sys_dist_ks: rename all_tables() to ensured_tables()
cdc: when creating new generations, use format v2 if possible
main: pass feature_service to cdc::generation_service
gms: introduce CDC_GENERATIONS_V2 feature
cdc: introduce retrieve_generation_data
test: cdc: include new generations table in permissions test
sys_dist_ks: increase timeout for create_cdc_desc
sys_dist_ks: new table for exchanging CDC generations
tree-wide: introduce cdc::generation_id_v2
This draft extends and obsoletes #8123 by introducing a way of determining the workload type from service level parameters, and then using this context to qualify requests for shedding.
The rough idea is that when the admission queue in the CQL server is hit, it might make more sense to start shedding surplus requests instead of accumulating them on the semaphore. The assumption that interactive workloads are more interested in the success rate of as many requests as possible, and hanging on a semaphore reduces the chances for a request to succeed. Thus, it may make sense to shed some requests to reduce the load on this coordinator and let the existing requests to finish.
It's a draft, because I only performed local guided tests. #8123 was followed by some experiments on a multinode cluster which I want to rerun first.
Closes#8680
* github.com:scylladb/scylla:
test: add a case for conflicting workload types
cql-pytest: add basic tests for service level workload types
docs: describe workload types for service levels
sys_dist_ks: fix redundant parsing in get_service_level
sys_dist_ks: make get_service_level exception-safe
transport: start shedding requests during potential overload
client_state: hook workload type from service levels
cql3: add listing service level workload type
cql3: add persisting service level workload type
qos: add workload_type service level parameter
The purpose of the class in question is to start sharded storage
service to make its global instance alive. I don't know when exactly
it happened but no code that instantiates this wrapper really needs
the global storage service.
Ref: #2795
tests: unit(dev), perf_sstable(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210526170454.15795-1-xemul@scylladb.com>
This warning triggers when a range for ("for (auto x : range)") causes
non-trivial copies, prompting the developer to replace with a capture
by reference. A few minor violations in the test suite are corrected.
Closes#8699
Fixes#8270
If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk _footprint_ that is over limit causing new segment allocation to stall.
We need to take a few things into account:
1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here.
2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task.
3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit.
Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything).
Closes#8695
* github.com:scylladb/scylla:
commitlog_test: Add test case for usage/disk size threshold mismatch
commitlog: Flush all segments if we only have one.
commitlog: Always force flush if segment allocation is waiting
commitlog: Include segment wasted (slack) size in footprint check
commitlog: Adjust (lower) usage threshold
The old table won't be created in clusters that are bootstrapped after
this commit. It will stay in clusters that were upgraded from a version
before this commit.
Note that a fully upgraded cluster doesn't automatically create a new
generation in the new format. Even if the last generation was created
before the upgrade, the cluster will keep using it.
A new generation will be created in the new format when either:
1. a new node bootstraps (in the new version),
2. or the user runs checkAndRepairCdcStreams, which has a new check: if
the current generation uses the old format, the command will decide
that repair is needed, even if the generation is completely fine
otherwise (also in the new version).
During upgrade, while the CDC_GENERATIONS_V2 feature is still not
enabled, the user may still bootstrap a node in the old version of
Scylla or run checkAndRepairCdcStreams on a not-yet-upgraded node. In
that case a new generation will be created in the old format,
using the old table definitions.
test_phased_barrier_reassignment has a timeout to prevent the test from
hanging on failure, but it occastionally triggers in debug mode since
the timeout is quite low (1ms). Increase the timeout to prevent false
positives. Since the timeout only expires if the test fails, it will
have no impact on execution time.
Ref #8613Closes#8692
Off-strategy compaction on a table using STCS is slow because of
the needless write amplification of 2. That's because STCS reshape
isn't taking advantage of the fact that sstables produced by
a repair-based operation are disjoint. So the ~256 input sstables
were compacted (in batches of 32) into larger sstables, which in
turn were compacted into even larger ones. That write amp is very
significant on large data sets, making the whole operation 2x
slower.
Fixes#8449.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210524213426.196407-1-raphaelsc@scylladb.com>
"
The patch set is an assorted collection of header cleanups, e.g:
* Reduce number of boost includes in header files
* Switch to forward declarations in some places
A quick measurement was performed to see if these changes
provide any improvement in build times (ccache cleaned and
existing build products wiped out).
The results are posted below (`/usr/bin/time -v ninja dev-build`)
for 24 cores/48 threads CPU setup (AMD Threadripper 2970WX).
Before:
Command being timed: "ninja dev-build"
User time (seconds): 28262.47
System time (seconds): 824.85
Percent of CPU this job got: 3979%
Elapsed (wall clock) time (h:mm:ss or m:ss): 12:10.97
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2129888
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1402838
Minor (reclaiming a frame) page faults: 124265412
Voluntary context switches: 1879279
Involuntary context switches: 1159999
Swaps: 0
File system inputs: 0
File system outputs: 11806272
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
After:
Command being timed: "ninja dev-build"
User time (seconds): 26270.81
System time (seconds): 767.01
Percent of CPU this job got: 3905%
Elapsed (wall clock) time (h:mm:ss or m:ss): 11:32.36
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2117608
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1400189
Minor (reclaiming a frame) page faults: 117570335
Voluntary context switches: 1870631
Involuntary context switches: 1154535
Swaps: 0
File system inputs: 0
File system outputs: 11777280
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
The observed improvement is about 5% of total wall clock time
for `dev-build` target.
Also, all commits make sure that headers stay self-sufficient,
which would help to further improve the situation in the future.
"
* 'feature/header_cleanups_v1' of https://github.com/ManManson/scylla:
transport: remove extraneous `qos/service_level_controller` includes from headers
treewide: remove evidently unneded storage_proxy includes from some places
service_level_controller: remove extraneous `service/storage_service.hh` include
sstables/writer: remove extraneous `service/storage_service.hh` include
treewide: remove extraneous database.hh includes from headers
treewide: reduce boost headers usage in scylla header files
cql3: remove extraneous includes from some headers
cql3: various forward declaration cleanups
utils: add missing <limits> header in `extremum_tracking.hh`
"
There are many global stuff in repair -- a bunch of pointers to
sharded services, tracker, map of metas (maybe more). This set
removes the first group, all those services had become main-local
recently. Along the way a call to global storage proxy is dropped.
To get there the repair_service is turned into a "classical"
sharded<> service, gets all the needed dependencies by references
from main and spreads them internally where needed. Tracker and other
stuff is left global, but tracker is now the candidate for merging
with the now sharded repair_service, since it emulates the sharded
concept internally.
Overall the change is
- make repair_service sharded and put all dependencies on it at start
- have sharded<repair_service> in API and storage service
- carry the service reference down to repair_info and repair_meta
constructions to give them the depedencies
- use needed services in _info and _meta methods
tests: unit(dev), dtest.repair(dev)
"
* 'br-repair-service' of https://github.com/xemul/scylla: (29 commits)
repair: Drop most of globals from repair
repair: Use local references in messaging handler checks
repair: Use local references in create_writer()
repair: Construct repair_meta with local references
repair: Keep more stuff on repair_info
repair: Kill bunch of global usages from insert_repair_meta
repair: Pass repair service down to meta insertion
repair: Keep local migration manager on repair_info
repair: Move unused db captures
repair: Remove unused ms captures
repair: Construct repair_info with service
repair: Loop over repair sharded container
repair: Make sync_data_using_repair a method
repair: Use repair from storage service
repair: Keep repair on storage service
repair: Make do_repair_start a method
repair: Pass repair_service through the API until do_repair_start
repair: Fix indentation after previous patch
repair: Split sync_data_using_repair
repair: Turn repair_range a repair_info method
...
Following Nadav's advice, instead of ignoring the test
in sanitize/debug modes, the allocator simply has a special path
of failing sufficiently large allocation requests.
With that, a problem with the address sanitizer is bypassed
and other debug mode sanitizers can inspect and check
if there are no more problems related to wrapping the original
rapidjson allocator.
Closes#8539
Todays alloc() accepts migrate-fn, size and alignment. All the callers
don't really need to provide anything special for the migrate-fn and
are just happy with default alignof() for alignment. The simplification
is in providing alloc() that only accepts size arg and does the rest
itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently `cql_test_env` runs its `func` in the default (main) group and
also leaves all scheduling groups in `dbcfg` default initialized to the
same scheduling group. This results in every part of the system,
normally isolated from each other, running in the same (default)
scheduling group. Not a big problem on its own, as we are talking about
tests, but this creates an artificial difference between the test and
the real environment, which is ever more pronounced since certain query
parameters are selected based on the current scheduling group.
To bring cql test env just that little bit closer to the real thing,
this patch creates all the scheduling groups main does (well almost) and
configures `dbcfg` with them.
Creating and destroying the scheduling group on each setup-teardown of
cql test env breaks some internal seastar components which don't like
seeing the same scheduling group with the same name but different id. So
create the scheduling groups once on first access and keep them around
until the test executable is running.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210514141614.128213-2-bdenes@scylladb.com>
Currently `with_cql_test_env()` is equivalent to
`with_cql_test_env_thread()`, which resulted in many tests using the
former while really needing the latter and getting away with it. This
equivalence is incidental and will go away soon, so make sure all cql
test env using tests that expect to be run in a thread use the
appropriate variant.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210514141614.128213-1-bdenes@scylladb.com>
"
The current scrub compaction has a serious drawback, while it is
very effective at removing any corruptions it recognizes, it is very
heavy-handed in its way of repairing such corruptions: it simply drops
all data that is suspected to be corrupt. While this *is* the safest way
to cleanse data, it might not be the best way from the point of view of
a user who doesn't want to loose data, even at the risk of retaining
some business-logic level corruption. Mind you, no database-level scrub
can ever fully repair data from the business-logic point of view, they
can only do so on the database-level. So in certain cases it might be
desirable to have a less heavy-handed approach of cleansing the data,
that tries as hard as it can to not loose any data.
This series introduces a new scrub mode, with the goal of addressing
this use-case: when the user doesn't want to loose any data. The new
mode is called "segregate" and it works by segregating its input into
multiple outputs such that each output contains a valid stream. This
approach can fix any out-of-order data, be that on the partition or
fragment level. Out-of-order partitions are simply written into a
separate output. Out of order fragments are handled by injecting a
partition-end/partition-start pair right before them, so that they are
now in a separate (duplicate) partition, that will just be written into
a separate output, just like a regular out-of-order partition.
The reason this series is posted as an RFC is that although I consider
the code stable and tested, there are some questions related to the UX.
* First and foremost every scrub that does more than just discard data
that is suspected to be corrupt (but even these a certain degree) have
to consider the possibility that they are rehabilitating corruptions,
leaving them in the system without a warning, in the sense that the
user won't see any more problems due to low-level corruptions and
hence might think everything is alright, while data is still corrupt
from the business logic point of view. It is very hard to draw a line
between what should and shouldn't scrub do, yet there is a demand from
users for scrub that can restore data without loosing any of it. Note
that anybody executing such a scrub is already in a bad shape, even if
they can read their data (they often can't) it is already corrupt,
scrub is not making anything worse here.
* This series converts the previous `skip_corrupted` boolean into an
enum, which now selects the scrub mode. This means that
`skip_corrupted` cannot be combined with segregate to throw out what
the former can't fix. This was chosen for simplicity, a bunch of
flags, all interacting with each other is very hard to see through in
my opinion, a linear mode selector is much more so.
* The new segregate mode goes all-in, by trying to fix even
fragment-level disorder. Maybe it should only do it on the partition
level, or maybe this should be made configurable, allowing the user to
select what to happen with those data that cannot be fixed.
Tests: unit(dev), unit(sstable_datafile_test:debug)
"
* 'sstable-scrub-segregate-by-partition/v1' of https://github.com/denesb/scylla:
test: boost/sstable_datafile_test: add tests for segregate mode scrub
api: storage_service/keyspace_scrub: expose new segregate mode
sstables: compaction/scrub: add segregate mode
mutation_fragment_stream_validator: add reset methods
mutation_writer: add segregate_by_partition
api: /storage_service/keyspace_scrub: add scrub mode param
sstables: compaction/scrub: replace skip_corrupted with mode enum
sstables: compaction/scrub: prevent infinite loop when last partition end is missing
tests: boost/sstable_datafile_test: use the same permit for all fragments in scrub tests
Storage service calls a bunch of do_something_with_repair() methods. All
of them need the local repair_service and the only way to get it is by
keeping it on storage service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Uses the infrastructure for testing mutation_sources, but only a
subset of it which does not do fast forwarding (since virtual_table
does not support it).
utils::phased_barrier holds a `lw_shared_ptr<gate>` that is
typically `enter()`ed in `phased_barrier::start()`,
and left when the operation is destroyed in `~operation`.
Currently, the operation move-assign implementation is the
default one that just moves the lw_shared gate ptr from the
other operation into this one, without calling `_gate->leave()` first.
This change first destroys *this when move-assigned (if not self)
to call _gate->leave() if engaged, before reassigning the
gate with the other operation::_gate.
A unit test that reproduces the issue before this change
and passes with the fix was added to serialized_action_test.
Fixes#8613
Test: unit(dev), serialized_action_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210510120703.1520328-1-bhalevy@scylladb.com>
"
The current printout is has multiple problems:
* It is segregated by state, each having its own sorting criteria;
* Number of permits and count resources is collapsed in to a single
column, not clear which is the one printed.
* Number of available/initial units of the semaphore are not printed;
This series solves all this problems:
* It merges all states into a single table, sorted by memory
consumption, in descending order.
* It separates number of permits and count resources into separate
columns.
* Prints a summary of the semaphore units.
* Provides a cap on the maximum amount of printable lines, to not blow
up the logs.
The goal of all this is to make it easy to find the culprit a semaphore
problem: easily spot the big memory consumers, then unpack the name
column to determine which table and code path is responsible.
This brings the printout close to the recently `scylla reads`
scylla-gdb.py command, providing a uniform report format across the two
tools.
Example report:
INFO 2021-05-07 09:52:16,806 [shard 0] testlog - With max-lines=4: Semaphore reader_concurrency_semaphore_dump_reader_diganostics with 8/2147483647 count and 263599186/9223372036854775807 memory resources: user request, dumping permit diagnostics:
permits count memory table/description/state
7 2 77M ks.tbl1/op1/active
6 3 59M ks.tbl1/op0/active
4 0 36M ks.tbl1/op2/active
3 1 36M ks.tbl0/op2/active
11 2 43M permits omitted for brevity
31 8 251M total
"
* 'reader-concurrency-semaphore-dump-improvement/v1' of https://github.com/denesb/scylla:
test: reader_concurrency_test: add reader_concurrency_semaphore_dump_reader_diganostics
reader_concurrency_semaphore: dump_reader_diagnostics(): print more information in the header
reader_concurrency_semaphore: dump_reader_diagnostics(): cap number of printed lines
reader_concurrency_semaphore: dump_reader_diagnostics(): sort lines in descending order
reader_concurrency_semaphore: dump_reader_diagnostics(): merge all states into a single table
reader_concurrency_semaphore: dump_reader_diagnostics(): separate number of permits and count resources
In commit 3e39985c7a we added the Cassandra-compatible system table
system."IndexInfo" (note the capitalized table name) which lists built
indexes. Because we already had a table of built materialized views, and
indexes are implemented as materialized views, the index list was
implemented as a virtual table based on the view list.
However, the *name* of each materialized view listed in the list of
views looks like something_index, with the suffix "_index", while the
name of the table we need to print is "something". We forgot to do this
transformation in the virtual table - and this is what this patch does.
This bug can confuse applications which use this system table to wait for
an index to be built. Several tests translated from Cassandra's unit
tests, in cassandra_tests/validation/entities/secondary_index_test.py fail
in wait_for_index() because of this incompatibility, and pass after this
patch.
This patch also changes the unit test that enshrined the previous, wrong,
behavior, to test for the correct behavior. This problem is typical of
C++ unit tests which cannot be run against Cassandra.
Fixes#8600
Unfortunately, although this patch fixes "typical" applications (including
all tests which I tried) - applications which read from IndexInfo in a
"typical" method to look for a specific index being ready, the
implementation is technically NOT correct: The problem is that index
names are not sorted in the right order, because they are sorted with
the "_index" prefix.
To give an example, the index names "a" should be listed before "a1", but
the view names "a1_index" comes before "a_index" (because in ASCII, 1
comes before underscore). I can't think of any way to fix this bug
without completely reimplementing IndexInfo in a different way - probably
based on a temporary memtable (which is fine as this is not a
performance-critical operation). We'll need to do this rewrite eventually,
and I'll open a new issue.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210509140113.1084497-1-nyh@scylladb.com>