Commit Graph

1005 Commits

Author SHA1 Message Date
Raphael S. Carvalho
846f0bd16e sstables: Fix incremental selection with compound sstable set
Incremental selection may not work properly for LCS and ICS due to an
use-after-free bug in partitioned set which came into existence after
compound set was introduced.

The use-after-free happens because partitioned set wasn't taking into
account that the next position can become the current position in the
next iteration, which will be used by all selectors managed by
compound set. So if next position is freed, when it were being used
as current position, subsequent selectors would find the current
position freed, making them produce incorrect results.

Fix this by moving ownership of next pos from incremental_selector_impl
to incremental_selector, which makes it more robust as the latter knows
better when the selection is done with the next pos. incremental_selector
will still return ring_position_view to avoid copies.

Fixes #8802.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210611130957.156712-1-raphaelsc@scylladb.com>
2021-06-13 16:45:07 +03:00
Tomasz Grabiec
419ee84d86 Merge "sstable: validate first and last keys ordering" from Benny
In #8772, an assert validating first token <= last token
failed in leveled_manifest::overlapping.

It is unclear how we got to that state, so add validation
in sstable::set_first_and_last_keys() that the to-be-set
first and last keys are well ordered.
Otherwise, throw malformed_sstable_exception.

set_first_and_last_keys is called both on the write path
from the sstable writer before the sstable is sealed,
and on the open/load path via update_info_for_opened_data().

This series also fixes issues with unit tests with
regards to first/last keys so they won't fail the
validation.

Refs #8772

Test: unit(dev)
DTest: next-gating(dev), materialized_views_test:TestMaterializedViews.interrupt_build_process_and_resharding_half_to_max_test(debug)

* tag 'validate-first-and-last-keys-ordering-v1':
  sstable: validate first and last keys ordering
  test: lib: reusable_sst: save unexpected errors
  test: sstable_datafile_test: stcs_reshape_test: use token_generation_for_current_shard
  test: sstable_test: define primary key in schema for compressed sstable
2021-06-09 14:43:02 +02:00
Tomasz Grabiec
ce7a404f17 Merge "Cleanups/refactoring for Raft Group 0" from Kostja
* scylla-dev/raft-group-0-part-1-rebase:
  raft: (service) pass Raft service into storage_service
  raft: (service) add comments for boot steps
  raft: add ordering for raft::server_address based on id
  raft: (internal) simplify construction of tagged_id
  raft: (internal) tagged_id minor improvements
2021-06-09 10:48:05 +02:00
Konstantin Osipov
267a8e99ad raft: (service) pass Raft service into storage_service
Raft group 0 initialization and configuration changes
should be integrated with Scylla cluster assembly,
happening when starting the storage service and joining
the cluster. Prepare for this.

Since Raft service depends on query processor, and query
processor depends on storage service, to break a dependency
loop split Raft initialization into two steps: starting
an under-constructed instance of "sharded" Raft service,
accepting an under-constructed instance of "sharded"
query_processor, and then passed into storage service start
function, and then the local state of Raft groups from system
tables once query processor starts.

Consistently abbreviate raft_services instance raft_svcs, as
is the convention at Scylla.

Update the tests.
2021-06-08 14:52:32 +03:00
Konstantin Osipov
c9a23e9b8a raft: (internal) tagged_id minor improvements
Introduce a syntax helper tagged_id::create_random_id(),
used to create a new Raft server or group id.

Provide a default ordering for tagged ids, for use
in Raft leader discovery, which selects the smallest
id for leader.
2021-06-08 14:52:32 +03:00
Avi Kivity
3e3003fcc1 Merge 'cql3: limit the concurrency of indexed statements' from Piotr Sarna
Indexed select statements fetch primary key information from
their internal materialized views and then use it to query
the base table. Unfortunately, the current mechanism for retrieving
base table rows makes it easy to overwhelm the replicas with unbounded
concurrency - the number of concurrent ops is increased exponentially
until a short read is encountered, but it's not enough to cap the
concurrency - if data is fetched row-by-row, then short reads usually
don't occur and as a result it's easy to see concurrency of 1M or
higher. In order to avoid overloading the replicas, the concurrency
of indexed queries is now capped at 4096 and additionally throttled
if enough results are already fetched. For paged queries it means that
the query returns as soon as 1MB of data is ready, and for unpaged ones
the concurrency will no longer be doubled as soon as the previous
iteration fetched 1MB of results.

The fixed 4096 value can be subject to debate, its reasoning is as follows:
for 2KiB rows, so moderately large but not huge, they result in
fetching 10MB of data, which is the granularity used by replicas.
For 200B rows, which is rather small, the result would still be
around 1MB.
At the same time, 4096 separate tasks also means 4096 allocations,
so increasing the number also strains the allocator.

Fixes #8799

Tests: unit(release),
       manual: observing metrics of modified index_paging_test

Closes #8814

* github.com:scylladb/scylla:
  cql3: limit the transitional result size for indexed queries
  cql3: return indexed pages after 1MB worth of data
  cql3: limit the concurrency of indexed statements
2021-06-07 18:00:51 +03:00
Piotr Sarna
df0d44486a cql3: limit the transitional result size for indexed queries
Unpaged indexed queries already have a concurrency limit of 4096,
but now the concurrency is further limited by previous number of bytes
fetched. Once this number reached 1MB, the concurrency will not be
increased in consecutive queries to avoid overload.
2021-06-07 16:29:18 +02:00
Pavel Solodovnikov
76bea23174 treewide: reduce header interdependencies
Use forward declarations wherever possible.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>

Closes #8813
2021-06-07 15:58:35 +03:00
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Pavel Solodovnikov
e0749d6264 treewide: some random header cleanups
Eliminate not used includes and replace some more includes
with forward declarations where appropriate.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-06 19:18:49 +03:00
Benny Halevy
3f9bad0f0a test: compound_test: use tests::random
For reproducibility.

Test: compound_test(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210602061910.286893-2-bhalevy@scylladb.com>
2021-06-06 09:21:23 +03:00
Benny Halevy
40e032ff8b test: compound_test: use to seastar test framework
Prepare for using tests::random instead of std::rand
for reproducibility.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210602061910.286893-1-bhalevy@scylladb.com>
2021-06-06 09:21:23 +03:00
Calle Wilund
3b55ef36d1 cf_prop_defs: Fix extensions merge to handle removal
Fixes #8773

When refactored for cdc, properties -> extensions merge
was modified so it did not handle _removal_ (i.e. an
extension function returning null -> no entry in new map).

This causes certain enterprise extensions to not be able
to disable themselves.

Fixed by filtering existing extensions by property keywords.
Unit test added.

Closes #8774
2021-06-06 09:21:23 +03:00
Avi Kivity
100d6f4094 build: enable -Wunused-function
Also drop a single violation in transport/server.cc. This helps
prevent dead code from piling up.

Three functions in row_cache_test that are not used in debug mode
are moved near their user, and under the same ifdef, to avoid triggering
the error.

Closes #8767
2021-06-06 09:21:23 +03:00
Benny Halevy
9452b99b40 test: sstable_datafile_test: stcs_reshape_test: use token_generation_for_current_shard
Currently the test is using "first_key", "last_key"
literals for the first and last keys and expects them
to sort properly with the murmur3 partitioner.
Also it does that for all generated sstables
which is less interesting for reshape.

Use token_generation_for_current_shard to
generate random, properly ordered keys.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-02 12:25:29 +03:00
Benny Halevy
d5405dade7 test: sstable_test: define primary key in schema for compressed sstable
Otherwise, the primary_key will be considered as composite,
as its length does not equal 1.  That hampers token caluclation
when decorating the dirst and last keys in the summary file.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-02 12:25:29 +03:00
Piotr Sarna
389a0a52c9 treewide: revamp workload type for service levels
This patch is not backward compatible with its original,
but it's considered fine, since the original workload types were not
yet part of any release.
The changes include:
 - instead of using 'unspecified' for declaring that there's no workload
   type for a particular service level, NULL is used for that purpose;
   NULL is the standard way of representing lack of data
 - introducing a delete marker, which accompanies NULL and makes it
   possible to distinguish between wanting to forcibly reset a workload
   type to unspecified and not wanting to change the previous value
 - updating the tests accordingly

These changes come in as a single patch, because they're intertwined
with each other and the tests for workload types are already in place;
an attempt to split them proved to be more complicated than it's worth.

Tests: unit(release)

Closes #8763
2021-05-31 18:18:33 +03:00
Raphael S. Carvalho
a7cdd846da compaction: Prevent tons of compaction of fully expired sstable from happening in parallel
Compaction manager can start tons of compaction of fully expired sstable in
parallel, which may consume a significant amount of resources.
This problem is caused by weight being released too early in compaction, after
data is all compacted but before table is called to update its state, like
replacing sstables and so on.
Fully expired sstables aren't actually compacted, so the following can happen:
- compaction 1 starts for expired sst A with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 2 starts for expired sst B with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 3 starts for expired sst C with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 1 is done updating table state, so it finally completes and
releases all the resources.
- compaction 2 is done updating table state, so it finally completes and
releases all the resources.
- compaction 3 is done updating table state, so it finally completes and
releases all the resources.

This happens because, with expired sstable, compaction will release weight
faster than it will update table state, as there's nothing to be compacted.

With my reproducer, it's very easy to reach 50 parallel compactions on a single
shard, but that number can be easily worse depending on the amount of sstables
with fully expired data, across all tables. This high parallelism can happen
only with a couple of tables, if there are many time windows with expired data,
as they can be compacted in parallel.

Prior to 55a8b6e3c9, weight was released earlier in compaction, before
last sstable was sealed, but right now, there's no need to release weight
earlier. Weight can be released in a much simpler way, after the compaction is
actually done. So such compactions will be serialized from now on.

Fixes #8710.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210527165443.165198-1-raphaelsc@scylladb.com>

[avi: drop now unneeded storage_service_for_tests]
2021-05-30 23:22:51 +03:00
Avi Kivity
791412b046 test: user_defined_function_test: raise Lua timeout
user_defined_function_test fails sporadically in debug mode
due to lua timeout. Raise the timeout to avoid the failure, but
not so much that the test that expects timout becomes too slow.

Fixes #8746.

Closes #8747
2021-05-30 13:10:57 +03:00
Piotr Jastrzebski
76d7c761d1 schema: Stop using deprecated constructor
This is another boring patch.

One of schema constructors has been deprecated for many years now but
was used in several places anyway. Usage of this constructor could
lead to data corruption when using MX sstables because this constructor
does not set schema version. MX reading/writing code depends on schema
version.

This patch replaces all the places the deprecated constructor is used
with schema_builder equivalent. The schema_builder sets the schema
version correctly.

Fixes #8507

Test: unit(dev)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <4beabc8c942ebf2c1f9b09cfab7668777ce5b384.1622357125.git.piotr@scylladb.com>
2021-05-30 11:58:27 +03:00
Avi Kivity
d3e5b37059 Revert "Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed' from Calle Wilund"
This reverts commit e9c940dbbc, reversing
changes made to 6144656b25. Since it was
merged commitlog_test consistently times out in debug mode.
2021-05-27 21:16:26 +03:00
Michał Chojnowski
5e9f741bb4 repair: remove range_split.hh
Dead code since 80ebedd242.

Closes #8698
2021-05-27 17:21:37 +03:00
Avi Kivity
5f8484897b Merge 'cdc: use a new internal table for exchanging generations' from Kamil Braun
Reopening #8286 since the token metadata fix that allows `Everywhere` strategy tables to work with RBO (#8536) has been merged.

---
Currently when a node wants to create and broadcast a new CDC generation
it performs the following steps:
1. choose the generation's stream IDs and mapping (how this is done is
   irrelevant for the current discussion)
2. choose the generation's timestamp by taking the current time
   (according to its local clock) and adding 2 * ring_delay
3. insert the generation's data (mapping and stream IDs) into
   system_distributed.cdc_generation_descriptions, using the
   generation's timestamp as the partition key (we call this table
   the "old internal table" below)
4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP"
   application state.

The timestamp spreads epidemically through the gossip protocol. When
nodes see the timestamp, they retrieve the generation data from the
old internal table.

Unfortunately, due to the schema of the old internal table, where
the entire generation data is stored in a single cell, step 3 may fail for
sufficiently large generations (there is a size threshold for which step
3 will always fail - retrying the operation won't help). Also the old
internal table lies in the system_distributed keyspace that uses
SimpleStrategy with replication factor 3, which is also problematic; for
example, when nodes restart, they must reach at least 2 out of these 3
specific replicas in order to retrieve the current generation (we write
and read the generation data with QUORUM, unless we're a single-node
cluster, where we use ONE). Until this happens, a restarting
node can't coordinate writes to CDC-enabled tables. It would be better
if the node could access the last known generation locally.

The commit introduces a new table for broadcasting generation data with
the following properties:
-  it uses a better schema that stores the data in multiple rows, each
   of manageable size
-  it resides in a new keyspace that uses EverywhereStrategy so the
   data will be written to every node in the cluster that has a token in
   the token ring
-  the data will be written using CL=ALL and read using CL=ONE; thanks
   to this, restarting node won't have to communicate with other nodes
   to retrieve the data of the last known generation. Note that writing
   with CL=ALL does not reduce availability: creating a new generation
   *requires* all nodes to be available anyway, because they must learn
   about the generation before their clocks go past the generation's
   timestamp; if they don't, partitions won't be mapped to stream IDs
   consistently across the cluster
-  the partition key is no longer the generation's timestamp. Because it
   was that way in the old internal table, it forced the algorithm to
   choose the timestamp *before* the generation data was inserted into
   the table. What if the inserting took a long time? It increased the
   chance that nodes would learn about the generation too late (after
   their clocks moved past its timestamp). With the new schema we will
   first insert the generation data using a randomly generated UUID as
   the partition key, *then* choose the timestamp, then gossip both the
   timestamp and the UUID.
   Observe that after a node learns about a generation broadcasted using
   this new method through gossip it will retrieve its data very quickly
   since it's one of the replicas and it can use CL=ONE as it was
   written using CL=ALL.

The generation's timestamp and the UUID mentioned in the last point form
a "generation identifier" for this new generation. For passing these new
identifiers around, we introduce the cdc::generation_id_v2 type.

Fixes #7961.

---

For optimal review experience it is best to first read the updated design notes (you can read them rendered here: https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md), specifically the ["Generation switching"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#generation-switching) section followed by the ["Internal generation descriptions table V1 and upgrade procedure"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#internal-generation-descriptions-table-v1-and-upgrade-procedure) section, then read the commits in topological order.

dtest gating run (dev): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/1160/
unit tests (dev) passed locally

Closes #8643

* github.com:scylladb/scylla:
  docs: update cdc.md with info about the new internal table
  sys_dist_ks: don't create old CDC generations table on service initialization
  sys_dist_ks: rename all_tables() to ensured_tables()
  cdc: when creating new generations, use format v2 if possible
  main: pass feature_service to cdc::generation_service
  gms: introduce CDC_GENERATIONS_V2 feature
  cdc: introduce retrieve_generation_data
  test: cdc: include new generations table in permissions test
  sys_dist_ks: increase timeout for create_cdc_desc
  sys_dist_ks: new table for exchanging CDC generations
  tree-wide: introduce cdc::generation_id_v2
2021-05-27 17:13:44 +03:00
Avi Kivity
e8e4456ec7 Merge 'Introduce per-service-level workload types and their first use-case - shedding in interactive workloads' from Piotr Sarna
This draft extends and obsoletes #8123 by introducing a way of determining the workload type from service level parameters, and then using this context to qualify requests for shedding.

The rough idea is that when the admission queue in the CQL server is hit, it might make more sense to start shedding surplus requests instead of accumulating them on the semaphore. The assumption that interactive workloads are more interested in the success rate of as many requests as possible, and hanging on a semaphore reduces the chances for a request to succeed. Thus, it may make sense to shed some requests to reduce the load on this coordinator and let the existing requests to finish.

It's a draft, because I only performed local guided tests. #8123 was followed by some experiments on a multinode cluster which I want to rerun first.

Closes #8680

* github.com:scylladb/scylla:
  test: add a case for conflicting workload types
  cql-pytest: add basic tests for service level workload types
  docs: describe workload types for service levels
  sys_dist_ks: fix redundant parsing in get_service_level
  sys_dist_ks: make get_service_level exception-safe
  transport: start shedding requests during potential overload
  client_state: hook workload type from service levels
  cql3: add listing service level workload type
  cql3: add persisting service level workload type
  qos: add workload_type service level parameter
2021-05-27 17:01:56 +03:00
Piotr Sarna
99f356d764 test: add a case for conflicting workload types
The test case verifies that if several workload types are effective
for a single role, the conflict resolution is well defined.
2021-05-27 14:31:36 +02:00
Pavel Emelyanov
d2442a1bb3 tests: Ditch storage_service_for_tests
The purpose of the class in question is to start sharded storage
service to make its global instance alive. I don't know when exactly
it happened but no code that instantiates this wrapper really needs
the global storage service.

Ref: #2795
tests: unit(dev), perf_sstable(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210526170454.15795-1-xemul@scylladb.com>
2021-05-27 14:39:13 +03:00
Piotr Sarna
762e2f48f2 cql3: add listing service level workload type
The workload type information is now presented in the output
of LIST SERVICE LEVEL and LIST ALL SERVICE LEVELS statements.
2021-05-27 13:02:22 +02:00
Avi Kivity
e2e723cc4c build: enable -Wrange-loop-construct warning
This warning triggers when a range for ("for (auto x : range)") causes
non-trivial copies, prompting the developer to replace with a capture
by reference. A few minor violations in the test suite are corrected.

Closes #8699
2021-05-26 10:32:56 +03:00
Avi Kivity
e9c940dbbc Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed' from Calle Wilund
Fixes #8270

If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk _footprint_ that is over limit causing new segment allocation to stall.

We need to take a few things into account:
1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here.
2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task.
3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit.

Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything).

Closes #8695

* github.com:scylladb/scylla:
  commitlog_test: Add test case for usage/disk size threshold mismatch
  commitlog: Flush all segments if we only have one.
  commitlog: Always force flush if segment allocation is waiting
  commitlog: Include segment wasted (slack) size in footprint check
  commitlog: Adjust (lower) usage threshold
2021-05-25 18:34:29 +03:00
Kamil Braun
c948573398 sys_dist_ks: don't create old CDC generations table on service initialization
The old table won't be created in clusters that are bootstrapped after
this commit. It will stay in clusters that were upgraded from a version
before this commit.

Note that a fully upgraded cluster doesn't automatically create a new
generation in the new format. Even if the last generation was created
before the upgrade, the cluster will keep using it.
A new generation will be created in the new format when either:
1. a new node bootstraps (in the new version),
2. or the user runs checkAndRepairCdcStreams, which has a new check: if
   the current generation uses the old format, the command will decide
   that repair is needed, even if the generation is completely fine
   otherwise (also in the new version).

During upgrade, while the CDC_GENERATIONS_V2 feature is still not
enabled, the user may still bootstrap a node in the old version of
Scylla or run checkAndRepairCdcStreams on a not-yet-upgraded node. In
that case a new generation will be created in the old format,
using the old table definitions.
2021-05-25 16:07:23 +02:00
Kamil Braun
4d3870b24b main: pass feature_service to cdc::generation_service 2021-05-25 16:07:23 +02:00
Kamil Braun
f25e77c202 test: cdc: include new generations table in permissions test 2021-05-25 16:07:23 +02:00
Calle Wilund
a96433c684 commitlog_test: Add test case for usage/disk size threshold mismatch
Refs #8270

Tries to simulate case where we mismatch segments usage with actual
disk footprint and fail to flush enough to allow segment recycling
2021-05-25 12:43:12 +00:00
Avi Kivity
e391e4a398 test: serialized_action_test: prevent false-positive timeout in test_phased_barrier_reassignment
test_phased_barrier_reassignment has a timeout to prevent the test from
hanging on failure, but it occastionally triggers in debug mode since
the timeout is quite low (1ms). Increase the timeout to prevent false
positives. Since the timeout only expires if the test fails, it will
have no impact on execution time.

Ref #8613

Closes #8692
2021-05-25 11:20:18 +02:00
Raphael S. Carvalho
ee39eb9042 sstables: Fix slow off-strategy compaction on STCS tables
Off-strategy compaction on a table using STCS is slow because of
the needless write amplification of 2. That's because STCS reshape
isn't taking advantage of the fact that sstables produced by
a repair-based operation are disjoint. So the ~256 input sstables
were compacted (in batches of 32) into larger sstables, which in
turn were compacted into even larger ones. That write amp is very
significant on large data sets, making the whole operation 2x
slower.

Fixes #8449.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210524213426.196407-1-raphaelsc@scylladb.com>
2021-05-25 11:24:42 +03:00
Avi Kivity
50f3bbc359 Merge "treewide: various header cleanups" from Pavel S
"
The patch set is an assorted collection of header cleanups, e.g:
* Reduce number of boost includes in header files
* Switch to forward declarations in some places

A quick measurement was performed to see if these changes
provide any improvement in build times (ccache cleaned and
existing build products wiped out).

The results are posted below (`/usr/bin/time -v ninja dev-build`)
for 24 cores/48 threads CPU setup (AMD Threadripper 2970WX).

Before:

	Command being timed: "ninja dev-build"
	User time (seconds): 28262.47
	System time (seconds): 824.85
	Percent of CPU this job got: 3979%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 12:10.97
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2129888
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1402838
	Minor (reclaiming a frame) page faults: 124265412
	Voluntary context switches: 1879279
	Involuntary context switches: 1159999
	Swaps: 0
	File system inputs: 0
	File system outputs: 11806272
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

After:

	Command being timed: "ninja dev-build"
	User time (seconds): 26270.81
	System time (seconds): 767.01
	Percent of CPU this job got: 3905%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 11:32.36
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2117608
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1400189
	Minor (reclaiming a frame) page faults: 117570335
	Voluntary context switches: 1870631
	Involuntary context switches: 1154535
	Swaps: 0
	File system inputs: 0
	File system outputs: 11777280
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

The observed improvement is about 5% of total wall clock time
for `dev-build` target.

Also, all commits make sure that headers stay self-sufficient,
which would help to further improve the situation in the future.
"

* 'feature/header_cleanups_v1' of https://github.com/ManManson/scylla:
  transport: remove extraneous `qos/service_level_controller` includes from headers
  treewide: remove evidently unneded storage_proxy includes from some places
  service_level_controller: remove extraneous `service/storage_service.hh` include
  sstables/writer: remove extraneous `service/storage_service.hh` include
  treewide: remove extraneous database.hh includes from headers
  treewide: reduce boost headers usage in scylla header files
  cql3: remove extraneous includes from some headers
  cql3: various forward declaration cleanups
  utils: add missing <limits> header in `extremum_tracking.hh`
2021-05-24 14:24:20 +03:00
Avi Kivity
7e5a0b6fd0 test: drop unused fields
Drop unused fields in various tests and test libraries.
2021-05-21 21:04:49 +03:00
Avi Kivity
30034371e7 Merge "Remove most of global pointers from repair" from Pavel
"
There are many global stuff in repair -- a bunch of pointers to
sharded services, tracker, map of metas (maybe more). This set
removes the first group, all those services had become main-local
recently. Along the way a call to global storage proxy is dropped.

To get there the repair_service is turned into a "classical"
sharded<> service, gets all the needed dependencies by references
from main and spreads them internally where needed. Tracker and other
stuff is left global, but tracker is now the candidate for merging
with the now sharded repair_service, since it emulates the sharded
concept internally.

Overall the change is

- make repair_service sharded and put all dependencies on it at start
- have sharded<repair_service> in API and storage service
- carry the service reference down to repair_info and repair_meta
  constructions to give them the depedencies
- use needed services in _info and _meta methods

tests: unit(dev), dtest.repair(dev)
"

* 'br-repair-service' of https://github.com/xemul/scylla: (29 commits)
  repair: Drop most of globals from repair
  repair: Use local references in messaging handler checks
  repair: Use local references in create_writer()
  repair: Construct repair_meta with local references
  repair: Keep more stuff on repair_info
  repair: Kill bunch of global usages from insert_repair_meta
  repair: Pass repair service down to meta insertion
  repair: Keep local migration manager on repair_info
  repair: Move unused db captures
  repair: Remove unused ms captures
  repair: Construct repair_info with service
  repair: Loop over repair sharded container
  repair: Make sync_data_using_repair a method
  repair: Use repair from storage service
  repair: Keep repair on storage service
  repair: Make do_repair_start a method
  repair: Pass repair_service through the API until do_repair_start
  repair: Fix indentation after previous patch
  repair: Split sync_data_using_repair
  repair: Turn repair_range a repair_info method
  ...
2021-05-20 10:57:48 +03:00
Pavel Solodovnikov
fff7ef1fc2 treewide: reduce boost headers usage in scylla header files
`dev-headers` target is also ensured to build successfully.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 01:33:18 +03:00
Piotr Sarna
223a59c09c test: make rjson allocator test working in sanitize mode
Following Nadav's advice, instead of ignoring the test
in sanitize/debug modes, the allocator simply has a special path
of failing sufficiently large allocation requests.
With that, a problem with the address sanitizer is bypassed
and other debug mode sanitizers can inspect and check
if there are no more problems related to wrapping the original
rapidjson allocator.

Closes #8539
2021-05-20 00:42:47 +03:00
Pavel Emelyanov
28f01aadc9 allocation_strategy, code: Simplify alloc()
Todays alloc() accepts migrate-fn, size and alignment. All the callers
don't really need to provide anything special for the migrate-fn and
are just happy with default alignof() for alignment. The simplification
is in providing alloc() that only accepts size arg and does the rest
itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-19 09:23:49 +03:00
Botond Dénes
82bff1bcc6 test: cql_test_env: use proper scheduling groups
Currently `cql_test_env` runs its `func` in the default (main) group and
also leaves all scheduling groups in `dbcfg` default initialized to the
same scheduling group. This results in every part of the system,
normally isolated from each other, running in the same (default)
scheduling group. Not a big problem on its own, as we are talking about
tests, but this creates an artificial difference between the test and
the real environment, which is ever more pronounced since certain query
parameters are selected based on the current scheduling group.
To bring cql test env just that little bit closer to the real thing,
this patch creates all the scheduling groups main does (well almost) and
configures `dbcfg` with them.
Creating and destroying the scheduling group on each setup-teardown of
cql test env breaks some internal seastar components which don't like
seeing the same scheduling group with the same name but different id. So
create the scheduling groups once on first access and keep them around
until the test executable is running.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210514141614.128213-2-bdenes@scylladb.com>
2021-05-18 13:44:54 +03:00
Botond Dénes
300ee974f7 test: use with_cql_test_env_thread where needed
Currently `with_cql_test_env()` is equivalent to
`with_cql_test_env_thread()`, which resulted in many tests using the
former while really needing the latter and getting away with it. This
equivalence is incidental and will go away soon, so make sure all cql
test env using tests that expect to be run in a thread use the
appropriate variant.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210514141614.128213-1-bdenes@scylladb.com>
2021-05-18 13:44:52 +03:00
Avi Kivity
6db826475d Merge "Introduce segregate scrub mode" from Botond
"
The current scrub compaction has a serious drawback, while it is
very effective at removing any corruptions it recognizes, it is very
heavy-handed in its way of repairing such corruptions: it simply drops
all data that is suspected to be corrupt. While this *is* the safest way
to cleanse data, it might not be the best way from the point of view of
a user who doesn't want to loose data, even at the risk of retaining
some business-logic level corruption. Mind you, no database-level scrub
can ever fully repair data from the business-logic point of view, they
can only do so on the database-level. So in certain cases it might be
desirable to have a less heavy-handed approach of cleansing the data,
that tries as hard as it can to not loose any data.

This series introduces a new scrub mode, with the goal of addressing
this use-case: when the user doesn't want to loose any data. The new
mode is called "segregate" and it works by segregating its input into
multiple outputs such that each output contains a valid stream. This
approach can fix any out-of-order data, be that on the partition or
fragment level. Out-of-order partitions are simply written into a
separate output. Out of order fragments are handled by injecting a
partition-end/partition-start pair right before them, so that they are
now in a separate (duplicate) partition, that will just be written into
a separate output, just like a regular out-of-order partition.

The reason this series is posted as an RFC is that although I consider
the code stable and tested, there are some questions related to the UX.
* First and foremost every scrub that does more than just discard data
  that is suspected to be corrupt (but even these a certain degree) have
  to consider the possibility that they are rehabilitating corruptions,
  leaving them in the system without a warning, in the sense that the
  user won't see any more problems due to low-level corruptions and
  hence might think everything is alright, while data is still corrupt
  from the business logic point of view. It is very hard to draw a line
  between what should and shouldn't scrub do, yet there is a demand from
  users for scrub that can restore data without loosing any of it. Note
  that anybody executing such a scrub is already in a bad shape, even if
  they can read their data (they often can't) it is already corrupt,
  scrub is not making anything worse here.
* This series converts the previous `skip_corrupted` boolean into an
  enum, which now selects the scrub mode. This means that
  `skip_corrupted` cannot be combined with segregate to throw out what
  the former can't fix. This was chosen for simplicity, a bunch of
  flags, all interacting with each other is very hard to see through in
  my opinion, a linear mode selector is much more so.
* The new segregate mode goes all-in, by trying to fix even
  fragment-level disorder. Maybe it should only do it on the partition
  level, or maybe this should be made configurable, allowing the user to
  select what to happen with those data that cannot be fixed.

Tests: unit(dev), unit(sstable_datafile_test:debug)
"

* 'sstable-scrub-segregate-by-partition/v1' of https://github.com/denesb/scylla:
  test: boost/sstable_datafile_test: add tests for segregate mode scrub
  api: storage_service/keyspace_scrub: expose new segregate mode
  sstables: compaction/scrub: add segregate mode
  mutation_fragment_stream_validator: add reset methods
  mutation_writer: add segregate_by_partition
  api: /storage_service/keyspace_scrub: add scrub mode param
  sstables: compaction/scrub: replace skip_corrupted with mode enum
  sstables: compaction/scrub: prevent infinite loop when last partition end is missing
  tests: boost/sstable_datafile_test: use the same permit for all fragments in scrub tests
2021-05-18 13:43:01 +03:00
Pavel Emelyanov
23e8e60ec0 repair: Keep repair on storage service
Storage service calls a bunch of do_something_with_repair() methods. All
of them need the local repair_service and the only way to get it is by
keeping it on storage service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-14 18:44:02 +03:00
Piotr Wojtczak
e6254acfd3 boost/tests: Add virtual_table_test for basic infrastructure 2021-05-12 17:05:35 +02:00
Piotr Wojtczak
8825ae128d boost/tests: Test memtable_filling_virtual_table as mutation_source
Uses the infrastructure for testing mutation_sources, but only a
subset of it which does not do fast forwarding (since virtual_table
does not support it).
2021-05-12 17:05:35 +02:00
Benny Halevy
9ba960a388 utils: phased_barrier::operation do not leak gate entry when reassigned
utils::phased_barrier holds a `lw_shared_ptr<gate>` that is
typically `enter()`ed in `phased_barrier::start()`,
and left when the operation is destroyed in `~operation`.

Currently, the operation move-assign implementation is the
default one that just moves the lw_shared gate ptr from the
other operation into this one, without calling `_gate->leave()` first.

This change first destroys *this when move-assigned (if not self)
to call _gate->leave() if engaged, before reassigning the
gate with the other operation::_gate.

A unit test that reproduces the issue before this change
and passes with the fix was added to serialized_action_test.

Fixes #8613

Test: unit(dev), serialized_action_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210510120703.1520328-1-bhalevy@scylladb.com>
2021-05-11 18:39:10 +03:00
Avi Kivity
1d8234f52d Merge "reader_concurrency_semaphore: improve diagnostics printout" from Botond
"
The current printout is has multiple problems:
* It is segregated by state, each having its own sorting criteria;
* Number of permits and count resources is collapsed in to a single
  column, not clear which is the one printed.
* Number of available/initial units of the semaphore are not printed;

This series solves all this problems:
* It merges all states into a single table, sorted by memory
  consumption, in descending order.
* It separates number of permits and count resources into separate
  columns.
* Prints a summary of the semaphore units.
* Provides a cap on the maximum amount of printable lines, to not blow
  up the logs.

The goal of all this is to make it easy to find the culprit a semaphore
problem: easily spot the big memory consumers, then unpack the name
column to determine which table and code path is responsible.
This brings the printout close to the recently `scylla reads`
scylla-gdb.py command, providing a uniform report format across the two
tools.
Example report:
INFO  2021-05-07 09:52:16,806 [shard 0] testlog - With max-lines=4: Semaphore reader_concurrency_semaphore_dump_reader_diganostics with 8/2147483647 count and 263599186/9223372036854775807 memory resources: user request, dumping permit diagnostics:
permits count   memory  table/description/state
7       2       77M     ks.tbl1/op1/active
6       3       59M     ks.tbl1/op0/active
4       0       36M     ks.tbl1/op2/active
3       1       36M     ks.tbl0/op2/active
11      2       43M     permits omitted for brevity

31      8       251M    total
"

* 'reader-concurrency-semaphore-dump-improvement/v1' of https://github.com/denesb/scylla:
  test: reader_concurrency_test: add reader_concurrency_semaphore_dump_reader_diganostics
  reader_concurrency_semaphore: dump_reader_diagnostics(): print more information in the header
  reader_concurrency_semaphore: dump_reader_diagnostics(): cap number of printed lines
  reader_concurrency_semaphore: dump_reader_diagnostics(): sort lines in descending order
  reader_concurrency_semaphore: dump_reader_diagnostics(): merge all states into a single table
  reader_concurrency_semaphore: dump_reader_diagnostics(): separate number of permits and count resources
2021-05-11 18:39:10 +03:00
Nadav Har'El
af485f5226 secondary index: fix index name in IndexInfo system table
In commit 3e39985c7a we added the Cassandra-compatible system table
system."IndexInfo" (note the capitalized table name) which lists built
indexes. Because we already had a table of built materialized views, and
indexes are implemented as materialized views, the index list was
implemented as a virtual table based on the view list.

However, the *name* of each materialized view listed in the list of
views looks like something_index, with the suffix "_index", while the
name of the table we need to print is "something". We forgot to do this
transformation in the virtual table - and this is what this patch does.

This bug can confuse applications which use this system table to wait for
an index to be built. Several tests translated from Cassandra's unit
tests, in cassandra_tests/validation/entities/secondary_index_test.py fail
in wait_for_index() because of this incompatibility, and pass after this
patch.

This patch also changes the unit test that enshrined the previous, wrong,
behavior, to test for the correct behavior. This problem is typical of
C++ unit tests which cannot be run against Cassandra.

Fixes #8600

Unfortunately, although this patch fixes "typical" applications (including
all tests which I tried) - applications which read from IndexInfo in a
"typical" method to look for a specific index being ready, the
implementation is technically NOT correct: The problem is that index
names are not sorted in the right order, because they are sorted with
the "_index" prefix.
To give an example, the index names "a" should be listed before "a1", but
the view names "a1_index" comes before "a_index" (because in ASCII, 1
comes before underscore). I can't think of any way to fix this bug
without completely reimplementing IndexInfo in a different way - probably
based on a temporary memtable (which is fine as this is not a
performance-critical operation). We'll need to do this rewrite eventually,
and I'll open a new issue.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210509140113.1084497-1-nyh@scylladb.com>
2021-05-11 18:39:10 +03:00