"This series ensures the always write correct cell names to promoted
index cell blocks, taking into account the eoc of range tombstones.
Fixes#2333"
* 'pi-cell-name/v1' of github.com:duarten/scylla:
tests/sstable_mutation_test: Test promoted index blocks are monotonic
sstables: Consider eoc when flushing pi block
sstables: Extract out converting bound_kind to eoc
(cherry picked from commit db7329b1cb)
Allow reading exactly desired byte ranges and fast_forward_to
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.
As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).
This patch has two levels:
1. In the lower level, sstable::data_consume_rows(), which reads all
partitions in a given disk byte range, now gets another byte position,
"last_end". That can be the range's end, the end of the file, or anything
in between the two. It opens the disk stream until last_end, which means
1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
not allowed beyond last_end.
2. In the upper level, we add to the various layers of sstable readers,
mutation readers, etc., a boolean flag mutation_reader::forwarding, which
says whether fast_forward_to() is allowed on the stream of mutations to
move the stream to a different partition range.
Note that this flag is separate from the existing boolean flag
streamed_mutation::fowarding - that one talks about skipping inside a
single partition, while the flag we are adding is about switching the
partition range being read. Most of the functions that previously
accepted streamed_mutation::forwarding now accept *also* the option
mutation_reader::forwarding. The exception are functions which are known
to read only a single partition, and not support fast_forward_to() a
different partition range.
We note that if mutation_reader::forwarding::no is requested, and
fast_forward_to() is forbidden, there is no point in reading anything
beyond the range's end, so data_consume_rows() is called with last_end as
the range's end. But if forwarding::yes is requested, we use the end of the
file as last_end, exactly like the code before this patch did.
Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.
In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve perforance.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170718110643.8667-1-nyh@scylladb.com>
Test should
a.) Wait for the flush semaphore
b.) Only compare segement sets between start and end, not start,
end and inbetwen. I.e. the test sort of assumed we started
with < 2 (or so) segments. Not always the case (timing)
Message-Id: <1496828317-14375-1-git-send-email-calle@scylladb.com>
(cherry picked from commit 0c598e5645)
"This series switches repair to use more stream plans to stream the mismatched
sub ranges and use a range generator to produce sub ranges.
Test shows no huge memory is used for repair with large data set.
In addition, we now have a progress reporter in the log how many ranges are processed.
Jun 06 14:18:22 [shard 0] repair - Repair 512 out of 529 ranges, id=1, keyspace=myks, cf=mytable, range=(8526136029525195375, 8549482295083869942]
Jun 06 14:19:55 [shard 0] repair - Repair 513 out of 529 ranges, id=1, keyspace=myks, cf=mytable, range=(8526136029525195375, 8549482295083869942]
Fixes #2430."
* tag 'asias/fix-repair-2430-branch-master-v1' of github.com:cloudius-systems/seastar-dev:
repair: Remove unused sub_ranges_max
repair: Reduce parallelism in repair_ranges
repair: Tweak the log a bit
repair: Use more stream_plan
repair: iterator over subranges instead of list
(cherry picked from commit 419ad9d6cb)
When compacting a fully expired sstable, we're not allowing that sstable
to be purged because expired cell is *unconditionally* converted into a
dead cell. Why not check if the expired cell can be purged instead using
gc before and max purgeable timestamp?
Currently, we need two compactions to get rid of a fully expired sstable
which cells could have always been purged.
look at this sstable with expired cell:
{
"partition" : {
"key" : [ "2" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 120,
"liveness_info" : { "tstamp" : "2017-04-09T17:07:12.702597Z",
"ttl" : 20, "expires_at" : "2017-04-09T17:07:32Z", "expired" : true },
"cells" : [
{ "name" : "country", "value" : "1" },
]
now this sstable data after first compaction:
[shard 0] compaction - Compacted 1 sstables to [...]. 120 bytes to 79
(~65% of original) in 229ms = 0.000328997MB/s.
{
...
"rows" : [
{
"type" : "row",
"position" : 79,
"cells" : [
{ "name" : "country", "deletion_info" :
{ "local_delete_time" : "2017-04-09T17:07:12Z" },
"tstamp" : "2017-04-09T17:07:12.702597Z"
},
]
now another compaction will actually get rid of data:
compaction - Compacted 1 sstables to []. 79 bytes to 0 (~0% of original)
in 1ms = 0MB/s. ~2 total partitions merged to 0
NOTE:
It's a waste of time to wait for second compaction because the expired
cell could have been purged at first compaction because it satisfied
gc_before and max purgeable timestamp.
Fixes#2249, #2253
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170413001049.9663-1-raphaelsc@scylladb.com>
(cherry picked from commit a6f8f4fe24)
It can't make the leap from dht::ring_position to
stdx::optional<range_bound<dht::ring_position>> for some reason.
(cherry picked from commit ba31619594)
From http://github.com/avikivity/scylla exponential-sharder/v3.
The sharder, which takes a range of tokens and splits it among shards, is
slow with large shard count and the default
murmur3_partitioner_ignore_msb_bits.
This patchset fixes excessive iteration in sstable sharding metadata writer and
nonsignular range scans.
Without this patchset, sealing a memtable takes > 60 ms on a 48-shard
system. With the patchset, it drops below the latency tracker threshold I
used (5 ms).
Fixes#2392.
(cherry picked from commit 84648f73ef)
Currently, fully expired sstable[1] is unconditionally chosen for compaction
by DTCS, but that may lead to a compaction loop under certain conditions.
Let's consider that an almost expired sstable is compacted, and it's not
deleted yet, and that the new sstable becomes expired before its ancestor is
deleted.
Because this new sstable is expired, it will be chosen by DTCS, but it will
not be purged because 'compacted undeleted' sstables are taken into account
by calculation of max purgeable timestamp and prevents expired data from
being purged. The problem is that this sequence of events can keep happening
forever as reported by issue #2260.
NOTE: This problem was easier to reproduce before improvement on compaction
of expired cells, because fully expired sstable was being converted into a
sstable full of tombstones, which is also considered fully expired.
Fixes#2260.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170428233554.13744-1-raphaelsc@scylladb.com>
(cherry picked from commit 687a4bb0c2)
"Currently eviction is performed until occupancy of the whole region
drops below the 85% threshold. This may take a while if region had
high occupancy and is large. We could improve the situation by only
evicting until occupancy of the sparsest segment drops below the
threshold, as is done by this change.
I tested this using a c-s read workload in which the condition
triggers in the cache region, with 1G per shard:
lsa-timing - Reclamation cycle took 12.934 us.
lsa-timing - Reclamation cycle took 47.771 us.
lsa-timing - Reclamation cycle took 125.946 us.
lsa-timing - Reclamation cycle took 144356 us.
lsa-timing - Reclamation cycle took 655.765 us.
lsa-timing - Reclamation cycle took 693.418 us.
lsa-timing - Reclamation cycle took 509.869 us.
lsa-timing - Reclamation cycle took 1139.15 us.
The 144ms pause is when large eviction is necessary.
Statistics for reclamation pauses for a read workload over
larger-than-memory data set:
Before:
avg = 865.796362
stdev = 10253.498038
min = 93.891000
max = 264078.000000
sum = 574022.988000
samples = 663
After:
avg = 513.685650
stdev = 275.270157
min = 212.286000
max = 1089.670000
sum = 340573.586000
samples = 663
Refs #1634."
* tag 'tgrabiec/lsa-reduce-reclaim-latency-v3' of github.com:cloudius-systems/seastar-dev:
lsa: Reduce reclamation latency
tests: Add test for log_histogram
log_histogram: Allow non-power-of-two minimum values
lsa: Use regular compaction threshold in on-idle compaction
tests: row_cache_test: Induce update failure more reliably
lsa: Add getter for region's eviction function
(cherry picked from commit fccbf2c51f)
[avi: adjustments for 1.7's heap vs. master's log_histogram]
streaming generates lots of small sstables with large token range,
which triggers O(N^2) in space in interval map.
level 0 sstables will now be stored in a structure that has O(N)
in space complexity and which will be included for every read.
Fixes#2287.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170417185509.6633-1-raphaelsc@scylladb.com>
(cherry picked from commit 11b74050a1)
"The test allocates objects in batches (allocation is always under a reclaim
lock) of ~3MiB and assumes that it will always succeed because if we cross the
low water mark for free memory (20MiB) in seastar, reclamation will be
performed between the batches, asynchronously.
Unfortunately that's prevented by can_allocate_more_memory(), which fails
segment allocation when we're below the low water mark. LSA currently doesn't
allow allocating below the low water mark.
The solution which is employed across the code base is to use allocating_section,
so use it here as well.
Exposed by recent consistent failures on branch-1.7."
* 'tgrabiec/fix-lsa-async-eviction-test' of github.com:cloudius-systems/seastar-dev:
tests: lsa_async_eviction_test: Allocate objects under allocating section
lsa: Allow adjusting reserves in allocating_section
(cherry picked from commit 434a4fee28)
"This series contains some fixes and a unit test for the logic responsible
for locking counter cells."
* 'pdziepak/cell-locking-fixes/v1' of github.com:cloudius-systems/seastar-dev:
tests: add test for counter cell locker
cell_locking: fix schema upgrades
cell_locker: make locker non-movable
cell_locking: allow to be included by anyone
(cherry picked from commit b8c4b35b57)
"This series makes sure that schemas containing both counter and non-counter
regular or static columns are not allowed."
* 'pdziepak/disallow-mixed-schemas/v1' of github.com:cloudius-systems/seastar-dev:
schema: verify that there are no both counter and non-counter columns
test/mutation_source: specify whether to generate counter mutations
tests/canonical_mutation: don't try to upgrade incompatible schemas
(cherry picked from commit 9e4ae0763d)
"Before, the logic for releasing writes blocked on dirty worked like this:
1) When region group size changes and it is not under pressure and there
are some requests blocked, then schedule request releasing task
2) request releasing task, if no pressure, runs one request and if there are
still blocked requests, schedules next request releasing task
If requests don't change the size of the region group, then either some request
executes or there is a request releasing task scheduled. The amount of scheduled
tasks is at most 1, there is a single releasing thread.
However, if requests themselves would change the size of the group, then each
such change would schedule yet another request releasing thread, growing the task
queue size by one.
The group size can also change when memory is reclaimed from the groups (e.g.
when contains sparse segments). Compaction may start many request releasing
threads due to group size updates.
Such behavior is detrimental for performance and stability if there are a lot
of blocked requests. This can happen on 1.5 even with modest concurrency
because timed out requests stay in the queue. This is less likely on 1.6 where
they are dropped from the queue.
The releasing of tasks may start to dominate over other processes in the
system. When the amount of scheduled tasks reaches 1000, polling stops and
server becomes unresponsive until all of the released requests are done, which
is either when they start to block on dirty memory again or run out of blocked
requests. It may take a while to reach pressure condition after memtable flush
if it brings virtual dirty much below the threshold, which is currently the
case for workloads with overwrites producing sparse regions.
I saw this happening in a write workload from issue #2021 where the number of
request releasing threads grew into thousands.
Fix by ensuring there is at most one request releasing thread at a time. There
will be one releasing fiber per region group which is woken up when pressure is
lifted. It executes blocked requests until pressure occurs."
* tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev:
tests: lsa: Add test for reclaimer starting and stopping
tests: lsa: Add request releasing stress test
lsa: Avoid avalanche releasing of requests
lsa: Move definitions to .cc
lsa: Simplify hard pressure notification management
lsa: Do not start or stop reclaiming on hard pressure
tests: lsa: Adjust to take into account that reclaimers are run synchronously
lsa: Document and annotate reclaimer notification callbacks
tests: lsa: Use with_timeout() in quiesce()
(cherry picked from commit 7a00dd6985)
Currently the test does not wait for cache update
to finish before carrying on with the checks.
This makes the test nondeterministic and purely wrong
because checks expect update to be finished.
This patch changes the test to wait for update to finish.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <2a99bba24b1628466d3495332b48ef3ccdb43c26.1485862389.git.piotr@scylladb.com>
Currently, the code is using bytes_opt and bytes_view_opt to represent
CQL values, which can hold a value or null. In preparation for
supporting a third state, unset value introduced in CQL v4, introduce
new raw_value and raw_value_view types and use them instead.
The new types are based on boost::variant<> and are capable of holding
null, unset values, and blobs that represent a value.
This reverts commit d61002cc33.
Introduced a regression in row_cache_alloc_stress.
The problem is that reclaim_from_evictable() evicts way too much after
the refactor due to the stop condition not taking into account how
much data was evicted so far and only looking at occupancy of the
minimal segment. This may lead to eviction of the whole region.
Add a boolean to short circuit the read path on empty range
hoping for some speedup.
tested in read write with cs using:
cl=QUORUM duration=1m -mode native cql3 -rate threads=700 -node localhost
Will do some additional benchmark.
Fixes#1056
Signed-off-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <20170118194451.16836-1-benoit@scylladb.com>
Currently eviction is performed until occupancy of the whole region
drops below the 85% threshold. This may take a while if region had
high occupancy and is large. We could improve the situation by only
evicting until occupancy of the sparsest segment drops below the
threshold, as is done by this change.
I tested this using a c-s read workload in which the condition
triggers in the cache region, with 1G per shard:
lsa-timing - Reclamation cycle took 12.934 us.
lsa-timing - Reclamation cycle took 47.771 us.
lsa-timing - Reclamation cycle took 125.946 us.
lsa-timing - Reclamation cycle took 144356 us.
lsa-timing - Reclamation cycle took 655.765 us.
lsa-timing - Reclamation cycle took 693.418 us.
lsa-timing - Reclamation cycle took 509.869 us.
lsa-timing - Reclamation cycle took 1139.15 us.
The 144ms pause is when large eviction is necessary.
The change improves worst case latency. Reclamation time statistics
over 30 second period after cache fills up, in microseconds:
Before:
avg = 1524.283148
stdev = 11021.021118
min = 12.934000
max = 144356.000000
sum = 257603.852000
samples = 169
After:
avg = 1317.362414
stdev = 1913.542802
min = 263.935000
max = 19244.600000
sum = 175209.201000
samples = 133
Refs #1634.
Message-Id: <1484730859-11969-1-git-send-email-tgrabiec@scylladb.com>
Since ce083308a1
"random_mutation_generator: Generate RTs by default" random mutation
generator produces range tombstones. However, so far the tests were run
with all features disabled (because of incomplete initialization of all
services) which meant that RANGE_TOMBSTONE feature was not enabled and
the code couldn't handle range tombstones that weren't just prefixes.
This patch solves the problem by forcing all features to be enabled when
tests are run.
Message-Id: <20170116103324.22956-1-pdziepak@scylladb.com>
This patch changes the random_mutation_generator so it generates range
tombstones by default. This fixes a bug where reversibly applying
range tombstones wasn't being tested.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170110164822.28747-1-duarte@scylladb.com>
"Intended to reduce memory usage when resharding by sharing sstable
components among shards. File descriptors are also shared from now
on, meaning that a much smaller number of file descriptors will be
used during resharding.
Fixes #1951."
branch 'excessive_memory_usage_v4' of github.com:raphaelsc/scylla
* 'excessive_memory_usage_v4' of github.com:raphaelsc/scylla:
db: avoid excessive memory usage during resharding
checked_file_impl: add support to dup
sstables: group sstable components that can be shared among shards
sstables: rename sstable member
After resharding, sstables may be owned by all shards, which
means that file descriptors and memory usage for metadata will
increase by a factor equal to number of shards. That can easily
lead to OOM.
SSTable components are immutable, so they can be stored in one
shard and shared with others that need it. We use the following
formula to decide which shard will open the sstable and share
it with the others: (generation % smp::count), which is the
inverse of how we calculate generation for new sstables.
So if no resharding is performed, everything is shard-local.
With this approach, resource usage due to loaded sstables will
be evenly distributed among shards.
For this approach to work, we now only populate keyspaces from
shard 0. It's now the sole responsible for iterating through
column family dirs. In addition, most of population functions
are now free and take distributed database object as parameter.
Fixes#1951.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"This patch series adds support for CQL 3.3.1. The changes to CQL are listed
here:
https://github.com/apache/cassandra/blob/cassandra-2.2/doc/cql3/CQL.textile#changes
The following CQL features are already supported by Scylla:
- TRUNCATE TABLE alias
- Double-dollar string literals
- Aggregate functions: MIN, MAX, SUM, and AVG
This series adds the following CQL features:
- New data types: tinyint, smallint, date, and time
- CQL binary protocol v4 (required by the new data types)
- Advertise Cassandra 2.2.8 version from Scylla so that drivers correctly
detect the presence of CQL 3.3.1
The following CQL features are not supported by Scylla:
- Role-based access control (issue #1941)
- JSON data type
- User-defined functions (UDFs)
- User-defined aggregates (UDAs)
The following CQL binary protocol v4 changes are not implemented by this
series:
- Read_failure and Write_failure error codes are not implemented.
They error codes not used by the smart drivers but as they are
propagated to application code, we eventually need to wire them up
to our storage proxy implementation.
- Function_failure error code is only used by user-defined functions
and the fromJson function, which are not implemented by Scylla.
Fixes #1284."
* 'penberg/cql-3.3.1/v5' of github.com:cloudius-systems/seastar-dev:
version: Bump Cassandra version to 2.2.8
db/schema_tables: Add schema_functions and schema_aggregates tables
tests/type_tests: TIME type test cases
tests/cql_query_test: TIME type test cases
cql3: TIME data type support
tests/type_tests: DATE type test cases
tests/cql_query_test: DATE type test cases
cql3: DATE type support
date.h: 64-bit year and days representation
licenses: Add utils/date.h license
utils/date.h: Import date and time library sources
tests/type_tests: TINYINT and SMALLINT type test cases
tests/cql_query_test: TINYINT and SMALLINT type test cases
cql3: TINYINT and SMALLINT data type support
types: Fix integer_type_impl::parse_int() for bytes