The code uses incorrect output stream in case only digest is requested
and thus getting incorrect data size. Failing to correctly account
for static row size while calculating digest may cause digest mismatch
between digest and data query.
Fixes#3753.
Message-Id: <20180905131219.GD2326@scylladb.com>
(cherry picked from commit 98092353df)
When the list of values in the IN list of a single column contains
duplicates, multiple executors are activated since the assumption
is that each value in the IN list corresponds to a different partition.
this results in the same row appearing in the result number times
corresponding to the duplication of the partition value.
Added queries for the in restriction unitest and fixed with a bad result check.
Fixes#2837
Tests: Queries as in the usecase from the GitHub issue in both forms ,
prepared and plain (using python driver),Unitest.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <ad88b7218fa55466be7bc4303dc50326a3d59733.1534322238.git.eliransin@scylladb.com>
(cherry picked from commit d734d316a6)
Fixes a regression introduced in
9e88b60ef5, which broke the lookup for
prefetched values of lists when a clustering key is specified.
This is the code that was removed from some list operations:
std::experimental::optional<clustering_key> row_key;
if (!column.is_static()) {
row_key = clustering_key::from_clustering_prefix(*params._schema, prefix);
}
...
auto&& existing_list = params.get_prefetched_list(m.key().view(), row_key, column);
Put it back, in the form of common code in the update_parameters class.
Fixes#3703
* https://github.com/duarten/scylla cql-list-fixes/v1:
tests/cql_query_test: Test multi-cell static list updates with ckeys
cql3/lists: Fix multi-cell static list updates in the presence of ckeys
keys: Add factory for an empty clustering_key_prefix_view
(cherry picked from commit 6937cc2d1c)
_value_views is the authoritative data structure for the
client-specified values. Indeed, the ctor called
transport::request::read_options() leaves _values completely empty.
In query_options::prepare() we were, however, using _values to
associated values to the client-specified column names, and not
_value_views. Fix this by using _value_views instead.
As for the reasons we didn't see this bug earlier, I assume it's
because very few drivers set the 0x04 query options flag, which means
column names are omitted. This is the right thing to do since most
drivers have enough information to correctly position the values.
Fixes#3688
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814234605.14775-1-duarte@scylladb.com>
(cherry picked from commit a4355fe7e7)
In previous versions of Fedora, the `crypt_r` function returned
`nullptr` when a requested hashing algorithm was not supported.
This is consistent with the documentation of the function in its man
page.
As of Fedora 28, the function's behavior changes so that the encrypted
text is not `nullptr` on error, but instead the string "*0".
The info pages for `crypt_r` clarify somewhat (and contradict the man
pages):
Some implementations return `NULL` on failure, and others return an
_invalid_ hashed passphrase, which will begin with a `*` and will
not be the same as SALT.
Because of this change of behavior, users running Scylla on a Fedora 28
machine which was upgraded from a previous release would not be able to
authenticate: an unsupported hashing algorithm would be selected,
producing encrypted text that did not match the entry in the table.
With this change, unsupported algorithms are correctly detected and
users should be able to continue to authenticate themselves.
Fixes#3637.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <bcd708f3ec195870fa2b0d147c8910fb63db7e0e.1533322594.git.jhaberku@scylladb.com>
(cherry picked from commit fce10f2c6e)
The calculation consists of several parts with preemption point between
them, so a table can be added while calculation is ongoing. Do not
assume that table exists in intermediate data structure.
Fixes#3636
Message-Id: <20180801093147.GD23569@scylladb.com>
(cherry picked from commit 44a6afad8c)
"
The problem happens under the following circumstances:
- we have a partially populated partition in cache, with a gap in the middle
- a read with no clustering restrictions trying to populate that gap
- eviction of the entry for the lower bound of the gap concurrent with population
The population may incorrectly mark the range before the gap as continuous.
This may result in temporary loss of writes in that clustering range. The
problem heals by clearing cache.
Caught by row_cache_test::test_concurrent_reads_and_eviction, which has been
failing sporadically.
The problem is in ensure_population_lower_bound(), which returns true if
current clustering range covers all rows, which means that the populator has a
right to set continuity flag to true on the row it inserts. This is correct
only if the current population range actually starts since before all
clustering rows. Otherwise, we're populating since _last_row and should
consult it.
Fixes#3608.
"
* 'tgrabiec/fix-violation-of-continuity-on-concurrent-read-and-eviction' of github.com:tgrabiec/scylla:
row_cache: Fix violation of continuity on concurrent eviction and population
position_in_partition: Introduce is_before_all_clustered_rows()
(cherry picked from commit 31151cadd4)
"
Changes made:
- switched the test to use do_with_cql_env_thread due to lack of SEASTAR_TEST_CASE_THREAD macro
- imported make_local_key() from master, needed for the database_test to pass
"
* tag 'tgrabiec/disable-min-max-sstable-filtering-v1-branch-2.1' of github.com:tgrabiec/scylla:
Merge "Disable sstable filtering based on min/max clustering key components" from Tomasz
tests: simple_schema: Generate local keys form make_pkeys()
tests: Import make_local_key() from master
"
With DateTiered and TimeWindow, there is a read optimization enabled
which excludes sstables based on overlap with recorded min/max values
of clustering key components. The problem is that it doesn't take into
account partition tombstones and static rows, which should still be
returned by the reader even if there is no overlap in the query's
clustering range. A read which returns no clustering rows can
mispopulate cache, which will appear as partition deletion or writes
to the static row being lost. Until node restart or eviction of the
partition entry.
There is also a bad interaction between cache population on read and
that optimization. When the clustering range of the query doesn't
overlap with any sstable, the reader will return no partition markers
for the read, which leads cache populator to assume there is no
partition in sstables and it will cache an empty partition. This will
cause later reads of that partition to miss prior writes to that
partition until it is evicted from cache or node is restarted.
Disable until a more elaborate fix is implemented.
Fixes#3552Fixes#3553
"
* tag 'tgrabiec/disable-min-max-sstable-filtering-v1' of github.com:tgrabiec/scylla:
tests: Add test for slicing a mutation source with date tiered compaction strategy
tests: Check that database conforms to mutation source
database: Disable sstable filtering based on min/max clustering key components
(cherry picked from commit e1efda8b0c)
On some build environment we may want to limit number of parallel jobs since
ninja-build runs ncpus jobs by default, it may too many since g++ eats very
huge memory.
So support --jobs <njobs> just like on rpm build script.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180425205439.30053-1-syuu@scylladb.com>
(cherry picked from commit 782ebcece4)
There is a bug in incremental_selector for partitioned_sstable_set, so
until it is found, stop using it.
This degrades scan performance of Leveled Compaction Strategy tables.
Fixes#3513. (as a workaround)
Introduced: 2.1
Message-Id: <20180613131547.19084-1-avi@scylladb.com>
(cherry picked from commit aeffbb6732)
ec2_snitch::gossiper_starting() calls for the base class (default) method
that sets _gossip_started to TRUE and thereby prevents to following
reconnectable_snitch_helper registration.
Fixes#3454
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1528208520-28046-1-git-send-email-vladz@scylladb.com>
(cherry picked from commit 2dde372ae6)
Fixes argument misquoting at $SRPM_OPTS expansion for the mock commands
and makes the --jobs argument work as supposed.
Signed-off-by: Mika Eloranta <mel@aiven.io>
Message-Id: <20180113212904.85907-1-mel@aiven.io>
(cherry picked from commit 7266446227)
We are currently moving the pointer we acquired to the segment inside
the lambda in which we'll handle the cycle.
The problem is, we also use that same pointer inside the exception
handler. If an exception happens we'll access it and we'll crash.
Probably #3440.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180518125820.10726-1-glauber@scylladb.com>
(cherry picked from commit 596a525950)
This parameter is not available on recent Red Hat kernels or on
non-Red Hat kernels (it was removed on 3.10.0-772.el7,
RHBZ 1455932). The presence of the parameter on kernels that don't
support it cause the module load to fail, with the result that the
storage is not available.
Fix by removing the parameter. For someone running an older Red Hat
kernel the effect will be that discard is disabled, but they can fix
that by updating the kernel. For someone running a newer kernel, the
effect will be that they can access their data.
Fixes#3437.
Message-Id: <20180516134913.6540-1-avi@scylladb.com>
(cherry picked from commit 3b8118d4e5)
Dropping a user type requires that all tables using that type also be
dropped. However, a type may appear to be dropped at the same time as
a table, for instance due to the order in which a node receives schema
notifications, or when dropping a keyspace.
When dropping a table, if we build a schema in a shard through a
global_schema_pointer, then we'll check for the existence of any user
type the schema employs. We thus need to ensure types are only dropped
after tables, similarly to how it's done for keyspaces.
Fixes#3068
Tests: unit-tests (release)
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180129114137.85149-1-duarte@scylladb.com>
(cherry picked from commit 1e3fae5bef)
When 'always_set_home' is specified on /etc/sudoers pbuilder won't read
.pbuilderrc from current user home directory, and we don't have a way to change
the behavor from sudo command parameter.
So let's use ~root/.pbuilderrc and switch to HOME=/root when sudo executed,
this can work both environment which does specified always_set_home and doesn't
specified.
Fixes#3366
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1523926024-3937-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit ace44784e8)
There is a race between cql connection closure and notifier
registration. If a connection is closed before notification registration
is complete stale pointer to the connection will remain in notification
list since attempt to unregister the connection will happen to early.
The fix is to move notifier unregisteration after connection's gate
is closed which will ensure that there is no outstanding registration
request. But this means that now a connection with closed gate can be in
notifier list, so with_gate() may throw and abort a notifier loop. Fix
that by replacing with_gate() by call to is_closed();
Fixes: #3355
Tests: unit(release)
Message-Id: <20180412134744.GB22593@scylladb.com>
(cherry picked from commit 1a9aaece3e)
Empty partition keys are not supported on normal tables - they cannot
be inserted or queried (surprisingly, the rules for composite
partition keys are different: all components are then allowed to be
empty). However, the (non-composite) partition key of a view could end
up being empty if that column is: a base table regular column, a
base table clustering key column, or a base table partition key column,
part of a composite key.
Fixes#3262
Refs CASSANDRA-14345
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180403122244.10626-1-duarte@scylladb.com>
(cherry picked from commit ec8960df45)
start node 1 2 3
shutdown node2
shutdown node1 and node3
start node1 and node3
nodetool removenode node2
clean up all scylla data on node2
bootstrap node2 as a new node
I saw node2 could not bootstrap stuck at waiting for schema information to compelte for ever:
On node1, node3
[shard 0] gossip - received an invalid gossip generation for peer 127.0.0.2; local generation = 2, received generation = 1521779704
On node2
[shard 0] storage_service - JOINING: waiting for schema information to complete
This is becasue in nodetool removenode operation, the generation of node1 was increased from 0 to 2.
gossiper::advertise_removing () calls eps.get_heart_beat_state().force_newer_generation_unsafe();
gossiper::advertise_token_removed() calls eps.get_heart_beat_state().force_newer_generation_unsafe();
Each force_newer_generation_unsafe increases the generation by 1.
Here is an example,
Before nodetool removenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
{
"addrs": "127.0.0.2",
"generation": 0,
"is_alive": false,
"update_time": 1521778757334,
"version": 0
},
```
After nodetool revmoenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
{
"addrs": "127.0.0.2",
"application_state": [
{
"application_state": 0,
"value": "removed,146b52d5-dc94-4e35-b7d4-4f64be0d2672,1522038476246",
"version": 214
},
{
"application_state": 6,
"value": "REMOVER,14ecc9b0-4b88-4ff3-9c96-38505fb4968a",
"version": 153
}
],
"generation": 2,
"is_alive": false,
"update_time": 1521779276246,
"version": 0
},
```
In gossiper::apply_state_locally, we have this check:
```
if (local_generation != 0 && remote_generation > local_generation + MAX_GENERATION_DIFFERENCE) {
// assume some peer has corrupted memory and is broadcasting an unbelievable generation about another peer (or itself)
logger.warn("received an invalid gossip generation for peer {}; local generation = {}, received generation = {}",ep, local_generation, remote_generation);
}
```
to skip the gossip update.
To fix, we relax generation max difference check to allow the generation
of a removed node.
After this patch, the removed node bootstraps successfully.
Tests: dtest:update_cluster_layout_tests.py
Fixes#3331
Message-Id: <678fb60f6b370d3ca050c768f705a8f2fd4b1287.1522289822.git.asias@scylladb.com>
(cherry picked from commit f539e993d3)
The partition tombstone is not part of a mutation_fragment in the old
streamed_mutation, so it was not scattered correctly by fragment_scatterer.
This causes test failures if the mutations to be scattered have a partition
tombstone.
Fix by calling consume(tombstone) directly. This isn't nice, but the code
is dead anyway.
"
This fixes an abort in an sstable reader when querying a partition with no
clustering ranges (happens on counter table mutation with no live rows) which
also doesn't have any static columns. In such case, the
sstable_mutation_reader will setup the data_consume_context such that it only
covers the static row of the partition, knowing that there is no need to read
any clustered rows. See partition.cc::advance_to_upper_bound(). Later when
the reader is done with the range for the static row, it will try to skip to
the first clustering range (missing in this case). If clustering_ranges_walker
tells us to skip to after_all_clustering_rows(), we will hit an assert inside
continuous_data_consumer::fast_forward_to() due to attempt to skip past the
original data file range. If clustering_ranges_walker returns
before_all_clustering_rows() instead, all is fine because we're still at the
same data file position.
Fixes#3304.
"
* 'tgrabiec/fix-counter-read-no-static-columns' of github.com:scylladb/seastar-dev:
tests: mutation_source_test: Test reads with no clustering ranges and no static columns
tests: simple_schema: Allow creating schema with no static column
clustering_ranges_walker: Stop after static row in case no clustering ranges
(cherry picked from commit 054854839a)
This change is backported from 092f2e659c.
Previously, the sharded permissions cache was only accessible to the
implementation of `auth::service` in `auth/service.cc`. The intention
was that invoking `auth::service::get_permissions` on shard `k` would
query the cache on shard `k`, which would in turn depend on
`auth::service` on shard k to check for superuser status.
The problem is in `auth::service::start`.
`seastar::sharded<auth::permissions_cache>::start` is invoked with
`*this` of shard 0, causing all instances of the cache to reference the
same object.
I wasn't able to locally reproduce errors or crashes due to this bug
when I compiled a release build of Scylla. However, running a debug
build meant that the glorious `seastar::debug_shared_ptr_counter_type`
quickly saved the day with its checks that `seastar::shared_ptr` isn't
being misused.
To eliminate this problem, we move ownership of a single instance of
`auth::permissions_cache` to a single instance of `auth::service`. When
`auth::service` is sharded, so is the permissions cache.
I verified interactively that no assertions failed in debug mode with
this change.
Fixes#3296.
Tests: unit (debug, release)
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <280a889f551180db1c00d8a80eddf85b2ff0ac60.1521696176.git.jhaberku@scylladb.com>
In gossiper::handle_major_state_change() we set the endpoint_state for
a particular endpoint and replicate the changes to other cores.
This is totally unsynchronized with the execution of
gossiper::evict_from_membership(), which can happen concurrently, and
can remove the very same endpoint from the map (in all cores).
Replicating the changes to other cores in handle_major_state_change()
can interleave with replicating the changes to other cores in
evict_from_membership(), and result in an undefined final state.
Another issue happened in debug mode dtests, where a fiber executes
handle_major_state_change(), calls into the subscribers, of which
storage_service is one, and ultimately lands on
storage_service::update_peer_info(), which iterates over the
endpoint's application state with deferring points in between (to
update a system table). gossiper::evict_from_membership() was executed
concurrently by another fiber, which freed the state the first one is
iterating over.
Fixes#3299.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180318123211.3366-1-duarte@scylladb.com>
(cherry picked from commit 810db425a5)
If there are a lot of ranges, e.g., num_tokens=2048, 10 ranges per
stream plan will cause tons of stream plan to be created to stream data,
each having very few data. This cause each stream plan has low transfer
bandwidth, so that the total time to complete the streaming increases.
It makes more sense to send a percentage of the total ranges per stream
plan than a fixed ranges.
Here is an example to stream a keyspace with 513 ranges in
total, 10 ranges v.s. 10% ranges:
Before:
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for
keyspace=system_traces, 510 out of 513 ranges: ranges = 51
[shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1
succeeded, took 107 seconds
After:
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for
keyspace=system_traces, 510 out of 513 ranges: ranges = 10
[shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1
succeeded, took 22 seconds
Message-Id: <a890b84fbac0f3c3cc4021e30dbf4cdf135b93ea.1520992228.git.asias@scylladb.com>
(cherry picked from commit 9b5585ebd5)
This reverts commit f792c78c96.
With the "Use range_streamer everywhere" (7217b7ab36) series,
all the user of streaming now do streaming with relative small ranges
and can retry streaming at higher level.
Reduce the time-to-recover from 5 hours to 10 minutes per stream session.
Even if the 10 minutes idle detection might cause higher false positive,
it is fine, since we can retry the "small" stream session anyway. In the
long term, we should replace the whole idle detection logic with
whenever the stream initiator goes away, the stream slave goes away.
Message-Id: <75f308baf25a520d42d884c7ef36f1aecb8a64b0.1520992219.git.asias@scylladb.com>
(cherry picked from commit ad7b132188)
"
Refs #2692Fixes#3246
The current restricting algorithm [1] restricts the active-reader queue
based on the memory consumption of the existing active readers. When
this memory consumption is above the limit new readers are not admitted.
The inactive reader queue on the other hand has a fixed length.
This caused performance regressions on two workloads:
* read-only: since the inactive-reader queue length is severly limited
(compared to the previous situation) reads will timeout at loads
comfortably handled before.
* mixed: since the memory consumption happens only at admission time
(already created active readers are not limited) memory consumption
growed significantly causing problems when compactions kicked in.
The solution is to reintroduce the old limit of 100 active concurrent
user-reads while still keeping the memory-based limit as well. For
workloads that don't consume a lot of memory or on large boxes with lots
of memory the count-based limit will be reached which is reverting to the
old well-known behaviour. For memory-hungry workloads or on small boxes
with little memory the memory based-limit will kick in sooner avoiding
memory overconsumption.
[1] introduced by bdbbfe9390
"
* 'restricted-reader-dual-limit/v3-backport-2.1' of https://github.com/denesb/scylla:
Modify unit tests so that they test the dual-limits
Use the reader_concurrency_semaphore to limit reader concurrency
Add reader_concurrency_semaphore
Add reader_resource_tracker param to mutation_source
mv reader_resource_tracker.hh -> reader_concurrency_semaphore.hh
This semaphore implements the new dual, count and memory based active
reader limiting. As purely memory-based limiting proved to cause
problems on big boxes admitting a large number of readers (more than any
disk could handle) the previous count-based limit is reintroduced in
addition to the existing memory-based limit.
When creating new readers first the count-based limit is checked. If
that clears the memory limit is checked before admitting the reader.
reader_conccurency_semaphore wraps the two semaphores that implement
these limits and enforces the correct order of limit checking.
This class also completely replaces the restricted_reader_config struct,
it encapsulates all data and related functinality of the latter, making
client code simpler.
Soon, reader_resource_tracker will only be constructible after the
reader has been admitted. This means that the resource tracker cannot be
preconstructed and just captured by the lambda stored in the mutation
source and instead has to be passed in along the other parameters.
In preparation to reader_concurrency_semaphore being added to the file.
The reader_resource_tracker is really only a helper class for
reader_concurrency_semaphore so the latter is better suited to provide
the name of the file.
This patch takes a modified version of the Ubuntu 14.04 housekeeping
service script and uses it in Docker to validate the current version.
To disable the version validation, pass the --disable-version-check flag
when running the container.
Message-Id: <20180220161231.1630-1-amnon@scylladb.com>
(cherry picked from commit edcfab3262)
Since we splited scylla-housekeeping service to two different services for systemd, we don't share same service name between systemd and upstart anymore.
So handle it independently for each distribution, try to install
/etc/init/scylla-housekeeping.conf on Ubuntu 14.04.
Fixes#3239
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1519852659-10688-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 101e909483)
Debian and ubuntu list files come in two variations.
The housekeeping should support both.
This patch change the regexp that match the os in the repository file.
After the introduction of the second list variation, the os name can be in the middle of the path not only at the end.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180227092543.19538-1-amnon@scylladb.com>
(cherry picked from commit 57d46c6959)
swap_tree() doesn't change the color of the header, and becasue header
was not initialized, it is undefined (can be both red or black). One
problem this causes is that algo::is_header() expects the header to be
always red. It is used by unlink(), which for trees which have a black
header would infinite-loop.
The fix is to initialize the header.
Fixes#3242.
Message-Id: <1519815091-13111-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 30635510a2)
The test inserts some values with a TTL of 1 second and then
reads them back expecting them not to be expired yet. That may not
always be the case if the machine is slow and we are running in the
debug mode. Increasising the TTLs by x100 should help avoid these
false positives.
Message-Id: <20180219133816.17452-1-pdziepak@scylladb.com>
(cherry picked from commit d97eebe82d)
Commit 6ccd317 introduced a bug in partition_entry::evict() where a
partition entry may be partially evicted if there are non-evictable
snapshots in it. Partially evicting some of the versions may violate
consistency of a snapshot which includes evicted versions. For one,
continuity flags are interpreted realtive to the merged view, not
within a version, so evicting from some of the versions may mark
reanges as continuous when before they were discontinuous. Also, range
tombtsones of the snapshot are taken from all versions, so we can't
partially evict some of them without marking all affected ranges as
discontinuous.
The fix is to revert back to full eviciton, and avoid moving
non-evictable snapshots to cache. When moving whole partition entry to
cache, we first create a neutral empty partition entry and then merge
the memtable entry into it just like we would if the entry already
existed.
Fixes#3215.
Tests: unit (release)
Message-Id: <1518710592-21925-2-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit b0b57b8143)
The labels in database active_reads metrics where not define correctly.
Label should be created so it will be possible to select based on their
value.
The current implementation define a label "class" with three instances:
user, streaming, system.
Fixes: #2770
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180123125206.23660-1-amnon@scylladb.com>
(cherry picked from commit a0a1961b6d)
"When moving whole partition entries from memtable to cache, we move
snapshots as well. It is incorrect to evict from such snapshots
though, because associated readers would miss data.
Solution is to record evictability of partition version references (snapshots)
and avoiding eviction from non-evictable snapshots.
Could affect scanning reads, if the reader uses partition entry from
memtable, and the partition is too large to fit in reader's buffer,
and that entry gets moved to cache (was absent in cache), and then
gets evicted (memory pressure). The reader will not see the remainder
of that entry. Found during code review.
Introduced in ca8e3c4, so affects 2.1+
Fixes#3186.
Tests: unit (release)"
* 'tgrabiec/do-not-evict-memtable-snapshots' of github.com:tgrabiec/scylla:
tests: mvcc: Add test for eviction with non-evictable snapshots
mutation_partition: Define + operator on tombstones
tests: mvcc: Check that partition is fully discontinuous after eviction
tests: row_cache: Add test for memtable readers surviving flush and eviction
memtable: Make printable
mvcc: Take partition_entry by const ref in operator<<()
mvcc: Do not evict from non-evictable snapshots
mvcc: Drop unnecessary assignment to partition_snapshot::_version
tests: Use partition_entry::make_evictable() where appropriate
mvcc: Encapsulate construction of evictable entries
(cherry picked from commit 6ccd317c38)
Race condition was introduced by commit 028c7a0888, which introduces chunk offset
compression, because a reading state is kept in the compress structure which is
supposed to be immutable and can be shared among shards owning the same sstable.
So it may happen that shard A updates state while shard B relies on information
previously set which leads to incorrect decompression, which in turn leads to
read misbehaving.
We could serialize access to at() which would only lead to contention issues for
shared sstables, but that can be avoided by moving state out of compress structure
which is expected to be immutable after sstable is loaded and feeded to shards that
own it. Sequential accessor (wraps state and reference to segmented_offset) is
added to prevent at() and push_back() interfaces from being polluted.
Tests: release mode.
Fixes#3148.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180205192432.23405-1-raphaelsc@scylladb.com>
(cherry picked from commit 09f4ee808f)
These patches deal with the remaining exception safety issues in the
memtable partition range readers. That includes moving the assignment
to iterator_reader::_last outside of allocating section to avoid
problems caused by exception-unsafe assignment operator. Memory
accotuning code is also moved out of the retryable context to improve
the code robustness and avoid potential problems in the future.
Fixes#3172.
* https://github.com/pdziepak/scylla.git memtable-range-read-exception-safety-2.1/v1:
memtable: do not update iterator_reader::_last in alloc section
memtable: do not change accounting state in alloc section
tests/memtable: add more reader exception safety tests
These patches change the memtable reader implementation (in particular
partition_snapshot_reader) so that the existing exception safety
paroblems are fixed, but also in a way that, hopefully, would make it
easier to reason about the error handling and avoid future bugs in that
area.
The main difficulty related to exception safety is that when an
exception is thrown out of an allocating section that code is run again
with increased memory reserved. If the retryable code has side effects
it is very easy to get incorrect behaviour.
In addition to that, entering an allocating section is not exactly cheap
which encourages doing so rarely and having large sections.
The approach taken by this series is to, first, make entering allocating
sections cheaper and then reducing the amount of logic that runs inside
of them to a minimum.
This means that instead of entering a section once per a call to
flat_mutation_reader::fill_buffer() the allocation section is entered
once for each emitted row. The only state modified from within the
section are cached iterators to the current row, which are dropped on
retry. Hopefully, this would make the reader code easier to reason
about.
The optimisations to the allocating sections and managed_bytes
linearised context has successfully eliminated any penalty caused by
much more fine grained allocating sections.
Fixes#3123.
Fixes#3133.
Tests: unit-tests (release)
BEFORE
test iterations median mad min max
memtable.one_partition_one_row 1155362 869.139ns 0.282ns 868.465ns 873.253ns
memtable.one_partition_many_rows 127252 7.871us 15.252ns 7.851us 7.886us
memtable.many_partitions_one_row 58715 17.109us 2.765ns 17.013us 17.112us
memtable.many_partitions_many_rows 4839 206.717us 212.385ns 206.505us 207.448us
AFTER
test iterations median mad min max
memtable.one_partition_one_row 1194453 839.223ns 0.503ns 834.952ns 842.841ns
memtable.one_partition_many_rows 133785 7.477us 4.492ns 7.473us 7.507us
memtable.many_partitions_one_row 60267 16.680us 18.027ns 16.592us 16.700us
memtable.many_partitions_many_rows 4975 201.048us 144.929ns 200.822us 201.699us
./before_sq ./after_sq diff
read 337373.86 353694.24 4.8%
write 388759.99 394135.78 1.4%
* https://github.com/pdziepak/scylla.git memtable-exception-safety-2.1/v1:
flat_mutation_reader: add allocation point in push_mutation_fragment
linearization_context: remove non-trivial operations from fast path
lsa: split alloc section into reserving and reclamation-disabled parts
lsa: optimise disabling reclamation and invalidation counter
mutation_fragment: allow creating clustering row in place
paratition_snapshot_reader: minimise amount of retryable code
memtable: drop memtable_entry::read()
tests/memtable: add test for reader exception safety
Retryable code that has side effects is a recipe for bugs. This patch
reworkds the snapshot reader so that the amount of logic run with
reclamation disabled is minimal and has a very limited side effects.
Moving clustering_row is expensive due to amount of data stored
internally. Adding a mutation_fragment constructor that builds a
clustering_row in-place saves some of that moving.
Most of the lsa gory details are hidden in utils/logalloc.cc. That
includes the actual implementation of a lsa region: region_impl.
However, there is code in the hot path that often accesses the
_reclaiming_enabled member as well as its base class
allocation_strategy.
In order to optimise those accesses another class is introduced:
basic_region_impl that inherits from allocation_strategy and is a base
of region_impl. It is defined in utils/logalloc.hh so that it is
publicly visible and its member functions are inlineable from anywhere
in the code. This class is supposed to be as small as possible, but
contain all members and functions that are accessed from the fast path
and should be inlined.
Allocating sections reserves certain amount of memory, then disables
reclamation and attempts to perform given operation. If that fails due
to std::bad_alloc the reserve is increased and the operation is retried.
Reserving memory is expensive while just disabling reclamation isn't.
Moreover, the code that runs inside the section needs to be safely
retryable. This means that we want the amount of logic running with
reclamation disabled as small as possible, even if it means entering and
leaving the section multiple times.
In order to reduce the performance penalty of such solution the memory
reserving and reclamation disabling parts of the allocating sections are
separated.
Since linearization_context is thread_local every time it is accessed
the compiler needs to emit code that checks if it was already
constructed and does so if it wasn't. Moreover, upon leaving the context
from the outermost scope the map needs to be cleared.
All these operations impose some performance overhead and aren't really
necessary if no buffers were linearised (the expected case). This patch
rearranges the code so that lineatization_context is trivially
constructible and the map is cleared only if it was modified.
Exception safety tests inject a failure at every allocation and verify
whether the error is handled properly.
push_mutation_fragment() adds a mutation fragment to a circular_buffer,
in theory any call to that function can result in a memory allocation,
but in practice that depends on the implementation details. In order to
improve the effectiveness of the exception safety tests this patch adds
an explicit allocation point in push_mutation_fragment().
Fixes#3096
The credentials processing for transitional auth was broken
in ba6a41d, "auth: Switch to sharded service which effectively removed
the "virtualization" of underlying auth in the SASL challenge.
As a quick workaround, add the permissive exception handling to
sasl object as well.
Message-Id: <20180103102724.1083-1-calle@scylladb.com>
(cherry picked from commit 35b9ec868a)
* seastar 8d254a1...af1b789 (3):
> tls_test: Fix echo test not setting server trust store
> tls: Do not restrict re-handshake to client
> tls: Actually verify client certificate if requested
Fixes#3072
The reason sstable key estimation is inaccurate is that it doesn't account that
index sampling is now dynamic.
The estimation is done as follow:
uint64_t get_estimated_key_count() const {
return ((uint64_t)_components->summary.header.size_at_full_sampling + 1) *
_components->summary.header.min_index_interval;
}
The biggest problem is that _components->summary.header.min_index_interval isn't
actually the minimum interval, but instead the default interval value set in the
schema.
So the estimation gets worse the larger the average partition, because the larger
the average partition the lower the index sampling interval.
One of the problems is that estimation has a big influence on bloom filter size,
and so for large partitions we were generating bigger filters than we had to.
From now on, size at full sampling is calculated as if sampling were static
(which was the case until commit 8726ee937d which introduced size-based
sampling), using minimum index as a strict sampling interval.
Tests: units (release)
Fixes#3113.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180122233612.11147-1-raphaelsc@scylladb.com>
(cherry picked from commit 2c181b69c9)
"_free_segments_in_zones is not adjusted by
segment_pool::reclaim_segments() for empty zones on reclaim under some
conditions. For instance when some zone becomes empty due to regular
free() and then reclaiming is called from the std allocator, and it is
satisfied from a zone after the one which is empty. This would result
in free memory in such zone to appear as being leaked due to corrupted
free segment count, which may cause a later reclaim to fail. This
could result in bad_allocs.
The fix is to always collect such zones.
Fixes#3129
Refs #3119
Refs #3120"
* 'tgrabiec/fix-free_segments_in_zones-leak' of github.com:scylladb/seastar-dev:
tests: lsa: Test _free_segments_in_zones is kept correct on reclaim
lsa: Expose max_zone_segments for tests
lsa: Expose tracker::non_lsa_used_space()
lsa: Fix memory leak on zone reclaim
(cherry picked from commit 4ad212dc01)
The call chain is:
storage_service::on_change() -> storage_service::handle_state_removing()
-> storage_service::restore_replica_count() -> streamer->stream_async()
Listeners run as part of gossip message processing, which is serialized.
This means we won't be processing any gossip messages until streaming
completes.
In fact, there is no need to wait for restore_replica_count to complete
which can take a long time, since when it completes, this node will send
notification to tell the removal_coordinator that the restore process is
finished on this node. This node will be removed from _replicating_nodes
on the removal_coordinator.
Tested with update_cluster_layout_tests.py
Fixes#2886
Message-Id: <8b4fe637dfea6c56167ddde3ca86fefb8438ce96.1516088237.git.asias@scylladb.com>
(cherry picked from commit 5107b6ad16)
Commit 2d5fb9d109 (gms/gossiper: Replicate changes incrementally to
other shards) changes the way we replicate _token_metadata and
endpoint_state_map. Before they are replicated at the same time, after
they are not any more. This causes a shard in NORMAL status can still be
with a empty _token_metadata.
We saw errors:
[shard 12] token_metadata - sorted_tokens is empty in first_token_index!
during CorruptThenRepairNemesis.
Fix by setting the gossip status to NORMAL after replication of
_token_metadata, so that once a node is in NORMAL, we can do repair. The
commit 69c81bcc87 (repair: Do not allow repair until node is in NORMAL
status) prevents the early repair operation by checking if a node is in
NORMAL status.
Fixes#3121
Message-Id: <af6a223733d2e11351f1fa35f59eacfa7d65dd30.1516065564.git.asias@scylladb.com>
(cherry picked from commit 3c8ed255ac)
Problem introduced in 375ed938b4
Also remove redefinition of schema in dummy incremental selector
which is supposed to use the one in base class instead.
Following tests are fixed:
./build/release/tests/mutation_reader_test
./build/release/tests/sstable_test -- -c1
./build/release/tests/row_cache_test
./build/release/tests/cache_flat_mutation_reader_test
./build/release/tests/row_cache_stress_test
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180111153831.17462-1-raphaelsc@scylladb.com>
On Debian 9, 'pbuilder create' fails because of lack of GPG key for
3rdparty repo, so we need --allow-untrusted on 'pbuilder create' and
'pbuilder update'.
Also, apt-key adv --fetch-keys does not works correctly on it, but we can use
"curl <URL> | apt-key add -" as workaround.
Fixes#3088
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1513797714-18067-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit b68ee98310)
We provided "boost1.63" package for Debian 8 since we couldn't build
"scylla-boost163" package witch is available on Ubuntu14/16, but I fixed the
problem and now we have it for Debian 8 too, so switch to it.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1514220163-25985-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 51013f561d)
The legacy mutation_reader/streamed_mutation design allowed very easily
to skip the partition merging logic if there was only one underlying
reader that has emitted it.
That optimisation was lost after conversion to flat mutation readers
which has impacted the performance. This patch mostly recovers it by
bypassing most of mutation_reader_merger logic if there is only a single
active reader for a given partition.
The performance regression was introduced in
8731c1bc66 "Flatten the implementation of
combined_mutation_reader".
perf_simple_query -c4 read results (medians of 60):
original regression
before 8731c1 after 8731c1 diff
read 326241.02 300244.09 -8.0%
this patch
before after diff
read 313882.59 325148.05 3.6%
Message-Id: <20180103121019.764-1-pdziepak@scylladb.com>
(cherry picked from commit b4a4c04bab)
The uninitialized session has no peer associated with it yet. There is
no point sending the failed message when abort the session. Sending the
failed message in this case will send to a peer with uninitialized
dst_cpu_id which will casue the receiver to pass a bogus shard id to
smp::submit_to which cases segfault.
In addition, to be safe, initialize the dst_cpu_id to zero. So that
uninitialized session will send message to shard zero instead of random
bogus shard id.
Fixes the segfault issue found by
repair_additional_test.py:RepairAdditionalTest.repair_abort_test
Fixes#3115
Message-Id: <9f0f7b44c7d6d8f5c60d6293ab2435dadc3496a9.1515380325.git.asias@scylladb.com>
(cherry picked from commit 774307b3a7)
Using Materialized Views, if the base table has static columns,
and the update in base table mutates static and non static rows,
the streamed_mutation is stopped before process non static row.
The patch avoids stopping the stream_mutation and adds a test case.
Message-Id: <20171220173434.25091-1-tavares.george@gmail.com>
(cherry picked from commit ceecd542cd)
After 611774b, we're blind again on which sstable caused a compaction
to fail, leaving us with cryptic message as follow:
compaction_manager - compaction failed: std::runtime_error (compressed
chunk failed checksum)
After this change, now both read failure in compaction or regular read
will report the guilty sstable, see:
compaction_manager - compaction failed: std::runtime_error (SSTable reader
found an exception when reading sstable ./data/.../keyspace1-standard1
ka-1-Data.db : std::runtime_error(compressed chunk failed checksum))
Fixes#3006.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180102230752.14701-1-raphaelsc@scylladb.com>
(cherry picked from commit 4610e994e1)
In version 1.8.3 of JsonCpp shipped with Fedora 27, old FastWriter and
Reader classes from JsonCpp have been deprecated in favour of
newer/better ones: CharReaderBuilder/CharReader and
StreamWriterBuilder/StreamWriter.
This fix uses the new classes where available or resorts to old ones for
older versions of the library.
Fixes#2989
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
(cherry picked from commit 76775ddf26)
'"The issue is triggered by compaction of sstables of level higher than 0.
The problem happens when interval map of partitioned sstable set stores
intervals such as follow:
[-9223362900961284625 : -3695961740249769322 ]
(-3695961740249769322 : -3695961103022958562 ]
When selector is called for first interval above, the exclusive lower
bound of the second interval is returned as next token, but the
inclusivess info is not returned.
So reader_selector was returning that there *were* new readers when
the current token was -3695961740249769322 because it was stored in
selector position field as inclusive, but it's actually exclusive.
This false positive was leading to infinite recursion in combined
reader because sstable set's incremental selector itself knew that
there were actually *no* new readers, and therefore *no* progress
could be made."
Fixes #2908.'
* 'high_level_compaction_infinite_recursion_fix_v4' of github.com:raphaelsc/scylla:
tests: test for infinite recursion bug when doing high-level compaction
Fix potential infinite recursion when combining mutations for leveled compaction
dht: make it easier to create ring_position_view from token
dht: introduce is_min/max for ring_position
(cherry picked from commit 375ed938b4)
Migrate all the places that used memtable::make_reader to use
memtable::make_flat_reader and remove memtable::make_reader.
* seastar-dev.git haaawk/remove_memtable_make_reader_v2_rebased:
Remove memtable::make_reader
Stop using memtable::make_reader in row_cache_stress_test
Stop using memtable::make_reader in row_cache_test
Stop using memtable::make_reader in mutation_test
Stop using memtable::make_reader in streamed_mutation_test
Stop using memtable::make_reader in memtable_snapshot_source.hh
Stop using memtable::make_reader in memtable::apply
Add consume_partitions(flat_mutation_reader& reader, Consumer consumer)
Add default parameter values in make_combined_reader
Migrate test_virtual_dirty_accounting_on_flush to flat reader
Migrate test_adding_a_column_during_reading_doesnt_affect_read_result
Simplify flat_reader_assertions& produces(const mutation& m)
Migrate test_partition_version_consistency_after_lsa_compaction_happens
flat_mutation_reader: Allow setting buffer capacity
Add next_mutation() to flat_mutation_reader_assertions
cf::for_all_partitions::iteration_state: don't store schema_ptr
read_mutation_from_flat_mutation_reader: don't take schema_ptr
Migrate test_fast_forward_to_after_memtable_is_flushed to flat reader
(cherry picked from commit b0a56a91c2)
'char' and int8_t ('unsigned char') are different types. 'bytes' base type
is int8_t - use the correct type for casting.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
(cherry picked from commit 22ca5d2596)
The following patches contain fixes for skipping to the next parititon
in multi_range_reader and completelty dissable support for fast
forwarding inside a single partition, which is not needed and would only
add unnecessary complexity.
* https://github.com/pdziepak/scylla.git fix-multi_range_reader/v1:
flat_multi_range_mutation_reader: disallow
streamed_mutation::forwarding
flat_multi_range_mutation_reader: clear buffer on next_partition()
tests/flat_multi_range_mutation_reader: test skipping to next
partition
(cherry picked from commit 71cc63dfa6)
In the case there are large number of column families, the sender will
send all the column families in parallel. We allow 20% of shard memory
for streaming on the receiver, so each column family will have 1/N, N is
the number of in-flight column families, memory for memtable. Large N
causes a lot of small sstables to be generated.
It is possible there are multiple senders to a single receiver, e.g.,
when a new node joins the cluster, the maximum in-flight column families
is number of peer node. The column families are sent in the order of
cf_id. It is not guaranteed that all peers has the same speed so they
are sending the same cf_id at the same time, though. We still have
chance some of the peers are sending the same cf_id.
Fixes#3065
Message-Id: <46961463c2a5e4f1faff232294dc485ac4f1a04e.1513159678.git.asias@scylladb.com>
(cherry picked from commit a9dab60b6c)
end_bound was not updated in one of the cases in which end and
end_kind was changed, as a result later merging decision using
end_bound were incorrect. end_bound was using the new key, but the old
end_kind.
Fixes#3083.
Message-Id: <1513772083-5257-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit dfe48bbbc7)
"4b9a34a85425d1279b471b2ff0b0f2462328929c "Merge sstable_data_source
into sstable_mutation_reader" has introduced unintentional changes, some
of them causing excessive read amplification during empty range reads.
The following patches restore the previous behaviour."
* tag 'fix-read-amplification/v1' of https://github.com/pdziepak/scylla:
sstables: set _read_enabled to false if possible
sstables: set _single_partition_read for single parititon reads
(cherry picked from commit 772d1f47d7)
Don't enforce the outgoing connections from the 'listen_address'
interface only.
If 'local_address' is given to connect() it will enforce it to use a
particular interface to connect from, even if the destination address
should be accessed from a different interface. If we don't specify the
'local_address' the source interface will be chosen according to the
routing configuration.
Fixes#3066
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1513372688-21595-1-git-send-email-vladz@scylladb.com>
(cherry picked from commit be6f8be9cb)
"Fixes #3057."
* 'summary_recreation_fixes_v2' of github.com:raphaelsc/scylla:
tests: sstable summary recreation sanity test
sstables: make loading of sstable without summary to work again
sstables: fix summary generation with dynamic index sampling
(cherry picked from commit 11de20fc33)
Currently scylla-housekeeping-daily.service/-restart.service hardcoded
"--repo-files '/etc/yum.repos.d/scylla*.repo'" to specify CentOS .repo file,
but we use same .service for Ubuntu/Debian.
It doesn't work correctly, we need to specify .list file for Debian variants.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1513385159-15736-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit c2e87f4677)
There are some users used original default cluster_name 'Test Cluster',
they will fail to start the node for cluster_name change if they use
new scylla.yaml.
'ScyllaDB Cluster' isn't more beautiful than 'Test Cluster', reset back
to original old to avoid problem for users.
Fixes#3060
Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <8c9dab8a64d0f4ab3a5d6910b87af696c60e5076.1513072453.git.amos@scylladb.com>
(cherry picked from commit b07de93636)
"While aa8c2cbc16 'Merge "Migrate sstables
to flat_mutation_reader" from Piotr' has converted the low-level sstable
reader to the new flat_mutation_reader interface there were still
multiple readers related to sstables that required converting,
including:
- restricted reader
- filtering reader
- single partition sstable reader
This series completes their conversion to the flat stream interface."
* tag 'flat_mutation_reader-sstable-readers/v2' of https://github.com/pdziepak/scylla:
db: convert single_key_sstalbe_reader to flat streams
db: fully convert incremental_reader_selector to flat readers
db: make make_range_sstable_reader() return flat reader
db: make column_family::make_reader() return flat reader
db: make column_family::make_sstable_reader() return a flat reader
filtering_reader: switch to flat mutation fragment streams
filtering_reader: pass a const dht::decorated_key& to the callback
mutation_reader: drop make_restricted_reader()
db: use make_restricted_flat_reader
mutation_reader: convert restricted reader to flat streams
(cherry picked from commit 6cb3b29168)
We have had an issue recently where failed SSTable writes left the
generated SSTables dangling in a potentially invalid state. If the write
had, for instance, started and generated tmp TOCs but not finished,
those files would be left for dead.
We had fixed this in commit b7e1575ad4,
but streaming memtables still have the same isse.
Note that we can't fix this in the common function
write_memtable_to_sstable because different flushers have different
retry policies.
Fixes#3062
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171213011741.8156-1-glauber@scylladb.com>
(cherry picked from commit 1aabbc75ab)
"Fixes cache reader to not skip over data in some cases involving overlapping
range tombstones in different partition versions and discontinuous cache.
Introduced in 2.0
Fixes #3053."
* tag 'tgrabiec/fix-range-tombstone-slicing-v2' of github.com:scylladb/seastar-dev:
tests: row_cache: Add reproducer for issue #3053
tests: mvcc: Add test for partition_snapshot::range_tombstones()
mvcc: Optimize partition_snapshot::range_tombstones() for single version case
mvcc: Fix partition_snapshot::range_tombstones()
tests: random_mutation_generator: Do not emit dummy entries at clustering row positions
(cherry picked from commit 051cbbc9af)
"Didn't affect any release. Regression introduced in 301358e.
Fixes#3041"
* 'resharding_fix_v4' of github.com:raphaelsc/scylla:
tests: add sstable resharding test to test.py
tests: fix sstable resharding test
sstables: Fix resharding by not filtering out mutation that belongs to other shard
db: introduce make_range_sstable_reader
rename make_range_sstable_reader to make_local_shard_sstable_reader
db: extract sstable reader creation from incremental_reader_selector
db: reuse make_range_sstable_reader in make_sstable_reader
(cherry picked from commit d934ca55a7)
When fast forwarding is enabled and all readers positioned inside the
current partition return EOS, return EOS from the combined-reader
too. Instead of skipping to the next partition if there are idle readers
(positioned at some later partition) available. This will cause rows to
be skipped in some cases.
The fix is to distinguish EOS'd readers that are only halted (waiting
for a fast-forward) from thoose really out of data. To achieve this we
track the last fragment-kind the reader emitted. If that was a
partition-end then the reader is out of data, otherwise it might emit
more fragments after a fast-forward. Without this additional information
it is impossible to determine why a reader reached EOS and the code
later may make the wrong decision about whether the combined-reader as
a whole is at EOS or not.
Also when fast-forwarding between partition-ranges or calling
next_partition() we set the last fragment-kind of forwarded readers
because they should emit a partition-start, otherwise they are out of
data.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <6f0b21b1ec62e1197de6b46510d5508cdb4a6977.1512569218.git.bdenes@scylladb.com>
(cherry picked from commit 9661769313)
"Convert combined_mutation_reader into a flat_mutation_reader impl. For
now - in the name of incremental progress - all consumers are updated to
use the combined reader through the
mutation_reader_from_flat_mutation_reader adaptor. The combined reader also
uses all it's sub mutation_readers through the
flat_mutation_reader_from_mutation_reader adaptor."
* 'bdenes/flatten-combined-reader-v8' of https://github.com/denesb/scylla:
Add unit tests for the combined reader - selector interactions
Add flat_mutation_reader overload of make_combined_reader
Flatten the implementation of combined_mutation_reader
Add mutation_fragment_merger
mutation_fragment::apply(): handle partition start and end too
Add non-const overload of partition_start::partition_tombstone()
Make combined_mutation_reader a flat_mutation_reader
Move the mutation merging logic to combined_mutation_reader
Remove the unnecessary indirection of mutation_reader_merger::next()
Move the implementation of combined_mutation_reader into mutation_reader_merger
Remove unused mutation_and_reader::less_compare and operator<
(cherry picked from commit 046991b0b7)
"This series makes sstable tests use flat stream interface. The main
motivation is to allow eventual removal of mutation_reader and
streamed_mutation and ensuring that the conversion between the
interfaces doesn't hide any bugs that would be otherwise found."
* tag 'flat_mutation_reader-sstable-tests/v1' of https://github.com/pdziepak/scylla:
sstables: drop read_range_rows()
tests/mutation_reader: stop using read_range_rows()
incremental_reader_selector: do not use read_range_rows()
tests/sstable: stop using read_range_rows()
sstables: drop read_row()
tests/sstables: use read_row_flat() instead of read_row()
database: use read_row_flat() instead of read_row()
tests/sstable_mutation_test: get flat_mutation_readers from mutation sources
tests/sstables: make sstable_reader return flat_mutation_reader
sstable: drop read_row() overload accepting sstable::key
tests/sstable: stop using read_row() with sstable::key
tests/flat_mutation_reader_assertions: add has_monotonic_positions()
tests/flat_mutation_reader_assertions: add produces(Range)
tests/flat_mutation_reader_assertions: add produces(mutation)
tests/flat_mutation_reader_assertions: add produces(dht::decorated_key)
tests/flat_mutation_reader_assertions: add produces(mutation_fragment::kind)
tests/flat_mutation_reader_assertions: fix fast forwarding
(cherry picked from commit 601a03dda7)
Since pbuilder chroot environment does not install CA certificates by default,
accessing https://download.opensuse.org will cause certificate verification
error.
So we need to install it before installing 3rdparty repo GPG key.
Also, checking existance of gpgkeys_curl is not needed, since it's always
not installed since we are running the script in clean chroot environment.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1512517001-27524-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 8f02967a3b)
We are getting package build error on dh_auto_install which is invoked by
pybuild.
But since we handle all installation on debian/scylla-server.install, we can
simply skip running dh_auto_install.
Fixes#3036
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1512065117-15708-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 25bc18b8ff)
The universal reference was introduced so we could bind an rvalue to
the argument, but it would have sufficed to make the argument a const
reference. This is also more consistent with the function's other
overload.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171129132758.19654-1-duarte@scylladb.com>
(cherry picked from commit cda3ddd146)
Fix two issues with serializing non-compound range tombstones as
compound: convert a non-compound clustering element to compound and
actually advertise the issue to other nodes.
* git@github.com:duarten/scylla.git rt-compact-fixes/v1:
compound_compact: Allow rvalues in size()
sstables/sstables: Convert non-compound clustering element to compound
tests/sstable_mutation_test: Verify we can write/read non-correct RTs
service/storage_service: Export non-compound RT feature
(cherry picked from commit e9cce59b85)
after 7f8b62bc0b, its move operator and ctor broke. That potentially
leads to error because data_consume_context dtor moves sstable ref
to continuation when waiting for in-flight reads from input stream.
Otherwise, sstable can be destroyed meanwhile and file descriptor
would be invalid, leading to EBADF.
Fixes#3020.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171129014917.11841-1-raphaelsc@scylladb.com>
(cherry picked from commit f699cf17ae)
This series converts memtable flush reader to the new flat mutation
readers. Just like the scanning reader, flush reader concatenates
multiple partition snapshot readers in order to provide a stream
of all partitions in the memtable.
* https://github.com/pdziepak/scylla.git flat_mutation_reader-memtable-flush/v1
tests/flat_mutation_reader_assertion: add produces_partition()
memtable: make make_flush_reader() return flat_mutation_reader
flat_mutation_reader: add optimised flat_mutation_reader_opt
memtable: switch flush reader implementation to flat streams
tests/memtable: add test for flush reader
(cherry picked from commit 04106b4c96)
These tests now require having the storage service initialize, which
is needed to decide whether correct non-compound range tombstones
should be emitted or not.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171126152921.5199-1-duarte@scylladb.com>
(cherry picked from commit 922f095f22)
The following patches convert sstable writers to use flat mutation
readers instead of the legacy mutation_reader interface.
Writers were already using flat consumer interface and used
consume_flattened_in_thread(), so most of the work was limited to
providing an appropriate equivalent for flat mutation readers.
* https://github.com/pdziepak/scylla.git flat_mutation_reader-sstable-write/v1:
flat_mutation_reader: move consumer_adapter out of consume()
flat_mutation_reader: introduce consume_in_thread()
tests/flat_mutation_reader: test consume_in_thread()
sstables: switch write_components() to flat_mutation_reader
streamed_mutation: drop streamed_mutation_returning()
sstables: convert compaction to flat_mutation_reader
mutation_reader: drop consume_flattened_in_thread()
(cherry picked from commit 596ebaed1f)
This series mainly fixes issues with the serialization of promoted
index entries for non-compound schemas and with the serialization of
range tombstones, also for non-compound schemas.
We lift the correct cell name writing code into its own function,
and direct all users to it. We also ensure backward compatibility with
incorrectly generated promoted indexes and range tombstones.
Fixes#2995Fixes#2986Fixes#2979Fixes#2992Fixes#2993
* git@github.com:duarten/scylla.git promoted-index-serialization/v3:
sstables/sstables: Unify column name writers
sstables/sstables: Don't write index entry for a missing row maker
sstables/sstables: Reuse write_range_tombstone() for row tombstones
sstables/sstables: Lift index writing for row tombstones
sstables/sstables: Leverage index code upon range tombstone consume
sstables/sstables: Move out tombstone check in write_range_tombstone()
sstables/sstables: A schema with static columns is always compound
sstables/sstables: Lift column name writing logic
sstables/sstables: Use schema-aware write_column_name() for
collections
sstables/sstables: Use schema-aware write_column_name() for row marker
sstables/sstables: Use schema-aware write_column_name() for static row
sstables/sstables: Writing promoted index entry leverages
column_name_writer
sstables/sstables: Add supported feature list to sstables
sstables/sstables: Don't use incorrectly serialized promoted index
cql3/single_column_primary_key_restrictions: Implement is_inclusive()
cql3/delete_statement: Constrain range deletions for non-compound
schemas
tests/cql_query_test: Verify range deletion constraints
sstables/sstables: Correctly deserialize range tombstones
service/storage_service: Add feature for correct non-compound RTs
tests/sstable_*: Start the storage service for some cases
sstables/sstable_writer: Prepare to control range tombstone
serialization
sstables/sstables: Correctly serialize range tombstones
tests/sstable_assertions: Fix monotonicity check for promoted indexes
tests/sstable_assertions: Assert a promoted index is empty
tests/sstable_mutation_test: Verify promoted index serializes
correctly
tests/sstable_mutation_test: Verify promoted index repeats tombstones
tests/sstable_mutation_test: Ensure range tombstone serializes
correctly
tests/sstable_datafile_test: Add test for incorrect promoted index
tests/sstable_datafile_test: Verify reading of incorrect range
tombstones
sstables/sstable: Rename schema-oblivious write_column_name() function
sstables/sstables: No promoted index without clustering keys
tests/sstable_mutation_test: Verify promoted index is not generated
sstables/sstables: Optimize column name writing and indexing
compound_compat: Don't assume compoundness
(cherry picked from commit bd1efbc25c)
TTL of 1 second may cause the cell to expire right after we write it,
if the second component of current time changes right after it. Use
larger ttl to avoid spurious faliures due to this.
Message-Id: <1511463392-1451-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 35e404b1a2)
Prometheus histograms have 3 embedded metrics: count, buckets, and sum.
Currently we fill up count and buckets but sum is left at 0. This is
particularly bad, since according to the prometheus documentation, the
best way to calculate histogram averages is to write:
rate(metric_sum[5m]) / rate(metric_count[5m])
One way of keeping track of the sum is adding the value we sampled,
every time we sample. However, the interface for the estimated histogram
has a method that allows to add a metric while allowing to adjust the
count for missing metrics (add_nano())
That makes acumulating a sum inaccurate--as we will have no values for
the points that were added. To overcome that, when we call add_nano(),
we pretend we are introducing new_count - _count metrics, all with the
same value.
Long term, doing away with sampling may help us provide more accurate
results.
After this patch, we are able to correctly calculate latency averages
through the data exported in prometheus.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171122144558.7575-1-glauber@scylladb.com>
(cherry picked from commit 6c4e8049a0)
* seastar-dev.git haaawk/flat_reader_remove_read_rows:
sstable_mutation_test: use read_rows_flat instead of read_rows
perf_sstable: use read_rows_flat instead of read_rows
Remove sstable::read_rows
(cherry picked from commit e9ffe36d65)
Introduce sstable::read_row_flat and sstable::read_range_rows_flat methods
and use them in sstable::as_mutation_source.
* https://github.com/scylladb/seastar-dev/tree/haaawk/flat_reader_sstables_v3:
Introduce conversion from flat_mutation_reader to streamed_mutation
Add sstables::read_rows_flat and sstables::read_range_rows_flat
Turn sstable_mutation_reader into a flat_mutation_reader
sstable: add getter for filter_tracker
Move mp_row_consumer methods implementations to the bottom
Remove unused sstable_mutation_reader constructor
Replace "sm" with "partition" in get_next_sm and on_sm_finished
Move advance_to_upper_bound above sstable_mutation_reader
Store sstable_mutation_reader pointer in mp_row_consumer
Stop using streamed_mutation in consumer and reader
Stop using streamed_mutation in sstable_data_source
Delete sstable_streamed_mutation
Introduce sstable::read_row_flat
Migrate sstable::as_mutation_source to flat_mutation_reader
Remove single_partition_reader_adaptor
Merge data_consume_context::impl into data_consume_context
Create data_consume_context_opt.
Merge on_partition_finished into mark_partition_finished
Check _partition_finished instead of _current_partition_key
Merge sstable_data_source into sstable_mutation_reader
Remove sstable_data_source
Remove get_next_partition and partition_header
(cherry picked from commit aa8c2cbc16)
Since we want to support cross building, we shouldn't hardcode GPG file path,
even these files provided on recent version of mock.
This fixes build error on some older build environment such as CentOS-7.2.
Fixes#3002
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1511277722-22917-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit c1b97d11ea)
We get following warning from antlr3 header when we compile Scylla with gcc-7.2:
/opt/scylladb/include/antlr3bitset.inl: In member function 'antlr3::BitsetList<AllocatorType>::BitsetType* antlr3::BitsetList<AllocatorType>::bitsetLoad() [with ImplTraits = antlr3::TraitsBase<antlr3::CustomTraitsBase>]':
/opt/scylladb/include/antlr3bitset.inl:54:2: error: nonnull argument 'this' compared to NULL [-Werror=nonnull-compare]
To make it compilable we need to specify '-Wno-nonnull-compare' on cflags.
Message-Id: <1510952411-20722-2-git-send-email-syuu@scylladb.com>
(cherry picked from commit f26cde582f)
Switch Debian 3rdparty packages to our OBS repo
(https://build.opensuse.org/project/subprojects/home:scylladb).
We don't use 3rdparty packages on dist/debian/dep, so dropped them.
Also we switch Debian to gcc-7.2/boost-1.63 on same time.
Due to packaging issues following packages doesn't renamed our 3rdparty
package naming rule for now:
- gcc-7: renamed as 'xxx-scylla72', instead of scylla-xxx-72.
- boost1.63: doesn't renamed, also doesn't changed prefix to /opt/scylladb
Message-Id: <1510952411-20722-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit ab9d7cdc65)
The exception handling code inspects server state, which could be
destroyed before the handle_exception() task runs since it runs after
exiting the gate. Move the exception handling inside the gate and
avoid scheduling another accept if the server has been stopped.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171116122921.21273-1-duarte@scylladb.com>
(cherry picked from commit 34a0b85982)
These patches convert queries (data, mutation and counter) to flat
mutation readers. All of them already use consume_flattened() to
consume a flat stream of data, so the only major missing thing
was adding support for reversed partitions to
flat_mutation_reader::consume().
* pdziepak flat_mutation_reader-queries/v3-rebased:
flat_mutation_reader: keep reference to decorated key valid
flat_muation_reader: support consuming reversed partitions
tests/flat_mutation_reader: add test for
flat_mutation_reader::consume()
mutation_partition: convert queries to flat_mutation_readers
tests/row_cache_stress_test: do not use consume_flattened()
mutation_reader: drop consume_flattened()
streamed_mutation: drop reverse_streamed_mutation()
(cherry picked from commit 6969a235f3)
For a time range tombstone that was already removed from a tree
is owned by a raw pointer. This doesn't end well if creation of
a mutation fragment or a call to push_mutation_fragment() throw.
Message-Id: <20171121105749.16559-1-pdziepak@scylladb.com>
(cherry picked from commit 1b936876b7)
This series reworks handling of range tombstones in reversed queries
so that they are applied to correct rows. Additionally, the concept
of flipped range tombstones is removed, since it only made it harder
to reason about the code.
Fixes#2982.
* https://github.com/pdziepak/scylla fix-reverse-query-range-tombstone/v2:
streamed_mutation: fix reversing range tombstones
range_tombstone: drop flip()
tests/cql_query_test: test range tombstones and reverse queries
tests/range_tombstone_list: add test for range_tombstone_accumulator
(cherry picked from commit cec5b0a5b8)
Currently flat_mutation_reader_from_mutation_reader()'s
converting_reader will throw std::runtime_error if fast_forward_to() is
called when its internal streamed_mutation_opt is disengaged. This can
create problems if this reader is a sub-reader of a combined reader as the
latter has no way to determine the source of a sub-reader EOS. A reader
can be in EOS either because it reached the end of the current
position_range or because it doesn't have any more data.
To avoid this, instead of throwing we just silently ignore the fact that
the streamed_mutation_opt is disengaged and set _end_of_stream to true
which is still correct.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <83d309b225950bdbbd931f1c5e7fb91c9929ba1c.1511180262.git.bdenes@scylladb.com>
(cherry picked from commit 8065dca4a1)
Don't std::move() the "query" string inside the parallel_for_each() lambda.
parallel_for_each is going to invoke the given callback object for each element of the range
and as a result the first call of lambda that std::move()s the "query" is going to destroy it for
all other calls.
Fixes#2998
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1511225744-1159-1-git-send-email-vladz@scylladb.com>
(cherry picked from commit 941aa20252)
_current_pi_idx was not reset from advance_to_next_partition(), which
is used when we skip to the next partition before fully consuming
it. As a result, if we try to skip to a clustering position which is
before the index block used by the last skip in the previous
partition, we would not skip assuming that the new position is in the
current block. This may result in more data being read from the
sstable than necessary.
Fixes#2984
Message-Id: <1510915793-20159-1-git-send-email-tgrabiec@scylladb.com>
* seastar 11ad0b1...78cd87f (3):
> Merge "http: Use output stream for files" from Amnon
> tutorial: a section about when_all() and when_all_succeed()
> Merge "Power8 related changes (what's left of them)" from Vlad
As can be seen in one of the traces in #2958, the copy constructor
of index_entry is called in response to std::vector<index_entry>::push_back(index_entry&&).
This is wasteful. Fix by providing the full suite of constructors/assignment operators.
Message-Id: <20171116121608.5580-1-avi@scylladb.com>
"This patch series addresses #2929. The objective is to eliminate global
state from the implementation and use of all access-control functionlity.
I've made every effort to make these patches logically independent and
incremental, but the final patch is big: this was necessary because
eliminating the global instances themselves is an atomic change."
* 'jhk/non_global_auth/v2' of https://github.com/hakuch/scylla:
auth: Switch to sharded service
tracing/trace_keyspace_helper: Use internal `client_state`
auth: Make the QP an explicit dependency
auth: Unify Java class name attributes
auth: Make life-time control more consistent
auth: Move metadata constants
auth: Don't expose internal constant
auth: Extract `permissions_cache`
utils/loading_cache: Include necessary dependency
auth: Fix static constant initialization
auth: Extract `delayed_tasks` from `auth.cc`
This change appears quite large, but is logically fairly simple.
Previously, the `auth` module was structured around global state in a
number of ways:
- There existed global instances for the authenticator and the
authorizer, which were accessed pervasively throughout the system
through `auth::authenticator::get()` and `auth::authorizer::get()`,
respectively. These instances needed to be initialized before they
could be used with `auth::authenticator::setup(sstring type_name)`
and `auth::authorizer::setup(sstring type_name)`.
- The implementation of the `auth::auth` functions and the authenticator
and authorizer depended on resources accessed globally through
`cql3::get_local_query_processor()` and
`service::get_local_migration_manager()`.
- CQL statements would check for access and manage users through static
functions in `auth::auth`. These functions would access the global
authenticator and authorizer instances and depended on the necessary
systems being started before they were used.
This change eliminates global state from all of these.
The specific changes are:
- Move out `allow_all_authenticator` and `allow_all_authorizer` into
their own files so that they're constructed like any other
authenticator or authorizer.
- Delete `auth.hh` and `auth.cc`. Constants and helper functions useful
for implementing functionality in the `auth` module have moved to
`common.hh`.
- Remove silent global dependency in
`auth::authenticated_user::is_super()` on the auth* service in favour
of a new function `auth::is_super_user()` with an explicit auth*
service argument.
- Remove global authenticator and authorizer instances, as well as the
`setup()` functions.
- Expose dependency on the auth* service in
`auth::authorizer::authorize()` and `auth::authorizer::list()`, which
is necessary to check for superuser status.
- Add an explicit `service::migration_manager` argument to the
authenticators and authorizers so they can announce metadata tables.
- The permissions cache now requires an auth* service reference instead
of just an authorizer since authorizing also requires this.
- The permissions cache configuration can now easily be created from the
DB configuration.
- Move the static functions in `auth::auth` to the new `auth::service`.
Where possible, previously static resources like the `delayed_tasks`
are now members.
- Validating `cql3::user_options` requires an authenticator, which was
previously accessed globally.
- Instances of the auth* service are accessed through `external`
instances of `client_state` instead of globally. This includes several
CQL statements including `alter_user_statement`,
`create_user_statement`, `drop_user_statement`, `grant_statement`,
`list_permissions_statement`, `permissions_altering_statement`, and
`revoke_statement`. For `internal` `client_state`, this is `nullptr`.
- Since the `cql_server` is responsible for instantiating connections
and each connection gets a new `client_state`, the `cql_server` is
instantiated with a reference to the auth* service.
- Similarly, the Thrift server is now also instantiated with a reference
to the auth* service.
- Since the storage service is responsible for instantiating and
starting the sharded servers, it is instantiated with the sharded
auth* service which it threads through. All relevant factory functions
have been updated.
- The storage service is still responsible for starting the auth*
service it has been provided, and shutting it down.
- The `cql_test_env` is now instantiated with an instance of the auth*
service, and can be accessed through a member function.
- All unit tests have been updated and pass.
Fixes#2929.
Rather than have all uses of the QP in auth reference global variables,
we supply a QP reference to both the authenticator and authorizer on
construction.
The caller still references a global variable when constructing the
instances, but fixing this problem is a much larger task that is out of
scope of this change.
This change is motivated partly be aesthetics, but more significantly
due to the future work to refactor `auth` into a sharded service. Since
doing so will require writing `auth::auth` from scratch, these
constants (and other common functionality) need a new home.
Using "Meyer's singletons" eliminate the problem of static constant
initialization order because static variables inside functions are
initialized only the first time control flow passes over their
declaration.
Fixes#2966.
This simple task scheduler is used by the auth module to delay metadata
creation until the system is settled.
Extracting it out allows the `auth` module to be refactored into a
sharded service and for other components of `auth` to make use of it.
Fixes#2965.
This patchset prepares sstables read path for flat_mutation_reader.
It cuts some dependencies between classes and replaces
sstables::mutation_reader with ::mutation_reader. This will make it
possible to gradually convert the code to flat_mutation_reader because
we have converters between flat_mutation_reader and ::mutation_reader.
* seastar-dev.git haaawk/flat_reader_prepare_sstables_rebased
Reduce dependencies from mp_row_consumer to sstable_streamed_mutation
Replace sstables::mutation_reader with ::mutation_reader
Remove range_reader_adaptor
Remove sstable_range_wrapping_reader
The wrapper is no longer needed because
read_range_rows returns ::mutation_reader instead of
sstables::mutation_reader and the reader returned from
it keeps the pointer to shared_sstable that was used to
create the reader.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The wrapper is no longer needed because
read_range_rows returns ::mutation_reader instead of
sstables::mutation_reader and the reader returned from
it keeps the pointer to shared_sstable that was used to
create the reader.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This will make migration to flat_mutation_reader much
easier and sstables::mutation_reader is going away with
this migration anyway.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Before this patch mp_row_consumer was using sstable_streamed_mutation
in two ways:
1. Populate sstable_streamed_mutation's buffer with mutation_fragments
2. Advance sstable_streamed_mutation's sstable_data_source to new position.
We can easily reduce those dependencies only to the first one.
This will reduce the coupling between those classes and simplify
the flow of execution.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
"Otherwise, such strategies couldn't behave as expected when it needs to do STCS."
* 'respecting_stcs_options_v2' of github.com:raphaelsc/scylla:
tests: enable twcs test that relied on size-tiered properties
twcs: respect stcs options by forwarding them to stcs method
lcs: forward stcs options to respect them
stcs: make most_interesting_bucket respect size-tiered options
stcs: make most_interesting_bucket respect thresholds
compaction: make size_tiered_most_interesting_bucket static method of stcs class
stcs: introduce new ctor
stcs: make header self contained
stcs: inline function definition so as not to break one definition rule
Broken in f570e41d18.
Not replicating this may cause coordinator to treat a node which is
down as alive, or vice verse.
Fixes regression in dtest:
consistency_test.py:TestAvailability.test_simple_strategy
which was expected to get "unavailable" exception but it was getting a
timeout.
Message-Id: <1510666967-1288-1-git-send-email-tgrabiec@scylladb.com>
Make sure loading_cache::stop() is always called where appropriate:
regardless whether the test failed or there was an exception during the test.
Otherwise a false-alarm use-after-free error may occur.
Fixes#2955
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1510625736-3109-1-git-send-email-vladz@scylladb.com>
"Fixes #2944."
* tag 'tgrabiec/cache-exception-safety-fixes-v2' of github.com:scylladb/seastar-dev:
tests: row_cache: Add test for exception safety of multi-partition scans
tests: row_cache: Add test for exception safety of single-partition reads
tests: mutation_source_tests: Always print the seed
tests: Disable alloc failure injection in test assertions
tests: Avoid needless copies
row_cache: Fix exception safety of cache_entry::read()
row_cache: scanning_and_populating_reader: Fix exception unsafety causing read to skip data
row_cache: partition_range_cursor: Extract valid() and advance_to() from refresh()
cache_streamed_mutation: Add trace-level logging to cache_streamed_mutation
mvcc: Lift noexcept off partition_snapshot_row_weakref assignment/constructors
cache_streamed_mutation: Make advancing to the next range exception-safe
cache_streamed_mutation: Make add_clustering_row_to_buffer() exception-safe
cache_streamed_mutation: Make drain_tombstones() exception-safe
cache_streamed_mutation: Return void from start_reading_from_underlying()
cache_streamed_mutation: Document invariants related to exception-safety
streamed_mutation: Add reserve_one()
lsa: Guarantee invalidated references on allocating section retry
mvcc: partition_snapshot_row_cursor: Mark allocation points
BOOST_TEST_MESSAGE() is not logged by default, and for some tests we
don't want to enable that because it's too noisy. But we need to know
the seed to reproduce a failure, so we better to always print it.
If assignment to _lower_bound in the "_secondary_in_progress = true;"
case in do_read_from_primary() throws due to allocation failure, the
update section will be retried and we will take the not_moved path,
skipping the range which was discontinuous and was supposed to be read
from underlying.
Fix by redoing lookup using _lower_bound in case the section is
retried. When we retry, _primary.valid() will be false. We need to
ensure now that _lower_bound is always valid.
Fixes#2944.
Changing _ck_ranges_curr and _lower_bound should be atomic, either
both fail or both succeed. Currently it could happen that if
position_in_partition::for_range_start() fails, _ck_ranges_curr would
be advanced but _lower_bound not.
We need to maintain the following invariants:
(1) no fragment with position >= _lower_bound was pushed yet
(2) If _lower_bound > mf.position(), mf was emitted
Before this patch (1) could be violated if drain_tombstones() failed
in the middle. (2) could be violated if push_mutation_fragment()
failed.
There is existing code (e.g. use of partition_snapshot_row_cursor in
cache_streamed_mutation) which assumes that references will be
invalidated when bad_alloc is thrown from allocating_section. That is
currently the case because on retry we will attempt memory reclamation
which will invalidate references either through compaction or
eviction. Make this guarantee explicit.
thrift/server.cc:237:6: required from here
thrift/server.cc:236:9: error: cannot call member function ‘void thrift_server::maybe_retry_accept(int, bool, std::__exception_ptr::exception_ptr)’ without object
maybe_retry_accept(which, keepalive, std::move(ex));
gcc version: gcc (GCC) 6.3.1 20161221 (Red Hat 6.3.1-1)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171113184537.10472-1-raphaelsc@scylladb.com>
"The following patches convert streaming and repair code to the new
flat mutation reader interface. In particular this involves changing::
- fragment_and_freeze() -- a consumer that fragments and freezes mutations
- checksum computation for repair which until now was using two-level
mutation_reader/streamed_mutation interface
- multi_range_reader -- a mutation reader that automatically fast
forwards to between given partiton ranges"
* tag 'flat_mutation_reader-streaming/v2' of https://github.com/pdziepak/scylla: (24 commits)
mutation_reader: drop multi_range_reader
db: convert make_streaming_reader() to flat_mutation_reader
tests/flat_mutation_reader: add test for multi range reader
tests/flat_mutation_reader_assertions: add fast_forward_to()
tests/simple_schema: add to_ring_positions() helper
flat_mutation_reader: convert flat_multi_range_mutation_reader
flat_mutation_reader: add partition_range_forwarding
flat_mutation_reader: make pop_mutation_fragment() public
flat_mutation_reader: copy multi_range_mutation_reader
streamed_mutation: drop mutation_hasher
tests/flat_mutation_reader: add test for partition checksum
repair: convert partition_checksum::compute_streamed() to flat streams
repair: make partition_hasher consume flat mutation streams
mutation_hasher: copy mutation_hasher to repair.cc
partition_start: make partition_tombstone() const
partition_checksum: introduce compute() for flat_mutation_reader
db: drop single-range make_streaming_reader()
fragment_and_freeze: drop streamed_mutation overload
stream_transfer_task: switch to flat_mutation_reader
tests/flat_mutation_reader: add test for fragment_and_freeze
...
flat_mutation_reader::partition_range_forwarding and
mutation_reader::forwarding are aliases of the same type. The change was
necessary in order to make mutation_reader::forwarding available in
flat_mutation_reader.hh even though it is included by mutation_reader.hh
flat_mutation_reader public interface already exposes low leve
is_buffer_empty() and is_buffer_full() adding pop_mutation_fragment()
will make implementation of intermediate readers more straightforward.
There is a user of fragment_and_freeze() (streaming) that will need
to be able to break the loop Right now, it does that between
streamed_mutation, but that won't be possible after we switch to flat
readers.
partition_entry::read() calls open_version() under standard allocator,
but it may allocate a new partition version if a snapshot already
exists which was created in an earlier phase. Versions are supposed to
be allocated using region's allocator, they will be freed using
region's allocator. LSA will delegate free() to the standard allocator
correctly in this case, but it will subtract from its
_non_lsa_occupancy, assuming the allocation was done through it. This
will corrupt occupancy() for cache region.
Fixes#2948.
Message-Id: <1510229584-14398-1-git-send-email-tgrabiec@scylladb.com>
"Ensure stop() waits for the accept loop to complete to avoid crashes
during shutdown."
* 'thrift-server-stop/v4' of https://github.com/duarten/scylla:
thrift/server: Restore code format
thrift/server: Stopping the server waits for connection shutdown
thrift/server: Abort listeners on stop()
thrift/server: Avoid manual memory management
thrift/server: Add move ctor for connection
thrift/server: Extract retry logic
thrift/server: Retry with backoff for some error types
thrift/server: Retry accept in case of error
This patch ensures the future returned from stop() resolves only when
all connections and listeners are no longer in use.
Fixes#2657Fixes#2942
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
In case of errors like ECONNABORTED, we want to retry accepting
connections. Right now we immediately retry the accept, but in
subsequent patches we introduce a backoff for other types of errors.
We also consider fatal errors like EBADFD, which should not trigger a
retry.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Currently, we use type type of the column as the accumulator when we
average it. This can easily overflow, e.g. (2^31-1)+(3) = overflow.
Fix by using __int128 for the accumulator. It's not standard, but
it's way more efficient and simpler than the alternatives.
Inspired by CASSANDRA-12417, but much simpler due to the availability
of __int128.
Message-Id: <20171112173529.30764-1-avi@scylladb.com>
"This patch adds support for varint and decimal to aggregate functions.
Some other types (like byte or smallint) weren't supported and they are
supported by C*. So their aggregate functions were added as well.
To allow aggregate functions for big_decimal, following methods were added
to big_decimal type:
* Division by int64_t that preservers number of decimal digits.
* Operator += .
* Comparison operators.
Fixes #2842."
* 'danfiala/scylla-2842-send-002' of https://github.com/hagrid-the-developer/scylla:
tests: Add tests for aggregate functions.
tests: Add tests for big_decimal type.
cql3/functions: Add aggregate functions for big_decimal.
utils/big_decimal: Added necessary operators and methods for aggregate functions.
cql3/functions: Add aggregate functions for types for which it is trivial.
Since we started accounting virtual dirty memory we no longer have a cap
on real dirty memory. In most situations that is not needed, since real
dirty will just be at most twice as much as virtual dirty (current
flushing memtable plus new memtable).
However, due to things like cache updates and component flushing we can
end up having a lot of memtables that are virtually freed but not yet
fully released, leading real dirty memory to explode using all the box'
memory.
This patch adds a cap on real dirty memory as well. Because of the
hierarchical nature of region_group, if the parent blocks due to memory
depletion, so will the child (virtual dirty region group).
After that is done, we need to make sure that dirty memory is not seen
as freed until the cache update is done. Until a particular partition is
moved to the cache it is not evictable. As a result we can OOM the
system if we have a lot of pending cache updates as the writes will not
be throttled and memory won't be made available.
This patch pins the memory used by the region as real dirty before the
cache update starts, and unpins it when it is over. In the mean time it
gradually releases memory of the partitions that are being moved to
cache.
I have verified in a couple of workloads that the amount of memory
accounted through this is the same amount of memory accounted through
the memtable flush procedure.
Fixes#1942
* git@github.com:glommer/scylla.git glommer/update-cache-v4:
row_cache: modernize use of seastar threads
mutation_partition: estimate size of partition
memtable: factor out calculation of memtable_entry memory size
memtable: add a method to export memtable's dirty memory manager
dirty_memory_manager: block if we hit the real dirty limit
dirty_memory_manager: add functions to manipulate real dirty
partition: add method to calculate memory size of a partition
row cache: pin real dirty during cache updates.
This patch fixes 'DROP INDEX' CQL statement to also drop the underlying
index view automatically so that we don't leave unused materialized
views behind.
Message-Id: <1510303421-15945-1-git-send-email-penberg@scylladb.com>
Right now, once a region is moved to the cache is no longer visible to
the dirty memory system. Not as real dirty nor virtual dirty.
The problem is that until a particular partition is moved to the cache
it is not evictable. As a result we can OOM the system if we have a lot
of pending cache updates as the writes will not be throttled and memory
won't be made available.
This patch pins the memory used by the region as real dirty before the
cache update starts, and unpins it when it is over. In the mean time it
gradually releases memory of the partitions that are being moved to
cache.
I have verified in a couple of workloads that the amount of memory
accounted through this is the same amount of memory accounted through
the memtable flush procedure.
Fixes#1942
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Once that is added, also add a method to a memtable entry to calculate
the entire size of a memtable entry. Right now we only have one method
to calculate the size minus rows.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
There are times in which we want to add and remove real dirty memory
without impacting virtual dirty. One such example is the cache update
process, where real dirty is the limiting factor.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Since we started accounting virtual dirty memory we no longer have a cap
on real dirty memory. In most situations that is not needed, since real
dirty will just be at most twice as much as virtual dirty (current
flushing memtable plus new memtable).
However, due to things like cache updates and component flushing we can
end up having a lot of memtables that are virtually freed but not yet
fully released, leading real dirty memory to explode using all the box'
memory.
This patch adds a cap on real dirty memory as well. Because of the
hierarchical nature of region_group, if the parent blocks due to memory
depletion, so will the child (virtual dirty region group).
A next step is to add a controller that will increase the priority of
the tasks involving in releasing real dirty memory if we get dangerously
close to the threshold.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The total size is the sum of two components. Add a method that
does that sum so this code gets easier to reuse.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In the memtable flusher, we account for the size of a partition as we
read them. However, there are other points in the architecture where we
would like to calculate the size of a partition in a point in which we
are not reading it. One such example is the cache update process.
This patch enhances the mutation_partition adding a method that returns
the total size for this partition.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
For a while now we have an async() function, that simplifies the code by not
needing to issue an explicit join. This patch converts the row cache to use
async() as well, which most of our code already does. Doing so will make
it easier to make changes to update_cache.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"Implement flat_mutation_reader::consume and add tests for it.
For that implement flat_mutation_reader_from_mutations and
read_mutation_from_flat_mutation_reader."
* 'haaawk/flat_reader_consume_v3' of github.com:scylladb/seastar-dev:
Add tests for flat_mutation_reader::consume
Add tests for flat_mutation_reader utils
Introduce read_mutation_from_flat_mutation_reader
Make mutation_rebuilder streamed_mutation independent
flat_mutation_reader_from_mutation: support multiple mutations
Introduce flat_mutation_reader::consume
Move FlattenedConsumer concept to flat_mutation_reader.hh
"This patchset introduces memtable::make_flat_reader that returns
flat_mutation_reader and converts internal memtable readers into
flat_mutation_readers.
It also introduces some utility methods like make_forwardable and
make_partition_snapshot_flat_reader."
* 'haaawk/flat_reader_memtable_v4' of github.com:scylladb/seastar-dev:
Turn scanning_reader into flat_mutation_reader
Change memtable_entry::read to return flat_mutation_reader
Make iterator_reader independent from mutation_reader
Introduce make_partition_snapshot_flat_reader
Prepare partition_snapshot_flat_reader
Introduce flat_mutation_reader_from_mutation
Prepare flat_mutation_reader_from_mutation
Introduce make_forwardable
Prepare make_forwardable for flat_mutation_reader
Introduce empty_flat_reader
memtable: Introduce make_flat_reader
mutation_rebuilder will be used not only with streamed_mutations
but also with flat_mutation_readers so it's better for it to be
independent from streamed_mutation.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Rename flat_mutation_reader_from_mutation to
flat_mutation_reader_from_mutations.
Make it work with std::vector<mutation> instead of a single
mutation.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This concept will be used both in flat_mutation_reader.hh
and mutation_reader.hh. mutation_reader.hh includes
flat_mutation_reader.hh so we have to move the concept to
make it accessible in both files.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Make it possible for a mutation_source to be created both for sources
that use old mutation_reader and new flat_mutation_reader.
Add tests for flat_mutation_reader::next_partition to run_mutation_source_tests.
* seastar-dev.git 'dev/haaawk/flat_reader_mutation_source_v3':
Remove mutation_reader.hh dependency from flat_mutation_reader.hh
Prepare mutation_source for more than one implementation
Add flat reader mutation source implementation
Add mutation_source::make_flat_mutation_reader
Use mutation_source::make_flat_mutation_reader in tests
Add flat_mutation_reader_assertions
Add test for flat_mutation_reader::next_partition
This commit creates a copy of partition_snapshot_reader
and names it partition_snapshot_flat_reader.
This new class will be turned into a flat_mutation_reader
in the next commit.
The purpose of this commit is to make it easier to review the next
commit.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This is a utility method that will be handy in conversion
from mutation_reader to flat_mutation_reader.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This commit copies streamed_mutation_from_mutation
from streamed_mutation to flat_mutation_reader
and renames it to streamed_mutation_from_mutation_copy.
This copy will be used as a base for
flat_mutation_reader_from_mutation.
The purpose of this commit is to make it easier to review the next
commit.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
It will add the ability to fast_forward_to on position_range
to flat_mutation_reader that does not have this ability.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This commit copies make_forwardable from streamed_mutation
to flat_mutation_reader and renames it to make_forwardable_copy.
This copy will be used as a base for make_forwardable implementation
for flat_mutation_reader.
The purpose of this commit is to make it easier to review the next
commit.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This method creates a flat_mutation_reader
instead of mutation_reader. All users will be gradually
converted to the new interface. make_reader is implemented
using make_flat_reader and will be removed once all users
are migrated.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This will be used as an intermediate state of migration
from mutation_reader to flat_mutation_reader.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
There will be a second implementation that will be used by
sources that are converted to flat_mutation_reader.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
It's not needed and causes cyclic dependency when we need
flat_mutation_reader in mutation_source.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
We don't support non-PK restrictions correctly as explained in commit
3c90607 ("tests/cql_query_test: Fix view creation in
test_duration_restrictions()") and Apache Cassandra doesn't support them
for MVs either. Change some test cases to not rely on them.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171107165138.3176-1-duarte@scylladb.com>
f44131226a introduced a regression where for some verbs we would
return partitions in their natural sort order, but since thrift
partition ranges can wrap-around, what we need to preserve is query
order.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171103201118.18175-1-duarte@scylladb.com>
Fixes#2938.
* 'tgrabiec/fix-range-tombstone-list-exception-safety-v1' of github.com:scylladb/seastar-dev:
tests: range_tombstone_list: Add test for exception safety of apply()
tests: Introduce range_tombstone_list assertions
cache: Make range tombstone merging exception-safe
range_tombstone_list: Introduce apply_monotonically()
range_tombstone_list: Make reverter::erase() exception-safe
range_tombstone_list: Fix memory leaks in case of bad_alloc
mutation_partition: Fix abort in case range tombstone copying fails
managed_bytes: Declare copy constructor as allocation point
Integrate with allocation failure injection framework
We don't support non-PK restrictions correctly as explained in commit
3c90607 ("tests/cql_query_test: Fix view creation in
test_duration_restrictions()") and Apache Cassandra doesn't support them
for MVs either. Disable the tests, but don't remove them because they
will be resurrected once CASSANDRA-13832 is fixed.
Message-Id: <1510052422-3478-1-git-send-email-penberg@scylladb.com>
range_tombstone_list::apply() has no exception safety guarantees about
the logical state. The target mutation_partition in cache should be
assumed to be left in unspecified state. In particular, some of the
preexisting overlapping tombstones may be removed and not reinserted,
so the cache would be missing some of the range tombstone information
in case the whole allocating section fails.
Use apply_monotonically() which provides the needed guarantees.
Fixes#2938.
erase_undo_op() constructor takes ownership of *it, and destroys it
when it goes out of scope. If emplace_back() fails, *it would be
destroyed before being removed from its container (_dst._tombstones).
Fix by making sure _ops.emplace_back() won't fail.
If exception is thrown from _row_tombstones.apply(), _rows will be
left uncleared. This will trigger assertion in bi::set_member_hook
destructor, which assrts that the hook is not linked.
Always clear _rows.
Because of the small size optimization, not all copies will call the
allocator, so allocation failure injection may miss this site if the
value is not large enough. Make the testing more effective by marking
this place explicitly as an allocation point.
* seastar d71922c...8040cab (4):
> util: Introduce support for allocation failure injection
> Adding dpdk-port-index as a command line option with default value of 0
> core/sharded: Introduce invoke_on_others()
> noncopyable_function: improve support for capturing mutable lambdas
"Fixes #2933
Fixes regressions introduced by config restructuring.
Allows "base" config to handle errors by warning, while
other uses can opt otherwise."
[pdziepak: resolved merge conflict]
* 'calle/cfgfix' of github.com:scylladb/seastar-dev:
config_test: Use error handler (ignore errors) + add error test
config: Resurrect command line aliases that where lost
main: Use error handler for config parse
config_file: Add optional "error_handler" to yaml parse functions
"This patch series adds support for secondary index queries using the
backing index view that's created when CREATE INDEX statement is
executed.
Example:
-- Create keyspace and table:
CREATE KEYSPACE ks WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor' : 1};
CREATE TABLE ks.users (
userid uuid,
name text,
email text,
country text,
PRIMARY KEY (userid)
);
-- Create secondary indexes:
CREATE INDEX ON ks.users (email);
CREATE INDEX ON ks.users (country);
-- Insert some data:
INSERT INTO ks.users (userid, name, email, country) VALUES (uuid(), 'Bondie Easseby', 'beassebyv@house.gov', 'France');
INSERT INTO ks.users (userid, name, email, country) VALUES (uuid(), 'Demetri Curror', 'dcurrorw@techcrunch.com', 'France');
INSERT INTO ks.users (userid, name, email, country) VALUES (uuid(), 'Langston Paulisch', 'lpaulischm@reverbnation.com', 'United States');
INSERT INTO ks.users (userid, name, email, country) VALUES (uuid(), 'Channa Devote', 'cdevote14@marriott.com', 'Denmark');
-- Query on the secondary-index backed non-primary keys:
SELECT * FROM ks.users WHERE email = 'beassebyv@house.gov';
userid | country | email | name
--------+---------+-------+------
022238c8-5213-44b5-959e-4e3e1b032f85 | France | beassebyv@house.gov | Bondie Easseby
(1 rows)
SELECT * FROM ks.users WHERE country = 'France';
userid | country | email | name
--------------------------------------+---------+-------------------------+----------------
2152d85a-61f6-4eab-af4d-e7e7d0872319 | France | beassebyv@house.gov | Bondie Easseby
59fddb6d-bfc9-4636-a9a0-85383fd815ee | France | dcurrorw@techcrunch.com | Demetri Curror
Known imitations:
- Only regular column indexes return results. Indexing primary key
components like clustering keys return empty result set because of
index view query partition key serialization issues that will be fixed
in subsequent patches.
- Secondary index queries are not paginated, which can cause problems
for queries that return a large number of rows.
- Multiple restrictions don't work correctly if one of them is backed by
a secondary-index.
- Only one secondary-indexed restriction per query is supported -- other
restrictions are ignored.
- Compound partition keys are not supported.
- ALLOW FILTERING on non-primary key columns does not work correctly
without secondary index (see issue #2200)."
* 'penberg/cql-2i-queries/v2' of github.com:penberg/scylla:
tests/cql_query_test: Add test case for secondary index queries
cql3: Secondary-index backed select statements
index: Fix index view schema when primary key component is indexed
tests/cql_query_test: Fix view creation in test_duration_restrictions()
cql3/restrictions: Add statement_restrictions::index_restrictions() helper
index: Implement index::supports_expression() for EQ operator
cql3: Make operator_type class non-copyable
index: Fix index::supports_expression() operator parameter type
cql3: Implement statement_restriction index validation
Before the patch we appended and queried at the front. Insert at the
front instead, so that writes and reads overlap. Stresses eviction and
population more.
Message-Id: <1506369562-14892-1-git-send-email-tgrabiec@scylladb.com>
"We currently can't insert row entries at any position_in_partition,
but only at full keys and after all keys. If a query range has bounds
such that we have to insert a dummy entry at non-representable position
then information about range continuity will not be fully populated.
In particular, single-row queries of a row which is not present in sstables
will miss when repeated again.
The series fixes the problem by marking the whole query range as continuous
by inserting dummy entries at boundaries when necessary.
Refs #2579."
* tag 'tgrabiec/cache-range-continuity-v2' of github.com:scylladb/seastar-dev:
tests: row_cache: Add test for population of single rows
tests: Add test for population of continuity
tests: mutation_reader_assertions: Introduce produces_compacted()
mutation: Introduce apply(mutation_fragment)
cache: Document invariants of cache_streamed_mutation::_lower_bound
cache_streamed_mutation: Special-case population for singular ranges
query: Introduce is_single_row()
cache_streamed_mutation: Increment mispopulation counter when can't populate due to eviction
cache_streamed_mutation: Override continuity of older versions when populating
cache_streamed_mutation: Mark whole query range as continuous
tests: cache_streamed_mutation: Allow creating expected_row at any position_in_partition
cache_streamed_mutation: Populate continuity when range adjacent to non-latest version rows
cache_streamed_mutation: Avoid lookup in maybe_add_to_cache() in more cases
row_cache: Make read_context::key() valid before reading from underlying starts
mutation_partition: Allow creating rows_entry at any clustered position_in_partition
position_in_partition: Do not use -2 and +2 weights
clustering_ranges_walker: Make contains() drop range tombstones adjacent to query range
mutation_partition: Remove delegating_compare()
mvcc: Print iterators in operator<< for partition_snapshot_row_cursor
mvcc: Introduce partition_snapshot_row_weakref
mvcc: Make the null state of partition_snapshot::change_mark explicit
mvcc: Add partition_snapshot::region() getter
mvcc: Add partition_snapshot::schema() getter
position_in_partition: Introduce before_key()
position_in_partition: Introduce min()
position_in_partition: Introduce for_static_row()
While not being a real unit tests memory_footprint can be a quite useful
tool and running it among other tests will ensure that we will notice
when it gets broken.
Message-Id: <20171102160233.6756-2-pdziepak@scylladb.com>
When created cache registers several metrics, since attempts to create
an already existing metrics result in an exception being thrown it is no
longer possible to have two cache instances at the same time. This is
exactly what happens in memory_footprint: one (useless) cache object is
created through a call to do_with_cql_env() and, then, memory_footprint
explicitly creates another one (not a useless one).
The tests itself doesn't really need a full cql environment and the only
reason it was added is so that storage_service is initialised and various
code paths can query for the available cluster features. This can be
done in a much lightweight way using storage_service_for_tests.
Fixes memory_footprint failure (until next time we decide there is
nothing wrong with globals).
Message-Id: <20171102160233.6756-1-pdziepak@scylladb.com>
$basearch isn't parsed as expected, the finaly baseurl is wrong.
We only have x86_64 arch in external 3rdparty repository, and
the conf file is only for x86_64, so it's fine to use hardcode
x86_64.
The problem was introduced by commit
b5e83ebd94 ("dist/redhat: switch 3rdparty
packages to external build service").
Fixes#2930
Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <708f46a7c36623e86fee278462c80db1eff3b820.1509700430.git.amos@scylladb.com>
This patch adds support for secondary-index backed select statements.
Current select_statement class is split into two separate classes:
primary_key_select_statement that retains regular query behavior and
indexed_table_select_statement that introduces the new secondary-index
backed query logic. One of the two behaviors is selected at query
preparation time to minimize overhead for non-indexed queries.
This fixes index view schema to exclude indexed column when a primary
key component like clustering key is indexed. This fixes a server crash
when CREATE INDEX statement is executed on a clustering key column.
The materialized view created in test_duration_restriction() restricts
on a non-PK column. Since Scylla's ALLOW FILTERING and secondary index
validation path is broken, once we start to do secondary index queries,
query processor thinks there's a secondary index backing that non-PK
column and fails because it's unable to find such column.
Fix up the view to only trigger the duration type validation error we're
interested in here.
Fixes the case of continuity not being populated when the row which is
the upper bound of the population range belongs to a non-latest
version. In such case we wouldn't mark the range as continuous,
because we can't modify rows of non-latest versions. To fix this,
create an empty entry in latest version which will just override the
continuity flag of the old entry.
Before this patch only ranges between returned row fragments were
marked as continuous. In the extreme case, there could be no such
fragments, in which case next read would miss as well. To avoid this,
mark whole query range as continuous by inserting dummy entries when
necessary.
Refs #2579.
Current code will not mark the range as continuous if the previous
entry does not come in the latest version. Fix that by switching
to partition_snapshot_row_pointer, which is capable of checking
in older versions as necessary.
Also, we avoid the key comparison if we know that the iterator
is still valid.
So that we can call cache_streamed_mutation::can_populate() before
we start reading from underlying. Will be needed in upcoming changes
which insert dummy entries when falling back to underlying.
::weight() is using those values for excl_end and excl_start in order
to be able to represent non-overlapping ranges. In their model the end
bound is inclusive. We don't need this, since position_range has end
bound exclusive.
This change makes that:
position_in_partition::after_key(y)
== position_in_partition::for_range_end(clutering_range::make({x}, {y})
This patch changes the way the multiget_{slice, count} verbs return
their results, by ensuring a queried key that produced no results is
still present in the returned map, associated with an empty list.
This is not required by the thrift interface, and it is a performance
step back, but matches the behavior of Apache Cassandra.
Said behavior is relied upon by projects like JanusGraph, whose
integration with Scylla motivated this patch.
Fixes#2900
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171019161104.22797-2-duarte@scylladb.com>
Most common operations, like multiget_count and multiget_slice, return
maps. So, instead of keeping a vector internally in column_visitor
that we later transform into a map, keep a map that we transform into
a vector for the uncommon operations.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20171019161104.22797-1-duarte@scylladb.com>
"This series cleans-up the query processor header and source file,
including deleting dead Java code.
There are no functional or interface changes.
I've run all unit tests and observed no failures."
* 'jhk/clean_up_qp/v2' of github.com:hakuch/scylla:
cql3/query_processor: Fix formatting
cql3/query_processor: Organize headers
cql3/query_processor.hh: Consolidate `public` and `private` sections
cql3/query_processor: Remove dead Java code
* seastar 8babd1f...d71922c (11):
> configure.py: add -Wno-sign-compare to compile Boost.Test with gcc-7
> log: Print nested exceptions
> reactor: do not account non idle activity for total idle time calculation
> execution_stage: defer execution less aggressively
> Fix -Wreturn-type warnings
> cpu scheduler: make _reciprocal_shares_times_2_32 wider to avoid overflow problems
> noncopyable_function add bool operator
> execution_stage: make make_execution_stage return a named type
> memory: support overriding the default allocator page size
> memory: fix crash during startup with large page_size
> core: io_destroy is missing when destructing reactor, which causes io_context leak
We are failing to build .deb package on pbuilder due to lack of build time
dependencies so we need add those packages on Build-Depends, also we need to
follow Debian packaging style for the package contains python scripts.
Fixes#2918
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1508457215-11552-1-git-send-email-syuu@scylladb.com>
Switch to g++-7/boost-1.63 for Ubuntu 14.04/16.04 that newly provided via
our 3rdparty PPA.
To make Scylla compilable with boost-1.63/g++-7, we need to disable following
warnings:
- misleading-indentation
- overflow
- noexcept-type
Compile error message:
https://gist.github.com/syuu1228/96acc640c56c3316df5ce6911d60beea
Seastar also has similar problem, it needs to disable 'sign-compare', detail
is in a patch for Seastar.
This update also fixes current Ubuntu 14.04/16.04 compilation error problem,
since errors were come from too old g++/boost.
Fixes#2902Fixes#2903
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1508457509-13122-1-git-send-email-syuu@scylladb.com>
"When reading a single row it is possible that the read will be satisfied
by just reading from one of the data source candidates. To exploit this
an optimization is employed which sorts data source candidates by their
timestamp and reads mutations from the most recent to the oldest. When
all needed cells are present and their earliest timestamp is still
later than the latest one of the remaining data source the read can be
terminated early.
However this optimization also has the possibility to backfire as the
data sources are read sequentially, so if all of them has to be read
eventually then we will end up worse then without it.
Thus the optimization can be disabled up-front or enabled to only run
until its efficiency degrades below a certain threshold.
Also counters are added to column-families to make it possible to
observe how well it performs.
Benchmarking
Benchmarking was done with disabled cache and at a constant op rate of
4k (1/3 of the max op rate on my box), against 3 sstables containing the
same 10000 rows.
1) Optimization turned off (all sstables read paralelly)
latency mean : 1.3 [simple:1.3]
latency median : 1.0 [simple:1.0]
latency 95th percentile : 2.4 [simple:2.4]
latency 99th percentile : 2.9 [simple:2.9]
latency 99.9th percentile : 8.0 [simple:8.0]
latency max : 13.5 [simple:13.5]
2) Optimization turned on, best case (1 of 3 sstables read)
latency mean : 0.6 [simple:0.6]
latency median : 0.6 [simple:0.6]
latency 95th percentile : 1.0 [simple:1.0]
latency 99th percentile : 1.2 [simple:1.2]
latency 99.9th percentile : 4.4 [simple:4.4]
latency max : 13.4 [simple:13.4]
3) Optimization turned on, best case, IN query (1 of 3 sstables read)
latency mean : 0.7 [simple_in:0.7]
latency median : 0.6 [simple_in:0.6]
latency 95th percentile : 1.1 [simple_in:1.1]
latency 99th percentile : 1.4 [simple_in:1.4]
latency 99.9th percentile : 5.4 [simple_in:5.4]
latency max : 16.8 [simple_in:16.8]
4) Optimization turned on, worst case (3 of 3 sstables read sequentally)
latency mean : 2.8 [simple:2.8]
latency median : 2.3 [simple:2.3]
latency 95th percentile : 5.4 [simple:5.4]
latency 99th percentile : 6.5 [simple:6.5]
latency 99.9th percentile : 13.5 [simple:13.5]
latency max : 19.2 [simple:19.2]
5) Optimization turned on, mid case (2 of 3 sstables read sequentally)
latency mean : 1.4 [simple:1.4]
latency median : 1.1 [simple:1.1]
latency 95th percentile : 2.7 [simple:2.7]
latency 99th percentile : 3.2 [simple:3.2]
latency 99.9th percentile : 7.7 [simple:7.7]
latency max : 15.1 [simple:15.1]"
Ref #324
* 'bdenes/optimize_single_row_read_v6' of github.com:denesb/scylla:
Add unit tests for single_key_sstable_reader
Add counters for the single-key reader optimization
Add single_key_parallel_scan_threshold option
single_key_sstable_reader: optimize single-row queries
single_key_sstable_reader: move reading code into it's own method
Add selects_only_full_rows() and selects_only_full_rows_with_atomic_columns()
Add two counters, one to determine how many of the reads fall into the
optimization, and a second one to determine it's effectiveness.
The first one is single_key_reader_optimization_hit_rate. It contains
the rate of reads that the optimization applies to out of all the reads
that go into the single_key_sstable_reader.
The second one, single_key_reader_optimization_extra_read_proportion is
a histogram of the efficiency of the optimization. It contains the
proportion of extra data-sources read. It's a number between 0 and 1,
where 0 is the best case (only one data-source was read) and 1 is the
worst case (all data-sources were read eventually). This is the same
number that is used for the threshold option (see previous patch).
Each of the histogram's buckets cover a chunk of 0.1 from the [0, 1]
range.
Note that single_key_parallel_scan_threshold effectively provides an
upper bound for the proportion as the optimization is turned off as soon
as it goes above that number.
The counters are disabled if single_key_parallel_scan_threshold is set
to 0 disabling the optimization entirely.
This option regulates when exactly the single-key optimization is
considered ineffective and turned off.
The threshold is the proportion of the extra data source candidates that
can be read before the optimization is considered ineffective and
disabled. The proportion is calculated as follows:
(read_data_sources - 1) / (total_data_sources - 1)
We substract 1 from the read_data_sources and total_data_sources to
effectively measure the rate of *extra* data sources we read. This
makes sure that the proportion is meaningful even if e.g. we have only
have a total of 2 data-sources and we read only 1 (best case).
Whenever this number goes above the threshold the optimization is
disabled. The threshold is number between 0 and 1, 0 forces the
optimization off and 1 forces it on. Increase the treshold to favor
throughput over latency for single-row reads, decrease the treshold to
improve latency at the expense of throughput.
If the threshold is > 0 (it's not force disabled) and the optimization
is disabled due to a read crossing the threshold, we will issue
"probing" reads (every 100th read) to determine if the optimization is
worth re-enabling. Probing reads are allowed to run through the
optimization path and if they go below the threshold the optimization is
re-enabled.
For single-row queries that only query atomic cells one can put a lower
bound on the timestamps which may affect the query results and thus rule
out entire data sources. This allows the query to read only those
sstables that actually contribute to the result.
To do this we incrementally move through the sstables overlapping with
the query range, checking after each read mutation whether we already
have a value for all required cells and whether the lower-bound of their
timestamps is higher than the upper-bound of the timestamps of all the
remaining data-sources. When this condition is met we terminate the
read.
"The main problem fixed is slow processing of application state changes.
This may lead to a bootstrapping node not having up to date view on the
ring, and serve incorrect data.
Fixes #2855."
* tag 'tgrabiec/gossip-performance-v3' of github.com:scylladb/seastar-dev:
gms/gossiper: Remove periodic replication of endpoint state map
gossiper: Check for features in the change listener
gms/gossiper: Replicate changes incrementally to other shards
gms/gossiper: Document validity of endpoint_state properties
storage_service: Update token_metadata after changing endpoint_state
gms/gossiper: Process endpoints in parallel
gms/gossiper: Serialize state changes and notifications for given node
utils/loading_shared_values: Allow Loader to return non-future result
gms/gossiper: Encapsulate lookup of endpoint_state
storage_service: Batch token metadata and endpoint state replication
utils/serialized_action: Introduce trigger_later()
gossiper: Add and improve logging
gms/gossiper: Don't fire change listeners when there is no change
gms/gossiper: Allow parallel apply_state_locally()
gms/gossiper: Avoid copies in endpoint_state::add_application_state()
gms/failure_detector: Ignore short update intervals
"Extracts the yaml/boost-po aspects of the "self-describing" db::config
into an abstract type.
db::config is then reimplemented in said type, removing some of the
slightly cumbersome entanglement with seastar opts (log).
Adds a main hook for additional configuration files (options + file)"
* 'calle/config' of github.com:scylladb/seastar-dev:
main/init: Add registerable configuration objects
db::config: Re-implement on utils/config_file.
utils::config_file: Abstract out config file to external type
storage_service depends on endpoint states to be replicated to all
shards before token metadata is replicated. Currently this is taken
care of by storage_service::replicate_to_all_cores(), invoked from
storage_service's change listener. It copies whole endpoint state map,
which is expensive in large clusters. It's more efficient to replicate
only incremental changes, and only once, rather than for each
application state.
There is a requirement that whatever is present in token_metadata,
should also be present in endpoint_state. Because of that, we should update
endpoint_state first (set_gossip_tokens).
Apache Cassandra switched to this order as well in commit
b39d984f7bd682c7638415d65dcc4ac9bcb74e5f.
Makes state application faster due to increased parallelism.
Refs #2855.
Bootrap of 11th node, ignoring apply_state_locally() which complete instantly:
Before:
DEBUG 2017-10-06 15:24:04,213 [shard 0] gossip - apply_state_locally() took 1230 ms
DEBUG 2017-10-06 15:24:04,223 [shard 0] gossip - apply_state_locally() took 1421 ms
DEBUG 2017-10-06 15:24:04,225 [shard 0] gossip - apply_state_locally() took 607 ms
DEBUG 2017-10-06 15:24:04,288 [shard 0] gossip - apply_state_locally() took 488 ms
DEBUG 2017-10-06 15:24:04,408 [shard 0] gossip - apply_state_locally() took 1425 ms
After:
DEBUG 2017-10-06 16:24:13,130 [shard 0] gossip - apply_state_locally() took 814 ms
It's possible that a change listener for a later state will run before
change listener for the previous state completes, in which case
node's state can be corruped. For example, the previous change listener
may override sysytem.peers with an old value.
This patch fixes the problem by serializing state changes and
listeners for each node.
The implementation uses loading_shared_values so that the lock remains
alive as long as there is anyone holding it. Using endpoint_state_map
for that doesn't seem appropraite, because entries can be removed from
it while listeners are still running. There is code in the gossiper
which anticipates that entry may be gone across deferring points in
some places.
apply_new_states() always fires change listeners for received values,
even if we already processed the state earlier. Some change listeners
are heavy-weight, e.g. storage_service::handle_state_normal(). We
should avoid calling them more than necessary.
Make sure that we always run the change listeners by putting them in a
defer() block. Otherwise, if exception is thrown in the middle of state
application, change listeners would not be run. Later we would not
detect the change for states which were already applied, and not run
the change listers.
Fixes#2867
It is serialized since e428d06f40. This causes regression in
performance of application state propagation due to reduced
parallelism.
Processing states for each node has high latency due to memtable
flushes triggered by update_tokens() and commitlog syncs done by
system.peers updates, if commitlog sync mode is set to "batch". We
have high internal concurrency for these, so increasing parallelism
significantly reduces time to process all states.
Fixes#2855.
Failure detector decides that a node is down if it hasn't received a change of
its heartbeat for longer than ~11 times the average of past intervals between
updates.
If there are multiple incoming ACKs containing information about the
same node, we may detect and report a change for each of them. This
will cause failure_detector to establish that the average report
period is in milliseconds. After the update storm is over, it will
claim the node failure very soon, because report period will now be a
large multiple of the average.
Fix by not counting short updates into the calculation of average
arrival time.
Fixes#2861.
Handling all the boost::commandline + YAML stuff.
This patch only provides an external version of these functions,
it does not modify the db::config object. That is for a follow-up
patch.
"Histograms are a native prometheus type, and there are many functions
available that operate on them. There is extensive documentation about
them at https://prometheus.io/docs/practices/histograms/
One example is the function histogram_quantile(), that can extract
useful quantiles from the histograms. Currently, those functions don't
work well.
The reasons are twofold:
1) We are only exporting 16 metrics, starting from 1usec. That means
that the highest latency we can differentiate is 4ms. After that,
everything falls into the same bin.
2) The format that prometheus expects is that each bin will contain
the total number of points seen *up until that bin*, while we
currently export the total number of points that falls between bins.
IOW, it is a cummulative histogram.
About point two, granted it is a bit hidden in their website, but it is
there. The following phrase about a caveat make it clear:
"Note that we divide the sum of both buckets. The reason is that the
histogram buckets are cumulative. The le="0.3" bucket is also contained
in the le="1.2" bucket; dividing it by 2 corrects for that."
It is also not needed to accumulate things that fall over the last bin:
the _count component of the histogram will already account for that."
Acked-by: Amnon Heiman <amnon@scylladb.com>
Acked-by: Gleb Natapov <gleb@scylladb.com>
* 'prometheus-histograms' of github.com:glommer/scylla:
storage_proxy: change reporting of estimated histograms
estimated_histogram: bring histogram closer to what prometheus expects.
We moved to new dependency package names like
antlr3-c++-dev to scylla-antlr35-c++-dev when we moved to ppa on Ubuntu, but
Debian still uses old dist/debian/dep packages.
So keep using old style package names.
Fixes#2831
Message-Id: <1508245175-2184-1-git-send-email-syuu@scylladb.com>
query::full_slice doesn't select any regular or static columns, which
is at odds with the expectations of its users. This patch replaces it
with the schema::full_slice() version.
Refs #2885
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1507732800-9448-2-git-send-email-duarte@scylladb.com>
This patch introduces schema::full_slice(), which returns a
partition_slice selecting the full clustering range, as well as all
static and regular columns. No options aside from the default are
set in that partition_slice.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1507732800-9448-1-git-send-email-duarte@scylladb.com>
Fixes: 2898
Typo error in gensalt(). Only returned selected hash method, not
the random salt bytes. Does not prevent the hash function from
operating, but strength is ever so reduced.
Message-Id: <20171016130505.25593-2-calle@scylladb.com>
"This changeset is the first step to flatten mutation_reader.
Then it introduces new mutation_fragment types for partition header and end of partition.
Using those a new flat_mutation_reader is defined.
Finally it introduces converters between new flat_mutation_reader and
old mutation_reader."
* 'haaawk/flattened_mutation_reader_v12' of github.com:scylladb/seastar-dev:
Add tests for flat_mutation_reader
Introduce conversion from flat_mutation_reader to mutation_reader
Introduce conversion from mutation_reader to flat_mutation_reader
Introduce flat_mutation_reader
Extract FlattenedConsumer concept using GCC6_CONCEPT
Introduce partition_end mutation_fragment
Introduce a position for end of partition
Introduce partition_start mutation_fragment
Introduce FragmentConsumer
Introduce a position for partition start
streamed_mutation: Extract concepts using GCC6_CONCEPT macro
The command scans random set of objects in a small pool (or, optionally
only objects of a certain size) for vptrs and builds a histogram, so that
most often used vptrs can be easily found. The command is useful to find
"memory leaks" caused by creating of too many tasks of a certain type
which is usually a result of unlimited parallelism somewhere.
Message-Id: <20171015081634.GB21092@scylladb.com>
Those tests run mutation source test for all sources
using conversion to and from flat_mutation_reader.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This reader operates on mutation_fragments instead of
streamed_mutations.
Each partition starts with a partition_header fragment
and ends with end_of_partition fragment.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This fixes a regression introduced in 27a3b4bca9 (master only).
partition_range_cursor assumes that as long as references are valid,
_end is valid as well. But if new entries were inserted before _end,
it may not, if the new entries fall after the query range. This may
result in reads returning partitions from outside the query range.
Message-Id: <1507815478-20269-1-git-send-email-tgrabiec@scylladb.com>
"Based on the functions get_endpoint_state_for_endpoint_ptr(),
get_application_state_ptr() and
endpoint_state::get_application_state_ptr(), this series
cleanups miscelaneous functions related to the gossiper.
It not only removes duplicated code, but also omits many copies.
All pointer usages have been audited for safety."
Acked-by: Asias He <asias@scylladb.com>
Acked-by: Tomasz Grabiec <tgrabiec@scylladb.com>
* 'gossiper-cleanup/v2' of github.com:duarten/scylla: (27 commits)
gms/endpoint_state: Remove get_application_state()
service/storage_service: Avoid copies in prepare_replacement_info()
service/storage_service: Cleanup get_application_state_value()
service/storage_service: Cleanup handle_state_removing()
service/storage_service: Cleanup get_rpc_address()
locator/reconnectable_snitch_helper: Avoid versioned_value copies
locator/production_snitch_base: Cleanup get_endpoint_info()
service/migration_manager: Avoid copies in is_ready_for_bootstrap()
service/migration_manager: Cleanup has_compatible_schema_tables_version()
service/migration_manager: Fix usages of get_application_state()
cache_hit_rate: Avoid copies in get_hit_rate()
gms/endpoint_state: Avoid copies in is_shutdown()
service/load_broadcaster: Avoid copy in on_join()
gms/gossiper: Cleanup get_supported_features()
gms/gossiper: Cleanup get_gossip_status()
gms/gossiper: Cleanup seen_any_seed()
gms/gossiper: Cleanup get_host_id()
gms/gossiper: Removed dead uses_vnodes() function
gms/gossiper: Cleanup uses_host_id()
gms/gossiper: Add get_application_state_ptr()
...
We were taking a reference to a temporary value in different places.
Fix them by using get_application_state_ptr(), which also avoids a copy.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces the get_application_state_ptr() function, which
allows access to a versioned_value of a particular endpoint.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Now that we have get_endpoint_state_for_endpoint_ptr(), which does not
return a copy and allows mutating the actual state, we can use it
instead of repeating the lookup code.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Make it use get_endpoint_state_for_endpoint_ptr(), check if gossiper is
enabled, mark it as const, and have some callers use it instead of open
coding the logic.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Have convict() use get_endpoint_state_for_endpoint_ptr(), simplify
logging, and also protect expensive operations by checking the log
level.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
For every finished compaction, we were calculating shards for all
existing tables. With ignore_msb set to 0, it's probably not a big
deal, but if ignore_msb is like 12 and LCS is used (meaning thousands
of tables possibly), the operation may stall the reactor for a
considerable amount of time. That's fixed by caching shards.
Fixes#2875.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171011053424.22308-1-raphaelsc@scylladb.com>
"gossiper::get_endpoint_state_for_endpoint() returns a copy of
endpoint_state, which we've seen can be very expensive. This
series introduces a function that returns a pointer and avoids
the copy.
Fixes#764"
* 'endpoint-state/v2' of https://github.com/duarten/scylla:
gossiper: Avoid endpoint_state copies
endpoint_state: const-qualify functions
storage_service: Remove duplicate endpoint state check
This type of mutation_fragment will be used in new mutation_reader
to signal the end of the current partition.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This position will be used for mutation fragment
that represents the end of partition.
This position sorts after all other mutation fragments.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This type of mutation_fragment will be used in new mutation_reader
to signal the beginning of the next partition.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
gossiper::get_endpoint_state_for_endpoint() returns a copy of
endpoint_state, which we've seen can be very expensive.
This patch adds a similar function which returns a pointer instead,
and changes the call sites where using the pointer-returning variant
is deemed safe (the pointer neither escapes the function, nor crosses
any defer point).
Fixes#764
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Seastar small pools can now fall back to smaller spans. Adjust
the 'scylla memory' command accordingly.
Message-Id: <20171005123935.13503-1-avi@scylladb.com>
* seastar c62bbf9...8babd1f (9):
> Enhanced support for Travis CI: build with and without DPDK support, use varioius compilers (GCC 5/6/7)
> backtrace: Allow whitespace after the backtrace addresses
> test.py: fix typo in noncopyable_function_test
> utils: introduce noncopyable_function
> Revert "utils: introduce noncopyable_function"
> utils: introduce noncopyable_function
> Add seastar-addr2line helper script to decode backtraces
> execution_stage: pass scheduling_group to constructor
> reactor: preempt tasks when a signal is received
"This series implements CAST AS functions in scylla.
It allows to use expressions of the form CAST(x AS type) in select statements.
Primary motivation for this functions came from aggregate functions, because
function avg(.) gives rounded results for interger columns. Now it is possible
to convert such column to float/double and obtain floating point results:
SELECT ... avg(cast(x as double)), ...
Fixes #2280."
* 'danfiala/2280-patch-series-v2' of https://github.com/hagrid-the-developer/scylla:
tests: Add test for CAST AS functions.
cql3: Add support for CAST AS functions to ANTLR grammar.
cql3/selectable: Add selectable::with_cast for CAST AS functions.
cql3/functions: Add support for CAST AS functions.
types:: Add support for CAST AS functions.
types: Moved code that implements conversion of types' values to string.
"This patch series adds backing materialized view for secondary indices.
When a new index is created with the 'CREATE INDEX' statement, a backing
materialized view is created automatically.
For example, assuming the following table:
CREATE TABLE ks1.users (
userid uuid,
email text,
PRIMARY KEY (userid)
);
When the following index is created:
CREATE INDEX user_email ON ks1.users (email);
The following materialized view is also created:
cqlsh> DESCRIBE ks1.users;
<snip>
CREATE MATERIALIZED VIEW ks1.user_email_index AS
SELECT email, userid
FROM ks1.users
WHERE email IS NOT NULL
PRIMARY KEY (email, userid)
WITH CLUSTERING ORDER BY (userid ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
AND comment = ''
AND compaction = {'class': 'SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CQL queries will use the backing materialized view as part of queries on
indexed columns to fetch the primary keys."
* 'penberg/cql-2i-backing-view/v3' of github.com:scylladb/seastar-dev:
schema_tables: Create backing view for indices
database: Kill obsolete secondary index manager stub
cql3: Wire up secondary index manager
cql3/restrictions: Add term_slice::is_supported_by() function
index: Add secondary_index_manager::create_view_for_index()
index: Add target_parser::parse() helper
cql3/statements: Add index_target::from_sstring() helper
index: Add secondary_index_manager::get_dependent_indices()
index: Add secondary_index_manager::reload()
index: Add secondary_index_manager::list_indexes()
index: Add index class
index: Pass column_family to secondary_index_manager constructor
database: Make secondary index manager per-column family
This patch wires calls to secondary index manager reload() in
merge_tables_and_views() and changes make_update_indices_mutations() to
also create mutations for the backing materialized view. After this
patch, "CREATE INDEX" CQL statement also creates a materialized view.
We are currently collapsing the histograms in 16 points, exponentially
increasing in value, starting from 1.
While reducing the number of points is a worthy goal, the current
configuration caps us at 4ms. Our latencies tend to be higher than this.
Starting from 1 is also a bit of an exhaggeration: rarely are our
latencies in that range. This patch changes reporting so that we
report 20 points starting from 32.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Histograms are a native prometheus type, and there are many functions
available that operate on them. There is extensive documentation about
them at https://prometheus.io/docs/practices/histograms/
One example is the function histogram_quantile(), that can extract
useful quantiles from the histograms. Currently, those functions don't
work well.
The reasons are twofold:
1) We are only exporting 16 metrics, starting from 1usec. That means
that the highest latency we can differentiate is 4ms. After that,
everything falls into the same bin.
2) The format that prometheus expects is that each bin will contain
the total number of points seen *up until that bin*, while we
currently export the total number of points that falls between bins.
IOW, it is a cummulative histogram.
About point two, granted it is a bit hidden in their website, but it is
there. The following phrase about a caveat make it clear:
"Note that we divide the sum of both buckets. The reason is that the
histogram buckets are cumulative. The le="0.3" bucket is also contained
in the le="1.2" bucket; dividing it by 2 corrects for that."
It is also not needed to accumulate things that fall over the last bin:
the _count component of the histogram will already account for that.
This patch changes the histogram format to be more in line with what
prometheus expect.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The call to std::ref() is not namespace-qualified, and so can conflict
with seastar::ref().
Fix by naming std::ref() explicitly.
Message-Id: <20171004155250.4960-1-avi@scylladb.com>
"Makes authorizer/authenticator actually pluggable (by class name)
and adds a "Transitional" type for both, conforming to the DSE
definition of the types.
The idea is to allow a rolling upgrade of a cluster to
authentication op by first making all clients provide credentials
(ignored by non-auth), then node by node enable auth with
transitional handlers, then ensure user DB is populated and
distributed, and finally rollingly enable strict auth for
each node. Pfew."
Fixes#2836
* auth: Transitional auth wrappers
auth: Make authenticator/authorizer use actual name based lookup
Similar to DSE objects with similar name. Basically ignores
all authentication/authorization except "superuser" login. All others
sessions are treated as anonymous.
Note: like DSE counterparts, a client session must still _use_
authentication to be able to connect, even though the actual content of
the auth is mostly ignored.
Allows registry to give back, for example, shared_ptr etc instead of
solely unique_ptr. If a registry is defined with seastar/std
shared/lw_shared/unique_ptr as "BaseType", the type will assume
this is the intended result type.
"Currently restricting_mutation_reader restricts mutation_readears on a
count basis. This is inaccurate on multiple levels. The reader might be
a combined_mutation_reader, which might be composed of multiple
individual readers, whose number might change during the lifetime of the
reader. The memory consumption of the readers can vary and may change
during the lifetime of the reader as well.
To remedy this, make the restriction memory-consumption based. The
restricting semaphore is now configured with the amound of memory
(bytes) that its readers are allowed to consume in total. New readers
consume 128k units up-front to account for read-ahead buffers, and then
consume additional units for any buffer (returned
from input_stream<>::read()) they keep around.
Like before, readers already allowed to read will not be blocked,
instead new readers will be blocked on their first read if all the units
all consumed.
Fixes #2692."
* 'bdenes/restricting_mutation_reader-v5' of https://github.com/denesb/scylla:
Update reader restriction related metrics
Add restricted_reader_test unit test
restricted_mutation_reader: restrict based-on memory consumption
mutation_reader.hh: Move restricted_reader related code
"
The original motivation for the "utils: introduce a loading_shared_values" series was a hinted handoff work where
I needed an on-demand asynchronously loading key-value container (a replica address to a commitlog instance map).
It turned out that we already have the classes that do almost what I needed:
- utils::loading_cache
- sstables::shared_index_lists
Therefore it made sense to find a common ground, unify this functionality and reuse the code both in the classes above and in the
new hinted handoff code.
This series introduces the utils::loading_shared_values that generalizes the sstables::shared_index_lists
API on top of bi::unordered_set with the rehashing logic from the utils::loading_cache triggered by an addition
of an entry to the set (PATCH1).
Then it reworks the sstables::shared_index_lists and utils::loading_cache on top of the new class (PATCH2 and PATCH3).
PATCH4 optimizes the loading_cache for the long timer period use case.
But then we have discovered that we have another "customer" for the loading_cache. Apparently our prepared statements cache
had a birth flaw - it was unlimited in size - unless the corresponding keyspace and/or table are modified/dropped the entries
are never evicted. We clearly need to limit its size and it would also make sense to evict the cache entries that haven't been
used long enough.
This seems like a perfect match for a utils::loading_cache except for prepared statements don't need to be reloaded after
they are created.
Patches starting from PATCH5 are dealing with adding the utils::loading_cache the missing functionality (like making the "reloading"
conditional and adding the synchronous methods like find(key)) and then transitioning the CQL and Thrift prepared statements
caches to utils::loading_cache.
This also fixes #2474."
* 'evict_unused_prepared-v5' of https://github.com/vladzcloudius/scylla:
tests: loading_cache_test: initial commit
cql3::query_processor: implement CQL and Thrift prepared statements caches using cql3::prepared_statements_cache
cql3: prepared statements cache on top of loading_cache
utils::loading_cache: make the size limitation more strict
utils::loading_cache: added static_asserts for checking the callbacks signatures
utils::loading_cache: add a bunch of standard synchronous methods
utils::loading_cache: add the ability to create a cache that would not reload the values
utils::loading_cache: add the ability to work with not-copy-constructable values
utils::loading_cache: add EntrySize template parameter
utils::loading_cache: rework on top of utils::loading_shared_values
sstables::shared_index_list: use utils::loading_shared_values
utils: introduce loading_shared_values
Update description of existing reader count metrics, add memory
consumption metrics. Use labels to distinguish between system, user and
streaming reads related metrics.
Restrict readers based on their memory consumption, instead of the count
of the top-level readers. To do this an interposer is installed at the
input_stream level which tracks buffers emmited by the stream. This way
we can have an accurate picture of the readers' actual memory
consumption.
New readers will consume 16k units from the semaphore up-front. This is
to account their own memory-consumption, apart from the buffers they
will allocate. Creating the reader will be deferred to when there are
enough resources to create it. As before only new readers will be
blocked on an exhausted semaphore, existing readers can continue to
work.
"Currently restricting_mutation_reader restricts mutation_readears on a
count basis. This is inaccurate on multiple levels. The reader might be
a combined_mutation_reader, which might be composed of multiple
individual readers, whose number might change during the lifetime of the
reader. The memory consumption of the readers can vary and may change
during the lifetime of the reader as well.
To remedy this, make the restriction memory-consumption based. The
restricting semaphore is now configured with the amound of memory
(bytes) that its readers are allowed to consume in total. New readers
consume 128k units up-front to account for read-ahead buffers, and then
consume additional units for any buffer (returned
from input_stream<>::read()) they keep around.
Like before, readers already allowed to read will not be blocked,
instead new readers will be blocked on their first read if all the units
all consumed."
Fixes#2692.
* 'bdenes/restricting_mutation_reader-v4' of https://github.com/denesb/scylla:
Update reader restriction related metrics
Add restricted_reader_test unit test
restricted_mutation_reader: restrict based-on memory consumption
mutation_reader.hh: Move restricted_reader related code
* seastar 899fc4e...c62bbf9 (6):
> Merge "CPU Scheduler for seastar" from Avi
> reactor: set SCHED_FIFO policy for timer thread
> future: mark future::wait() as noexcept
> shared_promise: Make get_shared_future() const-qualified
> Remove pessimizing and redundant std::move()-s reported by Clang-tidy utility
> Work around GCC 5 bug: scylladb/seastar#338, scylladb/seastar#339
Fixes#2770.
Fixes#2819.
* seastar 92fdce2...899fc4e (14):
> scollectd: increment the metadata iterator with the values
> Enable Travis CI builds for Seastar.
> tests: Fix httpd test compilation error caused by unconditionally explicit tuple constructor in GCC5: scylladb/seastar#326
> core::shared_future: add available() and failed() methods
> rpc: make sure that _write_buf stream is always properly closed
> log: Fail on attempt to register logger with the same name twice
> Merge "Make backtraces useful on ASLR-enabled machines as well" from Botond
> reactor: add option to bypass fsync
> future-util: modernize do_until() implementation
> future-util: fix do_until() API to not have forwarding references
> input_stream: add rvalue variant of input_stream::consume()
> logger: remove extra spaces after timestamp
> tutorial: lifetime management
> Fix broken link for fsqual failure message
We don't pull schema during rolling upgrade, that is until
schema_tables_v3 feature is enabled on all nodes.
Because features are enabled from gossiper timer, there is a race
between feature enablement and processing of endpoint states which may
trigger schema pull. It can happen that we first try to pull, but
only later enable the feature. In that case the schema pull will not
happen until the next schema change.
The fix is to ensure that pulls abandoned due to feature not being enabled
will be retried when it is enabled.
Fixes sporadic failure in dtest:
repair_additional_test.py:RepairAdditionalTest.repair_schema_test
Message-Id: <1506428715-8182-2-git-send-email-tgrabiec@scylladb.com>
Dirty memory manager for non-system column families was being used
when applying mutations to system cfs.
That previously lead to deadlock when updating history. Basically,
write disable waits on compaction, and compaction waits on a write
that would release dirty memory for updating compaction history.
Only using the correct dirty manager wouldn't solve this problem
if write is disabled for system cf, but the problem is completely
solved in addition to previous change which updates history
outside the sstable lock.
Refs #2769.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170918215238.9810-3-raphaelsc@scylladb.com>
The reason to do that is because compaction can deadlock if refresh
disables write which waits for compaction, and compaction in turn
waits for dirty memory[1] that would be released by memtable write.
Dirty memory manager for non-system cfs was being used for system cfs,
which was useful for exposing this problem.
[1]: when updating compaction history.
Fixes#2769.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170918215238.9810-2-raphaelsc@scylladb.com>
"Fixes the problem of concurrent populations of clustering row ranges
leading to some readers skipping over some of the rows.
Spotted during code review.
Fixes #2834."
* tag 'tgrabiec/fix-cache-reader-skipping-rows-v2' of github.com:scylladb/seastar-dev:
tests: mvcc: Add test for partition_snapshot_row_cursor
tests: row_cache: Add test for concurrent population
tests: row_cache: Make populate_range() accept partition_range
tests: Add simple_schema::make_ckey_range()
cache_streamed_mutation: Add missing _next_row.maybe_refresh() call
mvcc: partition_snapshot_row_cursor: Fix cursor skipping over rows added after its position
mvcc: partition_snapshot_row_cursor: Rename up_to_date() to iterators_valid()
mvcc: Keep track of all iterators in partition_snapshot_row_cursor
mvcc: Make partition_snapshot_row_cursor printable
Make get_natural_endpoints return local address iff token metadata
is not yet setup (since that is the one address we already know of).
If a request has a consistency level requiring more endpoints, it
will still fail, but for calls with, for example, CL=ONE, at startup
we will succeed, and more or less act like local strategy. Yet,
further down the line, have data distributed as desired.
Acked-by: Gleb Natapov <gleb@scylladb.com>
Message-Id: <20170926113512.15707-1-calle@scylladb.com>
We were checking if the cursor is up_to_date(), but this is not enough
to guarantee that the cursor is valid, merely that its iterators are
valid. The cursor may be invalidated even if its iterators are valid
if there was an insertion after cursor's position.
Fixes#2834.
The cursor maintains a heap of iterators in all versions. If rows were
inserted before the latest version's iterator, cursor would not see
them. Fix by redoing the lookup for iterators not in the current row
in maybe_refresh().
Refs #2834.
Will be needed when updating the iterator for latest version. Before
this change, such iterator could be neither in _current_row nor in
_heap.
Besides that, this will allow user to always access the iterator of
latest version, which enables some optimizations in the future of
avoiding unnecessary lookups. get_iterator_in_latest_version() is now
always valid.
Since commit 8378fe190, we disable schema sync in a mixed cluster.
The detection is done using gossiper features. We need to make sure
the features are registerred, and thus can be enabled, before the
bootstrapping of a non-seed node happens. Otherwise the bootstrap will
hang waiting on schema sync which will not happen.
Message-Id: <1505893837-27876-2-git-send-email-tgrabiec@scylladb.com>
This position will be used for mutation fragment
that represents the start of a partition.
This position sorts before static row.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
It makes it easier to actually use those concepts.
Lambdas passed to mutation_fragment::visit have to declare
return type otherwise compiler fails with:
internal compiler error: Segmentation fault
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The gossiper checks if features should be enabled from its timer
callback when it detects that endpoint_state_map changed, that is
different than shadow_endpoint_state_map.
shadow_endpoint_state_map is also assigned from endpoint_state_map in
storage_service::replicate_tm_and_ep_map(), called from
storage_service::on_change()
Call gossiper:maybe_enable_features() in replicate_tm_and_ep_map so
that we won't miss gossip feature update.
Fixes#2824
* git@github.com:scylladb/seastar-dev asias/gossip_miss_feature_update_v1:
gossip: Move the _features_condvar signal code to
maybe_enable_features
gossip: Make maybe_enable_features public
storage_service: Check gossip feature update in
replicate_tm_and_ep_map
This is another place we can update endpoint_state_map in addition to
gossiper::run().
Call the gossiper:maybe_enable_features() so that we won't miss gossip
feature update.
Restrict readers based on their memory consumption, instead of the count
of the top-level readers. To do this an interposer is installed at the
input_stream level which tracks buffers emmited by the stream. This way
we can have an accurate picture of the readers' actual memory
consumption.
New readers will consume 16k units from the semaphore up-front. This is
to account their own memory-consumption, apart from the buffers they
will allocate. Creating the reader will be deferred to when there are
enough resources to create it. As before only new readers will be
blocked on an exhausted semaphore, existing readers can continue to
work.
It skipped one sub-range in each of the 10 range batch, and
tried to access the range vector using end() iterator.
Fixes sporadic failures of
update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_node_1_test.
Message-Id: <1505848902-16734-1-git-send-email-tgrabiec@scylladb.com>
"On bad_alloc the section is retried. If the exception happened inside
fast_forward_to() on the underlying reader, that call will be
retried. However, the reader should not be used after exception is
thrown, since it is in unspecified state. Also, calling
fast_forward_to() with cache region locked increases the chances of it
failing to allocate.
We shouldn't call fast_forward_to() with the cache region locked.
Fixes #2791."
* 'tgrabiec/dont-ffwd-in-alloc-section' of github.com:scylladb/seastar-dev:
cache_streamed_mutation: De-futurize cursor movement
cache_streamed_mutation: Call fast_forward_to() outside allocating section
cache_streamed_mutation: Switch from flags to explicit state machine
If we read a partition from a single sstable (a fairly common case),
we can bypass mutation_merger and just return the input.
Message-Id: <20170918181418.14021-1-avi@scylladb.com>
gossiper::apply_state_locally() calls handle_major_state_change() for
each endpoint, in a seastar thread, which calls mark_alive() for new
nodes, which calls ms().send_gossip_echo(id).get(). So it synchronously
waits for each node to respond before it moves on to the next entry. As
a result it may take a while before whole state is processed.
Apache (tm) Cassandra (tm) sends echos in the background.
In a large cluster, we see at the time the joining node starts
streaming, it hasn't managed to apply all the endpoint_state for peer
nodes, so the joining node does not know some of the nodes yet, which
results in the joining node ingores to stream from some of the existing
nodes.
Fixes#2787Fixes#2797
Message-Id: <3760da2bef1a83f1b6a27702a67ca4170e74b92c.1505719669.git.asias@scylladb.com>
When incremental_reader_selector is used for compaction, it will
first call incremental selector of partitioned sstable set with
minimum token that will result in first interval being skipped,
which means not everything being compacted. The interval is
skipped because iterator is incorrectly advanced when token
lies before it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170918021446.15920-1-raphaelsc@scylladb.com>
The current sequential scan can take a long time on a small or empty table
with a large (nr_nodes * nr_vnodes) count, and can time out. Switching to
exponential scan reduces the time.
Fixes#1230.
Message-Id: <20170912173803.8277-1-avi@scylladb.com>
Forward-declare untyped_result_set and untyped_result_set_row, and remove
the include from query_processor.hh.
Message-Id: <20170916170859.27612-3-avi@scylladb.com>
"sstables.hh is already too big, and it is soon to become bigger with the
inclusion of the read_monitor, to pair it with the write_monitor.
It's a good opportunity for us to reduce sstables.hh dependencies by
moving the write monitor to its own reader. One obvious caller is
already changed so we don't need to include sstables.hh anymore."
* 'progress-monitor' of https://github.com/glommer/scylla:
sstables: do not include sstables.hh from memtable glue
sstables: move write_monitor to its own header
log_histogram is not really a histogram, it is a heap-like container.
Rename to log_heap in case we do want a log_histogram one day.
Message-Id: <20170916172137.30941-1-avi@scylladb.com>
This patch change the implementation of storage_service
repair_async_status to throw an exception, this way a 400 return code
will be returned.
Fixes#2794
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170917080533.6612-1-amnon@scylladb.com>
- Transition the prepared statements caches for both CQL and Trhift to the cql3::prepared_statements_cache class.
- Add the corresponding metrics to the query_processor:
- Evictions count.
- Current entries count.
- Current memory footprint.
Fixes#2474
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This is a template class that implements caching of prepared statements for a given ID type:
- Each cache instance is given 1/256 of the total shard memory. If the new entry is going to overflow
this memory limit - the less recently used entries are going to be evicted so that the new entry could
be added.
- The memory consumption of a single prepared statement is defined by a cql3::prepared_cache_entry_size
functor class that returns a number of bytes for a given prepared statement (currently returns 10000
bytes for any statement).
- The cache entry is going to be evicted if not used for 60 minutes or more.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Ensure that the size of the cache is never bigger than the "max_size".
Before this patch the size of the cache could have been indefinitely bigger than
the requested value during the refresh time period which is clearly an undesirable
behaviour.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Sometimes we don't want the cached values to be periodically reloaded.
This patch adds the ability to control this using a ReloadEnabled template parameter.
In case the reloading is not needed the "loading" function is not given to the constructor
but rather to the get_ptr(key, loader) method (currently it's the only method that is used, we may add
the corresponding get(key, loader) method in the future when needed).
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Current get(...) interface restricts the cache to work only with copy-constructable
values (it returns future<Tp>).
To make it able to work with non-copyable value we need to introduce an interface that would
return something like a reference to the cached value (like regular containers do).
We can't return future<Tp&> since the caller would have to ensure somehow that the underlying
value is still alive. The much more safe and easy-to-use way would be to return a shared_ptr-like
pointer to that value.
"Luckily" to us we value we actually store in a cache is already wrapped into the lw_shared_ptr
and we may simply return an object that impersonates itself as a smart_pointer<Tp> value while
it keeps a "reference" to an object stored in the cache.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Allow a variable entry size parameter.
Provide an EntrySize functor that would return a size for a
specific entry.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Get rid of the "proprietary" solution for asynchronous values on-demand loading.
Use utils::loading_shared_values instead.
We would still need to maintain intrusive set and list for efficient shrink and invalidate
operations but their entry is not going to contain the actual key and value anymore
but rather a loading_shared_values::entry_ptr which is essentially a shared pointer to a key-value
pair value.
In general, we added another level of dereferencing in order to get the key value but since
we use the bi::store_hash<true> in the hook and the bi::compare_hash<true> in the bi::unordered_set
this should not translate into an additional set lookup latency.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Since utils::loading_shared_values API is based on the original shared_index_list
this change is mostly a drop-in replacement of the corresponding parts.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This class implements an key-value container that is populated
using the provided asynchronous callback.
The value is loaded when there are active references to the value for the given key.
Container ensures that only one entry is loaded per key at any given time.
The returned value is a lw_shared_ptr to the actual value.
The value for a specific key is immediately evicted when there are no
more references to it.
The container is based on the boost::intrusive::unordered_set and is rehashed (grown) if needed
every time a new value is added (asynchronously loaded).
The container has a rehash() method that would grow or shrink the container as needed
in order to get the load factor into the [0.25, 0.75] range.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
There is no need to include the whole sstables.hh file in
memtable-sstable.hh anymore. All we need is the shared_sstable
definition and the progress monitor.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Soon I am about to introduce a read monitor, and pairing infrastructure
to manage it. Having it all living in sstables.hh force to include it
everytime, even in places that don't really need it.
Move to its own header.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
On bad_alloc the section is retried. If the exception happened inside
fast_forward_to() on the underlying reader, that call will be
retried. However, the reader should not be used after exception is
thrown, since it is in unspecified state. Also, calling
fast_forward_to() with cache region locked increases the chances of it
failing to allocate.
We shouldn't call fast_forward_to() with the cache region locked.
Fixes#2791.
We're in one state at a time, so it's better to express it as a single
variable rather than N independent flags.
In preparation before adding more states.
Right now we pass a permit to the memtable writer and that permit is
used insite write_memtable_to_sstable to compose a write_monitor.
We would like to extend the write_monitor to include other things, that
right now are not available as parameters to write_memtable_to_sstable -
and which are possibly too specialized to be.
The solution for that is to pass the write_monitor instead of the permit
to the writer. Conceptually, that also makes sense because the
write_monitor is something the sstable writer is aware of. Permits, on
the other hand, are a database concept that is alien to the sstable
writer.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170915032836.21154-1-glauber@scylladb.com>
"When there are at least 2 nodes upgraded to 2.0, and the two exchanged schema
for some reason, reads or writes which involve both 1.7 and 2.0 nodes may
start to fail with the following error logged:
storage_proxy - Exception when communicating with 127.0.0.3: Failed to load schema version 58fc9b89-74ab-37ca-8640-8b38a1204f8d
The situation should heal after whole cluster is upgraded.
Table schema versions are calculated by 2.0 nodes differently than 1.7 nodes
due to change in the schema tables format. Mismatch is meant to be avoided by
having 2.0 nodes calculate the old digest on schema migration during upgrade,
and use that version until next time the table is altered. It is thus not
allowed to alter tables during the rolling upgrade.
Two 2.0 nodes may exchange schema, if they detect through gossip that their
schema versions don't match. They may not match temporarily during boot, until
the upgraded node completes the bootstrap and propagates its new schema
through gossip. One source of such temporary mismatch is construction of new
tracing tables, which didn't exist on 1.7. Such schema pull will result in a
schema merge, which cause all tables to be altered and their schema version to
be recalculated. The new schema will not match the one used by 1.7 nodes,
causing reads and writes to fail, because schema requesting won't work during
rolling upgrade from 1.7 to 2.0.
The main fix employed here is to hold schema pulls, even among 2.0 nodes,
until rolling upgrade is complete."
* 'tgrabiec/fix-schema-mismatch' of github.com:scylladb/seastar-dev:
tests: schema_change_test: Add test_merging_does_not_alter_tables_which_didnt_change test case
tests: cql_test_env: Enable all features in tests
schema_tables: Make make_scylla_tables_mutation() visible
migration_manager: Disable pulls during rolling upgrade from 1.7
storage_service: Introduce SCHEMA_TABLES_V3 feature
schema_tables: Don't alter tables which differ only in version
schema_mutations: Use mutation_opt instead of stdx::optional<mutation>
If there is a schema pull during rolling upgrade among a two 2.0 nodes,
then schema merge will delete the persisted schema version. When the node
loads that table again, e.g. on restart, it will generate a version
which is different than the one which 1.7 nodes use. This will
cause reads and writes to fail.
To avoid this, disable pulls until all nodes are upgraded.
Fixes#2802.
We apply deletion of scylla_tables.version to the incoming schema
mutations so that table schema version is recalculated after merge.
The mutations which we read from local schema tables may not have it
deleted in which case all tables would be considered as differing on
the presence of the version field. Avoid this by deleting the field
from old mutations as well.
After the service started, a state of the service may become
"failed", "active" or "activating".
But our script does not accept scylla-ami-setup.service become "failed" state,
in result the script shows up wrong message.
So we handle these three types of state correctly.
Fixes#2759
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1504589079-1986-1-git-send-email-syuu@scylladb.com>
evict() doesn't guarantee that the whole partition is discontinuous.
In particular, partition tombstone cannot be marked as discontinuous.
The parts which are still continuous must be updated.
Broken after c78047fa5b.
Message-Id: <1505375684-28574-1-git-send-email-tgrabiec@scylladb.com>
Collect coordinator side read statistic per CF and use them in percentile
speculative read executor. Getting percentile from estimated_histogram
object is rather expensive, so cache it and recalculate only once per
second (or if requested percentile changes).
Fixes#2757
Message-Id: <20170911131752.27369-3-gleb@scylladb.com>
Currently overflow values are stored in incorrect bucket (last one
instead of special "overflow" one) and percentile() function throws
if there is overflow value. The patch fixes the code to store overflow
value in corespondent bucket and makes percentile() to take it into
account instead of throwing.
Message-Id: <20170911131752.27369-2-gleb@scylladb.com>
"This series fixes the problem of active reads causing OOM due to the fact that
partition snapshots they hold are not evictable. In particular, a single scan
of a partition larger than memory will bad_alloc due to itself.
After this, when partition entry is evicted from cache, data in all the snapshots
is also evicted. We still don't have row-level eviction, but this series lays some
grounds for it by making cache readers prepared for the possibility of rows
being evicted.
Fixes#2775.
Fixes #2730."
* tag 'tgrabiec/snapshot-evicition-in-cache-v1' of github.com:scylladb/seastar-dev:
tests: Add test for partition_entry::evict()
mutation_partition: Introduce range continuity checking methods
mutation_partition: Enable rows_entry::compare() on position_in_partition_views
tests: Extract mvcc tests to separate file
tests: row_cache: Add evicition tests
tests: simple_schema: Add new_tombstone() helper
tests: streamed_mutation_assertions: Introduce produces(mutation&)
streamed_mutation: Allow setting buffer capacity
row_cache: Evict partition snapshots
mvcc: Introduce partition_entry::evict()
row_cache: Handle eviction in partition reader
tests: row_cache_test: Don't assume mvcc snapshots are not evictable
row_cache: Reuse allocation_strategy::invalidate_references()
row_cache: Don't invalidate references on insertion
lsa: Move reclaim counter concept to allocation_strategy level
mvcc: Ensure partition_snapshot always destroys versions using proper allocator
mvcc: Encapsulate reference stability check in partition_snapshot
mvcc: Store LSA region reference in partition_snapshot
If snapshots are not evicted, they may pin unbouned amount of memory
for a long time in cache, which may lead to OOM. Evict snapshots
together with the entry.
Fixes#2775.
Fixes#2730.
The test was not updating the underlying mutation source but still
expecting to get the right data after calling invalidate(). If
snapshots are evictable, that's not guaranteed. Apply to underlying as
well, so data is read from underlying if necessary.
modification_count is currently only used to detect invalidation of
references, intended to be incremented on erasure.
Insertion into intrusive set doesn't invalidate references, so no
need to increment the counter.
partition_snapshot is managed by lw_shared_ptr. Currently it is
assumed that before it dies, maybe_merge_versions() is called on it,
which destroyes it in the right allocator context. It's not very
safe. This patch improves safety by using the right allocator in
snapshot's destructor.
This fixes a failure in view_schema_test, which starts many instances
of single_node_cql_env. cancel_atomic_deletions() causes later
deletions to fail, which causes some of the test cases to fail.
Message-Id: <1505311250-3118-2-git-send-email-tgrabiec@scylladb.com>
Fixes a hang on shutdown with --smp 2 in perf_fast_forward. The hang
is in sstables::await_background_jobs_on_all_shards(), which is
waiting on sstable deletions. Not all shards agree to delete certain
sstables, because e.g. not all shards decide to compact them
yet. Cancel those deletes after database is stopped on all shards,
like we do in main.cc
Fixes#2796.
Message-Id: <1505292239-26032-1-git-send-email-tgrabiec@scylladb.com>
Finds live objects on seastar heap of current shard which contain
given value. Prints results in 'scylla ptr' format.
Example:
(gdb) scylla find 0x600005321900
thread 1, small (size <= 512), live (0x6000000f3800 +48)
thread 1, small (size <= 56), live (0x6000008a1230 +32)
Message-Id: <1505284614-19577-1-git-send-email-tgrabiec@scylladb.com>
compaction_strategy.hh throws an exception, but it doesn't add the
exception header. It is working in-tree because of inclusion order,
but it broke one of my yet-out-of-tree changes.
In any case, it is best to add the headers we will need to the files,
and that is what this patch does.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170912233326.26114-1-glauber@scylladb.com>
filter.cc has just two smallish functions, which are part of the sstable
class. Move them to sstables.cc where the rest of the class members are defined.
Message-Id: <20170912080541.7836-1-avi@scylladb.com>
The test used apply() variant which assumed that it was invoked in a
seastar thread, which is no longer the case after commit d22fdf4. Fix
by copying outisde cache update, and use non-deferring apply() variant
for cache update.
Message-Id: <1505200142-3650-1-git-send-email-tgrabiec@scylladb.com>
This patchset reduces includes of sstables.hh, reducing compile time
by both reducing the amount of code compiled, and the amount of
needless recompiles caused by false dependencies. It does so by
replacing lw_shared_ptr<sstable>, which requires a complete class,
with a new custom type shared_sstable, which allows an incomplete
sstable class definition.
* https://github.com/avikivity/scylla deps2/v2.1
database: change truncate() to flush while compaction is disabled
database: make run_with_compaction_disabled() a non-template
database: add indirection to compaction_manager instance
database: remove dependency on compaction.hh and compaction_manager.hh
size_estimates_virtual_reader.hh: add missing include
system_keyspace: add missing include
main: add missing include
storage_service: add missing include
repair: add missing include
compaction.hh: add missig include and forward declaration
compaction_manager: add missing include
shared_index_lists.hh: add missing include
perf_fast_forward: add missing include
sstable_mutation_test: add missing include
sstables: extract version and format enum into a separate header file
database.hh: add missing forward declaration for
foreign_sstable_open_info
cql_test_env: add forward declaration
database: make column_family::disable_sstable_write() out-of-line
sstables: introduce make_sstable() as a shortcut for
make_lw_shared<sstable>
treewide: use shared_sstable, make_sstable in place of
lw_shared_ptr<sstable>
sstables: use support for lw_shared_ptr with incomplete type for
shared_sstable
sstables: reduce dependencies
streaming: remove unneeded includes
When table is created, it doesn't contain any data, so we can mark the whole
data range as continuous in cache. This way reads will immediately hit, and
flushes will populate. If sstables are later attached, the attaching process
is supposed to invalidate affected ranges (and it does).
Fixes#2536.
Message-Id: <1505200269-4031-1-git-send-email-tgrabiec@scylladb.com>
Use the lw_shared_ptr deleter support to define shared_sstable without
pulling the definition of class sstable, reducing compile time and
dependencies if only shared_sstable is needed.
The parser generator somehow confuses the use-after-scope sanitizer, causing it
to use large amounts of stack space. Disable that sanitizer on that file.
Message-Id: <20170905110628.18047-1-avi@scylladb.com>
In preparation to make run_with_compaction_disabled() a non-template,
we want to remove any non-copyable captures (so the function can be
an std::function, which requires copyability). Move the flush within
the compaction disabled region. This changes the behavior, but it shouldn't
matter.
"These patches make Scylla refuse to load counter sstables that may
contain unsupported counter shards. They are recognised by the lack of
the Scylla component.
Fixes #2766."
* tag 'reject-non-scylla-counter-sstables/v1' of https://github.com/pdziepak/scylla:
db: reject non-Scylla counter sstables in flush_upload_dir
db: disallow loading non-Scylla counter sstables
sstable: add has_scylla_component()
A keyspace can be deleted while write is ongoing, so the object cannot
be used after defer point. The keyspace reference is only used to check
how many replies a write operation should wait for and this can be
precalculated during write handler creation.
Fixes#2777
Message-Id: <20170911084436.GG24167@scylladb.com>
"This series tries to improve the bootstrap of a node in a large cluster by
improving how gossip applies the gossip node state. In #2404, the joining node
failed to bootstrap, because it did not see the seed node when
storage_service::bootstrap ran. After this series, we apply the whole gossip
state contained in the gossip ack/ack2 message before applying the next one,
and we apply the state of the seed node earlier than non-seed node so we can
have the seed node's state faster. We also add some randomness to the order of
applying gossip node state to prevent some of the nodes' state are always
applied earlier than the others.
This series improves apply_state_locally for large cluster:
- Tune the order of applying endpoint_state
- Serialize apply_state_locally
- Avoid copying of the gossip state map
Fixes#2404"
* tag 'asias/gossip_issue_2404_v2' of github.com:scylladb/seastar-dev:
gossip: Avoid copying with apply_state_locally
gossip: Serialize apply_state_locally
gossip: Tune the order of applying endpoint_state in apply_state_locally
gossip: Introduce is_seed helper
gossip: Pass const endpoint_state& in notify_failure_detector
gossip: Pass reference in notify_failure_detector
Move the std::map<inet_address, endpoint_state> map from the gossip
ack/ack2 message directly and move it around in apply_state_locally to
avoid copying the map.
apply_state_locally will be called when gossip ack/ack2
message is received. It will use the std::map<inet_address,
endpoint_state>& map to update the endpoint state.
However, we can receive multiple such gossip ack/ack2 messages from
multiple peer nodes in parallel. Currently, we process them in parallel.
It is better to apply all the states from one node then move to apply
all the states from another node than interleaving. Because it is more
important to have the state of the whole cluster than to have a bit
newer state from another peer (if it is newer), especially when the node
boots up and runs its first round of gossip exchange.
After this patch, we apply the whole gossip state contained in the
gossip ack/ack2 message before applying the next one.
We currently always apply the endpoint_state in the order of the
endpoint ip address. This is not good because some of the endpoint's
state is always applied earlier than the others.
In large cluster, the number of endpoints can be large, it takes time to
apply all of them. To make it more fair, we apply the endpoint_state
randomly.
Apply the seed node's state earlier because in bootstrap, we will check
if we have seen the seed node in storage_service::bootstrap. In #2404,
the bootstrap failed because, the joining node hasn't apply the seed
node's state when storage_service::bootstrap runs.
* seastar 85ca12d...31b925d (19):
> net/byteorder: fix 64 bit ntohq and htonq on big endian machines
> core, util: fix compilation on non-x86 processors
> core/memory: Fix SIGSEGV in small_pool::add_more_objects()
> log: remove debug leftovers
> Merge "TLS state machine fixes" from Calle
> logger: allow adjusting the timestamp style for stdout logs
> thread: make thread_context::s_main portable
> core: add seastar::cache_line_size constant
> Add detach() to input_stream and output_stream
> Install dependencies for Arch Linux.
> tls: Guard non-established sockets in sesrefs + more explicit close + states
> tls: Make vec_push fully exception safe
> basic_sstring: resize uses sstring
> Merge "Add and correct unit tests" from Jesse
> tcp: enforce 1-byte maximum segment invariant with zero window
> tcp: verify 1-byte maximum segment invariant during send with zero window
> memory: reduce small_pool vulnerability to fragmentation further
> Prometheus: avoid merging all metrics family
> net: Fix possible NULL pointer dereference.
validate_request_failure() assumed that the future returned by execute_cql()
is always ready, which doesn't have to be the case, and caused aborts
in debug mode build.
Message-Id: <1504701342-13300-1-git-send-email-tgrabiec@scylladb.com>
Scylla already refuses to load counter sstables that do not have Scylla
component. However, if this happens because of 'nodetool refresh'
command the existing protection will trigger after sstables have been
moved to the data directory. This is too later, so an additional check
is added when the upload directory is scanned.
has_scylla_component() is going to be used to verify that an sstable has
been generated by a recent version of Scylla. This would make it
possible to reject sstables that may be unsafe to load (e.g. sstables
containing legacy counter shards).
"This series implements the missing API to terminate all repairs.
For example:
$ curl -X POST --header "Accept: application/json"
"http://127.0.0.1:10000/storage_service/force_terminate_repair"
With the new stream_plan::abort() api we can now abort the stream
session assocaited with the repair as well.
On top of this, we can support termination of single repair job instead all
jobs.
Fixes#2105"
* tag 'asisas/repair_abort_v4' of github.com:scylladb/seastar-dev:
repair: Support termination of repair jobs
repair: Track repair_info
repair: Intorduce repair id to repair_info map
api: Add force_terminate_repair API
streaming: Add abort to stream_plan
streaming: Add abort_all_stream_sessions for stream_coordinator
streaming: Introduce streaming::abort()
streaming: Make stream_manager and coordinator message debug level
streaming: Check if _stream_result is valid
streaming: Log peer address in on_error
streaming: Introduce received_failed_complete_message
"Scylla 1.7.4 and older use incorrect ordering of counter shards, this
was fixed in 0d87f3dd7d ("utils::UUID:
operator< should behave as comparison of hex strings/bytes"). However,
that patch was not backported to 1.7 branch until very recently. This
means that versions 1.7.4 and older emit counter shards in an incorrect
order and expect them to be so. This is particularly bad when dealing
with imported correct sstables in which case some shards may become
duplicated.
The solution implemented in this patch is to allow any order of counter
shards and automaticly merge all duplicates. The code is written in a
way so that the correct ordering is expected in the fast path in order
not to excessively punish unaffected deployments.
A new feature flag CORRECT_COUNTER_ORDER is introduced to allow seamless
upgrade from 1.7.4 to later Scylla versions. If that feature is not
available Scylla still writes sstables and sends on-wire counters using
the old ordering so that it can be correctly understood by 1.7.4, once
the flag becomes available Scylla switches to the correct order.
Fixes #2752."
* tag 'fix-upgrade-with-counters/v2' of https://github.com/pdziepak/scylla:
tests/counter: verify counter_id ordering
counter: check that utils::UUID uses int64_t
mutation_partition_serializer: use old counter ordering if necessary
mutation_partition_view: do not expect counter shards to be sorted
sstables: write counter shards in the order expected by the cluster
tests/sstables: add storage_service_for_tests to counter write test
tests/sstables: add test for reading wrong-order counter cells
sstables: do not expect counter shards to be sorted
storage_service: introduce CORRECT_COUNTER_ORDER feature
tests/counter: test 1.7.4 compatible shard ordering
counters: add helper for retrieving shards in 1.7.4 order
tests/counter: add tests for 1.7.4 counter shard order
counters: add counter id comparator compatible with Scylla 1.7.4
tests/counter: verify order of counter shards
tests/counter: add test for sorting and deduplicating shards
counters: add function for sorting and deduplicating counter cells
counters: add counter_id::operator>
Until the cluster is fully upgraded from a version that uses the
incorrect counter shard ordering it is essential to keep using it lest
the old nodes corrupt the data upon receiving mutations with a counter
shard ordering they do not expect.
If the feature signaling that we have switched to the correct ordering
of counter shards is not enabled it means that the user still can do a
rollback to a version that expects wrong ordering. In order to avoid any
disasters when that happens write sstables using the 1.7.4 order until
we know for sure that it is no longer needed.
Scylla 1.7.4 used incorrect ordering of counter shards. In order to fix
this problem a new feature is introduced that will be used to determine
when nodes with that bug fixed can start sending counter shard in the
correct order.
Due to a bug in an implementation of UUID less compare some Scylla
versions sort counter shards in an incorrect order. Moreover, when
dealing with imported correct data the inconsistencies in ordering
caused some counter shards to become duplicated.
* 'tgrabiec/make-row-cache-update-exception-safe' of github.com:scylladb/seastar-dev:
row_cache: Improve safety of cache updates
row_cache: Extract invalidate_sync()
memtable: Mark mark_flushed() as noexcept
database: Add non-throwing try_trigger_compaction()
database: Make add_sstable() have strong exception guarantees
row_cache: Don't require presence checker to be supplied externally
database: Supply presence checker in sstable snapshots
mutation_source: Introduce mutation_source::make_partition_presence_checker()
mutation_reader: Move definitions up in the header
mutation_reader: Use constructor delegation to reduce code duplication
row_cache: Make populate() preserve continuity
row_cache: Allow marking as fully continuous on construction
database: Add missing serialization of sstable set udpate and cache invalidation
Cache imposes requirements on how updates to the on-disk mutation source
are made:
1) each change to the on-disk muation source must be followed
by cache synchronization reflecting that change
2) The two must be serialized with other synchronizations
3) must have strong failure guarantees (atomicity)
Because of that, sstable list update and cache synchronization must be
done under a lock, and cache synchronization cannot fail to synchronize.
Normally cache synchronization achieves no-failure thing by wiping the
cache (which is noexcept) in case failure is detect. There are some
setup steps hoever which cannot be skipped, e.g. taking a lock
followed by switching cache to use the new snapshot. That truly cannot
fail. The lock inside cache synchronizers is redundant, since the
user needs to take it anyway around the combined operation.
In order to make ensuring strong exception guarantees easier, and
making the cache interface easier to use correctly, this patch moves
the control of the combined update into the cache. This is done by
having cache::update() et al accept a callback (external_updater)
which is supposed to perform modiciation of the underlying mutation
source when invoked.
This is in-line with the layering. Cache is layered on top of the
on-disk mutation source (it wraps it) and reading has to go through
cache. After the patch, modification also goes through cache. This way
more of cache's requirements can be confined to its implementation.
The failure semantics of update() and other synchronizers needed to
change due to strong exception guaratnees. Now if it fails, it means
the update was not performed, neither to the cache nor to the
underlying mutation source.
The database::_cache_update_sem goes away, serialization is done
internally by the cache.
The external_updater needs to have strong exception guarantees. This
requirement is not new. It is however currently violated in some
places. This patch marks those callbacks as noexcept and leaves a
FIXME. Those should be fixed, but that's not in the scope of this
patch. Aborting is still better than corrupting the state.
Fixes#2754.
Also fixes the following test failure:
tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed
which started to trigger after commit 318423d50b. Thread stack
allocation may fail, in which case we did not do the necessary
invalidation.
Every mutation source can have a presence checker. By default all
answer "maybe contains".
Having this on mutation_source level will be useful for simplifying
cache update flow. The cache can ask the right snapshot for a presence
checker rather than relying on database to know when and how to make
the right one which preserves all invariants.
This will be especially useful once all updates of the underlying
mutation source of cache (e.g. sstable list) will have to go through
cache for safety reasons.
Commit e3ad676433 missed a few places.
It is required to serialize sstable list update and cache synchronization
in order to preserve partition update isolation.
Fixes#2746.
* dist/ami/files/scylla-ami b41e5eb...5ffa449 (3):
> amzn-main.repo: stick to Amazon Linux 2017.03 kernel (4.9.x)
> Prevent dependency error on 'yum update'
> scylla_create_devices: don't raise error when no disks found
This was part of "add gate for generic async operations to column family" but
somehow didn't make it into the final patch.
Add the missing piece.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170830164205.4497-1-glauber@scylladb.com>
"optional interposer that will check integrity of writes to sstable components.
The option name is enable_sstable_data_integrity_check, it's disabled by default
and can be enabled via config file.
It will provide enough details that will help to find the root of the issue.
if disk failed for example, we would've something like the following reported:
ERROR 2017-08-17 09:18:11,577 [shard 0] sstable - integrity check failed for ./data/data/system_schema/aggregates-924c55872e3a345bb10c12f37c1ba895/system_schema-aggregates-ka-111-Scylla.db, stage: read after write verification, write: 4096 bytes to offset 0, reason: data read from underlying storage isn't the same as written, mismatch at byte 0:
data written sample: 10000001000000010000001a00000001
date read sample: 00000000000000000000000000000000"
* 'integrity_check_interposer_v3' of github.com:raphaelsc/scylla:
sstables: optionally check integrity of sstable component writes
sstables: remove unneeded new_sstable_component_file variant
db/config: add sstable_data_integrity_check option
sstables: introduce file interposer for integrity check
If config file's sstable_data_integrity_check option is enabled,
new integrity check interposer will be used in addition to the
existing one. Performance is expected to drop because of all
the integrity checks for every write. This new interposer will
provide detailed info when integrity check fails.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
optional interposer that will check integrity when writing to
sstable components. It will provide enough details that will
help to find the root of the issue, which may come from lower
level layers.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This patch implements the missing API to terminate all repairs.
For example:
$ curl -X POST --header "Accept: application/json"
"http://127.0.0.2:10000/storage_service/force_terminate_repair"
With the new stream_plan::abort() api we can now abort the stream
session assocaited with the repair as well.
Fixes#2105
The maps are stored in a vector. The vector has smp::count elements, each
element will be accessed by only one shard.
The add_repair_info, remove_repair_info and get_repair_info helpers
are added.
The api /storage_service/force_terminate is supposed to be
/storage_service/force_terminate_repair.
scylla-jmx uses /storage_service/force_terminate api.
So instead of renaming it, it is better to add a new name for it.
When we abort a session, it is possible that:
node 1 abort the session by user request
node 1 send the complete_message to node 2
node 2 abort the session upon receive of the complete_message
node 1 sends one more stream message to node 2 and the stream_manager
for the session can not be found.
It is fine for node 2 to not able to find the stream_manager, make the
log on node 2 less verbose to confuse user less.
It is the handler for the failed complete message. Add a flag to
remember if we received a such message from peer, if so, do not send
back the failed complete message back to the peer when running
close_session with failed status.
This reverts commit b56ba02335.
After commit 8fa35d6ddf (messaging_service: Get rid of timeout and retry
logic for streaming verb), streaming verb in rpc does not check if a
node is in gossip memebership since all the retry logic is removed.
Remove the extra wait before removing the joining node from gossip
membership.
Message-Id: <a416a735bb8aad533bbee190e3324e6b16799415.1504063598.git.asias@scylladb.com>
Thread stack allocation may fail, in which case we did not do the
necessary invalidation. Fix by hoisting the scope of the cleanup function.
Also fixes the following test failure:
tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed
which started to trigger after commit 318423d50b.
Message-Id: <1504023113-30374-2-git-send-email-tgrabiec@scylladb.com>
With the "Use range_streamer everywhere" (7217b7ab36) seires, all
the user of streaming now do streaming with relative small ranges and
can retry streaming at higher level.
There are problems with timeout and retry at RPC verb level in streaming:
1) Timeout can be false negative.
2) We can not cancel the send operations which are already called. When
user aborts the streaming, the retry logic keeps running for a long
time.
This patch removes all the timeout and retry logic for streaming verbs.
After this, the timeout is the job of TCP, the retry is the job of the
upper layer.
Message-Id: <df20303c1fa728dcfdf06430417cf2bd7a843b00.1503994267.git.asias@scylladb.com>
- Removed text from Report's "PURPOSE" section, which was referring to the "MANUAL CHECK LIST" (not needed anymore).
- Removed curl command (no longer using the api_address), instead using scylla --version
- Added -v flag in iptables command, for more verbosity
- Added support to for OEL (Oracle Enterprise Linux) - minor fix
- Some text changes - minor
- OEL support indentation fix + collecting all files under /etc/scylla
- Added line seperation under cp output message
Signed-off-by: Tomer Sandler <tomer@scylladb.com>
Message-Id: <20170828131429.4212-1-tomer@scylladb.com>
Large deques require contiguous storage, which may not be available (or may
be expensive to obtain). Switch to new custom container instead, which allocates
less contiguous storage.
Allocation problems were observed with the summary and compression info. While
there is work to reduce compression info contiguous space use, this solves
all std::deque problems (and should not conflict with that work).
Fixes#2708
* tag '2708/v6' of https://github.com/avikivity/scylla:
sstables: switch std::deque to chunked_vector
tests: add test for chunked_vector
utils: add a new container type chunked_vector
"Fixes #2734."
* 'tgrabiec/make-sstable-reader-work-with-empty-range-set' of github.com:scylladb/seastar-dev:
tests: Introduce clustering_ranges_walker_test
tests: simple_schema: Add missing include
sstables: reader: Make clustering_ranges_walker work with empty range set
clustering_ranges_walker: Make adjacency more accurate
Such queries can be issued by counter updates which involve only
static row.
Causes failure in test_query_only_static_row invoked from
sstable_mutation_test. See commit 6572f38, which fixed the problem in
cache reader.
Fixes#2734.
Current check considered some adjacent range tombstones as overlapping
with the ranges. Making this more accurate will become more important
after we will rely on putting p_i_p::after_all_clustered_rows() in
_current_start in out-of-range state.
Previously _current_bucket_segment_index was used differently depending on
whether update_position_trackers() is used in a random or sequential
access. In the former case was used as the absolute index of the segment
(independent of the buckets) and in the latter as the relative index of
the segment within its bucket. This caused problems when there was a
switch between random and sequential access, meaning one could get different
results for an at() call depending on what was the previous at() call.
Fix this by consistently using _current_bucket_segment_index as - like its
name suggest - the bucket relative segment index.
Ref #1946.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <7f68ac1d32c80e8dea6dfa11be02acaa961bce2a.1503924927.git.bdenes@scylladb.com>
* 'deps1/v1' of https://github.com/avikivity/scylla:
types.hh: extract marshal_exception from types.hh into a new file
utils: remove dependency on types.hh
locator: add missing include "log.hh"
supervisor: remove dependency on init.hh
tracing: add missing include "log.hh"
gms: remove unneeded #include "types.hh"
* 'tgrabiec/fix-fast-forwarding' of github.com:scylladb/seastar-dev:
tests: mutation_source_test: Add more tests for fast forwarding across partitions
sstables: Fix abort in mutation reader for certain skip pattern
sstables: Fix reader returning partition past the query range in some cases
sstables: Introduce data_consume_context::eof()
The problem happens for the following sequence of events:
1) reader stops in the middle of some partition before it
skips to another partition range
2) reader is fast forwarded to a partition range which has no data in
the sstable. There are some partitions between the previous
partition range and the one we skip to
3) the reader is asked for next partition
The problem was that mutation_reader::fast_forward_to() was putting
the reader in _read_enabled == false state in step 2, but
data_consume_context was not fast forwarded to the range. When in step
3 we were asked for the next partition, we attempted to skip using
index (because of 1). The result of the skip was some position which
is outside of the current range of data_consume_context, which causes
it to abort. To fix, add a check for _read_enabled before we try to
skip.
If index was used to skip to the next partition (because the current
partition wasn't consumed in full) and reader's partition range ends
before the data file ends, we did not detect that we're out of range
before returning a streamed_mutation. Fix by checking _context.eof()
before doing that.
Refs #2733.
For better or worse, marshal_exception is used from utils/, and it's not good
to have utils/ depend on types.hh. Extract marshal_exception to make it possible
to remove the dependency.
* seastar 0083ee8...85ca12d (1):
> Merge "Run-time logging configuration" from Jesse
Includes patch from Jesse:
"Switch to Seastar for logging option handling
In addition to updating the abstraction layer for Seastar logging in `log.hh`,
the configuration system (`db/config.{hh,cc}`) has been updated in two ways:
- The string-map type for Boost.program_options is now defined in Seastar.
- A configuration value can be marked as `UsedFromSeastar`. This is like `Used`,
except the option is expected to be defined in the Boost.Program_options
description for Seastar. If the option is not defined in Seastar, or it is
defined with a different type, then a run-time exception is thrown early in
Scylla's initialization. This is necessary because logging options which are
now defined in Seastar were previously defined in Scylla and support for these
options in the YAML file cannot be dropped. In order to be able to verify that
options marked `UsedFromSeastar` are actually defined in Seastar, the
interface for adding options to `db::config` has changed from taking a
`boost::program_options::options_description_easy_init` (which is handle into
a `boost::program_options::options_description` which only allows adding
options) to taking a `boost::program_options::options_description`
directly (which also allows querying existing options).
Scylla also fully defers to Seastar's support for run-time logging
configuration."
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <ef26cffb91bef1ae95d508187a6dd861a6c4fc84.1503344007.git.jhaberku@scylladb.com>
"We have recently found two problems with the drop_column_family code
that needs addressing. The first is that exceptions in truncate() may
lead to stop() being skipped, which can cause Scylla to crash.
The other is that a truncate() issued before drop_column_family may get
the chance to execute only after the column family is already dropped
and also crash (That is issue 2726).
The second problem is the classic problem of asynchronous execution on
an object that may terminate, which we have been traditionally solving
with a gate. We add a gate to the column family that will be closed
during CF stop(), and we will require all asychronous operations to
enter it.
The immediate fix is for truncate(), where we have seen a real, concrete
problem. But it would be good to audit other code paths to make sure
that they are sane.
The most obvious ones, flush, compaction and sstable deletion are
already sane, since they are waited on explicitly during stop()."
Fixes#2726.
* 'issue-2726-v2-master' of github.com:glommer/scylla:
database: add gate for generic async operations to column family
database: make sure that column family is always stopped when dropped
We currently use std::deque<> for when we need large random-access containers,
but deque<> requires nr_items * sizeof(T) / 64 bytes of contiguous memory, which can
exceed our 256k fragmentation unit with large sstables. The new
container, which is a cross between deque and vector, has much lower
limitations.
Like deque, we allocate chunks of contiguous items, but they are
128k in size instead of 512. The last chunk can be smaller to avoid
allocating 128k for a really small vector.
run_with_compaction_disabled(), which is called by truncate, has a
pretty large defer point in remove(). When the code gets to finally
execute, we can't guarantee that the column family will still be alive.
That is true in particular if we issued a drop table command following
truncate: by the time truncate gets to resume, the CF will be gone.
Before the column family is dropped, it will always call its stop()
method, which means we have an opportunity to do some waiting there. We
already wait for flushes and current compactions to end.
Traditionally, we have been solving similar problems by adding a gate
that will catch asynchronous operations and making sure that potentially
asynchronous operations will enter the gate before executing. Let's do
the same thing here. We will close() the gate during stop().
Fixes#2726
Signed-off-by: Glauber Costa <glauber@scylladb.com>
truncate can throw exceptions. If it does, cf->stop() will never be
called because it is contained in a .then clause instead of finally.
One of the things that truncate does - in a finally block of its own -
is initiate a final compaction. If it returns an exception nobody will
wait for that compaction to finish (since cf->stop() is the one doing
that) and we'll crash.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"Preserve the networking configuration mode during the upgrade by generating the /etc/scylla.d/perftune.yaml
file and using it."
Fixes#2725.
* 'dist_respect_cpuset_conf-v3' of https://github.com/vladzcloudius/scylla:
scylla_prepare: respect the cpuset.conf when configuring the networking
scylla_cpuset_setup: rm perftune.yaml
scylla_cpuset_setup: add a missing "include" of scylla_lib.sh
Choose the networking configuration mode according to the current contents of /etc/scylla.d/cpuset.conf.
If it doesn't exist - use the default mode.
If it exists - use the mode that has been used for generation of the CPU set.
Store the configuration into the /etc/scylla.d/perftune.yaml
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
scylla_setup resets our configuration and perftune.yaml is a part of it.
perftune.yaml is generated based on the contents of cpuset.conf therefore we should reset
these together.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
The scylla_cpuset_setup uses a verify_args() function that is defined in the scylla_lib.sh.
Fixes#2716
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
* seastar e96881a...b9f1eb7 (9):
> httpd: indentation patch
> httpd: handle exception when shutting down
> stall-detector: Allow backtrace throttling to be configured
> stall-detector: Fix messages about suppresssion not appearing
> scripts: posix_net_conf.sh: allow passing a perftune.py configuration file as a parameter
> scripts: perftune.py: add the possibility to pass the parameters in a configuration file and print the YAML file with the current configuration
> scripts: perftune.py: actually use the number of Rx queues when comparing to the number of CPU threads
> core: make current_backtrace() noexcept
> memory: add large allocation detector stubs for default allocator
Fixes#2723.
* tag 'asias/repair_issue_2723_v1' of github.com:cloudius-systems/seastar-dev:
repair: Do not allow repair until node is in NORMAL status
gossip: Add is_normal helper
The following backtrace was reported by user when running repair and keeping restarting the node at the same time.
#0 0x00007eff077281d7 in raise () from /lib64/libc.so.6
#1 0x00007eff07729a08 in abort () from /lib64/libc.so.6
#2 0x00007eff07721146 in __assert_fail_base () from /lib64/libc.so.6
#3 0x00007eff077211f2 in __assert_fail () from /lib64/libc.so.6
#4 0x00000000010ef2c2 in locator::token_metadata::first_token_index (this=0x641000214e98, start=...) at locator/token_metadata.cc:133
#5 0x00000000010ef2d9 in locator::token_metadata::first_token (this=0x641000214e98, start=...) at locator/token_metadata.cc:143
#6 0x00000000010e329d in locator::abstract_replication_strategy::get_natural_endpoints (this=0x641000494000, search_token=...)
at locator/abstract_replication_strategy.cc:66
#7 0x0000000001481186 in get_neighbors (hosts=std::vector of length 0, capacity 0, data_centers=std::vector of length 0, capacity 0,
range=<error reading variable: access outside bounds of object referenced via synthetic pointer>, ksname=..., db=...) at repair/repair.cc:196
#8 repair_range<nonwrapping_range<dht::token> > (range=..., ri=...) at repair/repair.cc:781
#9 <lambda(auto:99&)>::<lambda(auto:100&&)>::<lambda(auto:101&)>::<lambda()>::operator() (__closure=0x7efec07f7460) at repair/repair.cc:1005
#10 futurize<future<bool_class<stop_iteration_tag> > >::apply<repair_ranges(repair_info)::<lambda(auto:99&)>::
It is reproduced with
1) while true; do curl -X POST --header "Content-Type: application/json" --header "Accept: application/json" "http://127.0.0.1:10000/storage_service/repair_async/ks3"; done
2) start node 127.0.0.1, stop node 127.0.0.1 in a loop
The problem is, during boot up, the token_metadata is not replicated to all shards until
the node goes into NORMAL status.
To fix, check until node is in NORMAL status before allowing repair.
Fixes#2723
The number of keysapce and column family metrics reported is
proportional to the number of shards times the number of keysapce/column
families.
This can cause a performance issue both on the reporting system and on
the collecting system.
This patch adds a configuration flag (set to false by default) to enable
or disable those metrics.
Fixes#2701
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170821113843.1036-1-amnon@scylladb.com>
"Overly large metadata can hog memory which especially hurts in setups
with bad disk/memory ratio. To ease the pain compress the in-memory
compression-info.
The compression is implemented based on Avi's idea which is to group n
offsets together into segments, where each segment stores a base
absolute offset into the file, the other offsets in the segments being
relative offsets (and thus of reduced size). Also offsets are allocated
only just enough bits to store their maximum value. The offsets are thus
packed in a buffer like so:
arrrarrrarrr...
where n is 4, a is an absolute offset and r are offsets relative to a.
This of course means that stored offsets will not be aligned, not even
on a byte boundary, but the size reduction pretty convincing.
In addition, segments are stored in buckets, where each bucket has its
own base offset. In addition, segments in a buckets are optimized to
address as large of a chunk of the data as possible for a given chunk
size."
Ref #1946.
* 'bdenes/compress-compression-v3' of https://github.com/denesb/scylla:
Add unit test for compress::offsets
Optimise the storage of compression chunk offsets
Add script to precompute segmented compression parameters
To reduce the memory footprint of compression-info, n offsets are
grouped together into segments, where each segment stores a base
absolute offset into the file, the other offsets in the segments being
relative offsets (and thus of reduced size). Also offsets are
allocated only just enough bits to store their maximum value. The
offsets are thus packed in a buffer like so:
arrrarrrarrr...
where n is 4, a is an absolute offset and r are offsets relative to a.
The optimal value of n can be calculated for a given file_size (f) and
chunk_size (c), by finding the minima of the following function:
f(n) = (f/c)/n * (log2(f) + (n - 1)*log2((n-1)*(c + 64)))
This is done in an empirical way, using a script (see below).
Furthermore segments are stored in buckets, where each bucket has its
own base offset. Each bucket therefore can address an equal chunk of the
file and furthermore each segment in a bucket can address an equal
sub-chunk of this area.
The value of a given offset i is thus:
bucket_base_offset_for(i) + segment_base_offset_for(i) + offset(i)
To account for the bucketed storage we calculate a local_f, which is
optimized so that a bucketful of segmented offsets can address the
largest possible chunk of f. As value of this local_f only depends on
the bucket_size (b) and c the value of n can be made independent of f
and therefore only depend on one dynamic value, c. This makes life much
simpler as we don't need to know the size of the file up-front, we can
just append buckets to the storage on demand, while the required storage
is still less than a third [1] of the original storage requirements
(std::deque<uint64>).
The table with the minima(f(n)) for different f and c values is
pre-computed by gen_segmented_compress_params.py and
stored in sstables/segmented_compress_params.hh. This script also
creates a table with the best values of local_f for the given
bucket_size. At runtime we only select the best params based on c.
[1] This was calculated for c=4K and b=4K
Some (most?) users don't read logs or release notes, so they won't notice
that the ByteOrdered and Random partitioners were deprecated in 2.0. Make
them notice by refusing to start with a deprecated partitioner, unless a
switch is explicitly enabled.
Message-Id: <20170820073424.8331-1-avi@scylladb.com>
Limiting summary entry generation to at most one summary entry
per 64k of index data can lead to large index pages, with thousands
of index entries per summary entry. These are slow to parse, and there
is no real gain from the limit, since we already enforce a size
limit on the summary.
Remove the limit and allow summary entry generation based solely on
spanned data size.
Fixes#2711.
Message-Id: <20170819184255.14181-1-avi@scylladb.com>
split_range_to_single_shard() returns a vector of size 4096, with
each element (a partition_range) of size 100. The total of 400k can
cause defragmentation if memory is fragmented.
Fix by using a deque.
Fixes#2707.
Message-Id: <20170819141017.28287-1-avi@scylladb.com>
The script generates sstables/segmented_compress_params.hh which
contains a list with the optimal number of grouped offsets for
different data and chunk sizes as well as a list with the best
nominal data sizes for different chunk sizes, given a bucket size.
Data sizes are in the range of [2**4,2**50] and chunks in the
range of [2**4, 2**30]. Data sizes that are not used with the current
bucket_size are ommited.
See next commit for details of how the calculated values are used.
Large allocations can require cache evictions to be satisfied, and can
therefore induce long latencies. Enable the seastar large allocation
warning so we can hunt them down and fix them.
Message-Id: <20170819135212.25230-1-avi@scylladb.com>
Two reasons for this change:
1) every compaction should be multiplexed to manager which in turn
will make decision when to schedule. improvements on it will
immediately benefit every existing compaction type.
2) active tasks metric will now track ongoing reshard jobs.
Fixes#2671.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170817224334.6402-1-raphaelsc@scylladb.com>
"Now that range queries go through the normal digest path, we rely on
query::result::calculate_counts() to count the amount of partitions
and rows returned.
This series optimizes it, in case it is needed, and also changes the
result message to include the partition and row counts, avoiding the
calculation altogether."
* 'calculate-counts/v3' of github.com:duarten/scylla:
query-result: Send row and partition count over the wire
query::result: Optimize calculate_counts()
These two admin related packages will be packaged under the "app-admin"
category and not the "dev-db" one.
This fixes the detection path of the packages for scylla_setup.
Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20170817094756.21550-1-ultrabug@gentoo.org>
Scylla's configure.py contains stuff we copied from Seastar's
configure.py, but is no longer used. Let's get rid of some of it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170813150842.12603-1-nyh@scylladb.com>
When user mistakenly forgot to pass parameter for a flag, our scripts misparses
next flag as the parameter.
ex) Correct usage is '--ntp-domain <domain> --setup-nic', but passed
'--ntp-domain --setup-nic'.
Result of that, next flag will ignore by scripts.
To prevent such behavior, reject any parameter that start with '--'.
Fixes#2609
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170815114751.6223-1-syuu@scylladb.com>
When compacting a partition for querying we would read an extra row,
to include any tombstones between that one and the previous row.
This is no longer needed since we have a general mechanism to detect
short reads in the storage_proxy.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170811103031.22866-1-duarte@scylladb.com>
incremental_reader_selector assumes the partition_range it receives has a lower
bound, but it was seen in mutation_test that this is not so.
Fix by checking whether the bound exists or not.
Message-Id: <20170815095852.14149-1-avi@scylladb.com>
"Exhausted readers belonging to a combined_mutation_reader can be fast
forwarded, so we have to keep them around. However, if the reader is
not fast forwardable, then we can drop the contained readers and their
buffers."
* 'ff-reader/v2' of github.com:duarten/scylla:
combined_mutation_reader: Drop exhausted readers if not in FF mode
combined_mutation_reader: Remove superfluous mutation_readers list
memtable_snapshot_source: Created readers should be fast forwardable
Exhausted readers can be fast forwarded, so we have to keep them
around. However, if the current reader is not fast forwardable, then
we can drop those readers and their buffers.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Now that range queries go through the normal digest path, we rely on
query::result::calculate_counts() to count the amount of partitions
and rows returned. This patch makes it a bit faster.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
* seastar 7a49ae5...edb73ab (11):
> scripts: perftune.py: change the network module mode auto selection heuristic
> net/tls: explicitly ignore ready future during shutdown
> Use python2 explicitly as an interpreter for Python v2 scripts
> peering_sharded_service: prevent over-run the container
> Add link to documentation to the README.md
> Add guidelines for contributing to Seastar
> sharded: fix move constructor for peering_sharded_service services
> Provide a convenient way to lazy-convert to string the values of pointers
> tutorial: overhaul semaphores section
> simple-stream: Make fragmented::write_substream return simple if possible
> simple-stream: Make simple/fragmented memory output stream top level
quantity prevents index_reader from reading all index entries of a summary
entry that span more than min_index_interval entries. That can happen after
introduction of size-based sampling, and consequently, sstable will not be
able to return a key which logical position in summary entry is beyond
min_index_interval. It's ok to not use quantity because index_reader will
read all indexes until either next summary entry or end of file is reached.
Fixes test_sstable_conforms_to_mutation_source
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170812045821.25269-1-raphaelsc@scylladb.com>
"Fixes #1842."
* 'size_based_sampling_v3' of github.com:raphaelsc/scylla:
tests: test summary entry spanning more keys than min interval
db/config: introduce sstable_summary_ratio option
sstables: introduce size-based sampling for sstable summary
sstables: make components_writer::offset const qualified and uint64_t
sstables: make writer::offset const qualified and uint64_t
"This patch series adds support for the `duration` type in CQL, which
was added to Cassandra in 3.10.
As part of this work, it was necessary also to add support for the
`vint` and `unsigned vint` types to the native protocol implementation,
which are part of v5 of the specification.
To test interactively, it is necessary to use cqlsh distributed with
Cassandra, as the version we distribute does not yet support the
duration type."
* 'jhk/duration_protocol/v5' of https://github.com/hakuch/scylla:
Support `duration` CQL native type
CQL native protocol: Add support for `vint` serialization
duration_test.cc: Add test for printing zero duration
duration.cc: Remove nop `const` qualifier on return type
Change `const` qualifier declaration order for `duration`
duration.cc: Simplify range checking
Rename `duration` to `cql_duration`
Now that we don't go directly to reconciliation for range queries, the
result isn't required to have the row and partition counts calculated
(we no longer transform a reconciled_result to a query::result).
Furthermore, this line was causing a lot of dtests to fail on account
of them not expecting an error line in the logs.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170810225351.12610-1-duarte@scylladb.com>
Currently, a summary entry is added after min_index_interval index
entries were written. Not taking into account size of index entries
becomes a problem with large partitions which may create big index
entries due to promoted indexes. Read performance is affected as a
consequence because index entries spanned by summary are all read
from disk to serve request.
What we wanna do is to also add a summary entry after index reaches
a boundary. To deal with oversampling, we want to write 1 byte to
summary for every 2000 bytes written to data file (this will be
eventually made into an option in the config file).
Both conditions must be met to avoid under or oversampling.
That way, the amount of data needed from index file to satify the
request is drastically reduced.
Fixes#1842.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
`duration` is a new native type that was introduced in Cassandra 3.10 [1].
Support for parsing and the internal representation of the type was added in
8fa47b74e8.
Important note: The version of cqlsh distributed with Scylla does not have
support for durations included (it was added to Cassandra in [2]). To test this
change, you can use cqlsh distributed with Cassandra.
Duration types are useful when working with time-series tables, because they can
be used to manipulate date-time values in relative terms.
Two interesting applications are:
- Aggregation by time intervals [3]:
`SELECT * FROM my_table GROUP BY floor(time, 3h)`
- Querying on changes in date-times:
`SELECT ... WHERE last_heartbeat_time < now() - 3h`
(Note: neither of these is currently supported, though columns with duration
values are.)
Internally, durations are represented as three signed counters: one for months,
for days, and for nanoseconds. Each of these counters is serialized using a
variable-length encoding which is described in version 5 of the CQL native
protocol specification.
The representation of a duration as three counters means that a semantic
ordering on durations doesn't exist: Is `1mo` greater than `1mo1d`? We cannot
know, because some months have more days than others. Durations can only have a
concrete absolute value when they are "attached" to absolute date-time
references. For example, `2015-04-31 at 12:00:00 + 1mo`.
That duration values are not comparable presents some difficulties for the
implementation, because most CQL types are. Like in Cassandra's implementation
[2], I adopted a similar strategy to the way restrictions on the `counter` type
are checked. A type "references" a duration if it is either a duration or it
contains a duration (like a `tuple<..., duration, ...>`, or a UDT with a
duration member).
The following restrictions apply on durations. Note that some of these contexts
are either experimental features (materialized views), or not currently
supported at run-time (though support exists in the parser and code, so it is
prudent to add the restrictions now):
- Durations cannot appear in any part of a primary key, either for tables or
materialized views.
- Durations cannot be directly used as the element type of a `set`, nor can they
be used as the key type of a `map`. Because internal ordering on durations is
based on a byte-level comparison, this property of Cassandra was intended to
help avoid user confusion around ordering of collection elements.
- Secondary indexes on durations are not supported.
- "Slice" relations (<=, <, >=, >) are not supported on durations with `WHERE`
restrictions (like `SELECT ... WHERE span <= 3d`). Multi-column restrictions
only work with clustering columns, which cannot be `duration` due to the
first rule.
- "Slice" relations are not supported on durations with query conditions (like
`UPDATE my_table ... IF span > 5us`).
Backwards incompatibility note:
As described in the documentation [4], duration literals take one of two
forms: either ISO 8601 formats (there are three), or a "standard" format. The ISO
8601 formats start with "P" (like "P5W"). Therefore, identifiers that have this
form are no longer supported.
Fixes#2240.
[1] https://issues.apache.org/jira/browse/CASSANDRA-11873
[2] bfd57d13b7
[3] https://issues.apache.org/jira/browse/CASSANDRA-11871
[4] http://cassandra.apache.org/doc/latest/cql/types.html#working-with-durations
Version 5 of the native protocol for CQL [1] adds the `vint` and `unsigned vint`
types.
An unsigned integer encoded as a `vint` has a variable size based on the
magnitude of the value. The first byte indicates the total number of bytes.
For signed integers, a "zig-zag" encoding scheme ensures that small negative
values are encoded as short-length `vint`s (0 -> 0, -1 -> 1, 1 -> 2, 2 -> 3, -2
-> 4, etc).
[1] https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v5.spec
"sstables will sometimes have narrow/disjont ranges (e.g. LCS L1+).
This can be exploited when reading from a range of sstables by opening
sstables on-demand thus saving memory, processing and potentially I/O.
To achieve this combined_mutation_reader is refactored such that the
reader selection logic is moved-out into a reader_selector class.
combined_mutation_reader now takes a reader_selector instance in its
constructor and asks it for new readers for the current ring position
on every call to operator()().
At the moment two specializations of reader_selector are provided:
* list_reader_selector which implements the current logic, that is using
a provided mutation_reader list, and
* incremental_reader_selector which implements the on-demand opening
logic discussed above.
Fixes#1935"
* 'bdenes/optimize_combined_reader-v6' of https://github.com/denesb/scylla:
Add combined_mutation_reader_test unit test
Remove range_sstable_reader
Add incremental_reader_selector
Add reader_selector to combined_mutation_reader
sstable_set::incremental_selector: select() now returns a selection
incremental_reader_selector is a specialization of reader_selector for
the case when sstables have narrow and/or disjoint token ranges. To
exploit this it creates new readers on-demand when their sstable's
token range intersects with the current ring position.
combined_mutation_reader now accepts as a constructor argument a
reader_selector instance whoose task is to create new readers on
each call to operator()() if needed and possible.
This way it is possible to control how readers are created through
different specializations of reader_selector.
The previous logic is refactored into list_reader_selector which
is using a pre-provided mutation_reader list and forwards all of them to
combined_mutation_reader at once.
We are moving to aptly to release .deb packages, that requires debian repository
structure changes.
After the change, we will share 'pool' directory between distributions.
However, our .deb package name on specific release is exactly same between
distributions, so we have file name confliction.
To avoid the problem, we need to append distribution name on package version.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1502312935-22348-1-git-send-email-syuu@scylladb.com>
A seletion contains - in addition to the list of sstables - a next_token
which is a hint as to what is the next best token to call select() with.
This should be the smallest token such that at the next call to
select() the least number of new sstables will be returned, without
skipping any.
"With this series, all the following cluster operations:
- bootstrap
- rebuild
- decommission
- removenode
will use the same code to do the streaming.
The range_streamer is now extended to support both fetch from and push
to peer node. Another big change is now the range_streamer will stream
less ranges at a time, so less data, per stream_plan and range_streamer
will remember which ranges are failed to stream and can retry later.
The retry policy is very simple at the moment it retries at most 5 times
and sleep 1 minutes, 1.5^2 minutes, 1.5^3 minutes ....
Later, we can introduce api for user to decide when to stop retrying and
the retry interval.
The benefits:
- All the cluster operation shares the same code to stream
- We can know the operation progress, e.g., we can know total number of
ranges need to be streamed and number of ranges finished in
bootstrap, decommission and etc.
- All the cluster operation can survive peer node down during the
operation which usually takes long time to complete, e.g., when adding
a new node, currently if any of the existing node which streams data to
the new node had issue sending data to the new node, the whole bootstrap
process will fail. After this patch, we can fix the problematic node
and restart it, the joining node will retry streaming from the node
again.
- We can fail streaming early and timeout early and retry less because
all the operations use stream can survive failure of a single
stream_plan. It is not that important for now to have to make a single
stream_plan successful. Note, another user of streaming, repair, is now
using small stream_plan as well and can rerun the repair for the
failed ranges too.
This is one step closer to supporting the resumable add/remove node
opeartions."
* tag 'asias/use_range_streamer_everywhere_v4' of github.com:cloudius-systems/seastar-dev:
storage_service: Use the new range_streamer interface for removenode
storage_service: Use the new range_streamer interface for decommission
storage_service: Use the new range_streamer interface for rebuild
storage_service: Use the new range_streamer interface for bootstrap
dht: Extend range_streamer interface
We experienced 'Constructing RAID volume...' takes too much time on some AMIs,
this is because setup script stuck at 'yum -y install mdadm xfsprogs'.
We don't have to install these packages on AMI startup time, we should
preinstall them on AMI creating time.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1502192796-21040-1-git-send-email-syuu@scylladb.com>
* Configuration for cluster_name is commented-out in config file.
* Default value set to empty string and if not rewritten by user then
warning is printed and value is reset to "ScyllaDB Cluster".
Fixes#2648.
Message-Id: <20170808113322.9313-1-daniel@scylladb.com>
ScyllaDB loves python & python loves ScyllaDB.
It would benefit the project to start enforcing some code guidelines
and basic QA with a linter along a PEP8 respect thanks to flake8.
This patch adds a tox config to at least start with an assessment
of the work to be done on all .py files in the code base.
To reduce its noise, tests on long lines (> 80char) are ignored
for now.
Signed-off-by: Ultrabug <ultrabug@gentoo.org>
Message-Id: <20170726134242.8927-1-ultrabug@gentoo.org>
index's file output stream uses write behind but it's not closed
when sstable write fails and that may lead to crash.
It happened before for data file (which is obviously easier to
reproduce for it) and was fixed by 0977f4fdf8.
Fixes#2673.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170807171146.10243-1-raphaelsc@scylladb.com>
After this patch and the following patches to use the new
range_streamder interface, all the following cluster operations:
- bootstrap
- rebuild
- decommission
- removenode
will use the same code to do the streaming.
The range_streamer is now extended to support both fetch from and push
to peer node. Another big change is now the range_streamer will stream
less ranges at a time, so less data, per stream_plan and range_streamer
will remember which ranges are failed to stream and can retry later.
The retry policy is very simple at the moment it retries at most 5 times
and sleep 1 minutes, 1.5^2 minutes, 1.5^3 minutes ....
Later, we can introduce api for user to decide when to stop retrying and
the retry interval.
The benefits:
- All the cluster operation shares the same code to stream
- We can know the operation progress, e.g., we can know total number of
ranges need to be streamed and number of ranges finished in
bootstrap, decommission and etc.
- All the cluster operation can survive peer node down during the
operation which usually takes long time to complete, e.g., when adding
a new node, currently if any of the existing node which streams data to
the new node had issue sending data to the new node, the whole bootstrap
process will fail. After this patch, we can fix the problematic node
and restart it, the joining node will retry streaming from the node
again.
- We can fail streaming early and timeout early and retry less because
all the operations use stream can survive failure of a single
stream_plan. It is not that important for now to have to make a single
stream_plan successful. Note, another user of streaming, repair, is now
using small stream_plan as well and can rerun the repair for the
failed ranges too.
This is one step closer to supporting the resumable add/remove node
opeartions.
* seastar f14d2a3...7a49ae5 (8):
> sharded: improve support for cooperating sharded<> services
> sharded: support for peer services
> semaphore: add a version of with_semaphore that takes a duration timeout
> scripts: perftune.py: fix the CPU mask generation for more than 64 CPUs
> Revert "future-utils: make when_all() (vector variant) exception safe"
> Revert "future-utils: fix gross compilation errors in when_all()"
> future-utils: fix gross compilation errors in when_all()
> future-utils: make when_all() (vector variant) exception safe
Includes change to batchlog_manager constructor to adapt it to
seastar::sharded::start() change.
This reverts commit 98757069a5. We have the
failure detector which will detect an unresponsive node and fail the RPC.
Adding a timeout can just introduce false positives.
In commit f38e4ff3f, we have separated streaming reads from normal reads
for the purpose of determining the maximum number of reads going on.
However, we'll now be totally unaware of how many reads will be
happening on behalf of streaming and that can be important information
when debugging issues.
This patch adds this metric so we don't fly blind.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1501909973-32519-1-git-send-email-glauber@scylladb.com>
We have problem to run fstrim with nomerges=2, so we need to change
the parameter to 1 during fstrim execution.
To do this, this fix changes follow things:
- revert dropping scylla_fstrim on Ubuntu 16.04/CentOS
- disable distribution provided fstrim script
- enable scylla_fstrim on all distributions
- introduce --set-nomerges on scylla-blocktune
- scylla_fstrim call scylla-blocktune by following order:
- 'scylla-blocktune --set-nomerges 1'
- 'fstrim' for each devices
- 'scylla-blocktune --set-nomerges 2'
Fixes#2649
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1501531393-21109-1-git-send-email-syuu@scylladb.com>
Streaming reads and normal reads share a semaphore, so if a bunch of
streaming reads use all available slots, no normal reads can proceed.
Fix by assigning streaming reads their own semaphore; they will compete
with normal reads once issued, and the I/O scheduler will determine the
winner.
Fixes#2663.
Message-Id: <20170802153107.939-1-avi@scylladb.com>
If we fail a streaming read due queue overload, we will fail the entire repair.
Remove the limit for streaming, and trust the caller (repair) to have bounded
concurrency.
Fixes#2659.
Message-Id: <20170802143448.28311-1-avi@scylladb.com>
"Currently scanning reads go to reconciliation stage directly which
requires asking for mutation data from all peers. This series makes
it to try matching digests first like a single partition read."
Fixes#2666.
* 'gleb/digest_scan' of github.com:cloudius-systems/seastar-dev:
storage_proxy: make range_slice_read_executor go through digest matching state
storage_proxy: add capability to read data/digest for non singular ranges
storage_proxy: remove redundant parameter from never_speculating_read_executor constructor
Currently scanning reads go to reconciliation stage directly which
requires asking for mutation data from all peers. This patch makes
it to try matching digests first like a single partition read.
The change requires internode protocol changes since currently it is not
possible to ask for multi partition data/digest over RPC. It means that
the capability has to be guarded by new gossip feature flag which the
patch also adds.
Using --hostname to give the container a meaningful name is a good
practice, and make the monitoring dashboard easier to understand
Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <20170803081027.6675-1-tzach@scylladb.com>
never_speculating_read_executor always waits for all targets so
block_for parameter is always equal to targets.size(). No need to
to pass it explicitly.
"This series adds an option to use paging in internal query and use that for the
get compaction history function.
Internal paging will be done explicitly, to use paging, you first create a
state object (that contains the query as well) and use that state to get the
first page, the result will contain both the query result and a new state that
can be used to get the next page.
Fixes#2366"
* 'amnon/paged_compaction_history_v5' of github.com:cloudius-systems/seastar-dev:
system_keyspace: Use paging for get compaction history
Add paging for internal queries
query_options: Allows creating query_options from query_options
"This series tries to fix possible repair stuck."
Fixes#2660, #2661, #2662.
* tag 'asias/repair_stuck_v2.1' of github.com:cloudius-systems/seastar-dev:
repair: Make send_repair_checksum_range timeout
repair: Singal parallelism_semaphore in case of error
repair: Fix repair_tracker done
It specifies the maximum gossip shadow round time. It can be used to
reduce the gossip feature check time during node boot up.
For instance, when the first node in the cluster, which listed both
itself and other node as seed in the yaml config, boots up, it will try
to talk to other seed nodes which are not started yet. The gossip shadow
round will be used to fetch the feature info of the cluster. Since there
is no other seed node in the cluster, the shadow round will fail. User
can reduce the default shadow_round_ms option to reduce the boot time.
Fixes#2615
Message-Id: <10916ce9059f3c7f1a1fb465919ae57de3b67d59.1500540297.git.asias@scylladb.com>
The timer is armed inside the section guarded by the _timer_reads_gate
therefore it has to be canceled after the gate is closed.
Otherwise we may end up with the armed timer after stop() method has
returned a ready future.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1501603059-32515-1-git-send-email-vladz@scylladb.com>
"This series ensures the always write correct cell names to promoted
index cell blocks, taking into account the eoc of range tombstones.
Fixes#2333"
* 'pi-cell-name/v1' of github.com:duarten/scylla:
tests/sstable_mutation_test: Test promoted index blocks are monotonic
sstables: Consider eoc when flushing pi block
sstables: Extract out converting bound_kind to eoc
It was meant to be run in the foreground since it is waited upon during
stop(), but as it is now from the stop() perspective it is completed
after first connection is accepted.
Fixes#2652
Message-Id: <20170801125558.GS20001@scylladb.com>
Queries with query page size equal or smaller than
zero are unpaged queries.
Count these kind of queries and make them a metrics
since they can ruin the performance of the system.
Message-Id: <20170731130004.25807-2-benoit@scylladb.com>
"This series reduce that effect in two ways:
1. Remove the latency counters from the system keyspaces
2. Reduce the histogram size by limiting the maximum number of buckets and
stop the last bucket."
Fixes#2650.
* 'amnon/remove_cf_latency_v2' of github.com:cloudius-systems/seastar-dev:
database: remove latency from the system table
estimated histogram: return a smaller histogram
Since the discovery of std::exchange(x, {}) move_and_clear has become
obsolete. Beside, the name was wrong, it did not clear the vector but
recreated it meaning that any allocated memory wasn't reused (not that
it mattered in the existing usages).
Message-Id: <20170731123549.10887-1-pdziepak@scylladb.com>
"These patches optimise combined_mutation_reader for cases where the majority
of mutation_readers is disjoint.
perf_fast_forward:
Results are medians of 3 of fragments/s as reported by perf_fast_forward.
Command:
perf_fast_forward -c1 --enable-cache
small: small-partition-skips (read=1, skip=0)
large: large-partition-skips (read=1, skip=0)
before after diff
small 195753 238196 +22%
large 1244325 1359096 +9%
perf_simple_query:
Results are medians of 10 of reads/s as reported by perf_simple_query.
Command:
perf_simple_query -c1
before 98651.40
after 104554.85
diff +6%"
* tag 'avoid-merge_mutations/v1' of https://github.com/pdziepak/scylla:
combined_mutation_reader: avoid unnecessary merge_mutations()
combined_mutation_reader: do not pop mutation with different key
"This series ensure that when we retry a memtable flush, we re-acquire the
flush permit that was previously released. It also ensures we don't hold
the sstable read lock for the duration of the sleep leading to the retry.
To achieve that cleanly we refactor the way the permit lifecycle is managed
by employing a RAII-based approach.
We also improve the latency of writes blocked on virtual dirty by releasing
the flush permit before fsyncing the sstables. There are additional avenues
for performance improvements on top of this one."
* 'memtable-flush-additional-fixes/v4' of github.com:duarten/scylla:
column_family: Re-acquire flush permit in case of error
column_family: Don't hold sstable read lock when retrying flush
sstables: Release the flush permit before fsyncing
sstables: Introduce write_monitor
database: Extract out dirty_memory_manager
dirty_memory_manager: Refactor flush permit lifetime management
dirty_memory_manager: Invert permit acquisition order
memtable_list: Register different seal functions for each behaviour
Merging mutations is quite an expensive operation. The creation of
streamed mutation merger involves several allocations (mostly coming
from various std::vector) and then all mutation_fragments need to go
through a heap.
All this is completely unnecessary if there is only one mutation, so
let's skip a call to merge_mutations() in such cases. This also means
that we can reuse memory allocated by _current vector if merge is not
required.
Originally, the loop insidecombined_mutation_reader::next() so that it
was popping mutation from the heap and when it encountered one with a
different decorated key it was pushed back and the ones accumulated so
far merged and emitted. In other words, every time the reader progressed
to the next mutation it did needless pop and push operations on the
heap.
This patch rearranges the code so that the key of the next mutation is
compared before it is popped from the heap.
If we fail to flush an sstable, after creating the flush_reader, then
we will have released the flush permit when we retry the flush. Ensure
that when retrying, we re-acquire the flush permit.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This allows a queued flush to start while we fsync the current
sstable, which helps reduce the overall time new writes are blocked on
dirty memory.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The write_monitor provides callbacks to inform an observer of the
state of the ongoing sstable write.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch refactors how the flush permit lifetime is managed,
dropping the current hash table in favour of a RAII approach.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
For an upcoming fix it is required to invert the permit acquisition
order: first we acquire the background work permit and then the single
flush permit.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Instead of passing a flush_behaviour to the seal function, use two
different functions for each of the behaviours.
This will be important in the forthcoming patches, which will require
the signatures of those functions to differ.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
serialized_action_tests depends on the fact that first part of the
serialized_action is executed at cetrtain points (in which it reads a
global variable that is later updated by the main thread).
This worked well in the release mode before ready continuations were
inlined and run immediately, but not in the debug mode since inlining
was not happening and the main seastar::thread was missing some yield
points.
Message-Id: <20170731103013.26542-1-pdziepak@scylladb.com>
1. assert() is not constexpr.
2. can't use static_assert(), because the contructor may be called in a non-constexpr
environment; moved to log_histogram
3. pow2_rank() uses count_leading_zeros() which is not constexpr; split
into constexpr and non-constexpr versions
4. duplicated number_of_buckets() because bucket_of() can't be constexpr due to pow2_rank
Message-Id: <20170726105444.32698-1-avi@scylladb.com>
"This series ensure that when we retry a memtable flush, we re-acquire the
flush permit that was previously released. It also ensures we don't hold
the sstable read lock for the duration of the sleep leading to the retry.
To achieve that cleanly we refactor the way the permit lifecycle is managed
by employing a RAII-based approach.
We also improve the latency of writes blocked on virtual dirty by releasing
the flush permit before fsyncing the sstables. There are additional avenues
for performance improvements on top of this one."
* 'memtable-flush-additional-fixes/v3' of github.com:duarten/scylla:
column_family: Re-acquire flush permit in case of error
column_family: Don't hold sstable read lock when retrying flush
sstables: Release the flush permit before fsyncing
sstables: Introduce write_monitor
database: Extract out dirty_memory_manager
dirty_memory_manager: Refactor flush permit lifetime management
dirty_memory_manager: Invert permit acquisition order
memtable_list: Register different seal functions for each behaviour
main: Don't catch polymorphic exceptions by value
* seastar a14d667...54e940f (8):
> Merge "Prometheus to use output stream" from Amnon
> http_test: Fix an http output stream test
> build: harden try_compile_and_link output temporary file
> configure: disable exception scalability hack on debug build
> build: don't perform test compiles to /dev/null
> Provide workaround for non scaleable c++ exception runtime
> Merge "Add output stream to http message reply" from Amnon
> configure.py: use user provided compiler flags when checking for features
Commitlog replay fails when upgrade from <1.7.3 to 2.0, we need to refuse
updating package if current scylla < 1.7.3 && commitlog remains.
Note: We have the problem on scylla-server package, but to prevent
scylla-conf package upgrade, %pretrans should be define on scylla-conf.
Fixes#2551
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1501187555-4629-1-git-send-email-syuu@scylladb.com>
"This patch series backports scalability fix for _Unwind_Find_FDE and modifies
out CentOS package to use our libgcc and libstdc++ which are needed to make
use of the fix instead of locally installed ones."
Ref #2646 (fixes on RHEL 7 and related only)
* 'gleb/exception-gcc-fix-v2' of github.com:cloudius-systems/seastar-dev:
dist/redhat: Make scylla rpm depend on scylla-libgcc and scylla-libstdc++ and use it instead of locally installed one
dist/redhat: Backport scalability fix of _Unwind_Find_FDE to out gcc
If we fail to flush an sstable, after creating the flush_reader, then
we will have released the flush permit when we retry the flush. Ensure
that when retrying, we re-acquire the flush permit.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This allows a queued flush to start while we fsync the current
sstable, which helps reduce the overall time new writes are blocked on
dirty memory.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The write_monitor provides callbacks to inform an observer of the
state of the ongoing sstable write.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch refactors how the flush permit lifetime is managed,
dropping the current hash table in favour of a RAII approach.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
For an upcoming fix it is required to invert the permit acquisition
order: first we acquire the background work permit and then the single
flush permit.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Instead of passing a flush_behaviour to the seal function, use two
different functions for each of the behaviours.
This will be important in the forthcoming patches, which will require
the signatures of those functions to differ.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
If schema merging completes at lower rate than incoming pull requests,
then merge processes will accumulate and needlessly request and hold schema mutations.
In rare cases, when there are constant schema changes, they may even
overflow memory. This was seen in dtest:
concurrent_schema_changes_test.py:TestConcurrentSchemaChanges.create_lots_of_schema_churn_test
Allowing only one active and one queued pull request per remote
endpoint is enough.
* tag 'tgrabiec/dont-accumulate-schema-pulls-v2' of github.com:scylladb/seastar-dev:
migration_manager: Log schema pulls
migration_manager: Prevent pull requests from accumulating
utils: Introduce serialized_action
If schema merging completes at lower rate than incoming pull requests,
then merge processes will accumulate and needlessly request and hold schema mutations.
In rare cases, when there are constant schema changes, they may even
overflow memory. This was seen in dtest:
concurrent_schema_changes_test.py:TestConcurrentSchemaChanges.create_lots_of_schema_churn_test
Allowing only one active and one queued pull request per remote
endpoint is enough.
When flushing a promoted index block using a range tombstone cell name
as a bound, use the right eoc value instead of always writing
composite::eoc::none.
Fixes#2333
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
consume_mutation_fragments_until() allows consuming mutation fragments
until a specified condition happens. This patch reorganises its
implementation so that we avoid situations when fill_buffer() is called
with stop condition being true.
Message-Id: <20170727122218.7703-1-pdziepak@scylladb.com>
When we install scylla metapackage with version (ex: scylla-1.7.1),
it just always install newest scylla-server/-jmx/-tools on the repo,
instead of installing specified version of packages.
To install same version packages with the metapackage, limited dependencies to
current package version.
Fixes#2642
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170726193321.7399-1-syuu@scylladb.com>
Commitlog replay fails when upgrade from <1.7.3 to 2.0, we need to refuse
updating package if current scylla < 1.7.3 && commitlog remains.
Note: We have the problem on scylla-server package, but to prevent
scylla-conf package upgrade, %pretrans should be define on scylla-conf.
Fixes#2551
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20170727110730.613-1-syuu@scylladb.com>
database::make_sstable_reader() creates a reader which will need to
obtain a semaphore permit when invoked. Therefore, each read may
create at most one such reader in order to be guaranteed to make
progress. If the reader tries to create another reader, that may
deadlock (or for non-system tables, timeout), if enough number of such
readers tries to do the same thing at the same time.
Avoid the problem by dropping previous reader before creating a new
one.
Refs #2644.
Message-Id: <1501152454-4866-1-git-send-email-tgrabiec@scylladb.com>
Initialize the system_auth and system_traces keyspaces and their tables after
the Node joins the token ring because as a part of system_auth initialization
there are going to be issues SELECT and possible INSERT CQL statements.
This patch effectively reverts the d3b8b67 patch and brings the initialization order
to how it was before that patch.
Fixes#2273
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1500417217-16677-1-git-send-email-vladz@scylladb.com>
This patch remove the latency histograms from the system table, it also
extend the already existing exclusion to all system keyspaces.
It also uses the new get_histogram API to set a minimal bucket size to
100 microseconds.
The current histogram contains 91 buckets, this is a very high
resolution with a high upper limit.
To reduce traffic passed, between scylla and the prometheus, this patch
generate a smaller histogram.
It limit the number of buckets (16 by default), set a lower limit to the
lowest bucket, and uses 2 as the bucket coeficient.
Highest empty buckets will not be reported.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
estimated histogram
These patches contain some minor fixes for performance regression reported
by perf_fast_forward after partial cache was merged. The solution is still
far from perfect, there is one case that still has 30% degradation, but
there is some improvement so there is no reason to hold these changes back.
Refs #2582.
Some numbers:
before - before cache changes were merged
(555621b537)
cache - at the commit that introduced the partial cache
(9b21a9bfb6)
after - recent master + this series
(based on e988121dbb)
Differences are shown relative to "before".
Testing effectiveness of caching of large partition, single-key slicing reads:
Large partitions, range [0, 500000], populating cache
before cache after
1636840 1013688 1234606
-38% -25%
Large partitions, range [0, 500000], reading from cache
before cache after
2012615 3076812 3035423
+53% +51%
Testing scanning small partitions with skips.
reading small partitions (skip 0)
before cache after
227060 165261 200639
-27% -11%
skipping small partitions (skip 1)
before cache after
29813 27312 38210
-8% +28%
Testing slicing small partitions:
slicing small partitions (offset 0, read 4096)
before cache after
195282 149695 180497
-23% -8%
* https://github.com/pdziepak/scylla.git perf_fast_forward-regression/v3:
sstables: make sure that fill_buffer() actually fills buffer
mutation_merger: improve handling of non-deferring fill_buffer()s
partition_snapshot_row_cursor: avoid apply() in single-version cases
sstables: introduce decorated_key_view
ring_position_comparator: accept sstables::decorated_key_view
sstable: keep a pre-computed token in summary_entry
sstables: cache token in index entries
index_reader: advance_and_check_if_present() use index_comparator
ring_position_comparator: drop unused overloads
cache_streamed_mutation: avoid moving clustering_row
streamed_mutation: introduce consume_mutation_fragments_until()
cache_streamed_mutation: use consumer based read_context reader
rows_entry: make position() inlineable
mutation_fragment: make destructor always_inline
keys: introduce compound_wrapper::from_exploded_view()
sstables: avoid copying key components
compound_compat: explode: reserve some elements in a vector
cache: short-circut static row logic if there are no static columns
cache: use equality comparators instead of tri_compare
sstables: avoid indirect calls to abstract_type::is_multi_cell()
loading_cache invokes a timer that may issue asynchronous operations
(queries) that would end with writing into the internal fields.
We have to ensure that these operations are over before we can destroy
the loading_cache object.
Fixes#2624
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1501096256-10949-1-git-send-email-vladz@scylladb.com>
We need to consider the _live_endpoints size. The nr_live_nodes should
not be larger than _live_endpoints size, otherwise the loop to collect
the live node can run forever.
It is a regression introduced in commit 437899909d
(gossip: Talk to more live nodes in each gossip round).
Fixes#2637
Message-Id: <863ec3890647038ae1dfcffc73dde0163e29db20.1501026478.git.asias@scylladb.com>
Equality comparator may be much cheaper than the fully fledged
trichotomic comparator, especially if the component types are byte order
equal but not byte order comparable.
When we are exploding a compound key we know already that there is more
than one component, but we have no easy way of determining how many of
them are going to be there. Let's reserve space for a few elements so
that we avoid an excessive number of reallocations in case of
medium-sized keys.
mutation_fragment destructor was already made inline-friendly by moving
most of the logic to a separate function. However, the compiler still is
quite reluctant to inline it in certain cases, so let's give it a
stronger hint.
consume_mutation_fragments_until() is a consumer based interface that
avoids indirect calls and continuation overhead present in the naive
streamed_mutation::operator() approach.
clustering_row can stores quite a lot of data internally which makes its
move constructor not exactly cheap.
If possible it is better to move mutation_fragment around as it keeps
everything externally. This also avoids some cases when clustering row
would be extracted from mutation_fragment only to be made to create
another mutation_fragment later.
When a sstable reader is fast forwarded some index entries may be read
(and compared) multiple times. This patch makes sure that once a token
is computed we keep it around and reuse if the entry is accessed again.
Each sstable index lookup involves a binary search in the summary and
each time a partition key of summary entry is compared with anything its
token needs to be calculated.
Since we keep summary in the memory all the time it is better to also
keep the tokens around.
ring_position_comparator has overloads for comparing ring_positions as
well as sstables::key_view. In the case of the latter it needs to
compute the token of the key. However, the sstable layer could cache
some tokens so let's allow the comparator callers to provide it
directly.
It is possible that a call to fill_buffer() will return an immediately
ready future. This patch avoids uncontrolled recursion in case when all
merged streamed mutation do not defer ini fill_buffer() and also
optimises for non-deferring case by avoiding some of the logic.
streamed_mutation::impl::fill_buffer() is supposed to either push
mutation fragments to the buffer or set EOS flag. However, it was
possible that mp_row_consumer would return proceed::no if a skip was
needed without satisfying any of these conditions.
"This patch series addresses some feedback from the preliminary
HACKING.md, adds some new content, and updates the README file with
some quick-start information."
* 'jhk/better_hacking/v3' of github.com:hakuch/scylla:
README.md: Add quick-start section and defer to `HACKING.md`
HACKING.md: `CMakeLists.txt` for analysis works for other IDEs too
HACKING.md: Add details and examples for unit tests
HACKING.md: Add section for project dependencies
HACKING.md: Describe releases and tags
HACKING.md: Re-work "building" section, including memory needs
HACKING.md: Update ccache recommendations
HACKING.md: Update "Contributing" URL
If a node is notified of a schema change where the schema's dropped
columns have changes, that node will miss the changes to the dropped
columns. A scenario where this can happen is where a column c is
dropped, then added as a different typed, and then dropped again, with
a node n having seen the first drop and being notified of the
subsequent add and drop.
Fixes#2616
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170725170622.4380-1-duarte@scylladb.com>
Counters write path on leader is completely different than on any other
replica (non-leaders share write path between counters and regular
columns). This patch makes sure that counter writes performed on leader
are added to appropriate metrics.
Message-Id: <20170725153346.31238-1-pdziepak@scylladb.com>
* git@github.com:raphaelsc/scylla.git row_cache_fixes:
db: atomically synchronize cache with changes to the snapshot
db: refresh row cache's underlying data source after compaction
Empty clustering key range is perfectly valid and signifies that the
reader is not interested in anything but the static row. Let's not
make it mean anything else.
Message-Id: <20170725131220.17467-2-pdziepak@scylladb.com>
cache_streamed_mutation assumed that at least one clustering range was
specified. That was wrong since the readers are allowed to query just
for a static row (e.g. counter update that modifies only static
columns).
Fixes#2604.
Message-Id: <20170725131220.17467-1-pdziepak@scylladb.com>
To support both version and patch release, the version server now returns
a patchversion parameter that include the latest minor version's patch
release.
The housekeeping should return a separate message if the current
minor version is not with the latest patch release, and a message if the version was
changed.
For example, if a user is using version 1.6.1 it should get a warning
that he need to update if 1.6.2 is available and in addition a warning it
should upgrade if version 1.7 is out.
Examples:
$ scylla-housekeeping version --version 1.6.2
Your current Scylla release is 1.6.2, while the latest patch release is 1.6.4, and the latest minor release is 1.7.2 (recommended)
$ scylla-housekeeping version --version 1.7.1
You current Scylla release is 1.7.1 while the latest patch release is 1.7.2 is available, update for the latest bug fixes
$ scylla-housekeeping version --version 1.7.1
You current Scylla release is 1.7.1 while the latest patch release is 1.7.2, update for the latest bug fixes and improvements
Fixes#1972
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Acked-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <20170725095455.6450-1-amnon@scylladb.com>
Underlying data source in row cache holds a reference to sstable set
prior to compaction which isn't released until a memtable flush, which
means file descriptors of deleted sstables remains opened, wasting
disk space.
The fix is to refresh underlying data source in row cache.
Fixes#2570.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
updates to cache and snapshot (i.e. sstable set) aren't synchronized, so
it may happen that cache update for memtable flush will use wrong snapshot
version, and that violates cache invariant of each partition entry only
reflecting one snapshot.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Boost 1.55 accidentally removed support for "range for" on
recursive_directory_iterator (previous and latter versions do
support it). Use old-style iteration instead.
Message-Id: <20170724080128.8824-1-avi@scylladb.com>
The 'free' attribute is not updated for all pages belonging to a large
object, so we can't use it to determine if the page is allocated or
not. More reliable way is to check if it belongs to any free span.
Message-Id: <1500648094-20039-1-git-send-email-tgrabiec@scylladb.com>
This is in order to avoid frequent misses which have a relatively high
cost. A miss means we need to fetch schema definition from another
node and in case of writes do a schema merge.
If the schema is kept alive only by the incoming request, then it
will be forgotten immediately when the request is done, and the next
request using the same schema version will miss again.
Refs #2608.
Message-Id: <1500632447-10104-1-git-send-email-tgrabiec@scylladb.com>
"Fixes issues uncovered in longevity test (#2608).
Main problem is that due to time drift scylla_tables.version column
may not get deleted on all nodes doing the schema merge, which will
make some nodes come up with different table schema version than others.
The inconsistency will not heal because scylla_tables doesn't
take part in the schema sync. This is fixed by the last patch.
This will cause nodes to constantly try to sync the schema, which under
some conditions triggers #2617."
* tag 'tgrabiec/fix-table-schema-version-inconsistency-v1' of github.com:scylladb/seastar-dev:
schema_tables: Add scylla_tables to ALL
schema: Make schema_mutations equality consistent with digest
schema_tables: Extract compact_for_schema_digest()
schema_tables: Always drop scylla_tables::version
global_schema_ptr ensures that schema object is replicated to other
cores on access. It was replicating the "synced" state as well, but
only when the shard didn't know about the schema. It could happen that
the other shard has the entry, but it's not yet synced, in which case
we would fail to replicate the "synced" state. This will result in
exception from mutate(), which rejects attempts to mutate using an
unsynced schema.
The fix is to always replicate the "synced" state. If the entry is
syncing, we will preemptively mark it as synced earlier. The syncing
code is already prepared for this.
Refs #2617.
Message-Id: <1500555224-15825-1-git-send-email-tgrabiec@scylladb.com>
there could be a lot of compactions when querying for compaction
history.
This patch changes the query to use paging. It would collect all results
when returning to the caller.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Usually, internal queries are used for short queries. Sometimes though,
like in the case of get compaction history, there could be a large
amount of results. Without paging it will overload the system.
This patch adds the ability to use paging internally.
Using paging will be done explicitely, all the relevant information
would be store in an internal_query_state, that would hold both the
paging state but also the query so consecutive calls can be made.
To use paging use the query method with a function.
The function gets beside a statement and its parameters a function that
will be used for each of the returned rows.
For example if qp is a query_processor:
qp.query("SELECT * from system.compaction_history", [] (const cql3::untyped_result_set::row& row) {
....
// do something with row
...
return stop_iteration::no; // keep on reading
});
Will run the function on each of the compaction history table rows.
To stop the iteration, the function can return stop_iteration::yes.
* https://github.com/pdziepak/scylla.git perf_fast_forward_improvements/v1:
perf_fast_forward: move global state to global scope
perf_fast_forward: move tests groups to separate functions
perf_fast_forward: allow running only selected test groups
perf_fast_forward: use consumer interface for reading
streamed_mutation
So that scylla_tables takes part in the digest and in mutations sent
as part of schema sync. Otherwise inconsistencies in scylla_tables
will not heal.
Refs #2608.
Digest only looks like live values, ignoring deletion
information. Equality should be consistent with that, so that schemas
considered equal do not trigger the alter path unnecessarily.
It can happen that due to time drift between nodes, the incoming
"version" cell will have higher timestamp than api::new_timestamp().
In such case the column would not be dropped and would cause version
mismatch between nodes.
Ensure it's always covered by using max of current time and cell's
timestamp.
Refs #2608.
By default build_deb.sh destroys all previous build image to make sure we don't
have environment dependent issue, but it's takes time to build distribution root
image (.tgz in pbuilder) from scratch.
--no-clean option is for skipping create .tgz stage, use previously built image,
to make build time shorter.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1500542094-12946-1-git-send-email-syuu@scylladb.com>
Using streamed_mutation::operator() is undesirable as it introduces an
indirect call and a continuation overhead for each emitted mutation
fragment. Consumer interface is the preferred method of reading streamed
mutations.
All test perf_fast_forward test cases currently live in the main
function. This patch moves the state they rely on to a global scope
so that it will be easier to extract these tests to individual
functions.
"This patchset restricts background writers - such as compactions,
streaming flushes and memtable flushes to a maximum amount of CPU usage
through a seastar::thread_scheduling_group.
The said maximum is recommended to be set 50 % - it is default
disabled, but can be adjusted through a configuration option until we
are able to auto-tune this.
The second patch in this series provides a preview on how such auto-tune
would look like. By implementing a simple controller we automatically
adjust the quota for the memtable writer processes, so that the rate at
which bytes come in is equal to the rates at which bytes are flushed.
Tail latencies are greatly reduced by this series, and heavy spikes that
previously appeared on CPU-bound workloads are no more."
* 'memtable-controller-v5' of https://github.com/glommer/scylla:
simple controller for memtable/streaming writer shares.
restrict background writers to 50 % of CPU.
Some places remained where code looked directly at
system_keyspace::NAME to determine iff a ks is
considered special/system/protected. Including
schema digest calculation.
Export "is_system_keyspace" and use accordingly.
Message-Id: <1500469809-23546-1-git-send-email-calle@scylladb.com>
"Fixes schema layout incompatibility in a mixed 1.7 and 2.0 cluster (#2555)
by reverting back to using the old layout in memory and thus also
in across-node requests. We still use the new v3 layout in schema
tables (needed by drivers and external tools). Translations happen
when converting to/from schema mutations."
* tag 'tgrabiec/use-v2-schema-layout-in-memory-v2' of github.com:scylladb/seastar-dev:
schema: Revert back to the 1.7 layout of static compact tables in memory
schema: Use v3 column layout when converting to/from schema mutations
schema: Encapsulate column layout translations in the v3_columns class
"This series improves the streaming error handling so that when one side of the
streaming failed, it will propagate the error to the other side and the peer
will close the failed session accordingly. This removes the unnecessary wait and
timeout time for the peer to discover the failed session and fail eventually.
Fix it by:
- Use the complete message to notify peer node local session is failed
- Listen on shutdown gossip callback so that we can detect the peer is shutdown
can close the session with the peer
Fixes#1743"
* tag 'asias/streaming/error_handling_v2' of github.com:cloudius-systems/seastar-dev:
streaming: Listen on shutdown gossip callback
gms: Add is_shutdown helper for endpoint_state class
streaming: Send complete message with failed flag when session is failed
streaming: Handle failed flag in complete message
streaming: Do not fail the session when failed to send complete message
streaming: Introduce send_failed_complete_message
streaming: Do not send complete message when session is successful
streaming: Introduce the failed parameter for complete message
streaming: Remove unused session_failed function
streaming: Less verbose in logging
streaming: Better stats
Due to the asynchronous nature of view update propagation, results
might still be absent from views when we query them. To be able to
deterministically assert on view rows, this patch retries a query a
bounded number of times until it succeeds.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170718212646.2958-1-duarte@scylladb.com>
We are using C* 3.x compatible layout in schema tables but want to
keep using the 1.7 layout in memory for compatibility during rolling
upgrade. This patch switches the schema and schema_builder classes
back to the old layout. Translation of layout happens when converting
to/from schema mutations.
Notable changes:
1) Includes a revert of commit 6260f31e08
"thrift: Update CQL mapping of static CFs".
2) Brings back the "default_validation_class" schema attribute. In v3
it can be dervied from column definitions, but in v2 it can't, so
we have to store it.
3) legacy_schema_migrator and schema_builder don't have to do
conversions to v3, this is now handled by the v3_columns
class. schema_builder works with the same layout as schema, that
is v2.
4) Includes a revert of commit 66991a7ccb
"v3 schema test fixes"
Fixes#2555.
"Time window strategy was introduced to address several limitations of
date tiered strategy. In addition, its options are much easier to reason
about, basically just window size and window unit.
TWCS will work to keep only one sstable in each window. So the only real
optimization needed is to align partition key to the window.
Size tiered strategy is used to reduce write amplification when compacting
the incoming window.
For more details: https://issues.apache.org/jira/browse/CASSANDRA-9666
Fixes #1432."
* 'twcs_v2' of github.com:raphaelsc/scylla:
tests: add tests for time window compaction strategy
compaction: wire up time window compaction strategy
compaction/twcs: override default values with options in schema
sstables: implement time window compaction strategy
sstables: import TimeWindowCompactionStrategy.java
This patch provides a rather trivial implementation of hash() for
collection types.
It is needed for view building, where we hold mutations in a map
indexed by partition keys (and frozen collection types can be part of
the key).
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170718192107.13746-1-duarte@scylladb.com>
This patch introduces a simple controller that will adjust memtables CPU
shares, trying to keep it around the soft limit: if we start going below
it means we're too fast (unless we are idle) and shares are adjusted
downwards. If we start going above it means we're too fast and shares
are adjusted upwards.
I have tested this extensively in a single-CPU setup with various
CPU-bound workloads while tracking virtual dirty and the results are
good, with virtual dirty fluctuating only slightly, somewhere within the
desired range.
Exceptions to this are:
1) when the load is very light - the idle system goes faster, and that's
ok
2) when the load is very high - as foreground requests dominate we can't
flush fast enough and hit the hard limit. However, in such scenarios
the memtable shares do hit its maximum, and the results are no worse
than they are right now and this will only be fixed by CPU-limiting the
actual requests.
This feature can be disabled with a config option - that is scheduled to
go away as we acquire more confidence in this. When the feature is
disabled, all background writers (streaming, compaction, memtables) will
share the same scheduling group, with static quotas.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In scylla, we have foreground processes, which are latency sensitive and
need to be responded to as fast as possible in order to maintain good
latency profiles, and background process, which are less so.
The most important background processes we have during normal write
workload operations are memtable writes and sstable compactions. Those
processes are quite CPU-intensive, and left unchecked will easily
dominate the CPU. Lower values of task-quota usually help, as it will
force those processes to preempt more, but aren't enough to guarantee
good isolation. We have seen boxes with good NVMe storage having their
throughput reduced to less than half of the original baseline in a short
dive down for the duration of a compaction.
In the long run, our goal is to leverage the CPU scheduler to make sure
that those processes are balanced with respect to all the others.
However, the current state of affairs is causing grievances as this very
moment. Thankfully, those processes live in a seastar::thread, that
ships with its own rudimentary bandwidth control mechanism: the
scheduling group.
The goal of this patch is to wrap background processes together in a
scheduling group, and assign to such group 50 % of our CPU power; the
remainder being left to foreground processes.
While we pride ourselves in dynamically adjusting things to the
workload, we won't be able to do this properly before the CPU scheduler
lands - and let's face it, leaving background processes run wild is not
adaptative either. Every workload would benefit most from a different
value for such shares, but 50 % is as fair as it gets if we really need
static partitining in the mean time.
As a defense against unforeseen consequences, we'll leave the actual
value as an option, but will do our best to hide it - as this is not a
tunable that we want to be part of a normal Scylla setup. The most
convenient place for this tunable is still db::config, so we can easily
pass it down to the database layer - but we will not document it in the
yaml, and will clearly note in the help string that it is not supposed
to be tuned.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
When a node shutdown itself, it will send a shutdown status to peer
nodes. When peer nodes receives the shtudown status update, they are
supposed to close all the sessions with that node becasue the node is
shutdown, no need to wait and timeout, then fail the session.
This change can speed up the closing of sessions.
Currently, send_complete_message is not used. We will use it shortly in
case the local session is failed. Send a complete message with failed
flag to notify peer node that the session is failed so that peer can
close the session. This can speed up the closing of failed session.
Also rename it to send_failed_complete_message.
it will be later converted to C++. Imported from latest scylla-
tools-java repository. Checked that it doesn't lack anything.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
A user reported scylla-server.service does not able to run on their cloud instance, because of hugeadm.
(hugeadm says the kernel does not support huge pages.)
We don't need it for posix mode, so move it in dpdk mode.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1500367219-8728-1-git-send-email-syuu@scylladb.com>
Instead of retrying, just drop mutations that raced with a truncate.
* git@github.com:duarten/scylla.git truncate-reorder/v1:
database: Rename replay_position_reordered_exception
database: Drop mutations that raced with truncate
The complete_message is not needed and the handler of this rpc message
does nothing but returns a ready future. The patch to remove it did not
make into the Scylla 1.0 release so it was left there.
Use this flag to notify the peer that the session is failed so that the
peer can close the failed session more quickly.
The flag is used as a rpc::optional so it is compatible use old
version of the verb.
It is useful for larger cluster with larger gossip message latency. By
default the fd_max_interval_ms is 2 seconds which means the
failure_detector will ignore any gossip message update interval larger
than 2 seconds. However, in larger cluster, the gossip message udpate
interval can be larger than 2 seconds.
Fixes#2603.
Message-Id: <49b387955fbf439e49f22e109723d3a19d11a1b9.1500278434.git.asias@scylladb.com>
branch 'tgrabiec/schema-migration-fixes' of github.com:scylladb/seastar-dev:
schema: Use proper name comparator
legacy_schema_migrator: Properly migrate non-UTF8 named columns
schema_tables: Store column_name in text form
legacy_schema_migrator: Migrate columns like Cassandra
schema_builder: Add factory method for default_names
legacy_schema_migrator: Simplify logic
thrift: Don't set regular_column_name_type
schema: Use proper column name type for static columns
schema: Fix column_name_type() for static compact tables
schema: Introduce clustering_column_at()
thrift: Reuse cell_comparator::to_sstring() for obtaining comparator type
partition_slice_builder: Use proper column's type instead of regular_column_name_type()
This replaces column_definition::name_comparator, which incorrectly
assumes that names are always utf8, with name_compare moved from
schema::rebuild() and unifies usages.
Currently migrator assumed all columns are utf8-named, which
doesn't have to be the case for static compact tables.
Refs #2597.
Due to #2573, we can assume that Scylla wasn't used with non-utf8
column names, and that old names are always in textual form.
This fixes generation of synthetic columns for static compact tables.
Current code always generates synthetic clustering column with utf8
type and synthetic regular column with bytes type (in schema_builder).
That's fine when creating a new CQL table, but not when migrating
existing tables created via thrift API.
Fixes#2584.
This also migrates empty compact value columns like Cassandra
does. Such columns are present in compact tables without regular
columns, e.g.:
create table test (k int, ck int, primary key (k, ck)) with compact storage;
They should be migrated to a synthetic regular column with
empty_type type and a non-empty name.
The expression "is_dense.value_or(true)" is always true inside the if,
so drop it.
This allows us to drop temporary calulated_is_dense.
We can also get rid of one of the if branches by extracting
builder.set_is_dense() outside.
After f5dae826ce, static columns not
always have utf8 column names. For static compact tables it's
determined by the cell name comparator type, which is equal to the
type of the synthetic clustering column.
Caused various errors with static thrift tables with non-utf8
comparator.
Reduces view_schema_test runtime to 5 seconds, from 53 seconds on an NVMe disk
with write-back cache, and forever on a spinning disk.
Message-Id: <20170716081653.10018-1-avi@scylladb.com>
We will be creating links to those sstable's files, and those don't work
if the data directory and the test sstable are on different devices.
Copying the files to the same directory fixes the problem.
Message-Id: <20170716090405.14307-1-avi@scylladb.com>
Rename replay_position_reordered_exception to
mutation_reordered_with_truncate_exception for more precision, since
this is the only situation where this exception can be thrown.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We don't ensure mutations are applied in memory following the order of their
replay positions. A memtable can thus be flushed with replay position rp,
with the new one being at replay position rp', where rp' < rp. This breaks
an intrinsic assumption in the code, which this series addresses.
Fixes#2074
branch memtable-flush/v3 of git@github.com:duarten/scylla.git:
commitlog: Always flush latest memtable
column_family: More precise count of switched memtables
column_family: Fix typo in pending_tasks metric name
column_family: More precise count of pending flushes
dirty_memory_manager: Remove unnecessary check from flush_one()
column_family: Don't rely on flush_queue to guarantee flushes finished
column_family: Don't bother closing the flush_queue on stop()
column_family: Stop using flush_queue
column_family: Remove outdated comment about the flush_queue
memtable: Stop tracking the highest flushed rp
Clang is happy to create a vector<data_value> from a {}, a {1, 2}, but not a {1}.
No doubt it is correct, but sheesh.
Make the data_value explicit to humor it.
Message-Id: <20170713074315.9857-1-avi@scylladb.com>
In storage_proxy we arrange the mutations sent by the replicas in a
vector of vectors, such that each row corresponds to a partition key
and each column contains the mutation, possibly empty, as sent by a
particular replica.
There is reconciliation-related code that assumes that all the
mutations sent by a particular replica can be found in a single
column, but that isn't guaranteed by the way we initially arrange the
mutations.
This patch fixes this and enforces the expected order.
Fixes#2531Fixes#2593
Signed-off-by: Gleb Natapov <gleb@scylladb.com>
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170713162014.15343-1-duarte@scylladb.com>
Since we no longer enforce that mutations are applied in memory
ordered by their replay_positions, the way the highest_flush_rp is
being tracked is no longer correct.
The invariant it was used to maintain no longer exists, so we can get
rid of it together with the assertion on the highest_flush_rp on
flush().
Fixes#2074
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Since commitlog ordering requirements have been relaxed, we now keep
the set of replay_positions seen by a memtable in a set, which we then
use to clean up relevant segments in the commitlog. This means that
the guarantees provided by the flush_queue are no longer necessary.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
When stopping a column family we issue a flush(), for which we wait.
Since writes are supposed to have stopped coming in, and also new
flush requests, there's no need to call and wait for the flush_queue
to be closed.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We now don't ensure mutations are applied in memory following the
order of their replay positions, so we can't rely on the replay
position to order memtable flushes. So, use a phased_barrier() to
ensure that calling flush() returns a future that completes when all
flushes up to that point have finished.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We don't need to check whether a memtable is empty in flush_one(), as
that must be checked later, during the actual sealing.
The condition itself is rare and is checked already after the potentially
contented semaphore has been acquired.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch ensures we update the count of pending flushes in the same
place as we update the stats across column families, which is more
correct since it only accounts for actual flushes and not those of
empty memtables or that have been coalesced together.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The memtable_switch_count metric is supposed to count the number of
times a flush has resulted in the memtable being switched out, but we
were incrementing the count regardless of whether we tried to flush an
empty memtable or two or more flushes were coalesced into one. This
patch fixes this by moving the metric to where the memtable is
actually switched.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We now don't ensure mutations are applied in memory following the
order of their replay positions, so we can't rely on the replay
position to order memtable flushes. When flushing commit log segments,
ensure we flush the latest memtable.
Refs #2074
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"This series aims to fix the "serving invalid (old) values" issue in the
loading_cache (issue #2590) by arming the timer with a period that equals
min(expire, refresh).
We are still trying to optimize the main case where 'expire' is
significantly longer than 'refresh' period.
We don't want to add any additional logic in the fast path and this
series gives the immediate solution for the issue above while not adding
any additional CPU cycle to the fast path."
* 'loading_cache_short_expired-v2' of https://github.com/vladzcloudius/scylla:
utils::loading_cache: arm the timer with a period equal to min(_expire, _update)
utils::loading_cache: make a timer use a loading_cache_clock_type clock as a source
Make the descriptions of permissions_validity_in_ms, permissions_update_interval_in_ms
and permissions_cache_max_entries more readable and more related to what they really
do.
Mention the none-zero value requirement for the permissions_update_interval_in_ms and
the permissions_cache_max_entries when the permissions cache is enabled.
Adjust the parameters description in the scylla.yaml too.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1499957053-31792-1-git-send-email-vladz@scylladb.com>
Arm the timer with a period that is not greater than either the permissions_validity_in_ms
or the permissions_update_interval_in_ms in order to ensure that we are not stuck with
the values older than permissions_validity_in_ms.
Fixes#2590
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Current algorithm was marking tables with regular columns not named
"value" as not dense, which doesn't have to be the case. It can be
either way.
It should be enough to look at clustering components. If there is a
clustering key, then table is dense if and only if all comparator
components belong to the clustering key.
If there is no clustering key, then if there are any regular columns
we're sure it's not dense.
Fixes#2587.
Message-Id: <1499877777-7083-1-git-send-email-tgrabiec@scylladb.com>
Cassandra 3.10 added the `duration` type [1], intended to manipulate date-time
values with offsets (for example, `now() - 2y3h`).
The full implementation of the `duration` type in Scylla requires support
for version 5 of the binary protocol, which is not yet available.
In the meantime, this patch patch adds the implementation of the underlying type
for the eventual `duration` type. Included is also the ported test suite from
the reference implementation and additional tests.
Related to #2240.
[1] https://issues.apache.org/jira/browse/CASSANDRA-11873
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <b1e481da103efee82106bf31f261c5a1f4f8d9ca.1499885803.git.jhaberku@scylladb.com>
query_options object cannot be changed after it was created. For
internal uses, like internal query paging, it is needed to create a new
object based on some of the data from an existing one with a new paging
state.
This patch adds a constructor from a unique_ptr and paging state.
using unique_ptr behave similar to move modify constructor.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Since base tables no longer look for their views, we need to parse
base tables first so that when we add a view we can fetch and connect
it to its base table.
When announcing view table mutations to other nodes we always include
the base table mutations, so there's no need to expect a view being
added before its base table.
Found out while testing view building.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170712172115.2960-1-duarte@scylladb.com>
"Currently new nodes calculate digests based on v3 schema mutations,
which are very different from v2 mutations. As a result they will
use schemas with different table_schema_version that the old nodes.
The old nodes will not recognize the version and will try to request
its definition. That will fail, because old nodes don't understand
v3 schema mutations.
To fix this problem, let's preserve the digests during migration,
so that they're the same on new and old nodes. This will allow
requests to proceed as usual.
This does not solve the problem of schema being changed during
the rolling upgrade. This is not allowed, as it would bring the
same problem back.
Fixes #2549."
* tag 'tgrabiec/use-consistent-schema-table-digests-v2' of github.com:cloudius-systems/seastar-dev:
tests: Add test for concurrent column addition
legacy_schema_migrator: Set digest to one compatible with the old nodes
schema_tables: Persist table_schema_version
schema_tables: Introduce system_schema.scylla_tables
schema_tables: Simplify read_table_mutations()
schema_tables: Resurrect v2 read_table_mutations()
system_keyspace: Forward-declare legacy schemas
legacy_schema_migrator: Take storage_proxy as dependency
Use name of the existing preceeding column with restriction
(last_column) instead of assuming that the column right after the
current column already has restrictions.
This will yield an error message that is different from that of
Cassandra, albeit still a correct one.
Fixes#2421
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <40335768a2c8bd6c911b881c27e9ea55745c442e.1499781685.git.bdenes@scylladb.com>
DowngradingConsistencyRetryPolicy uses live replicas count from
Unavailable exception to adjust CL for retry, but when there are pending
nodes CL is increased internally by a coordinator and that may prevent
retried query from succeeding. Adjust live replica count in case of
pending node presence so that retried query will be able to proceed.
Fixes#2535
Message-Id: <20170710085238.GY2324@scylladb.com>
"most of changes are to improve maintainability of the strategy but
the ones that are introduced by the following patches:
lcs: do not check if level 0 can be promoted twice
lcs: remove quadratic behavior from L0 compaction
lcs: partially sort candidates that will be trimmed
lcs: only demote sstable from level higher than target one"
* 'lcs_improvements_2' of github.com:raphaelsc/scylla:
lcs: only demote sstable from level higher than target one
lcs: improve indentation for get_overlapping_starved_sstables
lcs: improve indentation for get_compaction_candidates
lcs: partially sort candidates that will be trimmed
lcs: remove quadratic behavior from L0 compaction
lcs: introduce private interface
lcs: make some member functions static
lcs: make some functions const qualified
lcs: remove add method
lcs: extract code for higher levels compaction from get_candidates_for
lcs: simplify code to get candidates for higher levels
lcs: extract round-robin heuristic for even distribution of keys into function
lcs: update outdated comments for level 0 compaction
lcs: improve worth_promoting_L0_candidates interface
lcs: do not check if level 0 can be promoted twice
lcs: extract code for level 0 compaction from get_candidates_for
CQL reply may contain metadata that describes columns present in the
response including the information about their type.
However, Scylla incorrectly reports counter types as bigint. The
serialised format of counters and bigint is exactly the same, which
could explain why the problem hasn't been noticed earlier but it is a
bug nevertheless.
Fixes#2569.
Message-Id: <20170711130520.27603-1-pdziepak@scylladb.com>
Calculate and set digest using v2 mutations so that digests are the
same before and after migration. This is neeed so that no schema
definition exchange is required during rolling upgrade.
Fixes#2549.
When migrating schema tables from v2 to v3, mutations underlying
table schema will change, and so will their digest. However, we want
the digest to be the same on new nodes as on the old nodes, because
schema exchange is not possible between the two nodes, so they
must to request schema definitions from each other.
The solution is to make the digest persistable, so that it sticks to
given table schema, surviving both migration and node restarts. On
migration from v2, the digest will be calculated from v2 mutations, so
it will be the same on new and old nodes.
It will be used to store Scylla spcific table metadata. We cannot
store it in the standard "tables" table for compatibility reasons -
Cassandra will fail to read schema if it encounteres columns it is not
expecting.
if we are compacting level 1 into level 2, we only want to demote
a sstable from level 3 or higher.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
L0 compaction triggers quadratic behavior when many newly created
sstables are needed for promotion due to their size being relatively
low to max sstable size parameter. So until L0 is worth promoting,
the strategy will compact every new sstable with all the existing
ones in L0. To fix it, let's do STCS on level 0 until it becomes
worth promoting.
Fixes#2432.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
get rid of unneeded loop for dealing with suspect sstables and
std::advance because vector allows random access.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
some comments are no longer relevant, especially the ones that
talk about dealing with busy sstables due to parallel compaction,
which isn't done by us for lcs.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
can_promote flag will be used to carry info about whether or not
level 0 can promoted. That will avoid a single iteration for higher
levels too which can contain tens of thousands of sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
I will split code for higher levels compaction into functions first
before putting it into its own function too.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The default of 2ms is somewhat arbitrary. Now that we have a lot more
mileage deploying Scylla applications in production it does sound not
only arbitrary, but high.
In particular, it is really hard to achieve 1ms latencies in the face of
CPU-heavy workloads with it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1499354495-27173-1-git-send-email-glauber@scylladb.com>
Otherwise we may deadlock, as explained in commit 5e8f0efc8:
Table drop starts with creating a snapshot on all shards. All shards
must use the same snapshot timestamp which, among other things, is
part of the snapshot name. The timestamp is generated using supplied
timestamp generating function (joinpoint object). The joinpoint object
will wait for all shards to arrive and then generate and return the
timestamp.
However, we drop tables in parallel, using the same joinpoint
instance. So joinpoint may be contacted by snapshotting shards of
tables A and B concurrently, generating timestamp t1 for some shards
of table A and some shards of table B. Later the remaining shards of
table A will get a different timestamp. As a result, different shards
may use different snapshot names for the same table. The snapshot
creation will never complete because the sealing fiber waits for all
shards to signal it, on the same name.
Message-Id: <1499762663-21967-1-git-send-email-tgrabiec@scylladb.com>
"most of changes are to improve maintainability of the strategy but
the ones that are introduced by the following patches:
lcs: do not check if level 0 can be promoted twice
lcs: remove quadratic behavior from L0 compaction
lcs: partially sort candidates that will be trimmed
lcs: only demote sstable from level higher than target one"
* 'lcs_improvements' of github.com:raphaelsc/scylla: (21 commits)
lcs: only demote sstable from level higher than target one
lcs: improve indentation for get_overlapping_starved_sstables
lcs: improve indentation for get_compaction_candidates
lcs: partially sort candidates that will be trimmed
lcs: remove quadratic behavior from L0 compaction
lcs: introduce private interface
lcs: make some member functions static
lcs: make some functions const qualified
lcs: remove add method
lcs: extract code for higher levels compaction from get_candidates_for
lcs: simplify code to get candidates for higher levels
lcs: extract round-robin heuristic for even distribution of keys into function
lcs: update outdated comments for level 0 compaction
lcs: improve worth_promoting_L0_candidates interface
lcs: do not check if level 0 can be promoted twice
lcs: extract code for level 0 compaction from get_candidates_for
dist/offline_installer: add --skip-setup option to offline installer
dist/offline_installer/debian: install python-minimal package before installing scylla deps
migration_manager: Give empty response to schema pulls from incompatible nodes
migration_manager: Don't pull schema from incompatible nodes
...
if we are compacting level 1 into level 2, we only want to demote
a sstable from level 3 or higher.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
L0 compaction triggers quadratic behavior when many newly created
sstables are needed for promotion due to their size being relatively
low to max sstable size parameter. So until L0 is worth promoting,
the strategy will compact every new sstable with all the existing
ones in L0. To fix it, let's do STCS on level 0 until it becomes
worth promoting.
Fixes#2432.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
get rid of unneeded loop for dealing with suspect sstables and
std::advance because vector allows random access.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
some comments are no longer relevant, especially the ones that
talk about dealing with busy sstables due to parallel compaction,
which isn't done by us for lcs.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
can_promote flag will be used to carry info about whether or not
level 0 can promoted. That will avoid a single iteration for higher
levels too which can contain tens of thousands of sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
I will split code for higher levels compaction into functions first
before putting it into its own function too.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The old nodes which are still using v2 schema tables will fail to
apply our response, with error messages complaining about not being
able to locate schema of certain versions (new schema tables). This
change inhibits such errors by responding with an empty mutation list.
Currently it results in scary error messages in logs about not being
able to find schema of given version. It's benign, but may scare
users. It the future incompatibilities could result in more subtle
errors. Better to inhibit it completely.
"Old and new nodes will advertise different schema version because
of different format of schema tables. This will result in attempts
to sync the schema by each of the node.
Currently this will result in scary error messages in logs about
sync failing due to not being able to find schema of given version.
It's benign, but may scare users. It the future incompatibilities
could result in more subtle errors. Better to inhibit it completely."
* 'tgrabiec/fix-schema-pull-errors-during-upgrade' of github.com:cloudius-systems/seastar-dev:
migration_manager: Give empty response to schema pulls from incompatible nodes
migration_manager: Don't pull schema from incompatible nodes
service: Advertise schema tables format version through gossip
So that they are not left on disk even though we did a clean shutdown.
First part of the fix is to ensure that closed segments are recognized
as not allocating (_closed flag). Not doing this prevents them from
being collected by discard_unused_segments(). Second part is to
actually call discard_unused_segments() on shutdown after all segments
were shut down, so that those whose position are cleared can be
removed.
Fixes#2550.
Message-Id: <1499358825-17855-1-git-send-email-tgrabiec@scylladb.com>
The old nodes which are still using v2 schema tables will fail to
apply our response, with error messages complaining about not being
able to locate schema of certain versions (new schema tables). This
change inhibits such errors by responding with an empty mutation list.
Currently it results in scary error messages in logs about not being
able to find schema of given version. It's benign, but may scare
users. It the future incompatibilities could result in more subtle
errors. Better to inhibit it completely.
This change adds the start of what will hopefully be a continually evolving and
improving document for helping developers and contributors to get started with
Scylla development.
The first part of the document is general advice and information that is broadly
applicable.
The second part is an opinionated example of a particular work-flow and set of
tools. This is intended to serve as a starting point and inspire contributors to
develop their own work-flow.
The section on branching is marked "TODO" for now, and will be addressed by a
subsequent change.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <470a542a92aff20d6205fb94b3fb26168735ae6f.1499319310.git.jhaberku@scylladb.com>
Instead of copying and moving the bound, pass it by reference so the
transformer can decide whether it wants to copy or not. The only
caller so far doesn't want a copy and takes the value by reference,
which would be capturing a temporary value. Caught by the
view_schema_test with gcc7.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170705210255.29669-1-duarte@scylladb.com>
The state machines generated by antlr allocate many local variables per function.
In release mode, the stack space occupied by the variables is reused, but in debug
build, it is not, due to Address Sanitizer setting -fstack-reuse=none. This causes
a single function to take above 100k of stack space.
Fix by hacking the generated code to use just one variable.
Fixes#2546
Message-Id: <20170704135824.13225-1-avi@scylladb.com>
* tag 'tgrabiec/row-cache-metrics-v2' of github.com:cloudius-systems/seastar-dev:
row_cache: Switch _stats.hits/misses to row granularity
row_cache: Rename num_entries() to partitions() for clarity
row_cache: Track mispopulations also at row level
row_cache: Track row insertions
row_cache: Track row hits and misses
row_cache: Make mispopulation counter also apply for continuity information
row_cache: Add partition_ prefix to current counters
misc_services: Switch to using reads_with[_no]_misses counters
row_cache: Add metrics for operations on underlying reader
row_cache: Add reader-related metrics
row_cache: Remove dead code
"This series introduces selective_token_range_sharder and uses it in repair to
generate dht::token_range belongs to a specific shard."
* tag 'asias/repair-selective_token_range_sharder-v3' of github.com:cloudius-systems/seastar-dev:
repair: Use selective_token_range_sharder
tests: Add test_selective_token_range_sharder
dht: Add selective_token_range_sharder
With this change, we ask all the shard to handle the ranges provided by
user and we use selective_token_range_sharder to split the ranges and
ignore the ranges do not belong to the current shard.
The test put a wrapping range into a non-wrapping range variable.
This was harmless at the time this test was written, but newer code
may not be as forgiving so better use a non-wrapping range as intended.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170704103128.29689-1-nyh@scylladb.com>
`r` is moved-from, and later captured in a different lambda. The compiler may
choose to move and perform the other capture later, resulting in a use-after-free.
Fix by copying `r` instead of moving it.
Discovered by sstable_test in debug mode.
Message-Id: <20170702082546.20570-1-avi@scylladb.com>
Currently, lcs will choose, for tombstone compaction, sstable with
the lowest ratio from the ones which ratio is at least above threshold
(0.2 by default).
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170703185633.6644-1-raphaelsc@scylladb.com>
* 'lcs_improvements_part_2' of github.com:raphaelsc/scylla:
lcs: Match estimated tasks arithmetic to score in LCS
lcs: prevent leveled_compaction_strategy.hh from being included more than once
lcs: use vector instead for storing a level of sstables
compaction: keep only one variant of size_tiered_most_interesting_bucket
lcs: get rid of unused code in leveled_manifest
Contains fix for CASSANDRA-8904.
Added TARGET_SCORE to get rid of magic number for target score which
is now used more than once.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
list is no longer needed because lcs no longer moves a sstable breaking
invariant at its level to level 0. Now lcs incrementally restores invariant
by compacting together first set of overlapping tables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
two variants of size_tiered_most_interesting_bucket existed to avoid copy,
but subsequent work will make lcs use vector for each level of sstables,
so let's only keep one variant.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Repair today has a semaphore limiting the number of ongoing checksum
comparisons running in parallel (on one shard) to 100. We needed this
number to be fairly high, because a "checksum comparison" can involve
high latency operations - namely, sending an RPC request to another node
in a remote DC and waiting for it to calculate a checksum there, and while
waiting for a response we need to proceed calculating checksums in parallel.
But as a consequence, in the current code, we can end up with as many as
100 fibers all at the same stage of reading partitions to checksum from
sstables. This requires tons of memory, to hold at least 128K of buffer
(even more with read-ahead) for each of these fibers, plus partition data
for each. But doing 100 reads in parallel is pointless - one (or very few)
should be enough.
So this patch adds another semaphore to limit the number of checksum
*calculations* (including the read and checksum calculation) on each shard
to just 2. There may still be 100 ongoing checksum *comparisons*, in
other stages of the comparisons (sending the checksum requests to other
and waiting for them to return), but only 2 will ever be in the stage of
reading from disk and checksumming them.
The limit of 2 checksum calculations (per shard) applies on the repair
slave, not just to the master: The slave may receive many checksum
requests in parallel, but will only actually work on 2 at a time.
Because the parallelism=100 now rate-limits operations which use very little
memory, in the future we can safely increase it even more, to support
situations where the disk is very fast but the link between nodes has
very high latency.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170703151329.25716-1-nyh@scylladb.com>
when do_for_each is in its last iteration and with_semaphore defers
because there's an ongoing cleanup, sstable object will be used after
freed because it was taken by ref and the container it lives in was
destroyed prematurely.
Let's fix it with a do_with, also making code nicer.
Fixes#2537.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170630035324.19881-1-raphaelsc@scylladb.com>
"compaction_strategy.cc keeps the full implementation of size tiered,
major, and null strategies, and partial implementation of leveled
and date tiered strategies. It's a mess. In the future, we will also
need space for time window strategy. The file is hard to read and
maintain.
My goal here is to improve maintainability of the strategies by
putting each of them into its own header.
NOTE: No semantic change is introduced here."
* 'improve_compaction_strategy_maintainability' of github.com:raphaelsc/scylla:
compaction_strategy: move dtcs to its existing header
compaction_strategy: move lcs implementation to its own header
compaction_strategy: move stcs implementation to its own header
compaction_strategy: move compaction_strategy_impl to its own header
Configuring cpufreq service on VMs/IaaS causes an error because it doesn't supported cpufreq.
To prevent causing error, skip whole configuration when the driver not loaded.
Fixes#2051
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1498809504-27029-1-git-send-email-syuu@scylladb.com>
A comment states that we want the file to be old enough, but sets
a timestamp of max(), which is in the future. This may have passed
because the conversion from numeric_limits<time_t>::max() to
db_clock::time_point is not well defined (their dynamic range is
different), so truncation may have converted the large number to a
low one.
Message-Id: <20170702082903.20879-1-avi@scylladb.com>
Error messages incorrectly used the debug representation of the receiver,
rather than the text representation of the operation itself.
Fixes#113.
Message-Id: <20170701101325.3163-1-avi@scylladb.com>
Boot should not continue until a future returned by
wait_for_gossip_to_settle() is resolved. Commit 991ec4a16 mistakenly
broke that, so restore it back. Also fix calls for supervisor::notify()
to be in the right places.
Message-Id: <20170702082355.GQ14563@scylladb.com>
This reverts commit db5bf363d0. Causes
errors of the sort
Exiting on unhandled exception: exceptions::invalid_request_exception
(Keyspace 'system_traces' does not exist)
Previously, lexing and parsing errors were aggregated while CQL queries were
evaluated. Afterwards, the first collected error (if present) would be thrown as
an exception.
The problem was that when parsing and lexing errors were aggregated this way,
the parser would continue even in spite of errors like "no viable alternative".
Semantic actions attached to grammar rules would still execute, though with
variables that had not yet been initialized. This would crash Scylla.
This change modifies the error-handling strategy of CQL parsing. Rather than
aggregate errors, we throw an exception on the first error we encounter. This
ensures that grammar actions never execute unless there is a precise match.
One possible issue with this approach is that the generated C++ code from the
ANTLR grammar may not be exception-safe. I compiled Scylla in debug-mode with
ASan support and executed several erroneous CQL queries with `cqlsh`. No memory
leaks were reported.
Fixes#2466.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <db1f650a2bbb615b506d9015486eece45375a440.1498836703.git.jhaberku@scylladb.com>
compaction_strategy.cc keeps the full implementation of size tiered,
major, and null strategies, and partial implementation of leveled
and date tiered strategies. It's a mess. In the future, we will also
need space for time window strategy. The file is hard to read and
maintain.
My goal here is to eventually improve maintainability of the
strategies by putting each of them into its own header.
This is the first step towards that goal.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"This feature is intended to make compaction more efficient at getting rid of
droppable tombstone and expired data wasting disk space. So far, people have
been dealing with it manually through major compaction.
With strategies other than date tiered, large sstables will be left untouched
for a long time even though it's all expired. Date tiered suffers from it when
mixing data with different TTL because it only includes for compaction sstable
that is fully expired.
sstables keeps as metadata a histogram which allows us to easily estimate
droppable data ratio from gc_before. sstables which droppable data ratio is
above 20% (default value for tombstone_threshold option) will be considered
candidates for the operation.
Like in C*, we will only do tombstone removal compaction when there's nothing
to compact in standard way. It would be interesting to trigger it too when
disk usage is above a given threshold, but I decided to leave this for later.
Fixes #2306."
* 'tombstone_removal_compaction_v4' of github.com:raphaelsc/scylla:
tests: more testing for tombstone compaction options
tests: basic tombstone compaction test for date tiered
compaction/dtcs: add support for tombstone compaction
tests: basic test of tombstone compaction with lcs
compaction/lcs: add support for tombstone compaction
tests: basic tombstone compaction test for size tiered
compaction/stcs: add support for tombstone compaction
tests: add test for estimation of droppable tombstone ratio
sstables: introduce function to estimate droppable tombstone ratio
compaction_manager: periodically submit cfs for compaction
streaming_histogram: fix coding style
tests: add streaming_histogram_test
streaming_histogram: implement sum
tests: add test for sstable with bad tombstone histogram
sstables: discard bad streaming histogram for future use
tests: add sstable tombstone histogram test
streaming_histogram: fix update
streaming_histogram: move it to utils
streaming_histogram: do not limit it to be used by sstables
sstables: update tombstone_histogram for cells with expiration time
Unlike other strategies, dtcs has tombstone compaction disabled by
default due to:
- deletion shouldn't be used with DTCS; rather data is deleted through TTL.
- with time series workloads, it's usually better to wait for whole sstable
to be expired rather than compacting a single sstable when it's more than
20% (default value) expired.
See CASSANDRA-9234 for more details.
For tombstone compaction, unworthy sstables are filtered out and the oldest
one is chosen because it's the one less likely to shadow data and it's also
relatively big.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
LCS will choose its candidate by starting from highest level and
getting sstable which has highest droppable tombstone ratio.
Unlike STCS which needs to choose oldest sstable from biggest tier,
LCS can choose the one with highest d__t__r because sstables in
a given level don't overlap.
Sstable picked up for tombstone removal compaction won't be demoted
or promoted.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Larger sstables are hard to find sstable peers and therefore are
left uncompacted for a long time. Expired data and tombstones which
can be purged will waste disk space meanwhile.
sstable tracks droppable tombstone from which ratio can be calculated.
If ratio is greater than threshold (0.2 by default), sstable will
be eligible for compaction. Oldest sstables from biggest tiers are
preferrable because droppable data in them are more likely to satisfy
the conditions for purge, like not shadowing data in another sstable.
Subsequent patches will add support in leveled and date tiered
strategies.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Function used to estimate ratio of droppable tombstone.
A tombstone is considered droppable for cells expired before
gc_before and regular tombstones older than gc_before.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This is useful for a column family which isn't generating new content
and will have lots of expired data later on that can be purged.
Compaction submission is NO-OP if there's nothing to do, so I think
it's reasonable to do it at an interval of 1 hour.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This function is used to estimate number of points in interval
[-inf,b]. It will be useful for estimating droppable tombstone
ratio in a given sstable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Find bad histogram which had incorrect elements merged due to use of
unordered map. The keys will be unordered. Histogram which size is
less than max allowed will be correct because no entries needed to be
merged, so we can avoid discarding those.
This is important because histogram for tombstone will be used to
estimate droppable tombstone ratio. If it's incorrectly high for many
of existing sstables, we will needlessly compact lots of them.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This bug was introduced when converting java code. Return value
of map::erase() was used as if it were the value of the removed
entry, but it's actually the number of removed entries.
update() also relies on ordered keys, so map is used instead
by histogram.
In addition, histograms will be written in sorted order (like C*
does) such that we can detect bad histograms, using disk_array.
disk_array is also used from now on to read histograms.
The conversion from array to map is fine because histograms for
sstables are limited to 100 elements.
Coming patch will detect bad histograms (generated only by us)
and discard them, because we can't rely on their information.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
As a preperation for the http stream support, creation of empty reply
should be avoided.
This patch removes a line that cannot be reached but causes the compiler
to complain.
It has no effect aside of removing the reply creation.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170628130202.8132-1-amnon@scylladb.com>
If cache is missing given key, but the range is marked as continuous,
it means sstables don't have that entry and we can insert it without
asking the presence checker (bloom filter based). The latter is more
expensive and gives false positives. So this improves update
performance and hit ratio.
Another positive effect is that we don't have to clear continuity now.
Fixes#1999.
Message-Id: <1498643043-21117-1-git-send-email-tgrabiec@scylladb.com>
Region comparator, used by the two, calls region_impl::min_occupancy(),
which calls log_histogram::largest(). The latter is O(N) in terms of
the number of segments, and is supposed to be used only in tests.
We should call one_of_largest() instead, which is O(1).
This caused compact_on_idle() to take more CPU as the number of
segments grew (even when there was nothing to compact). Eviction
would see the same kind of slow down as well.
Introduced in 11b5076b3c.
Message-Id: <1498641973-20054-1-git-send-email-tgrabiec@scylladb.com>
It's been linked with various performance issues, either by causing
them or making them worse. One example is #1634, and also recently
I have investigated continuous performance degradation that was also
linked to defrag on idle activity.
Until we can figure out how to reduce its impact, we should disable it.
Signed-off-by: Glauber Costa <glauber@glauber.scylladb>
Message-Id: <20170627201109.10775-1-glauber@scylladb.com>
streaming histogram will later be placed in /utils, so we want
it to use std::unordered_map<> instead of disk_hash<>.
That also requires implementing serialization/deserialization
functions for it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That tombstone_histogram is used to determine droppable data ratio
for a sstable, and unlike C*, we were only updating it for
tombstones. We need to update it with expiration time of cells too,
if any. Creation time (expiration - ttl) cannot be used because if
ttl > gc_grace_seconds, the resulting sstable could be considered
worth dropping by tomstone compaction before any data is actually
expired.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This document is intended to help developers and contributors to Scylla get started. The first part consists of general guidelines that make no assumptions about a development environment or tooling. The second part describes a particular environment and work-flow for exemplary purposes.
## Overview
This section covers some high-level information about the Scylla source code and work-flow.
### Getting the source code
Scylla uses [Git submodules](https://git-scm.com/book/en/v2/Git-Tools-Submodules) to manage its dependency on Seastar and other tools. Be sure that all submodules are correctly initialized when cloning the project:
```bash
$ git clone https://github.com/scylladb/scylla
$ cd scylla
$ git submodule update --init --recursive
```
### Dependencies
Scylla depends on the system package manager for its development dependencies.
Running `./install_dependencies.sh` (as root) installs the appropriate packages based on your Linux distribution.
### Build system
**Note**: Compiling Scylla requires, conservatively, 2 GB of memory per native thread, and up to 3 GB per native thread while linking.
Scylla is built with [Ninja](https://ninja-build.org/), a low-level rule-based system. A Python script, `configure.py`, generates a Ninja file (`build.ninja`) based on configuration options.
To build for the first time:
```bash
$ ./configure.py
$ ninja-build
```
Afterwards, it is sufficient to just execute Ninja.
The full suite of options for project configuration is available via
```bash
$ ./configure.py --help
```
The most important options are:
-`--mode={release,debug,all}`: Debug mode enables [AddressSanitizer](https://github.com/google/sanitizers/wiki/AddressSanitizer) and allows for debugging with tools like GDB. Debugging builds are generally slower and generate much larger object files than release builds.
-`--{enable,disable}-dpdk`: [DPDK](http://dpdk.org/) is a set of libraries and drivers for fast packet processing. During development, it's not necessary to enable support even if it is supported by your platform.
Source files and build targets are tracked manually in `configure.py`, so the script needs to be updated when new files or targets are added or removed.
To save time -- for instance, to avoid compiling all unit tests -- you can also specify specific targets to Ninja. For example,
Unit tests live in the `/tests` directory. Like with application source files, test sources and executables are specified manually in `configure.py` and need to be updated when changes are made.
A test target can be any executable. A non-zero return code indicates test failure.
Most tests in the Scylla repository are built using the [Boost.Test](http://www.boost.org/doc/libs/1_64_0/libs/test/doc/html/index.html) library. Utilities for writing tests with Seastar futures are also included.
Run all tests through the test execution wrapper with
```bash
$ ./test.py --mode={debug,release}
```
The `--name` argument can be specified to run a particular test.
Alternatively, you can execute the test executable directly. For example,
```bash
$ build/release/tests/row_cache_test -- -c1 -m1G
```
The `-c1 -m1G` arguments limit this Seastar-based test to a single system thread and 1 GB of memory.
### Preparing patches
All changes to Scylla are submitted as patches to the public mailing list. Once a patch is approved by one of the maintainers of the project, it is committed to the maintainers' copy of the repository at https://github.com/scylladb/scylla.
Detailed instructions for formatting patches for the mailing list and advice on preparing good patches are available at the [ScyllaDB website](http://docs.scylladb.com/contribute/).
### Running Scylla
Once Scylla has been compiled, executing the (`debug` or `release`) target will start a running instance in the foreground:
```bash
$ build/release/scylla
```
The `scylla` executable requires a configuration file, `scylla.yaml`. By default, this is read from `$SCYLLA_HOME/conf/scylla.yaml`. A good starting point for development is located in the repository at `/conf/scylla.yaml`.
For development, a directory at `$HOME/scylla` can be used for all Scylla-related files:
The `scylla.yaml` file in the repository by default writes all database data to `/var/lib/scylla`, which likely requires root access. Change the `data_file_directories` and `commitlog_directory` fields as appropriate.
Scylla has a number of requirements for the file-system and operating system to operate ideally and at peak performance. However, during development, these requirements can be relaxed with the `--developer-mode` flag.
Additionally, when running on under-powered platforms like portable laptops, the `--overprovisined` flag is useful.
Multiple release branches are maintained on the Git repository at https://github.com/scylladb/scylla. Release 1.5, for instance, is tracked on the `branch-1.5` branch.
Similarly, tags are used to pin-point precise release versions, including hot-fix versions like 1.5.4. These are named `scylla-1.5.4`, for example.
Most development happens on the `master` branch. Release branches are cut from `master` based on time and/or features. When a patch against `master` fixes a serious issue like a node crash or data loss, it is backported to a particular release branch with `git cherry-pick` by the project maintainers.
## Example: development on Fedora 25
This section describes one possible work-flow for developing Scylla on a Fedora 25 system. It is presented as an example to help you to develop a work-flow and tools that you are comfortable with.
### Preface
This guide will be written from the perspective of a fictitious developer, Taylor Smith.
### Git work-flow
Having two Git remotes is useful:
- A public clone of Seastar (`"public"`)
- A private clone of Seastar (`"private"`) for in-progress work or work that is not yet ready to share
The first step to contributing a change to Scylla is to create a local branch dedicated to it. For example, a feature that fixes a bug in the CQL statement for creating tables could be called `ts/cql_create_table_error/v1`. The branch name is prefaced by the developer's initials and has a suffix indicating that this is the first version. The version suffix is useful when branches are shared publicly and changes are requested on the mailing list. Having a branch for each version of the patch (or patch set) shared publicly makes it easier to reference and compare the history of a change.
Setting the upstream branch of your development branch to `master` is a useful way to track your changes. You can do this with
As a patch set is developed, you can periodically push the branch to the private remote to back-up work.
Once the patch set is ready to be reviewed, push the branch to the public remote and prepare an email to the `scylladb-dev` mailing list. Including a link to the branch on your public remote allows for reviewers to quickly test and explore your changes.
### Development environment and source code navigation
Scylla includes a [CMake](https://cmake.org/) file, `CMakeLists.txt`, for use only with development environments (not for building) so that they can properly analyze the source code.
[CLion](https://www.jetbrains.com/clion/) is a commercial IDE offers reasonably good source code navigation and advice for code hygiene, though its C++ parser sometimes makes errors and flags false issues.
Other good options that directly parse CMake files are [KDevelop](https://www.kdevelop.org/) and [QtCreator](https://wiki.qt.io/Qt_Creator).
To use the `CMakeLists.txt` file with these programs, define the `FOR_IDE` CMake variable or shell environmental variable.
[Eclipse](https://eclipse.org/cdt/) is another open-source option. It doesn't natively work with CMake projects, and its C++ parser has many similar issues as CLion.
### Distributed compilation: `distcc` and `ccache`
Scylla's compilations times can be long. Two tools help somewhat:
- [ccache](https://ccache.samba.org/) caches compiled object files on disk and re-uses them when possible
- [distcc](https://github.com/distcc/distcc) distributes compilation jobs to remote machines
A reasonably-powered laptop acts as the coordinator for compilation. A second, more powerful, machine acts as a passive compilation server.
Having a direct wired connection between the machines ensures that object files can be transmitted quickly and limits the overhead of remote compilation.
The coordinator has been assigned the static IP address `10.0.0.1` and the passive compilation machine has been assigned `10.0.0.2`.
On Fedora, installing the `ccache` package places symbolic links for `gcc` and `g++` in the `PATH`. This allows normal compilation to transparently invoke `ccache` for compilation and cache object files on the local file-system.
Next, set `CCACHE_PREFIX` so that `ccache` is responsible for invoking `distcc` as necessary:
```bash
exportCCACHE_PREFIX="distcc"
```
On each host, edit `/etc/sysconfig/distccd` to include the allowed coordinators and the total number of jobs that the machine should accept.
This example is for the laptop, which has 2 physical cores (4 logical cores with hyper-threading):
`10.0.0.2` has 8 physical cores (16 logical cores) and 64 GB of memory.
As a rule-of-thumb, the number of jobs that a machine should be specified to support should be equal to the number of its native threads.
Restart the `distccd` service on all machines.
On the coordinator machine, edit `$HOME/.distcc/hosts` with the available hosts for compilation. Order of the hosts indicates preference.
```
10.0.0.2/16 localhost/2
```
In this example, `10.0.0.2` will be sent up to 16 jobs and the local machine will be sent up to 2. Allowing for two extra threads on the host machine for coordination, we run compilation with `16 + 2 + 2 = 20` jobs in total: `ninja-build -j20`.
When a compilation is in progress, the status of jobs on all remote machines can be visualized in the terminal with `distccmon-text` or graphically as a GTK application with `distccmon-gnome`.
One thing to keep in mind is that linking object files happens on the coordinating machine, which can be a bottleneck. See the next section speeding up this process.
### Using the `gold` linker
Linking Scylla can be slow. The gold linker can replace GNU ld and often speeds the linking process. On Fedora, you can switch the system linker using
```bash
$ sudo alternatives --config ld
```
### Testing changes in Seastar with Scylla
Sometimes Scylla development is closely tied with a feature being developed in Seastar. It can be useful to compile Scylla with a particular check-out of Seastar.
One way to do this it to create a local remote for the Seastar submodule in the Scylla repository:
throwexceptions::invalid_request_exception(sprint("Too many markers(?). %d markers exceed the allowed maximum of %d",bound_terms,std::numeric_limits<uint16_t>::max()));
throwexceptions::invalid_request_exception(
sprint("Too many markers(?). %d markers exceed the allowed maximum of %d",
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.