"
This series fix hang in multishard_writer when error happens. It contains
- multishard_writer: Abort the queue attached to consumers when producer fails
- repair: Fix hang when the writer is dead
Fixes#6241
Refs: #6248
"
* asias-stream_fix_multishard_writer_hang:
repair: Fix hang when the writer is dead
mutation_writer_test: Add test_multishard_writer_producer_aborts
multishard_writer: Abort the queue attached to consumers when producer fails
(cherry picked from commit 8925e00e96)
If no keyspace is specified when taking snapshot, there will be a segfault
because keynames is unconditionally dereferenced. Let's return an error
because a keyspace must be specified when column families are specified.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200427195634.99940-1-raphaelsc@scylladb.com>
(cherry picked from commit 02e046608f)
Fixes#6336.
When multiple key columns (clustering or partition) are passed to
the schema constructor, all having the same column id, the expectation
is that these columns will retain the order in which they were passed to
`schema_builder::with_column()`. Currently however this is not
guaranteed as the schema constructor sort key columns by column id with
`std::sort()`, which doesn't guarantee that equally comparing elements
retain their order. This can be an issue for indexes, the schemas of
which are built independently on each node. If there is any room for
variance between for the key column order, this can result in different
nodes having incompatible schemas for the same index.
The fix is to use `std::stable_sort()` which guarantees that the order
of equally comparing elements won't change.
This is a suspected cause of #5856, although we don't have hard proof.
Fixes: #5856
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
[avi: upgraded "Refs" to "Fixes", since we saw that std::sort() becomes
unstable at 17 elements, and the failing schema had a
clustering key with 23 elements]
Message-Id: <20200417121848.1456817-1-bdenes@scylladb.com>
(cherry picked from commit a4aa753f0f)
Some legacy `mc` SSTables (created in Scylla 3.0) may contain incorrect
serialization headers, which don't wrap frozen UDTs nested inside collections
with the FrozenType<...> tag. When reading such SSTable,
Scylla would detect a mismatch between the schema saved in schema
tables (which correctly wraps UDTs in the FrozenType<...> tag) and the schema
from the serialization header (which doesn't have these tags).
SSTables created in Scylla versions 3.1 and above, in particular in
Scylla versions that contain this commit, create correct serialization
headers (which wrap UDTs in the FrozenType<...> tag).
This commit does two things:
1. for all SSTables created after this commit, include a new feature
flag, CorrectUDTsInCollections, presence of which implies that frozen
UDTs inside collections have the FrozenType<...> tag.
2. when reading a Scylla SSTable without the feature flag, we assume that UDTs
nested inside collections are always frozen, even if they don't have
the tag. This assumption is safe to be made, because at the time of
this commit, Scylla does not allow non-frozen (multi-cell) types inside
collections or UDTs, and because of point 1 above.
There is one edge case not covered: if we don't know whether the SSTable
comes from Scylla or from C*. In that case we won't make the assumption
described in 2. Therefore, if we get a mismatch between schema and
serialization headers of a table which we couldn't confirm to come from
Scylla, we will still reject the table. If any user encounters such an
issue (unlikely), we will have to use another solution, e.g. using a
separate tool to rewrite the SSTable.
Fixes#6130.
[avi: adjusted sstable file paths]
(cherry picked from commit 3d811e2f95)
Consider 3 nodes in the cluster, n1, n2, n3 with gossip generation
number g1, g2, g3.
n1, n2, n3 running scylla version with commit
0a52ecb6df (gossip: Fix max generation
drift measure)
One year later, user wants the upgrade n1,n2,n3 to a new version
when n3 does a rolling restart with a new version, n3 will use a
generation number g3'. Because g3' - g2 > MAX_GENERATION_DIFFERENCE and
g3' - g1 > MAX_GENERATION_DIFFERENCE, so g1 and g2 will reject n3's
gossip update and mark g3 as down.
Such unnecessary marking of node down can cause availability issues.
For example:
DC1: n1, n2
DC2: n3, n4
When n3 and n4 restart, n1 and n2 will mark n3 and n4 as down, which
causes the whole DC2 to be unavailable.
To fix, we can start the node with a gossip generation within
MAX_GENERATION_DIFFERENCE difference for the new node.
Once all the nodes run the version with commit
0a52ecb6df, the option is no logger
needed.
Fixes#5164
(cherry picked from commit 743b529c2b)
User reported an issue that after a node restart, the restarted node
is marked as DOWN by other nodes in the cluster while the node is up
and running normally.
Consier the following:
- n1, n2, n3 in the cluster
- n3 shutdown itself
- n3 send shutdown verb to n1 and n2
- n1 and n2 set n3 in SHUTDOWN status and force the heartbeat version to
INT_MAX
- n3 restarts
- n3 sends gossip shadow rounds to n1 and n2, in
storage_service::prepare_to_join,
- n3 receives response from n1, in gossiper::handle_ack_msg, since
_enabled = false and _in_shadow_round == false, n3 will apply the
application state in fiber1, filber 1 finishes faster filber 2, it
sets _in_shadow_round = false
- n3 receives response from n2, in gossiper::handle_ack_msg, since
_enabled = false and _in_shadow_round == false, n3 will apply the
application state in fiber2, filber 2 yields
- n3 finishes the shadow round and continues
- n3 resets gossip endpoint_state_map with
gossiper.reset_endpoint_state_map()
- n3 resumes fiber 2, apply application state about n3 into
endpoint_state_map, at this point endpoint_state_map contains
information including n3 itself from n2.
- n3 calls gossiper.start_gossiping(generation_number, app_states, ...)
with new generation number generated correctly in
storage_service::prepare_to_join, but in
maybe_initialize_local_state(generation_nbr), it will not set new
generation and heartbeat if the endpoint_state_map contains itself
- n3 continues with the old generation and heartbeat learned in fiber 2
- n3 continues the gossip loop, in gossiper::run,
hbs.update_heart_beat() the heartbeat is set to the number starting
from 0.
- n1 and n2 will not get update from n3 because they use the same
generation number but n1 and n2 has larger heartbeat version
- n1 and n2 will mark n3 as down even if n3 is alive.
To fix, always use the the new generation number.
Fixes: #5800
Backports: 3.0 3.1 3.2
(cherry picked from commit 62774ff882)
Multiple requests can use the same client_state simultaneously, so it is
not safe to use it as a container for a tracing state which is per request.
Currently next request may overwrite tracing state for previous one
causing, in a best case, wrong trace to be taken or crash if overwritten
pointer is freed prematurely.
Fixes#6014
(cherry picked from commit 866c04dd64)
Message-Id: <20200324144003.GA20781@scylladb.com>
When qualifying columns to be fetched for filtering, we also check
if the target column is not used as an index - in which case there's
no need of fetching it. However, the check was incorrectly assuming
that any restriction is eligible for indexing, while it's currently
only true for EQ. The fix makes a more specific check and contains
many dynamic casts, but these will hopefully we gone once our
long planned "restrictions rewrite" is done.
This commit comes with a test.
Fixes#5708
Tests: unit(dev)
(cherry picked from commit 767ff59418)
SimpleStrategy creates a list of endpoints by iterating over the set of
all configured endpoints for the given token, until we reach keyspace
replication factor.
There is a trivial coding bug when we first add at least one endpoint
to the list, and then compare list size and replication factor.
If RF=0 this never yields true.
Fix by moving the RF check before at least one endpoint is added to the
list.
Cassandra never had this bug since it uses a less fancy while()
loop.
Fixes#5962
Message-Id: <20200306193729.130266-1-kostja@scylladb.com>
(cherry picked from commit ac6f64a885)
Reclaim consults the _regions vector, so we don't want it moving around while
allocating more capacity. For that we take the reclaim lock. However, that
can cause a false-positive OOM during startup:
1. all memory is allocated to LSA as part of priming (2baa16b371)
2. the _regions vector is resized from 64k to 128k, requiring a segment
to be freed (plenty are free)
3. but reclaiming_lock is taken, so we cannot reclaim anything.
To fix, resize the _regions vector outside the lock.
Fixes#6003.
Message-Id: <20200311091217.1112081-1-avi@scylladb.com>
(cherry picked from commit c020b4e5e2)
By default, `/usr/lib/rpm/find-debuginfo.sh` will temper with
the binary's build-id when stripping its debug info as it is passed
the `--build-id-seed <version>.<release>` option.
To prevent that we need to set the following macros as follows:
unset `_unique_build_ids`
set `_no_recompute_build_ids` to 1
Fixes#5881
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 25a763a187)
It seems like *.service is conflicting on install time because the file
installed twice, both debian/*.service and debian/scylla-server.install.
We don't need to use *.install, so we can just drop the line.
Fixes#5640
(cherry picked from commit 29285b28e2)
There may be other commitlog writes waiting for zeroing to complete, so
not using proper scheduling class causes priority inversion.
Fixes#5858.
Message-Id: <20200220102939.30769-2-gleb@scylladb.com>
(cherry picked from commit 6a78cc9e31)
This patch fixes a bug that appears because of an incorrect interaction
between counters and hinted handoff.
When a counter is updated on the leader, it sends mutations to other
replicas that contain all counter shards from the leader. If consistency
level is achieved but some replicas are unavailable, a hint with
mutation containing counter shards is stored.
When a hint's destination node is no longer its replica, it is attempted
to be sent to all its current replicas. Previously,
storage_proxy::mutate was used for that purpose. It was incorrect
because that function treats mutations for counter tables as mutations
containing only a delta (by how much to increase/decrease the counter).
These two types of mutations have different serialization format, so in
this case a "shards" mutation is reinterpreted as "delta" mutation,
which can cause data corruption to occur.
This patch backports `storage_proxy::mutate_hint_from_scratch`
function, which bypasses special handling of counter mutations and
treats them as regular mutations - which is the correct behavior for
"shards" mutations.
Refs #5833.
Backports: 3.1, 3.2, 3.3
Tests: unit(dev)
(cherry picked from commit ec513acc49)
"
Throw an error in case we hit an invalid time UUID
rather than hitting an assert.
Fixes#5552
(Ref #5588 that was dequeued and fixed here)
Test: UUID_test, cql_query_test(debug)
"
* 'validate-time-uuid' of https://github.com/bhalevy/scylla:
cql3: abstract_function_selector: provide assignment_testable_source_context
test: cql_query_test: add time uuid validation tests
cql3: time_uuid_fcts: validate timestamp arg
cql3: make_max_timeuuid_fct: delete outdated FIXME comment
cql3: time_uuid_fcts: validate time UUID
test: UUID_test: add tests for time uuid
utils: UUID: create_time assert nanos_since validity
utils/UUID_gen: make_nanos_since
utils: UUID: assert UUID.is_timestamp
(cherry picked from commit 3343baf159)
Conflicts:
cql3/functions/time_uuid_fcts.hh
tests/cql_query_test.cc
The update generation path must track and apply all tombstones,
both from the existing base row (if read-before-write was needed)
and for the new row. One such path contained an error, because
it assumed that if the existing row is empty, then the update
can be simply generated from the new row. However, lack of the
existing row can also be the result of a partition/range tombstone.
If that's the case, it needs to be applied, because it's entirely
possible that this partition row also hides the new row.
Without taking the partition tombstone into account, creating
a future tombstone and inserting an out-of-order write before it
in the base table can result in ghost rows in the view table.
This patch comes with a test which was proven to fail before the
changes.
Branches 3.1,3.2,3.3
Fixes#5793
Tests: unit(dev)
Message-Id: <8d3b2abad31572668693ab585f37f4af5bb7577a.1581525398.git.sarna@scylladb.com>
(cherry picked from commit e93c54e837)
Avoid following UBSAN error:
repair/row_level.cc:2141:7: runtime error: load of value 240, which is not a valid value for type 'bool'
Fixes#5531
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 474ffb6e54)
CQL transport code relies on an exception's C++ type to create correct
reply, but in lwt we converted some mutation_timeout exceptions to more
generic request_timeout while forwarding them which broke the protocol.
Do not drop type information.
Fixes#5598.
Message-Id: <20200115180313.GQ9084@scylladb.com>
(cherry picked from commit 51281bc8ad)
Docker restricts the number of processes in a container to some
limit it calculates. This limit turns out to be too low on large
machines, since we run multiple links in parallel, and each link
runs many threads.
Remove the limit by specifying --pids-limit -1. Since dbuild is
meant to provide a build environment, not a security barrier,
this is okay (the container is still restricted by host limits).
I checked that --pids-limit is supported by old versions of
docker and by podman.
Fixes#5651.
Message-Id: <20200127090807.3528561-1-avi@scylladb.com>
(cherry picked from commit 897320f6ab)
This patch affects the LWT queries with IF conditions of the
following form: `IF col in :value`, i.e. if the parameter
marker is used.
When executing a prepared query with a bound value
of `(None,)` (tuple with null, example for Python driver), it is
serialized not as NULL but as "empty" value (serialization
format differs in each case).
Therefore, Scylla deserializes the parameters in the request as
empty `data_value` instances, which are, in turn, translated
to non-empty `bytes_opt` with empty byte-string value later.
Account for this case too in the CAS condition evaluation code.
Example of a problem this patch aims to fix:
Suppose we have a table `tbl` with a boolean field `test` and
INSERT a row with NULL value for the `test` column.
Then the following update query fails to apply due to the
error in IF condition evaluation code (assume `v=(null)`):
`UPDATE tbl SET test=false WHERE key=0 IF test IN :v`
returns false in `[applied]` column, but is expected to succeed.
Tests: unit(debug, dev), dtest(prepared stmt LWT tests at https://github.com/scylladb/scylla-dtest/pull/1286)
Fixes: #5710
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200205102039.35851-1-pa.solodovnikov@scylladb.com>
(cherry picked from commit bcc4647552)
The table::flush_streaming_mutations is used in the days when streaming
data goes to memtable. After switching to the new streaming, data goes
to sstables directly in streaming, so the sstables generated in
table::flush_streaming_mutations will be empty.
It is unnecessary to invalidate the cache if no sstables are added. To
avoid unnecessary cache invalidating which pokes hole in the cache, skip
calling _cache.invalidate() if the sstables is empty.
The steps are:
- STREAM_MUTATION_DONE verb is sent when streaming is done with old or
new streaming
- table::flush_streaming_mutations is called in the verb handler
- cache is invalidated for the streaming ranges
In summary, this patch will avoid a lot of cache invalidation for
streaming.
Backports: 3.0 3.1 3.2
Fixes: #5769
(cherry picked from commit 5e9925b9f0)
This assert, added by 060e3f8 is supposed to make sure the invariant of
the append() is respected, in order to prevent building an invalid row.
The assert however proved to be too harsh, as it converts any bug
causing out-of-order clustering rows into cluster unavailability.
Downgrade it to on_internal_error(). This will still prevent corrupt
data from spreading in the cluster, without the unavailability caused by
the assert.
Fixes: #5786
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200211083829.915031-1-bdenes@scylladb.com>
(cherry picked from commit 3164456108)
Since dpkg does not re-install conffiles when it removed by user,
currently we are missing dependencies.conf and sysconfdir.conf on rollback.
To prevent this, we need to stop running
'rm -rf /etc/systemd/system/scylla-server.service.d/' on 'remove'.
Fixes#5734
(cherry picked from commit 43097854a5)
awk returns float value on Debian, it causes postinst script failure
since we compare it as integer value.
Replaced with sed + bash.
Fixes#5569
(cherry picked from commit 5627888b7c)
Treat writes to local.paxos as user memory, as the number of writes is
dependent on the amount of user data written with LWT.
Fixes#5682
Message-Id: <20200130150048.GW26048@scylladb.com>
(cherry picked from commit b08679e1d3)
We would sometimes produce an unnecessary extra 0xff prefix byte.
The new encoding matches what cassandra does.
This was both a efficiency and correctness issue, as using varint in a
key could produce different tokens.
Fixes#5656
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
(cherry picked from commit c89c90d07f)
We use eventually() in tests to wait for eventually consistent data
to become consistent. However, we see spurious failures indicating
that we wait too little.
Increasing the timeout has a negative side effect in that tests that
fail will now take longer to do so. However, this negative side effect
is negligible to false-positive failures, since they throw away large
test efforts and sometimes require a person to investigate the problem,
only to conclude it is a false positive.
This patch therefore makes eventually() more patient, by a factor of
32.
Fixes#4707.
Message-Id: <20200130162745.45569-1-avi@scylladb.com>
(cherry picked from commit ec5b721db7)
Scylla 3.2 doesn't support UDF, so do not accept UDF as a valid option
to experimental_features.
Fixes#5645.
No fix is needed on master, which does support UDF.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
We need to add '~' to handle rcX version correctly on Debian variants
(merged at ae33e9f), but when we moved to relocated package we mistakenly
dropped the code, so add the code again.
Fixes#5641
(cherry picked from commit dd81fd3454)
A mistake in handling legacy checks for special 'idx_token' column
resulted in not recognizing materialized views backing secondary
indexes properly. The mistake is really a typo, but with bad
consequences - instead of checking the view schema for being an index,
we asked for the base schema, which is definitely not an index of
itself.
Branches 3.1,3.2 (asap)
Fixes#5621Fixes#4744
(cherry picked from commit 9b379e3d63)
Consider this:
1) Write partition_start of p1
2) Write clustering_row of p1
3) Write partition_end of p1
4) Repair is stopped due to error before writing partition_start of p2
5) Repair calls repair_row_level_stop() to tear down which calls
wait_for_writer_done(). A duplicate partition_end is written.
To fix, track the partition_start and partition_end written, avoid
unpaired writes.
Backports: 3.1 and 3.2
Fixes: #5527
(cherry picked from commit 401854dbaf)
The query option always_return_static_content was added for lightweight
transations in commits e0b31dd273 (infrastructure) and 65b86d155e
(actual use). However, the flag was added unconditionally to
update_parameters::options. This caused it to be set for list
read-modify-write operations, not just for lightweight transactions.
This is a little wasteful, and worse, it breaks compatibility as old
nodes do not understand the always_return_static_content flag and
complain when they see it.
To fix, remove the always_return_static_content from
update_parameters::options and only set it from compare-and-swap
operations that are used to implement lightweight transactions.
Fixes#5593.
Reviewed-by: Gleb Natapov <gleb@scylladb.com>
Message-Id: <20200114135133.2338238-1-avi@scylladb.com>
(cherry picked from commit 6c84dd0045)
Merged pull request https://github.com/scylladb/scylla/pull/5538 from
Avi Kivity and Piotr Jastrzębski.
This series prepares CDC for rolling upgrade. This consists of
reducing the footprint of cdc, when disabled, on the schema, adding
a cluster feature, and redacting the cdc column when transferring
it to other nodes. The latter is needed because we'll want to backport
this to 3.2, which doesn't have canonical_mutations yet.
Fixes#5191.
(cherry picked from commit f0d8dd4094)
This is part of original commit 52b48b415c
("Test that schema digests with UDFs don't change"). It is needed to
test tables with CDC enabled.
Ref #5191.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Since we merged /usr/lib/scylla with /opt/scylladb, we removed
/usr/lib/scylla and replace it with the symlink point to /opt/scylladb.
However, RPM does not support replacing a directory with a symlink,
we are doing some dirty hack using RPM scriptlet, but it causes
multiple issues on upgrade/downgrade.
(See: https://docs.fedoraproject.org/en-US/packaging-guidelines/Directory_Replacement/)
To minimize Scylla upgrading/downgrade issues on user side, it's better
to keep /usr/lib/scylla directory.
Instead of creating single symlink /usr/lib/scylla -> /opt/scylladb,
we can create symlinks for each setup scripts like
/usr/lib/scylla/<script> -> /opt/scylladb/scripts/<script>.
Fixes#5522Fixes#4585Fixes#4611
(cherry picked from commit 263385cb4b)
Similar to trace_state keep shared_ptr<tracing> _local_tracing_ptr
in one_session_records when constructed so it can be used
during shutdown.
Fixes#5243
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 7aef39e400)
On aarch64, asan detected a use-after-move. It doesn't happen on x86_64,
likely due to different argument evaluation order.
Fix by evaluating full_slice before moving the schema.
Note: I used "auto&&" and "std::move()" even though full_slice()
returns a reference. I think this is safer in case full_slice()
changes, and works just as well with a reference.
Fixes#5419.
(cherry picked from commit 85822c7786)
Suppose we have a multi-dc setup (e.g. 9 nodes distributed across
3 datacenters: [dc1, dc2, dc3] -> [3, 3, 3]).
When a query that uses LWT is executed with LOCAL_SERIAL consistency
level, the `storage_proxy::get_paxos_participants` function
incorrectly calculates the number of required participants to serve
the query.
In the example above it's calculated to be 5 (i.e. the number of
nodes needed for a regular QUORUM) instead of 2 (for LOCAL_SERIAL,
which is equivalent to LOCAL_QUORUM cl in this case).
This behavior results in an exception being thrown when executing
the following query with LOCAL_SERIAL cl:
INSERT INTO users (userid, firstname, lastname, age) VALUES (0, 'first0', 'last0', 30) IF NOT EXISTS
Unavailable: Error from server: code=1000 [Unavailable exception] message="Cannot achieve consistency level for cl LOCAL_SERIAL. Requires 5, alive 3" info={'required_replicas': 5, 'alive_replicas': 3, 'consistency': 'LOCAL_SERIAL'}
Tests: unit(dev), dtest(consistency_test.py)
Fixes#5477.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20191216151732.64230-1-pa.solodovnikov@scylladb.com>
(cherry picked from commit c451f6d82a)
In commit b463d7039c (repair: Introduce
get_combined_row_hash_response), working_row_buf_nr is returned in
REPAIR_GET_COMBINED_ROW_HASH in addition to the combined hash. It is
scheduled to be part of 3.1 release. However it is not backported to 3.1
by accident.
In order to be compatible between 3.1 and 3.2 repair. We need to drop
the working_row_buf_nr in 3.2 release.
Fixes: #5490
Backports: 3.2
Tests: Run repair in a mixed 3.1 and 3.2 cluster
(cherry picked from commit 7322b749e0)
The LIKE operator requires filtering, so needs_filtering() must check
is_LIKE(). This already happens for partition columns, but it was
overlooked for clustering columns in the initial implementation of
LIKE.
Fixes#5400.
Tests: unit(dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
(cherry picked from commit 27b8b6fe9d)
"
Add --experimental-features -- a vector of features to unlock. Make corresponding changes in the YAML parser.
Fixes#5338
"
* 'vecexper' of https://github.com/dekimir/scylla:
config: Add `experimental_features` option
utils: Add enum_option
(cherry picked from commit 63474a3380)
This commit makes sure that single-partition readers for
read-before-write do not have fast-forwarding enabled,
as it may lead to huge read amplification. The observed case was:
1. Creating an index.
CREATE INDEX index1 ON myks2.standard1 ("C1");
2. Running cassandra-stress in order to generate view updates.
cassandra-stress write no-warmup n=1000000 cl=ONE -schema \
'replication(factor=2) compaction(strategy=LeveledCompactionStrategy)' \
keyspace=myks2 -pop seq=4000000..8000000 -rate threads=100 -errors
skip-read-validation -node 127.0.0.1;
Without disabling fast-forwarding, single-partition readers
were turned into scanning readers in cache, which resulted
in reading 36GB (sic!) on a workload which generates less
than 1GB of view updates. After applying the fix, the number
dropped down to less than 1GB, as expected.
Refs #5409Fixes#4615Fixes#5418
(cherry picked from commit 79c3a508f4)
Fixes#5211
In 79935df959 replay apply-call was
changed from one with no continuation to one with. But the frozen
mutation arg was still just lambda local.
Change to use do_with for this case as well.
Message-Id: <20191203162606.1664-1-calle@scylladb.com>
(cherry picked from commit 56a5e0a251)
In get_full_row_hashes_with_rpc_stream and
repair_get_row_diff_with_rpc_stream_process_op which were introduced in
the "Repair switch to rpc stream" series, rx_hashes_nr metrics are not
updated correctly.
In the test we have 3 nodes and run repair on node3, we makes sure the
following metrics are correct.
assertEqual(node1_metrics['scylla_repair_tx_hashes_nr'] + node2_metrics['scylla_repair_tx_hashes_nr'],
node3_metrics['scylla_repair_rx_hashes_nr'])
assertEqual(node1_metrics['scylla_repair_rx_hashes_nr'] + node2_metrics['scylla_repair_rx_hashes_nr'],
node3_metrics['scylla_repair_tx_hashes_nr'])
assertEqual(node1_metrics['scylla_repair_tx_row_nr'] + node2_metrics['scylla_repair_tx_row_nr'],
node3_metrics['scylla_repair_rx_row_nr'])
assertEqual(node1_metrics['scylla_repair_rx_row_nr'] + node2_metrics['scylla_repair_rx_row_nr'],
node3_metrics['scylla_repair_tx_row_nr'])
assertEqual(node1_metrics['scylla_repair_tx_row_bytes'] + node2_metrics['scylla_repair_tx_row_bytes'],
node3_metrics['scylla_repair_rx_row_bytes'])
assertEqual(node1_metrics['scylla_repair_rx_row_bytes'] + node2_metrics['scylla_repair_rx_row_bytes'],
node3_metrics['scylla_repair_tx_row_bytes'])
Tests: repair_additional_test.py:RepairAdditionalTest.repair_almost_synced_3nodes_test
Fixes: #5339
Backports: 3.2
(cherry picked from commit 6ec602ff2c)
By default rpm uses dwz to merge the debug info from various
binaries. Unfortunately, it looks like addr2line has not been updated
to handle this:
// This works
$ addr2line -e build/release/scylla 0x1234567
$ dwz -m build/release/common.debug build/release/scylla.debug build/release/iotune.debug
// now this fails
$ addr2line -e build/release/scylla 0x1234567
I think the issue is
https://sourceware.org/bugzilla/show_bug.cgi?id=23652Fixes#5289
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20191123015734.89331-1-espindola@scylladb.com>
(cherry picked from commit 8599f8205b)
Since 90d6c0b, cache will abort when trying to detach partition
entries while they're updated. This should never happen. It can happen
though, when the update fails on bad_alloc, because the cleanup guard
invalidates the cache before it releases partition snapshots (held by
"update" coroutine).
Fix by destroying the coroutine first.
Fixes#5327.
Tests:
- row_cache_test (dev)
Message-Id: <1574360259-10132-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit e3d025d014)
Currently, we overwrite the same XML output file for each test repeat
cycle. This can cause invalid XML to be generated if the XML contents
don't match exactly for every iteration.
Fix the problem by appending the test repeat cycle in the XML filename
as follows:
$ ./test.py --repeat 3 --name vint_serialization_test --mode dev --jenkins jenkins_test
$ ls -1 *.xml
jenkins_test.release.vint_serialization_test.0.boost.xml
jenkins_test.release.vint_serialization_test.1.boost.xml
jenkins_test.release.vint_serialization_test.2.boost.xml
Fixes#5303.
Message-Id: <20191119092048.16419-1-penberg@scylladb.com>
(cherry picked from commit 505f2c1008)
Merged patch set by Piotr Dulikowski:
This change corrects condition on which a row was considered expired by its
TTL.
The logic that decides when a row becomes expired was inconsistent with the
logic that decides if a single cell is expired. A single cell becomes expired
when expiry_timestamp <= now, while a row became expired when
expiry_timestamp < now (notice the strict inequality). For rows inserted
with TTL, this caused non-key cells to expire (change their values to null)
one second before the row disappeared. Now, row expiry logic uses non-strict
inequality.
Fixes#4263,
Fixes#5290.
Tests:
unit(dev)
python test described in issue #5290
(cherry picked from commit 9b9609c65b)
The goal of this patch is to fix issue #5280, a rather serious Alternator
bug, where Scylla fails to restart when an Alternator table has secondary
indexes (LSI or GSI).
Traditionally, Cassandra allows table names to contain only alphanumeric
characters and underscores. However, most of our internal implementation
doesn't actually have this restriction. So Alternator uses the characters
':' and '!' in the table names to mark global and local secondary indexes,
respectively. And this actually works. Or almost...
This patch fixes a problem of listing, during boot, the sstables stored
for tables with such non-traditional names. The sstable listing code
needlessly assumes that the *directory* name, i.e., the CF names, matches
the "\w+" regular expression. When an sstable is found in a directory not
matching such regular expression, the boot fails. But there is no real
reason to require such a strict regular expression. So this patch relaxes
this requirement, and allows Scylla to boot with Alternator's GSI and LSI
tables and their names which include the ":" and "!" characters, and in
fact any other name allowed as a directory name.
Fixes#5280.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191114153811.17386-1-nyh@scylladb.com>
(cherry picked from commit 2fb2eb27a2)
CQL tracing would only report file I/O involving one sstable, even if
multiple sstables were read from during the query.
Steps to reproduce:
create a table with NullCompactionStrategy
insert row, flush memtables
insert row, flush memtables
restart Scylla
tracing on
select * from table
The trace would only report DMA reads from one of the two sstables.
Kudos to @denesb for catching this.
Related issue: #4908
(cherry picked from commit a67e887dea)
Serialize provided partition_key in such a way that the serialized value
will hash to the same token as the original key. This way when system.paxos
table is updated the update is shard local.
Message-Id: <20191114135449.GU10922@scylladb.com>
"
When using INSERT JSON with frozen collection/UDT columns, if the columns were left unspecified or set to null, the statement would create an empty non-null value for these columns instead of using null values as it should have. For example:
cqlsh:b> create table t (k text primary key, l frozen<list<int>>, m frozen<map<int, int>>, s frozen<set<int>>, u frozen<ut>);
cqlsh:b> insert into t JSON '{"k": "insert_json"}';
cqlsh:b> select * from t;
k | l | m | s | u
-------------------+------+------+------+------
insert_json | [] | {} | {} |
This PR fixes this.
Resolves#5246 and closes#5270.
"
* 'frozen-json' of https://github.com/kbr-/scylla:
tests: add null/unset frozen collection/UDT INSERT JSON test
cql3: correctly handle frozen null/unset collection/UDT columns in INSERT JSON
cql3: decouple execute from term binding in user_type::setter
* seastar 75e189c6ba...6f0ef32514 (6):
> Merge "Add named semaphores" from Piotr
> parallel_for_each_state: pass rvalue reference to add_future
> future: Pass rvalue to uninitialized_wrapper::uninitialized_set.
> dependencies: Add libfmt-dev to debian
> log: Fix logger behavior when logging both to stdout and syslog.
> README.md: list Scylla among the projects using Seastar
If CONSISTENCY is set to SERIAL or LOCAL SERIAL, all write requests must
fail according to Cassandra's documentation. However, batched writes
bypass this check. Fix this.
This patch resurrects Cassandra's code validating a consistency level
for CAS requests. Basically, it makes CAS requests use a special
function instead of validate_for_write to make error messages more
coherent.
Note, we don't need to resurrect requireNetworkTopologyStrategy as
EACH_QUORUM should work just fine for both CAS and non-CAS writes.
Looks like it is just an artefact of a rebase in the Cassandra
repository.
The dependencies are provided by the frozen toolchain. If a dependency
is missing, we must update the toolchain rather than rely on build-time
installation, which is not reproducible (as different package versions
are available at different times).
Luckily "dnf install" does not update an already-installed package. Had
that been a case, none of our builds would have been reproducible, since
packages would be updated to the latest version as of the build time rather
than the version selected by the frozen toolchain.
So, to prevent missing packages in the frozen toolchain translating to
an unreproducible build, remove the support for installing dependencies
from reloc/build_reloc.sh. We still parse the --nodeps option in case some
script uses it.
Fixes#5222.
Tests: reloc/build_reloc.sh.
MV backpressure code frees mutation for delayed client replies earlier
to save memory. The commit 2d7c026d6e that
introduced the logic claimed to do it only when all replies are received,
but this is not the case. Fix the code to free only when all replies
are received for real.
Fixes#5242
Message-Id: <20191113142117.GA14484@scylladb.com>
Resharding is responsible for the scheduling the deletion of sstables
resharded, but it was not refreshing the cache of the shards those
sstables belong to, which means cache was incorrectly holding reference
to them even after they were deleted. The consequence is sstables
deleted by resharding not having their disk space freed until cache
is refreshed by a subsequent procedure that triggers it.
Fixes#5261.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20191107193550.7860-1-raphaelsc@scylladb.com>
* 'cql-trivial-cleanup' of ssh://github.com/scylladb/scylla-dev:
cql: rename modification_statement::_sets_a_collection to _selects_a_collection
cql: rename _column_conditions to _regular_conditions
cql: remove unnecessary optional around prefetch_data
"
Use a fixed-size, rather than a dynamically growing
bitset for column mask. This avoids unnecessary memory
reallocation in the most common case.
"
* 'column_set' of ssh://github.com/scylladb/scylla-dev:
schema: pre-allocate the bitset of column_set
schema: introduce schema::all_columns_count()
schema: rename column_mask to column_set
Adds per-table metrics for counting partition and row reuse
in memtables. New metrics are as follows:
- memtable_partition_writes - number of write operations performed
on partitions in memtables,
- memtable_partition_hits - number of write operations performed
on partitions that previously existed in a memtable,
- memtable_row_writes - number of row write operations performed
in memtables,
- memtable_row_hits - number of row write operations that ovewrote
rows previously present in a memtable.
Tests: unit(release)
Merged patch series from Dejan Mircevski. Implements the "LT" and "GT"
operators of the Expected update option (i.e., conditional updates),
and enables the pre-existing tests for them.
Since it contains a precise set of columns, it's more
accurate to call it a set, not a mask. Besides, the name
column_mask is already used for column options on storage
level.
This is merely to avoid confusion: we use _sets prefix to indicate that
there are operations over static/regular columns (_sets_static_columns,
_sets_regular_columns), but _sets_a_collection is set for both operations
and conditions. So let's rename it to _selects_a_collection and add some
comments.
It's weird that modification_statement has _static_conditions for
conditions on static columns and _column_conditions for conditions on
regular columns, as if conditions on static columns are not column
conditions. Let's rename _column_conditions to _regular_conditions to
avoid confusion.
Before this commit, an empty non-null value was created for
frozen collection/UDT columns when an INSERT JSON statement was executed
with the value left unspecified or set to null.
This was incompatible with Cassandra which inserted a null (dead cell).
Fixes#5270.
--pkg option on install.sh is introduced for .deb packaging since it requires
different install directory for each subpackage.
But we actually able to use "debian/tmp" for shared install directory,
then we can specify file owner of the package using .install files.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20191030203142.31743-1-syuu@scylladb.com>
Adds per-table metrics for counting partition and row reuse
in memtables. New metrics are as follows:
- memtable_partition_writes - number of write operations performed
on partitions in memtables,
- memtable_partition_hits - number of write operations performed
on partitions that previously existed in a memtable,
- memtable_row_writes - number of row write operations performed
in memtables,
- memtable_row_hits - number of row write operations that ovewrote
rows previously present in a memtable.
Tests: unit(release)
This change adds a SCYLLA_REPO_URL argument to Dockerfile, which defines
the RPM repository used to install Scylla from.
When building a new Docker image, users can specify the argument by
passing the --build-arg SCYLLA_REPO_URL=<url> option to the docker build
command. If the argument is not specified, the same RPM repository is
used as before, retaining the old default behavior.
We intend to use this in release engineering infrastructure to specify
RPM repositories for nightly builds of release branches (for example,
3.1.x), which are currently only using the stable RPMs.
Code for check_LT(), check_GT(), etc. will be nearly identical, so
factor it out into a single function that takes a comparator object.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
In 1ca9dc5d47, it was established that the correct way to
base64-decode a JSON value is via string_view, rather than directly
from GetString().
This patch adds a base64_decode(rjson::value) overload, which
automatically uses the correct procedure. It saves typing, ensures
correctness (fixing one incorrect call found), and will come in handy
for future EXPECTED comparisons.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
unwrap_number() is now a public function in serialization.hh instead
of a static function visible only in executor.cc.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Merged patch series from Piotr Sarna:
An otherwise empty partition can still have a valid static column.
Filtering didn't take that fact into account and only filtered
full-fledged rows, which may result in non-matching rows being returned
to the client.
Fixes#5248
"type" label is already in use for the counter type ("derive", "gauge",
etc). Using the same label for "cas" / "non-cas" overwrites it. Let's
instead call the new label "conditional" and use "yes" / "no" for its
value, as suggested by Kostja.
Message-Id: <3082b16e4d6797f064d58da95fb4e50b59ab795c.1572451480.git.vdavydov@scylladb.com>
"
In case when a single reader contributes a stream of fragments and keeps winning over other readers, mutation_reader_merger will enter gallop mode, in which it is assumed that the reader will keep winning over other readers. Currently, a reader needs to contribute 3 fragments to enter that mode.
In gallop mode, fragments returned by the galloping reader will be compared with the best fragment from _fragment_heap. If it wins, the fragment is directly returned. Otherwise, gallop mode ends and merging performed as in general case, which involves heap operations.
In current implementation, when the end of partition is encountered while in gallop mode, the gallop mode is ended unconditionally.
A microbenchmark was added in order to test performance of the galloping reader optimization. A combining reader that merges results from four other readers is created. Each sub-reader provides a range of 32 clustering rows that is disjoint from others. All sub-readers return rows from the same partition. An improvement can be observed after introducing the galloping reader optimization.
As for other benchmarks from the "combined" group, results are pretty close to the old ones. The only one that seems to have suffered slightly is combined.many_overlapping.
Median times from a single run of perf_mutation_readers.combined: (1s run duration, 5 runs per benchmark, release mode)
test name before after improvement
one_row 49.070ns 48.287ns 1.60%
single_active 61.574us 61.235us 0.55%
many_overlapping 488.193us 514.977us -5.49%
disjoint_interleaved 57.462us 57.111us 0.61%
disjoint_ranges 56.545us 56.006us 0.95%
overlapping_partitions_disjoint_rows 127.039us 80.849us 36.36%
Same results, normalized per mutation fragment:
test name before after improvement
one_row 16.36ns 16.10ns 1.60%
single_active 109.46ns 108.86ns 0.55%
many_overlapping 216.97ns 228.88ns -5.49%
disjoint_interleaved 102.15ns 101.53ns 0.61%
disjoint_ranges 100.52ns 99.57ns 0.95%
overlapping_partitions_disjoint_rows 246.38ns 156.80ns 36.36%
Tested on AMD Ryzen Threadripper 2950X @ 3.5GHz.
Tests: unit(release)
Fixes#3593.
"
* '3593-combined_reader-gallop-mode' of https://github.com/piodul/scylla:
mutation_reader: gallop mode microbenchmark
mutation_reader: combined reader gallop tests
mutation_reader: gallop mode for combined reader
mutation_reader: refactor prepare_next
An otherwise empty partition can still have a valid static column.
Filtering didn't take that fact into account and only filtered
full-fledged rows, which may result in non-matching rows being returned
to the client.
Fixes#5248
Update previous results dictionary using the update_metrics method.
It calls metric_source.query_list to get a list of results (similar to discover()) then for each line in the response it updates results dictionary.
New results may be appeneded depending on the do_append parameter (True by default).
Previously, with prometheous, each metric.update called query_list resulting in O(n^2) when all metric were updated, like in the scylla_top dtest - causing test timeout when testing debug build.
(E.g. dtest-debug/216/testReport/scyllatop_test/TestScyllaTop/default_start_test/)
This patch adds "type" label to the following CQL metrics:
inserts
updates
deletes
batches
statements_in_batches
The label is set to "cas" for conditional statements and "non-cas" for
unconditional statements.
Note, for a batch to be accounted as CAS, it is enough to have just one
conditional statement. In this case all statements within the batch are
accounted as CAS as well.
This microbenchmark tests performance of the galloping reader
optimization. A combining reader that merges results from four other
readers is created. Each sub-reader provides a range of 32 clustering
rows that is disjoint from others. All sub-readers return rows from
the same partition. An improvement can be observed after introducing the
galloping reader optimization.
As for other benchmarks from the "combined" group, results are pretty
close to the old ones. The only one that seems to have suffered slightly
is combined.many_overlapping.
Median times from a single run of perf_mutation_readers.combined:
(1s run duration, 5 runs per benchmark, release mode)
test name before after improvement
one_row 49.070ns 48.287ns 1.60%
single_active 61.574us 61.235us 0.55%
many_overlapping 488.193us 514.977us -5.49%
disjoint_interleaved 57.462us 57.111us 0.61%
disjoint_ranges 56.545us 56.006us 0.95%
overlapping_partitions_disjoint_rows 127.039us 80.849us 36.36%
Same results, normalized per mutation fragment:
test name before after improvement
one_row 16.36ns 16.10ns 1.60%
single_active 109.46ns 108.86ns 0.55%
many_overlapping 216.97ns 228.88ns -5.49%
disjoint_interleaved 102.15ns 101.53ns 0.61%
disjoint_ranges 100.52ns 99.57ns 0.95%
overlapping_partitions_disjoint_rows 246.38ns 156.80ns 36.36%
Tested on AMD Ryzen Threadripper 2950X @ 3.5GHz.
In case when a single reader contributes a stream of fragments
and keeps winning over other readers, mutation_reader_merger will
enter gallop mode, in which it is assumed that the reader will keep
winning over other readers. Currently, a reader needs to contribute
3 fragments to enter that mode.
In gallop mode, fragments returned by the galloping reader will be
compared with the best fragment from _fragment_heap. If it wins, the
fragment is directly returned. Otherwise, gallop mode ends and
merging performed as in general case, which involves heap operations.
In current implementation, when the end of partition is encountered
while in gallop mode, the gallop mode is ended unconditionally.
Fixes#3593.
Move out logic responsible for adding readers at partition boundary
into `maybe_add_readers_at_partition_boundary`, and advancing one reader
into `prepare_one`. This will allow to reuse this logic outside
`prepare_next`.
Since seastar::streams are based on future/promise, variadic streams
suffer the same fate as variadic futures - deprecation and eventual
removal.
This patch therefore replaces a variadic stream in commitlog::read_log_file()
with a non-variadic stream, via a helper struct.
Tests: unit (dev)
Recently, scylla memory started to go beyond just providing raw stats
about the occupancy of the various memory pools, to additionally also
provide an overview of the "usual suspects" that cause memory pressure.
As part of this, recently 46341bd63f
added a section of the coordinator stats. This patch continues this
trend and adds a replica section, with the "usual suspects":
* read concurrency semaphores
* execution stages
* read/write operations
Example:
Replica:
Read Concurrency Semaphores:
user sstable reads: 0/100, remaining mem: 84347453 B, queued: 0
streaming sstable reads: 0/ 10, remaining mem: 84347453 B, queued: 0
system sstable reads: 0/ 10, remaining mem: 84347453 B, queued: 0
Execution Stages:
data query stage:
03 "service_level_sg_0" 4967
Total 4967
mutation query stage:
Total 0
apply stage:
03 "service_level_sg_0" 12608
06 "statement" 3509
Total 16117
Tables - Ongoing Operations:
pending writes phaser (top 10):
2 ks.table1
2 Total (all)
pending reads phaser (top 10):
3380 ks.table2
898 ks.table1
410 ks.table3
262 ks.table4
17 ks.table8
2 system_auth.roles
4969 Total (all)
pending streams phaser (top 10):
0 Total (all)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191029164817.99865-1-bdenes@scylladb.com>
This patch adds the following per table stats:
cas_prepare_latency
cas_propose_latency
cas_commit_latency
They are equivalent to CasPropose, CasPrepare, CasCommit metrics exposed
by Cassandra.
This patch implements accounting of Cassandra's metrics related to
lightweight transactions, namely:
cas_read_latency transactional read latency (histogram)
cas_write_latency transactional write latency (histogram)
cas_read_timeouts number of transactional read timeouts
cas_write_timeouts number of transactional write timeouts
cas_read_unavailable number of transactional read
unavailable errors
cas_write_unavailable number of transactional write
unavailable errors
cas_read_unfinished_commit number of transaction commit attempts
that occurred on read
cas_write_unfinished_commit number of transaction commit attempts
that occurred on write
cas_write_condition_not_met number of transaction preconditions
that did not match current values
cas_read_contention how many contended reads were
encountered (histogram)
cas_write_contention how many contended writes were
encountered (histogram)
Pass contention by reference to begin_and_repair_paxos(), where it is
incremented on every sleep. Rationale: we want to account the total
number of times query() / cas() had to sleep, either directly or within
begin_and_repair_paxos(), no matter if the function failed or succeeded.
Even though every Scylla version has its own scylla-gdb.py, because we
don't backport any fixes or improvements, practically we end up always
using master's version when debugging older versions of Scylla too. This
is made harder by the fact that both Scylla's and its dependencies'
(most notably that of libstdc++ and boost) code is constantly changing
between releases, requiring edits to scylla-gdb.py to make it usable
with past releases.
This patch attempts to make it easier to use scylla-gdb.py with past
releases, more specifically Scylla 3.0. This is achieved by wrapping
problematic lines in a `try: except:` and putting the backward
compatible version in the `except:` clause. These lines have comments
with the version they provide support for, so they can be removed when
said version is not supported anymore.
I did not attempt to provide full coverage, I only fixed up problems
that surfaced when using my favourite commands with 3.0.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191029155737.94456-1-bdenes@scylladb.com>
The loop that collects the result of the checksum calculations and logs
any errors. The error logging includes `checksums[0]` which corresponds
to the checksum calculation on the local node. This violates the
assumption of the code following the loop, which assumes that the future
of `checksums[0]` is intact after the loop terminates. However this is
only true when the checksum calculation is successful and is false when
it fails, as in this case the loop extracts the error and logs it. When
the code after the loop checks again whether said calculation failed, it
will get a false negative and will go ahead and attempt to extract the
value, triggering an assert failure.
Fix by making sure that even in the case of failed checksum calculation,
the result of `checksum[0]` is extracted only once.
Fixes: #5238
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191029151709.90986-1-bdenes@scylladb.com>
* seastar 2963970f6b...75e189c6ba (7):
> posix-stack: Do auto-resolve of ipv6 scope iff not set for link-local dests
> README.md: Add redpanda and smf to 'Projects using Seastar'
> unix_domain_test: don't assume that at temporary_buffer is null terminated
> socket_address: Use offsetof instead of null pointer
> README: add projects using seastar section to readme
> Adjustments for glibc 2.30 and hwloc 2.0
> Mark future::failed() as const
We may want to change paxos tables format and change internode protocol,
so hide lwt behind experimental flag for now.
Message-Id: <20191029102725.GM2866@scylladb.com>
Currently end of stream validation is done in the destructor,
but the validator may be destructed prematurely, e.g. on
exception, as seen in https://github.com/scylladb/scylla/issues/5215
This patch adds a on_end_of_stream() method explicitly called by
consume_pausable_in_thread. Also, the respective concepts for
ParitionFilter, MutationFragmentFilter and a new on for the
on_end_of_stream method were unified as FlattenedConsumerFilter.
Refs #5215
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 506ff40bd447f00158c24859819d4bb06436c996)
There are a few issues at the CQL layer, because of which the result of
a CAS request execution may differ between Scylla and Cassandra. Mostly,
it happens when static columns are involved. The goal of this patch set
is to fix these issues, thus making Scylla's implementation of CAS yield
the same results as Cassandra's.
Merged patch series by Calle Wilund, with a few fixes by Piotr Jastrzębski:
Adds delta and pre-image data column writes for the atomic columns in a
cdc-enabled table.
Note that in this patch set it is still unconditional. Adding option support
comes in next set.
Uses code more or less derived from alternator to select pre-image, using
raw query interface. So should be fairly low overhead to query generation.
Pre-image and delta mutations are mixed in with the actual modification
mutations to generate the full cdc log (sans post-image).
Even if no rows match clustering key restrictions of a conditional
statement with static columns conditions, we still must include the
static column value into the CAS failure result set. For example,
the following conditional DELETE statement
create table t(k int, c int, s int static, v int, primary key(k, c));
insert into t(k, s) values(1, 1);
delete v from t where k=1 and c=1 if v=1 and s=1;
must return
[applied=False, v=null, s=1]
not just
[applied=False, v=null, s=null]
To fix that, set partition_slice::option::always_return_static_content
for querying rows used for checking conditions so that we have the
static row in update_parameters::prefetch_data even if no regular row
matches clustering column restrictions. Plus modify cas_request::
applies_to() so that it sets is_in_cas_result_set flag for the static
row in case there are static column conditions, but the result set
happens to be empty.
As pointed out by Tomek, there's another reason to set partition_slice::
option::always_return_static_content apart from building a correct
result set on CAS failure. There could be a batch with two statements,
one with clustering key restrictions which select no row, and another
statement with only static column conditions. If we didn't enable this
flag, we wouldn't get a static row even if it exists, and static column
conditions would evaluate as if the static row didn't exist, for
example, the following batch
create table t(k int, c int, s int static, primary key(k, c));
insert into t(k, s) values(1, 1);
begin batch
insert into t(k, c) values(1, 1) if not exists
update t set s = 2 where k = 1 if s = 1
apply batch;
would fail although it clearly must succeed.
A SELECT statement that has clustering key restrictions isn't supposed
to return static content if no regular rows matches the restrictions,
see #589. However, for the CAS statement we do need to return static
content on failure so this patch adds a flag that allows the caller to
override this behavior.
Apart from conditional statements, there may be other reading statements
in a batch, e.g. manipulating lists. We must not include rows fetched
for them into the CAS result set. For instance, the following CAS batch:
create table t(p int, c int, i int, l list<int>, primary key(p, c));
insert into t(p, c, i) values(1, 1, 1)
insert into t(p, c, i, l) values(1, 1, 1, [1, 2, 3])
begin batch
update t set i=3 where p=1 and c=1 if i=2
update t set l=l-[2] where p=1 and c=2
apply batch;
is supposed to return
[applied] | p | c | i
----------+---+---+---
False | 1 | 1 | 1
not
[applied] | p | c | i
----------+---+---+---
False | 1 | 1 | 1
False | 1 | 2 | 1
To filter out such collateral rows from the result set, let's mark rows
checked by conditional statements with a special flag.
If a CQL statement only updates static columns, i.e. has no clustering
key restrictions, we still fetch a regular row so that we can check it
against EXISTS condition. In this case we must be especially careful: we
can't simply pass the row to modification_statement::applies_to, because
it may turn out that the row has no static columns set, i.e. there's no
in fact static row in the partition. So we filter out such rows without
static columns right in cas_request::applies_to before passing them
further to modification_statement::applies_to.
Example:
create table t(p int, c int, s int static, primary key(p, c));
insert into t(p, c) values(1, 1);
insert into t(p, s) values(1, 1) if not exists;
The conditional statement must succeed in this case.
In case a CQL statement has only static columns conditions, we must
ignore clustering key restrictions.
Example:
create table t(p int, c int, s int static, v int, primary key(p, c));
insert into t(p, s) values(1, 1);
update t set v=1 where p=1 and c=1 if s=1;
This conditional statement must successfully insert row (p=1, c=1, v=1)
into the table even though there's no regular row with p=1 and c=1 in
the table before it's executed, because the statement condition only
applies to the static column s, which exists and matches.
If a modification statement doesn't have a clustering column restriction
while the table has static columns, then EXISTS condition just needs to
check if there's a static row in the partition, i.e. it doesn't need to
select any regular rows. Let's treat such EXIST condition like a static
column condition so that we can ignore its clustering key range while
checking CAS conditions.
This will allow us to add helper methods and store extra info in each
row. For example, we can add a method for checking if a row has static
columns. Also, to build CAS result set, we need to differentiate rows
fetched to check conditions from those fetched for reading operations.
Using struct as row container will allow us to store this information in
each prefetched row.
Currently, we set _sets_regular_columns/_sets_static_columns flags when
adding regular/static conditions to modification_statement. We use them
in applies_only_to_static_columns() function that returns true iff
_sets_static_columns is set and _sets_regular_columns is clear. We
assume that if this function returns true then the statement only deals
with static columns and so must not have clustering key restrictions.
Usually, that's true, but there's one exception: DELETE FROM ...
statement that deletes whole rows. Technically, this statement doesn't
have any column operations, i.e. _sets_regular_columns flag is clear.
So if such a statement happens to have a static condition, we will
assume that it only applies to static columns and mistakenly raise an
error.
Example:
create table t(k int, c int, s int static, v int, primary key(k, c));
delete from t where k=1 and c=1 if s=1;
To fix this, let's not set the above mentioned flags when adding
conditions and instead check if _column_conditions array is empty in
applies_only_to_static_columns().
modification_statement::process_where_clause() assumes that both
operations and conditions has been added to the statement when it's
called: it uses this information to raise an error in case the statement
restrictions are incompatible with operations or conditions. Currently,
operations are set before this function is called, but not conditions.
This results in "Invalid restrictions on clustering columns since
the {} statement modifies only static columns" error while trying to
execute the following statements:
create table t(k int, c int, s int static, v int, primary key(k, c));
delete s from t where k=1 and c=1 if v=1;
update t set s=1 where k=1 and c=1 if v=1;
Fix this by always initializing conditions before processing WHERE
clause.
Print a histogram of the number of async work items in the shard's
outgoing smp queues.
Example:
(gdb) scylla smp-queues
10747 17 -> 3 ++++++++++++++++++++++++++++++++++++++++
721 17 -> 19 ++
247 17 -> 20 +
233 17 -> 10 +
210 17 -> 14 +
205 17 -> 4 +
204 17 -> 5 +
198 17 -> 16 +
197 17 -> 6 +
189 17 -> 11 +
181 17 -> 1 +
179 17 -> 13 +
176 17 -> 2 +
173 17 -> 0 +
163 17 -> 8 +
1 17 -> 9 +
Useful for identifying the target shard, when `scylla task_histogram`
indicates a high number of async work items.
To produce the histogram the command goes over all virtual objects in
memory and identifies the source and target queues of each
`seastar::smp_message_queue::async_work_item` object. Practically the
source queue will always be that of the current shard. As this scales
with the number of virtual objects in memory, it can take some time to
run. An alternative implementation would be to instead read the actual
smp queues, but the code of that is scary so I went for the simpler and
more reliable solution.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191028132456.37796-1-bdenes@scylladb.com>
This patch set introduces light-weight transactions support to
ScyllaDB. It is a subset of the full series, which adds
basic LWT support and which has been reviewed thus far.
"
mutation_test/test_udt_mutations kept failing on my machine and I tracked it down to the 3rd patch in this series (use int64_t constants for long_type). While at it, this series also fixes a comment and the end iterator in BOOST_REQUIRE(std::all_of(...))
mutation_test: test_udt_mutations: fixup udt comment
mutation_test: test_udt_mutations: fix end iterator in call to std::all_of
mutation_test: test_udt_mutations: use int64_t constants for long_type
Test: mutation_test(dev, debug)
"
* 'test_udt_mutations-fixes' of https://github.com/bhalevy/scylla:
mutation_test: test_udt_mutations: use int64_t constants for long_type
mutation_test: test_udt_mutations: fix end iterator in call to std::all_of
mutation_test: test_udt_mutations: fixup udt comment
Based on a mutation, creates a pre-image select operation.
Note, this uses raw proxy query to shortcut parsing etc,
instead of trying to cache by generated query. Hypothesis is that
this is essentially faster.
The routine assumes all rows in a mutation touch same static/regular
columns. If this is not always true it will need additional
calculations.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Support single-statement conditional updates and as well as batches.
This patch almost fully rewrites column_condition.cc, implementing
is_satisfied_by().
Most of the remaining complications in column_condition implementation
come from the need to properly handle frozen and multi-cell
collection in predicates - up until now it was not possible
to compare entire collection values between each other. This is further
complicated since multi-cell lists and sets are returned as maps.
We can no longer assume that the columns fetched by prefetch operation
are non-frozen collections. IF EXISTS/IF NOT EXISTS condition
fetches all columns, besides, a column may be needed to check other
condition.
When fetching the old row for LWT or to apply updates on list/columns,
we now calculate precisely the list of columns to fetch.
The primary key columns are also included in CAS batch result set,
and are thus also prefetched (the user needs them to figure out which
statements failed to apply).
The patch is cross-checked for compatibility with cassandra-3.11.4-1545-g86812fa502
but does deviate from the origin in handling of conditions on static
row cells. This is addressed in future series.
Each column_condition and raw::column_condition construction case had a
static method wrapping its constructor, simply supplying some defaults.
This neither improves clarity nor maintainability.
cql_statement_opt_metadata is an interim node
in cql (prepared) statement hierarchy parenting
modification_statement and batch_statement. If there
is IF condition in such statements, they return a result set,
and thus have a result set metadata.
The metadata itself is filled in a subsequent patch.
Add checks for conditional modification statement limitations:
- WHERE clustering_key IN (list) IF condition is not supported
since a conditions is evaluated for a single row/cell, so
allowing multiple rows to match the WHERE clause would create
ambiguity,
- the same is true for conditional range deletions.
- ensure all clustering restrictions are eq for conditional delete
We must not allow statements like
create table t(p int, c int, v int, primary key (p, c));
delete from t where p=1 and c>0 if v=1;
because there may be more than one statement in a partition satisfying
WHERE clause, in which case it's unclear which of them should satisfy
IF condition: all or just one.
Raising an error on such a statement is consistent with Cassandra's
behavior.
Introduce service::cas_request abstract base class
which can be used to parameterize Paxos logic.
Implement storage_proxy::cas() - compare and swap - the storage proxy
entry point for lightweight transactions.
Currently the code that manipulates mutations during write need to
check what kind of mutations are those and (sometimes) choose different
code paths. This patch encapsulates the differences in virtual
functions of mutation_holder object, so that high level code will not
concern itself with the details. The functions that are added:
apply_locally(), apply_remotely() and store_hint().
This patch adds all functionality needed for Paxos protocol. The
implementation does not strictly adhere to Paxos paper since the original
paper allows setting a value only once, while for LWT we need to be able
to make another Paxos round after "learn" phase completes, which requires
things like repair to be introduced.
Paxos protocol has three stages: prepare, accept, learn. This patch adds
rpc verb for each of those stages. To be term compatible with Cassandra
the patch calls those stages: prepare, propose, commit.
Paxos protocol relies on replicas having a state that persists over
crashes/restarts. This patch defines such state and stores it in the
database itself in the paxos table to make it persistent.
The stored state is:
in_progress_ballot - promised ballot
proposal - accepted value
proposal_ballot - the ballot of the accepted value
most_recent_commit - most recently learned value
most_recent_commit_at - the ballot of the most recently learned value
This patch add two data structures that will be used by paxos. First
one is "proposal" which contains a ballot and a mutation representing
a value paxos protocol is trying to set. Second one is
"prepare_response" which is a value returned by paxos prepare stage.
It contains currently accepted value (if any) and most recently
learned value (again if any). The later is used to "repair" replicas
that missed previous "learn" message.
Otherwise they are decomposed and serialized as 4-byte int32.
For example, on my machine cell[1] looked like this:
{0002, atomic_cell{0000000310600000;ts=0;expiry=-1,ttl=0}}
and it failed cells_equal against:
{0002, atomic_cell{0000000300000000;ts=0;expiry=-1,ttl=0}}
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
server::set_routes() was setting the value of server::_callbacks.
This led to a race condition, as set_routes() is invoked on every
shard simultaneously. It is also unnecessary, since _callbacks can be
initialized in the constructor.
Fixes#5220.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
"
Introduce the traced_file class which wraps a file, adding CQL trace messages before and after every operation that returns a future.
Use this file to trace reads from SSTable data and index files.
Fixes#4908.
"
* 'traced_file' of https://github.com/kbr-/scylla:
sstables: report sstable index file I/O in CQL tracing
sstables: report sstable data file I/O in CQL tracing
tracing: add traced_file class
"
This change allows creating tables with non-frozen UDT columns. Such columns can then have single fields modified or deleted.
I had to do some refactoring first. Please read the initial commit messages, they are pretty descriptive of what happened (read the commits in the order they are listed on my branch: https://github.com/kbr-/scylla/commits/udt, starting from kbr-@8eee36e, in order to understand them). I also wrote a bunch of documentation in the code.
Fixes#2201.
"
* 'udt' of https://github.com/kbr-/scylla: (64 commits)
tests: too many UDT fields check test
collection_mutation: add a FIXME.
tests: add a non-frozen UDT materialized view test
tests: add a UDT mutation test.
tests: add a non-frozen UDT "JSON INSERT" test.
tests: add a non-frozen UDT to for_each_schema_change.
tests: more non-frozen UDT tests.
tests: move some UDT tests from cql_query_test.cc to new file.
types: handle trailing nulls in tuples/UDTs better.
cql3: enable deleting single fields of non-frozen UDTs.
cql3: enable setting single fields of a non-frozen UDT.
cql3: enable non-frozen UDTs.
cql3: introduce user_types::marker.
cql3: generalize function_call::make_terminal to UDTs.
cql3: generalize insert_prepared_json_statement::execute_set_value to UDTs.
cql3: use a dedicated setter operation for inserting user types.
cql3: introduce user_types::value.
types: introduce to_bytes_opt_vec function.
cql3: make user_types::delayed_value::bind_internal return vector<bytes_opt>.
cql3: make cql3_type::raw_ut::to_string distinguish frozenness.
...
The health check is performed simply by issuing a GET request
to the alternator port - it returns the following status 200
response when the server is healthy:
$ curl -i localhost:8000
HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 23
Server: Seastar httpd
Date: 21 Oct 2019 12:55:33 GMT
healthy: localhost:8000
This commit comes with a test.
Fixes#5050
Message-Id: <3050b3819661ee19640c78372e655470c1e1089c.1571921618.git.sarna@scylladb.com>
We could use iterators over cells instead of a vector of cells
in collection_mutation(_view)_description. Then some use cases could
provide iterators that construct the cells "on the fly".
Comparing user types after adding new fields was bugged.
In the following scenario:
create type ut (a int);
create table cf (a int primary key, b frozen<ut>);
insert into cf (a, b) values (0, (0));
alter type ut add b int;
select * from cf where b = {a:0,b:null};
the row with a = 0 should be returned, even though the value stored in the database is shorter
(by one null) than the value given by the user. Until now it wouldn't
have.
Add a cluster feature for non-frozen UDTs.
If the cluster supports non-frozen UDTs, do not return an error
message when trying to create a table with a non-frozen user type.
cql3::user_types::marker is a dedicated cql3::abstract_marker for user
type placeholders in prepared CQL queries. When bound, it returns a
user_types::value.
Previously it returned vector<cql3::raw_value>, even though we don't use
unset values when setting a UDT value (fields that are not provided
become nulls. Thats how C* does it).
This simplifies future implementation of user_types::{value, setter}.
is_value_compatible_with_internal and update_user_type were generalized
to the non-frozen case.
For now, all user_type_impls in the code are non-multi-cell (frozen).
This will be changed in future commits.
These functions are used to translate field indices, which are used to
identify fields inside UDTs, from/to a serialized representation to be
stored inside sstables and mutations.
They do it in a way that is compatible with C*.
The purpose of collection_type_impl::to_value was to serialize a
collection for sending over CQL. The corresponding function in origin
is called serializeForNativeProtocol, but the name is a bit lengthy,
so I settled for serialize_for_cql.
The method now became a free-standing function, using the visit
function to perform a dispatch on the collection type instead
of a virtual call. This also makes it easier to generalize it to UDTs
in future commits.
Remove the old serialize_for_native_protocol with a FIXME: implement
inside. It was already implemented (to_value), just called differently.
remove dead methods: enforce_limit and serialized_values. The
corresponding methods in C* are auxiliary methods used inside
serializeForNativeProtocol. In our case, the entire algorithm
is wholly written in serialize_for_cql.
`collection_type_impl::serialize_mutation_form`
became `collection_mutation(_view)_description::serialize`.
Previously callers had to cast their data_type down to collection_type
to use serialize_mutation_form. Now it's done inside `serialize`.
In the future `serialize` will be generalized to handle UDTs.
`collection_type_impl::deserialize_mutation_form`
became a free standing function `deserialize_collection_mutation`
with similiar benefits. Actually, noone needs to call this function
manually because of the next paragraph.
A common pattern consisting of linearizing data inside a `collection_mutation_view`
followed by calling `deserialize_mutation_form` has been abstracted out
as a `with_deserialized` method inside collection_mutation_view.
serialize_mutation_form_only_live was removed,
because it hadn't been used anywhere.
collection_type_impl::mutation became collection_mutation_description.
collection_type_impl::mutation_view became collection_mutation_view_description.
These classes now reside inside collection_mutation.hh.
Additional documentation has been written for these classes.
Related function implementations were moved to collection_mutation.cc.
This makes it easier to generalize these classes to non-frozen UDTs in future commits.
The new names (together with documentation) better describe their purpose.
The classes 'collection_mutation' and 'collection_mutation_view'
were moved to a separate header, collection_mutation.hh.
Implementations of functions that operate on these classes,
including some methods of collection_type_impl, were moved
to a separate compilation unit, collection_mutation.cc.
This makes it easier to modify these structures in future commits
in order to generalize them for non-frozen User Defined Types.
Some additional documentation has been written for collection_mutation.
Merged patch series from Piotr Sarna:
This series couples system_auth.roles with authorization routines
in alternator. The `salted_hash` field, which is every user's hashed
password, is used as a secret key for the signature generation
in alternator.
This series also adds related expiration verifications for alternator
signatures.
It also comes with more test cases and docs updates.
Tests: alternator(local, remote), manual
Piotr Sarna (11):
alternator: add extracting key from system_auth.roles
alternator: futurize verify_signature function
alternator: move the api handler to a separate function
alternator: use keys from system_auth.roles for authorization
alternator: add key cache to authorization
alternator-test: add a wrong password test
alternator: verify that the signature has not expired
alternator: add additional datestamp verification
alternator-test: add tests for expired signatures
docs: update alternator entry for authorization
alternator-test: add authorization to README
alternator-test/conftest.py | 2 +-
alternator-test/test_authorization.py | 44 ++++++++-
alternator-test/test_describe_endpoints.py | 2 +-
alternator/auth.hh | 15 ++-
alternator/server.hh | 10 +-
alternator/auth.cc | 62 +++++++++++-
alternator/server.cc | 106 ++++++++++++---------
alternator-test/README.md | 28 ++++++
docs/alternator/alternator.md | 7 +-
9 files changed, 221 insertions(+), 55 deletions(-)
ommit 93270dd changed gc_clock to be 64-bit, to fix the Y2038
problem. While 64-bit tombstone::deletion_time is serialized in a
compatible way, TTLs (gc_clock::duration) were not.
This patchset reverts TTL serialization to the 32-bit serialization
format, and also allows opting-in to the 64-bit format in case a
cluster was installed with the broken code. Only Scylla 3.1.0 is
vulnerable.
Fixes#4855
Tests: unit (dev)
From Shlomi:
4 node cluster Node A, B, C, D (Node A: seed)
cassandra-stress write n=10000000 -pop seq=1..10000000 -node <seed-node>
cassandra-stress read duration=10h -pop seq=1..10000000 -node <seed-node>
while read is progressing
Node D: nodetool decommission
Node A: nodetool status node - wait for UL
Node A: nodetool cleanup (while decommission progresses)
I get the error on c-s once decommission ends
java.io.IOException: Operation x0 on key(s) [383633374d31504b5030]: Data returned was not validated
The problem is when a node gets new ranges, e.g, the bootstrapping node, the
existing nodes after a node is removed or decommissioned, nodetool cleanup will
remove data within the new ranges which the node just gets from other nodes.
To fix, we should reject the nodetool cleanup when there is pending ranges on that node.
Note, rejecting nodetool cleanup is not a full protection because new ranges
can be assigned to the node while cleanup is still in progress. However, it is
a good start to reject until we have full protection solution.
Refs: #5045
Scylla 3.1.0 broke the serialization format for TTLs. Later versions
corrected it, but if a cluster was originally installed as 3.1.0,
it will use the broken serialization forever. This configuration option
allows upgrades from 3.1.0 to succeed, by enabling the broken format
even for later versions.
Scylla 3.1.0 inadvertently changed the serialization format of TTLs
(internally represented as gc_clock::duration) from 32-bit to 64-bit,
as part of preparation for Y2038 (which comes earlier for TTLed cells).
This breaks mutations transported in a mixed cluster.
To fix this, we revert back to the 32-bit format, unless we're in a 3.1.0-
heritage cluster, in which case we use the 64-bit format. Overflow of
a TTL is not a concern, since TTLs are capped to 20 years by the TTL layer.
An assertion is added to verify this.
This patch only defines a variable to indicate we're in
a 3.1.0 heritage cluster, but a way to set it is left to
a later patch.
* seastar 6bcb17c964...2963970f6b (4):
> Merge "IPv6 scope support and network interface impl" from Calle
> noncopyable_function: do not copy uninitialized data
> Merge "Move smp and smp queue out of reactor" from Asias
> Consolidate posix socket implementations
The README paragraph informs about turning on authorization with:
alternator-enforce-authorization: true
and has a short note on how to set up the secret key for tests.
The first test case ensures that expired signatures are not accepted,
while the second one checks that signatures with dates that reach out
too far into the future are also refused.
The authorization signature contains both a full obligatory date header
and a shortened datestamp - an additional verification step ensures that
the shortened stamp matches the full date.
AWS signatures have a 15min expiration policy. For compatibility,
the same policy is applied for alternator requests. The policy also
ensures that signatures expanding more than 15 minutes into the future
are treated as unsafe and thus not accepted.
The additional test case submits a request as a user that is expected
to exist (in the local setup), but the provided password is incorrect.
It also updates test_wrong_key_access so it uses an empty string
for trying to authenticate as an inexistent user - in order to cover
more corner cases.
In order to avoid fetching keys from system_auth.roles system table
on every request, a cache layer is introduced. And in order not to
reinvent the wheel, the existing implementation of loading_cache
with max size 1024 and a 1 minute timeout is used.
Instead of having a hardcoded secret key, the server now verifies
an actual key extracted from system_auth.roles system table.
This commit comes with a test update - instead of 'whatever':'whatever',
the credentials used for a local run are 'alternator':'secret_pass',
which matches the initial contents of system_auth.roles table,
which acts as a key store.
Fixes#5046
The lambda used for handling the api request has grown a little bit
too large, so it's moved to a separate method. Along with it,
the callbacks are now remembered inside the class itself.
The verify_signature utility will later be coupled with Scylla
authorization. In order to prepare for that, it is first transformed
into a function that returns future<>, and it also becomes a member
of class server. The reason it becoming a member function is that
it will make it easier to implement a server-local key cache.
As a first step towards coupling alternator authorization with Scylla
authorization, a helper function for extracting the key (salted_hash)
belonging to the user is added.
From Shlomi:
4 node cluster Node A, B, C, D (Node A: seed)
cassandra-stress write n=10000000 -pop seq=1..10000000 -node <seed-node>
cassandra-stress read duration=10h -pop seq=1..10000000 -node <seed-node>
while read is progressing
Node D: nodetool decommission
Node A: nodetool status node - wait for UL
Node A: nodetool cleanup (while decommission progresses)
I get the error on c-s once decommission ends
java.io.IOException: Operation x0 on key(s) [383633374d31504b5030]: Data returned was not validated
The problem is when a node gets new ranges, e.g, the bootstrapping node, the
existing nodes after a node is removed or decommissioned, nodetool cleanup will
remove data within the new ranges which the node just gets from other nodes.
To fix, we should reject the nodetool cleanup when there is pending ranges on that node.
Note, rejecting nodetool cleanup is not a full protection because new ranges
can be assigned to the node while cleanup is still in progress. However, it is
a good start to reject until we have full protection solution.
Refs: #5045
Argument evaluation order is UB, so it's not guaranteed that
c->make_garbage_collected_sstable_writer() is called before
compaction is moved to run().
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20191023052647.9066-1-raphaelsc@scylladb.com>
Make it possible to compare multi-cell lists and sets serialized
as maps with literal values and serialize them to network using
a standard format (vector of values).
This is a pre-requisite patch for column condition evaluation
in light-weight transactions.
Merged patch series from Botond Dénes:
This series extends the existing docs/debugging.md with a detailed guide
on how to debug Scylla coredumps. The intended target audience is
developers who are debugging their first core, hence the level of
details (hopefully enough). That said this should be just as useful for
seasoned debuggers just quickly looking up some snippet they can't
remember exactly. A Throubleshooting chapter is also added in this
series for commonly-met problems.
I decided to create this guide after myself having struggled for more
than a day on just opening(!) a coredump that was produced on Ubuntu.
As my main source, I used the How-to-debug-a-coredump page from the
internal wiki which contains many useful information on debugging
coredumps, however I found it to be missing some crucial information, as
well being very terse, thus being primarily useful for experienced
debuggers who can fill in the blanks. The reason I'm not extending said
wiki page is that I think this information should not be hidden in some
internal wiki page. Also, docs/debugging.md now seems to be a much
better base for such a document. This document was started as a
comprehensive debugging manual for beginners (but not just).
You will notice that the information on how to debug cores from
CentOS/Redhat are quite sparse. This is because I have no experience
with such cores, so for now the respective chapters are just stubs. I
intend to complete them in the future after having gained the necessary
experience and knowledge, however those being in possession of said
knowledge are more then welcome to send a patch. :)
Botond Dénes (4):
docs/debugging.md: demote 'Starting GDB' and 'Using GDB'
docs/debugging.md: fix formatting issues
docs/debugging.md: add 'Debugging coredumps' subchapter
docs/debugging.md: add 'Throubleshooting' subchapter
docs/debugging.md | 240 +++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 228 insertions(+), 12 deletions(-)
Add lua as a dependency in preparation for UDF. This is the first
patch since it has to go in before to allow for a frozen toolchain
update.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
[avi: update frozen toolchain image]
Message-Id: <20191018231442.11864-2-espindola@scylladb.com>
Multi-cell lists and maps may be stored in different formats: as sorted
vectors of pairs of values, when retreived from storage, or as sorted
vectors of values, when created from parser literals or supplied as
parameter values.
Implement a specialized compare for use when receiver and paramter
representation don't match.
Add helpers.
The problem is that backlog tracker is not being updated properly after
incremental compaction.
When replacing sstables earlier, we tell backlog tracker that we're done
with exhausted sstables[1], but we *don't* tell it about the new, sealed
sstables created that will replace the exhausted ones.
[1]: exhausted sstable is one that can be replaced earlier by compaction.
We need to notify backlog tracker about every sstable replacement which
was triggered by incremental compaction.
Otherwise, backlog for a table that enables incremental compaction will
be lower than it actually should. That's because new sstables being
tracked as partial decrease the backlog, whereas the exhausted ones
increase it.
The formula for a table's backlog is basically:
backlog(sstable set + compacting(1) - partial(2))
(1) compacting includes all compaction's input sstables, but the
exhausted ones are removed from it (correct behavior).
(2) partial includes all compaction's output sstables, but the ones
that replaced the exhausted sstables aren't removed from it (incorrect
behavior).
This problem is fixed by making backlog track *fully* aware of the early
replacement, not only the exhausted sstables, but also the new sstables
that replaced the exhausted ones. The new sstables need to be moved
inside the tracker from partial state to the regular one.
Fixes#5157.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20191016002838.23811-1-raphaelsc@scylladb.com>
Rather than passing a pointer to a cql_stats member corresponding to
the statement type, pass a reference to a cql_stats object and use
statement_type, which is already stored in modification_statement, for
determining which counter to increment. This will allow us to account
conditional statements, which will have a separate set of counters,
right in modification_statement::execute() - all we'll need to do is
add the new counters and bump them in case execute_with_condition is
called.
While we are at it, remove extra inclusions from statement_type.hh so as
not to introduce any extra dependencies for cql_stats.hh users.
Message-Id: <20191022092258.GC21588@esperanza>
Merged patch series from Avi Kivity:
The static row can be rare: many tables don't have them, and tables
that do will often have mutations without them (if the static row
is rarely updated, it may be present in the cache and in readers,
but absent in memtable mutations). However, it always consumes ~100
bytes of memory, even if it not present, due to row's overhead.
Change it to be optional by allocating it as an external object rather
than inlined into mutation_partition. This adds overhead when the
static row is present (17 bytes for the reference, back reference,
and lsa allocator overhead).
perf_simple_query appears to marginally (2%) faster. Footprint is
reduced by ~9% for a cache entry, 12% in memtables. More details are
provided in the patch commitlog.
Tests: unit (debug)
Avi Kivity (4):
managed_ref: add get() accessor
managed_ref: add external_memory_usage()
mutation_partition: introduce lazy_row
mutation_partition: make static_row optional to reduce memory
footprint
cell_locking.hh | 2 +-
converting_mutation_partition_applier.hh | 4 +-
mutation_partition.hh | 284 ++++++++++++++++++++++-
partition_builder.hh | 4 +-
utils/managed_ref.hh | 12 +
flat_mutation_reader.cc | 2 +-
memtable.cc | 2 +-
mutation_partition.cc | 45 +++-
mutation_partition_serializer.cc | 2 +-
partition_version.cc | 4 +-
tests/multishard_mutation_query_test.cc | 2 +-
tests/mutation_source_test.cc | 2 +-
tests/mutation_test.cc | 12 +-
tests/sstable_mutation_test.cc | 10 +-
14 files changed, 355 insertions(+), 32 deletions(-)
"
The node startup code (in particular the functions storage_service::prepare_to_join and storage_service::join_token_ring) is complicated and hard to understand.
This patch set aims to simplify it at least a bit by removing some dead code, moving code around so it's easier to understand and adding some comments that explain what the code does.
I did it to help me prepare for implementing generation and gossiping of CDC streams.
"
* 'bootstrap-refactors' of https://github.com/kbr-/scylla:
storage_service: more comments in join_token_ring
db: remove system_keyspace::update_local_tokens
db: improve documentation for update_tokens and get_saved_tokens in system_keyspace
storage_service: remove storage_service::_is_bootstrap_mode.
storage_service: simplify storage_service::bootstrap method
storage_service: fix typo in handle_state_moving
storage_service: remove unnecessary use of stringstream
storage_service: remove redundant call to update_tokens during join_token_ring
storage_service: remove storage_service::set_tokens method.
storage_service: remove is_survey_mode
storage_service::handle_state_normal: tokens_to_update* -> owned_tokens
storage_service::handle_state_normal: remove local_tokens_to_remove
db::system_keyspace::update_tokens: take tokens by const ref
db::system_keyspace::prepare_tokens: make static, take tokens by const ref
token_metadata::update_normal_tokens: take tokens by const ref
Assume n1 and n2 in a cluster with generation number g1, g2. The
cluster runs for more than 1 year (MAX_GENERATION_DIFFERENCE). When n1
reboots with generation g1' which is time based, n2 will see
g1' > g2 + MAX_GENERATION_DIFFERENCE and reject n1's gossip update.
To fix, check the generation drift with generation value this node would
get if this node were restarted.
This is a backport of CASSANDRA-10969.
Fixes#5164
The flag did nothing. It was used in one place to check if there's a
bug, but it can easily by proven by reading the code that the check
would never pass.
The storage_service::bootstrap method took a parameter: tokens to
bootstrap with. However, this method is only called in one place
(join_token_ring) with only one parameter: _bootstrap_tokens. It doesn't
make sense to call this method anywhere else with any other parameter.
This commit also adds a comment explaining what the method does and
moves it into the private section of storage_service.
When a non-seed node was bootstrapping, system_keyspace::update_tokens
was called twice: first right after the tokens were generated (or
received if we were replacing a different node) in the call to
`bootstrap`, and then later in join_token_ring. The second call was
redundant.
The join_token_ring call was also redundant if we were not bootstrapping
and had tokens saved previously (e.g. when restarting). In that case we
would have read them from LOCAL and then save the same tokens again.
This commit removes the redundant call and inserts calls to
update_tokens where they are necessary, when new tokens are generated.
The aim is to make the code easier to understand.
It also adds a comment which explains why the tokens don't need to be
generated in one of the cases.
After commit 36ccf72f3c, this method
was used only in one place.
Its name did not make it obvious what it does and when is it safe to call it.
This commit pulls out the code from set_tokens to the point where it was
called (join_token_ring). The code is only possible to understand in
context.
This code was also saving the tokens to the LOCAL table before
retrieving them from this table again. There is no point in doing that:
1. there are no races, since when join_token_ring is running, it is the
only function which can call system_keyspace::update_tokens (which saves them to the
LOCAL table). There can be no multiple instances of join_token_ring.
2. Even if there was a race, this wouldn't fix anything. The tokens we
retrieve from LOCAL by calling get_local_tokens().get0() could already
be different in the LOCAL table when the get0() returns.
Replace the two variables:
tokens_to_update_in_metadata
tokens_to_update_in_system_keyspace
which were exactly the same, with one variable owned_tokens.
The new name describes what the variable IS instead what's it used for.
Add a comment to clarify what "owned" means: those are the tokens the
node chose and any collision was resolved positively for this node.
Move the variable definition further down in the code, where it's
actually needed.
Merged patch series from Piotr Sarna:
Calculating the select statement for given view_info structure
used to work fine, but once local indexes were introduced, a subtle
bug appeared: the legacy token column does not exist in local indexes
and a valid clustering key column was omitted instead.
That results in potentially incorrect partition slices being used later
in read-before-write.
There's a long term plan for removing select_statement from
view info altogether, but nonetheless the bug needs to be fixed first.
Branch: master, 3.1
Tests: unit(dev) + manual confirmation that a correct legacy column is picked
Merge a patch series from Piotr Jastrzębski (haaawk):
This PR introduces CDC in it's minimal version.
It is possible now to create a table with CDC enabled or to enable/disable
CDC on existing table. There is a management of CDC log and description
related to enabling/disabling CDC for a table.
For now only primary key of the changed data is logged.
To be able to co-locate cdc streams with related base table partitions it
was needed to propagate the information about the number of shards per node.
This was node through gossip.
There is an assumption that all the nodes use the same value for
sharding_ignore_msb_bits. If it does not hold we would have to gossip
sharding_ignore_msb_bits around together with the number of shards.
Fixes#4986.
Tests: unit(dev, release, debug)
Currently, the function that generates the graph edges (and vertices)
with a breadth-first traversal of the object graph accidentally uses the
object that is the starting point of the graph as the "to" part of each
edge. This results in the graph having each of its edges point to the
starting point, as if all objects in it referenced said object directly.
Fix by using the object of the currently examined object.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191018113019.95093-1-bdenes@scylladb.com>
To the 'Debuggin Scylla with GDB` chapter. The '### Debugging
relocatable binaries built with the toolchain' subchapter is demoted to
be just a section in this new subchapter. It is also renamed to
'Relocatable binaries'.
This subchapter intends to be a complete guide on how to debug coredumps
from how to obtain the correct version of all the binaries all the way
to how to correctly open the core with GDB.
* seastar e888b1df...6bcb17c9 (4):
> iotune: don't crash in sequential read test if hitting EOF
> Remove FindBoost.cmake from install files
> Merge "Move reactor backend out of reactor" from Asias
> fair_queue: Add fair_queue.cc
This patch wrapps announce_migration logic into a lambda
that will be used both when cdc is used and when it's not.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
At the moment, this test only checks that table
creation and alteration sets cdc_options property
on a table correctly.
Future patches will extend this test to cover more
CDC aspects.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
We would like to share with other nodes
the value of ignore_msb_bits property used by the node.
This is needed because CDC will operate on
streams of changes. Each shard on each node
will have its own stream that will be identified
by a stream_id. Stream_id will be selected in
such a way that using stream_id as partition key
will locate partition identified by stream_id on
a node and shard that the stream belongs to.
To be able to generate such stream_id we need
to know ignore_msb_bits property value for each node.
IMPORTANT NOTE: At this point CDC does not support
topology changes. It will work only on a stable cluster.
Support for topology modifications will be added in
later steps.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
We would like to share with other nodes
the number of shards available at the node.
This is needed because CDC will operate on
streams of changes. Each shard on each node
will have its own stream that will be identified
by a stream_id. Stream_id will be selected in
such a way that using stream_id as partition key
will locate partition identified by stream_id on
a node and shard that the stream belongs to.
To be able to generate such stream_id we need
to know how many shards are on each node.
IMPORTANT NOTE: At this point CDC does not support
topology changes. It will work only on a stable cluster.
Support for topology modifications will be added in
later steps.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Refactor modification_statement to enable lightweight
transaction implementation.
This patch set re-arranges logic of
modification_statement::get_mutations() and uses
a column mask of identify the columns to prefetch.
It also pre-computes a few modification statement properties
at prepare, assuming the prepared statement is invalidated if
the underlying schema changes.
They are used more extensively with introduction of lightweight
transactions, and pre-computing makes it easier to reason about
complexity of the scenarios where they are involved.
Pre-compute column mask of columns to prefetch when preparing
a modification statement and use it to build partition_slice
object for read command. Fetch only the required columns.
Ligthweight transactions build up on this by using adding
columns used in conditions and in cas result set to the column
maks of columns to read. Batch statements unite all column
masks to build a single relation for all rows modified by
conditional statements of a batch.
Refactor get_mutations() so that the read command and
apply_updates() functions can be used in lightweight transactions.
Move read_command creation to an own method, as well as apply_updates().
Rewrite get_mutations() using the new API.
Avoid unnecessary shared pointers.
Introduce a column definition ordinal_id and use it in boosted
update_parameters::prefetch_data as a column index of a full row.
Lightweight transactions prefetch data and return a result set.
Make sure update_parameters::prefetch_data can serve as a
single representation of prefetched list cells as well as
condition cells and as a CAS result set.
I have a lot of plans for column_definition::ordinal_id, it
simplifies a lot of operations with columns and will also be
used for building a bitset of columns used in a query
or in multiple queries of a batch.
In modification_statement/batch_statement, we need to prefetch data to
1) apply list operations
2) evaluate CAS conditions
3) return CAS result set.
Boost update_parameters::prefetch_data to serve as a single result set
for all of the above. In case of a batch, store multiple rows for
multiple clustering keys involved in the batch.
Use an ordered set for columns and rows to make sure 3) CAS result set
is returned to the client in an ordered manner.
Deserialize the primary key and add it to result set rows since
it is returned to the client as part of CAS result set.
Index columns using ordinal_id - this allows having a single
set for all columns and makes columns easy to look up.
Remove an extra memcpy to build view objects when looking
up a cell by primary key, use partition_key/clustering_key
objects for lookup.
Get rid of an unnecessary optional around
update_parameters::prefetch_data.
update_parameters won't own prefetch_data in the future anyway,
since prefetch_data can be shared among multiple modification
statements of a batch, each statement having its own options
and hence its own update_parameters instance.
Move prefetch_data_builder class from modification_statement.cc
to update_parameters.cc.
We're going to share the same builder to build a result set
for condition evaluation and to apply updates of batch statements, so we
need to share it.
No other changes.
Make sure every column in the schema, be it a column of partition
key, clustering key, static or regular one, has a unique ordinal
identifier.
This makes it easy to compute the set of columns used in a query,
as well as index row cells.
Allow to get column definition in schema by ordinal id.
"
Incremental compaction code to release exhausted sstables was inefficient because
it was basically preventing any release from ever happening. So a new solution is
implemented to make incremental compaction approach actually efficient while
being cautious about not introducing data resurrection. This solution consists of
storing GC'able tombstones in a temporary sstable and keeping it till the end of
compaction. Overhead is avoided by not enabling it to strategies that don't work
with runs composed of multiple fragments.
Fixes#4531.
tests: unit, longevity 1TB for incremental compaction
"
* 'fix_incremental_compaction_efficiency/v6' of https://github.com/raphaelsc/scylla:
tests: Check that partition is not resurrected on compaction failure
tests: Add sstable compaction test for gc-only mutation compactor consumer
sstables: Fix Incremental Compaction Efficiency
Introduce `scylla generate_object_graph`, a command which generates a
visual object graph, where vertices are objects and edges are
references. The graph starts from the object specified by the user. The
graph allows visual inspection of the object graph and hopefully allows
the user to identify the object in question.
Add the `--resolve` flag to `scylla find`. When specified, `scylla find`
will attempt to resolve the first pointer in the found objects as a vtable
pointer. If successful the pointer as well as the resolved symbol will
be added to the listing.
In the listing of `scylla fiber` also print the starting task (as the
first item).
This mini-series contains assorted improvements that I found very useful
while debugging OOM crashes in the past weeks:
* A wrapper for `std::list`.
* A wrapper for `std::variant`.
* Making `scylla find` usable from python code.
* Improvements to `scylla sstables` and `scylla task_histogram`
commands.
* The `$downcast_vptr()` convenience function.
* The `$dereference_lw_shared_ptr()` convenience function.
Convenience functions in gdb are similar to commands, with some key
differences:
* They have a defined argument list.
* They can return values.
* They can be part of any gdb expression in which functions are allowed.
This makes them very useful for doing operations on values then
returning them so that the developer can use it the gdb shell.
The static row can be rare: many tables don't have them, and tables
that do will often have mutations without them (if the static row
is rarely updated, it may be present in the cache and in readers,
but absent in memtable mutations). However, it always consumes ~100
bytes of memory, even if it not present, due to row's overhead.
Change it to be optional by using lazy_row instead of row. Some call
sites treewide were adjusted to deal with the extra indirection.
perf_simple_query appears to improve by 2%, from 163krps to 165 krps,
though it's hard to be sure due to noisy measurements.
memory_footprint comparisons (before/after):
mutation footprint: mutation footprint:
- in cache: 1096 - in cache: 992
- in memtable: 854 - in memtable: 750
- in sstable: 351 - in sstable: 351
- frozen: 540 - frozen: 540
- canonical: 827 - canonical: 827
- query result: 342 - query result: 342
sizeof(cache_entry) = 112 sizeof(cache_entry) = 112
-- sizeof(decorated_key) = 36 -- sizeof(decorated_key) = 36
-- sizeof(cache_link_type) = 32 -- sizeof(cache_link_type) = 32
-- sizeof(mutation_partition) = 200 -- sizeof(mutation_partition) = 96
-- -- sizeof(_static_row) = 112 -- -- sizeof(_static_row) = 8
-- -- sizeof(_rows) = 24 -- -- sizeof(_rows) = 24
-- -- sizeof(_row_tombstones) = 40 -- -- sizeof(_row_tombstones) = 40
sizeof(rows_entry) = 232 sizeof(rows_entry) = 232
sizeof(lru_link_type) = 16 sizeof(lru_link_type) = 16
sizeof(deletable_row) = 168 sizeof(deletable_row) = 168
sizeof(row) = 112 sizeof(row) = 112
sizeof(atomic_cell_or_collection) = 8 sizeof(atomic_cell_or_collection) = 8
Tests: unit (dev)
lazy_row adds indirection to the row class, in order to reduce storage requirements
when the row is not present. The intent is to use it for the static row, which is
not present in many schemas, and is often not present in writes even in schemas that
have a static row.
Indirection is done using managed_ref, which is lsa-compatible.
lazy_row implements most of row's methods, and a few more:
- get(), get_existing(), and maybe_create(): bypass the abstraction and the
underlying row
- some methods that accept a row parameter also have an overload with a lazy_row
parameter
"Delete README-DPDK.md, move IDL.md to docs/ and fix
docs/review-checklist.md to point to scylla's coding style document,
instead of seastar's."
* 'documentation-cleanup/v3' of https://github.com/denesb/scylla:
docs/review-checklist.md: point to scylla's coding-style.md instead of seastar's
docs: mv coding-style.md docs/
rm README-DPDK.md
docs: mv IDL.md docs/
"Delete README-DPDK.md, move IDL.md to docs/ and fix
docs/review-checklist.md to point to scylla's coding style document,
instead of seastar's."
* 'documentation-cleanup/v3' of https://github.com/denesb/scylla:
docs/review-checklist.md: point to scylla's coding-style.md instead of seastar's
docs: mv coding-style.md docs/
rm README-DPDK.md
docs: mv IDL.md docs/
Allow returning fewer random clustering keys than requested since
the schema may limit the total number we can generate, for example,
if there is only one boolean clustering column.
Fixes#5161
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Calculating the select statement for given view_info structure
used to work fine, but once local indexes were introduced, a subtle
bug appeared: the legacy token column does not exist in local indexes
and a valid clustering key column was omitted instead.
That results in potentially incorrect partition slices being used later
in read-before-write.
There's a long term plan for removing select_statement from
view info altogether, but nonetheless the bug needs to be fixed first.
When investigating OOM:s a prominent pattern is a size class that is
exploded, using up most of the available memory alone. If one is lucky,
the objects causing the OOM are instances of some virtual class, making
their identification easy. Other times the objects are referenced by
instances of some virtual class, allowing their identification with some
work. However there are cases where neither these objects nor their
direct referrers are instances of virtual classes. This is the case
`scylla generate_object_graph` intends to help.
scylla generate_object_graph, like its name suggests generates the
object graph of the requested object. The object graph is a directed
graph, where vertices are objects and edges are references between them,
going from referrers to the referee. The vertices contain information,
like the address of the object, its size, whether it is a live or not
and if applies, the address and symbol name of its vtable. The edges
contain the list of offsets the referrer has references at. The
generated graph is an image, which allows the visual inspection of the
object graph, allowing the developer to notice patterns and hopefully
identify the problematic objects.
The graph is generated with the help of `graphwiz`. The command
generates `.dot` files which can be converted to images with the help of
the `dot` utility. The command can do this if the output file is one of
the supported image formats (e.g. `png`), otherwise only the `.dot` file
is generated, leaving the actual image generation to the user.
Add `--resolve` flag, which will make the command attempt to resolve the
first pointer of the found objects as a vtable pointer. If this is
successful the vtable pointer as well as the symbol name will be added
to the listing. This in particular makes backtracing continuation chains
a breeze, as the continuation object the searched one depends on can be
found at glance in the resulting listing (instead of having to manually
probe each item).
The arguments of `scylla find` are now parsed via `argparse`. While at
it, support for all the size classes supported by the underlying `find`
command were added, in addition to `w` and `g`. However the syntax of
specifying the size class to use has been changed, it now has to be
specified with the `-s|--size` command line argument, instead of passing
`-w` or `-g`.
Or in other words, the task that is the argument of the search. Example:
(gdb) scylla fiber 0x60001a305910
Starting task: (task*) 0x000060001a305910 0x0000000004aa5260 vtable for seastar::continuation<...> + 16
#0 (task*) 0x0000600016217c80 0x0000000004aa5288 vtable for seastar::continuation<...> + 16
#1 (task*) 0x000060000ac42940 0x0000000004aa2aa0 vtable for seastar::continuation<...> + 16
#2 (task*) 0x0000600023f59a50 0x0000000004ac1b30 vtable for seastar::continuation<...> + 16
This code is currently duplicated in `find_vptrs()` and
`scylla_task_histogram`. Refactor it out into a function.
The code is also improved in two ways:
* Make the search stricter, ensuring (hopefully) that indeed the
executable's text section is found, not that of the first object in
the `gdb file` listing.
* Throw an exception in the case when the search fails.
We don't want to add shared sstables to table's backlog tracker because:
1) table's backlog tracker has only an influence on regular compaction
2) shared sstables are never regular compacted, they're worked by
resharding which has its own backlog tracker.
Such sstables belong to more than one shard, meaning that currently
they're added to backlog tracker of all shards that own them.
But the thing is that such sstables ends up being resharded in shard
that may be completely random. So increasing backlog of all shards
such sstables belong to, won't lead to faster resharding. Also, table's
backlog tracker is supposed to deal only with regular compaction.
Accounting for shared sstables in table's tracker may lead to incorrect
speed up of regular compactions because the controller is not aware
that some relevant part of the backlog is due to pending resharding.
The fix is about ignoring sstables that will be resharded and let
table's backlog tracker account only for sstables that can be worked on
by regular compaction, and rely on resharding controlling itself
with its own tracker.
NOTE: this doesn't fix the resharding controlling issue completely,
as described in #4952. We'll still need to throttle regular compaction
on behalf of resharding. So subsequent work may be about:
- move resharding to its own priority class, perhaps streaming.
- make a resharding's backlog tracker accounts for sstables in all of
its pending jobs, not only the ongoing ones (currently limited to 1 by shard).
- limit compaction shares when resharding is in progress.
THIS only fixes the issue in which controller for regular compaction
shouldn't account sstables completely exclusive to resharding.
Fixes#5077.
Refs #4952.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190924022109.17400-1-raphaelsc@scylladb.com>
Incremental compaction efficiency depends on the reference of sstables
compacted being all released because the file descriptors of sstable
components are only closed once the sstable object is destructed.
Incremental compaction is not working for major compaction because a reference
to released sstables are being kept in the compaction manager, which prevents
their disk usage from being released. So the space amplification would be
the same as with a non-incremental approach, i.e. needs twice the amount of
used disk space for the table(s). With this issue fixed, the database now
becomes very major compaction friendly, the space requirement becoming very
low, a constant which is roughly number of fragments being currently compacted
multiplied by fragment size (1GB by default), for each table involved.
Fixes#5140.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20191003211927.24153-1-raphaelsc@scylladb.com>
Make sure gc'able-tombstone-only sstable is properly generated with data that
comes from regular compaction's input sstable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Compaction prevents data resurrection from happening by checking that there's
no way a data shadowed by a GC'able tombstone will survive alone, after
a failure for example.
Consider the following scenario:
We have two runs A and B, each divided to 5 fragments, A1..A5, B1..B5.
They have the following token ranges:
A: A1=[0, 3] A2=[4, 7] A3=[8, 11] A4=[12, 15] A5=[16,18]
B is the same as A's ranges, offset by 1:
B: B1=[1,4] B2=[5,8] B3=[9,12] B4=[13,16] B5=[17,19]
Let's say we are finished flushing output until position 10 in the compaction.
We are currently working on A3 and B3, so obviously those cannot be deleted.
Because B2 overlaps with A3, we cannot delete B2 either.
Otherwise, B2 could have a GC'able tombstone that shadows data in A3, and after
B2 is gone, dead data in A3 could be resurrected *on failure*.
Now, A2 overlaps with B2 which we couldn't delete yet, so we can't delete A2.
Now A2 overlaps with B1 so we can't delete B1. And B1 overlaps with A1 so
we can't delete A1. So we can't delete any fragment.
The problem with this approach is obvious, fragments can potentially not be
released due to data dependency, so incremental compaction efficiency is
severely reduced.
To fix it, let's not purge GC'able tombstones right away in the mutation
compactor step. Instead, let's have compaction writing them to a separate
sstable run that would be deleted in the end of compaction.
By making sure that tombstone information from all compacting sstables is not
lost, we no longer need to have incremental compaction imposing lots of
restriction on which fragments could be released. Now, any sstable which data
is safe in a new sstable can be released right away. In addition, incremental
compaction will only take place if compaction procedure is working with one
multi-fragment sstable run at least.
Fixes#4531.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
* seastar 1f68be436f...e888b1df9c (8):
> sharded: Make map work with mapper that returns a future
> cmake: Remove FindBoost.cmake
> Reduce noncopyable_function instruction cache footprint
> doc: add Loops section to the tutorial
> Merge "Move file related code out of reactor" from Asias
> Merge "Move the io_queue code out of reactor" from Asias
> cmake: expose seastar_perf_testing lib
> future: class doc: explain why discarding a future is bad
- main.cc now includes new file io_queue.hh
- perf tests now include seastar perf utilities via user, not
system, includes since those are not exported
Merged patch set from Piotr Sarna:
Refs #5046
This commit adds handling "Authorization:" header in incoming requests.
The signature sent in the authorization is recomputed server-side
and compared with what the client sent. In case of a mismatch,
UnrecognizedClientException is returned.
The signature computation is based on boto3 Python implementation
and uses gnutls to compute HMAC hashes.
This series is rebased on a previous HTTPS series in order to ease
merging these two. As such, it depends on the HTTPS series being
merged first.
Tests: alternator(local, remote)
The series also comes with a simple authorization test and a docs update.
Piotr Sarna (6):
alternator: migrate split() function to string_view
alternator: add computing the auth signature
config: add alternator_enforce_authorization entry
alternator: add verifying the auth signature
alternator-test: add a basic authorization test case
docs: update alternator authorization entry
alternator-test/test_authorization.py | 34 ++++++++
configure.py | 1 +
alternator/{server.hh => auth.hh} | 22 ++---
alternator/server.hh | 3 +-
db/config.hh | 1 +
alternator/auth.cc | 88 ++++++++++++++++++++
alternator/server.cc | 112 +++++++++++++++++++++++---
db/config.cc | 1 +
main.cc | 2 +-
docs/alternator/alternator.md | 7 +-
10 files changed, 241 insertions(+), 30 deletions(-)
create mode 100644 alternator-test/test_authorization.py
copy alternator/{server.hh => auth.hh} (58%)
create mode 100644 alternator/auth.cc
Before this change, when populating non-system keyspaces, each data
directory was scanned and for each entry (keyspace directory),
a keyspace was populated. This was done in a serial fashion - populating
of one keyspace was not started until the previous one was done.
Loading keyspaces in such fashion can introduce unnecessary waiting
in case of a large number of keyspaces in one data directory. Population
process is I/O intensive and barely uses CPU.
This change enables parallel loading of keyspaces per data directory.
Populating the next keyspace does not wait for the previous one.
A benchmark was performed measuring startup time, with the following
setup:
- 1 data directory,
- 200 keyspaces,
- 2 tables in each keyspace, with the following schema:
CREATE TABLE tbl (a int, b int, c int, PRIMARY KEY(a, b))
WITH CLUSTERING ORDER BY (b DESC),
- 1024 rows in each table, with values (i, 2*i, 3*i) for i in 0..1023,
- ran on 6-core virtual machine running on i7-8750H CPU,
- compiled in dev mode,
- parameters: --smp 6 --max-io-requests 4 --developer-mode=yes
--datadir $DIR --commitlog-directory $DIR
--hints-directory $DIR --view-hints-directory $DIR
The benchmark tested:
- boot time, by comparing timestamp of the first message in log,
and timestamp of the following message:
"init - Scylla version ... initialization completed."
- keyspace population time, by comparing timestamps of messages:
"init - loading non-system sstables"
and
"init - starting view update generator"
The benchmark was run 5 times for sequential and parallel version,
with the following results:
- sequential: boot 31.620s, keyspace population 6.051s
- parallel: boot 29.966s, keyspace population 4.360s
Keyspace population time decreased by ~27.95%, and overall boot time
by about ~5.23%.
Tests: unit(release)
Fixes#2007
The signature sent in the "Authorization:" header is now verified
by computing the signature server-side with a matching secret key
and confirming that the signatures match.
Currently the secret key is hardcoded to be "whatever" in order
to work with current tests, but it should be replaced
by a proper key store.
Refs #5046
The config entry will be used to turn authorization for alternator
requests on and off. The default is currently off, since the key store
is not implemented yet.
A function for computing the auth signature from user requests
is added, along with helper functions. The implementation
is based on gnutls's HMAC.
Refs #5046
The implementation of string split was based on sstring type for
simplicity, but it turns out that more generic std::string_view
will be beneficial later to avoid unneeded string copying.
Unfortunately boost::split does not cooperate well with string views,
so a simple manual implementation is provided instead.
Schema changes can have big effects on performance, typically it should
be a rare event.
It is usefull to monitor how frequently the schema changed.
This patch adds a counter that increases each time a schema changed.
After this patch the metrics would look like:
scylla_database_schema_changed{shard="0",type="derive"} 2
Fixes#4785
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
We can use the reader::peek() to check if the reader contains any data.
If not, do not open the rpc stream connection. It helps to reduce the
port usage.
Refs: #4943
Both in a single-statement transaction and in a batch
we expect that serial consistency is provided. Move the
check to query_options class and make it available for
reuse.
Keep get_serial_consistency() around for use in
transport/server.cc.
Message-Id: <20191006154532.54856-2-kostja@scylladb.com>
When another node is reported to be down, view updates queued
for it are cancelled, but some of them may already be initiated.
Right now, cancelling such a write resulted in an exception,
but on conceptual level it's not really an exception, since
this behaviour is expected.
Previous version of this patch was based on introducing a special
exception type that was later handled specially, but it's not clear
if it's a good direction. Instead, this patch simply makes this
path non-exceptional, as was originally done by Nadav in the first
version of the series that introduced handling unstarted write
cancellations. Additionally, a message containing the information
that a write is cancelled is logged with debug level.
README.md has 3 fixes applied:
- s/alternator_tls_port/alternator_https_port
- conf directory is mentioned more explicitly
- it now correctly states that the self-signed certificate
warning *is* explicitly ignored in tests
Message-Id: <e5767f7dbea260852fc2fa9b613e1bebf490cc78.1570444085.git.sarna@scylladb.com>
"
Fixes#5134, Eviction concurrent with preempted partition entry update after
memtable flush may allow stale data to be populated into cache.
Fixes#5135, Cache reads may miss some writes if schema alter followed by a
read happened concurrently with preempted partition entry update.
Fixes#5127, Cache populating read concurrent with schema alter may use the
wrong schema version to interpret sstable data.
Fixes#5128, Reads of multi-row partitions concurrent with memtable flush may
fail or cause a node crash after schema alter.
"
* tag 'fix-cache-issues-with-schema-alter-and-eviction-v2' of github.com:tgrabiec/scylla:
tests: row_cache: Introduce test_alter_then_preempted_update_then_memtable_read
tests: row_cache_stress_test: Verify all entries are evictable at the end
tests: row_cache_stress_test: Exercise single-partition reads
tests: row_cache_stress_test: Add periodic schema alters
tests: memtable_snapshot_source: Allow changing the schema
tests: simple_schema: Prepare for schema altering
row_cache: Record upgraded schema in memtable entries during update
memtable: Extract memtable_entry::upgrade_schema()
row_cache, mvcc: Prevent locked snapshots from being evicted
row_cache: Make evict() not use invalidate_unwrapped()
mvcc: Introduce partition_snapshot::touch()
row_cache, mvcc: Do not upgrade schema of entries which are being updated
row_cache: Use the correct schema version to populate the partition entry
delegating_reader: Optimize fill_buffer()
row_cache, memtable: Use upgrade_schema()
flat_mutation_reader: Introduce upgrade_schema()
Merged patch series from Piotr Sarna:
This series adds HTTPS support for Alternator.
The series comes with --https option added to alternator-test, which makes
the test harness run all the tests with HTTPS instead of HTTP. All the tests
pass, albeit with security warnings that a self-signed x509 certificate was
used and it should not be trusted.
Fixes#5042
Refs scylladb/seastar#685
Patches:
docs: update alternator entry on HTTPS
alternator-test: suppress the "Unverified HTTPS request" warning
alternator-test: add HTTPS info to README.md
alternator-test: add HTTPS to test_describe_endpoints
alternator-test: add --https parameter
alternator: add HTTPS support
config: add alternator HTTPS port
* seastar c21a7557f9...1f68be436f (6):
> scheduling: Add per scheduling group data support
> build: Include dpdk as a single object in libseastar.a
> sharded: fix foreign_ptr's move assignment
> build: Fix DPDK libraries linking in pkg-config file
> http server: https using tls support
> Make output_stream blurb Doxygen
The BEGINS_WITH condition in conditional updates (via Expected) requires
that the given operand be either a string or a binary. Any other operand
should result in a validation exception - not a failed condition as we
generate now.
This patch fixes the test for this case so it will succeed against
Amazon DynamoDB (before this patch it fails - this failure was masked by
a typo before commit 332ffa77ea). The patch
then fixes our code to handle this case correctly.
Note that BEGINS_WITH handling of wrong types is now asymmetrical: A bad
type in the operand is now handled differently from a bad type in the
attribute's value. We add another check to the test to verify that this
is the case.
Fixes#5141
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191006080553.4135-1-nyh@scylladb.com>
When debugging one constantly has to inspect object for which only
a "virtual pointer" is available, that is a pointer that points to a
common parent class or interface.
Finding the concrete type and downcasting the pointer is easy enough but
why do it manually when it is possible to automate it trivially?
$downcast_vptr() returns any virtual pointer given to it, casted to the
actual concrete object.
Exlample:
(gdb) p $1
$2 = (flat_mutation_reader::impl *) 0x60b03363b900
(gdb) p $downcast_vptr(0x60b03363b900)
$3 = (combined_mutation_reader *) 0x60b03363b900
# The return value can also be dereferenced on the spot.
(gdb) p *$downcast_vptr($1)
$4 = {<flat_mutation_reader::impl> = {_vptr.impl = 0x46a3ea8 <vtable
for combined_mutation_reader+16>, _buffer = {_impl = {<std::al...
Dereferencing an `seastar::lw_shared_ptr` is another tedious manual
task. The stored pointer (`_p`) has to be casted to the right subclass
of `lw_shared_ptr_counter_base`, which involves inspecting the code,
then make writing a cast expression that gdb is willing to parse. This
is something machines are so much better at doing.
`$dereference_lw_shared_ptr` returns a pointer to the actual pointed-to
object, given an instance of `seastar::lw_shared_ptr`.
Example:
(gdb) p $1._read_context
$2 = {_p = 0x60b00b068600}
(gdb) p $dereference_lw_shared_ptr($1._read_context)
$3 = {<seastar::enable_lw_shared_from_this<cache::read_context>>
= {<seastar::lw_shared_ptr_counter_base> = {_count = 1}, ...
Make all the parameters of the sampling tweakable via command line
arguments. I strived to keep full backward compatibility, but due to the
limitations of `argparse` there is one "breaking" change. The optional
positional size argument is now a non-positional argument as `argparse`
doesn't support optional positional arguments.
Added documentation for both the command itself as well as for all the
arguments.
make_single_key_reader() currently doesn't actually create
single-partition readers because it doesn't set
mutation_reader::forwarding::no when it creates individual
readers. The readers will default to mutation_reader::forwarding::yes
and actually create scanning readers in preparation for
fast-forwarding across partitions.
Fix by passing mutation_reader::forwarding::no.
Currently, methods of simple_schema assume that table's schema doesn't
change. Accessors like get_value() assume that rows were generated
using simple_schema::_s. Because if that, the column_definition& for
the "v" column is cached in the instance. That column_definiion&
cannot be used to access objects created with a different schema
version. To allow using simple_schema after schema changes,
column_definition& caching is now tagged with the table schema version
of origin. Methods which access schema-dependent objects, like
get_value(), are now accepting schema& corresponding to the objects.
Also, it's now possible to tell simple_schema to use a different
schema version in its generator methods.
Cache update may defer in the middle of moving of partition entry
from a flushed memtable to the cache. If the schema was changed since
the entry was written, it upgrades the schema of the partition_entry
first but doesn't update the schema_ptr in memtable_entry. The entry
is removed from the memtable afterward. If a memtable reader
encounters such an entry, it will try to upgrade it assuming it's
still at the old schema.
That is undefined behavior in general, which may include:
- read failures due to bad_alloc, if fixed-size cells are interpreted
as variable-sized cells, and we misinterpret a value for a huge
size
- wrong read results
- node crash
This doesn't result in a permanent corruption, restarting the node
should help.
It's the more likely to happen the more rows there are in a
partition. It's unlikely to happen with single-row partitions.
Introduced in 70c7277.
Fixes#5128.
If the whole partition entry is evicted while being updated from the
memtable, a subsequent read may populate the partition using the old
version of data if it attempts to do it before cache update advances
past that partition. Partial eviction is not affected because
populating reads will notice that there is a newer snapshot
corresponding to the updater.
This can happen only in OOM situations where the whole cache gets evicted.
Affects only tables with multi-row partitions, which are the only ones
that can experience the update of partition entry being preempted.
Introduced in 70c7277.
Fixes#5134.
invalidate_unwrapped() calls cache_entry::evict(), which cannot be
called concurrently with cache update. invalidate() serializes it
properly by calling do_update(), but evict() doesn't. The purpose of
evict() is to stress eviction in tests, which can happen concurrently
with cache update. Switch it to use memory reclaimer, so that it's
both correct and more realistic.
evict() is used only in tests.
When a read enters a partition entry in the cache, it first upgrades
it to the current schema of the cache. The same happens when an entry
is updated after a memtable flush. Upgrading the entry is currently
performed by squashing all versions and replacing them with a single
upgraded version. That has a side effect of detaching all snapshots
from the partition entry. Partition entry update on memtable flush is
writing into a snapshot. If that snapshot is detached by a schema
upgrade, the entry will be missing writes from the memtable which fall
into continuous ranges in that entry which have not yet been updated.
This can happen only if the update of the entry is preempted and the
schema was altered during that, and a read hit that partition before
the update went past it.
Affects only tables with multi-row partitions, which are the only ones
that can experience the update of partition entry being preempted.
The problem is fixed by locking updated entries and not upgrading
schema of locked entries. cache_entry::read() is prepared for this,
and will upgrade on-the-fly to the cache's schema.
Fixes#5135
The sstable reader which populates the partition entry in the cache is
using the schema of the partition entry snapshot, which will be the
schema of the cache at the time the partition was entered. If there
was a schema change after the cache reader entered the partition but
before it created the sstable reader, the cache populating reader will
interpret sstable fragments using the wrong schema version. That is
more likely if partitions have many rows, and the front of the
partition is populated. With single-row partitions that's unlikely to
happen.
That is undefined behavior in general, which may include:
- read failures due to bad_alloc, if fixed-size cells are
interpreted as variable-sized cells, and we misinterpret
a value for a huge size
- wrong read results
- node crash
This doesn't result in a permanent corruption, restarting the node
should help.
Fixes#5127.
Use move_buffer_content_to() which is faster than fill_buffer_from()
because it doesn't involve popping and pushing the fragments across
buffers. We save on size estimation costs.
Running with --https and a self-signed certificate results in a flood
of expected warnings, that the connection is not to be trusted.
These warnings are silenced, as users runing a local test with --https
usually use self-signed certificates.
The test_describe_endpoints test spawns another client connection
to the cluster, so it needs to be HTTPS-aware in order to work properly
with --https parameter.
Running with --https parameter will result in sending the requests
via HTTPS instead of HTTP. By default, port 8043 is used for a local
cluster. Before running pytest --https, make sure that Scylla
was properly configured to initialize a HTTPS alternator server
by providing the alternator_tls_port parameter.
The HTTPS-based connection runs with verification disabled,
otherwise it would not work with self-signed certificates,
which are useful for tests.
By providing a server based on a TLS socket, it's now possible
to serve HTTPS requests in alternator. The HTTPS server is enabled
by setting its port in scylla.yaml: alternator_tls_port=XXXX.
Alternator TLS relies on the existing TLS configuration,
which is provided by certificate, keyfile, truststore, priority_string
options.
Fixes#5042
The test test_update_expression_function_nesting() fails because DynamoDB
don't allow an expression like list_append(list_append(:val1, :val2), :val3)
but Alternator doesn't check for this (and supports this expression).
The "xfail" message was outdated, suggesting that the test fails because
the "SET" expression isn't supported - but it is. So replace the message
by a more accurate one.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190915104708.30471-1-nyh@scylladb.com>
Merged patch set from Dejan Mircevski implementing some of the
missing operators for Expected: NE, IN, NULL and NOT_NULL.
Patches:
alternator: Factor out Expected operand checks
alternator: Implement NOT_NULL operator in Expected
alternator: Implement NULL operator in Expected
alternator: Fix expected_1_null testcase
alternator: Implement IN operator in Expected
alternator: Implement NE operator in Expected
alternator: Factor out common code in Expected
Frozen empty lists/map/sets are not equal to null value,
whil multi-cell empty lists/map/sets are equal to null values.
Return a NULL value for an empty multi-cell set or list
if we know the receiver is not frozen - this makes it
easy to compare the parameter with the receiver.
Add a test case for inserting an empty list or set
- the result is indistinguishable from NULL value.
Message-Id: <20191003092157.92294-2-kostja@scylladb.com>
"
Fix races that may lead to use-after-free events and file system level exceptions
during shutdown and drain.
The root cause of use-after-free events in question is that space_watchdog blocks on
end_point_hints_manager::file_update_mutex() and we need to make sure this mutex is alive as long as
it's accessed even if the corresponding end_point_hints_manager instance
is destroyed in the context of manager::drain_for().
File system exceptions may occur when space_watchdog attempts to scan a
directory while it's being deleted from the drain_for() context.
In case of such an exception new hints generation is going to be blocked
- including for materialized views, till the next space_watchdog round (in 1s).
Issues that are fixed are #4685 and #4836.
Tested as follows:
1) Patched the code in order to trigger the race with (a lot) higher
probability and running slightly modified hinted handoff replace
dtest with a debug binary for 100 times. Side effect of this
testing was discovering of #4836.
2) Using the same patch as above tested that there are no crashes and
nodes survive stop/start sequences (they were not without this series)
in the context of all hinted handoff dtests. Ran the whole set of
tests with dev binary for 10 times.
"
* 'hinted_handoff_race_between_drain_for_and_space_watchdog_no_global_lock-v2' of https://github.com/vladzcloudius/scylla:
hinted handoff: fix a race on a directory removal between space_watchdog and drain_for()
hinted handoff: make taking file_update_mutex safe
db::hints::manager::drain_for(): fix alignment
db::hints::manager: serialize calls to drain_for()
db::hints: cosmetics: identation and missing method qualifier
The operation after gate.enter() in tracker::start() can fail and throw,
we should call gate.leave() in such case to avoid unbalanced enter and
leave calls. tracker::done() has similar issue too.
Fix it by removing the gate enter and leave logic in tracker start and
done. A helper tracker::run() is introduced to take care of the gate and
repair status.
In addition, the error log is improved. It now logs exceptions on all
shards in the summary. e.g.,
[shard 0] repair - repair id 1 failed: std::runtime_error
({shard 0: std::runtime_error (error0), shard 1: std::runtime_error (error1)})
Fixes#5074
Currently, the population stat is not increased for entries that are
evicted immediately on insert, however the code that does the eviction
still decreases the population stat, leading to an imbalance and in some
cases the underflow of the population stat. To fix, unconditionally
increase the population stat upon inserting an entry, regardless of
whether it is immediately evicted or not.
Fixes: #5123
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191001153215.82997-1-bdenes@scylladb.com>
Put all AttributeValuelist size verification under
verify_operand_count(), rather than have some cases invoke
verify_operand_count() while others verify it in check_*() functions.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Add check_IN() and a switch case that invokes it. Reactivate IN
tests. Add a testcase for non-scalar attribute values.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Recognize "NE" as a new operator type, add check_NE() function, invoke
it in verify_expected_one(), and reactivate NE tests.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Operand-count verification will be repeated a lot as more operators
are implemented, so factor it out into verify_operand_count().
Also move `got` null checks to check_* functions, which reduces
duplication at call sites.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
While a managed_ref emulates a reference more closely than it does
a pointer, it is still nullable, so add a get() (similar to
unique_ptr::get()) that can be nullptr if the reference is null.
The immediate use will be mutation_partition::_static_row, which
is often empty and takes up about 10% of a cache entry.
The example Python code had wrong indentation, and wouldn't actually
work if naively copy-pasted. Noticed by Noam Hasson.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190929091440.28042-1-nyh@scylladb.com>
"
This is a collection of assorted patches that will be needed for LWT.
Most of them are trivial, but one touches a lot of files, so have a
good chance to cause rebase headache (I already had to rebase it on
top of Alternator). Lets push them earlier instead of carrying them in
the lwt branch.
"
* 'gleb/lwt-prepare-v2' of github.com:scylladb/seastar-dev:
lwt: make _last_timestamp_micros static
lwt: Add client_state::get_timestamp_for_paxos() function
lwt: Pass client_state reference all the way to storage_proxy::query
exceptions: Add a constructor for unavailable_exception that allows providing a custom message
serializer: Add std::variant support
lwt: Add missing functions to utils/UUID_gen.hh
"
This is the second version of the patch series. The previous one was just the second patch, this one adds more tests an another patch to make it easier to test that the new code has the same behavior as the old one.
"
* 'espindola/overflow-is-intentional' of https://github.com/espindola/scylla:
types: Simplify and explain from_varint_to_integer
Add more cast tests
Affects single-partition reads only.
Refs #5113
When executing a query on the replica we do several things in order to
narrow down the sstable set we read from.
For tables which use LeveledCompactionStrategy, we store sstables in
an interval set and we select only sstables whose partition ranges
overlap with the queried range. Other compaction strategies don't
organize the sstables and will select all sstables at this stage. The
reasoning behind this is that for non-LCS compaction strategies the
sstables' ranges will typically overlap and using interval sets in
this case would not be effective and would result in quadratic (in
sstable count) memory consumption.
The assumption for overlap does not hold if the sstables come from
repair or streaming, which generates non-overlapping sstables.
At a later stage, for single-partition queries, we use the sstables'
bloom filter (kept in memory) to drop sstables which surely don't
contain given partition. Then we proceed to sstable indexes to narrow
down the data file range.
Tables which don't use LCS will do unnecessary I/O to read index pages
for single-partition reads if the partition is outside of the
sstable's range and the bloom filter is ineffective (Refs #5112).
This patch fixes the problem by consulting sstable's partition range
in addition to the bloom filter, so that the non-overlapping sstables
will be filtered out with certainty and not depend on bloom filter's
efficiency.
It's also faster to drop sstables based on the keys than the bloom
filter.
Tests:
- unit (dev)
- manual using cqlsh
Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190927122505.21932-1-tgrabiec@scylladb.com>
The method sstable::estimated_keys_for_range() was severely
under-estimating the number of partitions in an sstable for a given
token range.
The first reason is that it underestimated the number of sstable index
pages covered by the range, by one. In extreme, if the requested range
falls into a single index page, we will assume 0 pages, and report 1
partition. The reason is that we were using
get_sample_indexes_for_range(), which returns entries with the keys
falling into the range, not entries for pages which may contain the
keys.
A single page can have a lot of partitions though. By default, there
is a 1:20000 ratio between summary entry size and the data file size
covered by it. If partitions are small, that can be many hundreds of
partitions.
Another reason is that we underestimate the number of partitions in an
index page. We multiply the number of pages by:
(downsampling::BASE_SAMPLING_LEVEL * _components->summary.header.min_index_interval)
/ _components->summary.header.sampling_level
Using defaults, that means multiplying by 128. In the cassandra-stress
workload a single partition takes about 300 bytes in the data file and
summary entry is 22 bytes. That means a single page covers 22 * 20'000
= 440'000 bytes of the data file, which contains about 1'466
partitions. So we underestimate by an order of magnitude.
Underestimating the number of partitions will result in too small
bloom filters being generated for the sstables which are the output of
repair or streaming. This will make the bloom filters ineffective
which results in reads selecting more sstables than necessary.
The fix is to base the estimation on the number of index pages which
may contain keys for the range, and multiply that by the average key
count per index page.
Fixes#5112.
Refs #4994.
The output of test_key_count_estimation:
Before:
count = 10000
est = 10112
est([-inf; +inf]) = 512
est([0; 0]) = 128
est([0; 63]) = 128
est([0; 255]) = 128
est([0; 511]) = 128
est([0; 1023]) = 128
est([0; 4095]) = 256
est([0; 9999]) = 512
est([5000; 5000]) = 1
est([5000; 5063]) = 1
est([5000; 5255]) = 1
est([5000; 5511]) = 1
est([5000; 6023]) = 128
est([5000; 9095]) = 256
est([5000; 9999]) = 256
est(non-overlapping to the left) = 1
est(non-overlapping to the right) = 1
After:
count = 10000
est = 10112
est([-inf; +inf]) = 10112
est([0; 0]) = 2528
est([0; 63]) = 2528
est([0; 255]) = 2528
est([0; 511]) = 2528
est([0; 1023]) = 2528
est([0; 4095]) = 5056
est([0; 9999]) = 10112
est([5000; 5000]) = 2528
est([5000; 5063]) = 2528
est([5000; 5255]) = 2528
est([5000; 5511]) = 2528
est([5000; 6023]) = 5056
est([5000; 9095]) = 7584
est([5000; 9999]) = 7584
est(non-overlapping to the left) = 0
est(non-overlapping to the right) = 0
Tests:
- unit (dev)
Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190927141339.31315-1-tgrabiec@scylladb.com>
`dbuild` was recently (24c732057) updated to run in interactive mode
when given no arguments; we can now update the README to mention that.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
When the toppartitions operation gathers results, it copies partition
keys with their schema_ptr:s. When these schema_ptr:s are copies
or destroyed, they can cause leaks or premature frees of the schema
in its original shard since reference count operations in are not atomic.
Fix that by converting the schema_ptr to a global_schema_ptr during
transportation.
Fixes#5104 (direct bug)
Fixes#5018 (schema prematurely freed, toppartitions previously executed on that node)
Fixes#4973 (corrupted memory pool of the same size class as schema, toppartitions previously executed on that node)
Tests: new test added that fails with the existing code in debug mode,
manual toppartitions test
Copying schema_ptrs across shards results in memory corruption since
lw_shared_ptr does not use atomic operations for reference counts.
Prevent that by converting schema_ptr:s to global_schema_ptr:s before
shipping them across shards in the map operation, and converting them
back to local schema_ptr:s in the reduce operation.
This allows keys from different stages in the schema's like to compare equal.
This is safe since the partition key cannot change, unlike the rest of the schema.
More importantly, it will allow us to compare keys made local after a pass through
global_schema_ptr, which does not guarantee that the schema_ptr conversion will be
the same even when starting with the same global_schema_ptr.
Throwing move constructors are a a pain; so we should try to make
them noexcept. Currently, global_schema_ptr's move constructor
throws an exception if used illegaly (moving from a different shard);
this patch changes it to an assert, on the grounds that this error
is impossible to recover from.
The direct motivation for the patch is the desire to store objects
containing a global_schema_ptr in a chunked_vector, to move lists
of partition keys across shards for the topppartitions functionality.
chunked_vector currently requires noexcept move constructors for its
value_type.
When a user type changes we were not recreating other uses types that
use it. This patch series fixes that and makes it clear which code is
responsible for it.
In the system.types table a user type refers to another by name. When
a user type is modified, only its entry in the table is changed.
At runtime a user type has direct pointer to the types it uses. To
handle the discrepancy we need to recreate any dependent types when a
entry in system.types changes.
Fixes#5049
If each client_state has its own copy of the variable two clients may
generate timestamps that clash and needlessly create contention. Making
the variable shared between all client_state on the same shard will make
sure this will not happen to two clients on the same shard. It may still
happen for two client on two different shards or two different nodes.
Paxos needs a unique timestamp that is greater than some other
timestamp, so that the next round had more chances to succeed.
Add a function that returns such a timestamp.
client_state holds a state to generate monotonically increasing unique
timestamp. Queries with a SERIAL consistency level need it to generate
a paxos round.
In the system.types table a user type refers to another by name. When
a user type is modified, only its entry in the table is changed.
At runtime a user type has direct pointer to the types it uses. To
handle the discrepancy we need to recreate any dependent types when a
entry in system.types changes.
Fixes#5049
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The way schema changes propagate is by editing the system tables and
comparing the before and after state.
When a user type A uses another user type B and we modify B, the
representation of A in the system table doesn't change, so this code
was not producing any changes on the diff that the receiving side
uses.
Deleting it makes it clear that it is the receiver's responsibility to
handle dependent user types.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
With this patch db::cql_type_parser::raw_builder creates a local copy
of the list of existing types and uses that internally. By doing that
build() should have no observable behavior other than returning the
new types.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
We were never passing a null pointer and never saving a copy of the
lw_shared_ptr. Passing a reference is more flexible as not all callers
are required to hold the user_types_metadata in a lw_shared_ptr.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
* seastar b56a8c5045...c21a7557f9 (3):
> net: socket::{set,get}_reuseaddr() should not be virtual
> iotune: print verbose message in case of shutdown errors
> iotune: close test file on shutdown
Fixes#4946.
1. Add assert in remove_response_handler to make crashes like in #5032 easier to understand.
2. Lookup the view_update_write_response_handler id before calling timeout_cb and tolerate it not found.
Just log a warning if this happened.
Fixes#5032
"
Currently affects only counter tables.
Introduced in 27014a2.
mutation_partition(s, mp) is incorrect because it uses s to interpret
mp, while it should use mp_schema.
We may hit this if the current node has a newer schema than the
incoming mutation. This can happen during table schema altering when we receive the
mutation from a node which hasn't processed the schema change yet.
This is undefined behavior in general. If the alter was adding or
removing columns, this may result in corruption of the write where
values of one column are inserted into a different column.
Fixes#5095.
"
* 'fix-schema-alter-counter-tables' of https://github.com/tgrabiec/scylla:
mvcc: Fix incorrect schema verison being used to copy the mutation when applying
mutation_partition: Track and validate schema version in debug builds
tests: Use the correct schema to access mutation_partition
Currently affects only counter tables.
Introduced in 27014a2.
mutation_partition(s, mp) is incorrect, because it uses s to interpret
mp, while it should use mp_schema.
We may hit this if the current node has a newer schema than the
incoming mutation. This can happen during alter when we receive the
mutation from a node which hasn't processed the schema change yet.
This is undefined behavior in general. If the alter was adding or
removing columns, this may result in corruption of the write where
values of one column are inserted into a different column.
Fixes#5095.
This patch makes mutation_partition validate the invariant that it's
supposed to be accessed only with the schema version which it conforms
to.
Refs #5095
* seastar e51a1a8ed9...b56a8c5045 (3):
> net: add support for UNIX-domain sockets
> future: Warn on promise::set_exception with no corresponding future or task
> Merge "Handle exceptions in repeat_until_value and misc cleanups" from Rafael
Handle a race where a write handler is removed from _response_handlers
but not yet from _view_update_handlers_list.
Fixes#5032
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Refactor remove_response_handler_entry out of remove_response_handler,
to be called on a valid iterator found by _response_handlers.find(id).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Help identify cases like seen in #5032 where the handler id
wasn't found from the on_down -> timeout_cb path.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
compaction_manager::perform_sstable_upgrade() fails when it feeds
compaction mechanism with shared sstables. Shared sstables should
be ignored when performing upgrade and so wait for reshard to pick
them up in parallel. Whenever a shared sstable is brought up either
on restart or via refresh, reshard procedure kicks in.
Reshard picks the highest supported format so the upgrade for
shared sstable will naturally take place.
Fixes#5056.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190925042414.4330-1-raphaelsc@scylladb.com>
- Update the outdated comments in do_stop_gossiping. It was
storage_service not storage_proxy that used the lock. More
importantly, storage_service does not use it any more.
- Drop the unused timer_callback_lock and timer_callback_unlock API
- Use with_semaphore to make sure the semaphore usage is balanced.
- Add log in gossiper::do_stop_gossiping when it tries to take the
semaphore to help debug hang during the shutdown.
Refs: #4891
Refs: #4971
A documentation file that is intended to be a place for anything
debugging related: getting started tutorial, tips and tricks and
advanced guides.
For now it contains a short introductions, some selected links to
more in-depth documentation and some trips and tricks that I could think
off the top of my head.
One of those tricks describes how to load cores obtained from
relocatable packages inside the `dbuild` container. I originally
intended to add that to `tools/toolchain/README.md` but was convinced
that `docs/debugging.md` would be a better place for this.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190924133110.15069-1-bdenes@scylladb.com>
Recently we have seen a case where the population stat of the cache was
corrupt, either due to misaccounting or some more serious corruption.
When debugging something like that it would have been useful to know how
many items have been inserted to the cache. I also believe that such a
counter could be useful generally as well.
Refs: #4918
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190924083429.43038-1-bdenes@scylladb.com>
"
We observed an abort on bad_alloc which was not caused by real OOM,
but could be explained by cache region being locked from a different
shard, which is not allowed, concurrently with memory reclamation.
It's impossible now to prove this, or, if that was indeed the case, to
determine which code path was attempting such lock. This patch adds an
assert which would catch such incorrect locking at the attempt.
Refs #4978
Tests:
- unit (dev, release, debug)
"
* 'assert-no-xshard-lsa-locking' of https://github.com/tgrabiec/scylla:
lsa: Assert no cross-shard region locking
tests: Make managed_vector_test a seastar test
* seastar 2a526bb120...e51a1a8ed9 (2):
> rpc: introduce rpc::tuple as a way to move away from variadic future
> shared_future: don't warn on broken futures
Make it easier for the IDE to resolve references to the seastar
namespace. In any case include files should be stand-alone and not
depend on previously included files.
The build directory is meaningless, since it is typically some
directory in a continuous integration server. That means someone
debugging the relocatable package needs to issue the gdb command
'set substitute-path' with the correct arguments, or they lose
source debugging. Doing so in the relocatable package build saves
this step.
The default build is not modified, since a typical local build
benefits from having the paths hardcoded, as the debugger will
find the sources automatically.
We observed an abort on bad_alloc which was not caused by real OOM,
but could be explained by cache region being locked from a different
shard, which is not allowed, concurrently with memory reclamation.
It's impossible now to prove this, or, if that was indeed the case, to
determine which code path was attempting such lock. This patch adds an
assert which would catch such incorrect locking at the attempt.
Refs #4978
LCS demotes a SSTable from a given level when it thinks that level is inactive.
Inactive level means N rounds (compaction attempt) without any activity in it,
in other words, no SSTable has been promoted to it.
The problem happens because the metadata that tracks inactiveness of each level
can be incorrectly updated when there's an ongoing compaction. LCS has parallel
compaction disabled. So if a table finds itself running a long operation like
cleanup that blocks minor compaction, LCS could incorrectly think that many
levels need demotion, and by the time cleanup finishes, some demotions would
incorrectly take place.
This problem is fixed by only updating the counter that tracks inactiveness
when compaction completes, so it's not incorrectly updated when there's an
ongoing compaction for the table.
Fixes#4919.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190917235708.8131-1-raphaelsc@scylladb.com>
A recent fix to #3767 limited the amount of ranges that
can return from query_ranges_to_vnodes_generator. This with
the combination of a large amount of token ranges can lead to
an infinite recursion. The algorithm multiplies by factor of
2 (actualy a shift left by one) the amount of requested
tokens in each recursion iteration. As long as the requested
number of ranges is greater than 0, the recursion is implicit,
and each call is scheduled separately since the call is inside
a continuation of a map reduce.
But if the amount of iterations is large enough (~32) the
counter for requested ranges zeros out and from that moment on
two things will happen:
1. The counter will remain 0 forever (0*2 == 0)
2. The map reduce future will be immediately available and this
will result in the continuation being invoked immediately.
The latter causes the recursive call to be a "regular" recursive call
thus, through the stack and not the task queue of the scheduler, and
the former causes this recursion to be infinite.
The combination creates a stack that keeps growing and eventually
overflows resulting in undefined behavior (due to memory overrun).
This patch prevent the problem from happening, it limits the growth of
the concurrency counter beyond twice the last amount of tokens returned
by the query_ranges_to_vnodes_generator.And also makes sure it is not
get stuck at zero.
Testing: * Unit test in dev mode.
* Modified add 50 dtest that reproduce the problem
Fixes#4944
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20190922072838.14957-1-eliransin@scylladb.com>
Before this patch, if the _gate is closed, with_gate throws and
forward_to is not executed. When the promise<> p is destroyed it marks
its _task as a broken promise.
What happens next depends on the branch.
On master, we warn when the shared_future is destroyed, so this patch
changes the warning from a broken_promise to a gate closed.
On 3.1, we warn when the promises in shared_future::_peers are
destroyed since they no longer have a future attached: The future that
was attached was the "auto f" just before the with_gate call, and it
is destroyed when with_gate throws. The net result is that this patch
fixes the warning in 3.1.
I will send a patch to seastar to make the warning on master more
consistent with the warning in 3.1.
Fixes#4394
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190917211915.117252-1-espindola@scylladb.com>
Scylla currently crashes if we run manual operations like nodetool
compact with the controller disabled. While we neither like nor
recommend running with the controller disabled, due to some corner cases
in the controller algorithm we are not yet at the point in which we can
deprecate this and are sometimes forced to disable it.
The reason for the crash is that manual operations will invoke
_backlog_of_shares, which returns what is the backlog needed to
create a certain number of shares. That scan the existing control
points, but when we run without the controller there are no control
points and we crash.
Backlog doesn't matter if the controller is disabled, and the return
value of this function will be immaterial in this case. So to avoid the
crash, we return something right away if the controller is disabled.
Fixes#5016
Signed-off-by: Glauber Costa <glauber@scylladb.com>
gdb searches for libthread_db.so using its canonical name of libthread_db.so.1 rather
than the file name of libthread_db-1.0.so, so use that name to store the file in the
archive.
Fixes#4996.
* seastar b3fb4aaab3...84d8e9fe9b (8):
> Use aio fsync if available
> Merge "fix some tcp connection bugs and add reuseaddr option to a client socket" from Gleb
> lz4: use LZ4_decompress_safe
> reactor: document seastar::remove_file()
> core/file.hh: remove redundant std::move()
> core/{file,sstring}: do not add `const` to return value
> http/api_docs: always call parent constructor
> Add input_stream blurb
Currently, if updating bookkeeping operations for view building fails,
we log the error message and continue. However, during shutdown,
some errors are more likely to happen due to existing issues
like #4384. To differentiate actual errors from semi-expected
errors during shutdown, the latter are now logged with a warning
level instead of error.
Fixes#4954
Shutdown routines are usually implemented via the deferred_action
mechanism, which runs a function in its destructor. We thus expect
the function to be noexcept, but unfortunately it's not always
the case. Throwing in the destructor results in terminating the program
anyway, but before we do that, the exception can be logged so it's
easier to investigate and pinpoint the issue.
Example output before the patch:
INFO 2019-09-10 12:49:05,858 [shard 0] view - Stopping view builder
terminate called without an active exception
Aborting on shard 0.
Backtrace:
0x000000000184a9ad
(...)
Example output after the patch:
INFO 2019-09-10 12:49:05,858 [shard 0] view - Stopping view builder
ERROR 2019-09-10 12:49:05,858 [shard 0] init - Unexpected error on shutdown: std::runtime_error (Hello there!)
terminate called without an active exception
Aborting on shard 0.
Backtrace:
0x000000000184a9ad
(...)
This simplifies the implementation of from_varint_to_integer and
avoids using the fact that a static_cast from cpp_int to uint64_t
seems to just keep the low 64 bits.
The boost release notes
(https://www.boost.org/users/history/version_1_67_0.html) implies that
the conversion function should return the maximum value a uint64_t can
hold if the original value is too large.
The idea of using a & with ~0 is a suggestion from the boost release
notes.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Update current results dictionary using the Metric.discover method.
New results are added and missing results are marked as absent.
(Both full metrics or specific keys)
Previously, with prometheous, each metric.update called query_list
resulting in O(n^2) when all metric were updated, like in the scylla_top
dtest - causing test timeout when testing debug build.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Commit log replay was bypassing memtable space back-pressure, and if
replay was faster than memtable flush, it could lead to OOM.
The fix is to call database::apply_in_memory() instead of
table::apply(). The former blocks when memtable space is full.
Fixes#4982.
Tests:
- unit (release)
- manual, replay with memtable flush failin and without failing
Message-Id: <1568381952-26256-1-git-send-email-tgrabiec@scylladb.com>
If the user supplies the 'replication_factor' to the 'NetworkTopologyStrategy' class,
it will expand into a replication factor for each existing DC for their convenience.
Resolves#4210.
Signed-off-by: Kamil Braun <kbraun@scylladb.com>
This reverts commit 7f64a6ec4b.
Fixes#5011
The reverted commit exposes #3760 for all schemas, not only those
which have UDTs.
The problem is that table schema deserialization now requires keyspace
to be present. If the replica hasn't received schema changes which
introduce the keyspace yet, the write will fail.
Mention on the top-level README.md that Scylla by default is compatible
with Cassandra, but also has experimental support for DynamoDB's API.
Provide links to alternator/alternator.md and alternator/getting-started.md
with more information about this feature.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190911080913.10141-1-nyh@scylladb.com>
"
In this patch set, written by Piotr Sarna and myself, we add Alternator - a new
Scylla feature adding compatibility with the API of Amazon DynamoDB(TM).
DynamoDB's API uses JSON-encoded requests and responses which are sent over
an HTTP or HTTPS transport. It is described in detail on Amazon's site:
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/
Our goal is that any application written to use Amazon DynamoDB could
be run, unmodified, against Scylla with Alternator enabled. However, at this
stage the Alternator implementation is incomplete, and some of DynamoDB's
API features are not yet supported. The extent of Alternator's compatibility
with DynamoDB is described in the document docs/alternator/alternator.md
included in this patch set. The same document also describes Alternator's
design (and also points to a longer design document).
By default, Scylla continues to listen only to Cassandra API requests and not
DynamoDB API requests. To enable DynamoDB-API compatibility, you must set
the alternator-port configuration option (via command line or YAML) to the port on
which you wish to listen for DynamoDB API requests. For more information, see
docs/alternator/alternator.md. The document docs/alternator/getting-started.md
also contains some examples of how to get started with Alternator.
"
* 'alternator' of https://github.com/nyh/scylla: (272 commits)
Added comments about DAX, monitoring and more
alternator: fix usage of client_state
alternator-test: complete test_expected.py for rest of comparison operators
alternator-test: reproduce bug in Expected with EQ of set value
alternator: implement the Expected request parameter
alternator: add returning PAY_PER_REQUEST billing mode
alternator: update docs/alternator.md on GSI/LSI situation
Alternator: Add getting started document for alternator
move alternator.md to its own directory
alternator-test: add xfail test for GSI with 2 regular columns
alternator/executor.cc: Latencies should use steady_clock
alternator-test: fix LSI tests
alternator-test: fix test_describe_endpoints.py for AWS run
alternator-test: test_describe_endpoints.py without configuring AWS
alternator: run local tests without configuring AWS
alternator-test: add LSI tests
alternator-test: bump create table time limit to 200s
alternator: add basic LSI support
alternator: rename reserved column name "attrs"
alternator: migrate make_map_element_restriction to string view
...
This patch adds tests for all the missing comparion operators in the
Expected parameter (the old-style parameter for conditional operations).
All these new tests are now xfailing on Alternator (and succeeding on
DynamoDB), because these operators are not yet implemented in Alternator
(we only implemented EQ and BEGINS_WITH, so far - the rest are easy but
need to be implemented).
The test_expected.py is now hopefully comprehensive, covering the entire
feature set of the "Expected" parameter and all its various cases and
subcases.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190910092208.23461-1-nyh@scylladb.com>
Our implementation of the "EQ" operator in Expected (conditional
operation) just compares the JSON represntation of the values.
This is almost always correct, but unfortunately incorrect for
sets - where we can have two equal sets despite having a
different order.
This patch just adds an (xfailing) test for this bug.
The bug itself can be fixed in the future in one of several ways
including changing the implementation of EQ, or changing the
serialization of sets so they'll always be sorted in the same
way.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190909125147.16484-1-nyh@scylladb.com>
In this patch we implement the Expected parameter for the UpdateItem,
PutItem and DeleteItem operations. This parameter allows a conditional
update - i.e., do an update only if the existing value of the item
matches some condition.
This is the older form of conditional updates, but is still used by many
applications, including Amazon's Tic-Tac-Toe demo.
As usual, we do not yet provide isolation guarantees for read-modify-write
operations - the item is simply read before the modification, and there is
no protection against concurrent operation. This will of course need to be
addressed in the future.
The Expected parameter has a relatively large number of variations, and most
of them are supported by this code, except that currenly only two comparison
operators are supported (EQ and BEGINS_WITH) out of the 13 listed in the
documentation. The rest will be implemented later.
This patch also includes comprehensive tests for the Expected feature.
These tests are almost exhaustive, except for one missing part (labled FIXME) -
among the 13 comparison operations, the tests only check the EQ and BEGINS_WITH
operators. We'll later need to add checks to the rest of them as well.
As usual, all the tests pass on Amazon DynamoDB, and after this patch all
of them succeed on Alternator too.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190905125558.29133-1-nyh@scylladb.com>
In order for Spark jobs to work correctly, a hardcoded PAY_PER_REQUEST
billing mode entry is returned when describing a table with
a DescribeTable request.
Also, one test case in test_describe_table.py is no longer marked XFAIL.
Message-Id: <a4e6d02788d8be48b389045e6ff8c1628240197c.1567688894.git.sarna@scylladb.com>
This patch adds a getting started document for alternator,
it explains how to start up a cluster that has an alternator
API port open and how to test that it works using either an
application or some simple and minimal python scripts.
The goal of the document is to get a user to have an up and
running docker based cluster with alternator support in the
shortest time possible.
As part of trying to make alternator more accessible
to users, we expect more documents to be created so
it seems like a good idea to give all of the alternator
docs their own directory.
When updating the second regular base column that is also a view
key, the code in Scylla will assume it only needs to update an entry
instead of replacing an old one. This leads to inconsitencies
exposed in the test case.
Message-Id: <5dfeb9f61f986daa6e480e9da4c7aabb5a09a4ec.1567599461.git.sarna@scylladb.com>
LSI tests are amended, so they no longer needlessly XPASS:
* two xpassing tests are no longer marked XFAIL
* there's an additional test for partial projection
that succeeds on DynamoDB and does not work fine yet in alternator
Message-Id: <0418186cb6c8a91de84837ffef9ac0947ea4e3d3.1567585915.git.sarna@scylladb.com>
The previous patch fixed test_describe_endpoints.py for a local run
without an AWS configuration. But when running with "--aws", we do
need to use that AWS configuration, and this patch fixes this case.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Even when running against a local Alternator, Boto3 wants to know the
region name, and AWS credentials, even though they aren't actually needed.
For a local run, we can supply garbage values for these settings, to
allow a user who never configured AWS to run tests locally.
Running against "--aws" will, of course, still require the user to
configure AWS.
The previous patch already fixed this for most tests, this patch fixes the
same issue in test_describe_endpoints.py, which had a separate copy of the
problematic code.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Even when running against a local Alternator, Boto3 wants to know the
region name, and AWS credentials, even though they aren't actually needed.
For a local run, we can supply garbage values for these settings, to
allow a user who never configured AWS to run tests locally.
Running against "--aws" will, of course, still require the user to
configure AWS.
Also modified the README to be clearer, and more focused on the local
runs.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190708121420.7485-1-nyh@scylladb.com>
Unfortunately the previous 100s limit proved to be not enough
for creating tables with both local and global indexes attached
to them. Empirically 200s was chosen as a safe default,
as the longest test oscillated around 100s with the deviation of 10s.
With this patch, LocalSecondaryIndexes can be added to a table
during its creation. The implementation is heavily shared
with GlobalSecondaryIndexes and as such suffers from the same TODOs:
projections, describing more details in DescribeTable, etc.
We currently reserve the column name "attrs" for a map of attributes,
so the user is not allowed to use this name as a name of a key.
We plan to lift this reservation in a future patch, but until we do,
let's at least choose a more obscure name to forbid - in this patch ":attrs".
It is even less likely that a user will want to use this specific name
as a column name.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190903133508.2033-1-nyh@scylladb.com>
Currently, we reserve the name ATTRS_COLUMN_NAME ("attrs") - the user
cannot use it as a key column name (key of the base table or GSI or LSI)
because we use this name for the attribute map we add to the schema.
Currently, if the user does attempt to create such a key column, the
result is undefined (sometimes corrupt sstables, sometimes outright crashes).
This patches fixes it to become a clean error, saying that this column name is
currently reserved.
The test test_create_table_special_column_name now cleanly fails, instead
of crashing Scylla, so it is converted from "skip" to "xfail".
Eventually we need to solve this issue completely (e.g., in rare cases
rename columns to allow us to reserve a name like ATTRS_COLUMN_NAME,
or alternatively, instead of using a fixed name ATTRS_COLUMN_NAME pick a
different one different from the key column names). But until we do,
better fail with a clear error instead of a crash.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190901102832.7452-1-nyh@scylladb.com>
The file initially consists of a very simple case that succeeds
with `--aws` and expectedly fails without it, because the expression
is not implemented yet.
This adds a "alternator-address" and "alternator-port" configuration
options to the Docker image, so people can enable Alternator with
"docker run" with:
docker run --name some-scylla -d <image> --alternator-port=8080
Message-Id: <20190902110920.19269-1-penberg@scylladb.com>
When an unsupported expression parameter is encountered -
KeyConditionExpression, ConditionExpression or FilterExpression
are such - alternator will return an error instead of ignoring
the parameter.
This patch make two chagnes to the alternator stats:
1. It add estimated_histogram for the get, put, update and delete
operation
2. It changes the metrics naming, so the operation will be a label, it
will be easier to handle, perform operation and display in this way.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The test_gsi_3, involving creating a GSI with two key columns which weren't
previously a base key, now passes, so drop the "xfail" marker.
We still have problems with such materialized views, but not in the simple
scenario tested by test_gsi_3.
Later we should create a new test for the scenario which still fails, if
any.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Creating an underlying materialized view with 2 regular base columns
is risky in Scylla, as second's column liveness will not be correctly
taken into account when ensuring view row liveness.
Still, in case specific conditions are met:
* the regular base column value is always present in the base row
* no TTLs are involved
then the materialized view will behave as expected.
Creating a GSI with 2 base regular columns issues a warning,
as it should be performed with care.
Message-Id: <5ce8642c1576529d43ea05e5c4bab64d122df829.1567159633.git.sarna@scylladb.com>
It is important that BillingMode should default to PROVISIONED, as it
does on DynamoDB. This allows old clients, which don't specify
BillingMode at all, to specify ProvisionedThroughput as allowed with
PROVISIONED.
Also added a test case for this case (where BillingMode is absent).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190829193027.7982-1-nyh@scylladb.com>
When querying on a missing index, DynamoDB returns different errors in
case the entire table is missing (ResourceNotFoundException) or the table
exists and just the index is missing (ValidationException). We didn't
make this distinction, and always returned ValidationException, but this
confuses clients that expect ResourceNotFoundException - e.g., Amazon's
Tic-Tac-Toe demo.
This patch adds a test for the first case (the completely missing table) -
we already had a test for the second case - and returns the correct
error codes. As usual the test passes against DynamoDB as well as Alternator,
ensure they behave the same.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190829174113.5558-1-nyh@scylladb.com>
We needlessly split the trace-level log message for the request to two
messages - one containing just the operation's name, and one with the
parameters. Moreover we printed them in the opposite order (parameters
first, then the operation). So this patch combines them into one log
message.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190829165341.3600-1-nyh@scylladb.com>
Alternator puts in the Scylla table a column called "attrs" for all the
non-key attributes. If the user happens to choose the same name, "attrs",
for one of the key columns, the result of writing two different columns
with the same name is a mess and corrupt sstables.
This test reproduces this bug (and works against DynamoDB of course).
Because the test doesn't cleanly fail, but rather leaves Scylla in a bad
state from which it can't fully recover, the test is marked as "skip"
until we fix this bug.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190828135644.23248-1-nyh@scylladb.com>
Updating key columns is not allowed in UpdateItem requests,
but the series introducing GSI support for regular columns
also introduced redundant duplicates checks of this kind.
This condition is already checked in resolve_update_path helper function
and existing test_update_expression_cannot_modify_key test makes sure that
the condition is checked.
Message-Id: <00f83ab631f93b263003fb09cd7b055bee1565cd.1567086111.git.sarna@scylladb.com>
The test test_update_expression_cannot_modify_key() verifies that an
update expression cannot modify one of the key columns. The existing
test only tried the SET and REMOVE actions - this patch makes the
test more complete by also testing the ADD and DELETE actions.
This patch also makes the expected exception more picky - we now
expect that the exception message contains the word "key" (as it,
indeed, does on both DynamoDB and Alternator). If we get any other
exception, there may be a problem.
The test passed before this patch, and passes now as well - it's just
stricter now.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190829135650.30928-1-nyh@scylladb.com>
The code previously used clustering_key::from_singular() to compute
a clustering key value. It works fine, but has two issues:
1. involves one redundant deserialization stage compared to
from_single_value
2. does not work with compound clustering keys, which can appear
when using indexes
With more GSI features implemented, tests with XPASS status are promoted
to being enabled.
One test case (test_gsi_describe) is partially done as DescribeTable
now contains index names, but we could try providing more attributes
(e.g. IndexSizeBytes and ItemCount from the test case), so the test
is left in the XFAIL state.
The DescribeTable request now contains the list of index names
as well. None of the attributes of the list are marked as 'required'
in the documentation, so currently the implementation provides
index names only.
In order to be able to create a Global Secondary Index over a regular
column, this column is upgraded from being a map entry to being a full
member of the schema. As such, it's possible to use this column
definition in the underlying materialized view's key.
In order to prepare alternator for adding regular columns to schema,
i.e. in order to create a materialized view over them,
the code is changed so that updating no longer assumes that only keys
are included in the table schema.
Since in the future we may want to have more regular columns
in alternator tables' schemas, the code is changed accordingly,
so all regular columns will be fetched instead of just the attribute
map.
If no regular column attributes are passed to PutItem, the attr
collector serializes an empty collection mutation nonetheless
and sends it. It's redundant, so instead, if the attr colector
is empty, the collection does not get serialized and sent to replicas.
Keeping an instance of client_state is a convenient way of being able
to use tracing for alternator. It's also currently used in paging,
so adding a client state to executor removes the need of keeping
a dummy value.
String views used in JSON serialization should use not only the pointer
returned by rapidjson, but also the string length, as it may contain
\0 characters.
Additionally, one unnecessary copy is elided.
Add a link to a longer document (currently, around 40 pages) about
DynamoDB's features and how we implemented or may implement them in
Alternator.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190825121201.31747-2-nyh@scylladb.com>
If a user tries to create a table with a unsupported feature -
a local secondary index, a used-defined encryption key or supporting
streams (CDC), let's refuse the table creation, so the application
doesn't continue thinking this feature is available to it.
The "Tags" feature is also not supported, but it is more harmless
(it is used mostly for accounting purposes) so we do not fail the
table creation because of it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190818125528.9091-1-nyh@scylladb.com>
In CQL, before a user can create a table, they must create a keyspace to
contain this table and, among other things, specify this keyspace's RF.
But in the DynamoDB API, there is no "create keyspace" operation - the
user just creates a table, and there is no way, and no opportunity,
to specify the requested RF. Presumably, Amazon always uses the same
RF for all tables, most likely 3, although this is not officially
documented anywhere.
The existing code creates the keyspace during Scylla boot, with RF=1.
This RF=1 always works, and is a good choice for a one-node test run,
but was a really bad choice for a real cluster with multiple nodes, so
this patch fixes this choice:
With this patch, the keyspace creation is delayed - it doesn't happen
when the first node of the cluster boots, but only when the user creates
the first table. Presumably, at that time, the cluster is already up,
so at that point we can make the obvious choice automatically: a one-node
cluster will get RF=1, a >=3 node cluster will get RF=3. The choice of
RF is logged - and the choice of RF=1 is considered a warning.
Note that with this patch, keyspace creation is still automatic as it
was before. The user may manually create the keyspace via CQL, to
override this automatic choice. In the future we may also add additional
keyspace configuration options via configuration flags or new REST
requests, and the keyspace management code will also likely change
as we start to support clusters with multiple regions and global
tables. But for now, I think the automatic method is easiest for
users who want to test-drive Alternator without reading lengthy
instructions on how to set up the keyspace.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190820180610.5341-1-nyh@scylladb.com>
We allow BillingMode to be set to either PAY_PER_REQUEST (the default)
or PROVISIONED, although neither mode is fully implemented: In the former
case the payment isn't accounted, and in the latter case the throughput
limits are not enforced.
But other settings for BillingMode are now refused, and we add a new test
to verify that.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190818122919.8431-1-nyh@scylladb.com>
The alternator tests want to exercise many of the DynamoDB API features,
so they need a recent enough version of the client libraries, boto3
and botocore. In particular, only in botocore 1.12.54, released a year
ago, was support for BillingMode added - and we rely on this to create
pay-per-request tables for our tests.
Instead of letting the user run with an old version of this library and
get dozens of mysterious errors, in this patch we add a test to conftest.py
which cleanly aborts the test if the libraries aren't new enough, and
recommends a "pip" command to upgrade these libraries.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190819121831.26101-1-nyh@scylladb.com>
The DescribeTable operation was currently implemented to return the
minimal information that libraries and applications usually need from
it, namely verifying that some table exists. However, this operation
is actually supposed to return a lot more information fields (e.g.,
the size of the table, its creation date, and more) which we currently
don't return.
This patch adds a new test file, test_describe_table.py, testing all
these additional attributes that DescribeTable is supposed to return.
Several of the tests are marked xfail (expected to fail) because we
did not implement these attributes yet.
The test is exhaustive except for attributes that have to do with four
major features which will be tested together with these features: GSI,
LSI, streams (CDC), and backup/restore.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190816132546.2764-1-nyh@scylladb.com>
Currently Alternator starts all Scylla requests (including both reads
and writes) without any timeout set. Because of bugs and/or network
problems, Requests can theoretically hang and waste Scylla request for
hours, long after the client has given up on them and closed their
connection.
The DynamoDB protocol doesn't let a user specify which timeout to use,
so we should just use something "reasonable", in this patch 10 seconds.
Remember that all DynamoDB read and write requests are small (even scans
just scan a small piece), so 10 seconds should be above and beyond
anything we actually expect to see in practice.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190812105132.18651-1-nyh@scylladb.com>
So far we had the "--alternator-port" option allowing to configure the port
on which the Alternator server listens on, but the server always listened
to any address. It is important to also be able to configure the listen
address - it is useful in tests running several instances of Scylla on
the same machine, and useful in multi-homed machines with several interfaces.
So this patch adds the "--alternator-address" option, defaulting to 0.0.0.0
(to listen on all interfaces). It works like the many other "--*-address"
options that Scylla already has.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190808204641.28648-1-nyh@scylladb.com>
It turns out that recent rjson patches introduced some buggy
tabs instead of spaces due to bad IDE configuration. The indentation
is restored to spaces.
Until now, filtering in alternator was possible only for non-key
column equality relations. This commit adds support for equality
relations for key columns.
Alternator allows passing hash and sort key restrictions
as filters - it is, however, better to incorporate these restrictions
directly into partition and clustering ranges, if possible.
It's also necessary, as optimizations inside restrictions_filter
assume that it will not be fed unneeded rows - e.g. if filtering
is not needed on partition key restrictions, they will not be checked.
Currently the only utility function for getting key bytes
from JSON was to parse a document with the following format:
"key_column_name" : { "key_column_type" : VALUE }.
However, it's also useful to parse only the inner document, i.e.:
{ "key_column_type" : VALUE }.
Three metrics related to filtering are added to alternator:
- total rows read during filtering operations
- rows read and matched by filtering
- rows read and dropped by filtering
Some underlying operations (e.g. paging) make use of cql_stats
structure from CQL3. As such, cql_stats structure is added
to alternator stats in order to gather and use these statistics.
Read-before-write stat counters were already introduced, but the metrics
needs to be added to a metric group as well in order to be available
for users.
This patch adds partial support for GSI (Global Secondary Index) in
Alternator, implemented using a materialized view in Scylla.
This initial version only supports the specific cases of the index indexing
a column which was already part of the base table's key - e.g., indexing
what used to be a sort key (clustering key) in the base table. Indexing
of non-key attributes (which today live in a map) is not yet supported in
this version.
Creation of a table with GSIs is supported, and so is deleting the table.
UpdateTable which adds a GSI to an existing table is not yet supported.
Query and Scan operations on the index are supported.
DescribeTable does not yet list the GSIs as it should.
Seven previously-failing tests now pass, so their "xfail" tag is removed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190808090256.12374-1-nyh@scylladb.com>
The rapidjson library needs to be used with caution in order to
provide maximum performance and avoid undefined behavior.
Comments added to rjson.hh describe provided methods and potential
pitfalls to avoid.
Message-Id: <ba94eda81c8dd2f772e1d336b36cae62d39ed7e1.1565270214.git.sarna@scylladb.com>
With libjsoncpp we were forced to work around the problem of
non-noexcept constructors by using an intermediate unique pointer.
Objects provided by rapidjson have correct noexcept specifiers,
so the workaround can be dropped.
Profiling alternator implied that JSON parsing takes up a fair amount
of CPU, and as such should be optimized. libjsoncpp is a standard
library for handling JSON objects, but it also proves slower than
rapidjson, which is hereby used instead.
The results indicated that libjsoncpp used roughly 30% of CPU
for a single-shard alternator instance under stress, while rapidjson
dropped that usage to 18% without optimizations.
Future optimizations should include eliding object copying, string copying
and perhaps experimenting with different JSON allocators.
Migrating from libjsoncpp to rapidjson proved to be beneficial
for parsing performance. As a first step, a set of helper functions
is provided to ease the migration process.
error.hh file implicitly assumed that seastar:: namespace is available
when it's included, which is not always the case. To remedy that,
seastar::httpd namespace is used explicitly.
Our CreateTable handler assumed that the function
migration_manager::announce_new_column_family()
returns a failed future if the table already exists. But in some of
our code branches, this is not the case - the function itself throws
instead of returning a failed future. The solution is to use
seastar::futurize_apply() to handle both possibilities (direct exception
or future holding an exception).
This fixes a failure of the test_table.py::test_create_table_already_exists
test case.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This adds a new document, docs/alternator.md, about Alternator.
The scope of this document should be expanded in the future. We begin
here by introducing Alternator and its current compatibility level with
Amazon DynamoDB, but it should later grow to explain the design of Alternator
and how it maps the DynamoDB data model onto Scylla's.
Whether this document should remain a short high-level overview, or a long
and detailed design document, remains an open question.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190805085340.17543-1-nyh@scylladb.com>
The function attrs_type() return a supposedly singleton, but because
it is a seastar::shared_ptr we can't use the same one for multiple
threads, and need to use a separate one per thread.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190804163933.13772-1-nyh@scylladb.com>
The CQL type singletons like utf8_type et al. are separate for separate
shards and cannot be used across shards. So whatever hash tables we use
to find them, also needs to be per-shard. If we fail to do this, we
get errors running the debug build with multiple shards.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190804165904.14204-1-nyh@scylladb.com>
Expand the GSI test suite. The most important new test is
test_gsi_key_not_in_index(), where the index's key includes just one of
the base table's key columns, but not a second one. In this case, the
Scylla implementation will nevertheless need to add the second key column
to the view (as a clustering key), even though it isn't considered a key
column by the DynamoDB API.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190718085606.7763-1-nyh@scylladb.com>
Our ListTables implementation uses get_column_families(), which lists both
base tables and materialized views. We will use materialized views to
implement DynamoDB's secondary indexes, and those should not be listed in
the results of ListTables.
The patch also includes a test for this.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190717133103.26321-2-nyh@scylladb.com>
The list_tables() utility function was used only in test_table.py
but I want to use it elsewhere too (in GSI test) so let's move it
to util.py.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190717133103.26321-1-nyh@scylladb.com>
As in case of set_diff, an exception message in set_sum should include
the user-provided request (ADD) rather than our internal helper function
set_sum.
Although we do not support GSI yet, until now we silently ignored
CreateTable's GSI parameter, and the user wouldn't know the table
wasn't created as intended.
In this patch, GSI is still unsupported, but now CreateTable will
fail with an error message that GSI is not supported.
We need to change some of the tests which test the error path, and
expect an error - but should not consider a table creation error
as the expected error.
After this patch, test_gsi.py still fails all the tests on
Alternator, but much more quickly :-)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190711161420.18547-1-nyh@scylladb.com>
The test case for adding two sets with common values is added.
This case is a stub, because boto3 transforms the result into a Python
set, which removes duplicates on its own. A proper TODO is left
in order to migrate this case to a lower-level API and check
the returned JSON directly for lack of duplicates.
The Query operation's conditions can be used to search for a particular
hash key or both hash and sort keys - but not any other combinations.
We previously forgot to verify most errors, so in this patch we add
missing verifications - and tests to confirm we fail the query when
DynamoDB does.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190711132720.17248-1-nyh@scylladb.com>
Add more tests for GSI - tests that DescribeTable describes the GSI,
and test the case of more than one GSI for a base table.
Unfortunately, creating an empty table with two GSIs routinely takes
on DynamoDB more than a full minute (!), so because we now have a
test with two GSIs, I had to increase the timeout in create_test_table().
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190711112911.14703-1-nyh@scylladb.com>
The holds_path() utility function is actually used to check if a value
needs read before write, so its name is changed to more fitting
check_needs_read_before_write.
Alternator currently keeps an item's attributes inside a map, and we
had a serious bug in the way we build mutations for this map:
We didn't know there was a requirement to build this mutation sorted by
the attribute's name. When we neglect to do this sorting, this confuses
Scylla's merging algorithms, which assume collection cells are thus
sorted, and the result can be duplicate cells in a collection, and the
visible effect is a mutation that seems to be ignored - because both
old and new values exist in the collection.
So this patch includes a new helper class, "attribute_collector", which
helps collect attribute updates (put and del) and extract them in correctly
sorted order. This helper class also eliminates some duplication of
arcane code to create collection cells or deletions of collection cells.
This patch includes a simple test that previously failed, and one xfail
test that failed just because of this bug (this was the test that exposed
this bug). Both tests now succeed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190709160858.6316-1-nyh@scylladb.com>
This patch adds what is hopefully an exhaustive test suite for the
global secondary indexing (GSI) feature, and all its various
complications and corner cases of how GSIs can be created, deleted,
named, written, read, and more (the tests are heavily documented to
explain what they are testing).
All these tests pass on DynamoDB, and fail on Alternator, so they are
marked "xfail". As we develop the GSI feature in Alternator piece by
piece, we should make these tests start to pass.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190708160145.13865-1-nyh@scylladb.com>
This adds another test for BatchWriteItem: That if one of the operations is
invalid - e.g., has a wrong key type - the entire batch is rejected, and not
none of its operations are done - even the valid ones.
The test succeeds, because we already handle this case correctly.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190707134610.30613-1-nyh@scylladb.com>
Test an operation like SET #one = #two, where the RHS has a reference
to a name, rather than the name itself. Also verify that DynamoDB
gives an error if ExpressionAttributeNames includes names not needed
by neither left or right hand side of such assignments.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190708133311.11843-1-nyh@scylladb.com>
In order to serve update requests that depend on read-before-write,
a proper helper function which fetches the existing item with a given
key from the database is added.
This read-before-write mechanism is not considered safe, because it
provides no linearizability guarantees and offers no synchronization
protection. As such, it should be consider a placeholder that works
fine on a single machine and/or no concurrent access to the same key.
The calculate_value utility function is going to need more context
in order to resolve paths present in the right-hand side of update_item
operators: update_info and schema.
Historically, resolving a path checked for key columns, which are not
allowed to be on the left-hand side of the assignment. However, path
resolving will now also be used for right-hand side, where it should
be allowed to use the key value.
In order to implement read-before-write in the future, calculate_value
now accepts an additional parameter: previous_item. If read-before-write
was performed, previous_item will contain an item for the given key
which already exists in the database at the time of the update.
This patch moves the create_test_table() utility function, which creates
a test table with a unique name, from the fixtures (conftest.py) to
util.py. This will allow reusing this function in tests which need to
create tables but not through the existing fixtures. In particular
we will need to do this for GSI (global secondary index) tests
in the next patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190708104438.5830-1-nyh@scylladb.com>
The tests we had for BatchWriteItem's refusal to accept duplicate keys
only used test_table_s, with just a hash key. This patch adds tests
for test_table, i.e., a table with both hash and sort keys - to check
that we check duplicates in that case correctly as well.
Moreover, the expanded tests also verify that although identical
keys are not allowed, keys with just one component (hash or sort key)
the same but the other not the same - are fine.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190705191737.22235-1-nyh@scylladb.com>
Even when running against a local Alternator, Boto3 wants to know the
region name, and AWS credentials, even though they aren't actually needed.
For a local run, we can supply garbage values for these settings, to
allow a user who never configured AWS to run tests locally.
Running against "--aws" will, of course, still require the user to
configure AWS.
Also modified the README to be clearer, and more focused on the local
runs.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190708121420.7485-1-nyh@scylladb.com>
For "--aws" tests, use the default region chosen by the user in the
AWS configuration (~/.aws/config or environment variable), instead of
hard-coding "us-east-1".
Patch by Pekka Enberg.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190708105852.6313-1-nyh@scylladb.com>
Calculating value represented as 'v1 + v2' or 'v1 - v2' was previously
implemented with a double type, which offers limited precision.
From now on, these computations are based on big_decimal, which
allows returning values without losing precision.
This patch depends on 'add big_decimal arithmetic operators' series.
Message-Id: <f741017fe3d3287fa70618068bdc753bfc903e74.1562318971.git.sarna@scylladb.com>
Move some common utility functions to a common file "util.py"
instead of repeating them in many test files.
The utility functions include random_string(), random_bytes(),
full_scan(), full_query(), and multiset() (the more general
version, which also supports freezing nested dicts).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190705081013.1796-1-nyh@scylladb.com>
The idiomatic way to use an std::variant depending the type holds is to use
std::visit. This modern API makes it unnecessary to write many boiler-plate
functions to test and cast the type of the variant, and makes it impossible
to forget one of the options. So in this patch we throw out the old ways,
and welcome the new.
Thanks to Piotr Sarna for the idea.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190704205625.20300-1-nyh@scylladb.com>
This patch adds to Alternator an implementation of the BatchGetItem
operation, which allows to start a number of GetItem requests in parallel
in a single request.
The implementation is almost complete - the only missing feature is the
ability to ask only for non-top-level attributes in ProjectionExpression.
Everything else should work, and this patch also includes tests which,
as usual, pass on DynamoDB and now also on Alternator.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Amazingly, it appears we never tested booting Alternator a second time :-)
Our initialization code creates a new keyspace, and was supposed to ignore
the error if this keyspace already existed - but we thought the error will
come as an exceptional future, which it didn't - it came as a thrown
exception. So we need to change handle_exception() to a try/catch.
With this patch, I can kill Alternator and it will correctly start again.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Operations which take a key as parameter, namely GetItem, UpdateItem,
DeleteItem and BatchWriteItem's DeleteRequest, already fail if the given
key is missing one of the nessary key attributes, or has the wrong types
for them. But they should also fail if the given key has spurious
attributes beyond those actually needed in a key.
So this patch adds this check, and tests to confirm that we do these checks
correctly.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The PutItem operation, and also the PutRequest of BatchWriteItem, are
supposed to completely replace the item - not to merge the new value with
the previous value. We implemented this wrongly - we just wrote the new
item forgetting a tombstone to remove the old item.
So this patch fixes these operations, and adds tests which confirm the
fix (as usual, these tests pass on DynamoDB, failed on Alternator before
this patch, and pass after the patch).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add support for the DeleteItem operation, which deletes an item.
The basic deletion operation is supported. Still not supported are:
1. Parameters to conditionally delete (ConditionalExpression or Expected)
2. Parameters to return pre-delete content
3. ReturnItemCollectionMetrics (statistics relevant for tables with LSI)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In BatchWriteItem, we currently only support the PutRequest operation.
If a user tries to use DeleteRequest (which we don't support yet), he
will get a bizarre error. Let's test the request type more carefully,
and print a better error message. This will also be the place where
eventually we'll actually implement the DeleteRequest.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds more comprehensive tests for the BatchWriteItem operation,
in a new file batch_test.py. The one test we already had for it was also
moved from test_item.py here.
Some of the test still xfail for two reasons:
1. Support for the DeleteRequest operation of BatchWriteItem is missing.
2. Tests that forbid duplicate keys in the same request are missing.
As usual, all tests succeed on DynamoDB, and hopefully (I tried...)
cover all the BatchWriteItem features.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
DynamoDB has two similar parameters - AttributesToGet and
ProjectionExpression - which are supported by the GetItem, Scan and
Query operations. Until now we supported only the older AttributesToGet,
and this patch adds support to the newer ProjectionExpression.
Besides having a different syntax, the main difference between
AttributesToGet and ProjectionExpression is that the latter also
allows fetching only a specific nested attribute, e.g., a.b[3].c.
We do not support this feature yet, although it would not be
hard to add it: With our current data representation, it means
fetching the top-level attribute 'a', whose value is a JSON, and then
post-filtering it to take out only the '.b[3].c'. We'll do that
later.
This patch also adds more test cases to test_projection_expression.py.
All tests except three which check the nested attributes now pass,
and those three xfail (they succeed on DynamoDB, and fail as expected
on Alternator), reminding us what still needs to be done.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Our GetItem, Query and Scan implementations support the AttributesToGet
parameter to fetch only a subset of the attributes, but we don't yet
support the more elaborate ProjectionExpression parameter, which is
similar but has a different syntax and also allows to specify nested
document paths.
This patch adds existive testing of all the ProjectionExpression features.
All these tests pass against DynamoDB, but fail against the current
Alternator so they are marked "xfail". These tests will be helpful for
developing the ProjectionExpression feature.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The AttributesToGet parameter - saying which attributes to fetch for each
item - is already supported in the GetItem, Query and Scan operations.
However, we only had a test for it for it for Scan. This patch adds
similar tests also for the GetItem and Query operations.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Yet another test for overwriting a top-level attribute which contains
a nested document - here, overwriting it by just a string.
This test passes. In the current implementation we don't yet support
updates to specific attribute paths (e.g. a.b[3].c) but we do support
well writing and over-writing top-level attributes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch implements the last (finally!) syntactic feature of the
UpdateExpression - the ability to do SET a=val1+val2 (where, as
before, each of the values can be a reference to a value, an
attribute path, or a function call).
The implementation is not perfect: It adds the values as double-precision
numbers, which can lose precision. So the patch adds a new test which
checks that the precision isn't lost - a test that currently fails
(xfail) on Alternator, but passes on DynamoDB. The pre-existing test
for adding small integer now passes on Alternator.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In the previous patch we added function-call support in the UpdateExpression
parser. In this patch we add support for one such function - list_append().
This function takes two values, confirms they are lists, and concatenates
them. After this patch only one function remains unimplemented:
if_not_exists().
We also split the test we already had for list_append() into two tests:
One uses only value references (":val") and passes after this patch.
The second test also uses references to other attributes and will only
work after we start supporting read-modify-write.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Until this patch, in update expressions like "SET a = :val", we only
allowed the right-hand-side of the assignment to be a reference to a
value stored in the request - like ":val" in the above example.
But DynamoDB also allows the value to be an attribute path (e.g.,
"a.b[3].c", and can also be a function of a bunch of other values.
This patch adds supports for parsing all these value types.
This patch only adds the correct parsing of these additional types of
values, but they are still not supported: reading existing attributes
(i.e., read-modify-write operations) is still not supported, and
none of the two functions which UpdateExpression needs to support
are supported yet. Nevertheless, the parsing is now correct, and the
the "unknown_function" test starts to pass.
Note that DynamoDB allows the right-hand side of an assignment to be
not only a single value, but also value+value and value-value. This
possibility is not yet supported by the parser and will be added
later.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The test cases verify that equality-based filtering on non-key
attributes works fine. It also contains test stubs for key filtering
and non-equality attribute filtering.
Filled test table used to have identical non-key attributes for all
rows. These values are now diversified in order to allow writing
filtering test cases.
Filtering is currently only implemented for the equality operator
on non-key attributes.
Next steps (TODO) involve:
1. Implementing filtering for key restrictions
2. Implementing non-key attribute filtering for operators other than EQ.
It, in turn, may involve introducing 'map value restrictions' notion
to Scylla, since now it only allows equality restrictions on map
values (alternator attributes are currently kept in a CQL map).
3. Implementing FilterExpression in addition to deprecated QueryFilter
Before this patch, we read either an attribute name like "name" or
a reference to one "#name", as one type of token - NAME.
However, while attribute paths indeed can use either one, in some other
contexts - such as a function name - only "name" is allowed, so we
need to distinguish between two types of tokens: NAME and NAMEREF.
While separating those, I noticed that we incorrectly allowed a "#"
followed by *zero* alphanumeric characters to be considered a NAMEREF,
which it shouldn't. In other words, NAMEREF should have ALNUM+, not ALNUM*.
Same for VALREF, which can't be just a ":" with nothing after it.
So this patch fixes these mistakes, and adds tests for them.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
DynamoDB complains, and fails an update, if the update contains in
ExpressionAttributeNames or ExpressionAttributeValues names which aren't
used by the expression.
Let's do the same, although sadly this means more work to track which
of the references we've seen and which we haven't.
This patch makes two previously xfail (expected fail) tests become
successful tests on Alternator (they always succeeded against DynamoDB).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The existing tests in test_update_expression.py thoroughly tested the
UpdateExpression features which we currently support. But tests for
features which Alternator *doesn't* yet support were partial.
In this patch, we add a large number of new tests to
test_update_expression.py aiming to cover ALL the features of
UpdateExpression, regardless of whether we already support it in
Alternator or not. Every single feature and esoteric edge-case I could
discover is covered in these tests - and as far as I know these tests
now cover the *entire* UpdateExpression feature. All the tests succeed
on DynamoDB, and confirm our understanding of what DynamoDB actually does
on all these cases.
After this patch, test_update_expression.py is a whopper, with 752 lines of
code and 37 separate test functions. 23 out of these 37 tests are still
"xfail" - they succeed on DynamoDB but fail on Alternator, because of
several features we are still missing. Those missing features include
direct updates of nested attributes, read-modify-write updates (e.g.,
"SET a=b" or "SET a=a+1"), functions (e.g., "SET a = list_append(a, :val)"),
the ADD and DELETE operations on sets, and various other small missing
pieces.
The benefit of this whopper test is two-fold: First, it will allow us
to test our implementation as we continue to fill it (i.e., "test-
driven development"). Second, all these tested edge cases basically
"reverse engineer" how DynamoDB's expression parser is supposed to work,
and we will need this knowledge to implement the still-missing features of
UpdateExpression.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds an extensive array of tests for UpdateItem's UpdateExpression
support, which was introduced in the previous patch.
The tests include verification of various edge cases of the parser, support
for ":value" and "#name" references, functioning SET and REMOVE operations,
combinations of multiple such operations, and much more.
As usual, all these tests were ran and succeed on DynamoDB, as well as on
Alternator - to confirm Alternator behaves the same as DynamoDB.
There are two tests marked "xfail" (expected to fail), because Alternator
still doesn't support the attribute copy syntax (e.g., "SET a = b",
doing a read-before-write).
There are some additional areas which we don't support - such as the DELETE
and ADD operations or SET with functions - but those areas aren't yet test
in these tests.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
For the UpdateItem operation, so far we supported updates via the
AttributeUpdates parameter, specifying which attributes to set or remove
and how. But this parameter is considered deprecated, and DynamoDB supports
a more elaborate way to modify attributes, via an "UpdateExpression".
In the previous patch we added a function to parse such an UpdateExpression,
and in this patch we use the result of this parsing to actually perform
the required updates.
UpdateExpression is only partially supported after this patch. The basic
"SET" and "REMOVE" operations are supported, but various other cases aren't
fully supported and will be fixed in followup patches. The following
patch will add extensive tests to confirm exactly what works correctly
with the new UpdateExpression support.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The DynamoDB protocol is based on JSON, and most DynamoDB requests describe
the operation and its parameters via JSON objects such as maps and lists.
However, in some types of requests an "expression" is passed as a single
string, and we need to parse this string. These cases include:
1. Attribute paths, such as "a[3].b.c", are used in projection
expressions as well as inside other expressions described below.
2. Condition expressions, such as "(NOT (a=b OR c=d)) AND e=f",
used in conditional updates, filters, and other places.
3. Update expressions, such as "SET #a.b = :x, c = :y DELETE d"
This patch introduces the framework to parse these expressions, and
an implementation of parsing update expressions. These update expressions
will be used in the UpdateItem operation in the next patch.
All these expression syntaxes are very simple: Most of them could be
parsed as regular expressions, or at most a simple hand-written lexical
analyzer and recursive-descent parser. Nevertheless, we decided to specify
these parsers in the same ANTLR3 language already used in the Scylla
project for parsing CQL, hopefully making these parsers easier to reason
about, and easier to change if needed - and reducing the amount of boiler-
plate code.
The parsing of update expressions is most complete except that in SET
actions, only the "path = value" form is supported and not yet forms
forms such as "path1 = path2" (which does read-before-write) or
"path1 = path1 + value" or "path = function(...)".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We need to write more tests for various case of handling
nested documents and nested attributes. Let's collect them
all in the same test file.
This patch mostly moves existing code, but also adds one
small test, test_nested_document_attribute_write, which
just writes a nested document and reads it back (it's
mostly covered by the existing test_put_and_get_attribute_types,
but is specifically about a nested document).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We usually run Alternator tests against the local Alternator - testing
against AWS DynamoDB is rarer, and usually just done when writing the
test. So let's make "pytest" without parameters default to testing locally.
To test against AWS, use "pytest --aws" explicitly.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Attributes for reads (GetItem, Query, Scan, ...) and writes (PutItem,
UpdateItem, ...) are now serialized and deserialized in binary form
instead of raw JSON, provided that their type is S, B, BOOL or N.
Optimized serialization for the rest of the types will be introduced
as follow-ups.
Message-Id: <6aa9979d5db22ac42be0a835f8ed2931dae208c1.1559646761.git.sarna@scylladb.com>
Attributes used to be written into the database in raw JSON format,
which is far from optimal. This patch introduces more robust
serializationi routines for simple alternator types: S, B, BOOL, N.
Serialization uses the first byte to encode attribute type
and follows with serializing data in binary form.
More complex types (sets, lists, etc.) are currently still
serialized in raw JSON and will be optimized in follow-up patches.
Message-Id: <10955606455bbe9165affb8ac8fba4d9e7c3705f.1559646761.git.sarna@scylladb.com>
For some unknown reason we put the list of alternator source files
in configure.py inside the "api" list. Let's move it into a separate
list.
We could have just put it in the scylla_core list, but that would cause
frequent and annoying patch conflicts when people add alternator source
files and Scylla core source files concurrently.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
So far for UpdateItem we only supported the old-style AttributeUpdates
parameter, not the newer UpdateExpression. This patch begins the path
to supporting UpdateExpression. First, trying to use *both* parameters
should result in an error, and this patch does this (and tests this).
Second, passing neither parameters is allowed, and should result in
an *empty* item being created.
Finally, since today we do not yet support UpdateExpression, this patch
will cause UpdateItem to fail if UpdateExpression is used, instead of
silently being ignored as we did so far.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds two simple tests for nested documents, which pass:
test_nested_document_attribute_overwrite() tests what happens when
we UpdateItem a top-level attribute to a dictionary. We already tested
this works on an empty item in a previous test, but now we check what
happens when the attribute already existed, and already was a dictionary,
and now we update it to a new dictionary. In the test attribute a was
{b:3, c:4} and now we update it to {c:5}. The test verifies that the new
dictionary completely replaces the old one - the two are not merged.
The new value of the attribute is just {c:5}, *not* {b:3, c:5}.
The second test verifies that the AttributeUpdates parameter of
UpdateItem cannot be used to update a just a nested attributes.
Any dots in the attribute name are considered an actual dot - not
part of a path of attribute names.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Comparing two lists of items without regard for order is not trivial.
For this reason some tests in test_query.py only compare arrays of sort
keys, and those tests are fine.
But other tests used a trick of converting a list of items into a
of set_of_frozen_elements() and compare this sets. This trick is almost
correct, but it can miss cases where items repeat.
So in this patch, we replace the set_of_frozen_elements() approach by
a similar one using a multiset (set with repetitions) instead of a set.
A multiset in Python is "collections.Counter". This is the same approach
we started to also used in test_scan.py in a recent patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Remove the incomplete and unused function to convert DynamoDB type names
to ScyllaDB type objects:
DynamoDB has a different set of types relevant for keys and for attributes.
We already have a separate function, parse_key_type(), for parsing key
types, and for attributes - we don't currently parse the type names at
all (we just save them as JSON strings), so the function we removed here
wasn't used, and was in fact #if'ed out. It was never completed, and it now
started to decay (the type for numbers is wrong), so we're better off
completely removing it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch implements a fully working number type for keys, and now
Alternator fully and correctly supports every key type - strings, byte
arrays, and numbers.
The patch also adds a test which verifies that Scylla correctly sorts
number sort keys, and also correctly retrieves them to the full precision
guaranteed by DynamoDB (38 decimal digits).
The implementation uses Scylla's "decimal" type, which supports arbitrary
precision decimal floating point, and in particular supports the precision
specified by DynamoDB. However, "decimal" is actually over-qualified for
this use, so might not be optimal for the more specific requirements of
DynamoDB. So a FIXME is left to optimize this case in the future.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Comparing two lists of items without regard for order is not trivial.
test_scan.py currently has two ways of doing this, both unsatisfactory:
1. We convert each list to a set via set_of_frozen_elements(), and compare
the sets. But this comparison can miss cases where items repeat.
2. We use sorted() on the list. This doesn't work on Python 3 because
it removed the ability to compare (with "<") dictionaries.
So in this patch, we replace both by a new approach, similar to the first
one except we use a multiset (set with repetitions) instead of a set.
A multiset in Python is "collections.Counter".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Creating and deleting tables is the slowest part of our tests,
so we should lower the number of tables our tests create.
We had a "test_2_tables" fixture as a way to create two
tables, but since our tests already create other tables
for testing different key types, it's faster to reuse those
tables - instead of creating two more unused tables.
On my system, a "pytest --local", running all 38 tests
locally, drops from 25 seconds to 20 seconds.
As a bonus, we also have one fewer fixture ;-)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
to 1024 bytes, and the entire item to 400 KB which therefore also
limits the size of one attribute. This test checks that we can
reach up to these limits, with binary keys and attributes.
The test does *not* check what happens once we exceed these
limits. In such a case, DynamoDB throws an error (I checked that
manually) but Alternator currently simply succeeds. If in the
future we decide to add artificial limits to Alternator as well,
we should add such tests as well.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
"len" is an unfortunate choice for a variable name, in case one
day the implementation may want to call the built-in "len" function.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We already have a test for *string* sort-key ordering of items returned
by the Scan operation, and this test adds a similar test for the Query
operation. We verify that items are retrieved in the desired sorted
order (sorted by the aptly-named sort key) and not in creation order
or any other wrong order.
But beyond just checking that Query works as expected (it should,
given it uses the same machinary as Scan), the nice thing about this
test is that it doesn't create a new table - it uses a shared table
and creates one random partition inside it. This makes this test
faster and easier to write (no need for a new fixture), and most
importantly - easily allows us to write similar tests for other
key types.
So this patch also tests the correct ordering of *binary* sort keys.
It helped exposed bugs in previous versions of the binary key implementation.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Simple tests for item operations (PutItem, GetItem) with binary key instead
of string for the hash and sort keys. We need to be able to store such
keys, and then retrieve them correctly.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Until now we only supported string for key columns (hash or sort key).
This patch adds support for the bytes type (a.k.a binary or blob) as well.
The last missing type to be supported in keys is the number type.
Note that in JSON, bytes values are represented with base64 encoding,
so we need to decode them before storing the decoded value, and re-encode
when the user retrieves the value. The decoding is important not just
for saving storage space (the encoding is 4/3 the size of the decoded)
but also for correct *sorting* of the binary keys.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The DynamoDB API uses base64 encoding to encode binary blobs as JSON
strings. So we need functions to do these conversions.
This code was "inspired" by https://github.com/ReneNyffenegger/cpp-base64
but doesn't actually copy code from it.
I didn't write any specific unit tests for this code, but it will be
exercised and tested in a following patch which tests Alternator's use
of these functions.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
BEGINS_WITH behaves in a special way when a key postfix
consists of <255> bytes. The initial test does not use that
and instead checks UTF-8 characters, but once bytes type
is implemented for keys, it should also test specifically for
corner cases, like strings that consist of <255> byte only.
Message-Id: <fe10d7addc1c9d095f7a06f908701bb2990ce6fe.1558603189.git.sarna@scylladb.com>
BEGINS_WITH statement increments a string in order to compute
the upper bound for a clustering range of a query.
Unfortunately, previous implementation was not correct,
as it appended a <0> byte if the last character was <255>,
instead of incrementing a last-but-one character.
If the string contains <255> bytes only, the upper bound
of the returned upper bound is infinite.
Message-Id: <3a569f08f61fca66cc4f5d9e09a7188f6daad578.1558524028.git.sarna@scylladb.com>
We had several places in the code that need to parse the
ConsistentRead flag in the request. Let's add a function
that does this, and while at it, checks for more error
cases and also returns LOCAL_QUORUM and LOCAL_ONE instead
of QUORUM and ONE.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
As Shlomi suggested in the past, it is more likely that when we
eventually support global tables, we will use LOCAL_QUORUM,
not QUORUM. So let's switch to that now.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
So far, all of the tests in test_item.py (for PutItem, GetItem, UpdateItem),
were arbitrarily done on a test table with both hash key and sort key
(both with string type). While this covers most of the code paths, we still
need to verify that the case where there is *not* a sort key, also works
fine. E.g., maybe we have a bug where a missing clustering key is handled
incorrectly or an error is incorrectly reported in that case?
But in this patch we add tests for the hash-key-only case, and see that
it already works correctly. No bug :-)
We add a new fixture test_table_s for creating a test table with just
a single string key. Later we'll probably add more of these test tables
for additional key types.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Another type of key type error can be to forget part of the key
(the hash or sort key). Let's test that too (it already works correctly,
no need to patch the code).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When a table has a hash key or sort key of a certain type (this can
be string, bytes, or number), one cannot try to choose an item using
values of different types.
We previously did not handle this case gracefully, and PutItem handled
it particularly bad - writing malformed data to the sstable and basically
hanging Scylla. In this patch we fix the pk_from_json() and ck_from_json()
functions to verify the expected type, and fail gracefully if the user
sent the wrong type.
This patch also adds tests for these failures, for the GetItem, PutItem,
and UpdateItem operations.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
According to the documentation, trying to GetItem a non-existant item
should result in an empty response - NOT a response with an empty "Item"
map as we do before this patch.
This patch fixes this case, and adds a test case for it. As usual,
we verify that the test case also works on Amazon DynamoDB, to verify
DynamoDB really behaves the way we thik it does.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
If an empty item (i.e., no attributes except the key) is created, or an item
becomes empty (by deleting its existing attributes), the empty item must be
maintained - it cannot just disappear. To do this in Scylla, we must add a
row marker - otherwise an empty attribute map is not enough to keep the
row alive.
This patch includes 4 test cases for all the various ways an empty item can be
created empty or non-empty item be emptied, and verifies that the empty item
can be correctly retrieved (as usual, to verify that our expectation of
"correctness" is indeed correct, we run the same tests against DynamoDB).
All these 4 tests failed before this patch, and now succeed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
These lines of codes were superfluous and their result unused: the
make_item_mutation() function finds the pk and ck on its own.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
his patch adds a statistics framework to Alternator: Executor has (for
each shard) a _stats object which contains counters for various events,
and also is in charge of making these counters visible via Scylla's regular
metrics API (http://localhost:9180/metrics).
This patch includes a counter for each of DynamoDB's operation types,
and we increase the ones we support when handled. We also added counters
for total operations and unsupported operations (operation types we don't
yet handle). In the future we can easily add many more counters: Define
the counter in stats.hh, export it in stats.cc, and increment it in
where relevant in executor.cc (or server.cc).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Ask to retrieve only an attribute name which *none* of the items have.
The result should be a silly list of empty items, and indeed it is.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Use full_scan() in another test instead of open-coding the scan.
There are two more tests that could have used full_scan(), but
since they seem to be specifically adding more assertions or
using a different API ("paginators"), I decided to leave them
as-is. But new tests should use full_scan().
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This is a short, but extensive, test to the AttributesToGet parameter
to Scan, allowing to select for output only some of the attributes.
The AttributesToGet feature has several non-obvious features. Firstly,
it doesn't require that any key attributes be selected. So since each
item may have different non-key attributes, some scanned items may
be missing some of the selected columns, and some of the items may
even be missing *all* the selected columns - in which case DynamoDB
returns an empty item (and doesn't entirely skip this item). This
test covers all these cases, and it adds yet another item to the
'filled_test_table' fixture, one which has different attributes,
so we can see these issues.
As usual, this test passes in both DynamoDB and Alternator, to
assure we correspond to the *right* behavior, not just what we
think is right.
This test actually exposed a bug in the way our code returned
empty items (items which had none of the selected columns),
a bug which was fixed by the previous patch.
Instead of having yet another copy of table-scanning code, this
patch adds a utility function full_scan(), to scan an entire
table (with optional extra parameters for the scan) and return
the result as an array. We should simply existing tests in
test_scan.py by using this new function.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
* tag 'dbuild-image-help-usage-v1' of github.com:bhalevy/scylla:
dbuild: add usage
dbuild: add help option
dbuild: list available images when no image arg is given
dbuild: add --image option
When a Scan selects only certain attributes, and none of the key
attributes are selected, for some of the scanned items *nothing*
will remain to be output, but still Dynamo outputs an empty item
in this case. Our code had a bug where after each item we "moved"
the object leaving behind a null object, not an empty map, so a
completely empty item wasn't output as an empty map as expected,
and resulted in boto3 failing to parse the response.
This simple one-line patch fixes the bug, by resetting the item
to an empty map after moving it out.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Instead of blindly returning "localhost:8000" in response to
DescribeEndpoints and for sure causing us problems in the future,
the right thing to do is to return the same domain name which the
user originally used to get to us, be it "localhost:8000" or
"some.domain.name:1234". But how can we know what this domain name
was? Easy - this is why HTTP 1.1 added a mandatory "Host:" header,
and the DynamoDB driver I tested (boto3) adds it as expected,
indeed with the expected value of "localhost:8000" on my local setup.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Although different partitions are returned by a Scan in (seemingly)
random order, items in a single partition need to be returned sorted
by their sort key. This adds a test to verify this.
This patch adds to the filled_test_table fixture, which until now
had just one item in each partition, another partition (with the key
"long") with 164 additional items. The test_scan_sort_order_string
test then scans this table, and verifies that the items are really
returned in sorted order.
The sort order is, of course, string order. So we have the first
item with sort key "1", then "10", then "100", then "101", "102",
etc. When we implement numeric keys we'll need to add a version
of this test which uses a numeric clustering key and verifies the
sort order is numeric.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Because of a typo, we incorrectly set the table's sort key as a second
partition key column instead of a clustering key column. This has bad
but subtle consequences - such as that the items are *not* sorted
according to the sort key. So in this patch we fix the typo.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
DescribeEndpoints is not a very important API (and by default, clients
don't use it) but I wanted to understand how DynamoDB responds to it,
and what better way than to write a test :-)
And then, if we already have a test, let's implement this request in
Scylla as well. This is a silly implementation, which always returns
"localhost:8000". In the future, this will need to be configurable -
we're not supposed here to return *this* server's IP address, but rather
a domain name which can be used to get to all servers.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
"
Currently, GDB scripts locate sstables by scanning the heap for
bag_sstable_set containers. That has disadvatanges:
- not all containers are considered
- it's extremely slow on large heaps
- fragile, new containers can be added, and we won't even know
This series fixes all above by adding a per-shard sstable tracker
which tracks sstable objects in a linked-list.
"
* 'sstable-tracker' of github.com:tgrabiec/scylla:
gdb: Use sstable tracker to get the list of sstables
gdb: Make intrusive_list recognize member_hook links
sstables: Track whether sstable was already open or not
sstables: Track all instances of sstable objects
sstables: Make sstable object not movable
sstables: Move constructor out of line
Most of the request types need to a TableName parameter, specifying the
name of the table they operate on. There's a lot of boilerplate code
required to get this table name and verify that it is valid (the parameter
exists, is a string, passes DynamoDB's naming rules, and the table
actually exists), which resulted in a lot of code duplication - and
in some cases missing checks.
So this patch introduces two utility functions, get_table_name()
and get_table(), to fetch a table name or the schema of an existing
table, from the request, with all necessary validation. If validation
fails, the appropriate api_error() is thrown so the user gets the
right error message.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Remove unused random-string code from conftest.py, and also add a
TODO comment how we should speed up filled_test_table fixture by
using a batch write - when that becomes available in Alternator.
(right now this fixture takes almost 4 seconds to prepare on a local
Alternator, and a whopping 3 minutes (!) to prepare on DynamoDB).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The test test_put_and_get_attribute_types needlessly named all the
different attributes and their variables, causing a lot of repetition
and chance for mistakes when adding additional attributes to the test.
In this rewrite, we only have a list of items, and automatically build
attributes with them as values (using sequential names for the attributes)
and check we read back the same item (Python's dict equality operator
checks the equality recursively, as expected).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Although we planned to initially support only string types, it turns out
for the attributes (*not* the key), we actually support all types already,
including all scalar types (string, number, bool, binary and null) and
more complex types (list, nested document, and sets).
This adds a tests which PutItem's these types and verifies that we can
retrieve them.
Note that this test deals with top-level attributes only. There is no
attempt to modify only a nested attribute (and with the current code,
it wouldn't work).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In our tests, we cannot really assume that ListTables should returns *only*
the tables we created for the test, or even that a page size of 100 will
be enough to list our 3 pages. The issue is that on a shared DynamoDB, or
in hypothetical cases where multiple tests are run in parallel, or previous
tests had catestrophic errors and failed to clean up, we have no idea how
many unrelated tables there are in the system. There may be hundreds of
them. So every ListTables test will need to use paging.
So in this re-implementation, we begin with a list_tables() utility function
which calls ListTables multiple times to fetch all tables, and return the
resulting list (we assume this list isn't so huge it becomes unreasonable
to hold it in memory). We then use this utility function to fetch the table
list with various page sizes, and check that the test tables we created are
listed in the resulting list.
There's no longer a separate test for "all" tables (really was a page of 100
tables) and smaller pages (1,2,3,4) - we now have just one test that does the
page sizes 1,2,3,4, 50 and 100.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch cleans up some comments and reorganizes some functions in
conftest.py, where the test_table fixture was defined. The goal is to
later add additional types of test tables with different schemas (e.g.,
just a partition key, different key types, etc.) without too much
code duplication.
This patch doesn't change anything functional in the tests, and they
still pass ("pytest --local" runs all tests against the local Alternator).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The ck_from_json() utility function is easier to use if it handles
the no-clustering-key case as the callers need them too, instead of
requiring them to handle the no-clustering-key case separately.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
So far we supported UpdateItem only with PUT operations - this patch
adds support for DELETE operations, to delete specific attributes from
an item.
Only the case of a missing value is support. DynamoDB also provides
the ability to pass the old value, and only perform the deletion if
the value and/or its type is still up-to-date - but we don't support
this yet and fail such request if it is attempted.
This patch also includes a test for this case in alternator-test/
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add initial tests for UpdateItem. Only the features currently supported
by our code (only string attributes, only "PUT" action) are tested.
As usual, this test (like all others) was tested to pass on both DynamoDB
and Alternator.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add an initial UpdateItem implementation. As PutItem and GetItem we
are still limited to string attributes. This initial implementation
of UpdateItem implements only the "PUT" action (not "DELETE" and
certainly not "ADD") and not any of the more advanced options.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
All operation-generated error messages should have the 400 HTTP error
code. It's a real nag to have to type it every time. So make it the
default.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Without special options, PutItem should return nothing (an empty
JSON result). Previously we had trouble doing this, because instead
of return an empty JSON result, we converted an empty string into
JSON :-) So the existing code had an ugly workaround which worked,
sort of, for the Python driver but not for the Java driver.
The correct fix, in this patch, is to invent a new type json_string
which is a string *already* in JSON and doesn't need further conversion,
so we can use it to return the empty result. PutItem now works from
YCSB's Java driver.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Although we would like to allow table names up to 222 bytes, this is not
currently possible because Scylla tacks additional 33 bytes to create
a directory name, and directory names are limited to 255 bytes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The supported key types are just S(tring), B(lob), or N(umber).
Other types are valid for attributes, but not for keys, and should
not be accepted. And wrong types used should result in the appropriate
user-visible error.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
To be correct, CreateTable's input parsing need to work in reverse from
what it did: First, the key columns are listed in KeySchema, and then
each of these (and potetially more, e.g., from indexes) need to appear
AttributeDefinitions.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Without any arguments, PutItem should return no data at all. But somehow,
for reasons I don't understand, the boto3 driver gets confused from an
empty JSON thinking it isn't JSON at all. If we return a structure with
an empty "attributes" fields, boto3 is happy.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add an initial implementation of Delete table, enough for making the
pytest --local test_table.py::test_create_and_delete_table
Pass.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When given an unknown operation (we didn't implement yet many of them...)
we should throw the appropriate api_error, not some random exception.
This allows the client to understand the operation is not supported
and stop retrying - instead of retrying thinking this was a weird
internal error.
For example the test
pytest --local test_table.py::test_create_and_delete_table
Now fails immediately, saying Unsupported operation DeleteTable.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The structure's name in DescribeTable's output is supposed to be called
"Table", not "TableDescription". Putting in the wrong place caused the
driver's table creation waiters to fail.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
validate table name in CreateTable, and if it doesn't fit DynamoDB's
requirement, return the appropriate error as drivers expect.
With this patch, test_table.py::test_create_table_unsupported_names
now passes (albeit with a one minute pause - this a bug with keep-alive
support...).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Check the expected error message to contain just ValidationException
instead of an overly specific text message from DynamoDB, so we aren't
so constraint in our own messages' wording.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Dynamo allows tables names up to 255 characters, but when this is tested on
Alternator, the results are disasterous: mkdir with such a long directory
name fails, Scylla considers this an unrecoverable "I/O error", and exits
the server.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Start to use "test fixtures" defined in conftest.py: The connection to
the DynamoDB API, and also temporary tables, can be reused between multiple
tests.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This initial implementation is enough to pass a test of getting a
failure for a non-existant table -
test_table.py::test_describe_table_non_existent_table
and to recognize an existing table. But it's still missing a lot
of fields for an existing table (among others, the schema).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Exceptions from the handlers need to be output in a certain way - as
a JSON with specific fields - as DynamoDB drivers expect them to be.
If a handler throws an alternator::api_error with these specific fields,
they are output, but any other exception is converted into the same
format as an "Internal Error".
After this patch, executor code can throw an alternator::api_error and
the client will receive this error in the right format.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
DynamoDB error messages are returned in JSON format and expect specific
information: Some HTTP error code (often but not always 400), a string
error "type" and a user-readable message. Code that wants to return
user-visible exceptions should use this type, and in the next patch we
will translate it to the appropriate JSON string.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The "Timestamp" type returned for CreationDateTime can be one of several
things but if it is a number, it is supposed to be the time in *seconds*
since the epoch - not in milliseconds. Returning milliseconds as we
wrongly did causes boto3 (AWS's Python driver) to throw a parse exception
on this response.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Until now, we always opened the Alternator port along with Scylla's
regular ports (CQL etc.). This should really be made optional.
With this patch, by default Alternator does NOT start and does not
open a port. Run Scylla with --alternator-port=8000 to open an Alternator
API port on port 8000, as was the default until now. It's also possible
to set this in scylla.yaml.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The interface works on port 8000 by default and provides
the most basic alternator operations - it's an incomplete
set without validation, meant to allow testing as early as possible.
Some sstable objects correspond to sstables which are being written
and are not sealed yet. Such sstables don't have all the fields
filled-in. Tools which calculate statistics (like GDB scripts) need to
distinguish such sstables.
There is no reason to keep parts of the the Scylla Metadata component in memory
after it is read, parsed, and its information fed into the SSTable.
We have seen systems in which the Scylla metadata component is one
of the heaviest memory users, more than the Summary and Filter.
In particular, we use the token metadata, which is the largest part of the
Scylla component, to calculate a single integer -> the shards that are
responsible for this SSTable. Once we do that, we never use it again
Tests: unit (release/debug), + manual scylla write load + reshard.
Fixes#4951
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Introduce mutation_fragment_stream_validator class and use it as a
Filter to flat_mutation_reader::consume_in_thread from
sstable::write_components to validate partition region and optionally
clustering key monotonicity.
Fixes#4803
key monotonicity validation requires an overhead to store the last key and also to compare
therefore provide an option to enable/disable it (disabled by default).
Refs #4804
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Storing and comparing keys is expensive.
Add a flag to enable/disable this feature (disabled by default).
Without the flag, only the partition region monotonicity is
validated, allowing repeated clustering rows, regardless of
clustering keys.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The respective constructor is explicit.
Define this assignment operator to be used by flat_mutation_reader
mutation_fragment_stream_validator filter so that it can use
mutation_fragment::position() verbatim and keep its internal
state as position_in_partition.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Recently we started to use more the concept of metric labels - several
metrics which share the same name, but differ in the value of some label
such a "group" (for different scheduling groups).
This patch documents this feature in docs/metrics.md, gives the example of
scheduling groups, and explains a couple more relevant Promethueus syntax
tricks.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190909113803.15383-1-nyh@scylladb.com>
* seastar cb7026c16f...b3fb4aaab3 (10):
> Revert "scheduling groups: Adding per scheduling group data support"
> scheduling groups: Adding per scheduling group data support
> rpc: check that two servers are not created with the same streaming id
> future: really ignore exceptions in ignore_ready_future
> iostream: Constify eof() function
> apply.hh: add missing #include for size_t
> scheduling_group_demo: add explicit yields since future::get() no longer does
> Fix buffer size used when calling accept4()
> future-util: reduce allocations and continuations in parallel_for_each
> rpc: lz4_decompressor: Add a static constexpr variable decleration for Cpp14 compatibility
Previously, the gate could get
closed too early, which would result in shutting down the server
before it had an opportunity to respond to the client.
Refs #4818
"
The release notes for boost 1.67.0 includes:
Breaking Change: When converting a multiprecision integer to a narrower type, if the value is too large (or negative) to fit in the smaller type, then the result is either the maximum (or minimum) value of the target
Since we just moved out of boost 1.66, we have to update our code.
This fixes issue #4960
"
* 'espindola/fix-4960' of https://github.com/espindola/scylla:
types: fix varint to integer conversion
types: extract a from_varint_to_integer from make_castas_fctn_from_decimal_to_integer
types: fix decimal to integer conversion
types: extract helper for converting a decimal to a cppint
types: rename and detemplate make_castas_fctn_from_decimal_to_integer
"
With this patch series one has to be explicit to create a date_type_impl and now there is only the one documented difference between date_type_impl and timestamp_type_impl.
"
* 'espindola/simplify-date-type' of https://github.com/espindola/scylla:
types: Reduce duplication around date_type_impl
types: Don't use date_type_native_type when we want a timestamp
types: Remove timestamp_native_type
types: Don't specialize data_type_for for db_clock::time_point
types: Make it harder to create date_type
According to the comments, the only different between date_type_impl
and timestamp_type_impl is the comparison function.
This patch makes that explicit by merging all code paths except:
* The warning when converting between the two
* The compare function
The date_type_impl type can still be user visible via very old
sstables or via the thrift protocol. It is not clear if we still need
to support either, but with this patch it is easy to do so.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
In these cases it is pretty clear that the original code wanted to
create a timestamp_type data_value but was creating a date_type one
because of the old defaults.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Now that we know that anything expecting a date_type has been
converted to date_type_native_type, switch to using
db_clock::time_point when we want a timestamp_type.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This also moves every user to date_type_native_type. A followup patch
will convert to timestamp_type when appropriate.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
date_type was replaced with timestamp_type, but it was very easy to
create a date_type instead of a timestamp_type by accident.
This patch changes the code so that a date_type is no longer
implicitly used when constructing a data_value. All existing code that
was depending on this is converted to explicitly using
date_type_native_type. A followup patch will convert to timestamp_type
when appropriate.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Commit 7e3805ed3d removed the load balancing code from cql
server, but it did not remove most of the craft that load balancing
introduced. The most of the complexity (and probably the main reason the
code never worked properly) is around service::client_state class which
is copied before been passed to the request processor (because in the past
the processing could have happened on another shard) and then merged back
into the "master copy" because a request processing may have changed it.
This commit remove all this copying. The client_request is passed as a
reference all the way to the lowest layer that needs it and it copy
construction is removed to make sure nobody copies it by mistake.
tests: dev, default c-s load of 3 node cluster
Message-Id: <20190906083050.GA21796@scylladb.com>
"
This avoids a double dispatch on _kind and also removes a few shared_ptr copies.
The extra work was a small regression from the recent types refactoring.
"
* 'espindola/optimize_type_find' of https://github.com/espindola/scylla:
types: optimize type find implementation
types: Avoid shared_ptr copies
Currently when an error happens during the receive and distribute phase
it is swallowed and we just return a -1 status to the remote. We only
log errors that happen during responding with the status. This means
that when streaming fails, we only know that something went wrong, but
the node on which the failure happened doesn't log anything.
Fix by also logging errors happening in the receive and distribute
phase. Also mention the phase in which the error happened in both error
log messages.
Refs: #4901
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190903115735.49915-1-bdenes@scylladb.com>
The previous code was using the boost::multiprecision::cpp_int to
integer conversion, but that doesn't have the same semantics an cql
for signed numbers.
This fixes the dtest cql_cast_test.py:CQLCastTest.cast_varint_test.
Fixes#4960
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The previous code was using the boost::multiprecision::cpp_rational to
integer conversion, but that doesn't have the same semantics an cql.
This patch avoids creating a cpp_rational in the first place and works
just with integers.
This fixes the dtest cql_cast_test.py:CQLCastTest.cast_decimal_test.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
on_down() iterates over _view_update_handlers_list, but it yields during iteration,
and while it yields, elements in that list can be removed, resulting in a
use-after-free.
Prevent this by registering iterators that can be potentially invalidated, and
any time we remove an element from the list, check whether we're removing an element
that is being pointed to by a live iterator. If that is the case, advance the iterator
so that it points at a valid element (or at the end of the list).
Fixes#4912.
Tests: unit (dev)
Currently the background schema sync (push/pull) uses frozen mutation to
send the schema mutations over the wire to the remote node. For this to
work correctly, both nodes have to have the exact same schema for the
system schema tables, as attempting to unpack the frozen mutation with
the wrong schema leads to undefined behaviour.
To avoid this and to ensure syncing schema between nodes with different
schema table schema versions is defined we migrate the background
schema sync to use canonical mutations for the transfer of the schema
mutations. Canonical mutations are immune to this problem, as they
support deserializing with any version of the schema, older or newer
one.
The foreground schema sync mechanisms -- the on-demand schema pulls on
reads and writes -- already use canonical mutations to transmit the
schema mutations.
It is important to note that due to this change, column-level
incompatibilities between the schema mutations and the schema used to
deserialize them will be hidden. This is undesired and should be fixed
in a follow-up (#4956). Table level incompatibilities are detected and
schema mutations containing such mutations will be rejected just like before.
This patch adds canonical mutation support to the two background schema
sync verbs:
* `DEFINITIONS_UPDATE` (schema push)
* `MIGRATION_REQUEST` (schema pull)
Both verbs still support the old frozen mutation schema transfer, albeit
that path is now much less efficient. After all nodes are upgraded, the
pull verb can effectively avoid sending frozen mutations altogether,
completely migrating to canonical mutations. Unfortunately this was not
possible for the push verb, so that one now has an overhead as it needs
to send both the frozen and canonical mutations.
Fixes: #4273
The previous code was not exception safe and would eventually cause a
file to be destroyed without being closed, causing an assert failure.
Unfortunately it doesn't seem to be possible to test this without
error injection, since using an invalid directory fails before this
code is executed.
Fixes#4948
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190904002314.79591-1-espindola@scylladb.com>
The verbs are:
* DEFINITIONS_UPDATE (push)
* MIGRATION_REQUEST (pull)
Support was added in a backward-compatible way. The push verb, sends
both the old frozen mutation parameter, and the new optional canonical
mutation parameter. It is expected that new nodes will use the latter,
while old nodes will fall-back to the former. The pull verb has a new
optional `options` parameter, which for now contains a single flag:
`remote_supports_canonical_mutation_retval`. This flag, if set, means
that the remote node supports the new canonical mutation return value,
thus the old frozen mutations return value can be left empty.
In preparation to the schema push/pull migrating to use canonical
mutations, convert the method producing the schema mutations to return a
vector of canonical mutations. The only user, MIGRATION_REQUEST verb,
converts the canonical mutations back to frozen mutations. This is very
inefficient, but this path will only be used in mixed clusters. After
all nodes are upgraded the verb will be sending the canonical mutations
directly instead.
This turns find into a template so there is only one switch over the
kind of each type in the search.
To evaluate the change in code size sizes, I added [[noinline]] to
find and obtained the following results.
The release columns for release in the before case have an extra column
because the functions are sufficiently complex to trigger gcc to split
them in hot + cold.
before:
dev release (hot + cold split)
find 0x35f = 863 0x3d5 + 0x112 = 1255
references_duration 0x62 + 0x22 + 0x8 = 140 0x55 + 0x1f + 0x2a + 0x8 = 166
references_user_type 0x6b + 0x26 + 0x111 = 418 0x65 + 0x1f + 0x32 + 0x11b = 465
after:
dev release
find 0xd6 + 0x1b4 = 650 0xd2 + 0x1f5 = 711
references_duration 0x13 = 19 0x13 = 19
references_user_type 0x1a = 26 0x21 = 33
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
They are somewhat expensive (in code size at least) and not needed
everywhere.
Inside the getter the variables are 'const data_type&', so we can
return that. Everything still works when a copy is needed, but in code
that just wants to check a property we avoid the copy.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
During CQL request processing, a gate is used to ensure that
the connection is not shut down until all ongoing requests
are done. However, the gate might have been left too early
if the database was not ready to respond immediately - which
could result in trying to respond to an already closed connection
later. This issue is solved by postponing leaving the gate
until the continuation chain that handles the request is finished.
Refs #4808
* 'cleanup_sstables' of https://github.com/asias/scylla:
sstables: Move leveled_compaction_strategy implementation to source file
sstables: Include dht/i_partitioner.hh for dht::partition_range
Since nonroot mode requires to run everything on non-privileged user,
most of setup scripts does not able to use nonroot mode.
We only provide following functions on nonroot mode:
- EC2 check
- IO setup
- Node exporter installer
- Dev mode setup
Rest of functions will be skipped on scylla_setup.
To implement nonroot mode on setup scripts, scylla_util provides
utility functions to abstract difference of directory structure between normal
installation and nonroot mode.
Since systemd unit can override parameters using drop-in unit, we don't need
mustache template for them.
Also, drop --disttype and --target options on install.sh since it does not
required anymore, introduce --sysconfdir instead for non-redhat distributions.
Since ac9b115, we switched to install.sh on Debian so we don't rely on .deb
specific packaging scripts anymore.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Merged patch series by Amnon Heiman:
This patch fixes a bug that a map is held on the stack and then is used
by a future.
Instead, the map is now moved to the relevant lambda function.
Fixes#4824
Stopping services which occurs in a destructor of deferred_action
should not throw, or it will end the program with
terminate(). View builder breaks a semaphore during its shutdown,
which results in propagating a broken_semaphore exception,
which in turn results in throwing an exception during stop().get().
In order to fix that issue, semaphore exceptions are explicitly
ignored, since they're expected to appear during shutdown.
Fixes#4875
To prevent termination with SIGILL, tighten the instruction set
support checks. First, check for CLMUL too. Second, add a check in
scylla_prepare to catch the problem early.
Fixes#4921.
Scylla requires the CLMUL and SSE 4.2 instruction sets and will fail without them.
There is a check in main(), but that happens after the code is running and it may
already be too late. Add a check in scylla_prepare which runs before the main
executable.
"
It is well known that seastar applications, like Scylla, do not play
well with external processes: CPU usage from external processes may
confuse the I/O and CPU schedulers and create stalls.
We have also recently seen that memory usage from other application's
anonymous and page cache memory can bring the system to OOM.
Linux has a very good infrastructure for resource control contributed by
amazingly bright engineers in the form of cgroup controllers. This
infrastructure is exposed by SystemD in the form of slices: a
hierarchical structure to which controllers can be attached.
In true systemd way, the hierarchy is implicit in the filenames of the
slice files. a "-" symbol defines the hierarchy, so the files that this
patch presents, scylla-server and scylla-helper, essentially create a
"scylla" cgroup at the top level with "server" and "helper" children.
Later we mark the Services needed to run scylla as belonging to one
or the other through the Slice= directive.
Scylla DBAs can benefit from this setup by using the systemd-run
utility to fire ad-hoc commands.
Let's say for example that someone wants to hypothetically run a backup
and transfer files to an external object store like S3, making sure that
the amount of page cache used won't create swap pressure leading to
database timeouts.
One can then run something like:
sudo systemd-run --uid=id -u scylla --gid=id -g scylla -t --slice=scylla-helper.slice /path/to/my/magical_backup_tool
(or even better, the backup tool can itself be a systemd timer)
"
* 'slices' of https://github.com/glommer/scylla:
systemd: put scylla processes in systemd slices.
move postinst steps to an external script
"
The warning for discarded futures will only become useful, once we can
silence all present warnings and flip the flag to make it become error.
Then it will start being useful in finding new, accidental discarding of
futures.
This series silences all remaining warnings in the Scylla codebase. For
those cases where it was obvious that the future is discarded on
purpose, the author taking all necessary precaution (handling exception)
the warning was simply silenced by casting the future to void and
adding a relevant comment. Where the discarding seems to have been done
in error, I have fixed the code to not discard it. To the rest of the
sites I added a FIXME to fix the discarding.
"
* 'resolve-discarded-future-warnings/v4.2' of https://github.com/denesb/scylla:
treewide: silence discarded future warnings for questionable discards
treewide: silence discarded future warnings for legit discards
tests: silence discarded future warnings
tests/cql_query_test.cc: convert some tests to thread
This patches silences the remaining discarded future warnings, those
where it cannot be determined with reasonable confidence that this was
indeed the actual intent of the author, or that the discarding of the
future could lead to problems. For all those places a FIXME is added,
with the intent that these will be soon followed-up with an actual fix.
I deliberately haven't fixed any of these, even if the fix seems
trivial. It is too easy to overlook a bad fix mixed in with so many
mechanical changes.
This patch silences those future discard warnings where it is clear that
discarding the future was actually the intent of the original author,
*and* they did the necessary precautions (handling errors). The patch
also adds some trivial error handling (logging the error) in some
places, which were lacking this, but otherwise look ok. No functional
changes.
Some tests are currently discarding futures unjustifiably, however
adding code to wait on these futures is quite inconvenient due to the
continuation style code of these tests. Convert them to run in a seastar
thread to make the fix easier.
Introduced in c96ee98.
We call update_schema_version() after features are enabled and we
recalculate the schema version. This method is not updating gossip
though. The node will still use it's database::version() to decide on
syncing, so it will not sync and stay inconsistent in gossip until the
next schema change.
We should call updatE_schema_version_and_announce() instead so that
the gossip state is also updated.
There is no actual schema inconsistency, but the joining node will
think there is and will wait indefinitely. Making a random schema
change would unbock it.
Fixes#4647.
Message-Id: <1566825684-18000-1-git-send-email-tgrabiec@scylladb.com>
* seastar afc5bbf511...20bfd61955 (18):
> reactor: closing file used to check if direct_io is supported
> future: set_coroutine(): s/state()/_state/
> tests/perf/perf_test.hh: suppress discarded future warning
> tests: rpc: fix memory leak in timeout wraparound tests
> Revert "future-util: reduce allocations and continuations in parallel_for_each"
> reactor: fix rename_priority_class() build failure in C++14 mode
> future: mark future_state_base::failed() as unlikely
> future-util: reduce allocations and continuations in parallel_for_each
> future-utils: generalize when_all_estimate_vector_capacity()
> output_stream: Add comment on sequentiality
> docs/tutorial.md: minor cleanups in first section
> core: fix a race in execution stages (Fixes#4856, fixes#4766)
> semaphore: use semaphore's clock type in with_semaphore()/get_units()
> future: fix doxygen documentation for promise<>
> sharded: fixed detecting stop method when building with clang
> reactor: fixed locking error in rename_priority_class
> Assert that append_challenged_posix_file_impl are closed.
> rpc: correctly handle huge timeouts
Merged patch series from Amnon Heiman amnon@scylladb.com
This Patch adds an implementation of the get built index API and remove a
FIXME.
The API returns a list of secondary indexes belongs to a column family
and have already been fully built.
Example:
CREATE KEYSPACE scylla_demo WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};
CREATE TABLE scylla_demo.mytableID ( uid uuid, text text, time timeuuid, PRIMARY KEY (uid, time) );
CREATE index on scylla_demo.mytableID (time);
$ curl -X GET 'http://localhost:10000/column_family/built_indexes/scylla_demo%3Amytableid'
["mytableid_time_idx"]
The sum_ratio struct is a helper struct that is used when calculating
ratio over multiple shards.
Originally it was created thinking that it may need to use future, in
practice it was never used and the future was ignore.
This patch remove the future from the implementation and reduce an
unhandle future warning from the compilation.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This Patch adds an implementation of the get build index API and remove a
FIXME.
The API returns the list of the built secondary indexes belongs to a column family.
Example:
CREATE KEYSPACE scylla_demo WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};
CREATE TABLE scylla_demo.mytableID ( uid uuid, text text, time timeuuid, PRIMARY KEY (uid, time) );
CREATE index on scylla_demo.mytableID (time);
$ curl -X GET 'http://localhost:10000/column_family/built_indexes/scylla_demo%3Amytableid'
["mytableid_time_idx"]
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
When a role is created through the `create role` statement, the
'is_superuser' and 'can_login' columns are set to false by default.
Likewise, `list roles`, `alter roles` and `* roles` operations
expect to find a boolean when reading the same columns.
This is not the case, though, when a user directly inserts to
`system_auth.roles` and doesn't set those columns. Even though
manually creating roles is not a desired day-to-day operation,
it is an insert just like any other and it should work.
`* roles` operations, on the other hand, are not prepared for
this deviations. If a user manually creates a role and doesn't
set boolean values to those columns, `* roles` will return all
sorts of errors. This happens because `* roles` is explicitly
expecting a boolean and casting for it.
This patch makes `* roles` more friendly by considering the
boolean variable `false` - inside `* roles` context - if the
actual value is `null`; it won't change the `null` value.
Fixes#4280
Signed-off-by: Juliana Oliveira <juliana@scylladb.com>
Message-Id: <20190816032617.61680-1-juliana@scylladb.com>
The scylla_blocktune.py has a FIXME for btrfs from 2016, which is no
longer relevant for Scylla deployments, as Red Hat dropped support for
the file system in 2017.
Message-Id: <20190823114013.31112-1-penberg@scylladb.com>
The priority class the shard reader was created with was hardcoded to be
`service::get_local_sstable_query_read_priority()`. At the time this
code was written, priority classes could not be passed to other shards,
so this method, receiving its priority class parameters from another
shard, could not use it. This is now fixed, so we can just use whatever
the caller wants us to use.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190823115111.68711-1-bdenes@scylladb.com>
Cartesian products (generated by IN restrictions) can grow very large,
even for short queries. This can overwhelm server resources.
Add limit checking for cartesian products, and configuration items for
users that are not satisfied with the default of 100 records fetched.
Fixes#4752.
Tests: unit (dev), manual test with SIGHUP.
Cartesian products (via IN restrictions) make it easy to generate huge
primary key sets with simple queries, overflowing server resources. Limit
them in the coordinator and report an exception instead of trying to
execute a query that would consume all of our memory.
A unit test is added.
We need a way to configure the cql interpreter and runtime. So far we relied
on accessing the configuration class via various backdoors, but that causes
its own problems around initialization order and testability. To avoid that,
this patch adds an empty cql_config class and propagates it from main.cc
(and from tests) to the cql interpreter via the query_options class, which is
already passed everywhere.
Later patches will fill it with contents.
This was broken since the type refactoring. It was checking the static
type, which is always abstract_type. Unfortunately we only had dtests
for this.
This can probably be optimized to avoid the double switch over kind,
but it is probably better to do the simple fix first.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190821155354.47704-1-espindola@scylladb.com>
Currently we create a regex from the LIKE pattern for every row
considered during filtering, even though the pattern is always the
same. This is wasteful, especially since we require costly
optimization in the regex compiler. Fix it by reusing the regex
whenever the pattern is unchanged since the last call.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
The loop over view update handlers used a guard in order to ensure
that the object is not prematurely destroyed (thus invalidating
the iterator), but the guard itself was not in the right scope.
Fixed by replacinga 'for' loop with a 'while' loop, which moves
the iterator incrementation inside the scope in which it's still
guarded and valid.
Fixes#4866
Currently, seastar is built in seastar/build/{mode}. This means we have two build
directories: build/{mode} and seastar/build/{mode}.
This patch changes that to have only a single build directory (build/{mode}). It
does that by calling Seastar's cmake directly instead of through Seastar's
./configure.py. However, to support dpdk, if that is enabled it calls cmake
through Seastar's ./cooking.sh (similar to what Seastar's ./configure.py does).
All ./configure.py flags are translated to cmake variables, in the same way that
Seastar does.
Contains fix from Rafael to pass the flags for the correct mode.
This clarifies that "rows" are clustering rows and that there is no
information about individual collection elements.
The patch also documents some properties common to all these tables.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190820171204.48739-1-espindola@scylladb.com>
The endpoint directories scanned by space_watchdog may get deleted
by the manager::drain_for().
If a deleted directory is given to a lister::scan_dir() this will end up
in an exception and as a result a space_watchdog will skip this round
and hinted handoff is going to be disabled (for all agents including MVs)
for the whole space_watchdog round.
Let's make sure this doesn't happen by serializing the scanning and deletion
using end_point_hints_manager::file_update_mutex.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
end_point_hints_manager::file_update_mutex is taken by space_watchdog
but while space_watchdog is waiting for it the corresponding
end_point_hints_manager instance may get destroyed by manager::drain_for()
or by manager::stop().
This will end up in a use-after-free event.
Let's change the end_point_hints_manager's API in a way that would prevent
such an unsafe locking:
- Introduce the with_file_update_mutex().
- Make end_point_hints_manager::file_update_mutex() method private.
Fixes#4685Fixes#4836
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
If drain_for() is running together with itself: one instance for the local
node and one for some other node, erasing of elements from the _ep_managers
map may lead to a use-after-free event.
Let's serialize drain_for() calls with a semaphore.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Our current relocation works by invoking the dynamic linker with the
executable as an argument. This confuses gdb since the kernel records
the dynamic linker as the executable, not the real executable.
Switch to install-time relocation with patchelf: when installing the
executable and libraries, all paths are known, and we can update the
path to the dynamic loader and to the dynamic libraries.
Since patchelf itself is dynamically linked, we have to relocate it
dynamically (with the old method of invoking it via the dynamic linker).
This is okay since it's a one-time operation and since we don't expect
to debug core dumps of patchelf crashes.
We lose the ability to run scylla directly from the uninstalled
tarball, but since the nonroot installer is already moving in the
direction of requiring install.sh, that is not a great loss, and
certainly the ability to debug is more important.
dh_strip barfs on some binaries which were treated with patchelf,
so exclude them from dh_strip. This doesn't lose any functionality,
since these binaries didn't have debug information to begin with
(they are already-stripped Fedora executables).
Fixes#4673.
Non-full prefix keys are currently not handled correctly as all keys
are treated as if they were full prefixes, and therefore they represent
a point in the key space. Non-full prefixes however represent a
sub-range of the key space and therefore require null extending before
they can be treated as a point.
As a quick reminder, `key` is used to trim the clustering ranges such
that they only cover positions >= then key. Thus,
`trim_clustering_row_ranges_to()` does the equivalent of intersecting
each range with (key, inf). When `key` is a prefix, this would exclude
all positions that are prefixed by key as well, which is not desired.
Fixes: #4839
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190819134950.33406-1-bdenes@scylladb.com>
"
Follow-up to #4610, where a review comment asked for test coverage on all types. Existing tests cover all the types admissible in LIKE, while this PR adds coverage for all inadmissible types.
Tests: unit (dev)
"
* 'like-nonstring' of https://github.com/dekimir/scylla:
cql_query_test: Add LIKE tests for all types
cql_query_test: Remove LIKE-nonstring-pattern case
cql_query_test: Move a testcase elsewhere in file
In b197924, we changed the shutdown process not to rely on the global
reactor-defined exit, but instead added a local variable to hold the
shutdown state. However, we did not propagate that state everywhere,
and now streaming processes are not able to abort.
Fix that by enhancing stop_signal with a sharded<abort_source> member
that can be propagated to services. Propagate it to storage_service
and thence to boot_strapper and range_streamer so that streaming
processes can be aborted.
Fixes#4674Fixes#4501
Tests: unit (dev), manual bootstrap test
"
Streamed view updates parasitized on writing io priority, which is
reserved for user writes - it's now properly bound to streaming
write priority.
Verified manually by checking appropriate io metrics: scylla_io_queue_total_bytes{class="streaming_write" ...} vs scylla_io_queue_total_bytes{class="query" ...}
Tests: unit(dev)
"
* 'assign_proper_io_priority_to_streaming_view_updates' of https://github.com/psarna/scylla:
db,view: wrap view update generation in stream scheduling group
database: assign proper io priority for streaming view updates
Our current relocation works by invoking the dynamic linker with the
executable as an argument. This confuses gdb since the kernel records
the dynamic linker as the executable, not the real executable.
Switch to install-time relocation with patchelf: when installing the
executable and libraries, all paths are known, and we can update the
path to the dynamic loader and to the dynamic libraries.
Since patchelf itself is dynamically linked, we have to relocate it
dynamically (with the old method of invoking it via the dynamic linker).
This is okay since it's a one-time operation and since we don't expect
to debug core dumps of patchelf crashes.
We lose the ability to run scylla directly from the uninstalled
tarball, but since the nonroot installer is already moving in the
direction of requiring install.sh, that is not a great loss, and
certainly the ability to debug is more important.
dh_strip barfs on some binaries which were treated with patchelf,
so exclude them from dh_strip. This doesn't lose any functionality,
since these binaries didn't have debug information to begin with
(they are already-stripped Fedora executables).
Fixes#4673.
It is well known that seastar applications, like Scylla, do not play
well with external processes: CPU usage from external processes may
confuse the I/O and CPU schedulers and create stalls.
We have also recently seen that memory usage from other application's
anonymous and page cache memory can bring the system to OOM.
Linux has a very good infrastructure for resource control contributed by
amazingly bright engineers in the form of cgroup controllers. This
infrastructure is exposed by SystemD in the form of slices: a
hierarchical structure to which controllers can be attached.
In true systemd way, the hierarchy is implicit in the filenames of the
slice files. a "-" symbol defines the hierarchy, so the files that this
patch presents, scylla-server and scylla-helper, essentially create a
"scylla" cgroup at the top level with "server" and "helper" children.
Later we mark the Services needed to run scylla as belonging to one
or the other through the Slice= directive.
Scylla DBAs can benefit from this setup by using the systemd-run
utility to fire ad-hoc commands.
Let's say for example that someone wants to hypothetically run a backup
and transfer files to an external object store like S3, making sure that
the amount of page cache used won't create swap pressure leading to
database timeouts.
One can then run something like:
```
sudo systemd-run --uid=`id -u scylla` --gid=`id -g scylla` -t --slice=scylla-helper.slice /path/to/my/magical_backup_tool
```
(or even better, the backup tool can itself be a systemd timer)
Changes from last version:
- No longer use the CPUQuota
- Minor typo fixes
- postinstall fixup for small machines
Benchmark results:
==================
Test: read from disk, with 100% disk util using a single i3.xlarge (4 vCPUs).
We have to fill the cache as we read, so this should stress CPU, memory and
disk I/O.
cassandra-stress command:
```
cassandra-stress read no-warmup duration=5m -rate threads=20 -node 10.2.209.188 -pop dist=uniform\(1..150000000\)
```
Baseline results:
```
Results:
Op rate : 13,830 op/s [READ: 13,830 op/s]
Partition rate : 13,830 pk/s [READ: 13,830 pk/s]
Row rate : 13,830 row/s [READ: 13,830 row/s]
Latency mean : 1.4 ms [READ: 1.4 ms]
Latency median : 1.4 ms [READ: 1.4 ms]
Latency 95th percentile : 2.4 ms [READ: 2.4 ms]
Latency 99th percentile : 2.8 ms [READ: 2.8 ms]
Latency 99.9th percentile : 3.4 ms [READ: 3.4 ms]
Latency max : 12.0 ms [READ: 12.0 ms]
Total partitions : 4,149,130 [READ: 4,149,130]
Total errors : 0 [READ: 0]
Total GC count : 0
Total GC memory : 0.000 KiB
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:05:00
```
Question 1:
===========
Does putting scylla in a special slice affect its performance ?
Results with Scylla running in a slice:
```
Results:
Op rate : 13,811 op/s [READ: 13,811 op/s]
Partition rate : 13,811 pk/s [READ: 13,811 pk/s]
Row rate : 13,811 row/s [READ: 13,811 row/s]
Latency mean : 1.4 ms [READ: 1.4 ms]
Latency median : 1.4 ms [READ: 1.4 ms]
Latency 95th percentile : 2.2 ms [READ: 2.2 ms]
Latency 99th percentile : 2.6 ms [READ: 2.6 ms]
Latency 99.9th percentile : 3.3 ms [READ: 3.3 ms]
Latency max : 23.2 ms [READ: 23.2 ms]
Total partitions : 4,151,409 [READ: 4,151,409]
Total errors : 0 [READ: 0]
Total GC count : 0
Total GC memory : 0.000 KiB
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:05:00
```
*Conclusion* : No significant change
Question 2:
===========
What happens when there is a CPU hog running in the same server as scylla?
CPU hog:
```
taskset -c 0 /bin/sh -c "while true; do true; done" &
taskset -c 1 /bin/sh -c "while true; do true; done" &
taskset -c 2 /bin/sh -c "while true; do true; done" &
taskset -c 3 /bin/sh -c "while true; do true; done" &
sleep 330
```
Scenario 1: CPU hog runs freely:
```
Results:
Op rate : 2,939 op/s [READ: 2,939 op/s]
Partition rate : 2,939 pk/s [READ: 2,939 pk/s]
Row rate : 2,939 row/s [READ: 2,939 row/s]
Latency mean : 6.8 ms [READ: 6.8 ms]
Latency median : 5.3 ms [READ: 5.3 ms]
Latency 95th percentile : 11.0 ms [READ: 11.0 ms]
Latency 99th percentile : 14.9 ms [READ: 14.9 ms]
Latency 99.9th percentile : 17.1 ms [READ: 17.1 ms]
Latency max : 26.3 ms [READ: 26.3 ms]
Total partitions : 884,460 [READ: 884,460]
Total errors : 0 [READ: 0]
Total GC count : 0
Total GC memory : 0.000 KiB
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:05:00
```
Scenario 2: CPU hog runs inside scylla-helper slice
```
Results:
Op rate : 13,527 op/s [READ: 13,527 op/s]
Partition rate : 13,527 pk/s [READ: 13,527 pk/s]
Row rate : 13,527 row/s [READ: 13,527 row/s]
Latency mean : 1.5 ms [READ: 1.5 ms]
Latency median : 1.4 ms [READ: 1.4 ms]
Latency 95th percentile : 2.4 ms [READ: 2.4 ms]
Latency 99th percentile : 2.9 ms [READ: 2.9 ms]
Latency 99.9th percentile : 3.8 ms [READ: 3.8 ms]
Latency max : 18.7 ms [READ: 18.7 ms]
Total partitions : 4,069,934 [READ: 4,069,934]
Total errors : 0 [READ: 0]
Total GC count : 0
Total GC memory : 0.000 KiB
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:05:00
```
*Conclusion*: With systemd slice we can keep the performance very close to
baseline
Question 3:
===========
What happens when there is a CPU hog running in the same server as scylla?
I/O hog: (Data in the cluster is 2x size of memory)
```
while true; do
find /var/lib/scylla/data -type f -exec grep glauber {} +
done
```
Scenario 1: I/O hog runs freely:
```
Results:
Op rate : 7,680 op/s [READ: 7,680 op/s]
Partition rate : 7,680 pk/s [READ: 7,680 pk/s]
Row rate : 7,680 row/s [READ: 7,680 row/s]
Latency mean : 2.6 ms [READ: 2.6 ms]
Latency median : 1.3 ms [READ: 1.3 ms]
Latency 95th percentile : 7.8 ms [READ: 7.8 ms]
Latency 99th percentile : 10.9 ms [READ: 10.9 ms]
Latency 99.9th percentile : 16.9 ms [READ: 16.9 ms]
Latency max : 40.8 ms [READ: 40.8 ms]
Total partitions : 2,306,723 [READ: 2,306,723]
Total errors : 0 [READ: 0]
Total GC count : 0
Total GC memory : 0.000 KiB
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:05:00
```
Scenario 2: I/O hog runs in the scylla-helper systemd slice:
```
Results:
Op rate : 13,277 op/s [READ: 13,277 op/s]
Partition rate : 13,277 pk/s [READ: 13,277 pk/s]
Row rate : 13,277 row/s [READ: 13,277 row/s]
Latency mean : 1.5 ms [READ: 1.5 ms]
Latency median : 1.4 ms [READ: 1.4 ms]
Latency 95th percentile : 2.4 ms [READ: 2.4 ms]
Latency 99th percentile : 2.9 ms [READ: 2.9 ms]
Latency 99.9th percentile : 3.5 ms [READ: 3.5 ms]
Latency max : 183.4 ms [READ: 183.4 ms]
Total partitions : 3,984,080 [READ: 3,984,080]
Total errors : 0 [READ: 0]
Total GC count : 0
Total GC memory : 0.000 KiB
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:05:00
```
*Conclusion*: With systemd slice we can keep the performance very close to
baseline
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Propagate the abort_source from main() into boot_strapper and range_stream and
check for aborts at strategic points. This includes aborting running stream_plans
and aborting sleeps between retries.
Fixes#4674
In order to propagate stop signals, expose them as sharded<abort_source>. This
allows propagating the signal to all shards, and integrating it with
sleep_abortable().
Because sharded<abort_source>::stop() will block, we'll now require stop_signal
to run in a thread (which is already the case).
"
This is hopefully the last large refactoring on the way of UDF.
In UDF we have to convert internal types to Lua and back. Currently
almost all our types and hidden in types.cc and expose functionality
via virtual functions. While it should be possible to add a
convert_{to|from}_lua virtual functions, that seems like a bad design.
In compilers, the type definition is normally public and different
passes know how to reason about each type. The alias analysis knows
about int and floats, not the other way around.
This patch series is inspired by both the LLVM RTTI
(https://www.llvm.org/docs/HowToSetUpLLVMStyleRTTI.html) and
std::variant.
The series makes the types public, adds a visit function and converts
the various virtual methods to just use visit. As a small example of
why this is useful, it then moves a bit of cql3 and json specific
logic out of types.cc and types.hh. In a similar way, the UDF code
will be able to used visit to convert objects to Lua.
In comparison with the previous versions, this series doesn't require the intermediate step of converting void* to data_value& in a few member functions.
This version also has fewer double dispatches I a am fairly confident has all the tools for avoiding all double dispatches.
"
* 'simplify-types-v3' of https://github.com/espindola/scylla: (80 commits)
types: Move abstract_type visit to a header
types: Move uuid_type_impl to a header
types: Move inet_addr_type_impl to a header
types: Move varint_type_impl to a header
types: Move timeuuid_type_impl to a header
types: Move date_type_impl to a header
types: Move bytes_type_impl to a header
types: Move utf8_type_impl to a header
types: Move ascii_type_impl to a header
types: Move string_type_impl to a header
types: Move time_type_impl to a header
types: Move simple_date_type_impl to a header
types: Move timestamp_type_impl to a header
types: Move duration_type_impl to a header
types: Move decimal_type_impl to a header
types: Move floating point types to a header
types: Move boolean_type_impl to a header
types: Move integer types to a header
types: Move integer_type_impl to a header
types: Move simple_type_impl to a header
...
This testcase was previously commented out, pending a fix that cannot
be made. Currently it is impossible to validate the marker-value type
at filtering time. The value is entered into the options object under
its presumed type of string, regardless of what it was made from.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Somehow this test case sits in the middle of LIKE-operator tests:
test_alter_type_on_compact_storage_with_no_regular_columns_does_not_crash
Move it so LIKE test cases are contiguous.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
There are systemd-related steps done in both rpm and deb builds.
Move that to a script so we avoid duplication.
The tests are so far a bit specific to the distributions, so it
needs to be adapted a bit.
Also note that this also fixes a bug with rpm as a side-effect:
rpm does not call daemon-reload after potentially changing the
systemd files (it is only implied during postun operations, that
happen during uninstall). daemon-reload was called explicitly for
debian packages, and now it is called for both.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The following logic was duplicated:
* For all types, if value is null, the result is zero.
* For non collection types, if the native object is empty, the result
is zero.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
We defined less for some types and compare for others. There is no
type for which compare is substantially more expensive, so define it
for all types and implement less with compare.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The type walking is similar to what the find function does, but
refactoring it doesn't seem worth it if these are the only two uses.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
With this patch the logic for walking all nested types is moved to a
helper function. It also fixes reversed_type_impl not being handled in
references_duration.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The api is inspired by on std::variant.
This bridges the runtime type of a abstract_type object to a compile
time overload resolution. For example, it is possible to have a single
lambda to visit a string_type_impl, but it corresponds to two leaf
types (ascii and utf8).
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The type hierarchy is closed, so we can give each leaf an enum value.
This will be used to implement a visitor pattern and reduce code
duplication.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The corresponding code is correct, but I noticed no tests would fail
if it was broken while refactoring it.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
These only use public member functions from data_value, so there is no
reason for not making them public too.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
There are two reasons for the move. First is that cql server lifetime
is shorter than storage_proxy one and the later stores references to
the semaphore in each service_permit it holds. Second - we want thrift
(and in the future other user APIs) to share the same admission control
memory pool.
Fixes#4844
Message-Id: <20190814142614.GT17984@scylladb.com>
We in fact do stop cql server in storage_service::drain_on_shutdown()
which is called in main.cc during shutdown.
Message-Id: <20190814085027.GP17984@scylladb.com>
Since we do accept pull requests (in a long-running experiment), the
pull request template suggesting not to use them is inaccurate, and
many requesters forget to remove the boilerplace.
Remove the outdate template.
"Commit e3f7fe4 added file owner validation to prevent Scylla from
crashing when it tries to touch a file it doesn't own. However, under
docker, we cannot expect to pass this check since user IDs are from
different namespaces: the process runs in a container namespace, but the
data files usually come from a mounted volume, and so their uids are
from the host namespace.
So we need to relax the check. We do this by reverting b1226fb, which
causes Scylla to run as euid 0 in docker, and by special-casing euid 0
in the ownership verification step.
Fixes #4823."
* 'docker-euid-0' of git://github.com/avikivity/scylla:
main: relax file ownership checks if running under euid 0
Revert "dist/docker/redhat: change user of scylla services to 'scylla'"
During startup, we check that the data files are owned by our euid.
But in a container environment, this is impossible to enforce because
uid/username mappings are different between the host and the container,
and the data files are likely to be mounted from the host.
To allow for such environments, relax the checks if euid=0. This
both matches what happens in a container (programs run as root) and
the kernel access checks (euid 0 can do anything).
We can reconsider this when container uid mapping is better developed.
Fixes#4823.
Fixes#4536.
This reverts commit b1226fb15a. When the
data volume is mounted from the host (as is usual in container
deployments), we can't expect that the files will be owned by the
in-container scylla user. So that commit didn't really fix#4536.
A follow-up patch will relax the check so it passes in a container
environment.
This adds a '--configure-flags FLAGS' command line option, which
overrides the flags passed to scylla.git 'configure.py' script. We need
this for flexibility of custom builds in Jenkins pipelines, for example.
Message-Id: <20190813095428.13590-1-penberg@scylladb.com>
Make the reader recreation logic more robust, by moving away from
deciding which fragments have to be dropped based on a bunch of
special cases, instead replacing this with a general logic which just
drops all already seen fragments (based on their position). Special
handling is added for the case when the last position is a range
tombstone with a non full prefix starting position. Reproducer unit
tests are added for both cases.
Refs #4695Fixes#4733
We should wait for a future returned from add_local_application_state() to
resolve before issuing new calculation, otherwise two
add_local_application_state() may run simultaneously for the same state.
Fixes#4838.
Message-Id: <20190812082158.GE17984@scylladb.com>
Add `single_fragment_buffer` test variable. When set, the shard readers
are created with a max buffer size of 1, effectively causing them to
read a single fragment at a time. This, when combined with
`evict_readers=true` will stress the recreate reader logic to the max.
Be more tolerant with different but equivalent representation of range
deletions. When expecting a range tombstone, keep reading range
tombstones while these can be merged with the cumulative range
tombstone, resulting from the merging of the previous range tombstones.
This results in tests tolerating range tombstones that are split into
several, potentially overlapping range tombstones, representing the
same underlying deletion.
This is tailored to the multishard_combining_reader, to make sure it
doesn't loos rows following a range tombstone with a prefix starting
position (whose prefix their keys fall into).
Currently the remote reader uses the last seen fragment's position to
calculate the position the reader should continue from when the reader
is recreated after having been evicted. Recently it was discovered that
this logic breaks down badly when this last position is a non-full
clustering prefix (a range tombstone start bound). In this case, if only
the last position is available, there is no good way of computing the
starting position. Starting after this position will potentially miss
any rows that fall into the prefix (the current behaviour). Starting
from before it will cause all range tombstones with said prefix to be
re-emitted, causing other problems. A better solution is to exploit the
fact that sometimes we also know what the next fragment is.
These "some" times are the exact times that are problematic with the
current approach -- when the last fragment is a range tombstone.
Exploiting this extra knowledge allows for a much better way for
calculating the starting position: instead of maintaining the last
position, we maintain the next position, which is always safe to start
from. This is not always possible, but in many cases we can know for
sure what the next position is, for example if the last position was a
static row we can be sure the next position is the first clustering
position (or partition end). In the few cases where we cannot calculate
the next position we fall back to the previous logic and start from
*after* the last positions. The good news is that in these remaining
cases (the last fragment is a clustering row) it is safe to do so.
This patch also does some refactoring of the remote-reader internals,
all fill-buffer related logic is grouped together in a single
`fill_buffer()` method.
Allow expressing `pos` in term of a `position_in_partition_view`, which
allows finer control of the exact position, allowing specifying position
before, at or after a certain key.
The previous overload is kept for backward compatibility, invoking the
new overload behind the curtains.
"
Refs #4726
Implement the api portion of a "describe sstables" command.
Adds rest types for collecting both fixed and dynamic attributes, some grouped. Allows extensions to add attributes as well. (Hint hint)
"
* 'sstabledesc' of https://github.com/elcallio/scylla:
api/storage_service: Add "sstable_info" command
sstables/compress: Make compressor pointer accessible from compression info
sstables.hh: Add attribute description API to file extension
sstables.hh: Add compression component accessor
sstables.hh: Make "has_component" public
Match SQL's LIKE in allowing an empty pattern, which matches only
an empty text field.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
This simplifies the debug implementation and it now should work with
scylla-gdb.py.
It is not clear what, if anything, is lost by not using random
ids. They were never being reused in the debug implementation anyway.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190618144755.31212-1-espindola@scylladb.com>
This patch fixes a bug that a map is held on the stack and then is used
by a future.
Instead, the map is now wrapped with do_with.
Fixes#4824
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
After commit 8a0c4d5 (Merge "Repair switch to rpc stream" from Asias),
we increased the row buffer size for repair from 512KiB to 32MiB per
repair instance. We allow repairing 16 ranges (16 repair instance) in
parallel per repair request. So, a node can consume 16 * 32MiB = 512MiB
per user requested repair. In addition, the repair master node can hold
data from all the repair followers, so the memory usage on repair master
can be larger than 512MiB. We need to provide a way to limit the memory
usage.
In this patch, we limit the total memory used by repair to 10% of the
shard memory. The ranges that can be repaired in parallel is:
max_repair_ranges_in_parallel = max_repair_memory / max_repair_memory_per_range.
For example, if each shard has 4096MiB of memory, then we will have
max_repair_ranges_in_parallel = 4096MiB / 32MiB = 12.
Fixes#4675
"
Current admission control takes a permit when cql requests starts and
releases it when reply is sent, but some requests may leave background
work behind after that point (some because there is genuine background
work to do like complete a write or do a read repair, and some because
a read/write may stuck in a queue longer than the request's timeout), so
after Scylla replies with a timeout some resources are still occupied.
The series fixes this by passing the permit down to storage_proxy where
it is held until all background work is completed.
Fixes#4768
"
* 'gleb/admission-v3' of github.com:scylladb/seastar-dev:
transport: add a metric to follow memory available for service permit.
storage_proxy: store a permit in a read executor
storage_proxy: store a permit in a write response handler
Pass service permit to storage_proxy
transport: introduce service_permit class and use it instead of semaphore_units
transport: hold admission a permit until a reply is sent
transport: remove cql server load balancer
A read executor exists until read operation completes in its entirety
so storing a permit there guaranties that it will be freed only after
no background work left for the request on this server.
A write response handler exists until write operation completes in its
entirety so storing a permit there guaranties that it will be freed only
after no background work left for the request on this server.
Current cql transport code acquire a permit before processing a query and
release it when the query gets a reply, but some quires leave work behind.
If the work is allowed to accumulate without any limit a server may
eventually run out of memory. To prevent that the permit system should
account for the background work as well. The patch is a first step in
this direction. It passes a permit down to storage proxy where it will
be later hold by background work.
Manager trims sstables off to allow compaction jobs to proceed in parallel
according to their weights. The problem is that trimming procedure is not
sstable run aware, so it could incorrectly remove only a subset of a sstable
run, leading to partial sstable run compaction.
Compaction of a sstable run could lead to inneficiency because the run structure
would be messed up, affecting all amplification factors, and the same generation
could even end up being compacted twice.
This is fixed by making the trim procedure respect the sstable runs.
Fixes#4773.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190730042023.11351-1-raphaelsc@scylladb.com>
Current code release admission permit to soon. If new requests are
admitted faster than client read replies back reply queue can grow to
be very big. The patch moves service permit release until after a reply
is sent.
Merged patch series from Avi Kivity:
In rare but valid cases (reconciling many tombstones, paging disabled),
a reconciled_result can grow large. This triggers large allocation
warnings. Switch to chunked_vector to avoid the large allocation.
In passing, fix chunked_vector's begin()/end() const correctness, and
add the reverse iterator function family which is needed by the conversion.
Fixes#4780.
Tests: unit (dev)
Commit Summary
utils: chunked_vector: make begin()/end() const correct
utils::chunked_vector: add rbegin() and related iterators
reconcilable_result: use chunked_vector to hold partitions
"
Add listen/rpc "prefer_ipv6" options to DNS lookup of bind addresses for API/rpc/prometheus etc .
Fixes#4751
Adds using a preferred address family to dns name lookups related to
listen address and rpc address, adhering to the respective "prefer" options.
API, prometheus and broadcast address are all considered to be covered by
the "listen_interface_prefer_ipv6" option.
Note: scylla does not yet support actual interface binding, but these
options should apply equally to address name parameters.
Setting a "prefer_ipv6" option automtially enables ipv6 dns family query.
"
* 'calle/ipv6' of https://github.com/elcallio/scylla:
init: Use the "prefer_ipv6" options available for rpc/listen address/interface
inet_address: Add optional "preferred type" to lookup
config: Add rpc_interface_prefer_ipv6 parameter
config: Add listen_interface_perfer_ipv6 parameter
config.cc: Fix enable_ipv6_dns_lookup actual param name
The FIXME was added in the very first commit ("utils: Convert
utils/FBUtilities.java") that introduced the fb_utilities class as a
stub. However, we have long implemented the parts that we actually use,
so drop the FIXME as obsolete. In addition, drop the remaining
uncommented Java code as unused and also obsolete.
Message-Id: <20190808182758.1155-1-penberg@scylladb.com>
The Maven build tool ("mvn"), which is used by scylla-jmx and
scylla-tools-java, stores dependencies in a local repository stored at
$HOME/.m2. Make sure it's accessible to dbuild.
Message-Id: <20190808140216.26141-1-penberg@scylladb.com>
ignore_partial_runs() brings confusion because i__p__r() equal to true
doesn't mean filter out partial runs from compaction. It actually means
not caring about compaction of a partial run.
The logic was wrong because any compaction strategy that chooses not to ignore
partial sstable run[1] would have any fragment composing it incorrectly
becoming a candidate for compaction.
This problem could make compaction include only a subset of fragments composing
the partial run or even make the same fragment be compacted twice due to
parallel compaction.
[1]: partial sstable run is a sstable that is still being generated by
compaction and as a result cannot be selected as candidate whatsoever.
Fix is about making sure partial sstable run has none of its fragments
selected for compaction. And also renaming i__p__r.
Fixes#4729.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190807022814.12567-1-raphaelsc@scylladb.com>
This documents the steps needed to build Scylla's Linux packages with
the relocatable package infrastructure we use today.
Message-Id: <20190807134017.4275-1-penberg@scylladb.com>
Running "dbuild" without a build command fails as follows:
$ ./tools/toolchain/dbuild
Error: This command has to be run under the root user.
Israel Fruchter discovered that the default command of our Docker image is this:
"Cmd": [
"bash",
"-c",
"dnf -y install python3-cassandra-driver && dnf clean all"
]
Let's make "/bin/bash" the default command instead, which will make
"dbuild" with no build command to return to the host shell.
Message-Id: <20190807133955.4202-1-penberg@scylladb.com>
The build_reloc.sh script passes "--with=scylla" and "--with=iotune" to
the configure.py script. This is redundant as the
"scylla-package.tar.gz" target of ninja already limits itself to them.
Removing the "--with" options allows building unit tests after a
relocatable package has been built without having to rebuild anything.
Message-Id: <20190807130505.30089-1-penberg@scylladb.com>
Add a '--mode <mode>' command line option to the 'build_reloc.sh' script
so that we can create relocatable packages for debug builds.
The '--mode' command line option defaults to 'release' so existing users
are unaffected.
Message-Id: <20190807120759.32634-1-penberg@scylladb.com>
Avoid including the lengthy stream_session.hh in messaging_service.
More importantly, fix the build because currently messaging_service.cc
and messaging_service.hh does not include stream_mutation_fragments_cmd.
I am not sure why it builds on my machine. Spotted this when backporting
the "streaming: Send error code from the sender to receiver" to 3.0
branch.
Refs: #4789
The stream close() guarantees the data sent will be flushed. No need to
call the stream flush() since the stream is not reused.
Follow up fix for commit bac987e32a (streaming: Send error code from
the sender to receiver).
Refs #4789
* seastar d199d27681...a1cf07858b (1):
> Merge 'Do not return a variadic future form server_socket::accept()' from Avi
Seastar configure.py now has --api-level=1, to keep us one the old variadic future
server_socket::accept() API.
In case of error on the sender side, the sender does not propagate the
error to the receiver. The sender will close the stream. As a result,
the receiver will get nullopt from the source in
get_next_mutation_fragment and pass mutation_fragment_opt with no value
to the generating_reader. In turn, the generating_reader generates end
of stream. However, the last element that the generating_reader has
generated can be any type of mutation_fragment. This makes the sstable
that consumes the generating_reader violates the mutation_fragment
stream rule.
To fix, we need to propagate the error. However RPC streaming does not
support propagate the error in the framework. User has to send an error
code explicitly.
Fixes: #4789
Fixes#4751
Adds using a preferred address family to dns name lookups related to
listen address and rpc address, adhering to the respective "prefer" options.
API, prometheus and broadcast address are all considered to be covered by
the "listen_interface_prefer_ipv6" option.
Note: scylla does not yet support actual interface binding, but these
options should apply equally to address name parameters.
Setting a "prefer_ipv6" option automtially enables ipv6 dns family query.
Assembles information and attributes of sstables in one or more
column families.
v2:
* Use (not really legal) nested "type" in json
* Rename "table" param to "cf" for consistency
* Some comments on data sizes
* Stream result to avoid huge string allocations on final json
"
This adds the option to compress sstables using the Zstandard algorithm
(https://facebook.github.io/zstd/).
To use, pass 'sstable_compression': 'org.apache.cassandra.io.compress.ZstdCompressor'
to the 'compression' argument when creating a table.
You can also specify a 'compression_level' (default is 3). See Zstd documentation for the available
compression levels.
Resolves#2613.
This PR also fixes a bug in sstables/compress.cc, where chunk length in bytes
was passed to the compressor as chunk length in kilobytes. Fortunately,
none of the compressors implemented until now used this parameter.
Example usage (assuming there exists a keyspace 'a'):
create table a.a (a text primary key, b int) with compression = {'sstable_compression': 'org.apache.cassandra.io.compress.ZstdCompressor', 'compression_level': 1, 'chunk_length_in_kb': '64'};
Notes:
1. The code uses an external dependency: https://github.com/facebook/zstd. Since I'm using "experimental" features of the library (using my own allocated memory to store the compression/decompression contexts), according to the library's documentation we need to link it statically (https://github.com/facebook/zstd/blob/dev/lib/zstd.h#L63). I added a git submodule.
2. The compressor performs some dynamic allocations. Depending on the specified chunk length and/or compression level the allocations might be big and seastar throws warnings. But with reasonable chunk length sizes it should be OK.
3. It doesn't yet provide an option to train it with dictionaries, but that should be easy to add in another commit.
"
* 'zstd' of https://github.com/kbr-/scylla:
Configure: rename seastar_pool to submodule_pool, add more submodules to the pool
Add unit tests for Zstd compression
Enable tests that use compressed sstable files
Add ZStandard compression
Fix the value of the chunk length parameter passed to compressors
The fragment reader calls `fast_forward_to()` from its constructor to
discard fragments that fall outside the query range. Mmove the
the fast-forward code in to an internal void returning method, and call
that from both the constructor and `fast_forward_to()`, to avoid a
warning on a discarded future<>.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190801133942.10744-1-bdenes@scylladb.com>
The files in tests/sstables/3.x/compressed/ were not used in the tests.
This commit:
- renames the directory to tests/sstables/3.x/lz4/,
- adds analogous directories and files for other compressors,
- adds tests using these files,
- does some minor refactoring.
Signed-off-by: Kamil Braun <kbraun@scylladb.com>
This adds the option to compress sstables using the Zstandard algorithm
(https://facebook.github.io/zstd/).
To use, pass 'sstable_compression': 'org.apache.cassandra.io.compress.ZstdCompressor'
to the 'compression' argument when creating a table.
You can also specify a 'compression_level'. See Zstd documentation for the available
compression levels.
Resolves#2613.
Signed-off-by: Kamil Braun <kbraun@scylladb.com>
This commit also fixes a bug in sstables/compress.cc, where chunk length in bytes
was passed to the compressor as chunk length in kilobytes. Fortunately,
none of the compressors implemented until now used this parameter.
Signed-off-by: Kamil Braun <kbraun@scylladb.com>
"
Not emitting partition_end for a partition is incorrect. SStable
writer assumes that it is emitted. If it's not, the sstable will not
be written correctly. The partition index entry for the last partition
will be left partially written, which will result in errors during
reads. Also, statistics and sstable key ranges will not include the
last partition.
It's better to catch this problem at the time of writing, and not
generate bad sstables.
Another way of handling this would be to implicitly generate a
partition_end, but I don't think that we should do this. We cannot
trust the mutation stream when invariants are violated, we don't know
if this was really the last partition which was supposed to be
written. So it's safer to fail the write.
Enabled for both mc and la/ka.
Passing --abort-on-internal-error on the command line will switch to
aborting instead of throwing an exception.
The reason we don't abort by default is that it may bring the whole
cluster down and cause unavailability, while it may not be necessary
to do so. It's safer to fail just the affected operation,
e.g. repair. However, failing the operation with an exception leaves
little information for debugging the root cause. So the idea is that the
user would enable aborts on only one of the nodes in the cluster to
get a core dump and not bring the whole cluster down.
"
* 'catch-unclosed-partition-sstable-write' of https://github.com/tgrabiec/scylla:
sstables: writer: Validate that partition is closed when the input mutation stream ends
config, exceptions: Add helper for handling internal errors
utils: config_file: Introduce named_value::observe()
When a node is restarted, there is a race between gossip starts (other
nodes will mark this node up again and send requests) and the tokens are
replicated to other shards. Here is an example:
- n1, n2
- n2 is down, n1 think n2 is down
- n2 starts again, n2 starts gossip service, n1 thinks n2 is up and sends
reads/writes to n2, but n2 hasn't replicated the token_metadata to all
the shards.
- n2 complains:
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
storage_proxy - Failed to apply mutation from $ip#4: std::runtime_error
(sorted_tokens is empty in first_token_index!)
The code path looks like below:
0 stoarge_service::init_server
1 prepare_to_join()
2 add gossip application state of NET_VERSION, SCHEMA and so on.
3 _gossiper.start_gossiping().get()
4 join_token_ring()
5 _token_metadata.update_normal_tokens(tokens, get_broadcast_address());
6 replicate_to_all_cores().get()
7 storage_service::set_gossip_tokens() which adds the gossip application state of TOKENS and STATUS
The race talked above is at line 3 and line 6.
To fix, we can replicate the token_metadata early after it is filled
with the tokens read from system table before gossip starts. So that
when other nodes think this restarting node is up, the tokens are
already replicated to all the shards.
In addition, this patch also fixes the issue that other nodes might see
a node miss the TOKENS and STATUS application state in gossip if that
node failed in the middle of a restarting process, i.e., it is killed
after line 3 and before line 7. As a result we could not replace the
node.
Tests: update_cluster_layout_tests.py
Fixes: #4709Fixes: #4723
"
Altough 733c68cb1 made sure to synchronize the query time used for
compaction happening in the mutation_source_test suite and that
happening in the `flat_mutation_assertions` class, there remained
another hidden compaction that potentially could use a different
timestamp and hence produce false positive test failures. This was
hastily fixed by cea3338e3, by just increasing the TTL of cells, thus
avoiding possible differences in compaction output. This mini-series is
the proper fix to this problem. It passes a query time to the populate
function, allowing the users of the mutation source test suite to
forward it to any compaction they might be doing on the data. The quick
fix is reverted in favor of the proper fix.
Refs: #4747
"
* 'mutation_source_tests_proper_ttl_fix/v1' of https://github.com/denesb/scylla:
Revert "tests/mutation_source_tests: generate_mutation_sets() use larger ttl"
tests/sstable_mutation_test: test_sstable_conforms_to_mutation_source: use query_time
tests/mutation_source_test: add populate_fn overload with query_time
Not emitting partition_end for a partition is incorrect. Sstable
writer assumes that it is emitted. If it's not, the sstable will not
be written correctly. The partition index entry for the last partition
will be left partially written, which will may result in errors during
reads. Also, statistics and sstable key ranges will not include the
last partition.
It's better to catch this problem at the time of writing, and not
generate bad sstables.
Another way of handling this would be to implicitly generate a
partition_end, but I don't think that we should do this. We cannot
trust the mutation stream when invariants are violated, we don't know
if this was really the last partition which was supposed to be
written. So it's safer to fail the write.
Enabled for both mc and la/ka.
The handler is intended to be called when internal invariants are
violated and the operation cannot safely continue. The handler either
throws (default) or aborts, depending on configuration option.
Passing --abort-on-internal-error on the command line will switch to
aborting.
The reason we don't abort by default is that it may bring the whole
cluster down and cause unavailability, while it may not be necessary
to do so. It's safer to fail just the affected operation,
e.g. repair. However, failing the operation with an exception leaves
little information for debugging the root cause. So the idea is that the
user would enable aborts on only one of the nodes in the cluster to
get a core dump and not bring the whole cluster down.
Usually, a reconcilable_result holds very few partitions (1 is common),
since the page size is limited by 1MB. But if we have paging disabled or
if we are reconciling a range full of tombstones, we may see many more.
This can cause large allocations.
Change to chunked_vector to prevent those large allocations, as they
can be quite expensive.
Fixes#4780.
begin() of a const vector should return a const_iterator, to avoid
giving the caller the ability to mutate it.
This slipped through since iterator's constructor does a const_cast.
Noticed by code inspection.
Use the query_time passed in to the populate function and forward it to
the sstable constructor, so that the compaction happening during sstable
write uses the same query time that any compaction done by the mutation
source test suit does.
So tests that do compaction can pass the query_time they used for it to
clients that do some compaction themselves, making sure all compactions
happen with the same query time, avoiding false positive test failures.
Don't let perftune.py print false alarm error message when we calculate
a compute CPU set for tuning modes.
This may happen when we calculate a CPU set for non-MQ tuning modes on
small systems on which these modes are forbidden because they would
result in a zero CPU set, e.g. sq_split on a system with a single
physical core.
We are going to utilize a newly introduced --get-cpu-mask-quiet execution
mode introduced to the seastar/script/perftune.py by the
"perftune.py: introduce --get-cpu-mask-quiet" series which would return
a zero CPU set if that's what it turns out to be instead of exiting with
an error what --get-cpu-mask would do in such a case.
The rest of scylla_util.py logic is going to handle a zero CPU set
returned by get_mode_cpuset() correctly.
Fixes#4211Fixes#4443
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20190731212901.9510-1-vladz@scylladb.com>
"
A fix for #4338 "storage_proxy add a counter for cql requests
that arrived to a non replica"
Such requests should be tracked since forwarding them to a correct
replica can create a lot network noise and incur significant performance
penalty.
The current metrics are considered insufficient after introduction
of heat-weighted load balancing.
"
Fixes#4388.
* 'gh-4338' of https://github.com/kostja/scylla:
metrics: introduce a metric for non-local reads
metrics: account writes forwarded by a coordinator in an own metric.
"
Fixes#4663.
Fixes#4718.
"
* 'make_cleanup_run_aware_v3' of https://github.com/raphaelsc/scylla:
tests/sstable_datafile_test: Check cleaned sstable is generated with expected run id
table: Make SSTable cleanup run aware
compaction: introduce constants for compaction descriptor
compaction: Make it possible to config the identifier of the output sstable run
table: do not rely on undefined behavior in cleanup_sstables
* seastar 3f88e9068b...d90834443c (12):
> Print warning when somaxconn lower than backlog parameter used for listen()
> Merge "perftune.py: introduce --get-cpu-mask-quiet" from Vlad
> seastar-json2code: Handle "$ref"-usage for nested object types properly
> Make future [[nodiscard]]
> Allow pass listen_options to http_server::listen
> Handle EPOLLHUP and EPOLLERR from epoll explicitly
> reactor: fix false positives in the stall detector due to large task queue
> Merge "Small asan related improvements" from Rafael
> thread: reduce allocations during context switch
> thread: remove deprecated thread_scheduling_group and its unit test
> reactor: make _polls to be non atomic
> reactor: remove unused _tasks_processed variable
On previous commit ac9b115a8f, install.sh requires to specify single package using --pkg, there is no way to select all.
It should be select all packages when running install.sh without --pkg.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190731013245.5857-1-syuu@scylladb.com>
get_rpc_client assumes the messaging_service is not stopped. We should check
is_stopping() before we call get_rpc_client.
We do such check in existing code, e.g., send_message and friends. Do
the same check in the newly introduced
make_sink_and_source_for_stream_mutation_fragments() and friends for row
level repair.
Fixes: #4767
"
This series is a batch of first small steps towards devirtualizing CQL restrictions:
- one artificial parent class in the hierarchy is removed: abstract_restriction
- the following functions are devirtualized:
* is_EQ()
* is_IN()
* is_slice()
* is_contains()
* is_LIKE()
* is_on_token()
* is_multi_column()
Future steps can involve the following:
- introducing a std::variant of restriction targets: it's either a column
or a vector of columns
- introducing a std::variant of restriction values: it's one of:
{term, term_slice, std::vector<term>, abstract_marker}
The steps above will allow devirtualizing most of the remaining virtual functions
in favor of std::visit. They will also reduce the number of subclasses,
e.g. what's currently `token_restriction::IN_with_values` can be just an instance
of `restriction`, knowing that it's on a token, having a target of std::vector<column>
and a value of std::vector<term>.
Tests: unit(dev), dtest: cql_tests, cql_additional_tests
"
* 'refactor_restrictions_2' of https://github.com/psarna/scylla:
cql3: devirtualize is_on_token()
cql3: devirtualize is_multi_column()
cql3: devirtualize is_EQ, is_IN, is_contains, is_slice, is_LIKE
tests: add enum_set adding case
cql3: allow adding enum_sets
cql3: remove abstract_restriction class
(At least) on Ubuntu 19 'thrift -version' prints the expected
string but its exit status is non-zero:
$ thrift -version
Thrift version 0.9.1
$ echo $?
1
We don't really care about the exit status but rather about the printed
version string. If there is going to be some problem with the command,
e.g. it's missing, the printed string is not going to be as expected
anyway - let's verify that explicitly by checking the format of the
returned string in that case.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20190722211729.24225-1-vladz@scylladb.com>
Currently all cells generated by this method uses a ttl of 1. This
causes test flakyness as tests often compact the input and output
mutations to weed out artificial differences between them. If this
compaction is not done with the exact same query time, then some cells
will be expired in one compaction but not in the other.
733c68cb1 attempted to solve this by passing the same query time to
`flat_mutation_reader_assertions::produce_compacted()` as well as
`mutation_partition::compact_for_query()` when compacting the input
mutation. However a hidden compaction spot remained: the ka/la sstable
writer also does some compaction, and for this it uses the time point
passed to the `sstable` constructor, which defaults to
`gc_clock::now()`. This leads to false positive failures in
`sstable_mutation_test.cc`.
At this point I don't know what the original intent was behind this low
`ttl` value. To solve the immediate problem of the tests failing, I
increased it. If it turns out that this `ttl` value has a good reason,
we can do a more involved fix, of making sure all sstables written also
get the same query time as that used for the compaction.
Fixes: #4747
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190731081522.22915-1-bdenes@scylladb.com>
All restrictions inherit from `abstract_restriction` class,
which has only one parent class: `restriction`. To simplify the
inheritance tree, `restriction` and `abstract_restriction`
are merged into one class named `restriction`.
`produces_compacted()` is usually used in tandem of another
compaction done on the expected output (`m` param). This is usually done
so that even though the reader works with an uncompacted stream, when
checking the checking of the result will not fail due to insignificant
changes to the data, e.g. expired collection cells dropped while merging
two collections. Currently, the two compactions, the one inside
`produce_compacted()` and the one done by the caller uses two separate
calls to `gc_clock::now()` to obtain the query time. This can lead to
off-by-one errors in the two query times and subsequently artificial
differences between the two compacted mutations, ultimately failing the
test due to a false-positive.
To prevent this allow callers to pass in a query time, the same they
used to compact the input mutation (`m`).
This solves another source of flakyness in unit tests using the mutation
source test suite.
Refs: #4695Fixes: #4747
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190726144032.3411-1-bdenes@scylladb.com>
If we had an error while reading, then we would have failed to close
the reader, which in turn can cause memory corruption. Make the
closing more robust by using then_wrapped (that doesn't skip on
exception) and log the error for analysis.
Fixes#4761.
* seastar c1be3c912f...3f88e9068b (3):
> reactor: improve handling of connect storms
> json: Make date formatter use RFC8601/RFC3339 format
> reactor: fix deadlock of stall detector vs dlopen
Fixes#4759.
Currently, install.sh just used for building .rpm, we have similar build script
under dist/debian, sometimes it become inconsistent with install.sh.
Since most of package build process are same, we should share install.sh on both
.rpm and .deb package build process.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190725123207.2326-1-syuu@scylladb.com>
Across all calls to `make_fragments_with_non_monotonic_positions()`, to
prevent off-by one errors between the separately generated test input
and expected output. This problem was already supposed to be fixed by
5f22771ea8 but for some reason that only
used the same deletion time *inside* a single call, which will still
fall short in some cases.
This should hopefully fix this problem for good.
Refs: #4695
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190724073240.125975-1-bdenes@scylladb.com>
If the user specifies an output file name using "--xunit=<filename>",
test.py will write the test results of non-boost tests to the file in the XUnit XML format.
Every boost test creates its own results file already.
Resolves#4680.
Signed-off-by: Kamil Braun <kbraun@scylladb.com>
Explicity pass the port number of the AWS metadata server API
when creating a corresponding socket.
This patch fixes the regression introduced by 4ef940169f.
Fixes#4719
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
compaction: allow collecting purged data
Allow the compaction initiator to pass an additional consumer that will
consume any data that is purged during the compaction process. This
allows the separate retention of these dead cells and tombstone until
some long-running process like compaction safely finishes. If the
process fails or is interrupted the purged data can be used to prevent
data resurrection.
This patch was developed to serve as the basis for a solution to #4531
but it is not a complete solution in and on itself.
This series is a continuation of the patch: "[PATCH v1 1/3] Introduce
Garbage Collected Consumer to Mutation Compactor" by Raphael S.
Carvalho <raphaelsc@scylladb.com>.
Refs: #4531
* https://github.com/denesb/scylla.git compaction_collect_purged_data/v8:
Introduce compaction_garbage_collector interface
collection_type_impl::mutation: compact_and_expire() add collector
parameter
row: add garbage_collector
row_marker: de-inline compact_and_expire()
row_marker: add garbage_collector
Introduce Garbage Collected Consumer to Mutation Compactor
tests: mutation_writer_test.cc/generate_mutations() ->
random_schema.hh/generate_random_mutations()
tests/random_schema: generate_random_mutations(): remove `engine`
parameter
tests/random_schema: add assert to make_clustering_key()
tests/random_schema: generate_random_mutations(): allow customizing
generated data
tests: random_schema: futurize generate_random_mutations()
random_schema: generate_random_mutations(): restore indentation
data_model: extend ttl and expiry support
tests/random_schema: generate_random_mutations(): generate partition
tombstone
random_schema: add ttl and expiry support
tests/random: add get_bool() overload with random engine param
random_schema: generate_random_mutations(): ensure partitions are
unique
tests: add unit tests for the data stream split in compaction
"
After switching to rpc stream interface, we increased the row buffer
size. Code works on the buffer that do not yield can stall the reactor.
This series fixes the issue by futurizing or running the code in thread
and yield.
Fixes: #4642
"
* 'repair_switch_to_rpc_stream_fix_stall' of https://github.com/asias/scylla:
repair: Enable rpc stream in row level repair
repair: Wrap with foreign_ptr to avoid cross cpu free
repair: Futurize get_repair_rows_size and row_buf_size
repair: Avoid calling get_repair_rows_size in get_sync_boundary
repair: Futurize row_buf_csum
repair: Yield inside get_set_diff
repair: Use get_repair_rows_size helper in get_sync_boundary
repair: Avoid stall in do_estimate_partitions_on_local_shard
remove get_row_diff
repair: Futurize get_row_diff to avoid stall
repair: Fix possible stall in request_row_hashes
repair: Allow default construct for repair_row
repair: Remove apply_rows
repair: Run get_row_diff_with_rpc_stream in a thread
repair: Run get_row_diff_and_update_peer_row_hash_sets inside a thread
repair: Run get_row_diff inside a thread
repair: Add apply_rows_on_master_in_thread
repair: Add apply_rows_on_follower
repair: Futurize working_row_hashes
repair: Remove get_full_row_hashes helper
Data listener reads are implemented as flat_mutation_readers, which
take a reference to the listener and then execute asynchronously.
The listener can be removed between the time when the reference is
taken and actual execution, resulting in a dangling pointer
dereference.
Fix by using a weak_ptr to avoid writing to a destroyed object. Note that writes
don't need protection because they execute atomically.
Fixes#4661.
Tests: unit (dev)
"
Issue #4558 describes a situation in which failure to execute clearsnapshots
will hard crash the node. The problem is that clearsnapshots will internally use
lister::rmdir, which in turn has two in-tree users: clearing snapshots and clearing
temporary directories during sstable creation. The way it is currently
coded, it wraps the io functions in io_check, which means that failures
to remove the directory will crash the database.
We recently saw how benign failures crashed a database during
clearsnapshot: we had snapshot creation running in parallel, adding more
files to the directory that wasn't empty by the time of deletion. I
have also seen very often users add files to existing directories by
accident, which is another possibility to trigger that.
This patch removes the io_check from lister, and moves it to the caller
in which we want to be more strict. We still want to be strict about
the creation of temporary directories, since users shouldn't be touching
that in any way.
Also while working on that, I realized we have no tests for snapshots of
any kind in tree, so let's write some
"
* 'snapshots' of https://github.com/glommer/scylla:
tests: add tests for snapshots.
lister: don't crash the node on failure to remove snapshot
We were using segment::_closed to decide whether _file was already
closed. Unfortunately they are not exactly the same thing. As far as
I understand it, segments can be closed and reused without actually
closing the file.
Found with a seastar patch that asserts on destroying an open
append_challenged_posix_file_impl.
Fixes#4745.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190721171332.7995-1-espindola@scylladb.com>
Compilation fails with fmt release 5.3.0 when we print a bytes_view
using "{}" formatter.
Compiler's complain is: "error: static assertion failed: mismatch
between char-types of context and argument"
Fix this by explicitly using to_hex() converter.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20190716221231.22605-3-vladz@scylladb.com>
Merged patch series by Piotr Sarna:
This series introduces the concept of "computed" column, which represents
values not provided directly by the user, but computed on the fly -
possibly using other column values. It will be used in the future to
implement map value indexing, collection indexing, etc. Right now the only
use is the token column for secondary indexes - which is a column computed
from the base partition key value.
After this series, another one that depends on it and adds map value
indexing will be pushed.
Tests: unit(dev)
Piotr Sarna (14):
schema: add computed info to column definition
schema: add implementation of computing token column
schema: allow marking columns as computed in schema builder
service: add computed columns feature
view: check for computed columns in view
view: remove unused token_for function
database: add fixing previous secondary index schemas
tests: disable computed columns feature in schema change test
tests: add schema change test regeneration comment
db: add system_schema.computed_columns
docs: init system_schema_keyspace.md with column computations
tests: generate new test case for schema change + computed cols
index: mark token column as 'computed' when creating mv
tests: add checking computed columns in SI
column_computation.hh | 63 ++++++++
db/schema_features.hh | 4 +-
db/schema_tables.hh | 4 +
idl/frozen_schema.idl.hh | 1 +
schema.hh | 40 +++++
schema_builder.hh | 4 +-
schema_mutations.hh | 18 ++-
service/storage_service.hh | 8 +
view_info.hh | 2 -
database.cc | 6 +-
db/schema_tables.cc | 146 ++++++++++++++++--
db/view/view.cc | 46 +++---
index/secondary_index_manager.cc | 2 +-
schema.cc | 58 ++++++-
schema_mutations.cc | 14 +-
service/storage_service.cc | 5 +
tests/schema_change_test.cc | 63 ++++++--
tests/secondary_index_test.cc | 28 ++++
docs/system_schema_keyspace.md | 40 +++++
plus about 200 new test sstable files
The original "test_schema_digest_does_not_change" test case ensures
that schema digests will match for older nodes that do not support
all the features yet (including computed columns).
The additional case uses sstables generated after computed columns
are allowed, in order to make sure that the digest computed
including computed columns does not change spuriously as well.
Schema change test might need regenerating every time a system table
is added. In order to save future developer's time on debugging this
test, a short description of that requirement is added.
In order to make sure that old schema digest is not recomputed
and can be verified - computed columns feature is initially disabled
in schema_change_test.
The reason for that is as follows: running CQL test env assumes that
we are running the newest cluster with all features enabled. However,
the mere existence of some features might influence digest calculation.
So, in order for the existing test to work correctly, it should have
exactly the same set of cluster supported features as it had during
its creation. It used to be "all features", but now it's "all features
except computed columns". One can think of that as running a cluster
with some nodes not yet knowing what computed columns are, so they
are not taken into account when computing digests.
Additionally, a separate test case that takes computed column digest
into account will be generated and added in this series.
If a schema was created before computed columns were implemented,
its token column may not have been marked as computed.
To remedy this, if no computed column is found, the schema
will be recreated.
The code will work correctly even without this patch in order to support
upgrading from legacy versions, but it's still important: it transforms
token columns from the legacy format to new computed format, which will
eventually (after a few release cycles) allow dropping the support for
legacy format altogether.
Computed columns feature should be checked before creating
index schemas the new way - by adding computed column names
to system_schema.computed_columns.
Some columns may represent not user-provided values, but ones computed
from other columns. Currently an example is token column used in secondary
indexes to provide proper ordering. In order to avoid hardcoding special
cases in execution stage, optional additional information for computed
columns is stored in column definition.
streaming_reader_lifecycle_policy::create_reader() was ignoring the
partition_slice passed to it and always creating the reader for the
full slice.
That's wrong because create_reader() is called when recreating a
reader after it's evicted. If the reader stopped in the middle of
partition we need to start from that point. Otherwise, fragments in
the mutation stream will appear duplicated or out of ordre, violating
assumptions of the consumers.
This was observed to result in repair writing incorrect sstables with
duplicated clustering rows, which results in
malformed_sstable_exception on read from those sstables.
Fixes#4659.
In v2:
- Added an overload without partition_slice to avoid changing existing users which never slice
Tests:
- unit (dev)
- manual (3 node ccm + repair)
Backport: 3.1
Reviewd-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <1563451506-8871-1-git-send-email-tgrabiec@scylladb.com>
Given a list of ranges to stream, stream_transfer_task will create an
reader with the ranges and create a rpc stream connection on all the shards.
When user provides ranges to repair with -st -et options, e.g.,
using scylla-manger, such ranges can belong to only one shard, repair
will pass such ranges to streaming.
As a result, only one shard will have data to send while the rpc stream
connections are created on all the shards, which can cause the kernel
run out of ports in some systems.
To mitigate the problem, do not open the connection if the ranges do not
belong to the shard at all.
Refs: #4708
"
Fix another source of flakyness in mutation_reader_test. This one is caused by storage_service_for_tests lacking a config::broadcast_to_all_shards() call, triggering an invalid memory access (or SEGFAULT) when run on more than one shards.
Refs: #4695
"
* 'fix_storage_service_for_tests' of https://github.com/denesb/scylla:
tests: storage_service_for_tests: broadcast config to all shards
tests: move storage_service_for_tests impl to test_services.cc
The function announce_column_family_drop() drops (deletes) a base table
and all the materialized-views used for its secondary indexes, but not
other materialized views - if there are any, the operation refuses to
continue. This is exactly what CQL's "DROP TABLE" needs, because it is
not allowed to drop a table before manually dropping its views.
But there is no inherent reason why it we can't support an operation
to delete a table and *all* its views - not just those related to indexes.
This patch adds such an option to announce_column_family_drop().
This option is not used by the existing CQL layer, but can be used
by other code automating operations programatically without CQL.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190716150559.11806-1-nyh@scylladb.com>
Currently scylla-server.service uses DefaultTimeoutStopSec = 90, if Scylla
does not able to clean-shutdown in 90sec we may have data corruption on the node.
Since we already set TimeoutStartSec = 900, we can use TimeoutSec to set both
TimeoutStartSec and TimeoutStopSec to 900.
See #4700
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190717095416.10652-1-syuu@scylladb.com>
All statement objects which derive from cf_statement, including
drop_index_statement, have a column_family() returning the name of the
column family involved in this statement. For most statement this is
known at the time of construction, because it is part of the statement,
but for "DROP INDEX", the user doesn't specify the table's name - just
the index name. So we need to override column_family() to find the
table name.
The existing implementation assert()ed that we can always find such
a table, but this is not true - for example, in a DROP INDEX with
"IF EXISTS", it is perfectly fine for no such table to exist. In this
case we don't want a crash, and not even an except - it's fine that
we just return an empty table name.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190716180104.15985-1-nyh@scylladb.com>
While inspecting the snapshot code, I realized that we don't have any
tests for it. So I decided to add some.
Unfortunately I couldn't come up with a test of clearsnapshot reliably
failing to remove the directory: relying on create snapshot +
clearsnapshot is racy (not always happen), and other tricks that can be
used to reproduce this -- like creating a root-owned file inside the
snapshots directory -- is environment-dependent, and a bit ugly for unit
tests. Dtests would probably be a better place for that.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
lister::rmdir has two in-tree users: clearing snapshots and clearing
temporary directories during sstable creation. The way it is currently
coded, it wraps the io functions in io_check, which means that failures
to remove the directory will crash the database.
We recently saw how benign failures crashed a database during
clearsnapshot: we had snapshot creation running in parallel, adding more
files to the directory that wasn't empty by the time of deletion. I
have also seen very often users add files to existing directories by
accident, which is another possibility to trigger that.
This patch removes the io_check from lister, and moves it to the caller
in which we want to be more strict. We still want to be strict about
the creation of temporary directories, since users shouldn't be touching
that in any way.
Fixes#4558
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"
disable_sstable_write needs to acquire _sstable_deletion_sem to properly synchronize
with background deletions done by on_compaction_completion to ensure no sstables will
be created or deleted during reshuffle_sstables after
storage_service::load_new_sstables disables sstable writes.
Fixes#4622
Test: unit(dev), nodetool_additional_test.py migration_test.py
"
* 'scylla-4622-fix-disable-sstable-write' of https://github.com/bhalevy/scylla:
table: document _sstables_lock/_sstable_deletion_sem locking order
table: disable_sstable_write: acquire _sstable_deletion_sem
table: uninline enable_sstable_write
table: reshuffle_sstables: add log message
If a node is a seed node, it can not be started with
replace-address-first-boot or the replace-address flag.
The issue is that as a seed node it will generate new tokens instead of
replacing the existing one the user expect it to replaec when supplying
the flags.
This patch will throw a bad_configuration_error exception
in this case.
Fixes#3889
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Fixes#4717
Bug in ipv6 support series caused inet_address serialization
to include an additional "size" parameter in the address chunk.
Message-Id: <20190716134254.20708-1-calle@scylladb.com>
Due to recent changes to the config subsystem, configuration has to be
broadcast to all shards if one wishes to use it on them. The
`storage_service_for_tests` has a `sharded<gms::gossiper>` member, which
reads config values on initialization on each shard, causing a crash as
the configuration was initialized only on shard 0. Add a call to
`config::broadcast_to_all_shards()` to ensure all shards have access to
valid config values.
The cleanup procedure will move any sstable out of its sstable run
because sstables are cleaned up individually and they end up receiving
a new run identifier, meaning a table may potentially end up with a
new sstable run for each of the sstables cleaned.
SStable cleanup needs to be run aware, so that the run structure is
not messed up after the operation is done. Given that only one fragment
or other, composing a sstable run, may need cleanup, it's better to keep
them in their original sstable run.
Fixes#4663.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Make it easier for users, and also avoid duplicating knowledge
about descriptor defaults across the codebase.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
row::append_cell() has a precondition that the new cell column id needs
to be larger than that of any other already existing cell. If this
precondition is violated the row will end up in an invalid state. This
patch adds assertion to make sure we fail early in such cases.
Currently the
test_multishard_combining_reader_non_strictly_monotonic_positions is
flaky. The test is somewhat unconventional, in that it doesn't use the
same instance of data as the input to the test and as it's expected
output, instead it invokes the method which generates this data
(`make_fragments_with_non_monotonic_positions()`) twice, first to
generate the input, and a secondly to generate the expected output. This
means that the test is prone to any deviation in the data generated by
said method. One such deviation, discovered recently, is that the method
doesn't explicitly specify the deletion time of the generated range
tombstones. This results in this deletion time sometimes differing
between the test input and the expected output. Solve by explicitly
passing the same deletion time to all created range tombstones.
Refs: #4695
Fixes a segfault when querying for an empty keyspace.
Also, fixes an infinite loop on smp > 1. Queries to
system.size_estimates table which are not single-partition queries
caused Scylla to go into an infinite loop inside
multishard_combining_reader::fill_buffer. This happened because
multishard_combinind_reader assumes that shards return rows belonging
to separate partitions, which was not the case for
size_estimates_mutation_reader.
Fixes#4689.
Start n1, n2
Create ks with rf = 2
Run repair on n2
Stop n2 in the middle of repair
n1 will notice n2 is DOWN, gossip handler will remove repair instance
with n2 which calls remove_repair_meta().
Inside remove_repair_meta(), we have:
```
1 return parallel_for_each(*repair_metas, [repair_metas] (auto& rm) {
2 return rm->stop();
3 }).then([repair_metas, from] {
4 rlogger.debug("Removed all repair_meta for single node {}", from);
5 });
```
Since 3.1, we start 16 repair instances in parallel which will create 16
readers.The reader semaphore is 10.
At line 2, it calls
```
6 future<> stop() {
7 auto gate_future = _gate.close();
8 auto writer_future = _repair_writer.wait_for_writer_done();
9 return when_all_succeed(std::move(gate_future), std::move(writer_future));
10 }
```
The gate protects the reader to read data from disk:
```
11 with_gate(_gate, [] {
12 read_rows_from_disk
13 return _repair_reader.read_mutation_fragment() --> calls reader() to read data
14 })
```
So line 7 won't return until all the 16 readers return from the call of
reader().
The problem is, the reader won't release the reader semaphore until the
reader is destroyed!
So, even if 10 out of the 16 readers have finished reading, they won't
release the semaphore. As a result, the stop() hangs forever.
To fix in short term, we can delete the reader, aka, drop the the
repair_meta object once it is stopped.
Refs: #4693
Duplicate partitions can appear as a result of the same partition key
generated more than once. For now we simply remove any duplicates. This
means that in some circumstances there will be less partitions generated
than asked.
When generating data, the user can now also generate ttls and
expiry for all generated atoms. This happens in a controlled way, via a
generator functor, very similar to how the timestamps are generated.
This functor is also used by `random_schema` to generate `deletion_time`
for all tombstones, so the user now has full control of when all of the
atoms can be GC'd.
Use an internally create instance of random engine. Passing a readily
seeded engine from the outside is pointless now that we have a mechanism
to seed entire test suites with a command line algorithm: the internal
engine is seeded from tests::random, so the seed of the test suite
determines the internal seed as well.
Update the sole user of this method (mutation_writer_test.cc) to not
generate local seeds anymore.
Introduce consumer in mutation compactor that will only consume
data that is purged away from regular consumer. The goal is to
allow compaction implementation to do whatever it wants with
the garbage collected data, like saving it for preventing
data resurrection from ever happening, like described in
issue #4531.
noop_compacted_fragments_consumer is made available for users that
don't need this capability.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The new collector parameter is a pointer to a
`compaction_garbage_collector` implementation. This collector is passed
the row_marker when it expired and would be discarded.
The collector param is optional and defaults to nullptr.
The new collector parameter is a pointer to a
`compaction_garbage_collector` implementation. This collector is passed
all atoms that are expired and can would be discarded. The body of
`compact_and_expire()` was changed so that it checks cells' tombstone
coverage before it checks their expiry, so that cells that are both
covered by a tombstone and also expired are not passed to the collector.
The collector is forwarded to
`collection_type_impl::mutation::compact_and_expire()` as well.
The collector param is optional and defaults to nullptr
The new collector parameter is a pointer to a
`compaction_garbage_collector` implementation. This collector is passed
all atoms that are expired and would be discarded. The body of
`compact_and_expire()` was changed so that it checks cells' tombstone
coverage before it checks their expiry, so that cells that are both
covered by a tombstone and also expired are not passed to the collector.
The collector param is optional and defaults to nullptr. To accommodate
the collector, which needs to know the column id, a new `column_id`
parameter was added as well.
Fixes#4713
Modifying config files to use sharded storage misses the fact
that extensions are allowed to add non-member config fields to
the main configuration, typically from "extra" config_file
objects.
Unless those "extra" files are broadcast when main file broadcast,
the values will not be readable from other shards.
This patch propagates the broadcast to all other config files
whose entries are in the top level object. This ensures we
always keep data up to date on config reload.
Message-Id: <20190715135851.19948-1-calle@scylladb.com>
configure.py currently takes some time to write build.ninja. If the user
interrupts (e.g., control-C) configure.py, it can leave behind a partial
or even empty build.ninja file. This is most frustrating when the user
didn't explicitly run "configure.py", but rather just ran "ninja" and
ninja decided to run configure.py, and after interrupting it the user
cannot run "ninja" again because build.ninja is gone. Another result of
losing build.ninja is that the user now needs to remember which parameters
to run "configure.py", because the old ones stored in build.ninja were lost.
The solution in this patch is simple: We write the new build.ninja contents
into a temporary file, not directly into build.ninja. Then, only when the
entire file has been succesfully written, do we rename the temporary file
to its intended name - build.ninja.
Fixes#4706
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190715122129.16033-1-nyh@scylladb.com>
This interface can be used to implement a garbage collector that
collects atoms that are purged due to expiry during compaction.
The intended usage is collecting purged atoms for safekeeping until the
compaction process finishes safely, to be dropped only at the end when
the compaction is known to have finished successfully.
When scylla is started for the first time with PasswordAuthenticator
enabled, it can be that a record of the default superuser
will be created in the table with the can_login and is_superuser
set to null. It happens because the module in charge of creating
the row is the role manger and the module in charge of setting the
default password salted hash value is the password authenticator.
Those two modules are started together, it the case when the
password authenticator finish the initialization first, in the
period until the role manager completes it initialization, the row
contains those null columns and any loging attempt in this period
will cause a memory access violation since those columns are not
expected to ever be null. This patch removes the race by starting
the password authenticator and autorizer only after the role manger
finished its initialization.
Tests:
1. Unit tests (release)
2. Auth and cqlsh auth related dtests.
Fixes#4226
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20190714124839.8392-1-eliransin@scylladb.com>
In scylla-debuginfo package, we have /usr/lib/debug/opt/scylladb/libreloc/libthread_db-1.0.so-666.development-0.20190711.73a1978fb.el7.x86_64.debug
but we actually does not have libthread_db.so.1 in /opt/scylladb/libreloc
since it's not available on ldd result with scylla binary.
To debug thread, we need to add the library in a relocatable package manually.
Fixes#4673
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190711111058.7454-1-syuu@scylladb.com>
The resource manager is used to manage common resources between
various hints managers. In-flight hints used to be one of the shared
resources, but it proves to cause starvation, when one manager eats
the whole limit - which may be especially painful if the background
materialized views hints manager starves the regular hints manager,
which can in turn start failing user writes because of admission control.
This patch makes the limit per-manager again,
which effectively reverts the limit to its original behavior.
Fixes#4483
Message-Id: <8498768e8bccbfa238e6a021f51ec0fa0bf3f7f9.1559649491.git.sarna@scylladb.com>
We were missing calls to underlying_type in a few locations and so the
insert would think the given literal was invalid and the select would
refuse to fetch a UDT field.
Fixes#4672
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190708200516.59841-1-espindola@scylladb.com>
Queries to system.size_estimates table which are not single parition queries
caused Scylla to go into an infinite loop inside multishard_combining_reader::fill_buffer.
This happened because multishard_combinind_reader assumes that shards return rows belonging
to separate partitions, which was not the case for size_estimates_mutation_reader.
This commit fixes the issue and closes#4689.
Move the implementation of size_estimates_mutation_reader
to a separate compilation unit to speed up compilation times
and increase readability.
Refactor tests to use seastar::thread.
`disable_sstable_write` needs to acquire `_sstable_deletion_sem`
to properly synchronize with background deletions done by
`on_compaction_completion` to ensure no sstables will be created
or deleted during `reshuffle_sstables` after
`storage_service::load_new_sstables` disables sstable writes.
Fixes#4622
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add the row_level_diff_detect_algorithm::send_full_set_rpc_stream as
supported algo. If both repair master and followers support it, the
master will use the rpc stream interface, otherwise use the old rpc verb
interface.
The moved set_diff and rows will be freed on the target cpu instead of the
source cpu, which will cause a lot of cross-cpu frees.
To fix, wrap them in foreign_ptr.
This fixes a possible cause of #4614.
From the backtrace in that issue, it looks like a file is being closed
twice. The first point in the backtrace where that seems likely is in
the MC writer.
My first idea was to add a writer::close and make it the responsibility
of the code using the writer to call it. That way we would move work
out of the destructor.
That is a bit hard since the writer is destroyed from
flat_mutation_reader::impl::~consumer_adapter and that would need to
get a close function too.
This patch instead just fixes an exception safety issue. If
_index_writer->close() throws, _index_writer is still valid and
~writer will try to close it again.
If the exception was thrown after _completed.set_value(), that would
explain the assert about _completed.set_value() being called twice.
With this patch the path outside of the destructor now moves the
writer to a local variable before trying to close it.
Fixes#4614
Message-Id: <20190710171747.27337-1-espindola@scylladb.com>
In debug mode the LSA needs objects to be 8-byte aligned in order to
maximise coverage from the AddressSanitizer.
Usually `close_active()` creates a dummy objects that covers the end of
the segment being closed. However, it the last real objects ends in the
last eight bytes of the segment then that dummy won't be created because
of the alignment requirements. This broke exit conditions on loops
trying to read all objects in the segment and caused them to attempt to
dereference address at the end of the segment. This patch fixes that.
Fixes#4653.
"
If the user creates a keyspace with the 'SimpleStrategy' replication class
in a multi-datacenter environment, they will receive a warning in the CQL shell
and in the server logs.
Resolves#4481 and #4651.
"
* 'multidc' of https://github.com/kbr-/scylla:
Warn user about using SimpleStrategy with Multi DC deployment
Add warning support to the CQL binary protocol implementation
"
Fixes#2027
Modifies inet address type in scylla to use seastar::net::inet_address,
and removes explicit use of ipv4_addr in various network code in favour
of socket_address. Thus capable of resolving and binding to ipv6.
Adds config option to enable/disable ipv6 (default enabled), so
upgrading cluster can continue to work while running mixed version
nodes (since gossip message address serialization becomes different).
"
* 'calle/ipv6' of https://github.com/elcallio/scylla:
test-serialization: Add small roundtrip test for inet address (v4 + v6)
inet_address/init: Make ipv6 default enabled
db::config: Add enable ipv6 switch (default off)
gms::inet_address: Make serialization ipv6 aware
Remove usage of inet_address::raw_addr()
Replace use of "ipv4_addr" with socket_address
inet_address: Add optional family to lookup
gms::inet_address: Change inet_address to wrap actual seastar::net::inet_address
types: Add ipv6_address support
Post commit b3adabda2d
(Reduce logalloc differences between debug and release)
logalloc_test's memory footprint has grown, in particular
in test_region_groups, and it triggers the oom killer on
our test automation machines.
This patch scales down this test case so it requires less memory.
Fixes#4669
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This adds a '--repeat N' command line option to test.py, which can be
used to execute the tests N times. This is useful for finding flakey
tests, for example.
Message-Id: <20190710092115.15960-1-penberg@scylladb.com>
Currently the handling of partition tombstones is broken in multiple
ways:
* The partition-tombstone is lost when the bucket is calculated for its
timestamp (due to a misplaced `std::exchange()`).
* When the `partition_start` fragment (containing the partition
tombstone) is actually written to the bucket we emit another
`partition_start` fragment before it because the bucket has not seen
that partition before and we fail to notice that we are actually writing
the partition header.
This bug was allowed to fly under the radar because the unit test was
accidentally not creating partition tombstones in the generated data
(due to a mistake). It was discovered while working on unit tests for
another test and fixing the data generation function to actually
generate partition tombstones.
This patch fixes both problems in the handling of partition tombstones
but it doesn't yet fixes the test. That is deferred until the patch
series which uncovered this bug is merged to avoid merge conflicts.
The other series mentioned here is: [PATCH v6 00/15] compaction: allow
collecting purged data
Fixes: #4683
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190710092427.122623-1-bdenes@scylladb.com>
The get_value method returns a pointer to the value that is used by the
value_to_json method.
The assumption is that the void pointer points to the actual value.
Fixes#4678
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Since commit bb56653 (repair: Sync schema from follower nodes before
repair), the behaviour of handling down node during repair has been
changed. That is, if a repair follower is down, it will fail to sync
schema with it and the repair of the range will be skipped. This means
a range can not be repaired unless all the nodes for the replicas are up.
To fix, we filter out the nodes that is down and mark the repair is
partial and repair with the nodes that are still up.
Tests: repair_additional_test:RepairAdditionalTest.repair_with_down_nodes_2b_test
Fixes: #4616
Backports: 3.1
Message-Id: <621572af40335cf5ad222c149345281e669f7116.1562568434.git.asias@scylladb.com>
A read which arrived to a non-replica and had to be forwarded to a
replica by the coordinator is accounted in an own metric,
reads_coordinator_outside_replica_set.
Most often such read is produced by a driver which is unaware of
token distribution on the ring.
If a read was forwarded to another replica due to heat weighted
load balancing or query preference set by the user, it's not accounted
in the metric.
In case of a multi-partition read (a query using IN statement,
e.g. x in (1, 2, 3)), if any of the keys is read from a
non-local node the read is accounted as a non-local.
The rationale behind it is that if the user tries to be careful and send
IN queries only to the same vnode, they are rewarded with the counter
staying at zero, while if they send multi-partition IN queries without
any precautions, they will see the metric go up which gives them a
starting point for investigating performance problems.
Closes#4338
Add a metric to account writes which arrived to a non-replica and
had to be forwarded by a coordinator to a replica.
The name of the added metric is 'writes_coordinator_outside_replica_set'.
Do not account forwarded read repair writes, since they are already
accounted by a reads_coordinator_outside_replica_set metric, added in a
subsequent patch.
In scope of #4338.
Because inet_address was initially hardcoded to
ipv4, its wire format is not very forward compatible.
Since we potentially need to communicate with older version nodes, we
manually define the new serial format for inet_address to be:
ipv4: 4 bytes address
ipv6: 4 bytes marker 0xffffffff (invalid address)
16 bytes data -> address
The std::find_if and std::copy can stall if the number of rows are big.
Introduce a helper move_row_buf_to_working_row_buf to move the rows
that yields to avoid stall.
Simliar to commit 9785754e0d
lister::guarantee_type needs to check the entry's type,
not the symlink it may point to.
Fixes#4606
The nodetool_refresh_with_wrong_upload_modes_test dtest creates a broken
symlink and following it fails, as it should, with the default follow_symlink::yes
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190626110734.4558-1-bhalevy@scylladb.com>
"
This miniseries expands big_decimal interface with convenience operators
(-=, +, -), provides test cases for it and makes one of the constructors
explicit.
Tests: unit(dev)
"
* 'expand_big_decimal_interface' of https://github.com/psarna/scylla:
utils: make string-based big decimal constructor explicit
tests: add more operators to big decimal tests
utils: add operators to big_decimal
"
Implement LIKE parsing, intermediate representation, and query processing. Add tests
for this implementation (leaving the LIKE functionality tests in
tests/like_matcher_test.cc).
Refs #4477.
"
* 'finish-like' of https://github.com/dekimir/scylla:
cql3: Add LIKE operator to CQL grammar
cql3: Ensure LIKE filtering for partition columns
cql3: Add LIKE restriction
cql3: Add LIKE relation
If the user creates a keyspace with the 'SimpleStrategy' replication class
in a multi-datacenter environment, they will receive a warning in the CQL shell
and in the server logs.
Resolves#4481.
Signed-off-by: Kamil Braun <kbraun@scylladb.com>
Partition columns are implicitly filtered whenever possible, avoiding
expensive post-processing. But there are exceptions, eg, when
partition key is only partially restricted, or for CONTAINS
expressions. Here we add LIKE to this list of exceptions.
Also fix compute_bounds() to punt on LIKE restrictions, which cannot
be translated into meaningful bounds.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
This restriction leverages like_matcher to perform filtering.
Make single_column_relation::new_LIKE_restriction() return this new
restriction.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Add a new type of relation with operator LIKE. Handle it in
relation::to_restriction by introducing a new virtual method for it.
The temporary implementation of this method returns null; that will be
replaced in a subsequent patch.
Add abstract_type::is_string() to recognize string columns and
disallow LIKE operator on non-string columns.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Command line arguments are parsed twice in Scylla: once in main and once in Seastar's app_template::run.
The first parse is there to check if the "--version" flag is present --- in this case the version is printed
and the program exists. The second parsing is correct; however, most of the arguments were improperly treated
as positional arguments during the first parsing (e.g., "--network host" would treat "host" as a positional argument).
This happened because the arguments weren't known to the command line parser.
This commit fixes the issue by moving the parsing code until after the arguments are registered.
Resolves#4141.
Signed-off-by: Kamil Braun <kbraun@scylladb.com>
"
The put_row_diff, get_row_dif and get_full_row_hashes verbs are switched
to use rpc stream instead of rpc verb. They are the verbs that could
send big rpc messages. The rpc stream sink and source are created per
repair follower for each of the above 3 verbs. The sink and source are
shared for multiple requests during the entire repair operation for a
given range, so there is no overhead to setup rpc stream.
The row buffer is now increased to 32MiB from 256KiB, giving better
bandwidth in high latency links. The downside of bigger row buffer is
reduced possibility that all the rows inside a row buffer are identical.
This causes more full hashes to be exchanged. To address this issue, the
plan is to add better set reconciliation algorithm in addition to the
current send full hashes.
I compared rebuild using regular stream plan with repair using rpc
stream. With 2 nodes, 1 smp, 8M rows, delete all data on one of the
node before repair or rebuild.
repair using seastar rpc verb
Time to complete: 82.17s
rebuild using regular streaming which uses seastar rpc stream
Time to complete: 63.87s
repair using seastar rpc stream
Time to complete: 68.48s
For 1) and 3), the improvement is 16.6% (repair using rpc verb v.s. repair using rpc stream)
For 2) and 3), the difference is 7.2% (repair v.s. stream)
The result is promising for the future repair-based bootstrap/replace node operations.
NOTE: We do not actually enable rpc stream in row level repair for now. We
will enable it after we fix the the stall issues caused by handling
bigger row buffers.
Fixes#4581
"
* 'repair_switch_to_rpc_stream_v9' of https://github.com/asias/scylla: (45 commits)
docs: Add RPC stream doc for row level repair
repair: Mark some of the helper functions static
repair: Increase max row buf size
repair: Hook rpc stream version of verbs in row level repair
repair: Add use_rpc_stream to repair_meta
repair: Add is_rpc_stream_supported
repair: Add needs_all_rows flag to put_row_diff
repair: Optimize get_row_diff
repair: Register repair_get_full_row_hashes_with_rpc_strea
repair: Register repair_put_row_diff_with_rpc_stream
repair: Register repair_get_row_diff_with_rpc_stream
repair: Add repair_get_full_row_hashes_with_rpc_stream_handler
repair: Add repair_put_row_diff_with_rpc_stream_handler
repair: Add repair_get_row_diff_with_rpc_stream_handler
repair: Add repair_get_full_row_hashes_with_rpc_stream_process_op
repair: Add repair_put_row_diff_with_rpc_stream_process_op
repair: Add repair_get_row_diff_with_rpc_stream_process_op
repair: Add put_row_diff_with_rpc_stream
repair: Add put_row_diff_sink_op
repair: Add put_row_diff_source_op
...
If the cluster supports row level repair with rpc stream interface, we
can use bigger row buf size to have better repair bandwidth in high
latency links.
So we can avoid copy _working_row_buf in get_row_diff on master node if
there is only one follower node and all repair rows are needed by
follower node.
Move _working_row_buf instead of copy if it is follower node or
it is master node with only one follow. In these cases, the
_working_row_buf will not be used after this function, so we can move
it.
Fixes#4640
Iterating extensions in commitlog.cc should mimic that in sstables.cc,
i.e. a simple future-chain. Should also use same order for read and
write open, as we should preserve transformation stack order.
Message-Id: <20190702150028.18042-1-calle@scylladb.com>
On recent changes install.sh mistakenly copies dist/common/scripts to
/opt/scylladb/scripts/scripts, it should be /opt/scylladb/scripts.
Same on /opt/scylladb/scyllatop as well.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190702120030.13729-1-syuu@scylladb.com>
Use sink_source_for_repair to define sink_source_for_put_row_diff
with sink = repair_row_on_wire_with_cmd, source = repair_stream_cmd
for REPAIR_PUT_ROW_DIFF_WITH_RPC_STREAM rpc stream verb.
Use sink_source_for_repair to define sink_source_for_get_row_diff with
sink = repair_hash_with_cmd, source = repair_row_on_wire_with_cmd for
REPAIR_GET_ROW_DIFF_WITH_RPC_STREAM rpc stream verb.
Use the sink_source_for_repair to define
sink_source_for_get_full_row_hashes with sink = repair_stream_cmd,
source = repair_hash_with_cmd for
REPAIR_GET_FULL_ROW_HASHES_WITH_RPC_STREAM rpc stream verb.
- REPAIR_GET_ROW_DIFF_WITH_RPC_STREAM
Get repair rows from follower nodes
- REPAIR_PUT_ROW_DIFF_WITH_RPC_STREAM
Put repair rows to follower nodes
- REPAIR_GET_FULL_ROW_HASHES_WITH_RPC_STREAM:
Get full hashes from follower nodes
If someone opens a connection to port 9042 and sends some random bytes,
there is a 1 in 64 probability we'll recognize it as a valid frame
(since we only check the version byte, allowing versions 1-4) and we'll
try to read frame.length bytes for the body. If this value is very large,
we'll run out of memory very quickly.
Fix this by checking for reasonable body size (100kB). The initial message
must be a STARTUP, whose body is a [string map] of options, of which just
three are recognized. 100kB is plenty for future expansion.
Note that this does not replace true security on listening ports and
only serves to protect against mistakes, not attacks. An attacker can
easily exhaust server memory by opening many connections and trickle-feeding
them small amounts of data so they appear alive.
We can't use the config item native_transport_max_frame_size_in_mb,
because that can be legitimately large (and the default is atrocious,
256MB).
Fixes#4366.
This patchset allows changing the configuration at runtime, The user
triggers this by editing the configuration file normally, then
signalling the database with SIGHUP (as is traditional).
The implementation is somewhat complicated due the need to store
non-atomic mutable state per-shard and to synchronize the values in
all shards. This is somewhat similar to Seastar's sharded<>, but that
cannot be used since the configuration is read before Seastar is
initialized (due to the need to read command-line options).
Tests: unit (dev, debug), manual test with extra prints (dev)
Ref #2689Fixes#2517.
* seastar b629d5ef7a...a5b9f77d52 (6):
> perftune.py: add comment explaining why we don't log errors when binding NVMe IRQs for all but i3.nonmetal machines
> sharded: do a two phase shutdown for sharded services
> chunked_fifo: add iterator
> perftune.py: fix the i3 metal detection pattern
> core/memory: remove translation api
> reactor: file_type: offer option to not follow symbolic links
Change the type from bool to updateable_value<bool> throughout the dependency
chain and mark it as live updateable.
In theory we should also observe the value and trigger compaction if it changes,
but I don't think it is worthwhile.
Once we start using updateable_value<>, we must make it refer
to the updateable_value_source<> on the same shard, and to do
that we need to call broadcast_to_all_shards() first (this
creates the per-shard copy).
With dynamically updateable configuration, tracking the source of a value
is more important, since we'll accept or reject updates depending on the source.
Fix the source of client_encryption_options, which we RMW, by preserving the original
source.
Replace the per-shard value we store with an updateable_value_source, which
allows updating it dynamically and allows users to track changes.
The broadcast_to_all_shards() function is augmented to apply modifications
when called on a live system.
Since some of our values are not atomic (strings) and the administrative
information needed to track references to values is also not atomic, we will
need to store them per-shard. To do that we add a vector of per-shard data
to config_file, where each element is itself a vector of configuration items.
Since we need to operate generically on items (copying them from shard to shard)
we store them in a type-erased form.
Only mutable state is stored per-shard.
The updateable_value and updateable_value_source classes allow broadcasting
configuration changes across the application. The updateable_value_source class
represents a value that can be updated, and updateable_value tracks its source
and reflects changes. A typical use replaces "uint64_t config_item" with
"updateable_value<uint64_t> config_item", and from now on changes to the source
will be reflected in config_item. For more complicated uses, which must run some
callback when configuration changes, you can also call
config_item.observe(callback) to be actively notified of changes.
Dynamically updateable configuration requires checking whether configuration items
changed or not, so we can skip firing notifiers for the common case where nothing
changed.
This patch adds a comparison operator for seed_provider_type, which was missing it.
Currently, we allow adjusting configuration via
cfg.whatever() = 5;
by returning a mutable reference from cfg.whatever(). Soon, however, this operation
will have side effects (updating all references to the config item, and triggering
notifiers). While this can be done with a proxy, it is too tricky.
Switch to an ordinary setter interface:
cfg.whatever.set(5);
Because boost::program_options no longer gets a reference to the value to be written
to, we have to move the update to a notifier, and the value_ex() function has to
be adjusted to infer whether it was called with a vector type after it is
called, not before.
config_file and db::config are soon not going to be copyable. The reason is that
in order to support live updating, we'll need per-shard copies of each value,
and per-shard tracking of references to values. While these can be copied, it
will be an asycnronous operation and thus cannot be done from a copy constructor.
So to prepare for these changes, replace all copies of db::config by references
and delete config_file's copy constructor.
Some existing references had to be made const in order to adapt the const-ness
of db::config now being propagated (rather than being terminated by a non-const
copy).
Copying the config object breaks the link between the original and the copied
object, so updates to config items will not be visible. To allow updates, don't
copy any more, and instead keep a pointer.
The pointer won't work will once config is updateable, since the same object is
shared across multiple shard, but that can be addressed later.
Currently, database::_cfg is a copy of the global configuration. But this means
that we have multiple master copies of the configuration, which makes updating
the configuration harder. In order to eliminate the copy we have to eliminate the
database default constructor, which creates a config object, so that all
remaining constructors can receive config by reference and retain that reference.
We want to eliminate the default database constructor (to be explained in
the next patch), so eliminate its only use in gossip_test, using the regular
constructor instead.
With this unused descriptors and objects should always be poisoned.
* https://github.com/espindola/scylla/ align-descriptors-so-that-they-are-poisoned-v4:
Convert macros to inline functions
More precise poisoning in logalloc
On current scylla.spec, shell glob pattern "scylla_*setup" does not correctly
expanded, it mistakenly created a symlink named "/usr/sbin/scylla_*setup".
We need to expand them, need to create symlinks for each setup scripts.
Fixes#4605
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190627053530.10406-2-syuu@scylladb.com>
Since we don't install .pyc files on our package, python3 will generate .pyc
file when we launch setup script first time.
Then we will have unmanaged files under script directory, it will remain when
Scylla package upgraded / removed.
We need to compile *.py when we generate relocatable package, add compiled .pyc
files on .rpm/.deb packages.
Fixes#4612
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190627053530.10406-1-syuu@scylladb.com>
This change aligns descriptors and values to 8 bytes so that poisoning
a descriptor or value doesn't interfere with other descriptors and
values.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
"
When writing streamed data into sstables, while using time window
compaction strategy, we have to emit a new sstable for each time window.
Otherwise we can end up with sstables, mixing data from wildly different
windows, ruining the compaction strategy's ability to drop entire
sstables when all data within is expired. This gets worse as these mixed
sstables get compacted together with sstables that used to contain a
single time window.
This series provides a solution to this by segregating the data by its
atom's the time-windows. This is done on the new RPC streaming and the
new row-level, repair, memtable-flush and compaction, ensuring that the
segregation requirement is respected at all times.
Fixes: #2687
"
* 'segregate-data-into-sstables-by-time-window-streaming/v2.1' of ssh://github.com/denesb/scylla:
streaming,repair: restore indentation
repair: pass the data stream through the compaction strategy's interposer consumer
streaming: pass the data stream through the compaction strategy's interposer consumer
TWCS: implement add_interposer_consumer()
compaction_strategy: add add_interposer_consumer()
Add mutation_source_metadata
tests: add unit test for timestamp_based_splitting_writer
Add timestamp_based_splitting_writer
Introduce mutation_writer namespace
Most of our tests use overly simplistic schemas (`simple_schema`) or
very specialized ones that focus on exercising a specific area of the
tested code. This is fine in most places as not all code is schema
dependent, however practice has showed that there can be nasty bugs
hiding in dark corners that only appear with a schema that has a
specific combination of types.
This series introduces `tests::random_schema` a utility class for
generating random schemas and random data for them. An important goal is
to make using random schemas in tests as simple and convenient as
possible, therefore fostering the appearance of tests using random
schemas.
Random schema was developed to help testing code I'm currently working
on, which segregates data by time-windows. As I wasn't confident in my
ability to think of every possible combination of types that can break
my code I came up with random-schema to help me finding these corner
cases. So far I consider it a success, it already found bugs in my code
that I'm not sure I would have found if I had relied on specific
schemas. It also found bugs in unrelated areas of the code which proves
my point in the first paragraph.
* https://github.com/denesb/scylla.git random_schema/v5:
tests/data_model: approximate to the modeled data structures
data_value: add ascii constructor
tests/random-utils.hh: add stepped_int_distribution
tests/random-utils.hh: get_int() add overloads that accept external
rand engine
tests/random-utils.hh: add get_real()
tests: introduce random_schema
Exploit the interposer customization point to inject a consumer that will
segregate the mutation stream based on the contained atoms' timestamps,
allowing the requirements of TWCS to be mantained every time sstables
are written to disk.
For the implementation, `timestamp_based_splitting_writer` is used,
with a classifier that maps timestamps to windows.
This is a band-aid patch that is supposed to fix the immediate problem
of large collections causing large allocations. The proper fix is to
use IMR but that will take time. In the meanwhile alleviate the
pressure on the memory allocator by using a chunked storage collection
(utils::chunked_vector) instead of std::vector. In the linked issue
seastar::chunked_fifo was also proposed as the container to use,
however chunked fifo is not traversable in reverse which disqualifies
it from this role.
Refs: #3602
This will be the customization point for compaction strategies, used to
inject a specific interposer consumer that can manipulate the fragment
stream so that it satisfies the requirements of the compaction strategy.
For now the only candidate for injecting such an interposer is
time-window compaction strategy, which needs to write sstables that
only contains atoms belonging to the same time-window. By default no
interposer is injected.
Also add an accompanying customization point
`adjust_partition_estimate()` which returns the estimated per-sstable
partition-estimate that the interposer will produce.
This struct contains metadata regarding to a mutation_source. Currently
it contains the min and max timestamp. This will be used later by
compaction strategies to determine whether a given mutation stream has
to be split or not.
This writer implements the core logic of time-window based data
segregation. It splits the fragment stream provided by a reader, such
that each atom (cell) in the stream will be written into a consumer
based on the time-window its timestamp belongs to. The end result is
that each consumer will only see fragments, whoose atoms all have
timestamps belonging to the same time-window.
When a mutation fragment has atoms belonging to different time-windows,
it is split into as many fragments as needed so each has only atoms
that belong to the same time-window.
Currently there is a single mutation_writer: `multishard_writer`,
however in the next path we are going to add another one. This is the
right moment to move these into a common namespace (and folder), we
have way too much stuff scattered already in the top-level namespace
(and folder).
Also rename `tests/multishard_writer_test.cc` to
`tests/mutation_writer_test.cc`, this test-suite will be the home of all
the different mutation writer's unit test cases.
"
Currently, parser and the consumer save its state and return the
control to the caller, which then figures out that it needs to enter a
new partition, and that it doesn't need to skip. We do it twice, after
row end, and after row start. All this work could be avoided if the
consumer installed by the reader adjusted its state and pushed the
fragments on the spot. This patch achieves just that.
This results in less CPU overhead.
The ka/la reader is left still stopping after row end.
Brings a 20% improvement in frag/s for a full scan in perf_fast_forward (Haswell, NVMe):
perf_fast_forward -c1 -m1G --run-tests=small-partition-skips:
Before:
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
-> 1 0 0.952372 4 1000000 1050009 755 1050765 1046585 976.0 971 124256 1 0 0 0 0 0 0 0 99.7%
After:
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
-> 1 0 0.790178 4 1000000 1265538 1150 1266687 1263684 975.0 971 124256 2 0 0 0 0 0 0 0 99.6%
Tests: unit (dev)
"
* 'sstable-optimize-partition-scans' of https://github.com/tgrabiec/scylla:
sstable: mc: reader: Do not stop parsing across partitions
sstables: reader: Move some parser state from sstable_mutation_reader to mp_row_consumer_reader
sstables: reader: Simplify _single_partition_read checking
sstables: reader: Update stats from on_next_partition()
sstables: mutation_fragment_filter: Drop unnecessary calls to _walker.out_of_range()
sstables: ka/la: reader make push_ready_fragments() safe to call many times
sstables: mc: reader: Move out-of-range check out of push_ready_fragments()
sstables: reader: Return void from push_ready_fragments()
sstables: reader: Rename on_end_of_stream() to on_out_of_clustering_range()
sstables: ka/la: reader: Make sure push_ready_fragments() does not miss to emit partition_end
O_DSYNC causes commitlog to pre-allocate each commitlog segment by writing
zeroes into it. In normal operation, this is amortized over the many
times the segment will be reused. In tests, this is wasteful, but under
the default workstation configuration with /tmp using tmpfs, no actual
writes occur.
However on a non-default configuration with /tmp mounted on a real disk,
this causes huge disk I/O and eventually a crash (observed in
schema_change_test). The crash is likely only caused indirectly, as the
extra I/O (exacerbated by many tests running in parallel) xcauses timeouts.
I reproduced this problem by running 15 copies of schema_change_test in
parallel with /tmp mounted on a real filesystem. Without this change, I
usually observe one or two of the copies crashing, with the change they
complete (and much more quickly, too).
The tracker object was a static object in repair.cc. At the time we initialize
it, we do not know the smp::count, so we have to initialize the _repairs
object when it is used on the fly.
void init_repair_info() {
if (_repairs.size() != smp::count) {
_repairs.resize(smp::count);
}
}
This introduces a race if init_repair_info is called on different
thread(shard).
To fix, put the tracker object inside the newly introduced
repair_service object which is created in main.cc.
Fixes#4593
Message-Id: <b1adef1c0528354d2f92f8aaddc3c4bee5dc8a0a.1561537841.git.asias@scylladb.com>
This is quick fix to the immediate problem of large collections causing
large allocations, triggering stalls or OOM. The proper fix is to
use IMR for storing the cells, but that is a complex change that will
require time, so let's not stall/OOM in the meanwhile.
This series makes sure new schema is propagated to repair master and
follower nodes before repair.
Fixes#4575
* dev.git asias/repair_pull_schema_v2:
migration_manager: Add sync_schema
repair: Sync schema from follower nodes before repair
"
If the database supports infinite bound range deletions,
CQL layer will no longer throw an error indicating that both ranges
need to be specified.
Fixes#432
Update test_range_deletion_scenarios unit test accordingly.
"
* 'cql3-lift-infinite-bound-check' of https://github.com/bhalevy/scylla:
cql3: lift infinite bound check if it's supported
service: enable infinite bound range deletions with mc
database: add flag for infinite bound range deletions
Piotr Sarna says:
Fixes#4540
This series adds proper handling of aggregation for paged indexed queries.
Before this series returned results were presented to the user in per-page
partial manner, while they should have been returned as a single aggregated
value.
Tests: unit(dev)
Piotr Sarna (8):
cql3: split execute_base_query implementation
cql3: enable explicit copying of query_options
cql3: add a query options constructor with explicit page size
cql3: add proper aggregation to paged indexing
cql3: make DEFAULT_COUNT_PAGE_SIZE constant public
tests: add query_options to cquery_nofail
tests: add indexing + paging + aggregation test case
tests: add indexing+paging test case for clustering keys
* seastar ded50bd8a4...b629d5ef7a (9):
> sharded: no_sharded_instance_exception: fix grammar
> core,net: output_stream: remove redundant std::move()
> perftune: make sure that ethtool -K has a chance of succeeding
> net/dpdk: upgrade to dpdk-19.05
> perftune.py: Fix a few more places where we use deprecated pyudev.Device ones
> reactor: provide an uptime function
> rpc: add sink::flush() to streaming api
> Use a table to document the various build modes
> foreign_ptr: Fix compilation error due to unused variable
The view builder is started only if it's enabled in config,
via the view_building=true variable. Unfortunately, stopping
the builder was unconditional, which may result in failed
assertions during shutdown. To remedy this, view building
is stopped only if it was previously started.
Fixes#4589
Since commit "repair: Use the same schema version for repair master and
followers", repair master and followers uses the same schema version
that master decides to use during the whole repair operation. If master
has older version of schema, repair could ignore the data which makes use
of the new schema, e.g., writes to new columns.
To fix, always sync the schema agreement before repair.
The master node pulls schema from followers and applies locally. The
master then uses the "merged" schema. The followers use
get_schema_for_write() to pull the "merged" schema.
Fixes#4575
Backports: 3.1
random_schema is a utility class that provides methods for generating
random schemas as well as generating data (mutations) for them. The aim
is to make using random schemas in tests as simple and convenient as
is using `simple_schema`. For this reason the interface of
`random_schema` follows closely that of `simple_schema` to the extent
that it makes sense. An important difference is that `random_schema`
relies on `data_model` to actually build mutations. So all its
mutation-related operations work with `data_model::mutation_descrition`
instead of actual `mutation` objects. Once the user arrived at the
desired mutation description they can generate an actual mutation via
`data_model::mutation_description::build()`.
In addition to the `random_schema` class, the `random_schema.hh` header
exposes the generic utility classes for generating types and values
that it internally uses.
random_schema is fully deterministic. Using the same seed and the same
set of operations is guaranteed to result in generating the same schema
and data.
Make the the data modelling structures model their "real" counterparts
more closely, allowing the user greater control on the produced data.
The changes:
* Add timestamp to atomic_value (which is now a struct, not just an
alias to bytes).
* Add tombstone to collection.
* Add row_tombstone to row.
* Add bound kinds and tombstone to range_tombstone.
Great care was taken to preserve backward compatibility, to avoid
unnecessary changes in existing code.
If the database supports infinite bound range deletions,
CQL layer will no longer throw an error indicating that both ranges
need to be specified.
[bhalevy] Update test_range_deletion_scenarios unit test accordingly.
Fixes#432
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
As soon as it's agreed that the cluster supports sstables in mc format,
infinite bound range deletions in statements can be safely enabled.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Database can only support infinite bound range deletions if sstable mc
format is supported. As a first step to implement these checks,
an appropriate flag is added to database.
Indexed queries used to erroneously return partial per-page results
for aggregation queries. This test case used to reproduce the problem
and now ensures that there would be no regressions.
Refs #4540
Aggregated and paged filtering needs to aggregate the results
from all pages in order to avoid returning partial per-page
results. It's a little bit more complicated than regular aggregation,
because each paging state needs to be translated between the base
table and the underlying view. The routine keeps fetching pages
from the underlying view, which are then used to fetch base rows,
which go straight to the result set builder.
Fixes#4540
For internal use, there already exists a query_options constructor
that copies data from another query_options with overwritten paging
state. This commit adds an option to overwrite page size as well.
In order to handle aggregation queries correctly, the function that
returns base query results is split into two, so it's possible to
access raw query results, before they're converted into end-user
CQL message.
pow2_rank is undefined for 0.
bucket_of currently works around that by using a bitmask of 0.
To allow asserting that count_{leading,trailing}_zeros are not
called with 0, we want to avoid it at all call sites.
Fixes#4153
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190623162137.2401-1-bhalevy@scylladb.com>
"
partitioned_sstable_set is not self sufficient because it relies on
compatible_ring_position_view, which in turn relies on lifetime of
sstable object. This leads to use-after-free. Fix this problem by
introducing compatible_ring_position and using it in p__s__s.
Fixes#4572.
Test: unit (dev), compaction dtests (dev)
"
* 'projects/fix_partitioned_sstable_set/v4' of ssh://github.com/bhalevy/scylla:
tests: Test partitioned sstable set's self-sufficiency
sstables: Fix partitioned_sstable_set by making it self sufficient
Introduce compatible_ring_position and compatible_ring_position_or_view
Partitioned sstable set is not self sufficient, because it uses compatible_ring_position_view
as key for interval map, which is constructed from a decorated key in sstable object.
If sstable object is destroyed, like when compaction releases it early, partitioned set
potentially no longer works because c__r__p__v would store information that is already freed,
meaning its use implies use-after-free.
Therefore, the problem happens when partitioned set tries to access the interval of its
interval map and uses freed information from c__r__p__v.
Fix is about using the newly introduced compatible_ring_position_or_view which can hold a
ring_position, meaning that partitioned set is no longer dependent on lifetime of sstable
object.
Retire compatible_ring_position_view.hh as it is now unused.
Fixes#4572.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The motivation for supporting ring position is that containers using
it can be self sufficient. The existing compatible_ring_position_view
could lead to use after free when the ring position data, it was built
from, is gone.
The motivation for compatible_ring_position_or_view is to allow lookup
on containers that don't support different key types using c__r__p,
and also to avoid unnecessary copies.
If the user is provided only with a ring_position_view, c__r__p__or_v
could be built from it and used for lookups.
Converting ring_position_view to ring_position is very bug prone because
there could be information lost in the process.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently to_string takes raw bytes. This means that to print a
data_value it has to first be serialized to be passed to to_string,
which will then deserializes it.
This patch adds a virtual to_string_impl that takes a data_value and
implements a now non virtual to_sting on top of it.
I don't expect this to have a performance impact. It mostly documents
how to access a data_value without converting it to bytes.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190620183449.64779-3-espindola@scylladb.com>
Recently, in merge commit 2718c90448,
we added the ability to cancel pending view-update requests when we detect
that the target node went down. This is important for view updates because
these have a very long timeout (5 minutes), and we wanted to make this
timeout even longer.
However, the implementation caused a race: Between *creating* the update's
request handler (create_write_response_handler()) and actually starting
the request with this handler (mutate_begin()), there is a preemption point
and we may end up deleting the request handler before starting the request.
So mutate_begin() must gracefully handle the case of a missing request
handler, and not crash with a segmentation fault as it did before this patch.
Eventually the lifetime management of request handlers could be refactored
to avoid this delicate fix (which requires more comments to explain than
code), or even better, it would be more correct to cancel individual writes
when a node goes down, not drop the entire handler (see issue #4523).
However, for now, let's not do such invasive changes and just fix bug that
we set out to fix.
Fixes#4386.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190620123949.22123-1-nyh@scylladb.com>
The repair_rows in row_list are sorted. It is only possible for the
current repair_row to share the same partition key with the last
repair_row inserted into repair_row_on_wire. So, no need to search from
the beginning of the repair_rows_on_wire to avoid quadratic complexity.
To fix, look at the last item in repair_rows_on_wire.
Fixes#4580
Message-Id: <08a8bfe90d1a6cf16b67c210151245879418c042.1561001271.git.asias@scylladb.com>
Tests without custom flags were already being run with -m2G. Tests
with custom flags have to manually specify it, but some were missing
it. This could cause tests to fail with std::bad_alloc when two
concurrent tests tried to allocate all the memory.
This patch adds -m2G to all tests that were missing it.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190620002921.101481-1-espindola@scylladb.com>
"
Fixes#4569
This series fixes the infinite paging for indexed queries issue.
Before this fix, paging indexes tended to end up in an infinite loop
of returning pages with 0 results, but has_more_pages flag set to true,
which confused the drivers.
Tests: unit(dev)
Branches: 3.0, 3.1
"
* 'fix_infinite_paging_for_indexed_queries' of https://github.com/psarna/scylla:
tests: add test case for finishing index paging
cql3: fix infinite paging for indexed queries
We still has "{{^jessie}}" tag on scylla-server systemd unit file to
skip using AmbientCapabilities on Debian 8, but it does not able to work
anymore since we moved to single binary .deb package for all debian variants,
we must share same systemd unit file across all Debian variants.
To do so we need to have separated file on /etc/systemd to define
AmbientCapabilities, create the file while running postinst script only
if distribution is not Debian 8, just like we do in .rpm.
See #3344
See #3486
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190619064224.23035-1-syuu@scylladb.com>
Thrift 0.11 changed to generate c++ code with
std::shared_ptr instead of boost::shared_ptr.
- https://issues.apache.org/jira/browse/THRIFT-2221
This was forcing scylla to stick with older versions
of thrift.
Fixes issue #3097.
thrift: add type aliases to build with old and new versions.
update to using namespace =
Since we cannot use dh --with=systemd because we don't want to
automatically enabling systemd units, manage them by our setup scripts,
we have to do 'systemctl daemon-reload' manually.
(On dh --with=systemd, systemd helper automatically provides such
scirpts)
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190618000210.28972-1-syuu@scylladb.com>
Running tests in debug mode takes 25:22.08 in my machine. Using
sanitize instead takes that down to 10:46.39.
The mode is opt in, in that it must be explicitly selected with
"configure.py --mode=sanitize" or "ninja sanitize". It must also be
explicitly passed to test.py.
Unfortunately building with asan, optimizations and debug info is
very slow and there is nothing like -gline-tables-only in gcc.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190617170007.44117-1-espindola@scylladb.com>
Currently, parser and the consumer save its state and return the
control to the caller, which then figures out that it needs to enter a
new partition, and that it doesn't need to skip. We do it twice, after
row end, and after row start. All this work could be avoided if the
consumer installed by the reader adjusted its state and pushed the
fragments on the spot. This patch achieves just that.
This results in less CPU overhead.
The ka/la reader is left still stopping after row end.
Brings a 20% improvement in frag/s for a full scan in perf_fast_forward (Haswell, NVMe):
perf_fast_forward -c1 -m1G --run-tests=small-partition-skips:
Before:
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
-> 1 0 0.952372 4 1000000 1050009 755 1050765 1046585 976.0 971 124256 1 0 0 0 0 0 0 0 99.7%
After:
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
-> 1 0 0.790178 4 1000000 1265538 1150 1266687 1263684 975.0 971 124256 2 0 0 0 0 0 0 0 99.6%
The old code was making advance_to_next_partition() behave
incorrectly when _single_partition_read, which was compensated by a
check in read_partition().
Cleaner to exit early.
Not a bug fix, just makes the implementation more robust against changes.
Before this patch this might have resulted in partition_end being
pushed many times.
Currently, calling push_ready_fragments() with _mf_filter disengaged
or with _mf_filter->out_of_range() causes it to call
_reader->on_out_of_clustering_range(), which emits the partition_end
fragment. It's incorrect to emit this fragment twice, or zero times,
so correctness depends on the fact that push_ready_fragments() is
called exactly once when transitioning between partitions.
This is proved to be tricky to ensure, especially after partition_end
starts to be emitted in a different path as well. Ensuring that
push_ready_fragments() is *NOT* called after partition_end is emitted
from consume_partition_end() becomes tricky.
After having to fix this problem many times after unrelated changes to
the flow, I decide that it's better to refactor.
This change moves the call of on_out_of_clustering_range() out of
push_ready_fragments(), making the latter safe to call any number of
times.
The _mf_filter->out_of_range() check is moved to sites which update
the filter.
It's also good because it gets rid of conditionals.
Currently, if there is a fragment in _ready and _out_of_range was set
after row end was consumer, push_ready_fragments() would return
without emitting partition_end.
This is problematic once we make consume_row_start() emit
partiton_start directly, because we will want to assume that all
fragments for the previous partition are emitted by then. If they're
not, then we'd emit partition_start before partition_end for the
previous partition. The fix is to make sure that
push_ready_fragments() emits everything.
Indexed queries need to translate between view table paging state
and base table paging state, in order to be able to page the results
correctly. One of the stages of this translation is overwriting
the paging state obtained from the base query, in order to return
view paging state to the user, so it can be used for fetching next
pages. Unfortunately, in the original implementation the paging
state was overwritten only if more pages were available,
while if 'remaining' pages were equal to 0, nothing was done.
This is not enough, because the paging state of the base query
needs to be overwritten unconditionally - otherwise a guard paging state
value of 'remaining == 0' is returned back to the client along with
'has_more_pages = true', which will result in an infinite loop.
This patch correctly overwrites the base paging state unconditionally.
Fixes#4569
This patch set fixes repair nodes using different schema version and
optimizes the hashing thanks to the fact now all nodes uses same schema
version.
Fixes: #4549
* seastar-dev.git asias/repair_use_same_schema.v3:
repair: Use the same schema version for repair master and followers
repair: Hash column kind and id instead of column name and type name
It is guaranteed repair nodes use the same schema. It is faster to hash
column kind and id.
Changing the hashing of mutation fragment causes incompatibility with
mixed clusters. Let's backport to the 3.1 release, which includes row
level repair for the first time and is not released yet.
Refs: #4549
Backports: 3.1
Before this patch, repair master and followers use their own schema
version at the point repair starts independently. The schemas can be
different due to schema change. Repair uses the schema to serialize
mutation_fragment and deserialize the mutation_fragment received from
peer nodes. Using different schema version to serialize and deserialize
cause undefined behaviour.
To fix, we use the schema the repair master decides for all the repair
nodes involved.
On top of this patch, we could do another step to make sure all nodes
has the latest schema. But let's do it in a separate patch.
Fixes: #4549
Backports: 3.1
The intention is just to document what is currently done. If someone
wants to propose changes, that can be done after the current practices
have been documented.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190524135109.29436-1-espindola@scylladb.com>
The code that decides whether a query should used indexing was buggy - a partition key index might have influenced the decision even if the whole partition key was passed in the query (which effectively means that indexing it is not necessary).
Fixes#4539
Closes https://github.com/scylladb/scylla/pull/4544
Merged from branch 'fix_deciding_whether_a_query_uses_indexing' of git://github.com/psarna/scylla
tests: add case for partition key index and filtering
cql3: fix deciding if a query uses indexing
Currently NIC selection prompt on scylla_setup just proceed setup when
user just pressed Enter key on the prompt.
The prompt should ask NIC name again until user input correct NIC name.
Fixes#4517
Message-Id: <20190617124925.11559-1-syuu@scylladb.com>
"
These patches fix remaining issues with gcc9 build, that involve a gcc9 bug, a gcc9 bug, and a stricter warning.
Tests: unit(debug, dev, release).
"
* 'fix-gcc9-build' of https://github.com/pdziepak/scylla:
dht/ring_position: silence complaints about uninitialised _token_bound
xx_hasher: disable -Warray-bounds
api/column_family: work around gcc9 bug in seastar::future<std::any>
Currently, calling unfreeze() using the wrong version of the schema
results in undefined behavior. That can cause hard-to-debug
problems. Better to throw in such cases.
Refs #4549.
Tests:
- unit (dev)
Message-Id: <1560459022-23786-1-git-send-email-tgrabiec@scylladb.com>
In release mode gcc9 has a false positive warning about out of bound
access in xxhash implementation:
./xxHash/xxhash.c:799:27: error: array subscript -3 is outside array bounds of ‘long unsigned int [1]’ [-Werror=array-bounds]
This is solved by disabling -Warray-bounds in the xxhash code.
There is a gcc9 bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90415
that makes it impossible to pass std::any through a seastar::future<T>.
Fortunately, there is only one user of seastar::future<std::any> in
Scylla and it is not performance-critical. This patch avoids the gcc9
bug by using seastar::future<std::unique_ptr<std::any>>.
We saw a node crashing today with nodetool clearsnapshot being called.
After investigation, the reason is that nodetool clearsnapshot ws called
at the same time a new snapshot was created with the same tag. nodetool
clearsnapshot can't delete all files in the directory, because new files
had by then been created in that directory, and crashes on I/O error.
There are, many problems with allowing those operations to proceed in
parallel. Even if we fix the code not to crash and return an error on
directory non-empty, the moment they do any amount of work in parallel
the result of the operation becomes undefined. Some files in the
snapshot may have been deleted by clear, for example, and a user may
then not be able to properly restore from the backup if this snapshot
was used to generate a backup.
Moreover, although we could lock at the granularity of a keyspace or
column family, I think we should use a big hammer here and lock the
entire snapshot creation/deletion to avoid surprises (for example, if a
user requests creation of a snapshot for all keyspaces, and another
process requests clear of a single keyspace)
Fixes#4554
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190614174438.9002-1-glauber@scylladb.com>
In practice, we always want to use the same sanitizer flags with
seastar and scylla. Seastar was already marking its sanitizer flags
public, so what was missing was exporting the link flags via pkgconfig
and dropping the duplicates from scylla.
I am doing this after wasting some time editing the wrong file.
This depends on the seastar patch to export the sanitizer flags in
pkgconfig.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
We used to use /opt/scylladb just for Scylla build toolchain and
dependency libraries, not for Scylla main package.
But since we merged relocatable package, Scylla main binary and
dependency libraries are all located under /opt/scylladb, only
setup scripts remained on /usr/lib/scylla.
It strange to keep using both /usr/lib/<app name> and /opt/<app name>,
we should merge them into single place.
Message-Id: <20190614011038.17827-1-syuu@scylladb.com>
Before this patch mc sstables writer was ignoring
empty cellpaths. This is a wrong behaviour because
it is possible to have empty key in a map. In such case,
our writer creats a wrong sstable that we can't read back.
This is becaus a complex cell expects cellpath for each
simple cell it has. When writer ignores empty cellpath
it writes nothing and instead it should write a length
of zero to the file so that we know there's an empty cellpath.
Fixes#4533
Tests: unit(release)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <46242906c691a56a915ca5994b36baf87ee633b7.1560532790.git.piotr@scylladb.com>
Consider
master: row(pk=1, ck=1, col=10)
follower1: row(pk=1, ck=1, col=20)
follower2: row(pk=1, ck=1, col=30)
When repair runs, master fetches row(pk=1, ck=1, col=20) and row(pk=1,
ck=1, col=30) from follower1 and follower2.
Then repair master sends row(pk=1, ck=1, col=10) and row(pk=1, ck=1,
col=30) to follower1, follower1 will write the row with the same
pk=1, ck=1 twice, which violates uniqueness constraints.
To fix, we apply the row with same pk and ck into the previous row.
We only needs this on repair follower because the rows can come from
multiple nodes. While on repair master, we have a sstable writer per
follower, so the rows feed into sstable writer can come from only a
single node.
Tests: repair_additional_test.py:RepairAdditionalTest.repair_same_row_diff_value_3nodes_test
Fixes: #4510
Message-Id: <cb4fbba1e10fb0018116ffe5649c0870cda34575.1560405722.git.asias@scylladb.com>
On repair follower node, only decorated_key_with_hash and the
mutation_fragment inside repair_row are used in apply_rows() to apply
the rows to disk. Allow repair_row to initialize partially and throw if
the uninitialized member is accessed to be safe.
Message-Id: <b4e5cc050c11b1bafcf997076a3e32f20d059045.1560405722.git.asias@scylladb.com>
To provide test reproducibility use the seastar local_random_engine.
To reproduce a run, use the --random-seed command line option
with the seed printed accordingly.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190613114307.31038-1-bhalevy@scylladb.com>
When a column is not present in the select clause, but used for
filtering, it usually needs to be fetched from replicas.
Sometimes it can be avoided, e.g. if primary key columns form a valid
prefix - then, they will be optimized out before filtering itself.
However, clustering key prefix can only be qualified for this
optimization if the whole partition key is restricted - otherwise
the clustering columns still need to be present for filtering.
This commit also fixes tests in cql_query_test suite, because they now
expect more values - columns fetched for filtering will be present as
well (only internally, the clients receive only data they asked for).
Fixes#4541
Message-Id: <f08ebae5562d570ece2bb7ee6c84e647345dfe48.1560410018.git.sarna@scylladb.com>
Relocation of python scripts mentions scylla-server in paths explicitly.
It should use {{product}} instead. The current build is failing when
{{product}} is different than scylla-server
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190613012518.28784-1-glauber@scylladb.com>
On branch-3.1 / master, we are getting following error:
ERROR 2019-06-11 10:58:49,156 [shard 0] database - /var/lib/scylla/data: File not owned by current euid: 0. Owner is: 999
ERROR 2019-06-11 10:58:49,156 [shard 0] init - Failed owner and mode verification: std::runtime_error (File not owned by current euid: 0. Owner is: 999)
ERROR 2019-06-11 10:58:49,156 [shard 0] database - /var/lib/scylla/hints: File not owned by current euid: 0. Owner is: 999
ERROR 2019-06-11 10:58:49,156 [shard 0] init - Failed owner and mode verification: std::runtime_error (File not owned by current euid: 0. Owner is: 999)
ERROR 2019-06-11 10:58:49,156 [shard 0] database - /var/lib/scylla/commitlog: File not owned by current euid: 0. Owner is: 999
ERROR 2019-06-11 10:58:49,156 [shard 0] init - Failed owner and mode verification: std::runtime_error (File not owned by current euid: 0. Owner is: 999)
ERROR 2019-06-11 10:58:49,156 [shard 0] database - /var/lib/scylla/view_hints: File not owned by current euid: 0. Owner is: 999
ERROR 2019-06-11 10:58:49,156 [shard 0] init - Failed owner and mode verification: std::runtime_error (File not owned by current euid: 0. Owner is: 999)
It seems like owner verification of data directory fails because
scylla-server process is running in root but data directory owned by
scylla, so we should run services as scylla user.
Fixes#4536
Message-Id: <20190611113142.23599-1-syuu@scylladb.com>
To avoid 'Bad permmisons' error when user changed default umask, we need
to verify system umask is acceptable for scylla-server.
Fixes#4157
Message-Id: <20190612130343.6043-1-syuu@scylladb.com>
* seastar 253d6cb...ded50bd (14):
> Only export sanitizer flags if used
> perftune.py: use pyudev.Devices methods instead of deprecated pyudev.Device ones
> Add a Sanitize build mode
> Merge "perftune.py : new tuning modes" from Vlad
> reactor: clarify how submit_to() destroys the function object
> Export the sanitizer flags via pkgconfig
> smp: Delete unprocessed work items
> iotune: fixed finding mountpoint infinite loop
> net: Fix dereferencing moved object
> Always enable the exception scalability hack
> Merge "Simple cleanups in future.hh" from Rafael
> tests: introduce testing::local_random_engine
> core/deleter: Fix abort when append() is called twice with a shared deleter
> rpc stream: do not crash if a stream is used after eos
Currently, REPAIR_GET_COMBINED_ROW_HASH RPC verb returns only the
repair_hash object. In the future, we will use set reconciliation
algorithm to decode the full row hashes in working row buf. It is useful
to return the number of rows inside working row buf in addition to the
combined row hashes to make sure the decode is successful.
It is also better to use a wrapper class for the verb response so we can
extend the return values later more easily with IDL.
Fixes#4526
Message-Id: <93be47920b523f07179ee17e418760015a142990.1559771344.git.asias@scylladb.com>
With this patch, when using asan, we poison segment memory that has
been allocated from the system but should not be accessible to user
code.
Should help with debugging user after free bugs.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190607140313.5988-1-espindola@scylladb.com>
The code that decides whether a query should used indexing
was buggy - a partition key index might have influenced the decision
even if the whole partition key was passed in the query (which
effectively means that indexing it is not necessary).
Fixes#4539
Introduced in 513d01d53e
The script is trying to determine the branch to shallow clone
when an rpm is missing and has to be built.
This functionality in the current implementation assumes it is being run inside
a git repository, but that must not be the case if the script is triggered after
local rpms were placed on the local directory.
This happens when putting all necessary rpm files in: dist/ami/files
And then running: dist/ami/build_ami.sh --localrpm
The dist/ami/ and dist/ami/files are the only ones required for this action so
querying the git repository in that situation makes no sense.
Fixes#4535
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190611112455.13862-1-bhalevy@scylladb.com>
Build progress virtual reader uses Scylla-specific
scylla_views_builds_in_progress table in order to represent legacy
views_builds_in_progress rows. The Scylla-specific table contains
additional cpu_id clustering key part, which is trimmed before
returning it to the user. That may cause duplicated clustering row
fragments to be emitted by the reader, which may cause undefined
behaviour in consumers. The solution is to keep track of previous
clustering keys for each partition and drop fragments that would cause
duplication. That way if any shard is still building a view, its
progress will be returned, and if many shards are still building, the
returned value will indicate the progress of a single arbitrary shard.
Fixes#4524
Tests:
unit(dev) + custom monotonicity checks from tgrabiec@scylladb.com
Build progress virtual reader uses Scylla-specific
scylla_views_builds_in_progress table in order to represent
legacy views_builds_in_progress rows. The Scylla-specific table contains
additional cpu_id clustering key part, which is trimmed before returning
it to the user. That may cause duplicated clustering row fragments to be
emitted by the reader, which may cause undefined behaviour in consumers.
The solution is to keep track of previous clustering keys for each
partition and drop fragments that would cause duplication. That way if
any shard is still building a view, its progress will be returned,
and if many shards are still building, the returned value will indicate
the progress of a single arbitrary shard.
Fixes#4524
Tests:
unit(dev) + custom monotonicity checks from <tgrabiec@scylladb.com>
All Scylla code is written with "using namespace seastar", i.e., no
"seastar::" prefix for Seastar symbols. Document this in the coding style.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190610203948.18075-1-nyh@scylladb.com>
Fixes#4525
req_param uses boost::lexical cast to convert text->var.
However, lexical_cast does not handle textual booleans,
thus param=true causes not only wrong values, but
exceptions.
Message-Id: <20190610140511.15478-1-calle@scylladb.com>
Currently, each shard protects itself by not reading from rpc and the native
transport if in-flight requests consume too much memory for that shard. However,
if all shards then forward their requests to some other shard, then that shard
can easily run out of memory since its load can be multiplied by the number of
shards that send it requests.
To protect against this, use the new Seastar smp_service_group infrastructure.
We create three groups: read, write, and write ack (the latter is needed to
avoid ABBA deadlocks is shard A exhausts all its resources sending writes to shard B,
and shard B simulateously does the same; neither will be able to send
acknowledgements, so if the writes are throttled, they will never be unthrottled
until a timeout occurs).
Range scans are not addressed by this patch since they are handled by
multishard_mutation_query, which has its own complex cross-shard communication
scheme, but it be a similar solution.
Ref #1105 (missing range scan protection)
Tests: unit (dev)
Message-Id: <20190512142243.17795-1-avi@scylladb.com>
If a port value passed as a string this makes the cluster.connect() to
fail with Python3.4.
Let's fix this by explicitly declaring a 'port' argument as 'int'.
Fixes#4527
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20190606133321.28225-1-vladz@scylladb.com>
We want to use the same branch on the other repos build-ami needs
as the one we're building for. Automatically find the current branch
using the `git branch` command.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190606133648.15877-1-bhalevy@scylladb.com>
This patch adds a warning option to the user for situations where
rows count may get bigger than initially designed. Through the
warning, users can be aware of possible data modeling problems.
The threshold is initially set to '100,000'.
Tests: unit (dev)
Message-Id: <20190528075612.GA24671@shenzou.localdomain>
A lot of code in scylla is only reachable if SEASTAR_DEFAULT_ALLOCATOR
is not defined. In particular, refill_emergency_reserve in the default
allocator case is empty, but in the seastar allocator case it compacts
segments.
I am trying to debug a crash that seems to involve memory corruption
around the lsa allocator, and being able to use a debug build for that
would be awesome.
This patch reduces the differences between the two cases by having a
common segment_pool that defers only a few operations to different
segment_store implementations.
Tests: unit (debug, dev)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190606020937.118205-1-espindola@scylladb.com>
Prometheus needs to remember which "instance" (node) each measurement
came from. But it doesn't actually need Scylla to tell it the instance
name - it knows which node it got each measurement from.
After Seastar commit 79281ef287
which fixed Seastar issue https://github.com/scylladb/seastar/issues/477,
the "instance" label on measurements no longer comes from Scylla but rather
is added by Prometheus. This patch corrects the documentation to explain the
current situation, instead of incorrectly saying that Scylla adds the
"instance" label itself.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190602074629.14336-1-nyh@scylladb.com>
The relocatable Python is built from Fedora packages. Unfortunately TLS
certificates are in a different location on Debian variants, which
causes "node_exporter_install" to fail as follows:
Traceback (most recent call last):
File "/usr/lib/scylla/libexec/node_exporter_install", line 58, in <module>
data = curl('https://github.com/prometheus/node_exporter/releases/download/v{version}/node_exporter-{version}.linux-amd64.tar.gz'.format(version=VERSION), byte=True)
File "/usr/lib/scylla/scylla_util.py", line 40, in curl
with urllib.request.urlopen(req) as res:
File "/opt/scylladb/python3/lib64/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/opt/scylladb/python3/lib64/python3.7/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/opt/scylladb/python3/lib64/python3.7/urllib/request.py", line 543, in _open
'_open', req)
File "/opt/scylladb/python3/lib64/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/opt/scylladb/python3/lib64/python3.7/urllib/request.py", line 1360, in https_open
context=self._context, check_hostname=self._check_hostname)
File "/opt/scylladb/python3/lib64/python3.7/urllib/request.py", line 1319, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)>
Unable to retrieve version information
node exporter setup failed.
Fix the problem by overriding the SSL_CERT_FILE environment variable to
point to the correct location of the TLS bundle.
Message-Id: <20190604175434.24534-1-penberg@scylladb.com>
The to_json_string utility implementation was based on const references
instead of views, which can be a source of unnecessary memory copying.
This patch migrates all to_json_string to use bytes_view and leaves
the const reference version as a thin wrapper.
Message-Id: <2bf9f1951b862f8e8a2211cb4e83852e7ac70c67.1559654014.git.sarna@scylladb.com>
"
Technically queue_reader already exists, however so far it was a
private utility in `multishard_writer.cc`. This mini-series makes it
public and generally useful. The interface is made safer and simpler and
the implementation is improved so it doesn't have two separate buffers.
Also, unit tests are added.
Tests: mutation_reader_test:debug/test_queue_reader, multishard_writer_test:debug
"
* 'queue_reader/v2' of https://github.com/denesb/scylla:
queue_reader: use the reader's buffer as the queue
Make queue_reader public
The queue reader currently uses two buffers, a `_queue` that the
producer pushes fragments into and its internal `_buffer` where these
fragments eventually end up being served to the consumer from.
This double buffering is not necessary. Change the reader to allow the
producer to push fragments directly into the internal `_buffer`. This
complicates the code a little bit, as the producer logic of
`seastar::queue` has to be folded into the queue reader. On the other
hand this introduces proper memory consumption management, as well as
reduces the amount of consumed memory and eliminates the possibility of
outside code mangling with the queue. Another big advantage of the
change is that there is now an explicit way to communicate the EOS
condition, no need to push a disengaged `mutation_fragment_opt`.
The producer of the queue reader now pushes the fragments into the
reader via an opaque `queue_reader_handle` object, which has the
producer methods of `seastar::queue`.
Existing users of queue readers are refactored to use the new interface.
Since the code is more complex now, unit tests are added as well.
We have a script in tree that fixes the schema for distributed system
tables, like tracing, should they change their schema. We use it all the
time but unfortunately it is not distributed with the scylla package,
which makes it using it harder (we want to do this in the server, but
consistent updates will take a while).
One of the problems with the script today that makes distributing it
harder is that it uses the python3 cassandra driver, that we don't want
to have as a server dependency. But now with the relocatable packages in
place there is no reaso not to just add it.
[avi: adjust tools/toolchain/image to point to a new image with
python3-cassandra-driver]
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190603162447.24215-1-glauber@scylladb.com>
Unlike CentOS, Debian variants has python3 package on official repository,
so we don't have to use relocatable python3 on these distributions.
However, official python3 version is different on each distribution, we may
have issue because of that.
Also, our scripts and packaging implementation are becoming presuppose
existence of relocatable python3, it is causing issue on Debian
variants.
Switching to relocatable python3 on Debian variants avoid these issues,
it will easier to manage Scylla python3 environments accross multiple
distributions.
Fixes#4495
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190531112707.20082-1-syuu@scylladb.com>
Building on Ubuntu 18 or 19 following the current build instructions
doesn't work. Add information about a few pitfalls. Switch README.md
to recommending dbuild and move the details to HACKING.md.
Message-Id: <20190520152738.GA15198@atlas>
Unlike CentOS, Debian variants has python3 package on official repository,
so we don't have to use relocatable python3 on these distributions.
However, official python3 version is different on each distribution, we may
have issue because of that.
Also, our scripts and packaging implementation are becoming presuppose
existence of relocatable python3, it is causing issue on Debian
variants.
Switching to relocatable python3 on Debian variants avoid these issues,
it will easier to manage Scylla python3 environments accross multiple
distributions.
Fixes#4495
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190526105138.677-1-syuu@scylladb.com>
Apparently we are having some issues running iotune in the i3en instances,
as the values not always make sense. We believe it is something that XFS
is doing, and running fio directly on the device (no filesystem) provides
more meaningful results (more according to AWS published expected values).
For now, let's use fio instead. In this patch I have ran fio for our 4
dimensions in each of the three types of disks (large, xlarge, 3xlarge).
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190524111454.27956-1-glauber@scylladb.com>
"
Before this patchset empty counters were incorrectly persisted for
MC format. No value was written to disk for them. The correct way
is to still write a header that informs the counter is empty.
We also need to make sure that reading wrongly persisted empty
counters works because customers may have sstables with wrongly
persisted empty counters.
Fixes#4363
"
* 'haaawk/4363/v3' of github.com:scylladb/seastar-dev:
sstables: add test for empty counters
docs: add CorrectEmptyCounters to sstable-scylla-format
sstables: Add a feature for empty counters in Scylla.db.
sstables: Write header for empty counters
sstables: Remove unused variables in make_counter_cell
sstables: Handle empty counter value in read path
"
One local utility function in cql_query_test.cc duplicates an existing
exception_predicate member. Another can be generalized for wider use
in the future. This patch accomplishes both, retiring a to-do item.
Tests: unit (dev)
"
* 'use-utils-predicate-in-cql_test' of https://github.com/dekimir/scylla:
tests/cql: Replace equery() with cquery_nofail()
tests: Add cquery_nofail() utility
tests: Drop redundant function
Most tests await the result of cql_test_env::execute_cql(). Most
would also benefit from reporting errors with top-level location
included.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
* seastar 3f7a5e1...5cb1234 (5):
> build: Help Seastar to find Boost on Fedora 30
> Merge 'Reinstate nowait aio support' from Avi
> Fix documentation link in README.md
> sharded: add variants to invoke_on() that accept an smp_service_group
> improve error message on AIO setup failure
Due to a bug in an sstable writer, empty counters
were stored without a header.
Correct way of storing empty counter is to still write
a header that indicates the emptiness.
Next patch in this series fixes the write path
but we have to make sure that we handle incorrectly
serialized counters in the read path becuase there
may exist sstables with counters stored without header.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Users may want to know which version of packages are used for the AMI,
it's good to have it on AMI tags and description.
To do this, we need to download .rpm from specified .repo, extract
version information from .rpm.
Fixes#4499
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190520123924.14060-2-syuu@scylladb.com>
Refs #3929
Optionally enables O_DSYNC mode for segment files, and when
enabled ignores actual flushing and just barriers any ongoing
writes.
Iff using O_DSYNC mode, we will not only truncate the file
to max size, but also do an actual initial write of zero:s
to it, since XFS (intended target) has observably less good
behaviour on non-physical file blocks. Once written (and maybe
recycled) we should have rather satisfying throughput on writes.
Note that the O_DSYNC behaviour is hidden behind a default
disabled option. While user should probably seldom worry about
this, we should add some sort of logic i main/init that unless
specified by user, evaluates the commitlog disk and sets this
to true if it is using XFS and looks ok. This is because using
O_DSYNC on things like EXT4 etc has quite horrible performance.
All above statements about performance and O_DSYNC behaviour
are based on a sampling of benchmark results (modified fsqual)
on a statistically non-ssignificant selection of disks. However,
at least there the observed behaviour is a rather large
difference between ::fallocate:ed disk area vs. actually written
using O_DSYNC on XFS, and O_DSYNC on EXT4.
Note also that measurements on O_DSYNC vs. no O_DSYNC does not
take into account the wall-clock time of doing manual disk flush.
This is intentionally ignored, since in the commitlog case, at
least using periodic mode, flushes are relatively rare.
Message-Id: <20190520120331.10229-1-calle@scylladb.com>
"
Add a fallback-mode that can be used when the `scylla ptr` cannot be
used, either because the application is not built with the seastar
allocator, or due to bugs. The fallback mode relies on a more primitive
method for determining how much memory to scan looking for task pointers
inside the task object. This mode, being more primitive, is less prone
to errors, but is more wasteful and less precise.
"
* 'scylla-fiber-fallback-mode/v2' of https://github.com/denesb/scylla:
scylla-gdb.py: scylla_fiber: add fallback mode
scylla-gdb.py: scylla_ptr: add is_seastar_allocator_used()
scylla-gdb.py: pointer_metadata: allow constructing from non-seastar pointers
scylla-gdb.py: scylla_fiber: fix misaligned text in docstring
We compile Seastar unconditionally so that changes to Seastar files are
reflected in Scylla when it's built.
We don't need to unconditionally build `iotune` in the same way.
`iotune` is still listed as a build artifact, so it will be built if
`ninja` is invoked without a particular target.
However, building a specific target (like `ninja build/dev/scylla`) will
not build `iotune`.
Fixes#4165
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <9fb96a281580a8743e04d5dd11398be53960cb58.1558100815.git.jhaberku@scylladb.com>
"
This patchset fixes reactor stalls caused by cache invalidation not being preemptible.
This becomes a problem when there is a lot of partitions in cache inside the invalidated range.
This affects high-level operations like nodetool refresh, table
truncation, repair and streaming.
Fixes#2683
The improvement on stalls was measured using tests/perf_row_cache_update:
Before:
Small partitions, no overwrites:
invalidation: 339.420624 [ms], preemption: {count: 2, 99%: 0.008239 [ms], max: 339.422144 [ms]}
Small partition with a few rows:
invalidation: 191.855331 [ms], preemption: {count: 2, 99%: 0.008239 [ms], max: 191.856816 [ms]}
Large partition, lots of small rows:
invalidation: 0.959328 [ms], preemption: {count: 2, 99%: 0.008239 [ms], max: 0.961453 [ms]}
After:
Small partitions, no overwrites:
invalidation: 400.505554 [ms], preemption: {count: 843, 99%: 0.545791 [ms], max: 0.502340 [ms]}
Small partition with a few rows:
invalidation: 306.352600 [ms], preemption: {count: 644, 99%: 0.545791 [ms], max: 0.506464 [ms]}
Large partition, lots of small rows:
invalidation: 0.963660 [ms], preemption: {count: 2, 99%: 0.009887 [ms], max: 0.963264 [ms]}
The maximum scheduling latency went down form 339 ms to 0.5 ms (task quota).
Tests:
- unit (dev)
"
* tag 'cache-preemptible-invalidation-v2' of github.com:tgrabiec/scylla:
row_cache: Make invalidate() preemptible
row_cache: Switch _prev_snapshot_pos to be a ring_position_ext
dht: Introduce ring_position_ext
dht: ring_position_view: Take key by const pointer
tests: perf_row_cache_update: Rename 'stall' to 'preemption' to avoid confusion
tests: perf_row_cache_update: Report stalls around invalidation
To make product name templatization works correctly, we cannot use
"debian/scylla-server" as package contents directory path,
need to use template like "debian/{{product}}-server" instead.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190517121946.18248-1-syuu@scylladb.com>
The calculation is done in a non preemptable loop over all tables, so if
numbers of tables is very large it may take a while since we also build
a string for gossiper state. Make the loop preemtable and also make
the string calculation more efficient by preallocating memory for it.
Message-Id: <20190516132748.6469-3-gleb@scylladb.com>
invoke_on_all() copies provided function for each shard it is executed
on, so by moving stats map into the capture we copy it for each shard
too. Avoid it by putting it into the top level object which is already
captured by reference.
Message-Id: <20190516132748.6469-2-gleb@scylladb.com>
All `table` instances currently unconditionally allocate a cell locker
for counter cells, though not all need one. Since the lockers occupy
quite a bit of memory (as reported in #4441), it's wasteful to
allocate them when unneeded.
Fixes#4441.
Tests: unit (dev, debug)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Message-Id: <20190515190910.87931-1-dejan@scylladb.com>
"
This series contains loosely related generic cleanup patches that the
timestamp based data segregation series depends on. Most of the patches
have to do with making headers self-sustainable, that is compilable on
their own. This was needed to be able to ensure that the new headers
introduced or touched by that series are self-sustainable too.
This series also introduces `schema_fwd.hh` which contains a forward
declaration of `schema` and `schema_ptr` classes. No effort was made to
find and replace all existing ad-hoc schema forward declarations in the
source tree.
"
* 'pre-timestamp-based-data-segregation-cleanup/v1' of https://github.com/denesb/scylla:
encoding_stats.hh: add missing include
sstables/time_window_compaction_strategy.hh: make self-sufficient
sstables/size_tiered_compaction_strategy.hh: make self-sufficient
sstables/compaction_strategy_impl.hh: make header self-sufficient
compaction_strategy.hh: use schema_fwd.hh
db/extensions.hh: use schema_fwd.hh
Add schema_fwd.hh
Makes opening all sstable components go through same file open
routine, optionally applying extensions to each (except TOC which
is special).
Also ensures we read Scylla metadata before other non-TOC
components, as we might need this for extensions (hint hint).
Message-Id: <20190513201821.14417-1-calle@scylladb.com>
The current implementation of the `scylla fiber` command relies on the
`scylla ptr` command to provide metadata on pointers, more
specifically the boundaries of the region the object they point to
occupies. However, in debug mode, seastar is using the standard allocator
and thus the `scylla ptr` command doesn't work.
To work around this, provide a fallback mode for debug builds. This mode
assumes pointers point to the start of objetcts and scans a
configurable region of memory. While less exact than the variant relying
on `scylla ptr` it still works reasonably well.
The size of the to-be-scanned memory region can be set using the
`--scanned-region-size` command line argument. This defaults to 512.
Additionally, add a flag (`--force-fallback-mode`) to force using the
fallback mode. This is useful if `scylla ptr` is not working for any
reason.
Determines whether the application is using the seastar allocator or
not. This is done by attempting to resolve the
`seastar::memory::cpu_mem` symbol. To avoid the expensive symbol lookup
the result is cached. This means that loading a new inferior will
possibly return the wrong value. The cache can be flushed by re-sourcing
the `scylla-gdb.py` script.
"
Although CQL allows SELECT statements with both simple and aggregate
selectors, Scylla disallows them. This patch removes that restriction
and ensures that mixed simple/aggregate selection works as specified
both with and without GROUP BY.
Tests: unit (dev)
"
* 'aggregate-and-simple-select-together' of https://github.com/dekimir/scylla:
cql: Fix mixed selection with GROUP BY
cql: Allow mixing of aggregate and simple selectors
GROUP BY is currently supported by simple_selection, the class used
when all selectors are simple. But when selectors are mixed, we use
selection_with_processing, which does not yet support GROUP BY. This
patch fixes that.
It also adapts one testcase in filtering_test to the new behavior of
simple_selector. The test currently expects the last value seen, but
simple_selector now outputs the first value seen.
(More details: the WHERE clause implicitly selects the columns it
references, and unit tests are forced to provide expected values for
these columns. The user-visible result is unchanged in the test;
users never see the WHERE column values due to filtering in
cql::transport, outside unit tests.)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Scylla currently rejects SELECT statements with both simple and
aggregate selectors, but Cassandra allows them. This patch brings
parity to Scylla.
Fixes#4447.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
"
Commit d0f9e00 changed the representation of the gc_clock::duration
from int32_t to int64_t.
Mutation hashing uses appending_hash<gc_clock::time_point>, which by
default feeds duration::count() into the hasher. duration::rep changed
from int32_t to int64_t, which changes the value of the hash.
This affects schema digest and query digests, resulting in mismatches
between nodes during a rolling upgrade.
Fixes#4460.
Refs #4485.
"
* tag 'fix-gc_clock-digest-v2.1' of github.com:tgrabiec/scylla:
tests: Add test which verifies that schema digest stays the same
tests: Add sstables for the schema digest test
schema_tables, storage_service: Make schema digest insensitive to expired tombstones in empty partition
db/schema_tables: Move feed_hash_for_schema_digest() to .cc file
hashing: Introduce type-erased interface for the hasher
hashing: Introduce C++ concept for the hasher
hashers: Rename hasher to cryptopp_hasher
gc_clock: Fix hashing to be backwards-compatible
The _make_config_values macro reduces duplication (both the item name
and the types need to be available as C++ identifiers and as runtime
strings), but is hard to work with. The macro is huge and editors
don't handle it well, errors aren't identified at the correct
location, and since the macro doesn't have types, it's hard to
refactor.
This series replaces the macro with ordinary C++ code. Some repetition is
introduced, but IMO the result is easier to maintain than the macro. As a
bonus the bulk of the code is moved away from the header file.
Tests: unit (dev), manual testing of the config REST API
* https://github.com/avikivity/scylla config-no-macro/v2
config: make the named_value type name available without requiring
_make_config_values
config: remove value_status from named_value template parameter list
config: add named_value::value_as_json()
api: config: stop using _make_config_values
config: auto-add named_values into config_file
config: add allowed_values parameter to named_value constructor
config: convert _make_config_values to individual named_value member
declarations and initializers
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.
That's not a problem during normal cluster operation because the
tombstones will expire at all nodes at the same time, and schema
digest, although changes, will change to the same value on all nodes
at about the same time.
This fix changes digest calculation to not feed any digest for
partitions which are empty after compaction.
The digest returned by schema_mutations::digest() is left unchanged by
this patch. It affects the table schema version calculation. It's not
changed because the version is calculated on boot, where we don't yet
know all the cluster features. It's possible to fix this but it's more
complicated, so this patch defers that.
Refs #4485.
Asd
The motivation is to allow hiding the definition of functions
accepting a hasher. For one, this reduces (re)complication times,
because we can put the definition in .cc
"
scylla_setup is currently broken for OEL. This happens because the
OS detection code checks for RHEL and Fedora. CentOS returns itself
as RHEL, but OEL does not.
"
* 'unbreakable' of github.com:glommer/scylla:
scylla_setup: be nicer about unrecognized OS
scylla_util: recognize OEL as part of the RHEL family
Right now if the user tries to execute this in an unrecognized OS, the
following will be thrown:
Traceback (most recent call last):
File "/usr/lib/scylla/libexec/scylla_setup", line 214, in <module>
do_verify_package('scylla-enterprise-jmx')
File "/usr/lib/scylla/libexec/scylla_setup", line 73, in do_verify_package
if res != 0:
UnboundLocalError: local variable 'res' referenced before assignment
It would be a lot nicer to exit gracefully and print a messge saying what
is going on. This was caught when running on OEL, which the previous patch
fixed. Still, there are other unknown OS out there the users may try to run
on.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Oracle Linux is a RHEL-like distribution and we support it just fine, but our
new incarnation of scylla_setup is failing to recognize it.
os-release for OEL is a bit different. It doesn't have an ID_LIKE string, and
only shows an ID string, which is set to 'ol'. So let's recognize this.
Fixes: #4493
Branches: 3.1
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This change inserts preemption points between removal of partitions.
The main complication is in maintaining consitency in the face of
concurrent population or eviction. We use the same mechanism which is
used by memtable updates. _prev_snapshot_pos is the ring position
which partitions the ring into the part which is already updated in
cache and the one which is yet to be updated. That position should be
set accordingly on preemption.
In case of invalidation, updating means removing all entries in the
range and marking the range as discontinuous. When resuming
invalidation of a range we continue from _prev_snapshot_pos as the
lower bound.
This affects high-level operations like nodetool refresh, table
truncation, repair and streaming.
Fixes#2683
The improvement on stalls was measured using tests/perf_row_cache_update:
Before
Small partitions, no overwrites:
invalidation: 339.420624 [ms], preemption: {count: 2, 99%: 0.008239 [ms], max: 339.422144 [ms]}
Small partition with a few rows:
invalidation: 191.855331 [ms], preemption: {count: 2, 99%: 0.008239 [ms], max: 191.856816 [ms]}
Large partition, lots of small rows:
invalidation: 0.959328 [ms], preemption: {count: 2, 99%: 0.008239 [ms], max: 0.961453 [ms]}
After:
Small partitions, no overwrites:
invalidation: 400.505554 [ms], preemption: {count: 843, 99%: 0.545791 [ms], max: 0.502340 [ms]}
Small partition with a few rows:
invalidation: 306.352600 [ms], preemption: {count: 644, 99%: 0.545791 [ms], max: 0.506464 [ms]}
Large partition, lots of small rows:
invalidation: 0.963660 [ms], preemption: {count: 2, 99%: 0.009887 [ms], max: 0.963264 [ms]}
The maximum scheduling latency went down form 339 ms to 0.5 ms (task quota).
dht::ring_position cannot represent all ring_position_view instances,
in particular those obtained from
dht::ring_position_view::for_range_start(). To allow using the latter,
switch to views.
It's an owning version of ring_position_view.
Note that ring_position has a narrower domain than the
ring_position_view for historical reasons, so we cannot use that.
* seastar f73690e...3f7a5e1 (7):
> Revert "Make sure all allocations/deallocations are properly byte aligned"
> http: fix request content for POST requests
> doc: discourage generic lambdas and unconstrained templates
> smp: add smp_service_group for smp::submit_to() resource control
> Revert "smp: add smp_service_group for smp::submit_to() resource control"
> smp: add smp_service_group for smp::submit_to() resource control
> Make sure all allocations/deallocations are properly byte aligned
Commit d0f9e00 changed the representation of the gc_clock::duration
from int32_t to int64_t.
Mutation hashing uses appending_hash<gc_clock::time_point>, which by
default feeds duration::count() into the hasher. duration::rep changed
from int32_t to int64_t, which changes the value of the hash.
This affects schema digest and query digests, resulting in mismatches
between nodes during a rolling upgrade.
Fixes#4460.
(cherry picked from commit 549d0eb2f3)
"
This series contains fixes for GCC9 build, mostly corrections needed
after changes in libstdc++. With this series and a workaround for
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90415 (not included)
Scylla builds and passes unit tests with GCC9 (tested on Fedora 30,
development mode only).
Tests: unit(dev with gcc8 and gcc9).
"
* tag 'gcc9-fixes/v1' of https://github.com/pdziepak/scylla:
tests/imr: add missing noexcept
counters: bytes_view::pointer is not const pointer
imr/fundamental: use bytes_view::const_pointer for const pointer
In libstdc++ for gcc9 std::basic_string_view::pointer isn't const any
more. As a result the compiler is complaining about reinterpret_cast
casting away const. The solution is to use std::conditional<> to choose
between const pointer for counter view and non-const pointer for mutable
counter view.
In libstdc++ shipped with gcc9 std::basic_string_view::pointer is no
longer constant, which is causing the compiler to complain about
dropping const in reinterpret_cast. The solution is to use
std::basic_string_view::const_pointer.
"
Fix the broken logic that is meant to prevent sending hints when node is
in a DOWN NORMAL state.
"
* 'hinted_handoff_stop_sending_to_down_node-v2' of https://github.com/vladzcloudius/scylla:
hints_manager: rename the state::ep_state_is_not_normal enum value
hinted handoff: fix the logic that detects that the destination node is in DN state
hinted_handoff: sender::can_send(): optimize gossiper::is_alive(ep) check
hinted handoff: end_point_hints_manager::sender: use _gossiper instead of _shard_manager.local_gossiper()
types.cc: fix the compilation with fmt v5.3.0
Before ede1d248af, running "tools/toolchain/dbuild -it -- bash" was
a nice way to play in the toolchain environment, for example to start
a debugger. But that commit caused containers to run in detached mode,
which is incompatible with interactive mode.
To restore the old behavior, detect that the user wants interactive mode,
and run the container in non-detached mode instead. Add the --rm flag
so the container is removed after execution (as it was before ede1d248af).
Message-Id: <20190506175942.27361-1-avi@scylladb.com>
In 0ea6df, duration_test was made to link against all tests/*.o files.
This isn't necessary, as it only needs tests/exception_utils.o. This
patch narrows down duration_test's dependences to only
exception_utils.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Message-Id: <20190508211630.108228-1-dejan@scylladb.com>
Add new test cases:
- disallow creating a non-FULL index on frozen collections
- disallow repeated creation of a FULL index on frozen collections
- disallow FULL indexes on non-frozen collections
- disallow referencing frozen-map entries in the WHERE clause
Also add error-message expectations to existing test cases.
Fixes#3654.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Message-Id: <20190509025806.124499-1-dejan@scylladb.com>
Since other build_*.sh are for running inside extracted relocatable
package, they have SCYLLA-PRODUCT-FILE on top of the directory,
but build_ami.sh is not running in such condition, we need to run
SCYLLA-VERSION-GEN first, then refer to build/SCYLLA-PRODUCT-FILE.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190509110621.27468-1-syuu@scylladb.com>
Rename this state value to better reflect the reality:
state::ep_state_is_not_normal -> state::ep_state_left_the_ring
The manager gets to this state when the destination Node has left the ring.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
When node is in a DN state its gossiper state may be NORMAL, SHUTDOWN
or "" depending on the use case.
In addition to that if node has been removed from the ring its state is
also going to be removed from the gossiper_state map.
Let's consider the above when deciding if node is in the DN state.
Fixes#4461
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
AWS just released their new instances, the i3en instances. The instance
is verified already to work well with scylla, the only adjustments that
we need is advertise that we support it, and pre-fill the disk
information according to the performance numbers obtained by running the
instance.
Fixes#4486
Branches: 3.1
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190508170831.6003-1-glauber@scylladb.com>
"
Cassandra has supported GROUP BY in SELECT statements since 2016
(v3.10), while ScyllaDB currently treats it as a syntax error. To
achieve parity with Cassandra in this important bit of functionality,
this patch adds full support for GROUP BY, from parsing to validation
to implementation to testing.
"
* 'groupby-implPP' of https://github.com/dekimir/scylla:
Implement grouping in selection processing
Propagate GROUP BY indices to result_set_builder
Process GROUP BY columns into select_statement
Parse GROUP BY clause, store column identifiers
Make result_set_builder obey its _group_by_cell_indices by recognizing
group boundaries and resetting the selectors.
Also make simple_selectors work correctly when grouping.
Fixes#2206.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Ensure that the indices recorded in select_statement are passed to
result_set_builder when one is created for processing the cell values.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Extend the grammar file with GROUP BY, collect the column identifiers,
and store them in raw::select_statement.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Whenever the iotune_args array uses "--smp", it needs cpudata.smp()
which returns an integer instead of a string. So when iotune_args is
passed to subprocess.check_call(), it actually throws "TypeError:
expected str, bytes or os.PathLike object, not int" but
"%s did not pass validation tests, it may not be on XFS..." is shown as
the exception.
Even though the user inputs correct arguments, it might still throw an
error and confuse the user that he/she has not passed the right
arguments.
One simple fix is to use str(cpudata.smp()) instead of cpudata.smp().
Signed-off-by: JP-Reddy <guthijp.reddy@gmail.com>
Message-Id: <20190406070118.48477-1-guthijp.reddy@gmail.com>
"
gcc 9 complains a lot about pessimizing moves, narrowing conversions, and
has tighter deduction rules, plus other nice warnings. Fix problems found
by it, and make some non-problems compile without warnings.
"
* tag 'gcc9/v1' of https://github.com/avikivity/scylla:
types: fix pessimizing moves
thrift: fix pessimizing moves
tests: fix pessimizing moves
tests: cql_query_test: silence narrowing conversion warning
test: cql_auth_syntax_test: fix ambiguity due to parser uninitialized<T>
table: fix potentially wrong schema when reading from zero sstables
storage_proxy: fix pessimizing moves
memtable: fix pessimizing moves
IDL: silence narrowing conversion in bool serializer
compaction: fix pessimizing moves
cache: fix pessimizing moves
locator: fix pessimizing moves
database: fix pessimizing moves
cql: fix pessimizing moves
cql parser: fix conversion from uninitalized<T> to optional<T> with gcc 9
We use the schema during creation of the mutation_source rather than
during the query itself. Likely they're the same, and since no rows
are returned from a zero-sstable query, harmless. But gcc 9 complains.
Fix by using the query's schema.
bool serializers are now aliases to int8_t serializers, but gcc 9
complains about narrowing conversions, due to the path int8_t -> int -> bool.
A bad narrowing conversion here cannot happen in practice, but massage
the code a little to silence it.
We use uninitialized<T> (wrapping an optional<T>) to adjust to the
parser's way of laying out the code, but this fails with gcc 9
(presumably for the correct reasons) when converting from
uninitialized<T> back to optional<T>. Add a conversion operator
to make it build.
Many tests verify exception messages. Currently, they do so via
verbose lambdas or inner functions that hide test-failure locations.
This patch adds utilities for quick creation of message-checking tests
and replaces existing ad-hoc methods with these new utilities.
Tests: unit (dev)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Message-Id: <20190506210006.124645-1-dejan@scylladb.com>
"
Commit d0f9e00 changed the representation of the gc_clock::duration
from int32_t to int64_t.
Mutation hashing uses appending_hash<gc_clock::time_point>, which by
default feeds duration::count() into the hasher. duration::rep changed
from int32_t to int64_t, which changes the value of the hash.
This affects schema digest and query digests, resulting in mismatches
between nodes during a rolling upgrade.
Fixes#4460.
Branches: 3.1
"
* tag 'fix-gc_clock-digest-v1' of github.com:tgrabiec/scylla:
tests: Add test which verifies that schema digest stays the same
tests: Add sstables for the schema digest test
gc_clock: Fix hashing to be backwards-compatible
Commit d0f9e00 changed the representation of the gc_clock::duration
from int32_t to int64_t.
Mutation hashing uses appending_hash<gc_clock::time_point>, which by
default feeds duration::count() into the hasher. duration::rep changed
from int32_t to int64_t, which changes the value of the hash.
This affects schema digest and query digests, resulting in mismatches
between nodes during a rolling upgrade.
Fixes#4460.
* seastar 4cdccae...f73690e (16):
> sstring: silence technically correct but unhelpful warning in sstring move ctor
> cmake: add a seastar_supports_flag function
> future: Fix build with libc++'s non-trivially-constructible std::tuple<>
> Revert "Make sure all allocations are properly bytes aligned"
> Merge "future: simplify future_state management" from Rafael
> Make sure all allocations are properly bytes aligned
> util/log: use correct clock type
> core/reactor: don't assume system_clock::duration is in nanoseconds
> Merge "Optimize the future_state move constructor" from Rafael
> rpc: don't use boost/variant.hpp directly
> core/memory: Omit [[gnu::leaf]] attribute on clang
> Fix build with std::filesystem
> Merge "Fix clang build and tests" from Rafael
> cmake: Move ) out of quotes
> Merge "Fix some bugs found by (or perhaps in) gcc 9" by Avi
> Deduplicate Seastar dependencies management in CMake scripts
scylla-housekeeping always wants to run in the installation to check if
we are running the latest version. This happens regardless of whether or
not we said yes or no to the housekeeping scylla_setup question - as
that question only deals with whether or not we want to do this through
a timer.
It is fine to try to run scylla-housekeeping, as long as we time it out.
The current code doesn't.
The naive solution is to add a timeout parameter to urllib.request.open.
However, that timeout is not respected and in my tests I saw real
timeouts up to four times higher the timeout we set. For a reasonable 5s
timeout, this mean a 20s real timeout which can lead to a very bad user
experience. This seems to be a known problem with this module according
to a quick Google search.
This patch then takes a slightly more complex solution and uses
multiprocess to enforce a well-defined user-visible timeout.
Fixes#3980
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190506122335.5707-1-glauber@scylladb.com>
batchlog make copies of topology and endpoint array in batchlog endpoint
choosing code. There is a remark that at least endpoint copy is
deliberate because Cassandra code has it. We do not have to follow. Our
endpoint calculation code is atomic, so we can use a reference.
Message-Id: <20190506115815.GK21208@scylladb.com>
After incremental compaction, new sstables may have already replaced old
sstables at any point. Meaning that a new sstable is in-use by table and
a old sstable is already deleted when compaction itself is UNFINISHED.
Therefore, we should *NEVER* delete a new sstable unconditionally for an
interrupted compaction, or data loss could happen.
To fix it, we'll only delete new sstables that didn't replace anything
in the table, meaning they are unused.
Found the problem while auditting the code.
Fixes#4479.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190506134723.16639-1-raphaelsc@scylladb.com>
Mutations carry their schema, so use that instead of bring in a global schema,
which may change as features are added.
Message-Id: <20190505132542.6472-1-avi@scylladb.com>
perf_fast_forward is used to detect performance regressions. The two
main metrics used for this are fargments per second and the number of
the IO operations. The former is a median of a several runs, but the
latter is just the actual number of asynchronous IO operations performed
in the run that happened to be picked as a median frag/s-wise. There's
no always a direct correlation between frag/s and aio and the latter can
vary which makes the latter hard to compare.
In order to make this easier a new metric was introduced: "average aio"
which reports the average number of asynchronous IO operations performed
in a run. This should produce much more stable results and therefore
make the comparison more meaningful.
Message-Id: <20190430134401.19238-1-pdziepak@scylladb.com>
std::regex_match of the leading path may run out of stack
with long paths in debug build.
Using rfind instead to lookup the last '/' in in pathname
and skip it if found.
Fixes#4464
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190505144133.4333-1-bhalevy@scylladb.com>
The setup script asks the user whether or not housekeeping should
be called, and in the first time the script is executed this decision
is respected.
However if the script is invoked again, that decision is not respected.
This is because the check has the form:
if (housekeeping_cfg_file_exists) {
version_check = ask_user();
}
if (version_check) { do_version_check() } else { dont_do_it() }
When it should have the form:
if (housekeeping_cfg_file_exists) {
version_check = ask_user();
if (version_check) { do_version_check() } else { dont_do_it() }
}
(Thanks python)
This is problematic in systems that are not connected to the internet, since
housekeeping will fail to run and crash the setup script.
Fixes#4462
Branches: master, branch-3.1
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190502034211.18435-1-glauber@scylladb.com>
Newer versions of RHEL ship the os-release file with newlines in the
end, which our script was not prepared to handle. As such, scylla_setup
would fail.
This patch makes our OS detection robust against that.
Fixes#4473
Branches: master, branch-3.1
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190502152224.31307-1-glauber@scylladb.com>
"
One of the fixes is for incorrect recognition of memory pages as belonging
or not belonging to small allocation pools in some cases.
Also, compensates for https://github.com/scylladb/seastar/issues/608 in "scylla memory",
which improves accurracy of the small allocation pool report.
Fixes "scylla task_histogram" to not look into pages which do not belong to live
small allocation pool spans.
Fixes#4367Fixes#4368
"
* tag 'gdb-fix-span-qualification-v2' of github.com:tgrabiec/scylla:
gdb: Print size of large allocations in 'scylla ptr'
gdb: Fix 'scylla ptr' for free pages
gdb: Set is_live and offset for large allocations properly in 'scylla ptr'
gdb: Fix 'scylla ptr' misqualifying pointers
gdb: Make 'scylla memory' show unused memory in small pools
gdb: Fix small pool memory usage reporting in 'scylla memory'
gdb: Switch 'scylla memory' to use the span_checker to find large spans
gdb: Switch task_histogram to use the span_checker
gdb: Introduce span_checker
endpoint_filter() function assumes that each bucket of
std::unordered_multimap contains elements with the same key only, so
its size can be used to know how many elements with a particular key
are there. But this is not the case, elements with multiple keys may
share a bucket. Fix it by counting keys in other way.
Fixes#3229
Message-Id: <20190501133127.GE21208@scylladb.com>
The existing unit test test_secondary_index_contains_virtual_columns
reproduced a bug (issue #4144) with indexing of primary-key columns,
but we only actually tested clustering columns. In issue #4471 there
was a question whether we may still have a bug when indexing of
*partition-key* columns. This patch adds a test that verifies that
we don't, and this case works well too.
Refs #4144
Refs #4471
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190501113500.25900-1-nyh@scylladb.com>
Until this patch, dropping columns from a table was completely forbidden
if this table has any materialized views or secondary indexes. However,
this is excessively harsh, and not compatible with Cassandra which does
allow dropping columns from a base table which has a secondary index on
*other* columns. This incompatibility was raised in the following
Stackoverflow question:
https://stackoverflow.com/questions/55757273/error-while-dropping-column-from-a-table-with-secondary-index-scylladb/55776490
In this patch, we allow dropping a base table column if none of its
materialized views *needs* this column. Columns selected by a view
(as regular or key columns) are needed by it, of course, but when
virtual columns are used (namely, there is a view with same key columns
as the base), *all* columns are needed by the view, so unfortunately none
of the columns may be dropped.
After this patch, when a base-table column cannot be dropped because one
of the materialized views needs it, the error message will look like:
exceptions::invalid_request_exception: Cannot drop column a from base
table ks.cf: a materialized view cf_a_idx_index needs this column.
This patch also includes extensive testing for the cases where dropping
columns are now allowed, and not allowed. The secondary-index tests are
especially interesting, because they demonstrate that now usually (when
a non-key column is being indexed) dropping columns will be allowed,
which is what originally bothered the Stackoverflow user.
Fixes#4448.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190429214805.2972-1-nyh@scylladb.com>
There are several places were IN restrictions are not currently supported,
especially in queries involving a secondary index. However, when the IN
restriction has just a single value, it is nothing more than an equality
restriction and can be converted into one and be supported. So this patch
does exactly this.
Note that Cassandra does this conversion since August 2016, and therefore
supports the special case of single-value IN even where general IN is not
supported. So it's important for Cassandra compatibility that we do this
conversion too.
This patch also includes a test with two queries involving a secondary
index that were previously disallowed because of the "IN" on the primary
key or the indexed column - and are now allowed when the IN restriction
has just a single value. A third query tested is not related to secondary
indexes, but confirms we don't break multi-column single-value IN queries.
Fixes#4455.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190428160317.23328-1-nyh@scylladb.com>
The shard readers of the multishard reader assumed that the positions in
the data stream are strictly monotonous. This assumption is invalid.
Range tombstones can have positions that they can share with other range
tombstones and/or a clustering row. The effect of this false assumption
was that when the shard reader was evicted such that the last seen
fragment was a range tombstone, when recreated it would skip any unseen
fragments that have the same position as that of the last seen range
tombstone.
Fixes: #4418
Branches: master, 3.0, 2019.1
Tests: unit(dev)
* https://github.com/denesb/scylla.git
multishard_reader_handle_non_strictly_monotonous_positions/v4:
multishard_combining_reader: shard_reader::remote_reader extract
fill-buffer logic into do_fill_buffer()
mutlishard_combining_reader: reorder
shard_reader::remote_reader::do_fill_buffer() code
position_in_partition_view: add region() accessor
multishard_combining_reader: fix handling of non-strictly monotonous
positions
flat_mutation_reader: add flat_mutation_reader_from_mutations()
overload with range and slice
flat_mutation_reader: add make_flat_mutation_reader_from_fragments()
overload with range and slice
tests: add unit test for multishard reader correctly handling
non-strictly monotonous positions
Currently null and missing values are treated differently. Missing
values throw no_such_column. Null values return nullptr, std::nullopt
or throw null_column_value.
The api is a bit confusing since a function returning a std::optional
either returns std::nullopt or throws depending on why there is no
value.
With this patch series only get_nonnull throws and there is only one
exception type.
* https://github.com/espindola/scylla.git espindola/merge-null-and-missing-v2:
query-result-set: merge handling of null and missing values
Remove result_set_row::has
Return a reference from get_nonnull
No reason to copy if we don't have to. Now that get_nonnull doesn't
copy, replace a raw used of get_data_value with it.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Now that the various get methods return nullptr or std::nullopt on
missing values, we don't need to do double lookups.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Nothing seems to differentiate a missing and a null value. This patch
then merges the two exception types and now the only method that
throws is get_nonnull. The other methods return nullptr or
std::nullopt as appropriate.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
"
After 7c87405, schema sync includes system_schema.view_virtual_columns in the
schema digest. Old nodes don't know about this table and will not include it
in the digest calculation. As a result, there will be schema disagreement
until the whole cluster is upgraded.
Also, the order in which tables were hashed changed in 7c87405, which
causes digests to differ in some schemas.
Fixes#4457.
"
* tag 'fix-disagreement-during-upgrade-v2' of github.com:tgrabiec/scylla:
db/schema_tables: Include view_virtual_columns in the digest only when all nodes do
storage_service: Introduce the VIEW_VIRTUAL_COLUMNS cluster feature
db/schema_tables: Hash schema tables in the same order as on 3.0
db/schema_tables: Remove table name caching from all_tables()
treewide: Propagate schema_features to db::schema::all_tables()
enum_set: Introduce full()
service/storage_service: Introduce cluster_schema_features()
schema: Introduce schema_features
schema_tables: Propagate storage_service& to merge_schema()
gms/feature: Introduce a more convenient when_enabled()
gms/feature: Mark all when_enabled() overloads as const
Currently, we use --sig-proxy to forward signals to the container. However, this
requires the container's co-operation, which usually doesn't exist. For example,
docker run --sig-proxy fedora:29 bash -c "sleep 5"
Does not respond to ctrl-C.
This is a problem for continuous integration. If a build is aborted, Jenkins will
first attempt to gracefully terminate the processes (SIGINT/SIGTERM) and then give
up and use SIGKILL. If the graceful termination doesn't work, we end up with an
orphan container running on the node, which can then consume enough memory and CPU
to harm the following jobs.
To fix this, trap signals and handle them by killing the container. Also trap
shell exit, and even kill the container unconditionally, since if Jenkins happens
to kill the "docker wait" process the regular paths will not be taken.
We lose a lot by running the container asynchronously with the dbuild shell
script, so we need to add it back:
- log display: via the "docker logs" command
- auto-removal of the container: add a "docker rm -f" command on signal
or normal exit
Message-Id: <20190424130112.794-1-avi@scylladb.com>
To be able to support this new overload, the reader is made
partition-range aware. It will now correctly only return fragments that
fall into the partition-range it was created with. For completeness'
sake and to be able to test it, also implement
`fast_forward_to(const dht::partition_range)`. Slicing is done by
filtering out non-overlapping fragments from the initial list of
fragments. Also add a unit test that runs it through the mutation_source
test suite.
After 7c87405, schema sync includes system_schema.view_virtual_columns
in the schema digest. Old nodes don't know about this table and will
not include it in the digest calculation. As a result, there will be
schema disagreement until the whole cluster is upgraded.
Fix this by taking the new table into account only when the whole
cluster is upgraded.
The table should not be used for anything before this happens. This is
not currently enforced, but should be.
Fixes#4457.
Needed for determining if all nodes in the cluster are aware of the
new schema table. Only when all nodes are aware of it we can take it
into account when calculating schema digest, otherwise there would be
permanent schema disagreement in during rolling upgrade.
The commit 7c87405 also indirectly changed the order of schema tables
during hash calculation (index table should be taken after all other
tables). This shows up when there is an index created and any of {user
defined type, function, or aggregate}.
Refs #4457.
It can be invoked with a lambda without the ceremony of creating a
class deriving from gms::feature::listener.
The reutrned registration object controls listener's scope.
There was nothing keeping the verify lambda alive after the return. It
worked most of the time since the only state kept by the lambda was
a pointer to cql_test_env.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190426203823.15562-1-espindola@scylladb.com>
* seastar e84d2647c...4cdccae53 (4):
> Merge "future: Move some code out of line" from Rafael
> tests: socket_test: Add missing virtual and override
> build: Don't pass -Wno-maybe-uninitialized to clang
> Merge "expose file_permssions for creating files and dirs in API" from Benny
To be able to run the mutation-source test suite with this reader. In
the next patch, this reader will be used in testing another reader, so
it is important to make sure it works correctly first.
The shard readers under a multishard reader are paused after every
operation executed on them. When paused they can be evicted at any time.
When this happens, they will be re-created lazily on the next
operation, with a start position such that they continue reading from
where the evicted reader left off. This start position is determined
from the last fragment seen by the previous reader. When this position
is clustering position, the reader will be recreated such that it reads
the clustering range (from the half-read partition): (last-ckey, +inf).
This can cause problems if the last fragment seen by the evicted reader
was a range-tombstone. Range tombstones can share the same clustering
position with other range tombstones and potentially one clustering row.
This means that when the reader is recreated, it will start from the
next clustering position, ignoring any unread fragments that share the
same position as the last seen range tombstone.
To fix, ensure that on each fill-buffer call, the buffer contains all
fragments for the last position. To this end, when the last fragment in
the buffer is a range tombstone (with pos x), we continue reading until
we see a fragment with a position y that is greater. This way it is
ensured that we have seen all fragments for pos x and it is safe to
resume the read, starting from after position x.
gossiper::is_alive() has a lot of not needed checks (e.g. is_me(ep)) that
are irrelevant for HH use case and we may safely skip them.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Compilation fails with fmt release 5.3.0 when we print a bytes_view
using "{}" formatter.
Compiler's complain is: "error: static assertion failed: mismatch
between char-types of context and argument"
Fix this by explicitly using to_hex() converter.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
In order to avoid schema disagreements during upgrades (which may lead
to deadlocks), system distributed keyspace initialization is moved
right before starting the bootstrapping process, after the schema
agreement checks already succeeded.
Fixes#3976
Message-Id: <932e642659df1d00a2953df988f939a81275774a.1556204185.git.sarna@scylladb.com>
To make it available, we'll need to make it optional the usage of level metadata,
used to deal with interval map's fragmentation issue when level 0 falls behind,
and also introduce a interface for compaction strategies to implement
make_sstable_set() that instantiate partitioned sstable set.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190424232948.668-1-raphaelsc@scylladb.com>
Instead of app-template::run_deprecated() and at_exit() hooks, use
app_template::run() and RAII (via defer()) to stop services. This makes it
easier to add services that do support shutdown correctly.
Ref #2737
Message-Id: <20190420175733.29454-1-avi@scylladb.com>
While causing some duplication (names are explicitly instead of implicitly
stringified, and names are repeated in the member declaration and initializer),
it is overall more maintainable than the huge macro. It is easier to overload
named_value constructors when you can get error reporting on the line where the error
occurs, for example.
Now that named_value::value_as_json() exists, make use of it to report the
current value of a configuration variable via the REST API, instead of
_make_config_values().
Currently, the REST API does its own conversion of named_value into json.
This requires it to use the _make_config_values macro to perform iteration
of all config items, since it needs to preserve the concrete type of the item
while iterating, so it can select the correct json conversion.
Since we want to remove that macro, we need to provide a different way to
convert a config item to json. So this patch adds a value_as_json().
To hide json_return_value from the rest of the system, we extend config_type
with a conversion function to handle the details. This usually calls
the json_return_type constructor directly, but when it doesn't have default
translation, it interposes a conversion into a type that json recognizes.
I didn't bother maintaining the existing type names, since they're C++
names which don't make sense for the UI.
The value_status is only needed at run-time, and removing it from the
template parameter list reduces type proliferation (which leads to code
bloat) and simplifies the code.
I want to remove the _make_config_values macro, but it is needed now in
api/config.cc to make the type names available. So as a first step, copy the
type names to config_src. Further changes can extract it from there.
Because we want to add more type infomation in following patches, place the type
name in a new config_type object, instead of allocating a string_view in
config_src.
compact_and_evict gets memory_to_release in bytes while
reclamation step is in segments.
Broken in f092decd90.
It doesn't make much difference with the current default step of 1
segment since we cannot reclaim less than that, so shouldn't cause
problems in practice.
Message-Id: <1556013920-29676-1-git-send-email-tgrabiec@scylladb.com>
Currently stop returns ready future immediately. This is not a problem
since calculation loop holds a shared pointer to the local service, so
it will not be destroyed until calculation completes and global database
object db, that also used by the calculation, is never destroyed. But the
later is just a workaround for a shutdown sequence that cannot handle
it and will be changed one day. Make cache hitrate calculation service
ready for it.
Message-Id: <20190422113538.GR21208@scylladb.com>
non_system_filter lambda is defined static which means it is initialized
only once, so the 'this' that is will capture will belong to a shard
where the function runs first. During service destruction the function
may run on different shard and access already other's shard service that
may be already freed.
Fixed#4425
Message-Id: <20190421152139.GN21208@scylladb.com>
This patch add the node_exporter to the docker image.
It install it create and run a service with it.
After this patch node_exporter will run and will be part of scylla
Docker image.
Fixes#4300
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20190421130643.6837-1-amnon@scylladb.com>
To prepare for a seastar change that adds an optional file_permissions
parameter to touch_directory and recursive_touch_directory.
This change messes up the call to io_check since the compiler can't
derive the Func&& argument. Therefore, use a lambda function instead
to wrap the call to {recursive_,}touch_directory.
Ref #4395
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190421085502.24729-1-bhalevy@scylladb.com>
When we start the LSA reclamation it can be that
segment_pool::_free_segments is 0 under some conditions and
segment_pool::_current_emergency_reserve_goal is set to 1. The
reclamation step is 1 segment, and compact_and_evict_locked() frees 1
segment back into the segment_pool. However,
segment_pool::reclaim_segments() doesn't free anything to the standard
allocator because the condition _free_segments >
_current_emergency_reserve_goal is false. As a result,
tracker::impl::reclaim() returns 0 as the amount of released memory,
tracker::reclaim() returns
memory::reclaiming_result::reclaimed_nothing and the seastar allocator
thinks it's a real OOM and throws std::bad_alloc.
The fix is to change compact_and_evict() to make sure that reserves
are met, by releasing more if they're not met at entry.
This change also allows us to drop the variant of allocate_segment()
which accepts the reclamation step as a means to refill reserves
faster. This is now not needed, because compact_and_evict() will look
at the reserve deficit to increase the amount of memory to reclaim.
Fixes#4445
Message-Id: <1555671713-16530-1-git-send-email-tgrabiec@scylladb.com>
* seastar eb03ba5cd...e84d2647c (14):
> Fix hardcoded python paths in shebang line
> Disable -Wmaybe-uninitialized everywhere
> app_template: allow opting out of automatic SIGINT/SIGTERM handling
> build: Restore DPDK machine inference from cflags
> http: capture request content for POST requests
> Merge "Simplify future_state and promise" from Rafael
> temporary_buffer: fix memleak on fast path
> perftune.py: allow explicitly giving a CPU mask to be used for binding IRQs
> perftune.py: fix the sanity check for args.tune
> perftune.py: identify fast-path hardware queues IRQs of Mellanox NICs
> memory: malloc_allocator should be always available
> Merge "Using custom allocator in the posix network stack" from Elazar
> memory: Tell reclaimers how much should be reclaimed
> net/ipv4_addr: add std::hash & operator== overloads
To prevent running entrypoint script in another python3 package like
python36 in EPEL, move /opt/scylladb/python3/bin to top of $PATH.
It won't happen on this container image, but may occurs when user tries to
extend the image.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190417165806.12212-1-syuu@scylladb.com>
perftune.py executes hwloc-calc, the command is now provided as
relocatable binary, placed under /opt/scylladb/bin.
So we need to add the directory to PATH when calling
subprocess.check_output(), but our utility function already do that,
switch to it.
Fixes#4443
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190418124345.24973-1-syuu@scylladb.com>
When we add product name customization, we mistakenly defined the
parameter on each package build script.
Number of script is increasing since we recently added relocatable
python3 package, we should merge it in single place.
Also we should save the parameter on relocatable package, just like
version-release parameters.
So move the definition to SCYLLA-VERSION-GEN, save it to
build/SCYLLA-PRODUCT-FILE then archive it to relocatable package.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190417163335.10191-1-syuu@scylladb.com>
Seastar now supports two RPC compression algorithm: the original LZ4 one
and LZ4_FRAGMENTED. The latter uses lz4 stream interface which allows it
to process large messages without fully linearising them. Since, RPC
requests used by Scylla often contain user-provided data that
potentially could be very large, LZ4_FRAGMENTED is a better choice for
the default compression algorithm.
Message-Id: <20190417144318.27701-1-pdziepak@scylladb.com>
Since we moved relocatable .rpm now Scylla able to run on Amazon Linux
2.
However, is_redhat_variant() on scylla_util.py does not works on Amazon
Linux 2, since it does not have /etc/redhat-release.
So we need to switch to /etc/os-release, use ID_LIKE to detect Redhat
variants/Debian variants.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190417115634.9635-1-syuu@scylladb.com>
Switch to relocatable python3 instead of EPEL's python3 on docker-entrypoint.py.
Also drop uneeded dependencies, since we switched to relocatable scylla
image.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190417111024.6604-1-syuu@scylladb.com>
"
Previously we weren't validating elements of collections so it
was possible to add non-UTF-8 string to a column with type
list<text>.
Tests: unit(release)
Fixes#4009
"
* 'haaawk/4009/v5' of github.com:scylladb/seastar-dev:
types: Test correct map validation
types: Test correct in clause validation
types: Test correct tuple validation
types: Test correct set validation
types: Test correct list validation
types: Add test_tuple_elements_validation
types: Add test_in_clause_validation
types: Add test_map_elements_validation
types: Add test_set_elements_validation
types: Add test_list_elements_validation
types: Validate input when tuples
types: Validate input when parsing a set
types: Validate input when parsing a map
types: Validate input when parsing a list
types: Implement validation for tuple
types: Implement validation for set
types: Implement validation for map
types: Implement validation for list
types: Add cql_serialization_format parameter to validate
1. All nodes in the cluster have to support MC_SSTABLE_FEATURE
2. When a node observes that whole cluster supports MC_SSTABLE_FEATURE
then it should start using MC format.
3. Once all shards start to use MC then a node should broadcast that
unbounded range tombstones are now supported by the cluster.
4. Once whole cluster supports unbounded range tombstones we can
start accepting them on CQL level.
tests: unit(release)
Fixes#4205Fixes#4113
* seastar-dev.git dev/haaawk/enable_mc/v11:
system_keyspace: Add scylla_local
system_keyspace: add accessors for SCYLLA_LOCAL
storage_service: add _sstables_format field
feature: add when_enabled callbacks
system_keyspace: add storage_service param to setup
Add sstable format helper methods
Register feature listeners in storage_service
Add service::read_sstables_format
Use read_sstables_format in main.cc
Use _sstables_format to determine current format
Add _unbounded_range_tombstones_feature
Update supported features on format change
Currently, we use --sig-proxy to forward signals to the container. However, this
requires the container's co-operation, which usually doesn't exist. For example,
docker run --sig-proxy fedora:29 bash -c "sleep 5"
Does not respond to ctrl-C.
This is a problem for continuous integration. If a build is aborted, Jenkins will
first attempt to gracefully terminate the processes (SIGINT/SIGTERM) and then give
up and use SIGKILL. If the graceful termination doesn't work, we end up with an
orphan container running on the node, which can then consume enough memory and CPU
to harm the following jobs.
To fix this, trap signals and handle them by killing the container. Also trap
shell exit, and even kill the container unconditionally, since if Jenkins happens
to kill the "docker wait" process the regular paths will not be taken.
Message-Id: <20190415084040.12352-1-avi@scylladb.com>
This series addresses two issues in the hinted handoff that should
complete fixing the infamous #4231.
In particular the second patch removes the requirement to manually
delete hints files after upgrading to 3.0.4.
Tested with manual unit testing.
* https://github.com/vladzcloudius/scylla.git hinted_handoff_drop_broken_segments-v3:
hinted handoff: disable "reuse_segments"
commitlog: introduce a segment_error
hinted handoff: discard corrupted segments
* seastar 6f73675...eb03ba5 (11):
> tests: tests C++14 dialect in continuous integration
> rpc/compressor/lz4: fix std:variant related compiler errors
> tests: futures_test: allow project to compile with C++14
> Merge "io_queue: make io_priority_class namespace global" from Benny
> future::then_wrapped: use std::terminate instead of abort
> reactor: make metric about task quota violations less sensitive
> Merge "Add LZ4_FRAGMENTED compressor for RPC" from Paweł
> Fix build issues with Clang 7
> Merge "file_stat follow_symlink option and related fixes" from Benny
> doc/tutorial.md: reword mention of seastar::thread premption on get()
> tests: semaphore_test: relax timeouts
Fixes#4272.
Resharding wasn't preserving the sstable run structure, which depends
on all fragments sharing the same run identifier. So let's make
resharding run aware, meaning that a run will be created for each
shard involved.
tests: release mode.
Fixes#4428.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190415193556.16435-1-raphaelsc@scylladb.com>
This change drops the hit count from the name of the node, because it
prevents coalescing of nodes which are shared parents for paths with
different counts. This lack of coalescing makes the flamegraph a lot
less useful.
Message-Id: <1555348576-26382-1-git-send-email-tgrabiec@scylladb.com>
All schema changes made to the node locally are serialized on a
semaphore which lives on shard 0. For historical reasons, they don't
queue but rather try to take the lock without blocking and retry on
failure with a random delay from the range [0, 100 us]. Contenders
which do not originate on shard 0 will have an extra disadvantage as
each lock attempt will be longer by the across-shard round trip
latency. If there is constant contention on shard 0, contenders
originating from other shards may keep loosing to take the lock.
Schema merge executed on behalf of a DDL statement may originate on
any shard. Same for the schema merge which is coming from a push
notification. Schema merge executed as part of the background schema
pull will originate on shard 0 only, where the application state
change listeners run. So if there are constant schema pulls, DDL
statements may take a long time to get through.
The fix is to serialize merge requests fairly, by using the blocking
semaphore::wait(), which is fair.
We don't have to back-off any more, since submit_to() no longer has a
global concurrency limit.
Fixes#4436.
Message-Id: <1555349915-27703-1-git-send-email-tgrabiec@scylladb.com>
Currently, each instanciation of `random_mutation_generator::impl` will
generate a new random seed for itself. Altough these are printed,
mapping back all the printed seeds to the exact source location where it
has to be substituted in is non-trivial. This makes reproducing random
test failures very hard. To solve this problem, use
`tests::random::get_int()` to produce the random seed of the
`random_mutation_generator::impl` instances. This way the seed of all
the mutation generator will be derived from a single "master" seed that
is easily replaced after a test failure, hopefully also leading to
easily reproducible random test failures.
I checked that after substituting in a previously generated master
random seed, all derived seeds were exactly the same.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <0471415938fc27485975ef9213d37d94bff20fd5.1555329062.git.bdenes@scylladb.com>
"
These are patches I wrote while working on UDF/UDA, but IMHO they are
independent improvements and are ready for review.
Tests: unit (debug) dtest (release)
I checked that all tests in
nosetests -v user_types_test.py sstabledump_test.py cqlsh_tests/cqlsh_tests.py
now pass.
"
* 'espindola/udf-uda-refactoring-v3' of https://github.com/espindola/scylla:
Refactor user type merging
cql_type_parser::raw_builder: Allow building types incrementally
cql3: delete dead code
Include missing header
return a const reference from return_type
delete unused var
Add a test on nested user types.
auto_bootstrap: false provide negligible gains for new clusters and
it is extremely dangerous everywhere else. We have seen a couple of
times in which users, confused by this, added this flag by mistake
and added nodes with it. While they were pleased by the extremely fast
times to add nodes, they were later displeased to find their data
missing.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190414012028.20767-1-glauber@scylladb.com>
This requires introduction of storage_service::get_known_features
and using it with check_knows_remote_features.
Otherwise a node joining the existing cluster won't be able to
join because it does not support unbounded range tombstones yet.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
We have to run this script in python2, since we dropped EPEL from
dependencies, and the script is installer for rpms so we cannot use
relocatable python3 for it.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190411151858.2292-1-syuu@scylladb.com>
Although curl is widely available, there is no reason to depend on it.
There are mainly two users, as indicated by grep:
1) scylla-housekeeping
2) scripts within the AMI
3) docker image
The AMI has its own RPM and it already depends on curl. While we could
get rid of the curl dependency there too, we can do that later. Docker
is its own thing and it only needs it at build time anyway.
For the main scylla repo, this patch changes scylla-housekeeping so as
not to depend on the curl binary and use urllib directly instead. We can
then remove curl from our dependency list.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190411125642.9754-1-glauber@scylladb.com>
date_type_impl::less() invokes `compare_unsigned()` to compare the
underlying raw byte values. `compared_unsigned()` is a tri comparator,
however `date_type_impl::less()` implicitely converted the returned
value to bool. In effect, `date_type_impl::less()` would *always* return
`true` when the two compared values were not equal.
Found while working on a unit test which empoly a randomly generated
schema to test a component.
Fixes#4419.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <8a17c81bad586b3772bf3d1d1dae0e3dc3524e2d.1554907100.git.bdenes@scylladb.com>
Currently the partition header will always be reported as different when
comparing two mutations. This is because they are prepended with the
"expected: " and "... but got: " texts. This generates unnecessary
noise. Inject a new line between the prefix and the partition-header
proper. This way the partition header will only show up in the diff when
there is an actual difference. The "expected: " and "... but got: "
phrases are still shown as different on the top of the diff but this is
fine as one can immediately see that they are not part of the data and
additionaly they help the reader in determining which part of the diff
is the expected one and which is the actual one.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <29e0f413d248048d7db032224a3fd4180bf1b319.1554909144.git.bdenes@scylladb.com>
The problem happens after a schema change because we fail to properly
remove ongoing compaction, which stopped being tracked, from list that
is used to calculate backlog, so it may happen that a compaction read
monitor (ceases to exist after compaction ends) is used after freed.
Fixes#4410.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190409024936.23775-1-raphaelsc@scylladb.com>
If we discover that a current segment is corrupted there is nothing we
can do about it.
This patch does the following:
1) Drops the corrupted segment and moves to the next one.
2) Logs such events as ERRORs.
3) Introduces a new metrics that accounts such event.
Fixes#4364
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Introduce a common base class for all errors that indicate that the current
segment has "issues".
This allows a laconic "catch" clause for all such errors.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Hinted handoff doesn't utilize this feature (which was developed with a
commitlog in mind).
Since it's enabled by default we need to explicitly disable it.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Fixes runtime error which happens because the setter is expected to
take an argument, but our definition doesn't take one. We're not
really expecting the setter to be called with False, so don't use setter
semantics.
It can be that page::pool is != nullptr and page::offset_in_span is 0
for a page which is inside a large allocation span (live or
dead). This may lead to misqualification of a pointer as belonging to
a small allocation pool.
Only the first page of a span contains reliable information. This
patch changes the code to use the span_checker, which knows the real
boundaries of spans and exposes reliable information via the span
object.
Fixes#4368
It can be that page::pool is != nullptr and page::offset_in_span is 0
for a page which is inside a large allocation span (live or
dead). This may lead to misqualification of that span as belonging to
a small allocation pool and interpreting its contents as if it
contained small objects.
Only the first page of a span contains reliable information. This
patch changes the code to use the span_checker, which knows the real
boundaries of spans and exposes reliable information via the span
object.
Another problem was that the command scanned dead spans as well.
This is no longer the case after this patch.
I've seen this command report thousands of no longer live sstable
writers and various continuations because of those problems.
Fixes#4367
The comparison of tables before and after mutation is now done by a
generic diff_rows function. The same function will be used for user
defined functions and user defined aggregates.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Before this patch raw_builder would always start with an empty list of
user types. This means that every time a type is added to a keyspace,
every type in that keyspace needs to be recreated.
With this patch we pass a keyspace_metadata instead of just the
keyspace name and can construct new user types on top of previous
ones.
This will be used in the followup patch, where only new types are
created.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
abstract_function.hh uses function, which is defined in function.hh,
so it should include it.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
We define data_type as
using data_type = shared_ptr<const abstract_type>;
Since it is a shared_ptr, it cannot be copied into another thread
since that would create a race condition incrementing the reference
counter.
In particular, before this patch it is not legal to call
return_type from another thread.
With this patch read only access from another thread is possible.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
There were many calls to read_keyspace_mutation. One in each function
that prepares a mutation for some other schema change.
With this patch they are all moved to a single location.
Tests: unit (dev, debug)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190328024440.26201-1-espindola@scylladb.com>
Fedora28 python magic used to return a x-sharedlib mime type for .so files.
Fedora29 changed that to x-pie-executable, so the libraries are no longer
relocated.
Let's be more permissive and relocate everything that starts with application/.
Fixes#4396
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190404140929.7119-1-glauber@scylladb.com>
This series fixes row level repair shutdown related issues we saw with
dtests, e.g., use after free of the repair meta object, fail to stop a
table during shutdown.
Fixes: #4044Fixes: #4314Fixes: #4333Fixes: #4380
Tests: repair_additional_test.py:RepairAdditionalTest.repair_abort_test
repair_additional_test.py:RepairAdditionalTest.repair_kill_2_test
* sestar-dev.git asias/repair.fix.shutdown.v1:
repair: Wait for pending repair_meta operation before removing it
repair: Check shutdown in row level repair
repair: Remove repair meta when node is dead
repair: Remove all row level repair during shtudown
* seastar 63d8607...6f73675 (5):
> Merge "seastar-addr2line: improve the context of backtraces" from Botond
> log: fix std::system_error ostream operator to print full error message
> Revert "threads: yield on get if we had run for too long."
> core/queue: Document concurrency constraints
> core/memory: Make small pools use the full span size
Fixes#4407.
Fixes#4316.
"
Calculation of IO properties is slightly wrong for i3.metal, because we get
the number of disks wrong. The reason for that is our check for ephemeral nvme
disks, that pre-date the time in which root devices were exposed as nvme devices
(nitro and metal instances).
"
toolchain updated with python3-psutil
* 'ec2fixes' of github.com:glommer/scylla:
scylla_util.py: do not include root disks in ephemeral list
scylla-python3: include the psutil module
fix typo in scylla_ec2_check
They are of type db::system_distributed_keyspace and
db::view::view_update_generator.
n1 is in normal status
n2 boots up and _sys_dist_ks or _view_update_generator are not
initialized
n1 runs stream, n2 is the follower.
n2 uses the _sys_dist_ks or _view_update_generator
"Assertion `local_is_initialized()' failed" is observed
Fixes#4360
Message-Id: <4ae13e1640ac8707a9ba0503a2744f6faf89ecf4.1554330030.git.asias@scylladb.com>
"
With these changes we avoid a std::vector<data_value> copy, which is
nice in itself, but also makes it possible to call get_list from other
shards.
"
* 'espindola/result-set-v3' of https://github.com/espindola/scylla:
Avoid copying a std::vector in get_list
query-result-set: add and use a get_ptr method
They are of type db::system_distributed_keyspace and db::view::view_update_generator.
n1 is in normal status
n2 boots up and _sys_dist_ks or _view_update_generator are not initialized
n1 runs repair, n2 is the follower.
n2 uses the _sys_dist_ks or _view_update_generator
"Assertion `local_is_initialized()' failed" is observed
Fixes#4360
Message-Id: <6616c21078c47137a99ba71baf82594ba709597c.1553742487.git.asias@scylladb.com>
For now this is just an optimization. But it also avoids copying
data_type, which will allow this be used across shards.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This moves a copy up the call stack and makes it possible to avoid it
completely by passing a reference type to get_nonnull.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
When --abort-on-lsa-bad-alloc is enabled we want to abort whenever
we think we can be out of memory.
We covered failures due to bad_alloc thrown from inside of the
allocation section, but did not cover failures from reservations done
at the beginning of with_reserve(). Fix by moving the trap into
reserve().
Message-Id: <1553258915-27929-1-git-send-email-tgrabiec@scylladb.com>
Nitro instances (and metal ones) put their root device in nvme (as a
protocol. it is still EBS). Our algorithm so far has relied on parsing
the nvme devices to figure out which ones are ephemeral but it will
break for those instances.
Out of our supported instances so far, the i3.metal is the only one
in which this breaks.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Using a new python3 module has never been that easy! So we'll
unapologetically use psutil and don't even worry about whether or not
CentOS supports it (it doesn't)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
* seastar 5572de7...63d8607 (6):
> test: verify that negative sleep time doesn't cause infinite sleep
> httpd: Change address handling to use socket_address
> dns: Change "unspecififed" address search type to retrive first avail
> Allow when_all and when_all_succeed to take function arguments
> when_all: abort if memory allocation fails
> inet_address: Add missing constructor impl.
We saw dtest failed to stop a node like:
```
ERROR: repair_one_missing_row_test (repair_additional_test.RepairAdditionalTest)
----------------------------------------------------------------------
Traceback (most recent
[2019.1.3.node1.repair.zip](https://github.com/scylladb/scylla/files/2723244/2019.1.3.node1.repair.zip)
call last):
File "/home/asias/src/cloudius-systems/scylla-dtest/repair_additional_test.py", line 2521, in repair_one_missing_row_test
return RepairAdditionalBase._repair_one_missing_row_test(self)
File "/home/asias/src/cloudius-systems/scylla-dtest/repair_additional_test.py", line 1842, in _repair_one_missing_row_test
self.check_rows_on_node(node2, nr_rows)
File "/home/asias/src/cloudius-systems/scylla-dtest/repair_additional_test.py", line 34, in check_rows_on_node
node.stop(wait_other_notice=True)
File "/home/asias/src/cloudius-systems/scylla-ccm/ccmlib/scylla_node.py", line 496, in stop
raise NodeError("Problem stopping node %s" % self.name)
NodeError: Problem stopping node node1
```
The problem is:
1) repair_meat is created
repair_meta -> repair_writer::create_writer() -> t.stream_in_progress()
repari_meta -> repair_reader::repair_reader -> cf.read_in_progress()
2) repair_meta is stored in _repair_metas map.
3) Shtudown repair, repair_meta is not removed from the _repair_metas map
4) Shutdown database which wait for the utils::phased_barrier.
To fix, we should stop and remove all the repair_meata from the _repair_metas map.
Tests: 30 successful runs of the repair_kill_2_test
Fixes: #4044
Repair follower nodes will create repair meta object when repair master
node starts a repair. Normally, the repair meta object is removed when
repair master finishes the repair and sends the verb
REPAIR_ROW_LEVEL_STOP to all the followers to remove the repair meta
object. In case of repair master was killed suddenly, no one will remove
the repair meta object.
To prevent keeping this repair meta object forever, we should remove
such objects when gossip detects a node is dead with the gossip
listener.
Fixes: #4380
Reviewed-by: Botond Dénes <bdenes@scylladb.com>
We remove repair_meta object in remove_repair_meta up receiving of stop
row level repair rpc verb. It is possible there is an pending operation
of repair_meta. To avoid use after free, we should not remove the
repair_meta object until all the pending operations are done.
Use a gate to protect it.
Fixes: #4333Fixes: #4314
Tests: 50 succesful run of repair_additional_test.py:RepairAdditionalTest.repair_kill_2_test
ignore_ready_future in load_new_ss_tables broke
migration_test:TestMigration_with_*.migrate_sstable_with_counter_test_expect_fail dtests.
The java.io.NotSerializableException in nodetool was caused by exceptions that
were too long.
This fix prints the problematic file names onto the node system log
and includes the casue in the resulting exception so to provide the user
with information about the nature of the error.
Fixes#4375
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190331154006.12808-1-bhalevy@scylladb.com>
"
To make offline installer easier we need to minimize dependencies as
possible.
Python dependencies are already dropped by adding relocatable python3 by
Glauber, now it's time to drop rest of command line tools which used by
scylla setup tools.
(even scripts are converted to python3, it still executes some external
commands, so these commands should be distributed with offline installer)
Note that some of CLI tools haven't added such as NTP and RAID stuff,
since these tools have daemons, not just CLI.
To use such stuff in offline mode, users have to install them manually.
But both NTP setup and RAID setup are optional, users still can run Scylla w/o
them.
"
Toolchain updated to docker.io/scylladb/scylla-toolchain:fedora-29-20190401
for changes in install-dependencies.sh; also updates to gnutls 3.6.7 security
release.
* 'reloc_clitools_v5' of https://github.com/syuu1228/scylla:
reloc: add relocatable CLI tools for scylla setup scripts
dist/redhat: drop systemd-libs from dependency
dist/redhat: drop file from dependency since it seems unused
dist/redhat: drop pciutils from dependency since it only used in DPDK mode
Truncate would make disk usage stat go wild because it isn't updated
when sstables are removed in table::discard_sstables(). Let's update
the stat after sstables are removed from the sstable set.
Fixes#3624.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190328154918.25404-1-raphaelsc@scylladb.com>
Since we don't use DPDK mode by default, and the mode is not officially
supported, drop pciutils from package dependency.
Users who want to use DPDK mode they neeed to install the package
manually.
* seastar 05efbce...5572de7 (5):
> posix_file_impl::list_directory: do not ignore symbolic link file type
> prometheus: yield explicitly after each metric is processed
> thread: add maybe_yield function
> metrics: add vector overload of add_group()
> memory: tone down message for memory allocator
Currently scylla-python3 package name is hardcorded, need to support
package name renaming just like on other scylla packages.
This is required to release enterprise version.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190329003941.12289-1-syuu@scylladb.com>
"
File must be either owned by the process uid
or have both read and write access to it,
so it could be (hard) linked when sysctl
fs.protected_hardlinks is enabled.
Fixes#3117
"
* 'projects/valid_owner_and_mode/v3-rebased' of https://github.com/bhalevy/scylla:
storage_service: handle load_new_sstables exception
init: validate file ownership and mode.
treewide: use std::filesystem
Files and directories must be owned by the process uid.
Files must have read access and directories must have
read, write, and execute access.
Refs #3117
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than {std::experimental,boost,seastar::compat}::filesystem
On Sat, 2019-03-23 at 01:44 +0200, Avi Kivity wrote:
> The intent for seastar::compat was to allow the application to choose
> the C++ dialect and have seastar follow, rather than have seastar choose
> the types and have the application follow (as in your patch).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When generating view updates for base mutations when no pre-existing
data exists, we were forgetting to apply the tracked tombstones.
Fixes#4321
Tests: unit(dev)
* https://github.com/duarten/scylla materialized-views/4321/v1.1:
db/view: Apply tracked tombstones for new updates
tests/view_schema_test: Add reproducer for #4321
When generating view updates for base mutations when no pre-existing
data exists, we were forgetting to apply the tracked tombstones.
Fixes#4321
Tests: unit(dev)
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"
Variable length integers are used are used extensively by SSTables mc
format. The current deserialisation routine is quite naive in a way that
it reads each byte separately. Since, those vints usually appear inside
much larger buffers, we optimise for such cases, read 8-bytes at once
and then mask out the unneeded parts (as well as fix their order because
big-endian).
Tests: unit(dev).
perf_vint (average time per element when deserializing 1000 vints):
before:
vint.deserialize 69442000 14.400ns 0.000ns 14.399ns 14.400ns
after:
vint.deserialize 241502000 4.140ns 0.000ns 4.140ns 4.140ns
perf_fast_forward (data on /tmp):
large-partition-single-key-slice on dataset large-part-ds1:
before:
range time (s) iterations frags frag/s mad f/s max f/s min f/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
-> [0, 1] 0.000278 8792 2 7190 119 7367 1960 3 104 2 0 0 1 1 0 0 1 100.0%
-> [1, 100) 0.000344 96 99 288100 4335 307689 193809 2 108 2 0 0 1 1 0 0 1 100.0%
-> (100, 200] 0.000339 13254 100 295263 2824 301734 222725 2 108 2 0 0 1 1 0 0 1 100.0%
after:
range time (s) iterations frags frag/s mad f/s max f/s min f/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
-> [0, 1] 0.000236 10001 2 8461 59 8718 2261 3 104 2 0 0 1 1 0 0 1 100.0%
-> [1, 100) 0.000285 89 99 347500 2441 355826 215745 2 108 2 0 0 1 1 0 0 1 100.0%
-> (100, 200] 0.000293 14369 100 341302 1512 350123 222049 2 108 2 0 0 1 1 0 0 1 100.0%
"
* tag 'optimise-vint/v2' of https://github.com/pdziepak/scylla:
sstable: pass full length of buffer to vint deserialiser
vint: optimise deserialisation routine
vint: drop deserialize_type structure
tests/vint: reduce test dependencies
tests/perf: add performance test for vint serialisation
"
This series introduce a rudimentary sstables manager
that will be used for making and deleting sstables, and tracking
of thereof.
The motivation for having a sstables manager is detailed in
https://github.com/scylladb/scylla/issues/4149.
The gist of it is that we need a proper way to manage the life
cycle of sstables to solve potential races between compaction
and various consumers of sstables, so they don't get deleted by
compaction while being used.
In addition, we plan to add global statistics methods like returning
the total capacity used by all sstables.
This patchset changes the way class sstable gets the large_data_handler.
Rather than passing it separately for writing the sstable and when deleting
sstables, we provide the large_data_handler when the sstable object is
constructed and then use it when needed.
Refs #4149
"
* 'projects/sstables_manager/v3' of https://github.com/bhalevy/scylla:
sstables: provide large_data_handler to constructor
sstables_manager: default_sstable_buffer_size need not be a function
sstables: introduce sstables_manager
sstables: move shareable_components def to its own header
tests: use global nop_lp_handler in test_services
sstables: compress.hh: add missing include
sstables: reorder entry_descriptor constructor params
sstables: entry_descriptor: get rid of unused ctor
sstables: make load_shared_components a method of sstable
sstables: remove default params from sstable constructor
database: add table::make_sstable helper
distributed_loader: pass column_family to load_sstables_with_open_info
distributed_loader: no need for forward declaration of load_sstables_with_open_info
distributed_loader: reshard: use default params for make_sstable
The goal of the sstables manager is to track and manage sstables life-cycle.
There is a sstable manager instance per database and it is passed to each column-family
(and test environment) on construction.
All sstables created, loaded, and deleted pass through the sstables manager.
The manager will make sure consumers of sstables are in sync so that sstables
will not be deleted while in use.
Refs #4149
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The goal is to construct sstables only via make_sstables
that will be moved to class sstables_manager in a later patch.
Defining the default values in both interfaces is unneeded
and may to lead to them going out of sync.
Therefore, have only make_sstables provide the default parameter values.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In most cases we make a sstable based on the table schema
and soon - large_data_handler.
Encapsulate that in a make_sstable method.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
Most of the binaries we link in a debug build are linked with -s, so
the only impact is build/debug/scylla, which grows by 583 MiB when
using --compress-exec-debuginfo=0.
On the other hand, not having to recompress all the debug info from
all the used object files is a pretty big win when debugging an issue.
For example, linking build/debug/scylla goes from
56.01s user 15.86s system 220% cpu 32.592 total
to
27.39s user 19.51s system 991% cpu 4.731 total
Note how the cpu time is "only" 2x better, but given that compressing
debug info is a long serial task, the wall time is 6.8x better.
Tests: unit (debug)
"
* 'espindola/dont-compress-debug-v5' of https://github.com/espindola/scylla:
configure: Add a --compress-exec-debuginfo option
configure: Move some flags from cxx_ld_flags to cxxflags
configure: rename per mode opt to cxx_ld_flags
configure: remove per mode libs
configure: remove sanitize_libs and merge sanitize into opt
configure: split a ld_flags_{mode} out of cxxflags_{mode}
Function messaging_service::get_rpc_client() suppose to either return
existing client or create one and return it. The function is suppose to
be atomic, so after checking that requested client does not exist it is
safe to assume emplace() will succeed. But we saw bugs that made the
function to not be atomic. Lets add an assert that will help to catch
such bugs easier if they will happen in the future.
Message-Id: <20190326115741.GX26144@scylladb.com>
"
Fixes#4348
v2 changes:
* added a unit test
This miniseries fixes decimal/varint serialization - it did not update
output iterator in all cases, which may lead to overwriting decimal data
if any other value follows them directly in the same buffer (e.g. in a tuple).
It also comes with a reproducing unit test covering both decimals and varints.
Tests: unit (dev)
dtest: json_test.FromJsonUpdateTests.complex_data_types_test
json_test.FromJsonInsertTests.complex_data_types_test
json_test.ToJsonSelectTests.complex_data_types_test
"
* 'fix_varint_serialization_2' of https://github.com/psarna/scylla:
tests: add test for unpacking decimals
types: fix varint and decimal serialization
Varint and decimal types serialization did not update the output
iterator after generating a value, which may lead to corrupted
sstables - variable-length integers were properly serialized,
but if anything followed them directly in the buffer (e.g. in a tuple),
their value will be overwritten.
Fixes#4348
Tests: unit (dev)
dtest: json_test.FromJsonUpdateTests.complex_data_types_test
json_test.FromJsonInsertTests.complex_data_types_test
json_test.ToJsonSelectTests.complex_data_types_test
Note that dtests still do not succeed 100% due to formatting differences
in compared results (e.g. 1.0e+07 vs 1.0E7, but it's no longer a query
correctness issue.
I noticed a test failure with
Mutation inequality is not symmetric for ...
And the difference between the two mutations was that one atomic_cell
was live and the other wasn't.
Looking at the code I found a few cases where the comparison was not
symmetrical. This patch fixes them.
This patch will not fix the test, as it will now fail with a
"Mutations differ" error, but that is probably an independent issue.
Ref #3975.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190325194647.54950-1-espindola@scylladb.com>
When we call perftune.py in order to get a particular mode's cpu set
(e.g. mode=sq_split) it may fail and print an error message to stderr because
there are too few CPUs for a particular configuration mode (e.g. when
there are only 2 CPUs and the mode is sq_split).
We already treat these situations correctly however we let the
corresponding perftune.py error message get out into the syslog.
This is definitely confusing, stressful and annoying.
Let's not let these messages out.
Fixes#4211
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20190325220018.22824-1-vladz@scylladb.com>
Compilation fails with fmt release 5.3.0 when we print a bytes_view
using "{}" formatter.
Compiler's complain is: "error: static assertion failed: mismatch between char-types of context and argument"
Resolve this by explicitly using the operator<<() across the whole
operator<<(std::ostream& os, const result_message::rows& msg) function.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20190325203628.5902-1-vladz@scylladb.com>
has_scylla_component is always false before loading the sstable.
Also, return exception future rather than throwing.
Hit with the following dtests:
counter_tests.TestCounters.upgrade_test
counter_tests.TestCountersOnMultipleNodes.counter_consistency_node_*_test
resharding_test.ReshardingTest_nodes?_with_*CompactionStrategy.resharding_counter_test
update_cluster_layout_tests.TestUpdateClusterLayout.increment_decrement_counters_in_threads_nodes_restarted_test
Fixes#4306
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190326084151.18848-1-bhalevy@scylladb.com>
"
This series removes the usage of the static gossiper object in init.cc
and storage_service.
Follow up series will remove more in other components. This is the
effort to clean up the component dependencies and have better shutdown
procedure.
Tests: tests/gossip_test, tests/cql_query_test, tests/sstable_mutation_test, dtests.
"
* tag 'asias/storage_service_gossiper_dep_v5' of github.com:cloudius-systems/seastar-dev:
storage_service: Do not use the global gms::get_local_gossiper()
storage_service: Pass gossiper object to storage_service
gms: Remove i_failure_detector.hh
gossip: Get rid of the gms::get_local_failure_detector static object
dht: Do not use failure_detector::is_alive in failure_detector_source_filter
tests: Fix stop snitch in gossip_test.cc
gossiper: Do not use value_factory from storage_service object
gossiper: Use cfg options from _cfg instead of get_local_storage_service
gossiper: Pass db::config object to gossiper class
init: Pass gossiper object to init_ms_fd_gossiper
* seastar 33baf62...caa98f8 (8):
> Merge "Add file_accessible and file_stat methods" from Benny
> future::then: use std::terminate instead of abort
> build: Allow cooked dependencies with configure.py
> tests: Show a test's output when it fails
> posix_file_impl: Bypass flush() call iff opened with O_DSYNC
> posix_file_impl: Propagate and keep open_flags
> open_flags: Add O_DSYNC value
> build: Forward variables to CMake correctly
"
Both cql3_type and abstract_type are normally used inside
shared_ptr. This creates a problem when an abstract_type needs to refer
to a cql3_type as that creates a cycle.
To avoid warnings from asan, we were using a std::unordered_map to
store one of the edges of the cycle. This avoids the warning, but
wastes even more memory.
Even before this series cql3_type was a fairly light weight
structure. This patch pushes in that direction and now cql3_type is a
struct with a single member variable, a data_type.
This avoids the reference cycle and is easier to understand IMHO.
The one corner case is varchar. In the old system cql3_type::varchar
and cql3_type::text don't compare equal, but they both map to the same
data_type.
In the new system they would compare equal, so we avoid the confusion
by just removing the cql3_type::varchar variable.
Tests: unit (dev)
"
* 'espindola/merge-cq3-type-and-type-v3' of https://github.com/espindola/scylla:
Turn cql3_type into a trivial wrapper over data_type
Delete cql3_type::varchar
Simplify db::cql_type_parser::parse
Add a test for the varchar column representation
The crash observed in issue #4335 happens because
delete_large_data_entries is passed a deleted name.
Normally we don't get a crash, but a garbage name and we fail to
delete entries from system.large_*.
Adding a test for the fix found another issue that the second patch
is this series fixes.
Tests: unit (dev)
Fixes#4335.
* https://github.com/espindola/scylla guthub/fix-use-after-free-v4:
large_data_handler: Fix a use after destruction
large_data_handler: Make a variable non static
Allow large_data_handler to be stopped twice
Allow table to be stopped twice
Test that large data entries are deleted
"
Validate the to-be-loaded sstables in the open_sstable phase and handle any exceptions before calling cf.get_row_cache().invalidate.
Currently if exception is thrown from distributed_loader::open_sstable cf._sstables_opened_but_not_loaded may be left partially populated.
Fixes#4306
Tests: unit (dev)
- next-gating dtests (dev)
- migration_test:TestMigration_with_2_1_x.migrate_sstable_with_counter_test
migration_test:TestMigration_with_2_1_x.migrate_sstable_with_counter_test_expect_fail
- with bypassing exception in distributed_loader::flush_upload_dir
to trigger the exception in table::open_sstable
"
* 'issues/4306/v3' of https://github.com/bhalevy/scylla:
table: move sstable counters validation from load_sstable to open_sstable
distributed_loader::load_new_sstables: handle exceptions in open_sstable
"
Aligned way to build relocatable rpm with existing relocatable packages.
"
* 'relocatable-python3-fix-v3' of https://github.com/syuu1228/scylla:
reloc: allow specify rpmbuild dir
reloc/python3: archive package version number on build_reloc.sh
reloc/python3: archive rpm build script in the relocatable package, build rpm using the script
relloc/python3: fix PyYAML package name
reloc: rename python3 relocatable package filename to align same style with other packages
reloc: move relocatable python build scripts to reloc/python3 and dist/redhat/python3
When a view replica becomes unavailable, updates to it are stored as
hints at the paired based replica. This on-disk queue of pending view
updates grows as long as there are view updated and the view replica
remains unavailable. Currently, we take that relative queue size into
account when calculating the delay for new base writes, in the context
of the backpressure algorithm for materialized views.
However, the way we're calculating that on-disk backlog is wrong,
since we calculate it per-device and then feed it to all the hints
managers for that device. This means that normal hints will show up as
backlog for the view hints manager, which in turn introduces delays.
This can make the view backpressure mechanism kick-in even if the
cluster uses no materialized views.
There's yet another way in which considering the view hints backlog is
wrong: a view replica that is unavailable for some period of time can
cause the backlog to grow to a point where all base writes are applied
the maximum delay of 1 second. This turns a single-node failure into
cluster unavailability.
The fix to both issues is to simply not take this on-disk backlog into
account for the backpressure algorithm.
Fixes#4351Fixes#4352
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190321170418.25953-1-duarte@scylladb.com>
Since we archive rpm/deb build script on relocatable package and build
rpm/deb using the script, so align python relocatable package too.
Also added SCYLLA-RELOCATABLE-FILE, SCYLLA-RELEASE-FILE and SCYLLA-VERSION-FILE
since these files are required for relocatable package.
Debugging continuations is challenging. There is no support from gdb for
finding out which continuation was this continuation called from, nor
what other continuations are attached to it. GDB's `bt` command is of
limited use, at best a handful of continuations will appear in the
backtrace, those that were ready. This series attempts to fill part of
this void and provides a command that answers the latter question: what
continuations are attached to this one?
`scylla fiber` allows for walking a continuation chain, printing each
continuation. It is supposed to be the seastar equivalent of `bt`.
The continuation chain is walked starting from an arbitrary task,
specified by the user. The command will print all continuations attached
to the specified task.
This series also contains some loosely related cleanup of existing
commands and code in `scylla-gdb.py`.
* https://github.com/denesb/scylla.git scylla-fiber-gdb-command/v4:
scylla-gdb.py: fix static_vector
scylla-gdb.py: std_unique_ptr: add get() method
scylla-gdb.py: fix existing documentation
scylla-gdb.py: fix tasks and task-stats commands
scylla-gdb.py: resolve(): add cache parameter
scylla-gdb.py: scylla_ptr: move actual logic into analyze()
scylla-gdb.py: scylla_ptr: make analyze() usable for outside code
scylla-gdb.py: scylla_ptr: accept any valid gdb expression as input
scylla-gdb.py: add scylla fiber command
Store the failure_detector object inside gossiper object.
- No more the global object sharded<failure_detector>
- No need to initialize sharded<failure_detector> manually which
simplifies the code in tests/cql_test_env.cc and init.cc.
Switch failure_detector_source_filter to use get_local_gossiper::is_alive
directly since we are going to remove the static
gms::get_local_failure_detector object soon.
Pass the nodes that are down to the filter direclty, to avoid the
range_streamer to depends on gossiper at all.
This area is hard to test since we only issue deletes during
compaction and we wait for deletes only during shutdown.
That is probably worth it, seeing that two independent bugs would have
been found by this test.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The default is the old behavior, but it is now possible to configure
with --compress-exec-debuginfo=0 to get faster links but larger
binaries.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
It is the same name used in the build.ninja file.
A followup patch will add cxxflags and move compiler only flags there.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
These are flags we want to pass to both compilation and linking. There
is nothing special about the fact that they are sanitizer related.
With {sanitize} being passed to the link, we don't need {sanitize_libs}.
We do need to make sure -fno-sanitize=vptr is the last one in the
command line. Before we were implicitly getting it from seastar, but
it is bad practice to get some sanitizer flags from seastar but not
others.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
"
This series adds moving sstables uploaded via `nodetool refresh` to
staging/ directory if they require generating view updates from them.
Previous behavior (leaving these sstables in upload/ directory until
view updates are generated) might have caused sstables with
conflicting names to be mistakenly overwritten by the user.
Fixes#4047
Tests: unit (dev)
dtest: backup_restore_tests.py + backup_restore_tests.py modified with
having materialized view definitions
"
* 'use_staging_directory_for_uploaded_sstables_awaiting_view_updates' of https://github.com/psarna/scylla:
sstables: simplify requires_view_building
loader: move uploaded view pending sstables to staging
Current code captures a reference to rpc::client in a continuation, but
there is no guaranty that the reference will be valid when continuation runs.
Capture shared pointer to rpc::client instead.
Fixes#4350.
Message-Id: <20190314135538.GC21521@scylladb.com>
When we're populating a partition range and the population range ends
with a partition key (not a token) which is present in sstables and
there was a concurrent memtable flush, we would abort on the following
assert in cache::autoupdating_underlying_reader:
utils::phased_barrier::phase_type creation_phase() const {
assert(_reader);
return _reader_creation_phase;
}
That's because autoupdating_underlying_reader::move_to_next_partition()
clears the _reader field when it tries to recreate a reader but it finds
the new range to be empty:
if (!_reader || _reader_creation_phase != phase) {
if (_last_key) {
auto cmp = dht::ring_position_comparator(*_cache._schema);
auto&& new_range = _range.split_after(*_last_key, cmp);
if (!new_range) {
_reader = {};
return make_ready_future<mutation_fragment_opt>();
}
Fix by not asserting on _reader. creation_phase() will now be
meaningful even after we clear the _reader. The meaning of
creation_phase() is now "the phase in which the reader was last
created or 0", which makes it valid in more cases than before.
If the reader was never created we will return 0, which is smaller
than any phase returned by cache::phase_of(), since cache starts from
phase 1. This shouldn't affect current behavior, since we'd abort() if
called for this case, it just makes the value more appropriate for the
new semantics.
Tests:
- unit.row_cache_test (debug)
Fixes#4236
Message-Id: <1553107389-16214-1-git-send-email-tgrabiec@scylladb.com>
In commit 71bf757b2c, we call
wait_for_gossip_to_settle() which takes some time to complete in
storage_service::prepare_to_join().
In tests/cql_query_test calls init_server with do_bind == false which in
turn calls storage_service::prepare_to_join(). Since in the test, there
is only one node, there is no point to wait for gossip to settle.
To make the cql_query_test fast again, do not call
wait_for_gossip_to_settle if do_bind is false.
Before this patch, cql_query_test takes forever to complete.
After it takes 10s.
Tests: tests/cql_query_test
Message-Id: <3ae509e0a011ae30eef3f383c6a107e194e0e243.1553147332.git.asias@scylladb.com>
"
This series adds support for local indexing, i.e. when the index table
resides on the same partition as base data.
It addresses the performance issue of having an indexed query
that also specifies a partition key - index will be queried
locally.
"
* 'add_local_indexing_11' of https://github.com/psarna/scylla: (30 commits)
tests: add cases for local index prefix optimization
tests: add create/drop local index test case
tests: add non-standard names cases to local index tests
tests: add multi pk case for local index tests
tests: add test for malformed local index definitions
tests: add local index paging test
tests: add local indexing test
cql3: add CREATE INDEX syntax for local indexes
cql3: use serialization function to create index target string
index: add serialization function for index targets
index: use proper local index target when adding index
index: add parsing target column name from local index targets
db: add checking for local index in schema tables
index: add checking if serialized target implies local index
index: enable parsing multi-key targets
index: move target parser code to .cc file
json: add non-throwing overload for to_json_value
cql3: add checking for local indexes in has_supporting_index()
cql3: move finding index restrictions to prepare stage
cql3: add picking an index by score
...
When a materialized view was created, the verification code artificially
forbade creating a view without a clustering key column. However, there
is no real reason to forbid this. In the trivial case, the original base
table might not have had a clustering key, and the view might want to use
the exact same key. In a more complex case, a view may want to have all the
primary key columns as *partition* key columns, and that should be fine.
The patch also includes a regression test, which failed before this patch,
and succeeds with it (we test that we can create materialized views in both
aforementioned scenarios, and these materialized views work as expected).
Duarte raised the opinion that the "trivial" case of a view table with
a key identical to that of the base should be disallowed. However, this
should be done, if at all (I think it shouldn't), in a follow-up patch,
which will implement the non-triviality requirement consistently (e.g.,
require view primary key to be different from base's, regardless of
the existance or non-existance of clustering columns).
Fixes#4340.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190320122925.10108-1-nyh@scylladb.com>
When we are tracing requests, we would like to know everything that
happened to a query that can contribute to it having increased
latencies.
We insert some of those latencies explicitly due to throttling, but we
do not log that into tracing.
In the case of storage proxy, we do have a log message at trace level
but that is rarely used: trace messages are too heavy of a hammer, there
is no way to specify specific queries, etc.
The correct place for that is CQL tracing. This patch moves that message
to CQL tracing. We also add a matching tracepoint assuring us that no
delay happened if that's the case.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190320163350.15075-1-glauber@scylladb.com>
"
Static compact tables are tables with compact storage and no
clustering columns.
Before this patch, Scylla was writing rows of static compact tables as
clustered rows instead of as static rows. That's because in our in-memory
model such tables have regular rows and no static row. In Cassandra's
schema (since 3.x), those tables have columns which are marked as
static and there are no regular columns.
This worked fine as long as Scylla was writing and reading those
sstables. But when importing sstables from Cassandra, our reader was
skipping the static row, since it's not present in our schema, and
returning no rows as a result. Also, Cassandra, and Scylla tools,
would have problems reading those sstables.
Fix this by writing rows for such tables the same way as Cassandra
does. In order to support rolling downgrade, we do that only when all
nodes are upgraded.
Fixes#4139.
Tests:
- unit (dev)
"
* tag 'static-compact-mc-fix-v3.1' of github.com:tgrabiec/scylla:
tests: sstables: Test reading of static compact sstable generated by Cassandra
tests: sstables: Add test for writing and reading of static compact tables
sstables: mc: Write static compact tables the same way as Cassandra
sstable: mc: writer: Set _static_row_written inside write_static_row()
sstables: Add sstable::features()
sstables: mc: writer: Prepare write_static_row() for working with any column_kind
storage_service: Introduce the CORRECT_STATIC_COMPACT feature flag
sstables: mc: writer: Build indexed_columns together with serialization_header
sstables: mc: writer: De-optimize make_serialization_header()
sstable: mc: writer: Move attaching of mc-specific components out of generic code
Both cql3_type and abstract_type are normally used inside
shared_ptr. This creates a problem when an abstract_type needs to refer
to a cql3_type as that creates a cycle.
To avoid warnings from asan, we were using a std::unordered_map to
store one of the edges of the cycle. This avoids the warning, but
wastes even more memory.
Even before this patch cql3_type was a fairly light weight
structure. This patch pushes in that direction and now cql3_type is a
struct with a single member variable, a data_type.
This avoids the reference cycle and is easier to understand IMHO.
Tests: unit (dev)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
varchar is just an alias for text. Handle that conversion directly in
the parser and delete the cql3_type::varchar variable.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Since its first version, db::cql_type_parser::parse had special cases
for native and user defined types.
Those are not necessary, as the general parser has no problem handling
them.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The value computed is not static since
f254664fe6, but unfortunately that was
missed in that commit.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The path leading to the issue was:
The sstable name is allocated and passed to maybe_delete_large_data_entries by reference
auto name = sst->get_filename();
return large_data_handler.maybe_delete_large_data_entries(*sst->get_schema(), name, sst->data_size());
A future is created with a reference to it
large_partitions = with_sem([&s, &filename, this] {
return delete_large_data_entries(s, filename, db::system_keyspace::LARGE_PARTITIONS);
});
The semaphore blocks.
The filename is destroyed.
delete_large_data_entries is called with a destroyed filename.
The reason this did not reproduce trivially in a debug build was that
the sstable itself was in the stack and the destructed value was read
as an internal value, and so asan had nothing to complain about.
Unfortunately we also had no tests that the entry in
system.large_rows was actually deleted.
This patch passes the name by value. It might create up to 3 copies of
it. If that is too inefficient it can probably be avoided with a
do_with in maybe_delete_large_data_entries.
Fixes#4335
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Flags that we want to pass to gcc during compilation and linking are
in cxx_ld_flags_{mode}.
With this patch, we no longer pass
-I. -I build/{mode}/gen
to the link, which should have no impact.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The repair reader depends on the table object being alive, while it is
reading. However, for local reads, there was no synchronization between
the lifecycle of the repair reader and that of the table. In some cases
this can result in use-after-free. Solve by using the table's existing
mechanism for lifecycle extension: `read_in_progress()`.
For the non-local reader, when the local node's shard configuration is
different from the remote one's, this problem is already solved, as the
multishard streaming reader already pins table objects on the used
shards. This creates an inconsistency that might be suprising (in a bad
way). One reader takes care of pinning needed resources while the other
one doesn't. I was thorn on how to reconcile this, and decided to go
with the simplest solution, explicitely pinning the table for local
reads, that is conserve the inconsistency. It was suggested that this
inconsitency is remedied by building resource pinning into the local
reader as well [1] but there is opposition to this [2]. Adding a wrapper
reader which does just the resource pinning seems excessive, both in
code and runtime overhead.
Spotted while investigating repair-related crashes which occured during
interrupted repairs.
Fixes: #4342
[1] https://github.com/scylladb/scylla/issues/4342#issuecomment-474271050
[2] https://github.com/scylladb/scylla/issues/4342#issuecomment-474331657
Tests: none, this is a trivial fix for a not-yet-seen-in-the-wild bug.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <8e84ece8343468960d4e161467ecd9bb10870c27.1553072505.git.bdenes@scylladb.com>
When loading tables uploaded via `nodetool refresh`, they used to be
left in upload/ directory if view updates would need to be generated
from them. Since view update generation is asynchronous, sstables
left in the directory could erroneously get overwritten by the user,
who decides to upload another batch of sstables and some of the names
collided.
To remedy this, uploaded sstables that need view updates are moved
to staging/ directory with a unique generation number, where they
await view update generation.
Fixes#4047
Mounting /sys/fs/cgroup inside the image causes docker cgroup to not
be mounted internally. Therefore, hosts cannot limit resources on
Scylla. This patch removes the cgroup volume mount, allowing folders
under /sys/fs/cgroup to be created inside docker.
Message-Id: <20190320122053.GA20256@shenzou.localdomain>
db::schema_tables::ALL and db::schema_tables::all_tables() are both supposed
to list the same schema tables - the former is the list of their names, and
the latter is the list of their schemas. This code duplication makes it easy
to forget to update one of them, and indeed recently the new
"view_virtual_columns" was added to all_tables() but not to ALL.
What this patch does is to make ALL a function instead of constant vector.
The newly named all_table_names() function uses all_tables() so the list
of schema tables only appears once.
So that nobody worries about the performance impact, all_table_names()
caches the list in a per-thread vector that is only prepared once per thread.
Because after this patch all_table_names() has the "view_virtual_columns"
that was previously missing, this patch also fixes#4339, which was about
virtual columns in materialized views not being propagated to other nodes.
Unfortunately, to test the fix for #4339 we need a test with multiple
nodes, so we cannot test it here in a unit test, and will instead use
the dtest framework, in a separate patch.
Fixes#4339
Branches: 3.0
Tests: all unit tests (release and debug mode), new dtest for #4339. The unit test mutation_reader_test failed in debug mode but not in release mode, but this probably has nothing to do with this patch (?).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190320063437.32731-1-nyh@scylladb.com>
The materialized views flow control mechanism works by adding a certain
delay to each client request, designed to slow down the client to the
rate at we can complete the background view work. Until now we could observe
this mechanism only indirectly, in whether or not it succeeded to keep the
view backlog bounded; But we had no way to directly observe the delay that
we decided to add. In fact, we had a bug where this delay was constantly
zero, and we didn't even notice :-)
So in this patch we add a new metric,
scylla_storage_proxy_coordinator_last_mv_flow_control_delay
The metric is a floating point number, in units of seconds.
This metric is somewhat peculiar that it always contains the *last* delay
used for some request - unlike other metrics it doesn't measure the "current"
value of something. Moreover, it can jump wildly because there is no
guarantee that each request's delay will be identical (in particular,
different requests may involve different base replicas which have different
view backlogs, so decide on different delays). In the future we may want
to supplement this metric with some sort of delay histogram. But even
this simple metric is already useful to debug certain scenarios and
understand if the materialized-views flow control is working or not.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190227133630.26328-1-nyh@scylladb.com>
Scylla built using the frozen toolchain needs to be debugged
on a system with matching libraries. It's easiest if it's also done on the same image.
Install gdb in the image so that it's always out there when we need it.
Fixes#4329
Message-Id: <1553072393-9145-1-git-send-email-tgrabiec@scylladb.com>
In order to create a local index, the syntax used is:
CREATE INDEX t ON ((p1, p2, p3), v);
where (p1, p2, p3) are partition key columns (all of them),
and v is the indexed column.
With global indexes, target column name is always the same as the string
kept in 'options[target]' field. It's not the case for local indexes,
and so a proper extracting function is used to get the value.
When (re)creating a local index, the target string needs to be used
to parse out the actual indexed column:
"(base_pk_part1,base_pk_part2,base_pk_part3),actual_indexed_column".
This column is later used to deterine if an index should be applied
to a SELECT statement.
With local indexes it's not sufficient to check if a single
restriction is supported by an index in order to decide
that in can be used, because local indexes can be leveraged
only when full partition key is properly restricted.
(It also serves as a great example why restrictions code
would greatly benefit from a facelift! :) )
Index restrictions that match a given index were recomputed
during execution stage, which is redundant and prone to errors.
Now, used index restrictions are cached in a prepare statement.
Instead of choosing the first index that we find (in column def order),
the index with highest score is picked. Currently local indexes
score higher than global ones if restrictions allow local indexing
to be applied.
When computing paging state for local indexes, the partition
and clustering keys are different than with global ones:
- partition key is the same as base's
- clustering key starts with the indexed column
It already accepts several arguments that can be extracted from 'this',
and more will be added in the future.
New parameters include lambdas prepared during prepare stage
that define how to extract partition/clustering key ranges depending
on which index is used, so keeping it a static function will result
in unbounded number of parameters with complex types, which will
in turn make the function header almost illegible for a reader.
Hence, read_posting_list becomes a member function with easy access
to any data prepared during prepare stage.
Instead of having just one column definition, index target is now
a variant of either single column definition or a vector of them.
The vector is expected to be used when part of a target definition
is enclosed in parentheses:
$ CREATE INDEX ON t((p),v);
or
$ CREATE INDEX ON t((p1,p2), v);
etc.
This feature will allow providing (possibly composite) base partition key
to CREATE INDEX statement, which will result in creating a local index.
When the index is local, its partition key in underlying materialized
view is the the same as base's, and the indexed column is a first
clustering key. This implementation ensures that view and base rows
will reside on the same partition, while querying the indexed column
will be possible by putting it as a first clustering key part.
Our guidelines dictate that each header is self-sufficient, i.e.
after including it into an empty .cc file, the .cc file can be compiled
without having to include any other header file.
Currently we don't have any tool to check that a header is self
sufficient. This patch aims to remedy that by adding a target to check
each header, as well as a target to check all the headers.
For each header a target is generated that does the equivalent of
including the header into an empty .cc file, then compiling the
resulting .cc file.This targetis called {header_name}.o, so for
given the header `myheader.hh` this will be `build/dev/myheader.hh.o`
(if the dev build-mode is used).
Also a target, `checkheaders` is added which validates all headers in
the project. This currently fails as we have many headers that are not
self-sufficient.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <fdf550dc71203417252f1d8144e7a540eec074a1.1552636812.git.bdenes@scylladb.com>
The scylla fiber command traverses a continuation chain, given an
arbitrary task pointer.
Example (cropped for brevity):
(gdb) scylla fiber this
#0 (task*) 0x0000600000550360 0x000000000468ac40 vtable for seastar...
#1 (task*) 0x0000600000550300 0x00000000046c3778 vtable for seastar...
#2 (task*) 0x00006000018af600 0x00000000046c37a0 vtable for seastar...
#3 (task*) 0x00006000005502a0 0x00000000046c37f0 vtable for seastar...
#4 (task*) 0x0000600001a65e10 0x00000000046c6b10 vtable for seastar...
scylla fiber can be passed any expression that evaluates to a task
pointer. C++ variables, raw adresses and GDB variables (e.g. $1) all
work.
The command works by scanning the task object for pointers. If a pointer
is found it is dereferenced. If successful it checks that the pointer
dereferences to a vtable, the class for which is a known task.
If this succeeds the found task is saved, the scan then recursively
proceeds to scan the newly found task until a task with no further
attached continuations is found.
Instead of a formatted message, intended for humans, return a
`pointer_metadata` object, suitable for being using by code. The
formatting of the pointer metadata into the human readable message is
now done by the `pointer_metadata.__str__()` method, on the call site.
Also make `analyze()` a class method, making it possible for being
called without having to create a `scylla_ptr` command instance,
possibly confusing GDB.
Allow callers to prevent the resolved name from being saved. Useful when
one is just probing addresses but doesn't want to flood the cache with
useless symbols.
These two commands are broken for some time, roughly since the CPU
scheduler was merged. Fix them and move the task queue parsing code into
a common method, which now is used by both commands.
Some commands are documented, but not in the python way. Refactor these
commands so they use the standard python way for self documenting. In
addition to being more "python", this makes these documentation strings
discoverable by GDB so they appear in the `help scylla` output.
Add a `get()` method that retrieves the wrapped pointer without
dereferencing it. All existing methods are refactored to use this new
method to obtain the pointer instead of directly accessing the members.
This way only a single method has to be fixed if the object
implementation changes.
Tomek and I recently had a discussion about whether or not a commitlog
replay would be safe after we dropped or truncated a table that is not
flushed (durable, but auto_snapshots being false).
While we agreed that would be the safe, we both agreed we would feel
better with a unit test covering that.
This patch adds such a test (btw, it passes)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190318223811.6862-1-glauber@scylladb.com>
On some environment dh_strip fails at libreloc/ld.so, so it's better to
skip too just like libprotobuf.so.15.
error message is:
dh_strip -Xlibprotobuf.so.15 --dbg-package=scylla-server-dbg
strip:debian/scylla-server/opt/scylladb/libreloc/ld.so[.gnu.build.attributes]: corrupt GNU build attribute note: bad description size: Bad value
dh_strip: strip --remove-section=.comment --remove-section=.note --strip-unneeded debian/scylla-server/opt/scylladb/libreloc/ld.so returned exit code 1
0
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190319005153.26506-1-syuu@scylladb.com>
n1, n2, n3 in the cluster,
shutdown n1, n2, n3
start n1, n2
start n3, we saw features are enabled using the system table while n1 and n2 are already up and running in the cluster.
INFO 2019-02-27 09:24:41,023 [shard 0] gossip - Feature check passed. Local node 127.0.0.3 features = {CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}, Remote common_features = {CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}
INFO 2019-02-27 09:24:41,025 [shard 0] storage_service - Starting up server gossip
INFO 2019-02-27 09:24:41,063 [shard 0] gossip - Node 127.0.0.1 does not contain SUPPORTED_FEATURES in gossip, using features saved in system table, features={CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}
INFO 2019-02-27 09:24:41,063 [shard 0] gossip - Node 127.0.0.2 does not contain SUPPORTED_FEATURES in gossip, using features saved in system table, features={CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}
The problem is we enable the features too early in the start up process.
We should enable features after gossip is settled.
Fixes#4289
Message-Id: <04f2edb25457806bd9e8450dfdcccc9f466ae832.1551406991.git.asias@scylladb.com>
This brings the version check up-to-date with README.md and HACKING.md,
which were updated by commit fa2b03 ("Replace std::experimental types
with C++17 std version.") to say that minimum GCC 8.1.1 is required.
Tests: manually run configure.py with various `--compiler` values.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Message-Id: <20190318130543.24982-1-dejan@scylladb.com>
Static compact tables are tables with compact storage and no
clustering columns.
Before this patch, Scylla was writing rows of static compact tables as
clustered rows instead of static rows. That's because in our in-memory
model such tables have regular rows and no static row. In Cassandra's
schema (since 3.x), those tables have columns which are marked as
static and there are no regular columns.
This worked fine as long as Scylla was writing and reading those
sstables. But when importing sstables from Cassandra, our reader was
skipping the static row, since it's not present in the schema, and
returning no rows as a result. Also, Cassandra, and Scylla tools,
would have problems reading those sstables.
Fix this by writing rows for such tables the same way as Cassandra
does. In order to support rolling downgrade, we do that only when all
nodes are upgraded.
Fixes#4139.
When enabled on all nodes, sstable writers will start to produce
correct MC-format sstables for compact storage tables by writing rows
into the static row (like C*) rather than into the regular row.
We only do that when all nodes are upgraded to support rolling
downgrade. After all nodes are upgraded, regular rolling downgrade will
not be possible.
Refs #4139
The set of columns in both must match, so it's better to build them
together. Later the for choosing columns will become more
complicated, and this patch will allow for avoiding duplication.
So that it's easier to make it use schema_v3 conditionally in later
patches. It's not on the hot path, so it shouldn't matter that we
don't reserve the vectors.
Three nodes in the cluster node1, node2, node3
Shutdown the whole cluster
Start node1
Start node2, node2 sees empty remote common_features.
gossip - Feature check passed. Local node 127.0.0.2 features =
{CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS,
DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS,
LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT,
RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3,
STREAM_WITH_RPC_STREAM, WRITE_FAILURE_REPLY, XXHASH},
Remote common_features = {}
The problem is node3 hasn't started yet, node1 sees node3 has empty
features. In get_supported_features(), an empty common features will be
returned if an empty features of a node is seen. To fix, we should
fallback to use the features saved in system table.
Start node3, node3 sees empty remote common_features.
gossip - Feature check passed. Local node 127.0.0.3 features =
{CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS,
DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS,
LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT,
RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3,
STREAM_WITH_RPC_STREAM, WRITE_FAILURE_REPLY, XXHASH},
Remote common_features = {}
The problem is node3 hasn't inserted its own features into gossip
endpoint_state_map. get_supported_features() returns the common features
of all nodes in endpoint_state_map. To fix, we should fallback to use
the features stored in the system table for such node in this case.
Fixes#4225Fixes#4341
* dev asias/fix_check_knows_remote_features.upstream.v4.1:
gossiper: Remove unused register_feature and unregister_feature
gossiper: Remove unused wait_for_feature_on_all_node and
wait_for_feature_on_node
gossiper: Log feature is enabled only if the feature is not enabled
previously
gossiper: Fix empty remote common_features in
check_knows_remote_features
Three nodes in the cluster node1, node2, node3
Shutdown the whole cluster
Start node1
Start node2, node2 sees empty remote common_features.
gossip - Feature check passed. Local node 127.0.0.2 features =
{CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS,
DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS,
LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT,
RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3,
STREAM_WITH_RPC_STREAM, WRITE_FAILURE_REPLY, XXHASH},
Remote common_features = {}
The problem is node3 hasn't started yet, node1 sees node3 has empty
features. In get_supported_features(), an empty common features will be
returned if an empty features of a node is seen. To fix, we should
fallback to use the features saved in system table.
Start node3, node3 sees empty remote common_features.
gossip - Feature check passed. Local node 127.0.0.3 features =
{CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS,
DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS,
LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT,
RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3,
STREAM_WITH_RPC_STREAM, WRITE_FAILURE_REPLY, XXHASH},
Remote common_features = {}
The problem is node3 hasn't inserted its own features into gossip
endpoint_state_map. get_supported_features() returns the common features
of all nodes in endpoint_state_map. To fix, we should fallback to use
the features stored in the system table for such node in this case.
Fixes#4225
We saw the log "Feature FOO is enabled" more than once like below. It is
better to log it only when the feature is not enabled previously.
gossip - InetAddress 127.0.0.1 is now UP, status = NORMAL
gossip - Feature CORRECT_COUNTER_ORDER is enabled
gossip - Feature CORRECT_NON_COMPOUND_RANGE_TOMBSTONES is enabled
gossip - Feature COUNTERS is enabled
gossip - Feature DIGEST_MULTIPARTITION_READ is enabled
gossip - Feature INDEXES is enabled
gossip - Feature LARGE_PARTITIONS is enabled
gossip - Feature LA_SSTABLE_FORMAT is enabled
gossip - Feature MATERIALIZED_VIEWS is enabled
gossip - Feature MC_SSTABLE_FORMAT is enabled
gossip - Feature RANGE_TOMBSTONES is enabled
gossip - Feature ROLES is enabled
gossip - Feature ROW_LEVEL_REPAIR is enabled
gossip - Feature SCHEMA_TABLES_V3 is enabled
gossip - Feature STREAM_WITH_RPC_STREAM is enabled
gossip - Feature TRUNCATION_TABLE is enabled
gossip - Feature WRITE_FAILURE_REPLY is enabled
gossip - Feature XXHASH is enabled
gossip - Feature CORRECT_COUNTER_ORDER is enabled
gossip - Feature CORRECT_NON_COMPOUND_RANGE_TOMBSTONES is enabled
gossip - Feature COUNTERS is enabled
gossip - Feature DIGEST_MULTIPARTITION_READ is enabled
gossip - Feature INDEXES is enabled
gossip - Feature LARGE_PARTITIONS is enabled
gossip - Feature LA_SSTABLE_FORMAT is enabled
gossip - Feature MATERIALIZED_VIEWS is enabled
gossip - Feature MC_SSTABLE_FORMAT is enabled
gossip - Feature RANGE_TOMBSTONES is enabled
gossip - Feature ROLES is enabled
gossip - Feature ROW_LEVEL_REPAIR is enabled
gossip - Feature SCHEMA_TABLES_V3 is enabled
gossip - Feature STREAM_WITH_RPC_STREAM is enabled
gossip - Feature TRUNCATION_TABLE is enabled
gossip - Feature WRITE_FAILURE_REPLY is enabled
gossip - Feature XXHASH is enabled
gossip - InetAddress 127.0.0.2 is now UP, status = NORMAL
It works accidentally but it just expanded by bash to use mached files
in current directory, not correctly recognized by tar.
Need to use full file name instead.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190312172243.5482-2-syuu@scylladb.com>
We get following link error when running reloc/build_reloc.sh in dbuild,
need to enable DPDK on Seastar:
g++: error: /usr/lib64/librte_cfgfile.so: No such file or directory
g++: error: /usr/lib64/librte_cmdline.so: No such file or directory
g++: error: /usr/lib64/librte_ethdev.so: No such file or directory
g++: error: /usr/lib64/librte_hash.so: No such file or directory
g++: error: /usr/lib64/librte_kvargs.so: No such file or directory
g++: error: /usr/lib64/librte_mbuf.so: No such file or directory
g++: error: /usr/lib64/librte_eal.so: No such file or directory
g++: error: /usr/lib64/librte_mempool.so: No such file or directory
g++: error: /usr/lib64/librte_mempool_ring.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_bnxt.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_e1000.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_ena.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_enic.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_fm10k.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_qede.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_i40e.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_ixgbe.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_nfp.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_ring.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_sfc_efx.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_vmxnet3_uio.so: No such file or directory
g++: error: /usr/lib64/librte_ring.so: No such file or directory
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190312172243.5482-1-syuu@scylladb.com>
Since we removed dist/common/bin/scyllatop we are getting a build error
on .deb package build (1bb65a0888).
To fix the error we need to create a symlink for /usr/bin/scyllatop.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190316162105.28855-1-syuu@scylladb.com>
"
Refuse to accept SSTables that were created with partitioner
different than the one used by the Scylla server.
Fixes#4331
"
* 'haaawk/4331/v4' of github.com:scylladb/seastar-dev:
sstables: Add test for sstable::validate_partitioner
sstables: Add sstable::validate_partitioner and use it
Make sure the exception is thrown when Scylla
tries to load an SSTable created with non-compatible partitioner.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Scylla server can't read sstables that were created
with different partitioner than the one being used by Scylla.
We should make sure that Scylla identifies such mismatch
and refuses to use such SSTables.
We can use partitioner information stored in validation metadata
(Statistics.db file) for each SSTable and compare it against
partitioner used by Scylla.
Fixes#4331
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
A previous version of the patch that introduced these calls had no
limit on how far behind the large data recording could get, and
maybe_record_large_cells returned null.
The final version switched to a semaphore, but unfortunately these
calls were not updated.
Tests: unit (dev)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190314195856.66387-1-espindola@scylladb.com>
vint deserialiser can be more performant if it is allowed to do an
overread (i.e. read more memory than the value it is deserialising).
In case of sstable reads those vints are going to be usually in a middle
of a much larger buffer so lets pass the whole length of the buffer and
enable this optimisation.
At the moment, vint deserialisation is using a naive approach, reading
each byte separately. In practice, vints are going to most often appears
inside larger buffers. That means we can read 8-bytes at a time end then
figure out unneded parts and mask them out. This way we avoid a loop and
do less memory loads which are much more expensive than arithmetic
operations (even if they hit the cache).
Deserialisation function returns a structure containing both the value
and its length in the input buffer. In the vast majority of the cases
the caller will already know the length and having this structure will
make it harder for the compiler to emit good code, especially if the
function is not inlined.
In practice I've seen the structure causing register pressure problems
that lead to spilling variables to memory.
In order to allow yielding when handling endpoint lifecycle changes,
notifiers now run in threaded context.
Implementations which used this assumption before are supplemented
with assertions that they indeed run in seastar::async mode.
Fixes#4317
Message-Id: <45bbaf2d25dac314e4f322a91350705fad8b81ed.1552567666.git.sarna@scylladb.com>
* seastar e640314...463d24e (3):
> Merge 'Handle IOV_MAX limit in posix_file_impl' from Paweł
> core: remove unneeded 'exceptional future ignored' report
> tests/perf: support multiple iterations in a single test run
Issue #4234 asks for a large collection detector. Discussing the issue
Benny pointed out that it is probably better to have a generic large
cell detector as it makes a natural progression on what we already
warn on (large partitions and large rows).
This patch series implements that. It is on top of
shutdown-order-patches-v7 which is currently on next.
With the charges to use a semaphore this patch series might be getting
a bit big. Let me know if I should split it.
* https://github.com/espindola/scylla espindola/large-cells-on-top-of-shutdown-v5:
db: refactor large data deletion code
db: Rename (maybe_)?update_large_partitions
db: refactor a try_record helper
large_data_handler: assert it is not used after stop()
db: don't use _stopped directly
sstables: delete dead error handling code.
large_data_handler: Remove const from a few functions
large_data_handler: propagate a future out of stop()
large_data_handler: Run large data recording in parallel
Create a system.large_cells table
db: Record large cells
Add a test for large cells
This is analogous to the system.large_rows table, but holds individual
cells, so it also needs the column name.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
With this changes the futures returned by large_data_handler will not
normally wait for entries to be written to system.large_rows or
system.large_partitions.
We use a semaphore to bound how behind system.large_* table updates
can get.
This should avoid delaying sstables writes in the common case, which
is more relevant once we warn of large cells since the the default
threshold will be just 1MB.
Note that there is no ordering between the various maybe_record_* and
maybe_delete_large_data_entries requests. This means that we can end
up with a stale entry that is only removed once the TTL expires.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
These will use a member semaphore variable in a followup patch, so they
cannot be const.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
maybe_delete_large_data_entries handles exceptions internally, so the
code this patch deletes would never run.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This should have been changed in the patch
db: stop the commit log after the tables during shutdown
But unfortunately I missed it then.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
We had almost identical error handling for large_partitions and
large_rows. Refactor in preparation for large_cells.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This renames it to record_large_partitions, which matches
record_large_rows. It also changes the signature to be closer to
record_large_rows.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The code for deleting entries from system.large_partitions was almost
a duplicate from the code for deleting entries from system.large_rows.
This patch unifies the two, which also improves the error message when
we fail to delete entries from system.large_partitions.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
There is no guarantee that rpc streaming makes progress in some time
period. Remove the keep alive timer in streaming to avoid killing the
session when the rpc streaming is just slow.
The keep alive timer is used to close the session in the following case:
n2 (the rpc streaming sender) streams to n1 (the rpc streaming receiver)
kill -9 n2
We need this because we do not kill the session when gossip think a node
is down, because we think the node down might only be temporary
and it is a waste to drop the previous work that has done especially
when the stream session takes long time.
Since in range_streamer, we do not stream all data in a single stream
session, we stream 10% of the data per time, and we have retry logic.
I think it is fine to kill a stream session when gossip thinks a node is
down. This patch changes to close all stream session with the node that
gossip think it is down.
Message-Id: <bdbb9486a533eee25fcaf4a23a946629ba946537.1551773823.git.asias@scylladb.com>
"
This series allows canceling view update requests when a node is
discovered DOWN. View updates are sent in the background with long
timeout (5 minutes), and in case we discover that the node is
unavailable, there's no point in waiting that long for the request
to finish. What's more, waiting for these requests occurs on shutdown,
which may result in waiting 5 minutes until Scylla properly shuts down,
which is bad for both users and dtests.
This series implements storage_proxy as a lifecycle subscriber,
so it can react to membership changes. It also keeps track of all
"interruptible" writes per endpoint, so once a node is detected as DOWN,
an artificial timeout can be triggered for all aforementioned write
requests.
Fixes#3826Fixes#3966Fixes#4028
"
* 'write_hints_for_view_updates_on_shutdown_4' of https://github.com/psarna/scylla:
service: remove unused stop_hints_manager
storage_proxy: add drain_on_shutdown implementation
main: register storage proxy as lifecycle subscriber
storage_proxy: add endpoint_lifecycle_subscriber interface
storage_proxy: register view update handlers for view write type
storage_proxy: add intrusive list of view write handlers
storage_proxy: add view_update_write_response_handler
Complex timestamp tests were ported from dtest and contained a potential
race - rows were updated with TTL 1 and then checked if the row exists
in both base and view replicas in an eventually() loop.
During this loop however, TTL of 1 second might have already passed
and the row could have been deleted from base.
This patch changes the mentioned TTL to 30 seconds, making the tests
extremely unlikely to be flaky.
Message-Id: <6b43fe31850babeaa43465eb771c0af45ee6e80d.1552041571.git.sarna@scylladb.com>
This series contains several improvements to perf_fast_forward that
either address some of the problems seen in the automated runs or help
understanding the results.
The main problem was that test small-partition-slicing had a preparation
stage disproportionally long compared to the actual testing phase. While
the fragments per second results wasn't affected by that, it restricted
the number of iterations of the test that we were able to run, and the
test which single iterations is short (and more prone to noise) was
executed only four times. This was solved by sharing the preparation
stage with all iterations, thus enabling the test to be run many times
and improving the stability of the results.
Another, improvement is the ability to dump all test results and process
them producing histograms. This allows us to see how the distribution of
particular statistics looks like and if there are some complications.
Refs #4278.
* https://github.com/pdziepak/scylla.git more-perf_fast_forward/v1:
tests/perf_fast_forward: print number of iterations of each test
tests/perf_fast_forward: reuse keys in small partition slicing test
tests/perf_fast_forward: extract json result file writing logic
tests/perf_fast_forward: add an option to dump all results
tests/perf_fast_forward: add script for analysing full results
When storage proxy is shutting down, all interruptible writes
can be timed out in order not to wait for them. Instead, the mechanism
will fall back to storing hints and/or not progressing with view
building.
In order to be able to iterate over view update write response handlers,
an intrusive list of them is added to storage proxy. This way
iteration can be easily yielded without invalidating operators and all
logic is moved to slow path.
View update write response handler inherits from a regular write
response handler, but it's also possible to link it intrusively
in order to be able to induce timeouts on them later.
perf_fast_forward with flag --dump-all-results reports the results of
every test iteration that was executed. This patch introduces a python
script that can analyse those results (in json format) and present them
in a more human-friendly way.
For now, the only option is to plot histograms of selected statistics.
perf_fast_forward runs each test case multiple times and reports a
summary of those results (median, min, max, and median absolute
deviation).
While very convenient the summary may hide some important information
(e.g. the distribution of the results). This patch adds an option to
report results of every single executed iteration.
"
Fixes#4245
Breaks up "perform_cleanup" in parameterized "rewrite_sstables"
and implements upgrade + scrub in terms of this.
Both run as a "regular" compaction, but ignore the normal criteria
for compaction and select obsolete/all tables.
We also ensure all previous compactions are done so we can guarantee
all tables are rewritten post invocation of command.
"
* 'calle/upgrade_sstables' of github.com:scylladb/seastar-dev:
api::storage_service: Implement "scrub"
api/storage_service: Implement "upgradesstables"
api::storage_service: Add keyspace + tables helper
compaction_manager: Add perform_sstable_scrub
compaction_manager: Add perform_sstable_upgrade
compaction_manager: break out rewrite_sstables from cleanup
table: parameterize cleanup_sstables
"
Currently any large partitions found during shutdown are not
recorded. The reason is that the database commit log is already off,
so there is nowhere to record it to.
One possible solution is to have an independent system database. With
that the regular db is shutdown first and writes can continue to the
system db.
That is a pretty big change. It would also not allow us to record
large partitions in any system tables.
This patch series instead tries to stop the commit log later. With
that any large partitions are recorded to the log and moved to a
sstable on the next startup.
"
* 'espindola/shutdown-order-patches-v7' of https://github.com/espindola/scylla:
db: stop the commit log after the tables during shutdown
db: stop the compaction manager earlier
db: Add a stop_database helper
db: Don't record large partitions in system tables
Fixes#4245
Implemented as a compation barrier (forcing previous compactions to
finish) + parameterized "cleanup", with sstable list based on
parameters.
Truncating a table is very slow if the system is under pressure. Because
in that case we mostly just want to get rid of the existing data, it
shouldn't take this long. The problem happens because truncate has to
wait for memtable flushes to end, twice. This is regardless of whether
or not the table being truncated has any data.
1. The first time is when we call truncate itself:
if auto_snapshot is enabled, we will flush the contents of this table
first and we are expected to be slow. However, even if auto_snapshot is
disabled we will still do it -- which is a bug -- if the table is marked
as durable. We should just not flush in this case and it is a silly bug.
1. The second time is when we call cf->stop(). Stopping a table will
wait for a flush to finish. At this point, regardless of which path
(Durable or non-durable) we took in the previous step we will have no
more data in the table. However, calling `flush()` still need to acquire
a flush_permit, which means we will wait for whichever memtable is
flushing at that very moment to end.
If the system is under pressure and a memtable flush will take many
seconds, so will truncate. Even if auto_snapshots are enabled, we
shouldn't have to flush twice. The first flush should already put is in
a state in which the next one is immediate (maybe holding on to the
permit, maybe destroying the memtable_list already at that point ->
since no other memtables should be created).
If auto_snapshots are not enabled, the whole thing should just be
instantaneous.
This patchset fixes that by removing the flush need when !auto_snapshot,
and special casing the flush of an empty table.
Fixes#4294
* git@github.com:glommer/scylla.git slowtruncate-v2:
database: immediately flush tables with no memtables.
truncate: do not flush memtables if auto_snapshot is false.
This commit rewrites the logic of table creation at startup of the auth
mechanism to be race proof. This is done by simply ignoring the
already_exists exception as done in system_distributed_keyspace.
The old creation logic, tested for existance of the column family and
right after called announce_new_column_family with the newly
created table schema. The problem was that it does not prevent
a race since the announcement itself is a fiber and the created table
can still be gossiped from another node, causing the announce
function to throw an already_exists exception that in turn crashes
scylla.
Message-Id: <20190306075016.28131-1-eliransin@scylladb.com>
This allows for system.large_partitions to be updated if a large
partition is found while writing the last sstables.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
We want to finish all large data logging in stop_system, so stopping
the compaction manager should be the first thing stop_system does.
The make_ready_future<>() will be removed in a followup patch.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
According to the cql definitions, if no ORDER BY clause is present,
records should be returned ordered by the clustering keys. Since the
backend returns the ranges according to their order of appearance
in the request, the bounds should be sorted before sending it to the
backend. This kind of sorting is needed in queries that generates more
than one bound to be read, examples to such queris are:
1. a SELECT query with an IN clause.
2. a SELECT query on a mixed order tupple of columns (see #2050).
The assumption this commit makes is the correctness of the bounds
list, that is, the bounds are non overlapping. If this wasn't true, multiple
occurences of the same reccord could have returned for certain queries.
Tests:
1. Unit tests release
2. All dtest that requires #2050 and #2029Fixes#2029
"
Fixes#3708
This series adds JSON serialization and deserialization procedures
to tuples and user defined types.
Tests: unit (dev)
"
* 'add_tuple_and_udt_json_support_2' of https://github.com/psarna/scylla:
tests: add test cases for JSON and UDT
types: add JSON support to UDT
tests: add JSON tuple tests
types: add JSON support for tuples
Right now we flush memtables if the table is durable (which in practice
it almost always is).
We are truncating, so we don't want the data. We should only flush if
auto_snapshot is true.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
If a table has no data, it may still take a long time to flush. This is
because before we even try to flush, we need go acquire a permit and
that can take a while if there is a long running flush already queued.
We can special case the situation in which there is no data in any of
the memtables owned by table and return immediately.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
* seastar ab54765...e640314 (10):
> net: enable IP_BIND_ADDRESS_NO_PORT before binding a socket during connection
> core: show address in error message for posix_listen failures
> fmt: remove submodule
> tests: fix loopback socket close() to not fail when the peer's side is already closed
> Merge "Add suffixes to target names" from Jesse
> temporary_buffer: improve documentation for alignment param requirements
> docs: Fix dependencies for split tutorial target
> deleter: prevent early memory free caused by deleter append.
> doc/tutorial.md: introduce memory allocation foreign_ptr
> Fix CLI help message (network & DPDK options)
Toolchain and configure.py updated for fmt submodule removal.
fuzzy_test performs some checks that are expected to fail and whoose
failure does not influence the outcome of the test. For this it uses the
`BOOT_WARN_*` family of macros. These will just log a warning when their
predicate fails. This can however confuse someone looking at the logs
trying to determine the cause of a failure. Since these checks are
performed primarly to provide an aid in debugging failures, replace them
with a conditional debug-level log message.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f550a9d9ab1b5b4aeb4f81860cbd3d924fc86898.1551792035.git.bdenes@scylladb.com>
The `test_abandoned_read` verifies that an abandoned read does a proper
cleanup. One of the things checked is that after the querier TTL
expires, the saved queriers are cleaned-up. This check however had a
very tight timing. The TTL was 2s and the test waited 2s before it did
the check, which is wrapped in an `eventually_true()` (max +1s).
The TTL timer scans the queriers with a period of TTL/2 so a querier
can live 1.5*TTL time. This means that the 2s + 1s wait time is just on
the limit and with some bad luck (and a slow machine) it can fail.
Reduce the TTL in this test to 1s to relax the dependence on timing.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <ed0d45b5a07960b83b391d289cade9b9f60c7785.1551787638.git.bdenes@scylladb.com>
This change ammends on the functionality of the result generation,
it changes the behaviour to return the expected results vector sorted
in the expected order of appearance in the result set. Then the
result set is validated for both, content and also order.
Tests: unit tests (Release)
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Whenever a query with an IN clause on clustering keys is executed,
assuming only one partition, the rows are ordered according to the
clustering keys. This commit adds the order validation to the content
validation whenever possible (which means removing the
ignore order part).
Tests: unit tests (Release)
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
According to the cql definitions, if no ORDER BY clause is present,
records should be returned ordered by the clustering keys. Since the
backend returns the ranges according to their order of appearance
in the request, the bounds should be sorted before sending it to the
backend. This kind of sorting is needed in queries that generates more
than one bound to be read, examples to such queris are:
1. a SELECT query with an IN clause.
2. a SELECT query on a mixed order tupple of columns (see #2050).
The assumption this commit makes is the correctness of the bounds
list, that is, the bounds are non overlapping. If this wasn't true, multiple
occurences of the same reccord could have returned for certain queries.
Tests:
1. Unit tests release
2. All dtest that requires #2050 and #2029Fixes#2029
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
"
Futurize split_range_to_single_shard to fix reactor stall.
Fixes: #3846
"
* tag 'asias/split_range_to_single_shard/v4' of github.com:scylladb/seastar-dev:
partitioner: Futurize split_range_to_single_shard
tests: Use SEASTAR_THREAD_TEST_CASE for partitioner_test.cc
table::load_sstable: fix missing arg in old format counters exception
Properly catch and log the exception in load_new_sstables.
Abort when the exception is caught to keep current behavior.
Seen with migration_test:TestMigration_with_2_1_x.migrate_sstable_with_counter_test
without enable_dangerous_direct_import_of_cassandra_counters.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190301091235.2914-1-bhalevy@scylladb.com>
"
This fixes#3988.
We already have a system.large_partitions, but only a warning for
large rows. These patches close the gap by also recording large rows
into a new system.large_rows.
"
* 'espindola/large-row-add-table-v6' of https://github.com/espindola/scylla:
Add a testcase for large rows
Populate system.large_rows.
Create a system.large_rows table
Extract a key_to_str helper
Don't call record_large_rows if stopped
Add a delete_large_rows_entries method to large_data_handler
db::large_data_handler::(maybe_)?record_large_rows: Return future<> instead of void
Rename maybe_delete_large_partitions_entry
Rename log_large_row to record_large_rows
Rename maybe_log_large_row to maybe_record_large_rows
"
This series contains minor improvements to commitlog log messages that
have helped investigating #4231, but are not specific to that bug.
"
* tag 'improve-commitlog-logs/v1' of https://github.com/pdziepak/scylla:
commitlog: use consistent chunk offsets in logs
commitlog: provide more information in logs
commitlog: remove unnecessary comment
Logs in commitlog writer use offset in the file of the chunk header to
identify chunks. However, the replayer is using offset after the header
for the same purpose. This causes unnecessary confusion suggesting that
the replayer is reading at the wrong position.
This patch changes the replayer so that it reports chunk header offsets.
This commits adds some more information to the logs. Motivated, by
experiences with investigating #4231.
* size of each write
* position of each write
* log message for final write
"
This series fixes a problem in the commitlog cycle() function that
confused in-memory and on-disk size of chunks it wrote to disk. The
former was used to decide how much data needs to be actually written,
and the latter was used to compute the offset of the next chunk. If two
chunk writes happened concurrently one the one positioned earlier in
the file could corrupt the header of the next one.
Fixes#4231.
Tests: unit(dev), dtest(commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup,test_commitlog_replay_with_alter_table)
"
* tag 'fix-commitlog-cycle/v1' of https://github.com/pdziepak/scylla:
commitlog: write the correct buffer size
utils/fragmented_temporary_buffer_view: add remove suffix
Introduced in 2a437ab427.
regular_compaction::select_sstable_writer() creates the sstable writer
when the first partition is consumed from the combined mutation
fragment stream. It gets the schema directly from the table
object. That may be a different schema than the one used by the
readers if there was a concurrent schema alter duringthat small time
window. As a result, the writing consumer attached to readers will
interpret fragments using the wrong version of the schema.
One effect of this is storing values of some columns under a different
column.
This patch replaces all column_family::schema() accesses with accesses
to the _schema memeber which is obtained once per compaction and is
the same schema which readers use.
Fixes#4304.
Tests:
- manual tests with hard-coded schema change injection to reproduce the bug
- build/dev/scylla boot
- tests/sstable_mutation_test
Message-Id: <1551698056-23386-1-git-send-email-tgrabiec@scylladb.com>
Commitlog files contain multiple chunks. Each chunk starts as a single
(possibly, fragmented buffer). The size of that buffer in memory may be
larger than the size in the file.
cycle() was incorrectly using the in-memory size to write the whole
buffer to the file. That sometimes caused data corruption, since a
smaller on-file size was used to compute the offset of the next chunk
and there could be multiple chunk writes happening at the same time.
This patch solves the issue by ensuring that only the actual on-file
size of the chunk is written.
This patch adds fragmented_temporary_buffer_view::remove_suffix(). It is
also necessary to adjust remove_prefix() since now the total size of all
fragments may be larger than the size of the view if both those
operations are performed.
"
This series heavily refactors `auth_test` in anticipation of
the last patch, which fixes a bug and which should be backported.
Branches: branch-3.0, branch-2.3
"
Fixes#4284
* 'jhk/check_can_login/v2' of https://github.com/hakuch/scylla:
auth: Reject logins from disallowed roles
tests: Restrict the scope of a variable
tests: Simplify boolean assertions in `auth_test`
tests: Abstract out repeated assertion checking
tests: Do not use the `auth` namespace
tests: Validate authentication correctly
tests: Ensure test roles are created and dropped
tests: Use `static` variables in `auth_test`
tests: Remove non-useful test
4 nodes in the cluster
n1, n2 in dc1
n3, n4 in dc2
dc1 RF=2, dc2 RF=2.
If we run
nodetool repair -hosts 127.0.0.1,127.0.03 -dc "dc1,dc2" multi
on n1.
The -hosts option will be ignored and only the -dc option
will be used to choose which hosts to repair. In this case, n1 to n4
will be repaired.
If user wants to select specific hosts to repair with, there is no need
to specify the -dc option. Use the -hosts option is enough.
Reject the combination and not to surprise the user.
In https://issues.apache.org/jira/browse/CASSANDRA-9876, the same logic
is introduced as well.
Refs #3836
Message-Id: <e95ac1099f98dd53bb9d6534316005ea3577e639.1551406529.git.asias@scylladb.com>
Scylla Manager communicates through SSH, so this patch adds SSH server
to Scylla's docker image in order for it to be configurable by Scylla
Manager.
Message-Id: <20190301161428.GA12148@shenzou.localdomain>
"
This series aims to fix inconsistencies in recent view update generation series (435447998).
First of all, it checks view row marker liveness instead of that of a base row marker
when deciding if optimizations can be applied or not.
Secondly, tests based on creating mutations directly are removed. Instead:
- dtest case which detected inconsistencies in previous series is ported to be a unit test
- the above case is also expanded to cover views with regular base column in their key
- additional test for TTL and timestamps is added and it's based on CQL
Tests: unit (dev)
dtest: materialized_views_test.TestMaterializedViews.test_no_base_column_in_view_pk_complex_timestamp_without_flush
Fixes: #4271
"
* 'fix_virtual_columns_liveness_checks_in_update_optimization_5' of https://github.com/psarna/scylla:
tests: add view update optimization case for TTL
database: add view_stats getter
tests: port complex timestamp view test from dtest
db,view: fix virtual columns liveness checks
tests: remove update generating test case
There are additional validation steps that the server executes in
addition to simply invoking the authenticator, so we adapt the tests to
also perform that validation.
We also eliminate lots of code duplication.
Since the role manager and authenticator work in tandem, the test cases
should use the wrapper for `auth::service` to create and drop users
instead of just doing it through the authenticator.
Password handling is verified in its own test suite, and this test not
only makes a number of assumptions about implementation details, but
also tries to verify a hashing scheme (bcrypt) which is not supported on
most Linux distributions.
These defines are global, so they can be in the mode-agnostic cxxflags
rather than the mode-specific cxxflags_{mode}.
Message-Id: <20190228081247.20116-1-avi@scylladb.com>
This test was useful in discovering corner cases for TTLs of virtual
columns, so it's ported to unit test suite from dtest.
The test is also extended with a mirrored case for base regular column
that *is* included in view pk.
When looking for optimization paths, columns selected in a view
are checked against multiple conditions - unfortunately virtual
columns were erroneously skipped from that check, which resulted
in ignoring their TTLs. That can lead to overoptimizing
and not including vital liveness info into view rows,
which can then result in row disappearing too early.
This test case should have been based on CQL instead of creating
artificial update scenarios. It also contains invalid cases
regarding base and view row marker, so it's removed here
and replaced with CQL-based test in this same series.
gnutls requires a configuration file, and the configuration file must match
the one used by the library. Since we ship our own version of the library with
the relocatable package, we must also ship the configuration file.
Luckily, it is possible to override the location of the configuration file via
an environment variable, so all we need to do is to copy the file to the archive
and provide the environment variable in the thunk that adjusts the library path.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190227110529.14146-1-avi@scylladb.com>
Currently, we only allocate memory for concurrent unit test runs. This can cause
CPU overcommit when running test.py on machines with a log of memory but few cores.
This overcommit can cause timeouts in tests that are time-sensitive (bad practice,
but can happen) and makes the desktop sluggish.
Improve by allocating at least one logical core per running test.
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190227132516.22147-1-avi@scylladb.com>
Collect /etc/redhat-release as well as os-release from relevant
hosts. The problem with os-release is that it doesn't contain the
minor version of the EL OS family. Since this is only present in
Red Hat distributions and derivatives, it will not be collected
in Debian derivatives.
Another approach is to use lsb_release -a but it will not provide
anything more useful than os-release on Debian and lsb needs to be
installed on EL derivatives first.
Fixes#4093
Message-Id: <20190225204727.20805-4-dyasny@scylladb.com>
Hostname -i produces a garbled output on new systems with ipv6
enabled, better to use the clean hostname instead, for the file
names.
Message-Id: <20190225204727.20805-3-dyasny@scylladb.com>
The script relies on hostname -i for host address, which can be
wrong in some systems. This patch checks for where the defined
CQL_PORT is listening, and uses the correct IP address instead.
Message-Id: <20190225204727.20805-2-dyasny@scylladb.com>
"
This series restructures the SASL code that was previously internal
to the `password_authenticator` so that it can be used in other contexts.
"
* 'jhk/restructure_sasl/v1' of https://github.com/hakuch/scylla:
auth: Rename SASL challenge class for "PLAIN"
auth: Make a ctor `explicit`
auth: Move `sasl_challenge` to its own file
auth: Decouple SASL code from its parent class
"
This series adds a fuzzer-type unit test for range scans, which
generates a semi-random dataset and executes semi-random range scans
against it, validating the result.
This test aims to cover a wide range of corner cases with the help of
randomness. Data and queries against it are generated in such a way that
various corner cases and their combinations are likely to be covered.
The infrastructure under range-scans have gone under massive changes in
the last year, growing in complexity and scope. The correctness of range
scans is critical for the correct functioning of any Scylla cluster, and
while the current unit tests served well in detecting any major problems
(mostly while developing), they are too simplistic and can only be
relied on to check the correctness of the basic functionality. This test
aims to extend coverage drastically, testing cases that the author of
the range-scan code or that of the existing unit tests didn't even think
exists, by relying on some randomness.
Fixes: #3954 (deprecates really)
"
* 'more-extensive-range-scan-unit-tests/v2' of https://github.com/denesb/scylla:
tests/multishard_mutation_query_test: add fuzzy test
tests/multishard_mutation_query_test: refactor read_all_partitions_with_paged_scan()
tests/test_table: add advanced `create_test_table()` overload
tests/test_table: make `create_test_table()` customizable
query: add trim_clustering_row_ranges_to()
tests/test_table: add keyspace and table name params
tests/test_table: s/create_test_cf/create_test_table/
tests: move create_test_cf() to tests/test_table.{hh,cc}
tests/multishard_mutation_query_test: drop many partition test
tests/multishard_mutation_query_test: drop range tombstone test
"
This miniseries hides virtual columns's writetime and ttl
from the user.
Tests: unit (dev)
Fixes#4288
"
* 'hide_virtual_columns_writetime_and_ttl_2' of https://github.com/psarna/scylla:
tests: add test for hiding virtual columns from WRITETIME
cql3: hide virtual columns from WRITETIME() and TTL()
schema: add column_definition::is_hidden_from_cql
Virtual columns should not be visible to the user,
so they are now hidden not only from directly selecting them,
but also via WRITETIME() and TTL() keywords.
Fixes#4288
test_distributed_loader_with_pending_delete issues a dma write, but violates
the unwritten contract to temporary_buffer::aligned(), which requires that
size be a multiple of alignment. As a result the test fails spuriously.
Instead of playing with the alignment, rewrite that snippet to use the
easier-to-use make_file_output_stream().
Introduced in 1ba88b709f.
Branches: master.
Message-Id: <20190226181850.3074-1-avi@scylladb.com>
It now records large rows when they are first written to an sstable
and removes them when the sstable is deleted.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This is analogous to the system.large_partitions table, but holds
individual rows, so it also needs the clustering key of the large
rows.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The implementations large_data_handler should only be called if
large_data_handler hasn't been stopped yet.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
These functions will record into tables in a followup patch, so they
will need to return a future.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
From the log it looks like these checks were added in 2014 because of
a broken clang.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
With this patch HACKING suggest using just ./configure.py and passing
the mode to ninja. It also expands on the characteristics of each mode
and mentions the dev mode.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190208020444.19145-1-espindola@scylladb.com>
The materialized-views flow control carefully calculates an amount of
microseconds to delay a client to slow it down to the desired rate -
but then a typo (std::min instead of std::max) causes this delay to
be zeroed, which in effect completely nullifies the flow control
algorithm.
Before this fix, experiments suggested that view flow control was
not having any effect and view backlog not bounded at all. After this
fix, we can see the flow control having its desired effect, and the
view backlog converging.
Fixes#4143.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190226161452.498-1-nyh@scylladb.com>
" from Asias
Some of the function names are not updated after we change the rpc verb
names. Rename them to make them consistent with the rpc verb names.
* seastar-dev.git asias/row_level_repair_rename_consistent_with_rpc_verb/v1:
repair: Rename request_sync_boundary to get_sync_boundary
repair: Rename request_full_row_hashes to get_full_row_hashes
repair: Rename request_combined_row_hash to get_combined_row_hash
repair: Rename request_row_diff to get_row_diff
repair: Rename send_row_diff to put_row_diff
repair: Update function name in docs/row_level_repair.md
The shard-aware drivers can cause a huge amount of connections to be created
when there are tens of thousands of clients. While normally the shard-aware
drivers are beneficial, in those cases they can consume too much memory.
Provide an option to disable shard awareness from the server (it is likely to
be easier to do this on the server than to reprovision those thousands of
clients).
Tests: manual test with wireshark.
Message-Id: <20190223173331.24424-1-avi@scylladb.com>
In resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test, we saw:
4 nodes in the tests
n1, n2, n3, n4 are started
n1 is stopped
n1 is changed to use different shard config
n1 is restarted ( 2019-01-27 04:56:00,377 )
The backtrace happened on n2 right fater n1 restarts:
0 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature STREAM_WITH_RPC_STREAM is enabled
1 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature WRITE_FAILURE_REPLY is enabled
2 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature XXHASH is enabled
3 WARN 2019-01-27 04:56:05,177 [shard 0] gossip - Fail to send EchoMessage to 127.0.58.1: seastar::rpc::closed_error (connection is closed)
4 INFO 2019-01-27 04:56:05,205 [shard 0] gossip - InetAddress 127.0.58.1 is now UP, status =
5 Segmentation fault on shard 0.
6 Backtrace:
7 0x00000000041c0782
8 0x00000000040d9a8c
9 0x00000000040d9d35
10 0x00000000040d9d83
11 /lib64/libpthread.so.0+0x00000000000121af
12 0x0000000001a8ac0e
13 0x00000000040ba39e
14 0x00000000040ba561
15 0x000000000418c247
16 0x0000000004265437
17 0x000000000054766e
18 /lib64/libc.so.6+0x0000000000020f29
19 0x00000000005b17d9
The theory is: migration_manager::maybe_schedule_schema_pull is scheduled, at this time
n1 has SCHEMA application_state, when n1 restarts, n2 gets new application
state from n1 which does not have SCHEMA yet, when migration_manager::maybe_schedule
wakes up from the 60 sleep, n1 has non-empty endpoint_state but empty
application_state for SCHEMA. We dereference the nullptr
application_state and abort.
In commit da80f27f44, we fixed the problem by
checking the pointer before dereference.
To prevent this to happen in the first place, we'd better to add
application_state::SCHEMA when gossip starts. This way, peer nodes
always see the application_state::SCHEMA when a node restarts.
Tests: resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test
Fixes#4148Fixes#4258
Split the update_schema_version_and_announce() into
update_schema_version() and announce_schema_version(). This is going to
be used in storage_service::prepare_to_join() where we want to first
update the schema version, start gossip, announce the schema version.
It is sometimes usefull for force reinstallation of the node_exporter,
for example during upgrade or if something is wrong with the current
installation.
This patch adds a --force command line option.
If the --force is given to the node_expoerter_install, it will reinstall
node_exporter to the latest version, regardless if it was already
installed.
The symbolic link in /usr/bin/node_exporter will be set to the installed
version, so if there are other installed version, they will remain.
Examples:
$ sudo ./dist/common/scripts/node_exporter_install
node_exporter already installed, you can use `--force` to force reinstallation
$ sudo ./dist/common/scripts/node_exporter_install --force
node_exporter already installed, reinstalling
Fixes#4201
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20190225151120.21919-1-amnon@scylladb.com>
Let's add a PRODUCT variable, similar to build_rpm.sh, for example, so
that we can override package names for enterprise AMIs.
Message-Id: <20190225063319.19516-1-penberg@scylladb.com>
"
After adcb3ec20c ("row_cache: read is not
single-partition if inter-partition forwarding is enabled") we have
noticed a regression in the results of some perf_fast_forward tests.
This was caused by those tests not disabling partition-level
fast-forwarding even though it was not needed and the commit in question
fixed an incorrect optimisation in such cases.
However, after solving that issue it has also become apparent that
mutation_reader_merger performs worse when the fast-forwarding is
disabled. This was attributed to logic responsible for dropping readers
as soon as they have reached the end of stream (which cannot be done if
fast-forwarding is enabled). This problem was mitigated with avoiding a
scan of the list and removing readers in small batches.
Fixes#4246.
Fixes#4254.
Tests: unit(dev)
"
* tag 'perf_fast_forward-fix-regression/v1' of https://github.com/pdziepak/scylla:
mutation_reader_merger: drop unneded readers in small batches
mutation_reader_merger: track readers by iterators and not pointers
tests/perf_fast_forward: disable partition-level fast-forwarding if not needed
* seastar 2313dec...ab54765 (10):
> Fix C++-17-only uses of static_assert() with a single parameter.
> README.md: fix out-of-date explanation of C++ dialect
> net: fix tcp load balancer accounting leak while moving socket to other shard
> Revert "deleter: prevent early memory free caused by deleter append."
> deleter: prevent early memory free caused by deleter append.
> Solve seastar.unit.thread failure in debug mode
> Fix iovec-based read_dma: use make_readv_iocb instead of make_read_iocb
> build: Fix the required version of `fmt`
> app_template: fix use after move in app constructor
> build: Rename CMake variable for private flags
Fixes#4269.
* 'jhk/define_debug/v1' of https://github.com/hakuch/scylla:
build: Remove the `DEBUG_SHARED_PTR` pp variable
build: Prefer the Seastar version of a pp variable
Scylla currently prints a welcome message when it starts, with the
Scylla version, but this is not printed to the regular log so in some
cases (e.g., Jenkins runs) we do not see it in the log. So let's add
a regular INFO-level log message with the same information.
Also, Scylla currently doesn't print any specific log message when it
normally completes its shutdown. In some cases, users may end up
wondering whether Scylla hung in the middle of the shutdown, or in
fact exited normally. Refs #4238. So in this patch we add a "shutdown
complete" message as the very last message in a successfull shutdown.
We print Scylla's version also in the shutdown message, which may be
useful to see in the logs when shutting down one version of Scylla
and starting a different version.
Finally, we also add a log message when initialization is complete,
which may also be useful to understand whether Scylla hung during
initialization.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190217140659.19512-1-nyh@scylladb.com>
It was observed that destroying readers as soon as they are not needed
negatively affects performance of relatively small reads. We don't want
to keep them alive for too long either, since they may own a lot of
memory, but deferring the destruction slightly and removing them in
batches of 4 seems to solve the problem for the small reads.
mutation_reader_merger uses a std::list of mutation_reader to keep them
alive while the rest of the logic operates on non-owning pointers.
This means that when it is a time to drop some of the readers that are
no longer needed, the merger needs to scan the list looking for them.
That's not ideal.
The solution is to make the logic use iterators to elements in that
list, which allows for O(1) removal of an unneeded reader. Iterators to
list are just pointers to the node and are not invalidated by unrelated
additions and removals.
Several of the test cases in perf_fast_forward do not need
partition-level fast-forwarding. However, since the defaults are used to
construct most of the readers the fast-forwarding is enabled regardless.
This showed an apparent regression in the perf_fast_forward results
after adcb3ec20c ("row_cache: read is not
single-partition if inter-partition forwarding is enabled") which
disabled an optimisation that was invalid when partition-level
fast-forwarind was requested.
This patch ensures that all single-partition reads that do not need
partition-level fast-forwarding keep it disabled.
"
Currently we keep the entries in a circular_buffer, which uses
a contiguous storage. For large partitions with many promoted index
entries this can cause OOM and sstable compaction failure.
A similar problem exists for the offset vector built
in write_promoted_index().
This change solves the problem by serializing promoted index entries
and the offset vector on the fly directly into a bytes_ostream, which
uses fragmented storage.
The serialization of the first entry is deferred, so that
serialization is avoided if there will be less than 2
entries. Promoted index is not added for such partitions.
There still remains a problem that large-enough promoted index can cause OOM.
Refs #4217
Tests:
- unit (release)
- scylla-bench write
Branches: 3.0
"
* tag 'fix-large-alloc-for-promoted-index-v3' of github.com:tgrabiec/scylla:
sstables: mc: writer: Avoid large allocations for maintaining promoted index
sstables: mc: writer: Avoid double-serialization of the promoted index
"
The delete_atomically function is required to delete a set of sstables
atomically. I.e. Either delete all or none of them. Deleting only
some sstables in the set might result in data resurrection in case
sstable A holding tombstone that cover mutation in sstable B, is deleted,
while sstable B remains.
This patchset introduces a log file holding a list of SSTable TOC files
to delete for recovering a partial delete_atomically operation.
A new subdirectory is create in the sstables dir called `pending_delete`
holding in-flight logs.
The logs are created with a temporary name (using a .tmp suffix)
and renamed to the final .log name once ready. This indicates
the commit point for the operation.
When populating the column family, all files in the pending_delete
sub-directory are examined. Temporary log files are just removed,
and committed log files are read, replayed, and deleted.
Fixes#4082
Tests: unit (dev), database_test (debug)
"
* 'projects/delete_atomically_recovery/v5' of https://github.com/bhalevy/scylla:
tests: database_test: add test_distributed_loader_with_pending_delete
distributed_loader: replay and cleanup pending_delete log files
distributed_loader: populated_column_family: separate temp sst dirs cleanup phase
docs: add sstables-directory-structure.md
sstables: commit sstables to delete_atomically into a pending_delete log file
sstables: delete_atomically: delete sstables in a thread
sstables: component_basename: reuse with sstring component
sstables: introduce component_basename
database: maybe_delete_large_partitions_entry: do not access sstable and do not mask exceptions
sstables: add delete_sstable_and_maybe_large_data_entries
sstables: call remove_by_toc_name in dtor if marked_for_deletion
Scan the table's pending_delete sub-directory if it exists.
Remove any temporary pending_delete log files to roll back the respective
delete_atomically operation.
Replay completed pending_delete log files to roll forward the respective
delete_atomically operation, and finally delete the log files.
Cleanup of temporary sstable directories and pending_delete
sstables are done in a preliminary scan phase when populating the column family
so that we won't attempt to load the to-be-deleted sstables.
Fixes#4082
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In preparation for replaying pending_delete log files,
we would like to first remove any temporary sst dirs
and later handle pending_delete log files, and only
then populate the column family.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To facilitate recovery of a delete_atomically operation that crashed mid
way, add a replayable log file holding the committed sstables to delete.
It will be used by populate_column_family to replay the atomic deletion.
1. Write the toc names of sstables to be deleted into a temporary file.
2. Once flushed and closed, rename the temp log file into the final name
and flush the pending_delete directory.
3. delete the sstables.
4. Remove the pending_delete log file
and flush the pending_delete directory.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
component_basename returns just the basename for the component filename
without the leading sstdir path.
To be used for delete_atomically's pending_delete log file.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
1. We would like to be able to call maybe_delete_large_partitions_entry
from the sstable destructor path in the future so the sstable might go away
while the large data entries are being deleted.
2. We would like the caller to handle any exception on this path,
especially in the prepatation part, before calling delete_large_partitions_entry().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To be called by delete_atomically,
rather that passing a vector to delete_sstables.
This way, no need to build `sstables_to_delete_atomically` vector
To be replaced in the future with a sstable method once we
provide the large_data_handler upon construction.
Handle exceptions from remove_by_toc_name or maybe_delete_large_partitions_entry
by merely logging an error. There is nothing else we can do at this point.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
No need to call delete_sstables which works on a list of sstable
(by toc name).
Also, add FIXME comment about not calling
large_data_handler.maybe_delete_large_partitions_entry
on this path.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
checksummed_file_writer does not override allocate_buffer(), so it inherits
data_source_impl's default allocate_buffer, which does not care about alignment.
The buffer is then passed to the real file_data_sink_impl, and thence to the file
itself, which cannot complete the write since it is not properly aligned.
This doesn't fail in release mode, since the Seastar allocator will supply a
properly aligned buffer even if not asked to do so. The ASAN allocator usually
does supply an aligned buffer, but not always, which causes the test to fail.
Fix by forwarding the allocate_buffer() function to the underlying data_source.
Fixes#4262.
Branches: branch-3.0
Message-Id: <20190221184115.6695-1-avi@scylladb.com>
Limits are stored as uint32_t everywhere, but in some places
int32_t was used, which created inconsistencies when comparing
the value to std::numeric_limits<Type>::max().
In order to solve inconsistencies, the types are unified to uint32_t,
and instead of explicitly calling numeric limit max,
an already existing constant value query::max_rows is utilized.
Fixes#4253
Message-Id: <4234712ff61a0391821acaba63455a34844e489b.1550683120.git.sarna@scylladb.com>
We've seen schema application failing with marshal_exception
here. That's not enough information to figure out what is the
problem. Knowing which table and column is affected would make
diagnosis much easier in certain cases.
This patch wraps errors in query::deserialization_error with more
information.
Example output:
query::deserialization_error (failed on column system_schema.tables#bloom_filter_fp_chance \
(version: c179c1d7-9503-3f66-a5b3-70e72af3392a, id: 0, index: 0, type: org.apache.cassandra.db.marshal.DoubleType):\
seastar::internal::backtraced<marshal_exception> (marshaling error: read_simple - not enough bytes (expected 8, got 3)
Message-Id: <20190221113219.13018-1-tgrabiec@scylladb.com>
This patch removes the log message about "compaction_manager - Asked to stop"
at the very end of Scylla runs. This log message is confusing because it
only has the "asked to stop" part, without finally a "stopped", and may
lead a user to incorrectly fear that the shutdown hung - when it in fact
finished just fine.
The database object holds a compaction_manager and stop()s it when the
database is stop()ed - and that is the very last thing our shutdown does.
However, much earlier, as the *first* shutdown operation (i.e., the last
at_exit() in main.cc), we already stop() the compaction manager.
The second stop() call does nothing, but unfortunately prints the log
message just before checking if it has anything to stop. So this patch
just moves the log message to after the check.
Fixes#4238.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190217142657.19963-1-nyh@scylladb.com>
"
Fixes#4256
This miniseries fixes a problem with inserting NULL values through
INSERT JSON interface.
Tests: unit (dev)
"
* 'fix_insert_json_with_null' of https://github.com/psarna/scylla:
tests: add test for INSERT JSON with null values
cql3: add missing value erasing to json parser
Can be useful in diagnosing problems with application of schema
mutations.
do_merge_schema() is called on every change of schema of the local
node.
create_table_from_mutations() is called on schema merge when a table
was altered or created using mutations read from local schema tables
after applying the change, or when loading schema on boot.
Message-Id: <20190221093929.8929-2-tgrabiec@scylladb.com>
"
cryptopp's config.h has the following pragma:
#pragma GCC diagnostic ignored "-Wunused-function"
It is not wrapped in a push/pop. Because of that, including cryptopp
headers disables that warning on scylla code too.
This patch series introduces a single .cc file that has to include
cryptopp headers.
"
* 'avoid-cryptopp-v3' of https://github.com/espindola/scylla:
Avoid including cryptopp headers
Delete dead code
cryptopp's config.h has the following pragma:
#pragma GCC diagnostic ignored "-Wunused-function"
It is not wrapped in a push/pop. Because of that, including cryptopp
headers disables that warning on scylla code too.
The issue has been reported as
https://github.com/weidai11/cryptopp/issues/793
To work around it, this patch uses a pimpl to have a single .cc file
that has to include cryptopp headers.
While at it, it also reduces the differences and code duplication
between the md5 and sha1 hashers.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This code would have be to refactored by the next patch. Since it is
commented out, just delete it.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
"
This series addresses the issue of redundant view updates,
generated for columns that were not selected for given materialized view.
Cases covered (quote:)
* If a base row has a live row marker, then we can avoid generating
view updates if only unselected columns change;
* If a base row has no live row marker, then we can avoid generating
view updates if unselected columns are updated, unless they are newly
created, deleted, or they have a TTL.
Additionally, this series includes caching selected columns and is_index information
to avoid unnecessary CPU cycles spent on recomputing these two.
Fixes#3819
"
* 'send_less_view_updates_if_not_necessary_4' of https://github.com/psarna/scylla:
tests: add cases for view update generation optimizations
view: minimize generated view updates for unselected columns
view: cache is_index for view pointer
index: make non-pointer overload of is_index function
index: avoid copying when checking for is_index
In some cases generating view updates for columns that were not
selected in CREATE VIEW statement is redundant - it is the case
when the update will not influence row liveness in anyway.
Currently, these cases are optimized out:
- row marker is live and only unselected columns were updated;
- row marked is not live and only unselected columns were updated,
and in the process nothing was created or deleted and there was
no TTL involved;
It's detrimental to keep querying index manager whether a view
is backing a secondary index every time, so this value is cached
at construct time.
At the same time, this value is not simply passed to view_info
when being created in secondary index manager, in order to
decouple materialized view logic from secondary indexes as much as
possible (the sole existence of is_index() is bad enough).
allocate_segment() can fail even though we're not out of memory, when
it's invoked inside an allocating section with the cache region
locked. That section may later succeed after retried after memory
reclamation.
We should ignore bad_alloc thrown inside allocating section body and
fail only when the whole section fails.
Fixes#2924
Message-Id: <1550597493-22500-1-git-send-email-tgrabiec@scylladb.com>
"
Fixes#3574
This series adds missing multi-column restrictions filtering to CQL.
The underlying infrastructure already allows checking multi-column
restrictions in a reasonable way, so this series consists of mostly
adding simple interfaces and parameters.
Also, unit test cases for multi-column restrictions are provided.
Tests: unit (dev)
"
* 'add_multi_column_restrictions_filtering_3' of https://github.com/psarna/scylla:
tests: add multi-column filtering tests
cql3: add multi-column restrictions filtering
cql3: add specified is_satisfied_by to multi-column restriction
cql3: rewrite raw loop in is_satisfied_by to boost::any_of
cql3: fix is_satisfied_by for multi-column restrictions
cql3: add missing include to multi-column restriction
Multi-column restrictions need only schema, clustering key and query
options in order to decide if they are satisfied, so an overloaded
function that takes reduced number of parameters is added.
Multi-column restriction should be satisfied by the value
if any of the ranges contains it, not all of them.
Example: SELECT * FROM t WHERE (a,b) IN ((1,2),(1,3))
will operate on two singular ranges: [(1,2),(1,2)] and [(1,3),(1,3)].
It's sufficient for a value to be inside any of these two in order
to satisfy the restriction.
"
As part of implementing sstables manager and fixing issue related
to updating large_data_handler on all delete paths, we want to funnel
all sstable creations, loading, and deletions through a manager.
The patchset lays out test infrastructure to funnel these opeations
through class sstables::test_env.
In the process, it cleans up many numerous call sites in the existing
unit tests that evolved over time.
Refs #4198
Refs #4149
Tests: unit (dev)
"
* 'projects/test_env/v3' of https://github.com/bhalevy/scylla:
tests: introduce sstables::test_env
tests: perf_sstable: rename test_env
tests: sstable_datafile_test: use useable_sst
tests: sstable_test: add write_and_validate_sst helper
tests: sstable_test: add test_using_reusable_sst helper
tests: sstable_test: use reusable_sst where possible
tests: sstable_test: add test_using_working_sst helper
tests: sstable_3_x_test: make_test_sstable
tests: run_sstable_resharding_test: use default parameters to make_sstable
tests: sstables::test::make_test_sstable: reorder params
tests: test_setup: do_with_test_directory is unused
tests: move sstable_resharding_strategy_tests to sstable_reharding_test
tests: move create_token_from_key helpers to test_services
tests: move column_family_for_tests to test_services
dht: move declaration of default_partitioner from sstable_datafile_test to i_partitioner.hh
Allow the --mode argument to ./configure.py and ./test.py to be repeated. This
is to allow contiuous integration to configure only debug and release, leaving dev
to developers.
Message-Id: <20190214162736.16443-1-avi@scylladb.com>
Currently, we keep the entries in a circular_buffer, which uses
a contiguous storage. For large partitions with many promoted index
entries this can cause OOM and sstable compaction failure.
A similar problem exists for the offset vector built
in write_promoted_index().
This change solves the problem by serializing promoted index entries
and the offset vector on the fly directly into a bytes_ostream, which
uses fragmented storage.
The serialization of the first entry is deferred, so that
serialization is avoided if there will be less than 2
entries. Promoted index is not added for such partitions.
There still remains a problem that large-enough promoted index can cause OOM.
Refs #4217
"
This series introduces PER PARTITION LIMIT to CQL.
Protocol and storage is already capable of applying per-partition limits,
so for nonpaged queries the changes are superficial - a variable is parsed
and passed down.
For paged queries and filtering the situation is a little bit more complicated
due to corner cases: results for one partition can be split over 2 or more pages,
filtering may drop rows, etc. To solve these, another variable is added to paging
state - the number of rows already returned from last served partition.
Note that "last" partition may be stretched over any number of pages, not just the
last one, which is a case especially when considering filtering.
As a result, per-partition-limiting queries are not eligible for page generator
optimization, because they may need to have their results locally filtered
for extraneous rows (e.g. when the next page asks for per-partition limit 5,
but we already received 4 rows from the last partition, so need just 1 more
from last partition key, but 5 from all next ones).
Tests: unit (dev)
Fixes#2202
"
* 'add_per_partition_limit_3' of https://github.com/psarna/scylla:
tests: remove superficial ignore_order from filtering tests
tests: add filtering with per partition key limit test
tests: publish extract_paging_state and count_rows_fetched
tests: fix order of parameters in with_rows_ignore_order
cql3,grammar: add PER PARTITION LIMIT
idl,service: add persistent last partition row count
cql3: prevent page generator usage for per-partition limit
cql3: add checking for previous partition count to filtering
pager: add adjusting per-partition row limit
cql3: obey per partition limit for filtering
cql3: clean up unneeded limit variables
cql3: obey per partition limit for select statement
cql3: add get_per_partition_limit
cql3: add per_partition_limit to CQL statement
Python 3.6 is the first version to accept bytes to the json.loads(),
which causes the following error on older Python 3 versions:
Traceback (most recent call last):
File "/usr/lib/scylla/scylla-housekeeping", line 175, in <module>
args.func(args)
File "/usr/lib/scylla/scylla-housekeeping", line 121, in check_version
raise e
File "/usr/lib/scylla/scylla-housekeeping", line 116, in check_version
versions = get_json_from_url(version_url + params)
File "/usr/lib/scylla/scylla-housekeeping", line 55, in get_json_from_url
return json.loads(data)
File "/usr/lib64/python3.4/json/__init__.py", line 312, in loads
s.__class__.__name__))
TypeError: the JSON object must be str, not 'bytes'
To support those older Python versions, convert the bytes read to utf8
strings before calling the json.loads().
Fixes#4239
Branches: master, 3.0
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20190218112312.24455-1-amnon@scylladb.com>
When reporting a failure, expected rows were mixed up with received
rows. Also, the message assumed it received more rows, but it can
as well be less, so now it reports a "different number" of rows.
In order to process paged queries with per-partition limits properly,
paging state needs to keep additional information: what was the row
count of last partition returned in previous run.
That's necessary because the end of previous page and the beginning
of current one might consist of rows with the same partition key
and we need to be able to trim the results to the number indicated
by per-partition limit.
Paged queries that induce per-partition limits cannot use
page generator optimization, as sometimes the results need
to be filtered for extraneous rows on page breaks.
Filtering now needs to take into account per partition limits as well,
and for that it's essential to be able to compare partition keys
and decide which rows should be dropped - if previous page(s) contained
rows with the same partition key, these need to be taken into
consideration too.
For filtering pagers, per partition limit should be set
to page size every time a query is executed, because some rows
may potentially get dropped from results.
Part of the code is already implemented (counters and hinted-handoff).
Part of the code will probably never be (triggers). And the rest is
the code that estimates number of rows per range to determine query
parallelism, but we implemented exponential growth algorithms instead.
Message-Id: <20190214112226.GE19055@scylladb.com>
"
get_restricted_ranges() is inefficient since it calculates all
vnodes that cover a requested key ranges in advance, but callers often
use only the first one. Replace the function with generator interface
that generates requested number of vnodes on demand.
"
* 'gleb/query_ranges_to_vnodes_generator' of github.com:scylladb/seastar-dev:
storage_proxy: limit amount of precaclulated ranges by query_ranges_to_vnodes_generator
storage_proxy: remove old get_restricted_ranges() interface
cql3/statements/select_statement: convert index query interface to new query_ranges_to_vnodes_generator interface
tests: convert storage_proxy test to new query_ranges_to_vnodes_generator interface
storage_proxy: convert range query path to new query_ranges_to_vnodes_generator interface
storage_proxy: introduce new query_ranges_to_vnode_generator interface
Give the constant 1024*1024 introduced in an earlier commit a name,
"batch_memory_max", and move it from view.cc to view_builder.hh.
It now resides next to the pre-existing constant that controlled how
many rows were read in each build step, "batch_size".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190217100222.15673-1-nyh@scylladb.com>
* seastar 11546d4...2313dec (6):
> Deprecate thread_scheduling_group in favor of scheduling_group
> Merge "Fixes for Doxygen documentation" from Jesse
> future: optionally type-erase future::then() and future::then_wrapped
> build: Allow deprecated declarations internally
> rpc: fix insertion of server connections into server's container
> rpc: split BOOST_REQUIRE with long conditions into multiple
read_exactly(), when given a stream that does not contain the amount of data
requested, will loop endlessly, allocating more and more memory as it does, until
it fails with an exception (at which point it will release the memory).
Fix by returning an empty result, like input_stream::read_exactly() (which it
replaces). Add a test case that fails without a fix.
Affected callers are the native transport, commitlog replay, and internal
deserialization.
Fixes#4233.
Branches: master, branch-3.0
Tests: unit(dev)
Message-Id: <20190216150825.14841-1-avi@scylladb.com>
When yum-utils already installed on Fedora, 'yum install dnf-utils' causes
conflict, will fail.
We should show description message instead of just causing dnf error
mesage.
Fixes#4215
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190215221103.2379-1-syuu@scylladb.com>
When bootstrapping, a node should to wait to have a schema agreement
with its peers, before it can join the ring. This is to ensure it can
immediately accept writes. Failing to reach schema agreement before
joining is not fatal, as the node can pull unknown schemas on writes
on-demand. However, if such a schema contains references to UDFs, the
node will reject writes using it, due to #3760.
To ensure that schema agreement is reached before joining the ring,
`storage_service::join_token_ring()` has to checks. First it checks that
at least one peer was connected previously. For this it compares
`database::get_version()` with `database::empty_version`. The (implied)
assumption is that this will become something other than
`database::empty_version` only after having connected (and pulled
schemas from) at least one peer. This assumption doesn't hold anymore,
as we now set the version earlier in the boot process.
The second check verifies that we have the same schema version as all
known, live peers. This check assumes (since 3e415e2) that we have
already "met" all (or at least some) of our peers and if there is just
one known node (us) it concludes that this is a single-node cluster,
which automatically has schema agreement.
It's easy to see how these two checks will fail. The first fails to
ensure that we have met our peers, and the second wrongfully concludes
that we are a one-node cluster, and hence have schema agreement.
To fix this, modify the first check. Instead of relying on the presence
of a non-empty database version, supposedly implying that we already
talked to our peers, explicitely make sure that we have really talked to
*at least* one other node, before proceeding to the second check, which
will now do the correct thing, actually checking the schema versions.
Fixes: #4196
Branches: 3.0, 2.3
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <40b95b18e09c787e31ba6c5519fb64d68b4ca32e.1550228389.git.bdenes@scylladb.com>
The included testcase used to crash because during database::stop() we
would try to update system.large_partition.
There doesn't seem to be an order we can stop the existing services in
cql_test_env that makes this possible.
This patch then adds another step when shutting down a database: first
stop updating system.large_partition.
This means that during shutdown any memtable flush, compaction or
sstable deletion will not be reflected in system.large_partition. This
is hopefully not too bad since the data in the table is TTLed.
This seems to impact only tests, since main.cc calls _exit directly.
Tests: unit (release,debug)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190213194851.117692-1-espindola@scylladb.com>
-Og is advertised as debug-friendly optimization, both in compile time
and debug experience. It also cuts sstable_mutation_test run time in half:
Changing -O0 to -Og
Before:
real 16m49.441s
user 16m34.641s
sys 0m10.490s
After:
real 8m38.696s
user 8m26.073s
sys 0m10.575s
Message-Id: <20190214205521.19341-1-avi@scylladb.com>
In preparation for providing a default large_data_handler in
a test-standard way.
buffer_size parameter reordered and now has a default value
same as make_sstable()'s.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
For fixing issue #3362 we added in materialized views, in some cases,
"virtual columns" for columns which were not selected into the view.
Although these columns nominally exist in the view's schema, they must
not be visible to the user, and in commit
3f3a76aa8f we prevented a user from being
able to SELECT these columns.
In this patch we also prevent the user from being able to use these
column names (which shouldn't exist in the view) in WHERE restrictions.
Fixes#4216
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190212162014.18778-1-nyh@scylladb.com>
The bulk materialized-view building processes (when adding a materialized
view to a table with existing data) currently reads the base table in
batches of 128 (view_builder::batch_size) rows. This is clearly better
than reading entire partitions (which may be huge), but still, 128 rows
may grow pretty large when we have rows with large strings or blobs,
and there is no real reason to buffer 128 rows when they are large.
Instead, when the rows we read so far exceed some size threshold (in this
patch, 1MB), we can operate on them immediately instead of waiting for
128.
As a side-effect, this patch also solves another bug: At worst case, all
the base rows of one batch may be written into one output view partition,
in one mutation. But there is a hard limit on the size of one mutation
(commitlog_segment_size_in_mb, by default 32MB), so we cannot allow the
batch size to exceed this limit. By not batching further after 1MB,
we avoid reaching this limit when individual rows do not reach it but
128 of them did.
Fixes#4213.
This patch also includes a unit test reproducing #4213, and demonstrating
that it is now solved.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190214093424.7172-1-nyh@scylladb.com>
Fixes#4222
Iff an extension creation callback returns null (not exception)
we treat this as "I'm not needed" and simply ignore it.
Message-Id: <20190213124311.23238-1-calle@scylladb.com>
The way the `pkg-config` executable works on Fedora and Ubuntu is
different, since on Fedora `pkg-config` is provided by the `pkgconf`
project.
In the build directory of Seastar, `seastar.pc` and `seastar-testing.pc`
are generated. `seastar` is a requirement of `seastar-testing`.
When pkg-config is invoked like this:
pkg-config --libs build/release/seastar-testing.pc
the version of `pkg-config` on Fedora resolves the reference to
`seastar` in `Requires` to the `seastar.pc` in the same directory.
However, the version of `pkg-config` on Ubuntu 18.04 does not:
Package seastar was not found in the pkg-config search path.
Perhaps you should add the directory containing `seastar.pc'
to the PKG_CONFIG_PATH environment variable
Package 'seastar', required by '/seastar-testing', not found
To address the divergent behavior, we set the `PKG_CONFIG_PATH` variable
to point to the directory containing `seastar.pc`. With this change, I
was able to configure Scylla on both Fedora 29 and Ubuntu 18.04.
Fixes#4218
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <d7164bde2790708425ac6761154d517404818ecd.1550002959.git.jhaberku@scylladb.com>
"
Fixes#4083
Instead of sharded collection in system.local, use a
dedicated system table (system.truncated) to store
truncation positions. Makes query/update easier
and easier on the query memory.
The code also migrates any existing truncation
positions on startup and clears the old data.
"
* 'calle/truncation' of github.com:scylladb/seastar-dev:
truncation_migration_test: Add rudimentary test
system_keyspace: Add waitable for trunc. migration
cql_test_env: Add separate config w. feature disable
cql_test_env: Add truncation migration to init
cql_assertions: Add null/non-null tests
storage_service: Add features disabling for tests
Add system.truncated documentation in docs
commitlog_replay: Use dedicated table for truncation
storage_service: Add "truncation_table" feature
Fixes#4083
Instead of sharded collection in system.local, use a
dedicated system table (system.truncated) to store
truncation positions. Makes query/update easier
and easier on the query memory.
The code also migrates any existing truncation
positions on startup and clears the old data.
* seastar 428f4ac...11546d4 (9):
> reactor: Fix an infinite loop caused the by high resolution timer not being monitored
> build: Add back `SEASTAR_SHUFFLE_TASK_QUEUE`
> build: Unify dependency versions
> future-util: optimize parallel_for_each() with single element
> core/sharded.hh: fix doxygen for "Multicore" group
> build: switch from travis-ci to circleci
> perftune.py: fix irqbalance tuning on Ubuntu 18
> build: Make the use of sanitizers transitive
> net: ipv6: fix ipv6 detection and tests by binding to loopback
"
Recently, there has been a series of incidents of the multishard
combining reader deadlocking, when the concurrency of reads were
severely restricted and there was no timeout for the read.
Several fixes have been merged (414b14a6b, 21b4b2b9a, ee193f1ab,
170fa382f) but eliminating all occurrences of deadlocks proved to be a
whack-a-mole game. After the last bug report I have decided that instead
of trying to plug new wholes as we find them, I'll try to make wholes
impossible to appear in the first place. To translate this into the
multishard reader, instead of sprinkling new `reader.pause()` calls all
over the place in the multishard reader to solve the newly found
deadlocks, make the pausing of readers fully automatic on the shard
reader level. Readers are now always kept in a paused state, except when
actually used. This eliminates the entire class of deadlock bugs.
This patch-set also aims at simplifying the multishard reader code, as
well as the code of the existing `lifecycle_policy` implementations.
This effort resulted in:
* mutation_reader.cc: no change in SLOC, although it now also contains
logic that used to be duplicated in every `lifecycle_policy`
implementation;
* multishard_mutation_query.cc: 150 SLOC removed;
* database.cc: 30 SLOC removed;
Also the code is now (hopefully) simpler, safer and has a clearer
structure.
Fixes#4050 (main issue)
Fixes#3970Fixes#3998 (deprecates really)
"
* 'simplify-and-fix-multishard-reader/v3.1' of https://github.com/denesb/scylla:
query_mutations_on_all_shards(): make states light-weight
query_mutations_on_all_shards(): get rid of read_context::paused_reader
query_mutations_on_all_shards(): merge the dismantling and ready_to_save states into saving state
query_mutations_on_all_shards(): pause looked-up readers
query_mutation_on_all_shards(): remove unecessary indirection
shard_reader: auto pause readers after being used
reader_concurrency_semaphore::inactive_read_handle: fix handle semantics
shard_reader: make reader creation sync
shard_reader: use semaphore directly to pause-resume
shard_reader: recreate_reader(): fix empty range case
foreign_reader: rip out the now unused private API
shard_reader: move away from foreign_reader
multishard_combining_reader: make shard_reader a shared pointer
multishard_combining_reader: move the shard reader definition out
multishard_combining_reader: disentangle shard_reader
Previously the different states a reader can be in were all separate
structs, and were joined together by a variant. When this was designed
this made sense as states were numerous and quite different. By this
point however the number of states has been reduced to 4, with 3 of them
being almost the same. Thus it makes sense to merge these states into
single struct and keep track of the current state with an enum field.
This can theoretically increase the chances of mistakes, but in practice
I expect the opposite, due to the simpler (and less) code. Also, all the
important checks that verify that a reader is in the state expected by
the code are all left in place.
A byproduct of this change is that the amount of cross-shard writes is
greatly reduced. Whereas previously the whole state object had to be
rewritten on state change, now a single enum value has to be updated.
Cross shard reads are reduced as well to the read of a few foreign
pointers, all state-related data is now kept on the shard where the
associated reader lives.
These two states are now the same, with the artificial distinction that
all readers are promoted to readey_to_save state after the compaction
state and the combined buffer is dismantled. From a practical
perspective this distinction is meaningless so merge the two states into
a single `saving` state.
On the beginning of each page, all saved readers from the previous pages
(if any) are looked up, so they can be reused. Some of these saved
readers can end up not being used at all for the current page, in which
case they will needlessly sit on their permit for the duration of
filling the page. Avoid this by immediately pausing all looked-up
readers. This also allows a nice unifying of the reader saving logic, as
now *all* readers will be in a paused state when `save_reader()` is
called. Previously, looked-up, but not used readers were an exception to
this, requiring extra logic to handle both cases. This logic can now be
removed.
Previously it was the responsibility of the layer above (multishard
combining reader) to pause readers, which happened via an explicit
`pause()` call. This proved to be a very bad design as we kept finding
spots where the multishard reader should have paused the reader to avoid
potential deadlocks (due to starved reader concurrency semaphores), but
didn't.
This commit moves the responsibility of pausing the reader into the
shard reader. The reader is now kept in a paused state, except when it
is actually used (a `fill_buffer()` or `fast_forward_to()` call is
executing). This is fully transparent to the layer above.
As a side note, the shard reader now also hides when the reader is
created. This also used to be the responsibility of the multishard
reader, and although it caused no problems so far, it can be considered
a leak of internal details. The shard reader now automatically creates
the remote reader on the first time it is attempted to be used.
The code has been reorganized, such that there is now a clear separation
of responsibilities. The multishard combining reader handles the
combining of the output of the shard readers, as well as issuing
read-aheads. The shard reader handles read-ahead and creating the
remote reader when needed, as well as transferring the results of remote
reads to the "home" shard. The remote reader
(`shard_reader::remote_reader`, new in this patch) handles
pausing-resuming as well as recreating the reader after it was evicted.
Layers don't access each other's internals (like they used to).
After this commit, the reader passed to `destroy_reader()` will always
be in paused state.
Reader creation happens through the `reader_lifecycle_policy` interface,
which offers a `create_reader()` method. This method accepts a shard
parameter (among others) and returns a future. Its implementation is
expected to go to the specified shard and then return with the created
reader. The method is expected to be called from the shard where the
shard reader (and consequently the multishard reader) lives. This API,
while reasonable enough, has a serious flaw. It doesn't make batching
possible. For example, if the shard reader issues a call to the remote
shard to fill the remote reader's buffer, but finds that it was evicted
while paused, it has to come back to the local shard just to issue the
recreate call. This makes the code both convoluted and slow.
Change the reader creation API to be synchronous, that is, callable from
the shard where the reader has to be created, allowing for simple call
sites and batching.
This change requires that implementations of the lifecycle policy update
any per-reader data-structure they have from the remote shard. This is
not a problem however, as these data-structures are usually partitioned,
such that they can be accessed safely from a remote shard.
Another, very pleasant, consequence of this change is that now all
methods of the lifecycle interface are sync and thus calls to them
cannot overlap anymore.
This patch also removes the
`test_multishard_combining_reader_destroyed_with_pending_create_reader`
unit test, which is not useful anymore.
For now just emulate the old interface inside shard reader. We will
overhaul the shard reader after some further changes to minimize
noise.
The shard reader relies on the `reader_lifecycle_policy` for pausing and
resuming the remote reader. The lifecycle policy's API was designed to
be as general as possible, allowing for any implementation of
pause/resume. However, in practice, we have a single implementation of
pause/resume: registering/unregistering the reader with the relevant
`reader_concurrency_semaphore`, and we don't expect any new
implementations to appear in the future.
Thus, the generic API of the lifecycle policy, is needlessly abstract
making its implementations needlessly complex. We can instead make this
very concrete and have the lifecycle policy just return the relevant
semaphore, removing the need for every implementor of the lifecycle
policy interface to have a duplicate implementation of the very same
logic.
For now just emulate the old interface inside shard reader. We will
overhaul the shard reader after some further changes to minimize noise.
If the shard reader is created for a singular range (has a single
partition), and then it is evicted after reaching EOS, when recreated we
would have to create a reader that reads an empty range, since the only
partition the range has was already read. Since it is not possible to
create a reader with an empty range, we just didn't recreate the reader
in this case. This is incorrect however, as the code might still attempt
to read from this reader, if only due to a bug, and would trigger a
crash. The correct fix is to create an empty reader that will
immediately be at EOS.
Drop all the glue code, needed in the past so the shard reader can be
implemented on top of foreign reader. As the shard reader moved away
from foreign reader, this glue code is not needed anymore.
In the past, shard reader wrapped a foreign reader instance, adding
functionality required by the multishard reader on top. This has worked
well to a certain degree, but after the addition of pause-resume of
shard reader, the cooperation with foreign reader became more-and-more a
struggle. It has now gotten to a point, where it feels like shard reader
is fighting foreign reader as much as it reuses it. This manifested
itself in the ever growing amount of glue code, and hacks baked into
foreign reader (which is supposed to be of general use), specific to
the usage in the multishard reader.
It is time we don't force this code-reuse anymore and instead implement
all the required functionality in shard reader directly.
Some members of shard reader have to be accessed even after it is
destroyed. This is required by background work that might still be
pending when the reader is destroyed. This was solved by creating a
special `state` struct, which contained all the members of the shard
readers that had to be accessed even after it was destroyed. This state
struct was managed through a shared pointer, that each continuation that
was expected to outlive the reader, held a copy of. This however created
a minefield, where each line of the code had to be carefully audited to
access only fields that will be guaranteed to remain valid.
Fix this mess by making the whole class a shared pointer, with
`enable_shared_from_this`. Now each continuation just has to make sure
to keep `this` alive and code can now access all members freely (well,
almost).
Shard reader started its life as a very thin layer above foreign reader,
with just some convenience methods added. As usual, by now it has grown
into a hairy monster, its class definition out-growing even that of the
multishard reader itself. It is time shard reader is moved into the
top-level scope, improving the readability of both classes.
Currently shard reader has a reference to the owning multishard reader
and it freely accesses its members. This resulted in a mess, where it's
not clear what exactly shard reader depends on. Disentangle this mess,
by making the shard reader self-sufficient, passing all it depends on
into its constructor.
All tests that involve writing to a base table and then reading from the
view table must use the eventually() function to account for the fact that
the view update is asynchronous, and may be visible only some time after
writing the base table. Forgetting an eventually() can cause the test
to become flaky and sometimes fail because the expected data is not *yet*
in the view. Botond noticed these failures in practice in two subtests
(test_partition_key_filtering_with_slice and
test_clustering_key_in_restrictions).
This patch fixes both tests, and I also reviewed the entire source file
view_schem_test.cc and found additional places missing an eventually()
(and also places that unnecessarily used eventually() to read from the
base table), and fixed those as well.
Fixes#4212
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190212121140.14679-1-nyh@scylladb.com>
There is no value if having a default value for encoding_stats parameter
of write_components(). If anything it weakens the tests by encouraging
not using the real encoding stats which is not what the actual sstable
write path in Scylla does.
This patch removes the default value and makes most of the tests provide
real encoding statistics. The ones that do not are those that have no
easy way of obtaining those (and those stats are not that important for
the test itself) or there is a reason for not using those
(sstable_3_x_test::test_sstable_write_large_row uses row size thresholds
based on size with default-constructed encoding_stats).
Message-Id: <20190212124356.14878-1-pdziepak@scylladb.com>
Refs #4085
Changes commitlog descriptor to both accept "Recycled-Commitlog..."
file names, and preserve said name in the descriptor.
This ensures we pick up the not-yet-used recycled segments left
from a crash for replay. The replay in turn will simply ignore
the recycled files, and post actual replay they will be deleted
as needed.
Message-Id: <20190129123311.16050-1-calle@scylladb.com>
create-relocatable-package.py currently (refs #4194) builds a compressed
tar file, but does so using a painfully slow Python implementation of gzip,
which is a problem considering the huge size (around 2 gigabytes) of Scylla's
executable. On my machine, running it for a release build of Scylla takes a
whopping 6 minutes.
Just replacing the Python compression with a pipe to an external "gzip"
process speeds up the run to just 2 minutes. But gzip is still not optimal,
using only one thread even when on a many-core machine. If we switch to
"pigz", a parallel implementation of "gzip", all cores are used and on
my machine the compression speeds up to just 23 seconds - that's 15
times faster than before this patch.
So this patch has create-relocatable-package.py use an external pigz process.
"pigz" is now required on the build system (if you want to create packages),
so is added to install-dependencies.sh.
[avi: update toolchain]
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190212090333.3970-1-nyh@scylladb.com>
In commit ec66dd6562, in non-interactive
runs of scylla_setup all options were unintentionally set to "false",
regardless of the options passed on the scylla_setup command line. This
can lead to all sorts of wrong behaviors, and in particular one test
setup assumed it was enabling the Scylla service (which was previously
the default) but after this commit, it no longer did.
This patch restores the previous behavior: Non-interactive invocations
of scylla_setup adhere to the defaults and the command-line options,
rather than blindly choosing "false".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190211214105.32613-1-nyh@scylladb.com>
Do not recalculate too much ranges in advance, it requires large
allocation and usually means that a consumer of the interface is going
to do to much work in parallel.
Fixes: #3767
In case salted_hash was NULL, we'd access uninitialized memory when dereferencing
the optional in get_as<>().
Protect against that by using get_opt() and failing authentication if we see a NULL.
Fixes#4168.
Tests: unit (release)
Branches: 3.0, 2.3
Message-Id: <20190211173820.8053-1-avi@scylladb.com>
"Fuzzy test" executes semi-random range-scans against semi-random data.
By doing so we hope to achieve a coverage of edge cases that would
be very hard to achieve by "conventional" unit tests.
Fuzzy test generates a table with a population of partitions that are
a combinations of all of:
* Size of static row: none, tiny, small and large;
* Number of clustering rows: none, few, several, and lots;
* Size of clustering rows: tiny, small and large;
* Number of range deletions: few, several and lots;
* Number of rows covered by a range deletion: few, several;
As well as a partition with extreme large static row, extreme number of
rows and rows of extreme size.
To avoid writing an excess amount of data, the size limit of pages is
reduced to 1KB (from the default 1MB) and the row count limit of pages
is reduced to 1000 (from the default of 10000).
The test then executes range-scans against this population. For each
range scan, a random partition range is generated, that is guaranteed to
contain at least one partition (to avoid executing mostly empty scans),
as well as a random partition-slice (row ranges). The data returned by
the query is then thoroughly validated against the population
description returned by the `create_test_table()` function.
As this test has a large degree of randomness to it, covering a
quasi-infinite input-space, it can (theoretically) fail at any time.
As such I took great care in making such failures deterministically
reproducible, based on a single random seed, which is logged to the
output in case of a failure, together with instructions on how to repeat
the particular run. The test also uses extensive logging to aid
investigations. For logging, seastar's logging mechanism is used, as
`BOOST_TEST_MESSAGE` produces unintelligible output when running with
-c > 1. Log messages are carefully tagged, so that the test produces the
least amount of noise by default, while being very explicit about what's
happening when ran with `debug` or especially `trace` log levels.
The existing `read_all_partitions_with_paged_scan()` implementation was
tailored to the existing, simplistic test cases. Refactor it so that it
can be used in much more complex test cases:
* Allow specifying the page's `max_size`.
* Allow specifying the query range.
* Allow specifying the partition slice's ck ranges.
* Fix minor bugs in the paging logic.
To avoid churn, a backward-compatible overload is added, that retains
the old parameter set.
This overload provides a middle ground between the very generic, but
hard-to-use "expert version" and to very restrictive and simplistic
"beginner version". It allows the user to declaratively describe the
to-be-generated population in terms of bunch
`std::uniform_int_distribution` objects (e.g. number of rows, size of
rows, etc.).
This allows for generating a random population in a controlled way, with
a minimum amount of boiler-plate code on the user side.
Allow the user to specify the population of the table in a generic and
flexible way. This patch essentially rewrites the `create_test_table()`
implementation from scratch, so that it populates the table using the
partition generator passed in by the user. Backward compatibility is
kept, by providing a `create_test_table()` overload that is identical to
the previous API. This overload is now implemented on top of the generic
overload.
get_restricted_ranges() function gets query provided key ranges
and divides them on vnode boundaries. It iterates over all ranges and
calculates all vnodes, but all its users are usually interested in only
one vnode since most likely it will be enough to populate a page. If it
will be not enough they will ask for more. This patch introduces new
interface instead of the function that allows to generate vnode ranges
on demand instead of precalculating all of them.
Right now Cassandra SSTables with counters cannot be imported into
Scylla. The reason for that is that Cassandra changed their counter
representation in their 2.1 version and kept transparently supporting
both representations. We do not support their old representation, nor
there is a sane way to figure out by looking at the data which one is in
use.
For safety, we had made the decision long ago to not import any
tables with counters: if a counter was generated in older Cassandra, we
would misrepresent them.
In this patch, I propose we offer a non-default way to import SSTables
with counters: we can gate it with a flag, and trust that the user knows
what they are doing when flipping it (at their own peril). Cassandra 2.1
is by now pretty old. many users can safely say they've never used
anything older.
While there are tools like sstableloader that can be used to import
those counters, there are often situations in which directly importing
SSTables is either better, faster, or worse: the only option left. I
argue that having a flag that allow us to import them when we are sure
it is safe is better than having no option at all.
With this patch I was able to successfully import Cassandra tables with
counters that were generated in Cassandra 2.1, reshard and compact their
SSTables, and read the data back to get the same values in Scylla as in
Cassandra.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190210154028.12472-1-glauber@scylladb.com>
This is convenient to test scylla directly by invoking build/dev/scylla.
This needs to be done under docker because the shared objects scylla
looks for may not exist in the host system.
During quick development we may not want to go through the trouble of
packaging relocatable scylla every time to test changes.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190209021033.8400-1-glauber@scylladb.com>
"
The code reading counter cells form sstables verifies that there are no
unsupported local or remote shards. The latter are detected by checking
if all shards are present in the counter cell header (only remote shards
do not have entries there). However, the logic responsible for doing
that was incorrectly computing the total number of counter shards in a
cell if the header was larger than a single counter shard. This resulted
in incorrect complaints that remote shards are present.
Fixes#4206
Tests: unit(release)
"
* tag 'counter-header-fix/v1' of https://github.com/pdziepak/scylla:
tests/sstables: test counter cell header with large number of shards
sstables/counters: fix remote counter shard detection
The logic responsible for reading counters from sstables was getting
confused by large headers. The size of the header depends directly on
the number of shards. This tests checks that we can handle cells with
large number of counter shards properly.
Each counter cell has a header with an entry for each local and global
shards. The detection of remote shards is done by checking if there are
any counter shards that do not have an entry in the header. This is done
by computing the number of counter shards in a cell and comparing it to
the number of header entries. However, the computation was wrong and
included the size taken by the header itself. As a result, if the header
was as big or larger than a single counter shard Scylla incorrectly
complained about remote shards.
The interpreter as it is right now has a bug: I incorrectly assumed that
all the shared libraries that python dynamically links would be in
lib-dynload. That is not true, and at least some of them are in
site-packages.
With that, we were loading system libraries for some shared objects.
The approach taken to fix this is to just check if we're seeing a shared
library and relocate everything we see: we will end up relocating the
ones in lib64 too, but that not only should be okay, it is probably even
more fool-proof.
While doing that I noticed that I had forgotten to incorporate one of
previous feedback from Avi (that we're leaving temporary files behind).
So I'm fixing that as well.
[avi: update toolchain]
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190208115501.7234-1-glauber@scylladb.com>
I was playing with the python3 interpreter trying to get pip to work,
just to see how far we can go. We don't really need pip, but I figured
it would be a good stress test to make sure that the process is working
and robust.
And it didn't really work, because although pip will correctly install
things into $relocatable_root/local/lib, sys.path will still refer to a
hardcoded /usr/local. While this should not affect Scylla, since we
expect to have all our modules in out path anyway -- and that path is
searched before /usr/local, it is still dangerous to make an absolute
reference like this.
Unfortunately, /usr/local/ it is included unconditionally by site.py,
which is executed when the interpreter is started and there is no
environment variable I found to change that (the help string refers to
PYTHONNOUSERSITE, but I found no mention of that in site.py whatsoever)
There is a way to tell site.py not to bother to add user sites, by
passing the -s flag, which this patch does.
Aside from doing that, we also enhance PYTHONPATH to include a reference
to ./local/{lib,lib64}/python<version>/site-packages.
After applying this patch, I was able to build an interpreter containing
only python3-pip and python3-setuptools, and build the relocatable
environment from there.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190206052104.25927-1-glauber@scylladb.com>
This algorithm was already duplicated in two places
(service/pager/query_pagers.cc and mutation_reader.cc). Soon it will be
used in a third place. Instead of triplicating, move it into a function
that everybody can use.
In the next patches `create_test_cf()` will be made much more powerful
and as such generally useful. Move it into its own files so other tests
can start using it as well.
tmpdir is a helper class representing a temporary directory.
Unfortunately, it suffers for some problems such as lack of proper
encapsulation and weak typing. This has caused bugs in the past when the
user code accidentally modified the member variable with the path to the
directory.
This patch modernises tmpdir and updates its users. The path is stored
in a std::filesystem::path and available read-only to the class users.
mkdtemp and boost are replaced by standard solution.
The users are update to use path more (when it didn't involve too many
changes to their code) and stop using lw_shared_ptr to store the tmpdir
when it wasn't necessary.
tmpdir intentionally doesn't provide any helpers for getting the path as
a string in order to discourage weak types.
Message-Id: <20190207145727.491-1-pdziepak@scylladb.com>
* seastar c3be06d...428f4ac (13):
> build: make the "dist" test respect the build type
> Merge 'Add support for docker --cpuset-cpus' from Juliana
> Merge "Add support for Coroutines TS" from Paweł
> Merge "Modernize dependency management" from Avi
> future: propagate broken_promise exception to abandoned continuations
> net/inet_address: avoid clang Wmissing-braces
> build: Default to the "Release" type if unspecified
> rpc: log an exception that may happen while processing an RPC message
> Add a --split-dwarf option to configure.py
> build: Fix the `StdFilesystem` module
> Compress debug info by default
> Add an option for building with split dwarf
> Dockerfile: install stow
Default constructed extremum_tracker has uninitialised _default_value
which basically makes it never correct to do that. Since this class is a
mechanism and not a value it doesn't really need to be a regular type,
so let's drop the default constructor.
Message-Id: <20190207162430.7460-1-pdziepak@scylladb.com>
This series contains several fixes and improvements as well as new tests
for sstable code dealing with statistics.
* https://github.com/pdziepak/scylla.git sstable-stats-fixes/v1-rebased:
sstables: compaction: don't access moved-from vector of sstables
memtable: move encoding_stats_collector implementation out of header
sstables: seal_statistics(): pass encoding_stats by constant reference
sstables/mc/writer: don't assume all schema columns are present
tests/sstable3: improvements to file compare
tests: extract mutation data model
tests/data_model: add support for expiring atomic cells
tests/data_model: allow specifying timestamp for row markers
tests/memtable: test column tracking for encoding stats
sstables: use correct source of statistics in
get_encoding_stats_for_compaction()
utils/extremum_tracking: preserve "not-set" status on merge
sstables/metadata_collector: move the default values to the global
tracker
tests/sstables: test for reading serialisation header
tests/sstables: pass encoding stats to write_components()
tests/sstable: test merging encoding_stats
Fixes#4202.
By default write_components() uses a safe default for encoding_stats
which indicates that all columns are present. This may hide so bugs, so
let's pass the real thing in the tests that this may matter.
column_stats is a per-partition tracker, while metadata_collector is the
global one. The statistics gathered by column_stats are merged into the
metadata_collector. In order to ensure that we get proper default values
in case no value of particular kind (e.g. no TTLs) was seen they need to
be set on the global tracker, not the per-partition one.
extremum_tracker allows choosing a default value that's going to be used
only if no "real" values were provided. Since it is never compared with
the actual input values it can be anything. For instance, if the minimum
tracker default value is 0 and there was one update with the value 1 the
detected minimum is going to be 1 (the default is ignored).
However, this doesn't work when the trackers are merged since that
process always leaves the destination tracker in the "set" state
regardless whether any of the merged trakcers has ever seen any value.
This is fixed by this patch, by properly preserving _is_set state on
merge.
sstable class is responsible for much more things that it should. In
particular, it takes care of both writing and reading sstables. The
problem that it causes is that it is very easy to confuse those two.
This is what has happened in get_encoding_stats_for_compaction().
Originally, it was using _c_stats as a source of the statistics, which
is used only during the write and per-partition. Needless to say, the
returned encoding_stats were bogus.
The correct source of those statistics is get_stats_metadata().
This patch introduces some improvement to file comparison:
- exception flags are set so that any error triggers an exceptions and
guarantees that they are not silently ignored
- std::ios_base::binary flag is passed to open()
- istreambuf_iterator is used instead of istream_iterator. It is better
suited for comparing binary data.
The interface tmpdir::path isn't properly encapsulated and its users can
modify the path even though they really shouldn't. This can happen
accidentally, in cql_test_env a reference to tmpdir::path was created
and later assigned to in one of the code paths. This caused tmpdir
destructor to remove wrong directory at program exit.
This patch solves the problem by avoiding referencing tmpdir::path, a
copy is perfectly acceptable considering that this is tests-only code.
Message-Id: <20190206173046.26801-1-pdziepak@scylladb.com>
rm -rf build/* was to start rpm building on clean state, but it also delete
scylla built binaries so it was not good idea.
Instead of rm -rf build/*, we can check file existance on cloned
directory, if it seems good we can reuse it.
Also we need to run git pull on each package repo since it may not
included latest commit.
Fixes#4189
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190206101755.2056-1-syuu@scylladb.com>
"
Fixes#4193Fixes#3795
This series enables handling IN restrictions for regular columns,
which is needed by both filtering and indexing mechanisms.
Tests: unit (release)
"
* 'allow_non_key_in_restrictions' of https://github.com/psarna/scylla:
tests: add filtering with IN restriction test
cql3: remove unused can_have_only_one_value function
cql3: allow non-key IN restrictions
We still allow the delete of rows from system.large_partition to run
in parallel with the sstable deletion, but now we return a future that
waits for both.
Tests: unit (release)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190205001526.68774-1-espindola@scylladb.com>
Fixes#4010
Unless user sets this explicitly, we should try explicitly avoid
deprecated protocol versions. While gnutls should do this for
connections initiated thusly, clients such as drivers etc might
use obsolete versions.
Message-Id: <20190107131513.30197-1-calle@scylladb.com>
"
get_compaction_history can return a lot of records which will add up to a
big http reply.
This series makes sure it will not create large allocations when
returning the results.
It adds an api to the query_processor to use paged queries with a
consumer function that returns a future, this way we can use the http
stream after each record.
This implementation will prevent large allocations and stalls.
Fixes#4152
"
* 'amnon/compaction_history_stream_v7' of github.com:scylladb/seastar-dev:
tests/query_processor_test: add query_with_consumer_test
system_keyspace, api: stream get_compaction_history
query_processor: query and for_each_cql_result with future
"
Virtual columns are MV-specific columns that contribute to the
liveness of view rows. However, we were not adding those columns when
creating an index's underlying MV, causing indexes to miss base rows.
Fixes#4144
Branches: master, branch-3.0
"
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
* 'sec-index/virtual-columns/v1' of https://github.com/duarten/scylla:
tests/secondary_index_test: Add reproducer for #4144
index/secondary_index_manager: Add virtual columns to MV
"
We would like to deploy Scylla in constrained environments where
internet access is not permitted. In those environments it is not
possible to acquire the dependencies of Scylla from external repos and
the packages have to be sent alongside with its dependencies.
In older distributions, like CentOS7 there isn't a python3 interpreter
available. And while we can package one from EPEL this tends to break in
practice when installing the software in older patchlevels (for
instance, installing into RHEL7.3 when the latest is RHEL7.5).
The reason for that, as we saw in practice, is that EPEL may
not respect RHEL patchlevels and have the python interpreter depending
on newer versions of some system libraries.
virtualenv can be used to create isolated python enviornments, but it is
not designed for full isolation and I hit at least two roadblocks in
practice:
1) It doesn't copy the files, linking some instead. There is an
--always-copy option but it is broken (for years) in some
distributions.
2) Even when the above works, it still doesn't copy some files, relying
on the system files instead (one sad example was the subprocess
module that was just kept in the system and not moved to the
virtualenv)
This patch solves that problem by creating a python3 environment in a
directory with the modules that Scylla uses, and no other else. It is
essentially doing what vitualenv should do but doesn't. Once this
environment is assembled the binaries are then made relocatable the same
way the Scylla binary is.
One difference (for now) between the Scylla binary relocation process
and ours is that we steer away from LD_LIBRARY_PATH: the environment
variable is inherited by any child process steming from the caller,
which means that we are unable to use the subprocess module to call
system binaries like mkfs (which our scripts do a lot). Instead, we rely
on RUNPATH to tell the binary where to search for its libraries.
Once we generate an archive with the python3 interpreter, we then
package it as an rpm with bare any dependencies. The dependencies listed
are:
$ rpm -qpR scylla-relocatable-python3-3.6.7-1.el7.x86_64.rpm
rpmlib(CompressedFileNames) <= 3.0.4-1
rpmlib(FileDigests) <= 4.6.0-1
rpmlib(PartialHardlinkSets) <= 4.0.4-1
rpmlib(PayloadFilesHavePrefix) <= 4.0-1
rpmlib(PayloadIsXz) <= 5.2-1
And the total size of that rpm, with all modules scylla needs is 20MB.
The Scylla rpm now have a way more modest dependency list:
$ rpm -qpR scylla-server-666.development-0.20190121.80b7c7953.el7.x86_64.rpm | sort | uniq
/bin/sh
curl
file
hwloc
kernel >= 3.10.0-514
mdadm
pciutils
rpmlib(CompressedFileNames) <= 3.0.4-1
rpmlib(FileDigests) <= 4.6.0-1
rpmlib(PayloadFilesHavePrefix) <= 4.0-1
rpmlib(PayloadIsXz) <= 5.2-1
scylla-conf
scylla-relocatable-python3 <== our python3 package.
systemd-libs
util-linux
xfsprogs
I have tested this end to end by generating RPMs from our master branch,
then installing them in a clean CentOS7.3 installation without even
using yum, just rpm -Uhv <package_list>
Then I called scylla_setup to make sure all python scripts were working
and started Scylla successfully.
"
* 'scylla-python3-v5' of github.com:glommer/scylla:
Create a relocatable python3 interpreter
spec file: fix python3 dependency list.
fixup scripts before installing them to their final location
automatically relocate python scripts
make scyllatop relocatable
use relative paths for installing scylla and iotune binaries
This patch adds a unit test for querying with a consumer function.
query with consumer uses paging, the tests covers the scenarios where
the number of rows bellow and above the page size, it also test the
option to stop in the middle of reading.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
get_compaciton_history can return big chunk of data.
To prevent large memory allocation, the get_compaction_history now read
each compaction_history record and use the http stream to send it.
Fixes#4152
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
query and for_each_cql_result accept a function that reads a row and
return a stop_iterator.
This implementation of those functions gets a function that returns a
future stop_iterator allowing preemption between calls.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
We would like to deploy Scylla in constrained environments where
internet access is not permitted. In those environments it is not
possible to acquire the dependencies of Scylla from external repos and
the packages have to be sent alongside with its dependencies.
In older distributions, like CentOS7 there isn't a python3 interpreter
available. And while we can package one from EPEL this tends to break in
practice when installing the software in older patchlevels (for
instance, installing into RHEL7.3 when the latest is RHEL7.5).
The reason for that, as we saw in practice, is that EPEL may
not respect RHEL patchlevels and have the python interpreter depending
on newer versions of some system libraries.
virtualenv can be used to create isolated python enviornments, but it is
not designed for full isolation and I hit at least two roadblocks in
practice:
1) It doesn't copy the files, linking some instead. There is an
--always-copy option but it is broken (for years) in some
distributions.
2) Even when the above works, it still doesn't copy some files, relying
on the system files instead (one sad example was the subprocess
module that was just kept in the system and not moved to the
virtualenv)
This patch solves that problem by creating a python3 environment in a
directory with the modules that Scylla uses, and no other else. It is
essentially doing what vitualenv should do but doesn't. Once this
environment is assembled the binaries are then made relocatable the same
way the Scylla binary is.
One difference (for now) between the Scylla binary relocation process
and ours is that we steer away from LD_LIBRARY_PATH: the environment
variable is inherited by any child process steming from the caller,
which means that we are unable to use the subprocess module to call
system binaries like mkfs (which our scripts do a lot). Instead, we rely
on RUNPATH to tell the binary where to search for its libraries.
In terms of the python interpreter, PYTHONPATH does not need to be set
for this to work as the python interpreter will include the lib
directory in its PYTHONPATH. To confirm this, we executed the following
code:
bin/python3 -c "import sys; print('\n'.join(sys.path))"
with the interpreter unpacked to both /home/centos/glaubertmp/test/ and
/tmp. It yields respectively:
/home/centos/glaubertmp/test/lib64/python36.zip
/home/centos/glaubertmp/test/lib64/python3.6
/home/centos/glaubertmp/test/lib64/python3.6/lib-dynload
/home/centos/glaubertmp/test/lib64/python3.6/site-packages
and
/tmp/python/lib64/python36.zip
/tmp/python/lib64/python3.6
/tmp/python/lib64/python3.6/lib-dynload
/tmp/python/lib64/python3.6/site-packages
This was tested by moving the .tar.gz generated on my Fedora28 laptop to
a CentOS machine without python3 installed. I could then invoke
./scylla_python_env/python3 and use the interpreter to call 'ls' through
the subprocess module.
I have also tested that we can successfully import all the modules we listed
for installation and that we can read a sample yaml file (since PyYAML depends
on the system's libyaml, we know that this works)
Time to build:
real 0m15.935s
user 0m15.198s
sys 0m0.382s
Final archive size (uncompressed): 81MB
Final archive sie (compressed) : 25MB
Signed-off-by: Glauber Costa <glauber@scylladb.com>
--
v3:
- rewrite in python3
- do not use temporary directories, add directly to the archive. Only the python binary
have to be materialized
- Use --cacheonly for repoquery, and also repoquery --list in a second step to grab the file list
v2:
- do not use yum, resolve dependencies from installed packages instead
- move to scripts as Avi wants this not only for old offline CentOS
The dependency list as it was did not reflect the fact that scyllatop is
now written in python3.
Some packages, like urwid, should use the python3 version. CentOS
doesn't really have an urwid package for python3, not even in EPEL. So
this officially marks the point in which we can't build packages that
will install in CentOS7 anyway.
Luckily, we will soon be providing our own python3 interpreter. But for
now, as a first step, simplify the dependency list by removing the
CentOS/Fedora conditional and listing the full python3 list
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Before installing python files to their final location in install.sh,
replace them with a thunk so that they can work with our python3
interpreter. The way the thunk works, they will also work without our
python3 interpreter so unconditionally fixing them up is always safe.
I opt in this patch for fixing up just at install time to simplify
developer's life, who won't have to worry about this at all.
Note about the rpm .spec file: since we are relying on specific format
for the shebangs, we shouldn't let rpmbuild mess with them. Therefore,
we need to disable a global variable that controls that behavior (by
definition, Fedora rpmbuild will rewrite all shebangs to /usr/bin/python3)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Given a python script at $DIR/script.py, this copies the script to
$DIR/libexec/script.py.bin, fixes its shebang to use /usr/bin/env instead
of an absolute path for the interpreter and replaces the original script
with a thunk that calls into that script.
PYTHONPATH is adjusted so that the original directory containing the script
can also serve as a source of modules, as would be originally intended.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Right now the binary we distribute with scyllatop calls into
/usr/lib/scylla/scyllatop/scyllatop.py unconditionally. Calling that is
all that this binary does.
This poses a problem to our relocatable process, since we don't want
to be referring to absolute paths (And moreover, that is calling python
whereas it should be calling python3)
The scyllatop.py files includes a python3 shebang and is executable.
Therefore, it is best to just create a link to that file and execute it
directly
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The build times I got with a clean ccache were:
ninja dev 10806.89s user 678.29s system 2805% cpu 6:49.33 total
ninja release 28906.37s user 1094.53s system 2378% cpu 21:01.27 total
ninja debug 18611.17s user 1405.66s system 2310% cpu 14:26.52 total
With this version -gz is not passed to seastar's configure. It should
probably be seastar's configure responsibility to do that and I will
send a separate patch to do it.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190204162112.7471-1-espindola@scylladb.com>
"uuid" was ref:ed in a continuation. Works 99.9% of the time because
the continuation is not actually delayed (and assuming we begin the
checks with non-truncated (system) cf:s it works).
But if we do delay continuation, the resulting cf map will be
borked.
Fixes#4187.
Message-Id: <20190204141831.3387-1-calle@scylladb.com>
The multishard reader has to combine the output of all shards into a
single fragment stream. To do that, each time a `partition_start` is
read it has to check if there is another partition, from another shard,
that has to be emitted before this partition. Currently for this it
uses the partitioner. At every partition start fragment it checks if the
token falls into the current shard sub-range. The shard sub-range is the
continuous range of tokens, where each token belongs to the same shard.
If the partition doesn't belong to the current shard sub-range the
multishard reader assumes the following shard sub-range of the next shard
will have data and move over to it. This assumption will however only
stand on very dense tables, and will fail miserably on less dense
tables, resulting in the multishard reader effectively iterating over
the shard sub-ranges (4096 in the worst case), only to find data in just
a few of them. This resulted in high user-perceived latency when
scanning a sparse table.
This patch replaces this algorithm with one based on a shard heap. The
shards are now organized into a min-heap, by the next token they have
data for. When a partition start fragment is read from the current
shard, its token is compared to the smallest token in the shard heap. If
smaller, we continue to read from the current shard. Otherwise we move
to the shard with the smallest token. When constructing the reader, or
after fast-forwarding we don't know what first token each reader will
produce. To avoid reading in a partition from each reader, we assume
each reader will produce the first token from the first shard sub-range
that overlaps with the query range. This algorithm performs much better
on sparse tables, while also being slightly better on dense tables.
I did only a very rough measurement using CQL tracing. I populated a
table with four rows on a 64 shards machine, then scanned the entire
table.
Time to scan the table (microseconds):
before 27'846
after 5'248
Fixes: #4125
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <d559f887b650ab8caa79ad4d45fa2b7adc39462d.1548846019.git.bdenes@scylladb.com>
"
This is a first step in fixing #3988.
"
* 'espindola/large-row-warn-only-v4' of https://github.com/espindola/scylla:
Rename large_partition_handler
Print a warning if a row is too large
Remove defaut parameter value
Rename _threshold_bytes to _partition_threshold_bytes
keys: add schema-aware printing for clustering_key_prefix
Three error messages were supposed to include a column name, but a "{}"
was missing in the format so the given column name didn't actually appear
in the error message. So this patch adds the missing {}'s.
Fixes#4183.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190203112100.13031-1-nyh@scylladb.com>
Adds new columns to the "Page spans" table named "large [B]" and
"[spans]", which shows how much memory is allocated in spans of given
size. Excludes spans used by small pools.
Useful in determining what is the size of large allocations which
consume the memory.
Example output:
Page spans:
index size [B] free [B] large [B] [spans]
0 4096 4096 4096 1
1 8192 32768 0 0
2 16384 16384 0 0
3 32768 98304 2785280 85
4 65536 65536 1900544 29
5 131072 524288 471597056 3598
...
31 8796093022208 0 0 0
Large allocations: 484675584 [B]
Message-Id: <1548956406-7601-1-git-send-email-tgrabiec@scylladb.com>
Commit 976324bbb8 changed to use
get_application_state_ptr to get a pointer of the application_state. It
may return nullptr that is dereferenced unconditionally.
In resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test, we saw:
4 nodes in the tests
n1, n2, n3, n4 are started
n1 is stopped
n1 is changed to use different shard config
n1 is restarted ( 2019-01-27 04:56:00,377 )
The backtrace happened on n2 right fater n1 restarts:
0 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature STREAM_WITH_RPC_STREAM is enabled
1 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature WRITE_FAILURE_REPLY is enabled
2 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature XXHASH is enabled
3 WARN 2019-01-27 04:56:05,177 [shard 0] gossip - Fail to send EchoMessage to 127.0.58.1: seastar::rpc::closed_error (connection is closed)
4 INFO 2019-01-27 04:56:05,205 [shard 0] gossip - InetAddress 127.0.58.1 is now UP, status =
5 Segmentation fault on shard 0.
6 Backtrace:
7 0x00000000041c0782
8 0x00000000040d9a8c
9 0x00000000040d9d35
10 0x00000000040d9d83
11 /lib64/libpthread.so.0+0x00000000000121af
12 0x0000000001a8ac0e
13 0x00000000040ba39e
14 0x00000000040ba561
15 0x000000000418c247
16 0x0000000004265437
17 0x000000000054766e
18 /lib64/libc.so.6+0x0000000000020f29
19 0x00000000005b17d9
We do not know when this backtrace happened, but according to log from n3 an n4:
INFO 2019-01-27 04:56:22,154 [shard 0] gossip - InetAddress 127.0.58.2 is now DOWN, status = NORMAL
INFO 2019-01-27 04:56:21,594 [shard 0] gossip - InetAddress 127.0.58.2 is now DOWN, status = NORMAL
We can be sure the backtrace on n2 happened before 04:56:21 - 19 seconds (the
delay the gossip notice a peer is down), so the abort time is around 04:56:0X.
The migration_manager::maybe_schedule_schema_pull that triggers the backtrace
must be scheduled before n1 is restarted, because it dereference
application_state pointer after it sleeps 60 seconds, so the time
maybe_schedule_schema_pull is called is around 04:55:0X which is before n1 is
restarted.
So my theory is: migration_manager::maybe_schedule_schema_pull is scheduled, at this time
n1 has SCHEMA application_state, when n1 restarts, n2 gets new application
state from n1 which does not have SCHEMA yet, when migration_manager::maybe_schedule
wakes up from the 60 sleep, n1 has non-empty endpoint_state but empty
application_state for SCHEMA. We dereference the nullptr
application_state and abort.
Fixes: #4148
Tests: resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test
Message-Id: <9ef33277483ae193a49c5f441486ee6e045d766b.1548896554.git.asias@scylladb.com>
When a repair failed, we saw logs like:
repair - Checksum of range (8235770168569320790, 8235957818553794560] on
127.0.0.1 failed: std::bad_alloc (std::bad_alloc)
It is hard to tell which keyspace and table has failed.
To fix, log the keyspace and table name. It is useful to know when debugging.
Fixes#4166
Message-Id: <8424d314125b88bf5378ea02a703b0f82c2daeda.1548818669.git.asias@scylladb.com>
Stop calling .remove_suffix on empty string_view.
ck_bview can be empty because this function can be
called for a half open range tombstone.
It is impossible to write such range tombstones to LA/KA SSTables
so we should throw a proper exception instead of allowing
an undefined behaviour.
Refs #4113
Tests: unit(release)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <c3738916953e4b10812aed95e645c739b4c29462.1548777086.git.piotr@scylladb.com>
All of our python scripts are there and they are all installed
automatically into /usr/lib/scylla. By keeping scylla-housekeeping
separately we are just complicating our build process.
This would be just a minor annoyance but this broke the new relocatable
process for python3 that I am trying to put together because I forgot to
add the new location as a source for the scripts.
Therefore, I propose we start being more diligent with this and keeping
all scripts together for the future.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190123191732.32126-2-glauber@scylladb.com>
When a file in the `seastar` directory changes, we want to minimize the
amount of Scylla artifacts that are re-built while ensuring that all
changes in Seastar are reflected in Scylla correctly.
For compiling object files, we change Seastar to be an "order only"
dependency so that changes to Seastar don't trigger unnecessary builds.
For linking, we add an "implicit" dependency on Seastar so that Scylla
is re-linked when Seastar changes.
With these changes, modifying a Seastar header file will trigger the
recompilation of the affected Scylla object files, and modifying a
Seastar source file will trigger linking only.
Fixes#4171
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <0ab43d79ce0d41348238465d1819d4c937ac6414.1548906335.git.jhaberku@scylladb.com>
Fully expired sstable is not added to compacting set, meaning it's not actually
compacted, but it's kept in a list of sstables which incremental compaction
uses to check if any sstable can be replaced.
Incremental compaction was unconditionally removing expired sstable from compacting
set, which led to segfault because end iterator was given.
The fix is about changing sstable_set::erase() behavior to follow standard one
for erase functions which will works if the target element is not present.
Fixes#4085.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190130163100.5824-1-raphaelsc@scylladb.com>
"
This series enhances perf_simple_query error reporting by adding an
option of producing a json file containing the results. The format of
that file is very similar to the results produces by perf_fast_forward
in order to ease integration with any tools that may want to interpret
them.
In addition to that perf_simple_query now prints to the standard output
median, median absolute deviation, minimum and maximum of the partial
results, so that there is no need for external scripts to compute those
values.
"
* tag 'perf_simple_query-json/v1' of https://github.com/pdziepak/scylla:
perf_simple_query: produce json results
perf_simple_query: calculate and print statistics
perf: time_parallel: return results of each iteration
perf_simple_query: take advantage of threads in main()
"
Recently it was discovered that the memtable reader
(partition_snapshot_reader to be more precise) can violate mutation
fragment monotonicity, by remitting range tombstones when those overlap
with more than one ck range of the partition slice.
This was fixed by 7049cd9, however after that fix was merged a much
simpler fix was proposed by Tomek, one that doesn't involve nearly as
much changes to the partition snapshot reader and hences poses less risk
of breaking it.
This mini-series reverts the previous fix, then applies the new, simpler
one.
Refs: #4104
"
* 'partition-snapshot-reader-simpler-fix/v2' of https://github.com/denesb/scylla:
partition_snapshot_reader: don't re-emit range tombstones overlapping multiple ck ranges
Revert "partition_snapshot_reader: don't re-emit range tombstones overlapping multiple ck ranges"
Committer: Avi Kivity <avi@scylladb.com>
Branch: next
Switch to the the CMake-ified Seastar
This change allows Scylla to be compiled against the `master` branch of
Seastar.
The necessary changes:
- Add `-Wno-error` to prevent a Seastar warning from terminating the
build
- The new Seastar build system generates the pkg-config files (for
example, `seastar.pc`) at configure time, so we don't need to invoke
Ninja to generate them
- The `-march` argument is no longer inherited from Seastar (correctly),
so it needs to be provided independently
- Define `SEASTAR_TESTING_MAIN` so that the definition of an entry
point is included for all unit test compilation units
- Independently link Scylla against Seastar's compiled copy of fmt in
its build directory
- All test files use the (now public) Seastar testing headers
- Add some missing Seastar headers to source files
[avi: regenerate frozen toolchain, adjust seastar submoule]
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <02141f2e1ecff5cbcd56b32768356c3bf62750c4.1548820547.git.jhaberku@scylladb.com>
When entering a new ck range (of the partition-slice), the partition
snapshot reader will apply to its range tombstones stream all the
tombstones that are relevant to the new ck range. When the partition has
range tombstones that overlap with multiple ck ranges, these will be
applied to the range tombstone stream when entering any of the ck ranges
they overlap with. This will result in the violation of the monotonicity
of the mutation fragments emitted by the reader, as these range
tombstones will be re-emitted on each ck range, if the ck range has at
least one clustering row they apply to.
For example, given the following partition:
rt{[1,10]}, cr{1}, cr{2}, cr{3}...
And a partition-slice with the following ck ranges:
[1,2], [3, 4]
The reader will emit the following fragment stream:
rt{[1,10]}, cr{1}, cr{2}, rt{[1,10]}, cr{3}, ...
Note how the range tombstone is emitted twice. In addition to violating
the monotonicity guarantee, this can also result in an explosion of the
number of emitted range tombstones.
Fix by trimming range tombstones to the start of the current ck range,
thus ensuring that they will not violate mutation fragment monotonicity
guarantees.
Refs: #4104
This is a much simpler fix for the above issue, than the already
committed one (7049cd937A). The latter is reverted by the previous
patch and this patch applies the simpler fix.
docs/metrics.md so far explained just the REST API for retrieving current
metrics from a single Scylla node. In this patch, I add basic explanations
on how to use the Prometheus and Grafana tools included in the
"scylla-grafana-monitoring" project.
It is true that technically, what is being explained here doesn't come
with the Scylla project and requires the separate scylla-grafana-monitoring
to be installed as well. Nevertheless, most Scylla developers will need this
knowledge eventually and suprisingly it appears it was never documented
anywhere accessible to newbie developers, and I think metrics.md is the
right place to introduce it.
In fact, I myself wasn't aware until today that Prometheus actually had
its own Web UI on port 9090, and that it is probably more useful for
developers than Grafana is.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Reviewed-by: Botond Denes <bdenes@scylladb.com>
Message-Id: <20190129114214.17786-1-nyh@scylladb.com>
"
An error in validating CONTAINS restrictions against collections caused
only the first restriction to be taken into account due to returning
prematurely.
This miniseries provides a fix for that as well as a matching test case.
Tests: unit (release)
Fixes#4161
"
* 'fix_multiple_contains_for_one_column' of https://github.com/psarna/scylla:
tests: enable CONTAINS tests for filtering
cql3: remove premature return from is_satisfied_by
cql3: restore indentation
Tests for filtering with CONTAINS restrictions were not enabled,
so they are now. Also, another case for having two CONTAINS restrictions
for a single column is added.
Refs #4161
Function which checked whether a CONTAINS restriction is satisfied
by a collection erroneously returned prematurely after checking
just the first restriction - which works fine for the usual case,
but fails if there are multiple CONTAINS restrictions present
for a column.
Fixes#4161
The value is already passed by cql_table_large_partition_handler, so
the default was just for nop_large_partition_handler.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
For reporting large rows we have to be able to print clustering keys
in addition to partition keys.
Refs #3988.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Use the ":z" suffix to tell Docker to relabel file objets on shared
volumes. Fixes accessing filesystem via dbuild when SELinux is enabled.
Message-Id: <20190128160557.2066-1-penberg@scylladb.com>
"
Cleanup of temporary sstable directories in distributed_loader::populate_column_family
is completely broken and non tested. This code path was never executed since
populate_column_family doesn't currently list subdirectories at all.
This patchset fixes this code path and scans subdirectories in populate_column_family.
Also, a unit test is added for testing the cleanup of incomplete (unsealed) sstables.
Fixes: #4129
"
* 'projects/sst-temp-dir-cleanup/v3' of https://github.com/bhalevy/scylla:
tests: add test_distributed_loader_with_incomplete_sstables
tests: single_node_cql_env::do_with: use the provided data_file_directories path if available
tests: single_node_cql_env::_data_dir is not used
distributed_loader: populate_column_family should scan directories too
sstables: fix is_temp_dir
distributed_loader: populate_column_family: ignore directories other than sstable::is_temp_dir
distributed_loader: remove temporary sstable directories only on shard 0
distributed_loader: push future returned by rmdir into futures vector
"
This series prevents view building to fall back to storing hints.
Instead, it will try to send hints to an endpoint as if it has
consistency level ONE, and in case of failure retry the whole
building step. Then, view building will never be marked as finished
prematurely (because of pending hints), which will help avoid
creating inconsistencies when decommissioning a node from the cluster.
Tests:
unit (release)
dtest (materialized_views_test.py.*)
Fixes#3857Fixes#4039
"
* 'do_not_mark_view_as_built_with_hints_7' of https://github.com/psarna/scylla:
db,view: add updating view_building_paused statistics
database: add view_building_paused metrics
table: make populate_views not allow hints
db,view: add allow_hints parameter to mutate_MV
storage_proxy: add allow_hints parameter to send_to_endpoint
View building uses populate_views to generate and send view updates.
This procedure will now not allow hints to be used to acknowledge
the write. Instead, the whole building step will be retried on failure.
Fixes#3857Fixes#4039
Despite the name, this option also controls if a warning is issued
during memtable writes.
Warning during memtable writes is useful but the option name also
exists in cassandra, so probably the best we can do is update the
description.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190125020821.72815-1-espindola@scylladb.com>
We need to follow changes of rpm package build procedure on
-jmx/-tools/-ami packages, since it have been changed when we merged
relocatable pacakge.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190127204436.13959-1-syuu@scylladb.com>
On Scylla 3rdparty tools, we add /opt/scylladb/lib to LD_LIBRARY_PATH.
We use same directory for relocatable binaries, including libc.so.6.
Once we install both scylla-env package and relocatable version of scylla-server package, the loader tries to load libc from /opt/scylladb/lib then entire distribution become unusable.
We may able to use Obsoletes or Conflict tag on .rpm/.deb to avoid
install new Scylla package with scylla-env, but it's better & safer not to share
same directory for different purpose.
Fixes#3943
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190128023757.25676-1-syuu@scylladb.com>
Make it easier to work interactively by not reporting surprising times.
There are also reports that dtest fails with incorrect timezones, but those
are probably bugs in dtest.
Message-Id: <20190127134754.1428-1-avi@scylladb.com>
Virtual columns are MV-specific columns that contribute to the
liveness of view rows. However, we were not adding those columns when
creating an index's underlying MV, causing indexes to miss base rows.
Fixes#4144
Branches: master, branch-3.0
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Test removal of sstables with temporary TOC file,
with and without temporary sstable directory.
Temporary sstable directories may be empty or still have
leftover components in them.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
1. fs::canonical required that the path will exist.
and there is no need for fs::canonical here.
2. fs::path::extension will return the leading dot.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
populate_column_family currently lists only regular files. ignoring all directories.
A later patch in this series allows it to list also directories so to cleanup
the temporary sstable directories, yet valid sub-directories, like staging|upload|snapshots,
may still exist and need to be ignored.
Other kinds of handling, like validating recgnized sub-directories and halting on
unrecognized sub-directories are possible, yet out of scope for this patch(set).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Similar to calling remove_sstable_with_temp_toc later on in
populate_column_family(), we need only one thread to do the
cleanup work and the existing convention is that it's shard 0.
Since lister::rmdir is checking remove_file of all entries
(recursively) and the dir itself, doing that concurrently would fail.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
Use int64_t in data::cell for expiry / deletion time.
Extend time_overflow unit tests in cql_query_test to use
select statements with and without bypass cache to access deeper
into the system.
Refs #3353
"
* 'projects/gc_clock_64_fixes/v1' of https://github.com/bhalevy/scylla:
tests: extend time_overflow unit tests
data::cell: use int64_t for expiry and deletion time
From day1, scylla_setup can be run either iteractively or through
command line parameters. Still, one of the requests we are asked the
most from users is whether we can provide them with a version of
scylla_setup that they can call from their scripts.
This probably happens because once you call a script interactively,
it may not be totally obvious that a different mode is available.
Even when we do tell users about that possibility, the request number
two is then "which flags do I pass?"
The solution I am proposing is to just tell users the answers to those
qestions at the end of an interactive session. After this patch, we
print the following message to the console:
ScyllaDB setup finished.
scylla_setup accepts command line arguments as well! For easily provisioning in a similar environmen than this, type:
scylla_setup --no-raid-setup --nic eth0 --no-kernel-check \
--no-verify-package --no-enable-service --no-ntp-setup \
--no-node-exporter --no-fstrim-setup
Also, to avoid the time-consuming I/O tuning you can add --no-io-setup and copy the contents of /etc/scylla.d/io*
Only do that if you are moving the files into machines with the exact same hardware
Notes on the implementation: it is unfortunate for these purposes that
all our options are negated. Most conditionals are branching on true
conditions, so although I could write this:
args.no_option = not interactive_ask_service(...)
if not args.no_option:
...
I opted in this patch to write:
option = interactive_ask_service(...)
args.no_option = not option
if option:
...
There is an extra line and we have to update args separately, but it
makes it less hard to get confused in the conditional with the double
negation. Let me know if there are disagreements here.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190124153832.21140-1-glauber@scylladb.com>
"
I've recently had to work around types.hh/types.cc files and had
very unpleasent experience with incremental build on every change
to types.hh. It took ~30 min on my machine which is almost as much
as the clean build.
I looked around and it turns out that types.hh contains the whole
hierarchy of the types. On the same time, many places access the
types only through abstract_type which is the root of the
hierarchy.
This patchset extracts user_type_impl, tuple_type_impl,
map_type_impl, set_type_impl, list_type_impl and
collection_type_impl from types.hh and places each of them
in a separate header.
The result of this is that change in user_type_impl causes now
incremental build of ~6 min instead of ~30 min.
Change to tuple_type_impl causes incremental build of ~7.5 min
instead of ~30 min and change to map_type_impl triggers incremental
build that takes ~20 min instead of ~30 min.
Tests: unit(release)
"
* 'haaawk/types_build_speedup_2/rfc/2' of github.com:scylladb/seastar-dev:
Stop including types/list.hh in cql3/tuples.hh
Stop including types/set.hh into cql3/sets.hh
Move collection_type_impl out of types.hh to types/collection.hh
Move set_type_impl out of types.hh to types/set.hh
Move list_type_impl out of types.hh to types/list.hh
Move map_type_impl out of types.hh to types/map.hh
Move tuple_type_impl from types.hh to types/tuple.hh
Decouple database.hh from types/user.hh
Allow to use shared_ptr with incomplete type other than sstable
Move user_type_impl out of types.hh to types/user.hh
This commit declares shared_ptr<user_types_metadata> in
database.hh were user_types_metadata is an incomplete type so
it requires
"Allow to use shared_ptr with incomplete type other than sstable"
to compile correctly.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
When seastar/core/shared_ptr_incomplete.hh is included in a header
then it causes problems with all declarations of shared_ptr<T> with
incomplete type T that end up in the same compilation unit.
The problem happens when we have a compilation unit that includes
two headers a.hh and b.hh such that a.hh includes
seastar/core/shared_ptr_incomplete.hh and b.hh declares
shared_ptr<T> with incomplete type T. On the same time this
compilation unit does not use declared shared_ptr<T> so it should
compile and work but it does not because shared_ptr_incomplete.hh
is included and it forces instantiation of:
template <typename T>
T*
lw_shared_ptr_accessors<T,
void_t<decltype(lw_shared_ptr_deleter<T>{})>>::to_value(lw_shared_ptr_counter_base*
counter) {
return static_cast<T*>(counter);
}
for each declared shared_ptr<T> with incomplete type T. Even the once
that are never used.
Following commit "Decouple database.hh from types/user.hh"
moves user_types_metadata type out of database.hh and instead
declares shared_ptr<user_types_metadata> in database.hh where
user_types_metadata is incomplete. Without this commit
the compilation of the following one fails with:
In file included from ./sstables/sstables.hh:34,
from ./db/size_estimates_virtual_reader.hh:38,
from db/system_keyspace.cc:77:
seastar/include/seastar/core/shared_ptr_incomplete.hh: In
instantiation of ‘static T*
seastar::internal::lw_shared_ptr_accessors<T,
seastar::internal::void_t<decltype
(seastar::lw_shared_ptr_deleter<T>{})>
>::to_value(seastar::lw_shared_ptr_counter_base*) [with T =
user_types_metadata]’:
seastar/include/seastar/core/shared_ptr.hh:243:51: required from
‘static void seastar::internal::lw_shared_ptr_accessors<T,
seastar::internal::void_t<decltype
(seastar::lw_shared_ptr_deleter<T>{})>
>::dispose(seastar::lw_shared_ptr_counter_base*) [with T =
user_types_metadata]’
seastar/include/seastar/core/shared_ptr.hh:300:31: required from
‘seastar::lw_shared_ptr<T>::~lw_shared_ptr() [with T =
user_types_metadata]’
./database.hh:1004:7: required from ‘static void
seastar::internal::lw_shared_ptr_accessors_no_esft<T>::dispose(seastar::lw_shared_ptr_counter_base*)
[with T = keyspace_metadata]’
seastar/include/seastar/core/shared_ptr.hh:300:31: required from
‘seastar::lw_shared_ptr<T>::~lw_shared_ptr() [with T =
keyspace_metadata]’
./db/size_estimates_virtual_reader.hh:233:67: required from here
seastar/include/seastar/core/shared_ptr_incomplete.hh:38:12: error:
invalid static_cast from type ‘seastar::lw_shared_ptr_counter_base*’
to type ‘user_types_metadata*’
return static_cast<T*>(counter);
^~~~~~~~~~~~~~~~~~~~~~~~
[131/415] CXX build/release/distributed_loader.o
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Currently nop_large_partition_handler is only used in tests, but it
can also be used avoid self-reporting.
Tests: unit(Release)
I also tested starting scylla with
--compaction-large-partition-warning-threshold-mb=0.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190123205059.39573-1-espindola@scylladb.com>
"
This series is a first small step towards rewriting
CQL restrictions layer. Primary key restrictions used to be
a template that accepts either partition_key or clustering_key,
but the implementation is already based on virtual inheritance,
so in multiple cases these templates need specializations.
Refs #3815
"
* 'detemplatize_primary_key_restrictions_2' of https://github.com/psarna/scylla:
cql3: alias single_column_primary_key_restrictions
cql3: remove KeyType template from statement_restrictions
cql3: remove template from primary_key_restrictions
cql3: remove forwarding_primary_key_restrictions
In preparation for detemplatizing this class, it's aliased with
single_column_partition_key restrictions and
single_column_clustering_key_restrictions accordingly.
Partition key restrictions and clustering key restrictions
currently require virtual function specializations and have
lots of distinct code, so there's no value in having
primary_key_restrictions<KeyType> template.
libdeflate's build places some object files in the source directory, which is
shared between the debug and release build. If the same object file (for the two
modes) is written concurrently, or if one more reads it while the other writes it,
it will be corrupted.
Fix by not building the executables at all. They aren't needed, and we already
placed the libraries' objects in the build directory (which is unshared). We only
need the libraries anyway.
Fixes#4130.
Branches: master, branch-3.0
Message-Id: <20190123145435.19049-1-avi@scylladb.com>
Commit fd422c954e aimed to fix
issue #3803. In that issue, if a query SELECTed only certain columns but
did filtering (ALLOW FILTERING) over other unselected columns, the filtering
didn't work. The fix involved adding the columns being filtered to the set
of columns we read from disk, so they can be filtered.
But that commit included an optimization: If you have clustering keys
c1 and c2, and the query asks for a specific partition key and c1 < 3 and
c2 > 3, the "c1 < 3" part does NOT need to be filtered because it is already
done as a slice (a contiguous read from disk). The committed code erroneously
concluded that both c1 and c2 don't need to be filtered, which was wrong
(c2 *does* need to be read and filtered).
In this patch, we fix this optimization. Previously, we used the "prefix
length", which in the above example was 2 (both c1 and c2 were filtered)
but we need a new and more elaborate function,
num_prefix_columns_that_need_not_be_filtered(), to determine we can only
skip filtering of 1 (c1) and cannot skip the second.
Fixes#4121. This patch also adds a unit test to confirm this.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190123131212.6269-1-nyh@scylladb.com>
If docker sees the Dockerfile hasn't changed it may reuse an old image, not
caring that context files and dependent images have in fact changed. This can
happen for us if install-dependencies.sh or the base Fedora image changed.
To make sure we always get a correct image, add --no-cache to the build command.
Message-Id: <20190122185042.23131-1-avi@scylladb.com>
Done in a separate step so we can update the toolchain first.
dnf-utils is used to bring us repoquery, which we will use to derive the
list of files in the python packages.
patchelf is needed so we can add a DT_RUNPATH section to the interpreter
binary.
the python modules, as well as the python3 interpreter are taken from
the current RPM spec file.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
[avi: regenerate frozen toolchain image]
Message-Id: <20190123011751.14440-1-glauber@scylladb.com>
"
This series prepares for the integration of the `master` branch of
Seastar back into Scylla.
A number of changes to the existing build are necessary to integrate
Seastar correctly, and these are detailed in the individual change
messages.
I tested with and without DPDK, in release and debug mode.
The actual switch is a separate patch.
"
* 'jhk/seastar_cmake/v4' of https://github.com/hakuch/scylla:
build: Fix link order for DPDK
tests: Split out `sstable_datafile_test`
build: Remove unnecessary inclusion
tests: Fix use-after-free errors in static vars
build: Remove Seastar internals
build: Only use Seastar flags from pkg-config
build: Query Seastar flags using pkg-config
build: Change parameters for `pkg_config` function
"
Cache cf mappings when breaking in the middle of a segment sending so
that the sender has them the next time it wants to send this segment
for where it left off before.
Also add the "discard" metric so that we can track hints that are being
discarded in the send flow.
"
Fixes#4122
* 'hinted_handoff_cache_cf_mappings-v1' of https://github.com/vladzcloudius/scylla:
hinted handoff: cache column family mappings for segments that were not sent out in full
hinted handoff: add a "discarded" metric
Each `*_test.cc` file must be compiled separately so that there is only
one definition of `main`.
This change correctly defines an independent `sstable_datafile_test`
from `sstable_datafile_test.cc` and adds that test to the existing
suite.
We don't need to re-specify Seastar internals in Scylla's build, since
everything private to Seastar is managed via pkg-config.
We can eliminate all references to ragel and generated ragel header
files from Seastar.
We can also simplify the dependence on generated Seastar header files by
ensuring that all object files depend on Seastar being built first.
Some Seastar-specific flags were manually specified as Ninja rules, but
we want to rely exclusively on Seastar for its necessary flags.
The pkg-config file generated by the latest version of Seastar is
correct and allows us to do this, but the version generated by Scylla's
current check-out of Seastar does not. Therefore, we have to manually
adjust the pkg-config results temporarily until we update Seastar.
Previously, we manually parsed the pkg-config file. We now used
pkg-config itself to get the correct build flags.
This means that we will get the correct behavior for variable expansion,
and fields like `Requires`, `Requires.private`, and `Libs.private`.
Previously, these fields were ignored.
We will try to send a particular segment later (in 1s) from the place
where we left off if it wasn't sent out in full before. However we may miss
some of column family mappings when we get back to sending this file and
start sending from some entry in the middle of it (where we left off)
if we didn't save column family mappings we cached while reading this segment
from its begining.
This happens because commitlog doesn't save a column family information
in every entry but rather once for each uniq column family (version) per
"cycle" (see commitlog::segment description for more info).
Therefore we have to assume that a particular column family mapping
appears once in the whole segment (worst case). And therefore, when we
decide to resume sending a segment we need to keep the column family
mappings we accumulated so far and drop them only after we are done with
this particular segment (sent it out in full).
Fixes#4122
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Account the amount of hints that were discarded in the send path.
This may happen for instance due to a schema change or because a hint
being to old.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
"
wrap around on 2038-01-19 03:14:07 UTC. Such dates are valid deletion
times starting 2018-01-19 with the 20 years long maximum ttl.
This patchset extends gc_clock::duration::rep to int64_t and adds
respective unit tests for the max_ttl cases.
Fixes#3353
Tests: unit (release)
"
* 'projects/gc_clock_64/v2' of https://github.com/bhalevy/scylla:
tests: cql_query_test add test_time_overflow
gc_clock: make 64 bit
sstables: mc: use int64_t for local_deletion_time and ttl
sstables: add capped_tombstone_deletion_time stats counter
sstables: mc: cap partition tombstone local_deletion_time to max
sstables: add capped_local_deletion_time stats counter
sstables: mc: metadata collector: cap local_deletion_time at max
sstables: mc: use proper gc_clock types for local_deletion_time and ttl
db: get default_time_to_live as int32_t rather than gc_clock::rep
sstables: safely convert ttl and local_deletion_time to int32_t
sstables: mc: move liveness_info initialization to members
sstables: mc: move parsing of liveness_info deltas to data_consume_rows_context_m
sstables: mc: define expired_liveness_ttl as signed int32_t
sstables: mc: change write_delta_deletion_time to receive tombstone rather than deletion_time
sstables: mc: use gc_clock types for writing delta ttl and local_deletion_time
Commit 019a2e3a27 marked some arguments as required, which improved
the usability of scylla_setup.
The problem is that when we call scylla_setup in interactive mode,
no argument should be required. After the aforementioned commit
scylla_setup will either complain that the required arguments were
not passed if zero arguments are present, or skip interactive mode
if one of the mandatory ones is present.
This patch fixes that by checking whether or not we were invoked with
no command line arguments and lifting the requirements for mandatory
arguments in that case.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190122003621.11156-1-glauber@scylladb.com>
deletion_time struct as int32_t deletion_time that cannot hold long
time values. Cap local_deletion_time to max_local_deletion_time and
log a warning about that,
This corresponds to Cassandra's MAX_DELETION_TIME.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
max local_deletion_time_tracker in stats is int32_t so just track the limit
of (max int32_t - 1) if time_point is greater than the limit.
This corresponds to Cassandra's MAX_DELETION_TIME.
Refs #3353
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
mc format only writes delta local_deletion_time of tombstones.
Conventional deletion_time is written only for the partition header.
Restructure the code to pass a tombstone to write_delta_deletion_time
rather than struct deletion_time to prepare for using 64-bit deletion times.
The tombstone uses gc_clock::time_point while struct
deletion_time is limited to int32_t local_deletion_time.
Note that for "live" tombstones we encode <api::missing_timestamp,
no_deletion_time> as was previously evaluated by to_deletion_time().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When the reclaim request was satisfied from the pool there's no need
to call compact_and_evict_locked(). This allows us to avoid calling
boost::range::make_heap(), which is a tiny performance difference, as
well as some confusing log messages.
Message-Id: <1548091941-8534-1-git-send-email-tgrabiec@scylladb.com>
We can invoke pkg-config with multiple options, and we specify the
package name first since this is the "target" of the pkg-config query.
Supporting multiple options is necessary for querying Seastar's
pkg-config file with `--static`, which we anticipate in a future change.
The system won't work properly if IOTune is not run. While it is fair
to skip this step because it takes long-- indeed, it is common to provision
io.conf manually to be able to skip this step, first time users don't know
this and can have the impression that this is just a totally optional step.
Except the node won't boot up without it.
As a user nicely put recently in our mailing list:
"...in this case, it would be even simpler to forbid answering "no"
to this not-so-optional step :)"
We should not forbid saying no to IOTune, but we should warn the user
about the consequences of doing so.
Fixes#4120
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190121144506.17121-1-glauber@scylladb.com>
Recently we had a bug (#4096) due to a component
(`multishard_mutation_query()`) assuming that all reads used the
semaphore obtainable via `database::user_read_concurrency_sem()`.
This problem revealed that it is plain wrong to allow access to the
shard-global semaphores residing in the database object. Instead all
code wishing to access the relevant semaphore for some read, should do
so via the relevant `table` object, thus guaranteeing that it will get
the correct semaphore, configured for that table.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4f3a6780eb3240822db34aba7c1ba0a675a96592.1547734212.git.bdenes@scylladb.com>
Scylla starts doing IO much earlier that it starts cql/thrift servers.
The IO may cause an error that will try stop all servers, but since they
are still not running it will do nothing, but servers will be started
later. Fix it by checking that the node is not isolated before starting
servers.
Message-Id: <20190110152830.GE3172@scylladb.com>
Many area of the code are splattered with unneeded templates. This patchset replaces
some of them, where the template parameter is a function object, with an std::function
or noncopyable_function (with a preference towards the latter; but it is not always
possible). As the template is compiled for each instantiation (if the function
object is a lambda) while a function is compiled only once, there are significant
savings in compile time and bloat.
text data bss dec hex filename
85160690 42120 284910 85487720 5187068 scylla.before
84824762 42120 284910 85151792 5135030 scylla.after
* https://github.com/avikivity/scylla detemplate/v2:
api/commitlog: de-template acquire_cl_metric()
database: de-template do_parse_schema_tables
database: merge for_all_partitions and for_all_partitions_slow
hints: de-template scan_for_hints_dirs()
schema_tables: partially de-template make_map_mutation()
distributed_loader: de-template
tests: commitlog_test: de-template
tests: cql_auth_query_test: de-template
test: de-template eventually() and eventually_true()
tests: flush_queue_test: de-template
hint_test: de-template
tests: mutation_fragment_test: de-template
test: mutation_test: de-template
"
This miniseries fixes the behaviour of distributed loader,
which now unconditionally mutates new sstables found in /upload
dir to LCS level 0 first, and only after that proceeds with
either queueing them for update generation or moving them
to data directory.
"
* 'restore_always_mutating_sstables_level_0' of https://github.com/psarna/scylla:
distributed_loader: restore indentation
distributed_loader: restore always mutating to level 0
Fix runtime error: signed integer overflow
introduced by 2dc3776407
Delta-encoded values may wrap around if the encoded value is
less than the base value. This could happen in two places:
In the mc-format serialization header itself, where the base values are implicit
Cassandra epoch time, and in the sstables data files, where the base values
are taken from the encoding_stats (later written to the serialization_header).
In these cases, when the calculation is done using signed integer/long we may see
"runtime error: signed integer overflow" messages in debug mode
(with -fsanitize=undefined / -fsanitize=signed-integer-overflow).
Overflow here is expected and harmless since we do not gurantee that
neither the base values in the serialization header are greater than
or equal to Cassandra's epoch now that the delta-encoded values are
always greater than or equal to the respective base values in
the serialization header.
To prevent these warnings, the subtraction/addition should be done with unsigned
(two's complement) arithmetic and the result converted to the signed type.
Note that to keep the code simple where possible, when also rely on implicit
conversion of signed integers to unsigned when either one of added value is unsigned
and the other is signed.
Fixes: #4098
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190120142950.15776-1-bhalevy@scylladb.com>
The internal test_propagation template is instantiated many times. Replace
with an oridinary function to reduce bloat. Call sites adjusted to have a
uniform signature.
distributed_loader has several large templates that can be converted to normal
function with the help of noncopyable_function<>, reducing code bloat.
One of the lambdas used as an actual argument was adjusted, because the de-templated
callee only accepts functions returning a future, while the original accepted both
functions returning a future and functions returning void (similar to future::then).
This long slow-path function is called four times, so de-templating it is an
easy win. We use std::function instead of noncopyable_function because the
function is copied within the parallel_for_each callback. The original code
uses a move, which is incorrect, but did not fail because moving the lambdas
that were used as the actual arguments is equivalent to a copy.
Otherwise read test results for subsequent datasets will override each other.
Also, rename population test case to not include dataset name, which
is now redundant.
Message-Id: <1547822942-9690-1-git-send-email-tgrabiec@scylladb.com>
When entering a new ck range (of the partition-slice), the partition
snapshot reader will apply to its range tombstones stream all the
tombstones that are relevant to the new ck range. When the partition has
range tombstones that overlap with multiple ck ranges, these will be
applied to the range tombstone stream when entering any of the ck ranges
they overlap with. This will result in the violation of the monotonicity
of the mutation fragments emitted by the reader, as these range
tombstones will be re-emitted on each ck range, if the ck range has at
least one clustering row they apply to.
For example, given the following partition:
rt{[1,10]}, cr{1}, cr{2}, cr{3}...
And a partition-slice with the following ck ranges:
[1,2], [3, 4]
The reader will emit the following fragment stream:
rt{[1,10]}, cr{1}, cr{2}, rt{[1,10]}, cr{3}, ...
Note how the range tombstone is emitted twice. In addition to violating
the monotonicity guarantee, this can also result in an explosion of the
number of emitted range tombstones.
Fix by applying only those range tombstones to the range tombstone
stream, that have a position strictly greater than that of the last
emitted clustering row (or range tombstone), when entering a new ck
range.
Fixes: #4104
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <e047af76df75972acb3c32c7ef9bb5d65d804c82.1547916701.git.bdenes@scylladb.com>
At the moment are inefficiencies in how
collection_type_impl::mutation::compact_and_expire( handles tombstones.
If there is a higher-level tombstone that covers the collection one
(including cases where there is no collection tombstone) it will be
applied to the collection tombstone and present in the compaction
output. This also means that the collection tombstone is never dropped
if fully covered by a higher-level one.
This patch fixes both those problems. After the compaction the
collection tombstone is either unchanged or removed if covered by a
higher-level one.
Fixes#4092.
Message-Id: <20190118174244.15880-1-pdziepak@scylladb.com>
Use std::function instead of a template parameter. Likely doesn't gain
anyting, because the template was always instantiated with the same type
(the result of std::bind() with the same signatures), but still good practice.
std::function was used instead of noncopyable_function because
sharded::map_reduce0() copies the input function.
"
Use input sstables stats metadata to re-calculate encoding_stats.
Fixes#3971.
"
* 'projects/compaction-encoding-stats/v3' of https://github.com/bhalevy/scylla:
compaction: mc: re-calculate encoding_stats based on column stats
memtable: extract encoding_stats_collector base class to encoding_stats header file
The compilation fails on -Warray-bounds, even though the branch is never taken:
inlined from ‘managed_bytes::managed_bytes(bytes_view)’ at ./utils/managed_bytes.hh:195:22,
inlined from ‘managed_bytes::managed_bytes(const bytes&)’ at ./utils/managed_bytes.hh:162:77,
inlined from ‘dht::token dht::bytes_to_token(bytes)’ at dht/random_partitioner.cc:68:57,
inlined from ‘dht::token dht::random_partitioner::get_token(bytes)’ at dht/random_partitioner.cc:85:39:
/usr/include/c++/8/bits/stl_algobase.h:368:23: error: ‘void* __builtin_memmove(void*, const void*, long unsigned int)’ offset 16 from the object at ‘<anonymous>’ is out of the bounds of referenced subobject ‘managed_bytes::small_blob::data’ with type ‘signed char [15]’ at offset 0 [-Werror=array-bounds]
__builtin_memmove(__result, __first, sizeof(_Tp) * _Num);
~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Work around by disabling the diagnostic locally.
Message-Id: <1547205350-30225-1-git-send-email-tgrabiec@scylladb.com>
Many area of the code are splattered with unneeded templates. This patchset replaces
some of them, where the template parameter is a function object, with an std::function
or noncopyable_function (with a preference towards the latter; but it is not always
possible). As the template is compiled for each instantiation (if the function
object is a lambda) while a function is compiled only once, there are significant
savings in compile time and bloat.
text data bss dec hex filename
85160690 42120 284910 85487720 5187068 scylla.before
84824762 42120 284910 85151792 5135030 scylla.after
* https://github.com/avikivity/scylla detemplate/v1:
api/commitlog: de-template acquire_cl_metric()
database: de-template do_parse_schema_tables
database: merge for_all_partitions and for_all_partitions_slow
hints: de-template scan_for_hints_dirs()
schema_tables: partially de-template make_map_mutation()
distributed_loader: de-template
tests: commitlog_test: de-template
tests: cql_auth_query_test: de-template
test: de-template eventually() and eventually_true()
tests: flush_queue_test: de-template
hint_test: de-template
tests: mutation_fragment_test: de-template
test: mutation_test: de-template
When introducing view update generation path for sstables
in /upload directory, mutating these sstables was moved
to regular path only. It was wrong, because sstables that
need view updates generated from them may still need
to be downgraded to LCS level 0, so they won't disrupt
LCS assumptions after being loaded.
Reported-by: Nadav Har'El <nyh@scylladb.com>
The internal test_propagation template is instantiated many times. Replace
with an oridinary function to reduce bloat. Call sites adjusted to have a
uniform signature.
Use noncopyable_function instead of a template parameter. Likely doesn't gain
anyting, because the template was always instantiated with the same type
(the result of std::bind() with the same signatures), but still good practice.
The multishard mutation query used the semaphore obtained from
`database::user_read_concurrency_sem()` to pause-resume shard readers.
This presented a problem when `multishard_mutation_query()` was reading
from system tables. In this case the readers themselves would obtain
their permits from the system read concurrency semaphore. Since the
pausing of shard readers used the user read semaphore, pausing failed to
fulfill its objective of alleviating pressure on the semaphore the reads
obtained their permits from. In some cases this lead to a deadlock
during system reads.
To ensure the correct semaphore is used for pausing-resuming readers,
obtain the semaphore from the `table` object. To avoid looking up the
table on every pause or resume call, cache the semaphores when readers
are created.
Fixes: #4096
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <c784a3cd525ce29642d7216fbe92638fa7884e88.1547729119.git.bdenes@scylladb.com>
d2dbbba139 converted scyllatop's interperter to Python 3, but neglected to do
the actual conversion. This patch does so, by running 2to3 over allfiles and adding
an additional bytes->string decode step in prometheus.py. Superfluous 2to3 changes
to print() calls were removed.
Message-Id: <20190117124121.7409-1-avi@scylladb.com>
"
Before this series the limit was applied per page instead
of globally, which might have resulted in returning too many
rows.
To fix that:
1. restrictions filter now has a 'remaining' parameter
in order to stop accepting rows after enough of them
have already been accepted
2. pager passes its row limit to restrictions filter,
so no more rows than necessary will be served to the client
3. results no longer need to be trimmed on select_statement
level
Tests: unit (release)
"
* 'fix_filtering_limit_with_paging_3' of https://github.com/psarna/scylla:
tests: add filtering+limit+paging test case
tests: allow null paging state in filtering tests
cql3: fix filtering with LIMIT with regard to paging
Previously the utility to extract paging state asserted
that the state exists, but in future tests it would be useful
to be able to call this function even if it would return null.
Previously the limit was erroneously applied per page
instead of being accumulated, which might have caused returning
too many rows. As of now, LIMIT is handled properly inside
restrictions filter.
Fixes#4100
When compacting several sstables, get and merge their encoding_stats
for encoding the result.
Introduce sstable::get_encoding_stats_for_compaction to return encoding_stats
based on the sstable's column stats.
Use encoding_stats_collector to keep track of the minimum encoding_stats
values of all input sstables.
Fixes#3971
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Number of rows sent and received
- tx_row_nr
- rx_row_nr
Bytes of rows sent and received
- tx_row_bytes
- rx_row_bytes
Number of row hashes sent and received
- tx_hashes_nr
- rx_hashes_nr
Number of rows read from disk
- row_from_disk_nr
Bytes of rows read from disk
- row_from_disk_bytes
Message-Id: <d1ee6b8ae8370857fe45f88b6c13087ea217d381.1547603905.git.asias@scylladb.com>
"
This series adds generating view updates from sstables added through
/upload directory if their tables have accompanying materialized views.
Said sstables are left in /upload directory until updates are generated
from them and are treated just like staging sstables from /staging dir.
If there are no views for a given tables, sstables are simply moved
from /upload dir to datadir without any changes.
Tests: unit (release)
"
* 'add_handling_staging_sstables_to_upload_dir_5' of https://github.com/psarna/scylla:
all: rename view_update_from_staging_generator
distributed_loader: fix indentation
service: add generating view updates from uploaded sstables
init: pass view update generator to storage service
sstables: treat sstables in upload dir as needing view build
sstables,table: rename is_staging to requires_view_building
distributed_loader: use proper directory for opening SSTable
db,view: make throttling optional for view_update_generator
"
This series addresses the problem mentioned in issue 4032, which is a race
between creating a view and streaming sstables to a node. Before this patch
the following scenario is possible:
- sstable X arrives from a streaming session
- we decide that view updates won't be generated from an sstable X
by the view builder
- new view is created for the table that owns sstable X
- view builder doesn't generate updates from sstable X, even though the table
has accompanying views - which is an inconsistency
This race is fixed by making the view builder wait for all ongoing streams,
just like it does for reads and writes. It's implemented with a phaser.
Tests:
unit (release)
dtest(not merged yet: materialized_views_test.TestMaterializedViews.stream_from_repair_during_build_process_test)
"
* 'add_stream_phasing_2' of https://github.com/psarna/scylla:
repair: add stream phasing to row level repair
streaming: add phasing incoming streams
multishard_writer: add phaser operation parameter
view: wait for stream sessions to finish before view building
table: wait for pending streams on table::stop
database: add pending streams phaser
SSTables loaded to the system via /upload dir may sometimes be needed
to generate view updates from them (if their table has accompanying
views).
Fixes#4047
In some cases, sstables put in the upload dir should have view updates
generated from them. In order to avoid moving them across directories
(which then involves handling failure paths), upload dir will also be
treated as a valid directory where staging sstables reside.
Regular sstables that are not needed for view updates will be
immediately moved from upload/ dir as before.
Previous implementation assumes that each SSTable resides directly
in table::datadir directory, while what should actually be used
is directory path from SSTable descriptor.
This patch prevents a regression when adding staging sstables support
for upload/ dir.
Currently registering new view updates is throttled by a semaphore,
which makes sense during stream sessions in order to avoid overloading
the queue. Still, registration also occurs during initialization,
where it makes little sense to wait on a semaphore, since view update
generator might not have started at all yet.
"
Cleanup various cases related to updating of metatdata stats and encoding stats
updating in preparation for 64-bit gc_clock (#3353).
Fixes#4026Fixes#4033Fixes#4035Fixes#4041
Refs #3353
"
* 'projects/encoding-stats-fixes/v6' of https://github.com/bhalevy/scylla:
sstables: remove duplicated code in data_consume_rows_context CELL_VALUE_BYTES
sstables: mc: use api::timestamp_type in write_liveness_info
sstables: mc: sstable_write encoding_stats are const
mp_row_consumer_k_l::consume_deleted_cell rename ttl param to local_deletion_time
memtable: don't use encoding_stats epochs as default
memtable: mc: udpate min_ttl encoding stats for dead row marker
memtable: mc: add comment regarding updating encoding stats of collection tombstones
sstables: metadata_collector: add update tombstone stats
sstables: assert that delete_time is not live when updating stats
sstables: move update_deletion_time_stats to metadata collector
sstables: metadata_collector: introduce update_local_deletion_time_and_tombstone_histogram
sstables: mc: write_liveness_info and write_collection should update tombstone_histogram
sstables: update_local_deletion_time for row marker deletion_time and expiration
Presence checker is constructed and destroyed in the standard
allocator context, but the presence check was invoked in the LSA
context. If the presence checker allocates and caches some managed
objects, there will be alloc-dealloc mismatch.
That is the case with LeveledCompactionStrategy, which uses
incremental_selector.
Fix by invoking the presence check in the standard allocator context.
Fixes#4063.
Message-Id: <1547547700-16599-1-git-send-email-tgrabiec@scylladb.com>
In order to allow other services to wait for incoming streams
to finish, row level repair uses stream phasing when creating
new sstables from incoming data.
Fixes scylladb#4032
During streaming, there's a race between streamed sstables
and view creation, which might result in some tables not being
used to generate view updates, even though they should.
That happens when the decision about view update path for a table
is done before view creation, but after already receiving some sstables
via streaming. These will not be used in view building even though
they should.
Hence, a phaser is used to make the view builder wait for all ongoing
stream sessions for a table to finish before proceeding with build steps.
Refs #4032
"
This mini-series adds counters for the inactive reads registered in the
reader concurrency semaphore.
"
* 'reader-concurrency-semaphore-counters/v6' of https://github.com/denesb/scylla:
tests/querier_cache: use stats to get the no. of inactive reads
reader_concurrency_semaphore: add counters for inactive reads
The existence of LZ4_compress_default is a property of the lz4
library, not seastar.
With this patch scylla does its own configure check instead of
depending on the one done by seastar.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190114013737.5395-1-espindola@scylladb.com>
"
Currently the logic is scattered between types.*, cql3_types.* and
sstables/mc/writer.cc.
This patchset places all the logic in types.* and makes sure we
correctly add "frozen<...>" and "FrozenType(...)" to the names of
tuples and UDTs.
Fixes#4087
Tests: unit(release)
"
* 'haaawk/4087_v1' of github.com:scylladb/seastar-dev:
Add comment explaining tuple type name creation
Add "FrozenType(...)" to UDT name only when it's frozen
Move "FrozenType(...)" addition to UDT name to user_type_impl
Add "frozen<...>" to tuple CQL name only when it's frozen
Move "frozen<...>" addition to tuple CQL name to tuple_type_impl
Merge make_cql3_tuple_type into tuple_type_impl::as_cql3_type
Add "frozen<...>" to UDT CQL name only when it's frozen
Move "frozen<...>" addition to UDT CQL name to user_type_impl
Why default to an artificial minimum when you can do better
with zero effort? Track the actual minima in the memtable instead.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Update min ttl with expired_liveness_ttl (although it's value of max int32
is not expected to affect the minimum).
Fixes#4041
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When the row flag has_complex_deletion is set, some collection columns may have
deletion tombstones and some may not. we don't strictly need to update stats
will not affect the encoding_stats anyway.
Fixes#4035
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that continuous_data_consumer::position() is meaningful (since
36dd660), we can use our position in the stream to calculate offsets
instead of duplicating state machine in offset calculations.
The value of position() - data.size() always holds the current offset
in the stream.
Message-Id: <1547219234-21182-1-git-send-email-tgrabiec@scylladb.com>
Put package names one per line. This makes it easier to review changes,
and to backport changes to this file. No content changes.
Message-Id: <20190112091024.21878-1-avi@scylladb.com>
Do not allow write access to the sstable list via this accessor. Luckily
there are no violations, and now we enforce it.
Message-Id: <20190111151049.16953-1-avi@scylladb.com>
To keep format compatibiliti we never wrap tuple type name
into "org.apache.cassandra.db.marshal.FrozenType(...)".
Even when the tuple is frozen.
This patch adds a comment in tuple_type_impl::make_name that
explains the situation.
For more details see #4087
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
At the moment Scylla supports only frozen UDTs but
the code should be able to handle non-frozen UDTs as well.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
At the moment Scylla supports only frozen tuples but
the code should be able to handle non-frozen tuples as well.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
At the moment Scylla supports only frozen UDTs but
the code should be able to handle non-frozen UDTs as well.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Race condition takes place when one of the sstables selected by snapshot
is deleted by compaction. Snapshot fails because it tries to link a
sstable that was previously unlinked by compaction's sstable deletion.
Fixes#4051.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190110194048.26051-1-raphaelsc@scylladb.com>
sstable_file_io_extensions() return an array of pointers to extensions,
but do_for_each() may defer and the array will be destroyed. The match
keeps it alive until do_for_each completes.
Message-Id: <20190110125656.GC3172@scylladb.com>
make_sstable_reader needs to deal with single-key and scanning reads, and
with restricting and non-restricting (in terms of read concurrency) readers.
Right now it does this combinatorically - there are separate cases for
restricting single-key reads, non-restricting single-key reads, restricing
scans, and non-restricting scans.
This makes further changes more complicated, so separate the two concepts.
The patch splits the code into two stages; the first selects between a single-key
and a scan, and the second selects between a restricting and non-restricting read.
This slightly pessimizes non-restricting reads (a mutation_source is created and
immediately destroyed), but that's not the common case.
Tests: unit(release)
Message-Id: <20190109175804.9352-1-avi@scylladb.com>
compare_row_marker_for_merge compares deletion_time also for row markers
that have missing timestamps. This happened to succeed due to implicit
initialization to 0. However, we prefer the initialization to be explicit
and allow calling row_marker::deletion_time() in all states.
Fixes#4068
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190110102949.17896-1-bhalevy@scylladb.com>
Serialization header stores column types for all
columns in sstable. If any of them is a UDT then it
has to be wrapped into
"org.apache.cassandra.db.marshal.FrozenType(...)".
This patch adds a test case to verify that.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Serialization header stores type names of all
columns in a table. Including partition key columns,
clustering key columns, static columns and regular columns.
If one of those types is a user defined type then we need to
wrap its name into
"org.apache.cassandra.db.marshal.FrozenType(...)".
Fixes#4073
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This renames some variables and functions to make it clear that they
refer to partitions and not rows.
Old versions of sstablemetadata used to refer to a row histogram, but
current versions now mention a partition histogram instead.
This patch doesn't change the exposed API names.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20181229223311.4184-2-espindola@scylladb.com>
Both build_rpm.sh/build_deb.sh are failing at beginning of the script
when relocatable package does not exist, need to prevent it and show
user friendly message.
Fixes#4071
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190109094353.16690-1-syuu@scylladb.com>
Compaction manager holds reference to all cleaning sstables till the very
end, and that becomes a problem because disk space of cleaned sstables
cannot be reclaimed due to respective file descriptors opened.
Fixes#3735.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20181221000941.15024-1-raphaelsc@scylladb.com>
Replace stdx::optional and stdx::string_view with the C++ std
counterparts.
Some instances of boost::variant were also replaced with std::variant,
namely those that called seastar::visit.
Scylla now requires GCC 8 to compile.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190108111141.5369-1-duarte@scylladb.com>
test_fast_forwarding_across_partitions_to_empty_range uses an uninitialized
string to populate an sstable, but this can be invalid utf-8 so that sstable
cannot be sstabledumped.
Make it valid by using make_random_string().
Fixes#4040.
Message-Id: <20190107193240.14409-1-avi@scylladb.com>
In c++17 there are standard ways of requesting aligned memory, so
seastar doesn't need to provide one.
This patch is in preparation for removing with_alignment from seastar.
Tests: unit (debug)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190107191019.22295-1-espindola@scylladb.com>
The current timeout is way too small for debug builds. Currently
jenkins runs avoid the problem by increasing the timeout by 100x. This
patch increases it by 10x, with seems to be sufficient to run the
tests in most desktop machines.
Message-Id: <20190107191413.22531-1-espindola@scylladb.com>
Non-privileged user may not belongs to "wheel" group, for example Debian
variants uses "sudo" group instead of "wheel".
To make sudo able to work on all environment we should allow sudo for
"ALL" instead of "wheel".
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190107173410.23140-1-syuu@scylladb.com>
When building something other than Scylla (like scylla-tools-java or scylla-jmx)
it is convenient to run it from some other directory. To do that, allow running
dbuild from any directory (so we locate tools/toolchain/image relative to the
dbuild script rather than use a fixed path) and mount the current directory
since it's likely the user will want to access files there.
Message-Id: <20190107165824.25164-1-avi@scylladb.com>
Start a new document with an overview of isolation in Scylla, i.e.,
scheduling groups, I/O priority classes, controllers, etc.
As all documents in docs/, this is a document for developers (not users!)
who need to understand how isolation between different pieces of Scylla
(e.g., queries, compaction, repair, etc.) works, which scheduling groups
and I/O classes we have and why, etc.
The document is still very partial and includes a lot of TODOs on
places where the explanation needs to be expanded. In particular it
needs an accurate explanation (and not just a name) of what kind of
work is done under each of the groups and classes, and an explanation
of how we set up RPC to use which scheduling groups for the code it
executes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190103183232.21348-1-nyh@scylladb.com>
Now that we added stats for the inactive reads, the tests don't need
the `reader_concurrency_semaphore::inactive_reads()` method, instead
they can rely on the stats to check the number of inactive reads.
While we keep ordinary hints in a directory parallel to the data directory,
we decided to keep the materialized view hints in a subdirectory of the data
directory, named "view_pending_updates". But during boot, we expect all
subdirectories of data/ to be keyspace names, and when we notice this one,
we print a warning:
WARN: database - Skipping undefined keyspace: view_pending_updates
This spurious warning annoyed users. But moreover, we could have bigger
problems if the user actually tries to create a keyspace with that name.
So in this patch, we move the view hints to a separate top-level directory,
which defaults to /var/lib/scylla/view_hints, but as usual can be configured.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190107142257.16342-1-nyh@scylladb.com>
"Add features that are useful for continuous integration pipelines (and
also ordinary developers):
- sudo support, with and without a tty, as our packaging scripts require it
- install ccache package to allow reducing incremental build times
- dependencies needed to build scylla-jmx and scylla-tools-java"
* tag 'toolchain-ci/v1' of https://github.com/avikivity/scylla:
tools: toolchain: update image for ant, maven, ccache, sudo
tools: toolchain: dbuild: pass-through supplementary groups
tools: toolchain: defeat PAM
tools: toolchain: improve sudo support
tools: toolchain: break long line in dbuild
tools: toolchain: prepare sudoers file
tools: toolchain: install ccache
install-dependencies.sh: add maven and ant
"This patchset reduces inclusions of database.hh, particularly in header
files. It reduces the number of objects depending on database.hh from 166
to 116.
Tests: unit(release), playing a little with tracing"
* tag 'database.hh/v1' of https://github.com/avikivity/scylla:
streaming: stream_session: remove include of db/view/view_update_from_staging_generator.hh
sstables: writer.hh: add some forward declarations
table_helper: remove database.hh include
table_helper: de-inline insert() and setup_keyspace()
table_helper: de-template setup_keyspace()
table_helper: simplify template body of table_helper::insert()
schema_tables: remove #include of database.hh
cql_type_parser: remove dependency on user_types_metadata
thrift: add missing include of sleep.hh
cql3: ks_prop_defs: remove #include "database.hh"
To be able to verify the golden version with sstabledump.
These files were generated by running sstable_3_x_test and keeping its
generated output files.
Refs #4043
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190103112511.23488-2-bhalevy@scylladb.com>
Various small improvements to docs/logging.md:
1. Describe the options to log to stdout or syslog and their defaults.
2. Mention the possibility of using nodetool instead of REST API.
3. Additional small tweaks to formatting.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190106111851.26700-1-nyh@scylladb.com>
Add a new document about logging in Scylla, and how to change the log levels
when running Scylla and during the run.
It needs more developer-oriented information (e.g., how to create new logger
subsystems in the code) but I think it's a good start.
Some of the text is based on Glauber's writeup for the Scylla website on
changing log levels at runtime.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190106103606.26032-1-nyh@scylladb.com>
This header, which is easily replaced with a forward declaration,
introduces a dependency on database.hh everywhere. Remove it and scatter
includes of database.hh in source files that really need it.
Move most of the body into a non-template overload to reduce dependencies
in the header (and template bloat). The function is not on any fast path,
and noncopyable_function will likely not even allocate anything.
A default parameter of type T (or lw_shared_ptr<T>) requires that T be
defined. Remove the depndency by redefining the default parameter
as an overload, for T = user_types_metadata.
The workload in #3844 has these characteristics:
- very small data set size (a few gigabytes per shard)
- large working set size (all the data, enough for high cache miss rate)
- high overwrite rate (so a compaction results in 12X data reduction)
As a result, the compaction backlog controller assigns very few shares to
compaction (low data set size -> low backlog), so compaction proceeds very slowly.
Meanwhile, we have tons of cache misses, and each cache miss needs to read from a
large number of sstables (since compaction isn't progressing). The end result is
a high read amplification, and in this test, timeouts.
While we could declare that the scenario is very artificial, there are other
real-world scenarios that could trigger it. Consider a 100% write load
(population phase) followed by 100% read. Towards the end of the last compaction,
the backlog will drop more and more until compaction slows to a crawl, and until
it completes, all the data (for that compaction) will have to be read from its
input sstables, resulting in read amplification.
We should probably have read amplification affect the backlog, but for now the
simpler solution is to increase the minimum shares to 50 so that compaction
always makes forward progress. This will result in higher-than-needed compaction
bandwidth in some low write rate scenarios so we will see fluctuations in request
rate (what the controller was designed to avoid), but these fluctioations will be
limited to 5%.
Since the base class backlog_controller has a fixed (0, 0) point, remove it
and add it to derived classes (setting it to (0, 50) for compaction).
Fixes#3844 (or at least improves it).
Message-Id: <20181231162710.29410-1-avi@scylladb.com>
Prevent PAM from enforcing security and preventing sudo from working. This is
done by replacing the default configuration (designed for workstations) to
one that uses pam_permit for everything.
Add tools needed to build scylla-jmx and scylla-tools-java. While
not requirements of this repository, it's nicer if a single setup
can be used to build and run everything.
We also install pystache as it's used by packaging scripts.
The current implementation breaks the invariant that
_size_bytes = reduce(_fragments, &temporary_buffer::size)
In particular, this breaks algorithms that check the individual
segment size.
Correctly implement remove_suffix() by destroying superfluous
temporary_buffer's and by trimming the last one, if needed.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190103133523.34937-1-duarte@scylladb.com>
Simplify the fix for memory based eviction, introduced by 918d255 so
there is no need to massage the counters.
Also add a check to `test_memory_based_cache_eviction` which checks for
the bug fixed. While at it also add a check to
`test_time_based_cache_eviction` for the fix to time based eviction
(e5a0ea3).
Tests: tests/querier_cache:debug
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <c89e2788a88c2a701a2c39f377328e77ac01e3ef.1546515465.git.bdenes@scylladb.com>
"
This series adds staging SSTables support to row level repair.
It was introduced for streaming sessions before, but since row level
repair doesn't leverage sessions at all, it's added separately.
Tests:
unit (release)
dtest (repair_additional_test.py:RepairAdditionalTest,
excluding repair_abort_test, which fails for me locally on master)
"
* 'add_staging_sstables_generation_to_row_level_repair_2' of https://github.com/psarna/scylla:
repair: add staging sstables support to row level repair
main,repair: add params to row level repair init
streaming,view: move view update checks to separate file
When a node is bootstrapping, it will receive data from other nodes
via streaming, including materialized views. Regardless whether these
views are built on other nodes or not, building them on newly
bootstrapped nodes has no effect - updates were either already streamed
completely (if view building have finished) or will be propagated
via view building, if the process is still ongoing.
So, marking all views as 'built' for the bootstrapped node prevents it
from spawning superfluous view building processes.
Fixes#4001
Message-Id: <fd53692c38d944122d1b1013fdb0aedf517fa409.1546498861.git.sarna@scylladb.com>
Currently queriers evicted due to their TTL expiring are not
unregistered from the `reader_concurrency_semaphore`. This can cause a
use-after-free when the semaphore tries to evict the same querier at
some later point in time, as the querier entry it has a pointer to is
now invalid.
Fix by unregistering the querier from the semaphore before destroying
the entry.
Refs: #4018
Refs: #4031
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4adfd09f5af8a12d73c29d59407a789324cd3d01.1546504034.git.bdenes@scylladb.com>
Checking if view update path should be used for sstables
is going to be reused in row level repair code,
so relevant functions are moved to a separate header.
In insert_querier(), we may evict older queriers to make room for the new one.
However, we forgot to unregister the evicted queriers from
reader_concurrency_semaphore. As a result, when reader_concurrency_semaphore
eventually wanted to evict something, it saw an inactive_read_handle that was
not connected to a querier_cache::entry, and crashed on use-after-free.
Fix by evicting through the inactive_read_handle associated with the querier
to be evicted. This removes traces of the querier from both
reader_concurrency_semaphore and querier_cache. We also have to massage the
statistics since querier_inactive_read::evict() updates different counters.
Fixes#4018.
Tests: unit(release)
Reviewed-by: Botond Denes <bdenes@scylladb.com>
Message-Id: <20190102175023.26093-1-avi@scylladb.com>
The check in consume_range_tombstone was too late. Before getting to
it we would fail an assert calling to_bound_kind.
This moves the check earlier and adds a testcase.
Tests: unit (release)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This moves the predicate functions to the start of the file, renames
is_in_bound_kind to is_bound_kind for consistency with to_bound_kind
and defines all predicates in a similar fashion.
It also uses the predicates to reduce code duplication.
Tests: unit (release)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
It is useful to adjust the command line when running the docker image,
for example to attach a data volume or a ccache directory. Add e mechanism
to do that.
Message-Id: <20181228163306.19439-1-avi@scylladb.com>
"
Instead of allocating a contiguous temporary_buffer when reading
mutations from the commitlog - or hint - replaying, use fragemnted
buffers instead.
Refs #4020
"
* 'commitlog/fragmented-read/v1' of https://github.com/duarten/scylla:
db/commitlog: Use fragmented buffers to read entries
db/commitlog: Implement skip in terms of input buffer skipping
tests/fragmented_temporary_buffer_test: Add unit test for remove_suffix()
utils/fragmented_temporary_buffer: Add remove_suffix
tests/fragmented_temporary_buffer_test: Add unit test for skip()
utils/fragmented_temporary_buffer: Allow skipping in the input stream
Static class_registries hinder librarification by requiring linking with
all object files (instead of a library from which objects are linked on
demand) and reduce readability by hiding dependencies and by their
horrible syntax. Hide them behind a non-static, non-template tracing
backend registry.
Message-Id: <20181229121000.7885-1-avi@scylladb.com>
This simplifies the code and allows to get rid of the overload of
advance() taking a temporary_buffer.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
distributed_loader is a sizeable fraction of database.cc, so moving it
out reduces compile time and improves readability.
Message-Id: <20181230200926.15074-1-avi@scylladb.com>
lz4 1.8.3 was released with a fix for data corruption during compression. While
the release notes indicate we aren't vulnerable, be cautious and update anyway.
Message-Id: <20181230144716.7238-1-avi@scylladb.com>
"This series contains a couple of fixes to the
view_update_from_staging_generator, the object responsible for
generating view updates from sstables written through streaming.
Fixes#4021"
* 'materialized-views/staging-generator-fixes/v2' of https://github.com/duarten/scylla:
db/view/view_update_from_staging_generator: Break semaphore on stop()
db/view/view_update_from_staging_generator: Restore formatting
db/view/view_update_from_staging_generator: Avoid creating more than one fiber
If view_update_from_staging_generator::maybe_generate_view_updates()
is called before view_update_from_staging_generator::start(), as can
happen in main.cc, then we can potentially create more than one fiber,
which leads to corrupted state and conflicting operations.
To avoid this, use just one fiber and be explicit about notifying it
that more work is needed, by leveraging a condition-variable.
Fixes#4021
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"
A sharded<database> is not very useful for accessing data since data is
usually distributed across many nodes, while a sharded<database>
contains only a single node's view. So it is really only used for
accessing replicated metadata, not data. As such only the local shard
is accessed.
Use that to simplify query_processor a little by replacing sharded<database>
with a plain database.
We can probably be more ambitious and make all accesses, data and metadata,
go through storage_proxy, but this is a start.
"
* tag 'qp-unshard-database/v1' of https://github.com/avikivity/scylla:
query_processor: replace sharded<database> with the local shard
commitlog_replayer: don't use query_processor
client_state: change set_keyspace() to accept a single database shard
legacy_schema_migrator: initialize with database reference
system_keyspace is an implementation detail for most of its users, not
part of the interface, as it's only used to store internal data. Therefore,
including it in a header file causes unneeded dependencies.
This patch removes a dependency between views and system_keyspace.hh
by moving view_name and view_build_progress into a separate header file,
and using forward declarations where possible. This allows us to
remove an inclusion of system_keyspace.hh from a header file (the last
one), so that further changes to system_keyspace.hh will cause fewer
recompilations.
Message-Id: <20181228215736.11493-1-avi@scylladb.com>
query_processor uses storage_proxy to access data, and the local
database object to access replicated metadata. While it seems strange
that the database object is not used to access data, it is logical
when you consider that a sharded<database> only contain's this node's
data, not the cluster data.
Take advantage of this to replace sharded<database> with a single database
shard.
During normal writes, query processing happens before commitlog, so
logically commitlog replaying the commitlog shouldn't need it. And in
fact the dependency on query_processor can be eliminated, all it needs
is the local node's database.
Provide legacy_schema_migrator with a sharded<database> so it doesn't need
to use the one from query_processor. We want to replace query_processor's
sharded<database> with just a local database reference in order to simplify
it, and this is standing in the way.
When streaming, sstables for which we need to generate view updates
are placed in a special staging directory. However, we only need to do
this for tables that actually have views.
Refs #4021
Message-Id: <20181227215412.5632-1-duarte@scylladb.com>
The version of boost in Fedora 29 has a use-after-free bug that is only
exposed when ./test.py is run with the --jenkins flag. To patch it,
use a fixed version from the copr repository scylladb/toolchain.
Message-Id: <20181228150419.29623-1-avi@scylladb.com>
Implementation of nodetool toppartiotion query, which samples most frequest PKs in read/write
operation over a period of time.
Content:
- data_listener classes: mechanism that interfaces with mutation readers in database and table classes,
- toppartition_query and toppartition_data_listener classes to implement toppartition-specific query (this
interfaces with data_listeners and the REST api),
- REST api for toppartitions query.
Uses Top-k structure for handling stream summary statistics (based on implementation in C*, see #2811).
What's still missing:
- JMX interface to nodetool (interface customization may be required),
- Querying #rows and #bytes (currently, only #partitions is supported).
Fixes#2811
* https://github.com/avikivity/scylla rafie_toppartitions_v7.1:
top_k: whitespace and minor fixes
top_k: map template arguments
top_k: std::list -> chunked_vector
top_k: support for appending top_k results
nodetool toppartitions: refactor table::config constructor
nodetool toppartitions: data listeners
nodetool toppartitions: add data_listeners to database/table
nodetool toppartitions: fully_qualified_cf_name
nodetool toppartitions: Toppartitions query implementation
nodetool toppartitions: Toppartitions query REST API
nodetool toppartitions: nodetool-toppartitions script
A Python script mimicking the nodetool toppartitions utility, utilizing Scylla REST API.
Examples:
$ ./nodetool-toppartitions --help
usage: nodetool-toppartitions [-h] [-k LIST_SIZE] [-s CAPACITY]
keyspace table duration
Samples database reads and writes and reports the most active partitions in a
specified table
positional arguments:
keyspace Name of keyspace
table Name of column family
duration Query duration in milliseconds
optional arguments:
-h, --help show this help message and exit
-k LIST_SIZE The number of the top partitions to list (default: 10)
-s CAPACITY The capacity of stream summary (default: 256)
$ ./nodetool-toppartitions ks test1 10000
READ
Partition Count
30 2
20 2
10 2
WRITE
Partition Count
30 1
20 1
10 1
Signed-off-by: Rafi Einstein <rafie@scylladb.com>
toppartitions_query installs toppartitions_data_listener-s on all database shards, waits for
the designated period, uninstalls shards and collects top-k read/write partition keys.
Signed-off-by: Rafi Einstein <rafie@scylladb.com>
Add data_listeners member to database.
Adds data_listeners* to table::config, to be used by table methods to invoke listeners.
Install on_read() listener in table::make_reader().
Install on_write() listener in database::apply_in_memory().
Tests: Unit (release)
Signed-off-by: Rafi Einstein <rafie@scylladb.com>
Mechanism that interfaces with mutation readers in database and table classes, to
allow tracking most frequent partition keys in read and write operation.
Basic design is specified in #2811.
Tracking top #rows and #bytes will be supported in the future.
Signed-off-by: Rafi Einstein <rafie@scylladb.com>
Replaced std::list with chunked_vector. Because chunked_vector requires
a noexcept move constructor from its value type, change the bad_boy type
in the unit test not to throw in the move constructor.
Signed-off-by: Rafi Einstein <rafie@scylladb.com>
"
partition_snapshots created in the memtable will keep a reference to
the memtable (as region*) and to memtable::_cleaner. As long as the
reader is alive, the memtable will be kept alive by
partition_snapshot_flat_reader::_container_guard. But after that
nothing prevents it from being destroyed. The snapshot can outlive the
read if mutation_cleaner::merge_and_destroy() defers its destruction
for later. When the read ends after memtable was flushed, the snapshot
will be queued in the cache's cleaner, but internally will reference
memtable's region and cleaner. This will result in a use-after-free
when the snapshot resumes destruction.
The fix is to update snapshots's region and cleaner references at the
time of queueing to point to the cache's region and cleaner.
When memtable is destroyed without being moved to cache there is no
problem because the snapshot would be queued into memtable's cleaner,
which will be drained on destruction from all snapshots.
Introduced in f3da043 (in >= 3.0-rc1)
Fixes#4030.
Tests:
- mvcc_test (debug)
"
* tag 'fix-snapshot-merging-use-after-free-v1.1' of github.com:tgrabiec/scylla:
tests: mvcc: Add test_snapshot_merging_after_container_is_destroyed
tests: mvcc: Introduce mvcc_container::migrate()
tests: mvcc: Make mvcc_partition move-constructible
tests: mvcc: Introduce mvcc_container::make_not_evictable()
tests: mvcc: Allow constructing mvcc_container without a cache_tracker
mutation_cleaner: Migrate partition_snapshots when queueing for background cleanup
mvcc: partition_snapshot: Introduce migrate()
mutation_cleaner: impl: Store a back-reference to the owning mutation_cleaner
Some test cases will need many containers to simulate memtable ->
cache transitions, but there can be only one cache_tracker per shard
due to metrics. Allow constructing a conatiner without a cache_tracker
(and thus non-evictable).
partition_snapshots created in the memtable will keep a reference to
the memtable (as region*) and to memtable::_cleaner. As long as the
reader is alive the memtable will be kept alive by
partition_snapshot_flat_reader::_container_guard. But after that,
nothing prevents it from being destroyed. The snapshot can outlive the
read if mutation_cleaner::merge_and_destroy() defers its destruction
for later. When the read ends after memtable was flushed, the snapshot
will be queued in the cache's cleaner, but internally will reference
memtable's region and cleaner. This will result in a use-after-free
when the snapshot resumses destruction.
The fix is to update snapshots's region and cleaner references at the
time of queueing to point to the cache's region and cleaner.
When memtable is destroyed without being moved to cache there is no
problem, because the snapshot would be queued into memtable's cleaner,
which will be drained on destruction from all snapshots.
Introduced in f3da043.
Fixes#4030.
Snapshots which outlive the memtable will need to have their
_region and _cleaner references updated.
The snapshot can be destroyed after the memtable when it is queud in
the mutation_cleaner.
rpc::source cannot be abandoned until EOS is reached, but current code
does not obey it if error code is received, it throws exception instead that
aborts the reading loop. Fix it by moving exception throwing out of the
loop.
Fixes: #4025
Message-Id: <20181227135051.GC29458@scylladb.com>
We saw failure in dtest concurrent_schema_changes_test.py:
TestConcurrentSchemaChanges.changes_while_node_down_test test.
======================================================================
ERROR: changes_while_node_down_test (concurrent_schema_changes_test.TestConcurrentSchemaChanges)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/asias/src/cloudius-systems/scylla-dtest/concurrent_schema_changes_test.py", line 432, in changes_while_node_down_test
self.make_schema_changes(session, namespace='ns2')
File "/home/asias/src/cloudius-systems/scylla-dtest/concurrent_schema_changes_test.py", line 86, in make_schema_changes
session.execute('USE ks_%s' % namespace)
File "cassandra/cluster.py", line 2141, in cassandra.cluster.Session.execute
return self.execute_async(query, parameters, trace, custom_payload, timeout, execution_profile, paging_state).result()
File "cassandra/cluster.py", line 4033, in cassandra.cluster.ResponseFuture.result
raise self._final_exception
ConnectionShutdown: Connection to 127.0.0.1 is closed
The test:
session = self.patient_cql_connection(node2)
self.prepare_for_changes(session, namespace='ns2')
node1.stop()
self.make_schema_changes(session, namespace='ns2') --> ConnectionShutdown exception throws
The problem is that, after receiving the DOWN event, the python
Cassandra driver will call Cluster:on_down which checks if this client
has any connections to the node being shutdown. If there is any
connections, the Cluster:on_down handler will exit early, so the session
to the node being shutdown will not be removed.
If we shutdown the cql server first, the connection count will be zero
and the session will be removed.
Fixes: #4013
Message-Id: <7388f679a7b09ada10afe7e783d7868a58aac6ec.1545634941.git.asias@scylladb.com>
When creating a sstable from which to generate view updates, we held
on to a table reference across defer points. In case there's a
concurrent schema drop, the table object might be destroyed and we
will incur in a use-after-free. Solve this by holding on to a shared
pointer and pinning the table object.
Refs #4021
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181227105921.3601-1-duarte@scylladb.com>
On at least one system, using the container's /tmp as provided by docker
results in spurious EINVALs during aio:
INFO 2018-12-27 09:54:08,997 [shard 0] gossip - Feature ROW_LEVEL_REPAIR is enabled
unknown location(0): fatal error: in "test_write_many_range_tombstones": storage_io_error: Storage I/O error: 22: Invalid argument
seastar/tests/test-utils.cc(40): last checkpoint
The setup is overlayfs over xfs.
To avoid this problem, pass through the host's /tmp to the container.
Using --tmpfs would be better, but it's not possible to guess a good size
as the amount of temporary space needed depends on build concurrency.
Message-Id: <20181227101345.11794-1-avi@scylladb.com>
Image fedora-29-20181219 was broken due to the followin chain of events:
- we install gnutls, which currently is at version 3.6.5
- gnutls 3.6.5 introduced a dependency on nettle 3.4.1
- the gnutls rpm does not include a version requirement on nettle,
so an already-installed nettle will not be upgraded when gnutls is
installed
- the fedora:29 image which we use as a baseline has nettle installed
- docker does not pull the latest tag in FROM statements during
"docker build"
- my build machine already had a fedora:29 image, with nettle 3.4
installed (the repository's image has 3.4.1, but docker doesn't
automatically pull if an image with the required tag exists)
As a result, the image ended up hacing gnutls 3.6.5 and nettle 3.4, which
are incompatible.
To fix, update all packages after installation to attempt to have a self
consistent package set even if dependencies are not perfect, and regenerate
the image.
Message-Id: <20181226135711.24074-1-avi@scylladb.com>
The '-t' flag to 'docker run' passes the tty from the caller environment
to the container, which is nice for interactive jobs, but fails if there
is no tty, such as in a continuous integration environment.
Given that, the '-i' flag doesn't make sense either as there isn't any
input to pass.
Remove both, and replace with --sig-proxy=true which allows SIGTERM to
terminate the container instead of leaving it alive. This reduces the
chances of the build stopping but leaving random containers around.
Message-Id: <20181222105837.22547-1-avi@scylladb.com>
The function pending_collection is only called when
cdef->is_multi_cell() is true, so the throw is dead.
This patch converts it to an assert.
Message-Id: <20181207022119.38387-1-espindola@scylladb.com>
"
=== How the the partition level repair works
- The repair master decides which ranges to work on.
- The repair master splits the ranges to sub ranges which contains around 100
partitions.
- The repair master computes the checksum of the 100 partitions and asks the
related peers to compute the checksum of the 100 partitions.
- If the checksum matches, the data in this sub range is synced.
- If the checksum mismatches, repair master fetches the data from all the peers
and sends back the merged data to peers.
=== Major problems with partition level repair
- A mismatch of a single row in any of the 100 partitions causes 100
partitions to be transferred. A single partition can be very large. Not to
mention the size of 100 partitions.
- Checksum (find the mismatch) and streaming (fix the mismatch) will read the
same data twice
=== Row level repair
Row level checksum and synchronization: detect row level mismatch and transfer
only the mismatch
=== How the row level repair works
- To solve the problem of reading data twice
Read the data only once for both checksum and synchronization between nodes.
We work on a small range which contains only a few mega bytes of rows,
We read all the rows within the small range into memory. Find the
mismatch and send the mismatch rows between peers.
We need to find a sync boundary among the nodes which contains only N bytes of
rows.
- To solve the problem of sending unnecessary data.
We need to find the mismatched rows between nodes and only send the delta.
The problem is called set reconciliation problem which is a common problem in
distributed systems.
For example:
Node1 has set1 = {row1, row2, row3}
Node2 has set2 = { row2, row3}
Node3 has set3 = {row1, row2, row4}
To repair:
Node1 fetches nothing from Node2 (set2 - set1), fetches row4 (set3 - set1) from Node3.
Node1 sends row1 and row4 (set1 + set2 + set3 - set2) to Node2
Node1 sends row3 (set1 + set2 + set3 - set3) to Node3.
=== How to implement repair with set reconciliation
- Step A: Negotiate sync boundary
class repair_sync_boundary {
dht::decorated_key pk;
position_in_partition position
}
Reads rows from disk into row buffers until the size is larger than N
bytes. Return the repair_sync_boundary of the last mutation_fragment we
read from disk. The smallest repair_sync_boundary of all nodes is
set as the current_sync_boundary.
- Step B: Get missing rows from peer nodes so that repair master contains all the rows
Request combined hashes from all nodes between last_sync_boundary and
current_sync_boundary. If the combined hashes from all nodes are identical,
data is synced, goto Step A. If not, request the full hashes from peers.
At this point, the repair master knows exactly what rows are missing. Request the
missing rows from peer nodes.
Now, local node contains all the rows.
- Step C: Send missing rows to the peer nodes
Since local node also knows what peer nodes own, it sends the missing rows to
the peer nodes.
=== How the RPC API looks like
- repair_range_start()
Step A:
- request_sync_boundary()
Step B:
- request_combined_row_hashes()
- reqeust_full_row_hashes()
- request_row_diff()
Step C:
- send_row_diff()
- repair_range_stop()
=== Performance evaluation
We created a cluster of 3 Scylla nodes on AWS using i3.xlarge instance. We
created a keyspace with a replication factor of 3 and inserted 1 billion
rows to each of the 3 nodes. Each node has 241 GiB of data.
We tested 3 cases below.
1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows.
Time to repair:
old = 87 min
new = 70 min (rebuild took 50 minutes)
improvement = 19.54%
2) 100% synced: all of the 3 nodes have 1 billion identical rows.
Time to repair:
old = 43 min
new = 24 min
improvement = 44.18%
3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows.
Time to repair:
old: 211 min
new: 44 min
improvement: 79.15%
Bytes sent on wire for repair:
old: tx= 162 GiB, rx = 90 GiB
new: tx= 1.15 GiB, tx = 0.57 GiB
improvement: tx = 99.29%, rx = 99.36%
It is worth noting that row level repair sends and receives exactly the
number of rows needed in theory.
In this test case, repair master needs to receives 2 million rows and
sends 4 million rows. Here are the details: Each node has 1 billion *
0.1% distinct rows, that is 1 million rows. So repair master receives 1
million rows from repair slave 1 and 1 million rows from repair slave 2.
Repair master sends 1 million rows from repair master and 1 million rows
received from repair slave 1 to repair slave 2. Repair master sends
sends 1 million rows from repair master and 1 million rows received from
repair slave 2 to repair slave 1.
In the result, we saw the rows on wire were as expected.
tx_row_nr = 1000505 + 999619 + 1001257 + 998619 (4 shards, the numbers are for each shard) = 4'000'000
rx_row_nr = 500233 + 500235 + 499559 + 499973 (4 shards, the numbers are for each shard) = 2'000'000
Fixes: #3033
Tests: dtests/repair_additional_test.py
"
* 'asias/row_level_repair_v7' of github.com:cloudius-systems/seastar-dev: (51 commits)
repair: Enable row level repair
repair: Add row_level_repair
repair: Add docs for row level repair
repair: Add repair_init_messaging_service_handler
repair: Add repair_meta
repair: Add repair_writer
repair: Add repair_reader
repair: Add repair_row
repair: Add fragment_hasher
repair: Add decorated_key_with_hash
repair: Add get_random_seed
repair: Add get_common_diff_detect_algorithm
repair: Add shard_config
repair: Add suportted_diff_detect_algorithms
repair: Add repair_stats to repair_info
repair: Introduce repair_stats
flat_mutation_reader: Add make_generating_reader
storage_service: Introduce ROW_LEVEL_REPAIR feature
messaging_service: Add RPC verbs for row level repair
repair: Export the repair logger
...
We had issue to build offline installer on RHEL because of repository
difference.
This fix enables to build offline installer both on CentOS and RHEL.
Also it introduces --releasever <ver>, to build offline installer for
specific minor version of CentOS and RHEL.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181212032129.29515-1-syuu@scylladb.com>
Sometimes one wants to just compile all the source files in the
projects, because for example one just moved around code or files and
there is no need to link and run anything, just check that everything
still compiles.
Since linking takes up a considerable amount of time it is worthwhile to
have a specific target that caters for such needs.
This patch introduces a ${mode}-objects target for each mode (e.g.
release-objects) that only runs the compilation step for each source
file but does not link anything.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <eaad329bf22dfaa3deff43344f3e65916e2c8aaf.1545045775.git.bdenes@scylladb.com>
In one test the types in the schema don't match the types in the
statistics file. In another a column is missing.
The patch also updates the exceptions to have more human readable
messages.
Tests: unit (release)
Part of issue #3960.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20181219233046.74229-1-espindola@scylladb.com>
Validate ascii string by ORing all bytes and check if 7-th bit is 0.
Compared with original std::any_of(), which checks ascii string byte
by byte, this new approach validates input in 8 bytes and two
independent streams. Performance is much higher for normal cases,
though slightly slower when string is very short. See table below.
Speed(MB/s) of ascii string validation
+---------------+-------------+---------+
| String length | std::any_of | u64 x 2 |
+---------------+-------------+---------+
| 9 bytes | 1691 | 1635 |
+---------------+-------------+---------+
| 31 bytes | 2923 | 3181 |
+---------------+-------------+---------+
| 129 bytes | 3377 | 15110 |
+---------------+-------------+---------+
| 1039 bytes | 3357 | 31815 |
+---------------+-------------+---------+
| 16385 bytes | 3448 | 47983 |
+---------------+-------------+---------+
| 1048576 bytes | 3394 | 31391 |
+---------------+-------------+---------+
Signed-off-by: Yibo Cai <yibo.cai@arm.com>
Message-Id: <1544669646-31881-1-git-send-email-yibo.cai@arm.com>
"
Some files use db/config.hh just to access extensions. Reduce dependencies
on this global and volatile file by providing another path to access extensions.
Tests: unit(release)
"
* tag 'unconfig-2/v1' of https://github.com/avikivity/scylla:
hints: reduce dependencies on db/config.hh
commitlog: reduce dependencies on db/config.hh
cql3: reduce dependencies on db/config.hh
database: provide accessor to db::extensions
Rather than forcing callers to go through get_config(), provide a
direct accessor. This reduces dependencies on config.hh, and will
allow separation of extensions from configuration.
When the next pending fragments are after the start of the new range,
we know there is no need to skip.
Caught by perf_fast_forward --datasets large-part-ds3 \
--run-tests=large-partition-slicing
Refs #3984
Message-Id: <1545308006-16389-1-git-send-email-tgrabiec@scylladb.com>
Currently if something throws while streaming in mutation sending loop
sink is not closed. Also when close() is running the code does not hold
onto sink object. close() is async, so sink should be kept alive until
it completes. The patch uses do_with() to hold onto sink while close is
running and run close() on error path too.
Fixes#4004.
Message-Id: <20181220155931.GL3075@scylladb.com>
Mutation readers allow fast-forwarding the ranges from which the data is
being read. The main user of this feature is cache which, when reading
from the underlying reader, may want to skip some data it already has.
Unsurprisingly, this adds more complexity to the implementation of the
readers and more edge cases the developers need to take care of.
While most of the readers were at least to some extent checked in this
area those test usually were quite isolated (e.g. one test doing
inter-partition fast-forwarding, another doing intra-partition
fast-forwarding) and as a consequence didn't cover many corner cases.
This patch adds a generic test for fast-forwarding and slicing that
covers more complicated scenarios when those operations are combined.
Needless to say that did uncover some problems, but fortunately none of
them is user-visible.
Fixes#3963.
Fixes#3997.
Tests: unit(release, debug)
* https://github.com/pdziepak/scylla.git test-fast-forwarding/v4.1:
tests/flat_mutation_reader_assertions: accumulate received tombstones
tests/flat_mutation_reader_assertions: add more test messages
tests/flat_mutation_reader_assertions: relax has_monotonic_positions()
check
tests/mutation_readers: do not ignore streamed_mutation::forwarding
Revert "mutation_source_test: add option to skip intra-partition
fast-forwarding tests"
memtable: it is not a single partition read if partition
fast-forwaring is enabled
sstables: add more tracing in mp_row_consumer_m
row_cache: use make_forwardable() to implement
streamed_mutation::forwarding
row_cache: read is not single-partition if inter-partition forwarding
is enabled
row_cache: drop support for streamed_mutation::forwarding::yes
entirely
sstables/mp_row_consumer: position_range end bound is exclusive
mutation_fragment_filter: handle streamed_mutation::forwarding::yes
properly
tests/mutation_reader: reduce sleeping time
tests/memtable: fix partition_range use-after-free
tests/mutation: fix partition range use-after-free
flat_mutation_reader_from_mutations: add overload that accepts a slice
and partition range
flat_mutation_reader_from_mutations: fix empty range case
flat_mutation_reader_from_mutations: destroy all remaining mutations
tests/mutation_source: drop dropped column handling test
tests/mutation_source: add test for complex fast_forwarding and
slicing
While we already had tests that verified inter- and intra-partition
fast-forwarding as well as slicing, they had quite limited scope and
didn't combine those operations. The new test is meant to extensively
test these cases.
Schema changes are now covered by for_each_schema_change() function.
Having some additional tests in run_mutation_source_tests() is
problematic when it is used to test intermediate mutation readers
because schema changes may be irrelevant for them, which makes the test
a waste of time (might be a problem in debug mode) and requires those
intermediate reader to use more complex underlying reader that supports
schema changes (again, problem in a very slow debug mode).
If the reader is fast-forwarded to another partition range mutation_ may
be left with some partial mutations. Make sure that those are properly
destroyed.
It is a very bad taste to sleep anywhere in the code. The test should be
fixed to explicitly test various orderings between concurrent
operations, but before that happens let's at least readuce how much
those sleeps slow it down by changing it from milliseconds to
microseconds.
Implementing intra-partition fast-forwarding adds more complexity to
already very-much-not-trivial cache readers and isn't really critical in
any way since it is not used outside of the tests. Let's use the generic
adapter instead of natively implementing it.
Single-partition reader is less expensive than the one that accepts any
range of partitions, but it doesn't support fast-forwarding to another
partition range properly and therefore cannot be used if that option is
enabled.
This reverts commit b36733971b. That commit made
run_mutation_reader_tests() support mutation_sources that do not implement
streamed_mutation::forwarding::yes. This is wrong since mutation_sources
are not allowed to ignore or otherwise not support that mode. Moreover,
there is absolutely no reason for them to do so since there is a
make_forwardable() adapter that can make any mutation_reader a
forwardable one (at the cost of performance, but that's not always
important).
It is wrong to silently ignore streamed_mutation::forwarding option
which completely changes how the reader is supposed to operate. The best
solution is to use make_forwardable() adapter which changes
non-forwardable reader to a forwardable one.
Since 41ede08a1d "mutation_reader: Allow
range tombstones with same position in the fragment stream" mutation
readers emit fragments in non-decreasing order (as opposed to strictly
increasing), has_monotonic_posiitons() needs to be updated to allow
that.
Current data model employed by mutation readers doesn't have an unique
representation of range tombstones. This complicates testing by making
multiple ways of emitting range tombstones and rows equally valid.
This patch adds an option to verify mutation readers by checking whether
tombstones they emit properly affect the clustered rows regardless of how
exactly the tombstones are emitted. The interface of
flat_mutation_reader_assertions is extended by adding
may_produce_tombstones() that accepts any number of tombstones and
accumulates them. Then, produces_row_with_key() accepts an additional
argument which is the expected timestamp of the range tombstone that
affects that row.
"
Contains several improvements for fast-forwarding and slicing readers. Mainly
for the MC format, but not only:
- Exiting the parser early when going out of the fast-forwarding window [MC-format-only]
- Avoiding reading of the head of the partition when slicing
- Avoiding parsing rows which are going to be skipped [MC-format-only]
"
* 'sstable-mc-optimize-slicing-reads' of github.com:tgrabiec/scylla:
sstables: mc: reader: Skip ignored rows before parsing them
sstables: mc: reader: Call _cells.clear() when row ends rather than when it starts
sstables: mc: mutation_fragment_filter: Take position_in_partition rather than a clustering_row
sstables: mc: reader: Do not call consume_row_marker_and_tombstone() for static rows
sstables: mc: parser: Allow the consumer to skip the whole row
sstables: continuous_data_consumer: Introduce skip()
sstables: continuous_data_consumer: Make position() meaningful inside state_processor::process_state()
sstables: mc: parser: Allocate dynamic_bitset once per read instead of once per row
sstables: reader: Do not read the head of the partition when index can be used
sstables: mc: mutation_fragment_filter: Check the fast-forward window first
sstables: mc: writer: Avoid calling unsigned_vint::serialized_size()
"
As the amount of pending view updates increases we know that there’s a
mismatch between the rate at which the base receives writes and the
rate at which the view retires them. We react by applying backpressure
to decrease the rate of incoming base writes, allowing the slow view
replicas to catch up. We want to delay the client’s next writes to a
base replica and we use the base’s backlog of view updates to derive
this delay.
To validate this approach we tested a 3 node Scylla cluster on GCE,
using n1-standard-4 instances with NVMEs. A loader running on a
n1-standard-8 instance run cassandra-stress with 100 threads. With the
delay function d(x) set to 1s, we see no base write timeouts. With the
delay function as defined in the series, we see that backlogs stabilize
at some (arbitrary) point, as predicted, but this stabilization
co-exists with base write timeouts. However, the system overall behaves
better than the current version, with the 100 view update limit, and
also better than the version without such limit or any backpressure.
More work is necessary to further stabilize the system. Namely, we want
to keep delaying until we see the backlog is decreasing. This will
require us to add more delay beyond the stabilization point, which in
turn should minimize the base write timeouts, and will also minimize the
amount of memory the backlog takes at each base replica.
Design document:
https://docs.google.com/document/d/1J6GeLBvN8_c3SbLVp8YsOXHcLc9nOLlRY7pC6MH3JWo
Fixes#2538
"
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
* 'materialized-views/backpressure/v2' of https://github.com/duarten/scylla: (32 commits)
service/storage_proxy: Release mutation as early as possible
service/storage_proxy: Delay replica writes based on view update backlog
service/storage_proxy: Get the backlog of a particular base replica
service/storage_proxy: Add counters for delayed base writes
main: Start and stop the view_update_backlog_broker
service: Distribute a node's view update backlog
service: Advertise view update backlog over gossip
service/storage_proxy: Send view update backlog from replicas
service/storage_proxy: Prepare to receive replica view update backlog
service/storage_proxy: Expose local view update backlog
tests/view_schema_test: Add simple test for db::view::node_update_backlog
db/view: Introduce node_update_backlog class
db/hints: Initialize current backlog
database: Add counter for current view backlog
database: Expose current memory view update backlog
idl: Add db::view::update_backlog
db/view: Add view_update_backlog
database: Wait on view update semaphore for view building
service/storage_proxy: Use near-infinite timeouts for view updates
database: generate_and_propagate_view_updates no longer needs a timeout
...
When delaying a base write, there is no need to hold on to the
mutation if all replicas have already replied.
We introduce mutation_holder::release_mutation(), which frees the
mutations that are no longer needed during the rest of the delay.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
As the amount of pending view updates increases we know that there’s a
mismatch between the rate at which the base receives writes and the
rate at which the view retires them. We react by applying backpressure
to decrease the rate of incoming base writes, allowing the slow view
replicas to catch up. We want to delay the client’s next writes to a
base replica. We use the base’s backlog of view updates to derive
this delay.
If we achieve CL and the backlogs of all replicas involved were last
seen to be empty, then we wouldn't delay the client's reply. However,
it could be that one of the replicas is actually overloaded, and won't
reply for many new such requests. We'll eventually start applying
backpressure to the client via the background's write queue, but in
the meanwhile we may be dropping view updates. To mitigate this we rely
on the backlog being gossiped periodically.
Fixes#2538
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces the view_update_backlog_broker class, which is
responsible for periodically updating the local gossip state with the
current node's view update backlog. It also registers to updates from
other nodes, and updates the local coordinator's view of their view
update backlogs.
We consider the view update backlog received from a peer through the
mutation_done verb to be always fresh, but we consider the one received
through gossip to be fresh only if it has a higher timestamp than what
we currently have recorded.
This is because a node only updates its gossip state periodically, and
also because a node can transitively receive gossip state about a third
node with outdated information.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This lays the groundwork for brokering a node's view update
backlog across the whole cluster. This is needed for when a
coordinator does not contact a given replica for a long time, and uses
a backlog view that is outdated and causes requests to be
unnecessarily delayed.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Change the inter-node protocol so we can propagate the view update
backlog from a base replica to the coordinator through the
mutation_done and mutation_failed verbs.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
In subsequent patches, replicas will reply to the coordinator with
their view update backlog. Before introducing changes to the
messaging_service, prepare the storage_proxy to receive and store
those backlogs.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The local view update backlog is the max backlog out of the relative
memory backlog size and the relative hints backlog size.
We leverage the db::view::node_update_backlog class so we can send the
max backlog out of the node's shards.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Expose the base replica's current memory view update backlog, which is
defined in terms of units consumed from the semaphore.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The view update backlog represents the pending view data that a base replica
maintains. It is the maximum of the memory backlog - how much memory pending
view updates are consuming - and the disk backlog - how much view hints are
consuming. The size of a backlog is relative to its maximum size.
We will use this class to represent a base replica's view update
backlog at the coordinator.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
View building sends view updates synchronously, which has natural
backpressure. However, they
1) Contribute to the load on the view replicas, and;
2) Add memory pressure to the base replica.
They should thus count towards the current view update backlog, and
consume units from the view update concurrency semaphore.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
View updates are sent with a timeout of 5 minutes, unrelated to
any user-defined value and meant as a protection mechanism. During
normal operation we don’t benefit from timing out view writes and
offloading them to the hinted-handoff queue, since they are an
internal, non-real time workload that we already spent resources on.
This value should be increases further, but that change depends on
Refs #2538
Refs #3826
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We no longer wait on the semaphore and instead over-subscribe it, so
there's not reason to pass a timeout.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We arrive at an overloaded state when we fail to acquire semaphore
units in the base replica. This can mean clients are working in
interactive mode, we fail to throttle them and consequently should
start shedding load. We want to avoid impacting base table
availability by running out of memory, so we could offload the memory
queue to disk by writing the view updates as hints without attempting
to send them. However, the disk is also a limited resource and in
extreme cases we won’t be able to write hints. A tension exists
between forgetting the view updates, thereby opening up a window for
inconsistencies between base and view, or failing the base replica
write. The latter can fail the whole user write, or if the
coordinator was able to achieve CL, can instead cause inconsistencies
between base tables (we wouldn't want to store a hint, because if the
base replica is still overloaded, we would redo the whole dance).
Between the devil and the deep blue sea, we chose to forget view
updates. As a further simplification, we don't even write hints,
assuming that if clients can’t be throttled (as we'll attempt to do in
future patches), it will only be a matter of time before view updates
can’t be offloaded. We also start acquiring the semaphore units using
consume(), which is non-blocking, but allows for underflow of the
available semaphore units. This is okay, and we expect not to underflow
by much, as we stop generating new view updates.
Refs #2538
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Propagate acquired semaphore units to mutate_MV() to allow the
semaphore to be incrementally signalled as view updates are processed
by view replicas.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Stopping a table with in-flight reads and writes can be happening
concurrently, which rely on table state and we must therefore prevent
its destruction before those operations complete.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The semaphore currently limiting the amount of view updates a given
base replica emits aims to control the load that is imposed on the
cluster, to protect view replicas from being overloaded when there
are bursts of traffic (especially for degenerate cases like an index
with low selectivity).
100 is, however, an arbitrary number. It might allow too much load on
the view replicas, and it might also allow too much memory from the
base shard to be consumed. Conversely, it might allow for too few
updates to be queued in case of a burst, or to absorb updates while a
view replica becomes partitioned.
To deal with the load that is inflicted on the cluster, future patches
will ensure that the rate of base writes obeys the rate at which the
slowest view replica can consume the corresponding view updates.
To protect the current shard from using too much memory for this
queue, we will limit it to 10% of the shard's memory. The goal is to
both protect the shard from being overloaded, but also to allow it to
absorb bursts of writes resulting in large view mutations.
Refs #2538
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Working in terms of frozen_mutations allows us to account more
precisely the memory pending view updates consume at the storage_proxy
layer.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This allows an std::move() in its body to work as intended. Also, make
the lambda's argument type explicit.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Scylla is at the moment incompatible with the Seastar master branch,
so in order to allow Scylla commits that depend on Seastar patches,
we change the submodule to point to scylla-seastar and use a branch
(master-20181217) to hold these dependent commits.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181217153241.67514-1-duarte@scylladb.com>
The "enable_sstables_mc_format" config item help text wants to remove itself
before release. Since scylla-3.0 did not get enough mc format mileage, we
decided to leave it in, so the notice should be removed.
Fixes#4003.
Message-Id: <20181219082554.23923-1-avi@scylladb.com>
"
Implementation of origin change c000da13563907b99fe220a7c8bde3c1dec74ad5
Modifies network topology calculation, reducing the amount of
maps/sets used by applying the knowledge of how many replicas we
expect/need per dc and sharing endpoint and rack set (since we cannot have
overlaps).
Also includes a transposed origin test to ensure new calculation
matches the old one.
Fixes#2896
"
* 'calle/network_topology' of github.com:scylladb/seastar-dev:
network_topology_test: Add test to verify new algorith results equals old
network_topology_strategy: Simplify calculate_natural_endpoints
token_metadata: Add "get_location" ip to dc+rack accessor
sequenced_set: Add "insert" method, following std::set semantics
Currently filtering happens inside consume_row_end() after the whole
row is parsed. It's much faster to skip without parsing.
This patch moves filtering and range tombstones splitting to
consume_row_start().
_stored_row is no longer needed because in case the filter returns
store_and_finish, the consumer exits with retry_later, and the parser
will call consume_row_start() again when resumed.
Tests:
./build/release/tests/perf/perf_fast_forward_g \
--sstable-format=mc \
--datasets large-part-ds1 \
--run-tests=large-partition-skips
Before:
read skip time (s) frags frag/s mad f/s max f/s min f/s aio (KiB)
1 4096 1.085142 1953 1800 32 1803 1720 4990 159604
After:
read skip time (s) frags frag/s mad f/s max f/s min f/s aio (KiB)
1 4096 0.694560 1953 2812 11 2813 2684 4986 159588
mp_row_consumer_m::consume_row_marker_and_tombstone() is called for
both clustering and static rows, but it dereferences and modifies
_in_progress_row, which is only set when inside a clustering row.
Fixes#3999.
Will allow state_processor to know its position in the
stream.
Currently position() is meaningless inside process_state() because in
some cases it points to the position after the buffer and in some
cases before it. This patch standardizes on the former. This is more
useful than the latter because process_state() trims from the front of
the buffer as it consumes, so the position inside the stream can be
obtained by subtracting the remaining buffer size from position(),
without introducing any new variables.
The size of the bitset is the same for given row kind across the sstable, so we can
allocate it once.
_columns_selector is moved into row_schema structure, which we have
one for each row kind and setup in the constructor.
read_partition() was always called through read_next_partition(), even
if we're at the beginning of the read. read_next_partition() is
supposed to skip to the next partition. It still works when we're
positioned before a partition, it doesn't advance the consumer, but it
clears _index_in_current_partition, because it (correctly) assumes it
corresponds to the partition we're about to leave, not the one we're
about to enter.
This means that index lookups we did in the read initializer will be
disregarded when reading starts, and we'll always start by reading
partition data from the data file. This is suboptimal for reads which
are slicing a large partition and don't need to read the front of the
partition.
Regression introduced in 4b9a34a854.
The fix is to call read_partition() directly when we're positioned at
the beginning of the partition. For that purpose a new flag was
introduced.
test_no_index_reads_when_rows_fall_into_range_boundaries has to be
relaxed, because it assumed that slicing reads will read the head of
the partition.
Refs #3984Fixes#3992
Tested using:
./build/release/tests/perf/perf_fast_forward_g \
--sstable-format=mc \
--datasets large-part-ds1 \
--run-tests=large-partition-slicing-clustering-keys
Before (focus on aio):
offset read time (s) frags frag/s mad f/s max f/s min f/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
4000000 1 0.001378 1 726 5 736 102 6 200 4 2 0 1 1 0 0 0 65.8%
After:
offset read time (s) frags frag/s mad f/s max f/s min f/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
4000000 1 0.001290 1 775 6 788 716 2 136 2 0 0 1 1 0 0 0 69.1%
Otherwise the parser will keep consuming and dropping fragments
needlessly, rather than giving the user a chance to consume
end-of-stream condition, and maybe skip again.
Refs #3984
Rather than adding serialized_size() to the body size before
serializing the field, we can serialize the field to _tmp_bufs at the
beginning and have the body size automatically account for it.
"
Recently some additional issues were discovered related to recent
changes to the way inactive readers are evicted and making shard readers
evictable.
One such issue is that the `querier_cache` is not prepared for the
querier to be immediately evicted by the reader concurrency semaphore,
when registered with it as an inactive read (#3987).
The other issue is that the multishard mutation query code was not
fully prepared for evicted shard readers being re-created, or failing
why being re-created (#3991).
This series fixes both of these issues and adds a unit test which covers
the second one. I am working on a unit test which would cover the second
issue, but it's proving to be a difficult one and I don't want to delay
the fixes for these issues any longer as they also affect 3.0.
Fixes: #3987Fixes: #3991
Tests: unit(release, debug)
"
* 'evictable-reader-related-issues/v2' of https://github.com/denesb/scylla:
multishard_mutation_query: reset failed readers to inexistent state
multishard_mutation_query: handle missing readers when dismantling
multishard_mutation_query: add support for keeping stats for discarded partitions
multishard_mutation_query: expect evicted reader state when creating reader
multishard_mutation_query: pretty-print the reader state in log messages
querier_cache: check that the query wasn't evicted during registering
reader_concurrency_semaphore: use the correct types in the constructor
reader_concurrency_semaphore: add consume_resources()
reader_concurrency_semaphore::inactive_read_handle: add operator bool()
Transposed from origin unit test.
Creates a semi-random topology of racks, dcs, tokens and replication
factors and verifies endpoint calculation equals old algo.
Fixes#2896 (hopefully)
Implementation of origin change c000da13563907b99fe220a7c8bde3c1dec74ad5
Reduces the amount of maps and sets and general complexity of
endpoint calculation by simply mapping dc:s to expected node
counts, re-using endpoint sets and iterate thusly.
Tested with transposed origin unit test comparing old vs. new
algo results. (Next patch)
When attempting to dismantling readers, some of the to-be-dismantled
readers might be in a failed state. The code waiting on the reader to
stop is expecting failures, however it didn't do anything besides
logging the failure and bumping a counter. Code in the lower layers did
not know how to deal with a failed reader and would trigger
`std::bad_variant_access` when trying to process (save or cleanup) it.
To prevent this, reset the state of failed readers to `inexistent_state`
so code in the lower layers doesn't attempt to further process them.
When dismantling the combined buffer and the compaction state we are no
longer guaranteed to have the reader each partition originated from. The
reader might have been evicted and not resumed, or resuming it might
have failed. In any case we can no longer assume the originating reader
of each partition will be present. If a reader no longer exists,
discard the partitions that it emitted.
In the next patches we will add code that will have to discard some of
the dismantled partitions/fragments/bytes. Prepare the
`dismantle_buffer_stats` struct for being able to track the discarded
partitions/fragments/bytes in addition to those that were successfully
dismantled.
Previously readers were created once, so `make_remote_reader()` had a
validation to ensure readers were not attempted at being created more
than once. This validation was done by checking that the reader-state is
either `inexistent` or `successful_lookup`. However with the
introduction of pausing shard readers, it is now possible that a reader
will have to be created and then re-created several times, however this
validation was not updated to expect this.
Update the validation so it also expects the reader-state to be
`evicted`, the state the reader will be if it was evicted while paused.
The reader concurrency semaphore can evict the querier when it is
registered as an inactive read. Make the `querier_cache` aware of this
so that it doesn't continue to process the inserted querier when this
happens.
Also add a unit test for this.
Previously there was a type mismatch for `count` and `memory`, between
the actual type used to store them in the class (signed) and the type
of the parameters in the constructor (unsigned).
Although negative numbers are completely valid for these members,
initializing them to negative numbers don't make sense, this is why they
used unsigned types in the constructor. This restriction can backfire
however when someone intends to give these parameters the maximum
possible value, which, when interpreted as a signed value will be `-1`.
What's worse the caller might not even be aware of this unsigned->signed
conversion and be very suprised when they find out.
So to prevent surprises, expose the real type of these members, trusting
the clients of knowing what they are doing.
Also add a `no_limits` constructor, so clients don't have to make sure
they don't overflow internal types.
The upgrade to node_exporter 0.17 commit
09c2b8b48a ("node_exporter_install: switch
to node_exporter 0.17") caused the service to no longer start. Turns out
node_exported broke backwards compatibility of the command line between
0.15 to 0.16. Fix it up.
While fixing the command line, all the collector that are enabled by
default were removed.
Fixes#3989
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
[ penberg@scylladb.com: edit commit message ]
Message-Id: <20181213114831.27216-1-amnon@scylladb.com>
"
The motivation is to keep code related to each format separate, to make it
easier to comprehend and reduce incremental compilation times.
Also reduces dependency on sstable writer code by removing writer bits from
sstales.hh.
The ka/la format writers are still left in sstables.cc, they could be also extracted.
"
* 'extract-sstable-writer-code' of github.com:tgrabiec/scylla:
sstables: Make variadic write() not picked on substitution error
sstables: Extract MC format writer to mc/writer.cc
sstables: Extract maybe_add_summary_entry() out of components_writer
sstables: Publish functions used by writers in writer.hh
sstables: Move common write functions to writer.hh
sstables: Extract sstable_writer_impl to a header
sstables: Do not include writer.hh from sstables.hh
sstables: mc: Extract bound_kind_m related stuff into mc/types.hh
sstables: types: Extract sstable_enabled_features::all()
sstables: Move components_writer to .cc
tests: sstable_datafile_test: Avoid dependency on components_writer
"
Working on database.hh or any header that is included in database.hh
(of which there is a lot), is a major pain as each change involves the
recompilation of half of our compilation units.
Reduce the impact by removing the `#include "database.hh"` directive
from as many header files as possible. Many headers can make do with
just some forward declarations and don't need to include the entire
headers. I also found some headers that included database.hh without
actually needing it.
Results
Before:
$ touch database.hh
$ ninja build/release/scylla
[1/154] CXX build/release/gen/cql3/CqlParser.o
After:
$ touch database.hh
$ ninja build/release/scylla
[1/107] CXX build/release/gen/cql3/CqlParser.o
"
* 'reduce-dependencies-on-database-hh/v2' of https://github.com/denesb/scylla:
treewide: remove include database.hh from headers where possible
database_fwd.hh: add keyspace fwd declaration
service/client_state: de-inline set_keyspace()
Move cache_temperature into its own header
The previous code was using mp_row_consumer_k_l to be as close to the
tested code as possible.
Given that it is testing for an unhandled exception, there is probably
more value in moving it to a higher level, easier to use, API.
This patch changes it to use read_rows_flat().
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20181210235016.41133-1-espindola@scylladb.com>
Many headers don't really need to include database.hh, the include can
be replaced by forward declarations and/or including the actually needed
headers directly. Some headers don't need this include at all.
Each header was verified to be compilable on its own after the change,
by including it into an empty `.cc` file and compiling it. `.cc` files
that used to get `database.hh` through headers that no longer include it
were changed to include it themselves.
Embedding the expire timer for a write response in the
abstract_write_response_handler simplifies the code as it allows
removing the rh_entry type.
It will also make the timeout easily accessible inside the handler,
for future patches.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181213111818.39983-1-duarte@scylladb.com>
* https://github.com/espindola/scylla espindola/add-composite-tests:
Remove newline from exception messages.
Fix end marker exception message.
Add tests for broken start and end composite markers.
They are inconsistent with other uses of malformed_sstable_exception
and incompatible with adding " in sstable ..." to the message.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Introduced in 7e15e43.
Exposed by perf_fast_forward:
running: large-partition-skips on dataset large-part-ds1
Testing scanning large partition with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
read skip time (s) frags frag/s (...)
1 0 5.268780 8000000 1518378
1 1 31.695985 4000000 126199
Message-Id: <1544614272-21970-1-git-send-email-tgrabiec@scylladb.com>
If write(v, out, x) doesn't match any overload, the variadic write()
will be picked, with Rest = {}. The compiler will print error messages
about unable to find write(v, out), which totally obscures the
original cause of mismatch.
Make it picked only when there are at least two write() parameters so
that debugging compilation errors is actually possible.
This moves all MC-related writing code to mc/writer.cc:
- m_format_write_helpers.hh is dropped
- m_format_write_helpers_impl.hh is dropped
- sstable_writer_m is moved out of sstables.cc
sstable_writer_m is renamed to sstables::mc::writer
They are common for sstable writers of different formats.
Note that writer.hh is supposed to be included only by writer
implementations, not writer users.
=== How the the partition level repair works
- The repair master decides which ranges to work on.
- The repair master splits the ranges to sub ranges which contains around 100
partitions.
- The repair master computes the checksum of the 100 partitions and asks the
related peers to compute the checksum of the 100 partitions.
- If the checksum matches, the data in this sub range is synced.
- If the checksum mismatches, repair master fetches the data from all the peers
and sends back the merged data to peers.
=== Major problems with partition level repair
- A mismatch of a single row in any of the 100 partitions causes 100
partitions to be transferred. A single partition can be very large. Not to
mention the size of 100 partitions.
- Checksum (find the mismatch) and streaming (fix the mismatch) will read the
same data twice
=== Row level repair
Row level checksum and synchronization: detect row level mismatch and transfer
only the mismatch
=== How the row level repair works
- To solve the problem of reading data twice
Read the data only once for both checksum and synchronization between nodes.
We work on a small range which contains only a few mega bytes of rows,
We read all the rows within the small range into memory. Find the
mismatch and send the mismatch rows between peers.
We need to find a sync boundary among the nodes which contains only N bytes of
rows.
- To solve the problem of sending unnecessary data.
We need to find the mismatched rows between nodes and only send the delta.
The problem is called set reconciliation problem which is a common problem in
distributed systems.
For example:
Node1 has set1 = {row1, row2, row3}
Node2 has set2 = { row2, row3}
Node3 has set3 = {row1, row2, row4}
To repair:
Node1 fetches nothing from Node2 (set2 - set1), fetches row4 (set3 - set1) from Node3.
Node1 sends row1 and row4 (set1 + set2 + set3 - set2) to Node2
Node1 sends row3 (set1 + set2 + set3 - set3) to Node3.
=== How to implement repair with set reconciliation
- Step A: Negotiate sync boundary
class repair_sync_boundary {
dht::decorated_key pk;
position_in_partition position
}
Reads rows from disk into row buffers until the size is larger than N
bytes. Return the repair_sync_boundary of the last mutation_fragment we
read from disk. The smallest repair_sync_boundary of all nodes is
set as the current_sync_boundary.
- Step B: Get missing rows from peer nodes so that repair master contains all the rows
Request combined hashes from all nodes between last_sync_boundary and
current_sync_boundary. If the combined hashes from all nodes are identical,
data is synced, goto Step A. If not, request the full hashes from peers.
At this point, the repair master knows exactly what rows are missing. Request the
missing rows from peer nodes.
Now, local node contains all the rows.
- Step C: Send missing rows to the peer nodes
Since local node also knows what peer nodes own, it sends the missing rows to
the peer nodes.
=== How the RPC API looks like
- repair_range_start()
Step A:
- request_sync_boundary()
Step B:
- request_combined_row_hashes()
- reqeust_full_row_hashes()
- request_row_diff()
Step C:
- send_row_diff()
- repair_range_stop()
=== Performance evaluation
We created a cluster of 3 Scylla nodes on AWS using i3.xlarge instance. We
created a keyspace with a replication factor of 3 and inserted 1 billion
rows to each of the 3 nodes. Each node has 241 GiB of data.
We tested 3 cases below.
1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows.
Time to repair:
old = 87 min
new = 70 min (rebuild took 50 minutes)
improvement = 19.54%
2) 100% synced: all of the 3 nodes have 1 billion identical rows.
Time to repair:
old = 43 min
new = 24 min
improvement = 44.18%
3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows.
Time to repair:
old: 211 min
new: 44 min
improvement: 79.15%
Bytes sent on wire for repair:
old: tx= 162 GiB, rx = 90 GiB
new: tx= 1.15 GiB, tx = 0.57 GiB
improvement: tx = 99.29%, rx = 99.36%
It is worth noting that row level repair sends and receives exactly the
number of rows needed in theory.
In this test case, repair master needs to receives 2 million rows and
sends 4 million rows. Here are the details: Each node has 1 billion *
0.1% distinct rows, that is 1 million rows. So repair master receives 1
million rows from repair slave 1 and 1 million rows from repair slave 2.
Repair master sends 1 million rows from repair master and 1 million rows
received from repair slave 1 to repair slave 2. Repair master sends
sends 1 million rows from repair master and 1 million rows received from
repair slave 2 to repair slave 1.
In the result, we saw the rows on wire were as expected.
tx_row_nr = 1000505 + 999619 + 1001257 + 998619 (4 shards, the numbers are for each shard) = 4'000'000
rx_row_nr = 500233 + 500235 + 499559 + 499973 (4 shards, the numbers are for each shard) = 2'000'000
Fixes#3033
This patch introduces repair_meta class that is the core class for the
row level repair.
For each range to repair, repair_meta objects are created on both repair
master and repair slaves. It stores the meta data for the row level
repair algorithms, e.g, the current sync boundary, the buffer used to
hold the rows the peers are working on, the reader to read data from
sstable and the writer to write data to sstable.
This patch also implements the RPC verbs for row level repair, for
example, REPAIR_ROW_LEVEL_START/REPAIR_ROW_LEVEL_STOP to starts/stops
row level repair for a range, REPAIR_GET_SYNC_BOUNDARY to get sync
boundary peers want to work on, REPAIR_GET_ROW_DIFF to get missing rows
from repair slaves and REPAIR_PUT_ROW_DIFF to pus missing rows to repair
slaves.
repair_writer uses multishard_writer to apply the mutation_fragments to
sstable. The repair master needs one such writer for each of the repair
slave. The repair slave needs one writer for the repair master.
repair_reader is used to read data from disk. It is simply a local
flat_mutation_reader reader for the repair master. It is more
complicated for the repair slave.
The repair slaves have to follow what repair master read from disk.
For example,
Assume repair master has 2 shards and repair slave has 3 shards
Repair master on shard 0 asks repair slave on shard 0 to read range [0,100).
Repair master on shard 1 asks repair slave on shard 1 to read range [0,100).
Repair master on shard 0 will only read the data that belongs to shard 0
within range [0,100). Since master and slave have different shard count,
repair slave on shard 0 has to use the multi shard reader to collect
data on all the shards. It can not pass range [0, 100) to the multi
shard reader, otherwise it will read more data than the repair master.
Instead, repair slave uses a sharder using sharding configuration of the
repair master, to generate the sub ranges belong to shard 0 of repair
master.
If repair master and slave has the same sharding configuration, a simple
local reader is enough for repair slave.
repair_row is the in-memory representation of "row" that the row level
repair works on. It represents a mutation_fragment that is read from the
flat_mutation reader. The hash of a repair_row is the combination of the
mutation_fragment hash and partition_key hash.
Get a random uint64_t number as the seed for the repair row hashing.
The seed is passed to xx_hasher.
We add the randomization when hashing rows so that when we run repair
for the next time the same row produces different hashing number.
It is used to find the common difference detection algorithms supported
by repair master and repair slaves.
It is up to repair master to choose what algorithm to use.
It returns a vector of row level repair difference detection algorithms
supported by this node.
We are going to implement the "send_full_set" in the following patches.
Move generating_reader from stream_session.cc to flat_mutation_reader.cc.
It will be used by repair code soon.
Also introduce a helper make_generating_reader to hide the
implementation of generating_reader.
This patch adds the RPC verbs that are needed by the row level repair.
The usage of those verbs are in the following patches.
All the verbs for row level repair are sent by the repair master.
Repair master asks repair slaves to create repair meta objects, a.k.a,
repair_meta object, to store the repair meta data needed by row level
repair algorithm. The repair meta object is identified by the IP address
of the repair master and a uint32 number repair_meta_id chosen by repair
master. When repair master restarts or is out of the cluster, repair
slaves will detect it and remove all existing repair_meta for the repair
master. When repair slave restarts, the existing repair_meta on the
slave will be gone.
The sync boundary used in the verbs is the position_in_partition of the
last mutation_fragment. In each repair round, peers work on
(last_sync_boundary, current_sync_boundary]
Represent a position of a mutation_fragment read from a flat mutation
reader. Repair nodes negotiate a small sub range identified by two
repair_sync_boundary to work on in each round.
The new row level repair code will access clustering_key_prefix and it
uses std::optional everywhere. Convert position_in_partition to use
std::optional.
This is a backport of CASSANDRA-11038.
Before this, a restarted node will be reported as new node with NEW_NODE
cql notification.
To fix, only send NEW_NODE notification when the node was not part of
the cluster
Fixes: #3979
Tests: pushed_notifications_test.py:TestPushedNotifications.restart_node_test
Message-Id: <453d750b98b5af510c4637db25b629f07dd90140.1544583244.git.asias@scylladb.com>
This patch adds compatibility for Cassandra's "chunk_size_in_kb", as
well as it keeps Scylla's "chunk_size_kb" compression parameter.
Fixes#3669
Tests: unit (release)
v2: use variable instead of array
v3: fix commited files
Signed-off-by: Juliana Oliveira <juliana@scylladb.com>
Message-Id: <20181211215840.GA7379@shenzou.localdomain>
Cassandra supports a "CREATE CUSTOM INDEX" to create a secondary index
with a custom implementation. The only custom implementation that Cassandra
supports is SASI. But Scylla doesn't support this, or any other custom
index implementation. If a CREATE CUSTOM INDEX statement is used, we
shouldn't silently ignore the "CUSTOM" tag, we should generate an error.
This patch also includes a regression test that "CREATE CUSTOM INDEX"
statements with valid syntax fail (before this patch, they succeeded).
Fixes#3977
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181211224545.18349-2-nyh@scylladb.com>
Different nodes can concurrently create the distributed system
keyspace on boot, before the "if not exists" clause can take effect.
However, the resulting schema mutations will be different since
different nodes use different timestamps. This patch forces the
timestamps to be the same across all nodes, so we save some schema
mismatches.
This fixes a bug exposed by ca5dfdf, whereby the initialization of the
distributed system keyspace is done before waiting for schema
agreement. While waiting for schema agreement in
storage_service::join_token_ring(), the node still hasn't joined the
ring and schemas can't be pulled from it, so nodes can deadlock. A
similar situation can happen between a seed node and a non-seed node,
where the seed node progresses to a different "wait for schema
agreement" barrier, but still can't make progress because it can't
pull the schema from the non-seed node still trying to join the ring.
Finally, it is assumed that changes to the schema of the current
distributed system keyspace tables will be protected by a cluster
feature and a subsequent schema synchronization, such that all nodes
will be at a point where schemas can be transferred around.
Fixes#3976
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181211113407.20075-1-duarte@scylladb.com>
"
This series contains several optimizations of the MC format sstable writer, mainly:
- Avoiding output_stream when serializing into memory (e.g. a row)
- Faster serialization of primitive types when serializing into memory
I measured the improvement in throughput (frag/s) using perf_fast_forward for
datasets with a single large partition with many small rows:
- 10% for a row with a single cell of 8 bytes
- 10% for a row with a single cell of 100 bytes
- 9% for a row with a single cell of 1000 bytes
- 13% for a row with 6 cells of 100 bytes
"
* tag 'avoid-output-stream-in-sstable-writer-v2' of github.com:tgrabiec/scylla:
bytes_ostream: Optimize writing of fixed-size types
sstables: mc: Write temporary data to bytes_ostream rather than file_writer
sstables: mc: Avoid double-serialization of a range tombstone marker
sstables: file_writer: Generalize bytes& writer to accept bytes_view
sstables: Templetize write() functions on the writer
sstables: Turn m_format_write_helpers.cc into an impl header
sstables: De-futurize file_writer
bytes_ostream: Implement clear()
bytes_ostream: Make initial chunk size configurable
"
Carry out simplifications of db::extensions: less magical types, de-inline
complex functions, and reduce #include dependencies
Tests: unit(release)
"
* tag 'extensions-simplify/v1' of https://github.com/avikivity/scylla:
extensions: remove unneeded includes
extensions: deinline extension accessors
extensions: return concrete types from the extension accessors
extensions: remove dependency on cql layer
Returning "auto" makes it harder to understand what the function is returning,
and impossible to de-inline.
Return a vector of pointers instead. The caller should iterate immediately, in
any case, and since the previous return value was a range of references to const
unique_ptrs, nothing else could be done with it anyway.
Inlining write() allows the writing code to be optimized for
fixed-size types. In particular, memcpy() calls and loops will be
eliminated.
Saw 4% improvement in throughput in perf_fast_forward for tiny rows.
Currently temporary data is serialized into a file_writer, because
that's what write() functions used to expect, which goes through an
output_stream, a data_sink, into an in-memory data sink implementation
which collects the temporary_buffers.
Going through those abstractions is relatively expensive if we don't
write much, because each time we begin to write after a flush() of the
file_writer the output stream has to allocate a new buffer, which
means a large allocation for small amount of data.
We could avoid that and write into bytes_ostream directly, which will
keep its buffer across clear().
write() functions which are used both to write directly into the data
file and to a temporary arena were templatized to accept a Writer to
which both file_writer and bytes_ostream conform.
I need to templatize functions defined in it and want to avoid
explicit instantiations.
There is only one compilation unit in which this is used
(sstables.cc). I think in the long term we should move all those
"helpers" into sstables/mc/writer.{cc,hh} together with their only
user, the sstable_writer_m class from sstables.cc.
The extensions class reaches into cql's property_definitions class to grab
a map<sstring, sstring> type. This generates a few unneeded dependencies.
Reduce dependencies by defining the map type ourselves; if cql's property_definitions
changes in an incompatible way, it will have to adapt, rather than the extensions
class.
These are the current uninteresting cases I found when looking at
malformed_sstable_exception. The existing code is working, just not
being tested.
* https://github.com/espindola/scylla.git espindola/espindola/broken-sst:
Add a broken sstable test.
Add a test with mismatched schema.
db::config is a global class; changes in any module can cause changes
in db::config. Therefore, it is a cause of needless recompilation.
Remove some of these dependencies by having consumers of db::config
declare an intermediate config struct that is contains only
configuration of interest to them, and have their caller fill it out
(in the case of auth, it already followed this scheme and the patchset
only moves the translation function).
In addition, some outright pointless inclusions of db/config.hh are
removed.
The result is somewhat shorter compile times, and fewer needless
recompiles.
* https://github.com/avikivity/scylla unconfig-1/v1:
config: remove inclusions of db/config.hh from header files
repair: remove unneeded config.hh inclusion
batchlog_manager: remove dependency on db::config
auth: remove permissions_cache dependency on db::config
auth: remove auth::service dependency on db::config
auth: remove unneeded db/config.hh includes
"
This series optimises the read path by replacing some usages of
std::vector by utils::small_vector. The motivation for this change was
an observation that memory allocation functions are pointed out by the
profiler as the ones where we spent most time and while they have a
large number of callers storage allocation for some vectors was close to
the top. The gains are not huge, since the problem is a lot of things
adding up and not a single slow thing, but we need to start with
something.
Unfortunately, the performance of boost::container::small_vector is
quite disappointing so a new implementation of a small_vector was
introduced.
perf_simple_query -c4 --duration 60, medians:
./perf_before ./perf_after diff
read 343086.80 360720.53 5.1%
Tests: unit(release, small_vector in debug)
"
* tag 'small_vector/v2.1' of https://github.com/pdziepak/scylla:
partition_slice: use small_vector for column_ids
mutation_fragment_merger: use small_vector
auth: use small_vector in resource
auth: avoid list-initialisation of vectors
idl: serialiser: add serialiser for utils::small_vector
idl: serialiser: deduplicate vector serialisers
utils: introduce small_vector
intrusive_set_external_comparator: make iterator nothrow move constructible
mutation_fragment_merger: value-initialise iterator
"
This is a backport of CASSANDRA-8236.
Before this patch, scylla sends the node UP event to cql client when it
sees a new node joins the cluster, i.e., when a new node's status
becomes NORMAL. The problem is, at this time, the cql server might not
be ready yet. Once the client receives the UP event, it tries to
connect to the new node's cql port and fails.
To fix, a new application_sate::RPC_READY is introduced, new node sets
RPC_READY to false when it starts gossip in the very beginning and sets
RPC_READY to true when the cql server is ready.
The RPC_READY is a bad name but I think it is better to follow Cassandra.
Nodes with or without this patch are supposed to work together with no
problem.
Refs #3843
"
* 'asias/node_up_down.upstream.v4.1' of github.com:scylladb/seastar-dev:
storage_service: Use cql_ready facility
storage_service: Handle application_state::RPC_READY
storage_service: Add notify_cql_change
storage_service: Add debug log in notify_joined
storage_service: Add extra check in notify_joined
storage_service: Add notify_joined
storage_service: Add debug log in notify_up
storage_service: Add extra check in notify_up
storage_service: Add notify_up
storage_service: Make notify_left log debug level
storage_service: Introduce notify_left
storage_service: Add debug log in notify_down
storage_service: Introduce notify_down
storage_service: Add set_cql_ready
gossip: Add gossiper::is_cql_ready
gms: Add endpoint_state::is_cql_ready
gms: Add application_state::RPC_READY
gms: Introduce cql_ready in versioned_value
At this point the cql_ready facility is ready. To use it, advertise the
RPC_READY application state in the following cases:
- When a node boots, set it to false
- When cql server is ready, set it to true
- When cql server is down, set it to false
- New scylla node always send application_state::RPC_READY = false when
the node boots and send application_state::RPC_READY = true when cql
server is up
- Old scylla node that does not support the application_state::RPC_READY
never has application_state::RPC_READY in the endpoint_state, we can
only think their cql server is up, so we return true here if
application_state::RPC_READY is not present
_ck_blocks_header is a 64-bit variable, so the mask should be 64 bits too.
Otherwise, a shift in the range 32-63 will produce wrong results.
Fix by using a 64-bit mask.
Found by Fedora 29's ubsan.
Fixes#3973.
Message-Id: <20181209120549.21371-1-avi@scylladb.com>
"
Refs #3929
Enables re-use of commitlog segments.
First, ensures we never succeed playing back a commitlog
segment with name not matching the ID:s in the actual
file data, by determining expected id based on file name.
This will also handle partially written re-used files, as
each chunk headers CRC is dependent on the ID, and will
fail once we hit any left-overs.
Second part renamed and puts files into a recycle list
instead of actually deleting them when finished.
Allocating new files will the prioritize this list
before creating a new file.
Note that since consumtion and release of segments can
be somewhat unbalanced, this does not really guarantee
we will use recycled files even in all cases when it
might be possible, simply because of timing. It does
however give a good chance of it.
We limit recycled files based on the max disk size
setting, thus we can potentially grow disk size
more than without depending on timing, but not
uncontrolled.
While all this theoretially might improve disk
writes in some cases, it is far from any magic bullet.
No real performance testing has been done yet, only
functional.
"
* 'calle/commitlog-reuse' of github.com:scylladb/seastar-dev:
commitlog: Recycle used segments instead of delete + new file
commitlog: Terminate all segments with a zero chunk
commitlog_replay: Enforce file name based id matching
Refs #3929
When deleting a segment, IFF we have not yet filled up all reserves,
instead of actually deleting the file, put it on a "recycle" list.
Next segment allocation will instead of creating a new one simply
rename the segment and reuse the file and its allocated space.
We rename the file twice: Once on adding to recycle list, with special
prefix so we don't mix up actual replayable segments and these. Second
when we actually re-use the file (also to ensure consecutive names).
Note that we limit the amount of recyclables, so a really stressed
application which somehow fills up the replenish queue might
cause us to still drop the segments. Could skip this but risk
getting to many files on disk.
Replay should be safe, since all entries are guarded by CRC based
on the file ID (i.e. file name). Thus replaying a recycled segment
will simply cause a CRC error in the main header and be ignored (see
previous patch).
Segments that are fully synced will have terminating zero-header (see
previous patch) so we know when to stop processing a recycled file.
If a file is the result of a mid-write crash, we will generate a CRC
processing error as "normally" in this case, when hitting partially
written block or coming to an old/new chunk boundary.
v2:
* Sync dir on rename
* auto -> const sstring&
* Allow recycling files as long as we're within disk space limits
v3:
* Use special names for files waiting for reuse
Writes a final chunk header of zero to the file on close, to mark
end-of-segment.
This allows us to gracefully stop replay processing of a segment file
even if it was not zeroed from the beginning (maybe recycled - hint
hint).
When reading the header chunk of a commitlog file, check the stored id
value against the id derived from the file name, and ignore if
mismatched. This is a prerequisite for re-using renamed commitlog files,
as we can then fail-fast should one such be left on disk, instead of
trying to replay it.
We also check said id via the CRC check for each chunk parsed. If we
find a chunk with
mismatched id, we will get a CRC error for the chunk, and replay will
terminate (albeit not gracefully).
The newer version of node_exporter comes with important bug fixes, that
is especially important for I3.metal is not supported with the older
version of node_exporter.
The dashboards can now support both the new and the old version of
node_exporter.
Fixes#3927
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20181210085251.23312-1-amnon@scylladb.com>
"
Make major compaction aware of compaction strategy, by using an
optimal approach which suits the strategy needs.
Refs #1431.
"
* 'compaction_strategy_aware_major_compaction_v2' of github.com:raphaelsc/scylla:
tests: add test for compaction-strategy-aware major compaction
compaction: implement major compaction heuristic for leveled strategy
compaction: introduce notion of compaction-strategy-aware major compaction
auth::service already has its own configuration and a function to create it
from db::config; just move it to the caller. This reduces dependencies on the
global db::config class.
permissions_cache already has its own configuration and a function to create it
from db::config; just move it to the caller. This reduces dependencies on the
global db::config class.
Extract configuration into a new struct batchlog_manager_config and have the
callers populate it using db::config. This reduces dependencies on global objects.
Instead, distribute those inclusions to .cc files that require them. This
reduces rebuilds when config.hh changes, and makes it easier to locate files
that need config disaggregation.
rh_entry address is captured inside timeout's callback lambda, so the
structure should not be moved after it is created. Change the code to
create rh_entry in-place instead of moving it into the map.
Fixes#3972.
Message-Id: <20181206164043.GN25283@scylladb.com>
The results vector should be populated vertically, not horizontally.
Responsible for assertion failure with --cache-enabled:
void result_collector::add(test_result_vector): Assertion `rs.size() == results.size()' failed.
Introduced in 3fc78a25bf.
Message-Id: <1544105835-24530-2-git-send-email-tgrabiec@scylladb.com>
Major compaction for leveled strategy will now create a run of
non-overlapping sstables at the highest level. Until now, a single
sstable would be created at level 0 which was very suboptimal because
all data would need to climb up the levels again, making it a very
expensive I/O process.
Refs #1431.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's only the very first step which introduces the machinery for making
major compaction aware of all strategies. By the time being, default
implementation is used for them all which only suits size tiered.
Refs #1431.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Commit cc6c383249 has fixed an issue with
incorrectly tracking max_local_deletion_time and the check in
validate_max_local_deletion_time was called to work around old files.
This fix relaxes conditions for enforcing defaut max_local_deletion_time
so that they don't apply to SSTables in 'mc' format because the original
problem has been resolved before 'mc' format have been introduced.
This is needed to be able to read correct values from
Cassandra-generated SSTables that don't have a Scylla.db component.
Its presence or absence is used as an indicator of possibly affected
files.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
For tiny index files (< 8 bytes long) it could turn to zero and trigger
an assertion in prepare_summary().
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
storage_service keeps a bunch of "feature" variables, indicating cluster-wide
supported features, and has the ability to wait until the entire cluster supports
a given feature.
The propagation of features depends on gossip, but gossip is initialized after
storage_service, so the current code late-initializes the features. However, that
means that whoever waits on a feature between storage_service initialization and
gossip initialization loses their wait entry. In #3952, we have proof that this
in fact happens.
Fix this by removing the circular dependency. We now store features in a new
service, feature_service, that is started before both gossip and storage_service.
Gossip updates feature_service while storage_service reads for it.
Fixes#3953.
* https://github.com/avikivity/3953/v4.1:
storage_service: deinline enable_all_features()
gossiper: keep features registered
tests/gossip: switch to seastar::thread
storage_service: deinline init/deinit functions
gossiper: split feature storage into a new feature_service
gossiper: maybe enable features after start_gossiping()
storage_service: fix gap when feature::when_enabled() doesn't work
storage_service::register_features() reassigns to feature variables in
storage_service. This means that any call to feature::when_enabled() will be
orphaned when the feature is assigned.
Now that feature lifetimes are not tied to gossip, we can move the feature
initialization to the constructor and eliminate the gap. When gossip is started
it will evaluate application_states and enable features that the cluster agrees on.
Since we may now start with features already registered, we need to enable
features immediately after gossip is started. This case happens in a cluster
that already is fully upgraded on startup. Before this series, features were
only added after this point.
Feature lifetime is tied to storage_service lifetime, but features are now managed
by gossip. To avoid circular dependency, add a new feature_service service to manage
feature lifetime.
To work around the problem, the current code re-initializes features after
gossip is initialized. This patch does not fix this problem; it only makes it
possible to solve it by untyping features from gossip.
Gossiper unregisters enabled features as an optimization. However that makes
decoupling features from gossiper harder. Disable this optimization; since the
number of features is small and normal access is to a single feature at a time,
there is no significant performance or memory loss.
In Scylla we have three implementations of vector-like structures
std::vector, utils::chunked_vector and utils::small_vector. Which one is
used is largerly an implementation detail and all should be serialised
by the IDL infrastructure in exactly the same way. To make sure that
it's indeed the case let's make them share the serialiser
implementation.
small_vector is a variation of std::vector<> that reserves a configurable
amount of storage internally, without the need for memory allocation.
This can bring measurable gains if the expected number of elements is
small. The drawback is that moving such small_vector is more expensive
and invalidates iterators as well as references which disqualifies it in
some cases.
"
Multishard combining readers, running concurrently, with limited
concurrency and no timeout may deadlock, due to inactive shard readers
sitting on permits. To avoid this we have to make sure that all shard
readers belonging to a multishard combining readers, that are not
currently active, can be evicted to free up their permits, ensuring that
all readers can make progress.
Making inactive shard readers evictable is the solution for this
problem, however the original series introducing this solution
(414b14a6bd) did not go all they way and
left some loose ends. These loose ends are tied up by this mini-series.
Namely, two issues remained:
* The last reader to reach EOS was not paused (made evictable).
* Readers created/resumed as part of a read-ahead were not paused
immediately after finishing the read-ahead.
This series fixes both of these.
Fixes: #3865
Tests: unit(release, debug)
"
* 'fix-multishard-reader-deadlock/v1' of https://github.com/denesb/scylla:
multishard_combining_reader: pause readers after reading ahead
multishard_combining_reader: pause *all* EOS'd readers
Readers created or resumed just to read ahead should be paused right
after, to avoid consuming all available permits on the shards they
operate on, causing a deadlock.
"
This series of patches ensures that all the Python code base is python3 compliant
and consistent by applying the following logic:
- python3 classifier on setup.py to explicitly state our python compatibility matrix
- add UTF-8 encoding header
- correct every shebang to the same /usr/bin/env python3
- shebang is only added on scripts meant to be executed on their own (removed otherwise)
- migrate some leftover scripts from python2 to python3 with minimal QA
This work is important to prepare for a more drastic change on Python code styling
using the black formatter and the setting up of automated QA checks on Python code base.
"
* 'python3_everywhere' of https://github.com/numberly/scylla:
scylla-housekeeping: fix python3 compat and shebang
dist/ami/files/scylla_install_ami: python3 shebang
dist/docker/redhat/docker-entrypoint.py: add encoding comment
fix_system_distributed_tables.py: fix python3 compat and shebang
gen_segmented_compress_params.py: add encoding comment
idl-compiler.py: python3 shebang
scylla-gdb.py: python3 shebang
configure.py: python3 shebang
tools/scyllatop/: add / normalize python3 shebang
scripts/: add / normalize python3 shebang
dist/common/scripts: add / normalize python3 shebang
test.py: add encoding comment
setup.py: add python3 classifiers
"
This patchset extends a number of existing tests to check SSTables
statistics for 'mc' format and fixes an issue discovered with the help
of one of the tests.
Tests: unit {release}
"
* 'projects/sstables-30/check-stats/v2' of https://github.com/argenet/scylla:
tests: Run sstable_timestamp_metadata_correcness_with_negative with all SSTables versions.
tests: Run sstable_tombstone_histogram_test for all SSTables versions.
tests: Run min_max_clustering_key_test on all SSTables versions.
tests: Expand test_sstable_max_local_deletion_time_2 to run for all SSTables versions.
tests: Run test_sstable_max_local_deletion_time on all SSTables versions.
tests: Extend test checking tombstones histogram to cover all SSTables versions.
sstables: Properly track row-level tombstones when writing SSTables 3.x.
tests: Run min_max_clustering_key_test_2 for all SSTables versions.
tests: Make reusable_sst() helper accept SSTables version parameter.
Previously the last shard reader to reach EOS wasn't paused. This is a
mistake and can contribute to causing deadlocks when the number of
concurrently active readers on any shard is limited.
ForwardIterators are default constructible, but they have to be
value-initialised to compare equal to other value-initialised instances
of that iterator.
Otherwise errors cannot be made sense of, since error are reported
always to stdout. Without test output we don't know what they're
referring to.
This change makes the output always go to stdout, in addition to other
reportes, if any.
Message-Id: <1544020084-16492-1-git-send-email-tgrabiec@scylladb.com>
UTF-8 string is now validated by boost::locale::conv::utf_to_utf, it
actually does string conversions which is more than necessary. As
observed on Arm server, UTF-8 validation can become bottleneck under
heavy loads.
This patch introduces a brand new SIMD implementation supporting both
NEON and SSE, as well as a naive approach to handle short strings.
The naive approach is 3x faster than boost utf_to_utf, whilst SIMD
method outperforms naive approach 3x ~ 5x on Arm and x86. Details at
https://github.com/cyb70289/utf8/.
UTF-8 unit test is added to check various corner cases.
Signed-off-by: Yibo Cai <yibo.cai@arm.com>
Message-Id: <1543978498-12123-1-git-send-email-yibo.cai@arm.com>
If the compaction manager is started, compactions may start (this is
regardless of whether or not we trigger them). The problem with that is
that they start at a time in which we are flushing the commitlog and the
initialization procedure waits for the commitlog to be fully flushed and
the resulting memtables flushed before we move on.
Because there are no incoming writes, the amount of shares in memtable
flushes decrease as memory used decreases and that can cause the startup
procedure to take a long time.
We have recently started to bump the shares manually for manual flushes.
While that guarantees that we will not drive the shares to zero, I will
make the argument that we can do better by making sure that those things
are, at this point, running alone: user experience is affected by
startup times and the bump we give to user-triggered operations will
only do so much. Even if we increase the shares a lot flushes will still
be fighting for resources with compactions and startup will take longer
than it could.
By making sure that flushes are this point running alone we improve the
user experience by making sure the startup is as fast as it can be.
There is a similar problem at the drain level, which is also fixed in this
series.
Fixes#3958
* git@github.com:glommer/scylla.git faster-restart
compaction_manager: delay initialization of the compaction manager.
drain: stop compactions early
In dtest, we have
self.check_rows_on_node(node1, 2000)
self.check_rows_on_node(node2, 2000)
which introduce the following cluster operations:
1) Initially:
- node1 up
- node2 up
2) self.check_rows_on_node(node1, 2000)
- node2 down
- node2 up (A: node2 will call gossiper::real_mark_alive when node2 boots
up to mark node1 up)
3) self.check_rows_on_node(node2, 2000)
- node1 down (B: node1 will send shutdown gossip message to node2, node2
will mark node1 down)
- node1 up (C: when node1 is up, node2 will call
gossiper::real_mark_alive)
Since there is no guarantee the order of Operation A and Operation B, it
is possible node2 will mark node1 as status=shutdown and mark node1 is
UP.
In Operation C, node2 will call gossiper::real_mark_alive to mark node1
up, but since node2 might think node1 is already up, node2 will exit
early in gossiper::real_mark_alive and not log "InetAddress 127.0.0.1 is
now UP, status={}"
As a result, dtest fails to see node2 reports node1 is up when it boots
node1 and fail the test.
TimeoutError: 23 Nov 2018 10:44:19 [node2] Missing: ['127.0.0.1.* now UP']
In the log we can see node1 marked as DOWN and UP almost at the same time on node2:
INFO 2018-11-23 22:31:29,999 [shard 0] gossip - InetAddress 127.0.0.1 is now DOWN, status = shutdown
INFO 2018-11-23 22:31:30,006 [shard 0] gossip - InetAddress 127.0.0.1 is now UP, status = shutdown
Fixes#3940
Tests: dtest with 20 consecutive succesful runs
Message-Id: <996dc325cbcc3f94fc0b7569217aa65464eaaa1c.1543213511.git.asias@scylladb.com>
Fixes a build failure when only the scylla binary was selected for
building like this:
./configure.py --with scylla
In this case the rule for gen_crc_combine_table was missing, but it is
needed to build crc_combine_table.o
Message-Id: <1544010138-21282-1-git-send-email-tgrabiec@scylladb.com>
Both of these have the same problem. They remove the to-be-evicted
entries from `_entries` but they don't unregister the `entry` from the
`read_concurrency_semaphore`. This results in the
`reader_concurrency_semaphore` being left with a dangling pointer to the
entries will trigger segfault when it tries to evict the associated
inactive reads.
Also add a unit test for `evict_all_for_table()` to check that it works
properly (`evict_one()` is only used in tests, so no dedicated test for
it).
Fixes: #3962
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <57001857e3791c6385721b624d33b667ccda2e7d.1544010868.git.bdenes@scylladb.com>
"
Before the reader was just ignoring such columns but this creates a risk of data loss.
Refs #2598
"
* 'haaawk/2598/v3' of github.com:scylladb/seastar-dev:
sstables: Add test_sstable_reader_on_unknown_column
sstables: Exception on sstable's column not present in schema
sstables: store column name in column_translation::column_info
sstables: Make test_dropped_column_handling test dropped columns
Since we merged relocatable package, build_deb.sh/build_rpm.sh only does
packaging using prebuilt binary taken from relocatable package, won't compile
anything.
So passing --jobs option to build_deb.sh/build_rpm.sh becomes meaningless,
we can drop it.
Note that we still can specify --jobs option on reloc/build_reloc.sh, it
runs "ninja-build -jN" to compile Scylla, then generate relocatable package.
See #3956
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181204205652.25138-1-syuu@scylladb.com>
drain suffers from the same problem as startup suffers now: memtables
are flushed as part of the drain routine, and because there are no
incoming writes the shares the controller assign to flushes go down over
time, slowing down the process of drain.
This patch reorders things so that we stop compactions first, and flush
later. It guarantees that when flush do happen it will have the full
bandwidth to work with.
There is a comment in the code saying we should stop compactions
forcefully instead of waiting for them to finish. I consider this
orthogonal to this patch therefore I am not touching this. Doing so will
make the drain operation even faster but can be done later. Even when we
do it, having the flushes proceed alone instead of during compactions
will make it faster.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
If the compaction manager is started, compactions may start (this is
regardless of whether or not we trigger them). The problem with that is
that they start at a time in which we are flushing the commitlog and the
initialization procedure waits for the commitlog to be fully flushed and
the resulting memtables flushed before we move on.
Because there are no incoming writes, the amount of shares in memtable
flushes decrease as memory used decreases and that can cause the startup
procedure to take a long time.
We have recently started to bump the shares manually for manual flushes.
While that guarantees that we will not drive the shares to zero, I will
make the argument that we can do better by making sure that those things
are, at this point, running alone: user experience is affected by
startup times and the bump we give to user-triggered operations will
only do so much. Even if we increase the shares a lot flushes will still
be fighting for resources with compactions and startup will take longer
than it could.
By making sure that flushes are this point running alone we improve the
user experience by making sure the startup is as fast as it can be.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Fixes the condition which determines whether a row ttl should be used for a cell
and adds a test that uses each generated mutation to populate mutation source
and then verifies that it can read back the same mutation.
* seastar-dev.git haaawk/sst3/write-read-test/v3:
Fix use_row_ttl condition
Add test_all_data_is_read_back
gen_crc_combine_table is now executed on every build, so it should not
fail on unsupported archs. The generated file will not contain data,
but this is fine since it should not be used.
Another problem is that u32 and u64 aliases were not visible in the #else
branch in crc_combine.cc
Message-Id: <1543864425-5650-1-git-send-email-tgrabiec@scylladb.com>
This tests that a source after being populated with a mutation
returns exactly the same mutation when read.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
When the auth service is requested to stop during bootstrap,
it might have still not reached schema agreement.
Currently, waiting for this agreement is done in an infinite loop,
without taking abort_source into account.
This patch introduces checking if abort was requested
and breaking the loop in such case, so auth service can terminate.
Tests:
unit (release)
dtest (bootstrap_test.py:TestBootstrap.shutdown_wiped_node_cannot_join_test)
Message-Id: <1b7ded14b7c42254f02b5d2e10791eb767aae7fc.1543914769.git.sarna@scylladb.com>
Previous condition was wrong and was using row ttl too often.
We also have to change test_dead_row_marker to compare
resulting sstable with sstable generated by Origin not
by sstableupgrade.
This is because sstableupgrade transmits information about deleted row
marker automatically to cells in that row.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
"
This is a small step in fixing issue #2347. It is mostly tests and
testing infrastructure, but it does include a fix for a case where we
were missing the filename in the malformed_sstable_exception.
"
* 'espindola/sstable-corruption-v2' of https://github.com/espindola/scylla:
Add a filename to a malformed_sstable_exception.
Try to read the full sst in broken_sst.
Convert tests to SEASTAR_THREAD_TEST_CASE.
Check the exception message.
Move some tests to broken_sstable_test.cc
"
This series attempts to solve the regressions recently discovered in
performance of multi-partition range-scans. Namely that they:
* Flood the reader concurrency semaphore's queues, trampling other
reads.
* Behave very badly when too many of them is running concurrently
(trashing).
* May deadlock if enough of them is running without a timeout.
The solution for these problems is to make inactive shard readers
evictable. This should address all three issues listed above, to varying
degrees:
* Shard readers will now not cling onto their permits for the entire
duration of the scan, which might be a lot of time.
* Will be less affected by infinite concurrency (more than the node can
handle) as each scan now can make progress by evicting inactive shard
readers belonging to other scans.
* Will not deadlock at all.
In addition to the above fix, this series also bundles two further
improvements:
* Add a mechanism to `reader_concurrecy_semaphore` to be notified of
newly inserted evictables.
* General cleanups and fixes for `multishard_combining_reader` and
`foreign_reader`.
I can unbundle these mini series and send them separately, if the
maintainers so prefer, altough considering that this series will have to
be backported to 3.0, I think this present form is better.
Fixes: #3835
"
* 'evictable-inactive-shard-readers/v7' of https://github.com/denesb/scylla: (27 commits)
tests/multishard_mutation_query_test: test stateless query too
tests/querier_cache: fail resource-based eviction test gracefully
tests/querier_cache: simplify resource-based eviction test
tests/mutation_reader_test: add test_multishard_combining_reader_next_partition
tests/mutation_reader_test: restore indentation
tests/mutation_reader_test: enrich pause-related multishard reader test
multishard_combining_reader: use pause-resume API
query::partition_slice: add clear_ranges() method
position_in_partition: add region() accessor
foreign_reader: add pause-resume API
tests/mutation_reader_test: implement the pause-resume API
query_mutations_on_all_shards(): implement pause-resume API
make_multishard_streaming_reader(): implement the pause-resume API
database: add accessors for user and streaming concurrency semaphores
reader_lifecycle_policy: extend with a pause-resume API
query_mutations_on_all_shards(): restore indentation
query_mutations_on_all_shards(): simplify the state-machine
multishard_combining_reader: use the reader lifecycle policy
multishard_combining_reader: add reader lifecycle policy
multishard_combining_reader: drop unnecessary `reader_promise` member
...
Currently when this test fails, resources are not released in the
correct order, which results in ASAN complaining about use-after-free
in debug builds. This is due to the BOOST_REQUIRE macro aborting the
test when the predicate fails, not allowing for correct destruction
order to take place.
To avoid this ugly failure, that adds noise and might cause a developer
investigating into the failure starting on the wrong path, use the more
mild BOOST_CHECK family of test macros. These will allow the test to run
to completion even when the predicate fails, allowing for the correct
destruction of the resources.
Now that we have an accessor for all concurrency semaphores, we don't
need the tricks of creating a dummy keyspace to get them. Use the
accessors instead.
Test the interaction of the multishard reader with the foreign reader
w.r.t next_partition(). next_partition() is a special operation, as it
its execution is deferred until the next cross-shard operations. Give it
some extra stress-testing.
Refactor the multishard combining reader to make use of the new
pause-resume API to pause inactive shard readers.
Make the pause-resume API mandatory to implement, as by now all existing
clients have adapted it.
Allows for clearing any custom partition ranges, effectively resetting
them to the default ones. Useful for code that needs to set several
different specific partition ranges, one after the other, but doesn't
want to remember the last key it set a range for to be able to clear the
previous range with `clear_range()`.
Allowing for pausing the reader and later resume it. Pausing the reader
waits on the ongoing read ahead (if any), executes any pending
`next_partition()` calls and than detaches the shard reader's buffer.
The paused shard reader is returned to the client.
Resuming the reader consists of getting the previously detached reader
back, or one that has the same position as the old reader had.
This API allows for making the inactive shard readers of the
`multishard_combining_reader` evictable.
The API is private, it's only accessible for classes knowing the full
definition of the `foreign_reader` (which resides in a .cc file).
This API provides a way for the mulishard reader to pause inactive shard
readers and later resume them when they are needed again. This allows
for these paused shard readers to be evicted when the node is under
pressure.
How the readers are made evictable while paused is up to the clients.
Using this API in the `multishard_combining_reader` and implementing it
in the clients will be done in the next patches.
Provide default implementation for the new virtual methods to facilitate
gradual adoption.
The `read_context` which handles creating, saving and looking-up the
shard readers has to deal with its `destroy_reader()` method called any
time, even before some other method finished its work. For example it is
valid for a reader to be requested to be destroyed, even before the
contexts finishes creating it.
This means that state transitions that take time can be interleaved with
another state transition request. To deal with this the read context
uses `future_` states, states that mark an ongoing state transitions.
This allows for state transition request that arrive in the middle of
another state transition to be attached as a continuation to the ongoing
transition, and to be executed after that finishes. This however
resulted in complex code, that has to handle readers being in all sorts
of different states, when the `save_readers()` method is called.
To avoid all this complexity, exploit the fact that `destroy_reader()`
receives a future<> as its argument, which resolves when all previous
state transitions have finished. Use a gate to wait on all these futures
to resolve. This way we don't need all those transitional states,
instead in `save_readers()` we only need to wait on the gate to close.
Thus the number of states `save_readers()` has to consider drops
drastically.
This has the theoretical drawback of the process of saving the readers
having to wait on each of the readers to stop, but in practice the
process finishes when the last reader is saved anyway, so I don't expect
this to result in any slowdown.
Currently `multishard_combining_reader` takes two functors, one for
creating the readers and optionally one for destroying them.
A bag of functors (`std::function`) however make for a terrible
interface, and as we are about to add some more customization points,
it's time to use something more formal: policy based design, a
well-known design pattern.
As well as merging the job of the two functors into a single policy
class, also widen the area of responsibility of the policy to include
keeping alive any resource the shard readers might need on their home
shard. Implementing a proper reader cleanup is now not optional either.
This patch only adds the `reader_managing_policy` interface,
refactoring the multishard reader to use it will be done in the next patch.
The `reader_promise` member of the `shard_reader` was used to
synchronize a foreground request to create the underlying reader with an
ongoing background request with the same goal. This is however
unnecessary. The underlying reader is created in the background only as
part of a read ahead. In this case there is no need for extra
synchronization point, the foreground reader create request can just
wait for the read ahead to finish, for which there already exists a
mean. Furthermore, foreground reader create requests are always followed
by a `fill_buffer()` request, so by waiting on the read ahead we ensure
that the following `fill_buffer()` call will not block.
Shard readers used to track pending `next_partition()` calls that they
couldn't execute, because their underlying reader wasn't created yet.
These pending calls were then executed after the reader was created.
However the only situation where a shard reader can receive a
`next_partition()` call, before its underlying reader wasn't created is
when `next_partition()` is called on the multishard reader before a
single fragment is read. In this case we know we are at a partition
boundary and thus this call has no effect, therefore it is safe to
ignore it.
Foreign reader doesn't execute `next_partition()` calls straight away,
when this would require interaction with the remote reader. Instead
these calls are "remembered" and executed on the next occasion the
foreign reader has to interact with the remote reader. This was
implemented with a counter that counts the number of pending
`next_partition()` calls.
However when `next_partition()` is called multiple times, without
interleaving calls to `operator()()` or `fast_forward_to()`, only the
first such call has effect. Thus it doesn't make sense to count these
calls, it is enough to just set a flag if there was at least one such
call.
It doesn't make sense for the multishard reader anyway, as it's only
used by the row-cache. We are about to introduce the pausing of inactive
shard readers, and it would require complex data structures and code
to maintain support for this feature that is not even used. So drop it.
As we are about to add multiple sources of evictable readers, we need a
more scalable solution than a single functor being passed that opaquely
evicts a reader when called.
Add a generic way to register and unregister evictable (inactive)
readers to the semaphore. The readers are expected to be registered when
they become evictable and are expected to be unregistered when they
cease to become evictable. The semaphore might evict any reader that is
registered to it, when it sees fit.
This also solves the problem of notifying the semaphore when new readers
become evictable. Previously there was no such mechanism, and the
semaphore would only evict any such new readers when a new permit was
requested from it.
It is reasonable for parse() to throw when it finds something wrong
with the format. This seems to be the best spot to add the filename
and rethrow.
Also add a testcase to make sure we keep handling this error
gracefully.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
With this patch we use data_consume_rows to try to read the entire
sstable. The patch also adds a test with a particular corruption that
would not be found without parsing the file.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This makes the tests a bit more strict by also checking the message
returned by the what() function.
This shows that some of the tests are out of sync with which errors
they check for. I will hopefully fix this in another pass.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
sstable_test.cc was already a bit too big and there is potential for
having a lot of tests about broken sstables.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
These loops have the structure :
while (true) {
switch (state) {
case state1:
...
break;
case state2:
if (...) { ... break; } else {... continue; }
...
}
break;
}
There a couple things I find a bit odd on that structure:
* The break refers to the switch, the continue to the loop.
* A while (true) loop always hits a break or a continue.
This patch uses early returns to simplify the logic to
while (true) {
switch (state) {
case state1:
...
return
case state2:
if (...) { ... return; }
...
}
}
Now there are no breaks or continues.
Tests: unit (release)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20181126171726.84629-1-espindola@scylladb.com>
"
zlib's crc32_combine() is not very efficient. It is faster to re-combine
the buffer using crc32(). It's still substantial amount of work which
could be avoided.
This patch introduces a fast implementation of crc32_combine() which
uses a different algorithm than zlib. It also utilizes intrinsics for
carry-less multiplication instruction to perform the computation faster.
The details of the algorithm can be found in code comments.
Performance results using perf_checksum and second buffer of length 64 KiB:
zlib CRC32 combine: 38'851 ns
libdeflate CRC32: 4'797 ns
fast_crc32_combine(): 11 ns
So the new implementation is 3500x faster than zlib's, and 417x faster than
re-checksumming the buffer using libdeflate.
Tested on i7-5960X CPU @ 3.00GHz
Performance was also evaluated using sstable writer benchmark:
perf_fast_forward --populate --sstable-format=mc --data-directory /tmp/perf-mc \
--value-size=10000 --rows 1000000 --datasets small-part
It yielded 9% improvement in median frag/s (129'055 vs 117'977).
Refs #3874
"
* tag 'fast-crc32-combine-v2' of github.com:tgrabiec/scylla:
tests: perf_checksum: Test fast_crc32_combine()
tests: Rename libdeflate_test to checksum_utils_test
tests: libdeflate: Add more tests for checksum_combine()
tests: libdeflate: Check both libdeflate and default checksummers
sstables: Use fast_crc_combine() in the default checksummer
utils/gz: Add fast implementation of crc32_combine()
utils/gz: Add pre-computed polynomials
utils/gz: Import Barett reduction implementation from libdeflate
utils: Extract clmul() from crc.hh
Before this patch we were writing offset map enteies in unspecified
order, the one returned by std::unorderd_map. Cassandra writes them
sorted by metadata_type. Use the same order for improved
compatibility.
Fixes#3955.
Message-Id: <1543846649-22861-1-git-send-email-tgrabiec@scylladb.com>
"
Due to an XFS heuristic, if all files are in one (or a few) directories,
then block allocation can become very slow. This is because XFS divides
the disk into a few allocation groups (AGs), and each directory allocates
preferentially from a single AG. That AG can become filled long before
the disk is full.
This patchset works around the problem by:
- creating sstable component files in their own temporary, per-sstable sub-directory,
- moving the files back into the canonical location right after begin created, and finally
- removing the temp sub-directory when the sstable is sealed.
- In addition, any temporary sub-directories that might have been left over if scylla
crashed while creating sstables are looked up and removed when populating the table.
Fixes: #3167
Tests: unit (release)
"
* 'issues/3167/v7' of https://github.com/bhalevy/scylla:
distributed_loader::populate_column_family: lookup and remove temp sstable directories
database: directly use std::experimental::filesystem::path for lister::path
database: use std::experimental::filesystem::path for lister::path
sstable: use std::experimental::filesystem rather than boost
sstable::seal_sstable: fixup indentation
sstable: create sstable component files in a subdirectory
sstable::new_sstable_component_file: pass component_type rather than filename
sstable: cleanup filename related functions
sstable: make write_crc, write_digest, and new_sstable_component_file private methods
zlib's crc32_combine() is not very efficient. It is faster to re-combine
the buffer using crc32(). It's still substantial amount of work which
could be avoided.
This patch introduces a fast implementation of crc32_combine() which
uses a different algorithm than zlib. It also utilizes intrinsics for
carry-less multiplication instruction to perform the computation faster.
The details of the algorithm can be found in code comments.
Performance results using perf_checksum and second buffer of length 64 KiB:
zlib CRC32 combine: 38'851 ns
libdeflate CRC32: 4'797 ns
fast_crc32_combine(): 11 ns
So the new implementation is 3500x faster than zlib's, and 417x faster than
re-checksumming the buffer using libdeflate.
Tested on i7-5960X CPU @ 3.00GHz
Performance was also evaluated using sstable writer benchmark:
perf_fast_forward --populate --sstable-format=mc --data-directory /tmp/perf-mc \
--value-size=10000 --rows 1000000 --datasets small-part
It yielded 9% improvement in median frag/s (129'055 vs 117'977).
gen_crc_combine_table.cc will be run during build to produce tables
with precomputed polynomials (4 x 256 x u32). The definitions will
reside in:
build/<mode>/gen/utils/gz/crc_combine_table.cc
It takes 20ms to generate on my machine.
The purpose of those polynomials will be explained in crc_combine.cc
As we are about to extend the functionality of the reader concurrency
semaphore, adding more method implementations that need to go to a .cc
file, it's time we create a dedicated file, instead of keep shoving them
into unrelated .cc files (mutation_reader.cc).
Use standard convention of the rest of the code base. Type definitions
first, then data members and finally member functions.
As we are about to add more members, its especially important to make
the growing class have a familiar member arrangement.
We would like to get rid of boost::filesystem and gradually replace it with
std::experimental::filesystem.
TODO: using namespace fs = std::experimental::filesystem,
use fs::path directly, rather than lister::path
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When writing the sstable, create a temporary directory
for creating all components so that each sstable files' will be
assigned a different allocaton groups on xfs.
Files are immediately renamed to their default location after creation.
Temp directory is removed when the sstable is sealed.
Additional work to be introduced in the following patches:
- When populating tables, temp directories need to be looked up and removed.
Fixes#3167
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prepare for per-sstable sub directory.
Also, these functions get most of their parameters from the sst at hand so they might
as well be first class members.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a reference to a docker image that contains an "official" toolchain
for building Scylla. In addition, add a script that allows easy usage of
the image, and some documentation.
Message-Id: <20181202120829.21218-1-avi@scylladb.com>
git, python, sudo packages are installed by default on normal Fedora
installation but not in Docker image, we need to install it by this
script.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181201020834.24961-1-syuu@scylladb.com>
"
This patchset addresses two recently discovered bugs both triggered by
summary regeneration:
Tests: unit {release}
+
Validated with debug build of Scylla (ASAN) that no use-after-free
occurs when re-generating Summary.db.
"
* 'projects/sstables-30/summary-regeneration/v1' of https://github.com/argenet/scylla:
tests: Add test reading SSTables in 'mc' format with missing summary.
sstables: When loading, read statistics before summary.
database: Capture io_priority_class by reference to avoid dangling ref.
In case if summary is missing and we attempt to re-generate it,
statistics must be already read to provide us with values stored in
serialization header to facilitate clustering prefixes deserialization.
Fixes#3947
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
The original reference points to a thread-local storage object that
guaranteed to outlive the continuation, but copying it make the
subsequent calls point to a local object and introduces a use-after-free
bug.
Fixes#3948
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This test checks that sstable reader throws an exception
when sstable contains a column that's not present in the schema.
It also checks that dropped columns do not cause exceptions.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Previously such column was ignored but it's better to be explicit
about this situation.
Refs #2598
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
"
This series adds proper handling of filtering queries with LIMIT.
Previously the limit was erroneously applied before filtering,
which leads to truncated results.
To avoid that, paged filtering queries now use an enhanced pager,
which remembers how many rows dropped and uses that information
to fetch for more pages if the limit is not yet reached.
For unpaged filtering queries, paging is done internally as in case
of aggregations to avoid returning keeping huge results in memory.
Also, previously, all limited queries used the page size counted
from max(page size, limit). It's not good for filtering,
because with LIMIT 1 we would then query for rows one-by-one.
To avoid that, filtered queries ask for the whole page and the results
are truncated if need be afterwards.
Tests: unit (release)
"
* 'fix_filtering_with_limit_2' of https://github.com/psarna/scylla:
tests: add filtering with LIMIT test
tests: split filtering tests from cql_query_test
cql3: add proper handling of filtering with LIMIT
service/pager: use dropped_rows to adjust how many rows to read
service/pager: virtualize max_rows_to_fetch function
cql3: add counting dropped rows in filtering pager
Before it was testing missing columns.
It's better to test dropped columns because they should be ignored
while for missing columns some sources will throw.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
"
This patchset adds support to scylla_io_setup for
multiple data directories as well as commitlog,
hints, and saved_caches directories.
Refs #2415
Tests: manual testing with scylla-ccm generated scylla.yaml
"
* 'projects/multidev/v3' of https://github.com/bhalevy/scylla:
scylla_io_setup: assume default directories under /var/lib/scylla
scylla_io_setup: add support for commitlog, hints, and saved_caches directory
scylla_io_setup: support multiple data directories
Previously, limit was erroneously applied before filtering,
which might have resulted in truncated results.
Now, both paged and unpaged queries are filtered first,
and only after that properly trimmed so only X rows are returned
for LIMIT X.
Fixes#3902
Filtering pager may drop some rows and as a result return less
than what was fetched from the replica. To properly adjust how
many rows were actually read, dropped_rows variable is introduced.
Regular pagers use max_rows to figure out how many rows to fetch,
but filtering pager potentially needs the whole page to be fetched
in order to filter the results.
If a specific directory is not configure in scylla.yaml, scylla assumes
a default location under /var/lib/scylla.
Hard code these locations in scylla_io_setup until we have a better way
to probe scylla about it.
Be permissive and ignore the default directories if they don't not exist
on disk and silently ignore them.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Counter for dropped rows is added to the filtering pager.
This metrics can be used later to implement applying LIMIT
to filtering queries properly.
Dropped rows are returned on visitor::accept_partition_end.
"
This series fixes#3891 by amending the way restrictions
are checked for filtering. Previous implementation that returned
false from need_filtering() when multi-column restrictions
were present was incorrect.
Now, the error is going to be returned from restrictions filter layer,
and once multi-column support is implemented for filtering, it will
require no further changes.
Tests: unit (release)
"
* 'fix_multi_column_filtering_check_3' of https://github.com/psarna/scylla:
tests: add multi-column filtering check
cql3: remove incorrect multi-column check
cql3: check filtering restrictions only if applicable
cql3: add pk/ck_restrictions_need_filtering()
need_filtering() incorrectly returned false if multi-column restrictions
were present. Instead, these restrictions should be allowed to need
filtering.
Fixes#3891
"
This miniseries ensures that system tables are not checked
for having view updates, because they never do.
What's more, distributed system table is used in the process,
so it's unsafe to query the table while streaming it.
Tests: unit (release), dtest(update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_decommission_node_2_test)
"
* 'fix_checking_if_system_tables_need_view_updates_3' of https://github.com/psarna/scylla:
streaming: don't check view building of system tables
database: add is_internal_keyspace
streaming: remove unused sstable_is_staging bool class
System tables will never need view building, and, what's more,
are actually used in the process of view build checking.
So, checking whether system tables need a view update path
is simplified to returning 'false'.
"
The series fixed#3565 and #3566
"
* 'gleb/write_failure_fixes' of github.com:scylladb/seastar-dev:
storage_proxy: store hint for CL=ANY if all nodes replied with failure
storage_proxy: complete write request early if all replicas replied with success of failure
storage_proxy: check that write failure response comes from recognized replica
storage_proxy: move code executed on write timeout into separate function
Sync Debian variants dependencies with dist/debian/control.mustache
(before merging relocatable), use scylla 3rdparty packages.
Since we use 3rdparty repo on seastar/install-dependencies.sh, drop repo
setup part from this script.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181031122800.11802-1-syuu@scylladb.com>
Current code assumes that request failed if all replicas replied with
failure, but this is not true for CL=ANY requests. Take it into account.
Fixed: #3565
Currently if write request reaches CL and all replicas replied, but some
replied with failures, the request will wait for timeout to be retired.
Detect this case and retire request immediately instead.
Fixes#3566
As far as I can tell the old sstable reading code required reading the
data into a contiguous buffer. The function data_consume_rows_at_once
implemented the old behavior and incrementally code was moved away
from it.
Right now the only use is in two tests. The sstables used in those
tests are already used in other tests with data_consume_rows.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20181127024319.18732-2-espindola@scylladb.com>
"
Compression is not deterministic so instead of binary comparing the sstable files we just read data back
and make sure everything that was written down is still present.
Tests: unit(release)
"
* 'haaawk/binary-compare-of-compressed-sstables/v3' of github.com:scylladb/seastar-dev:
sstables: Remove compressed parameter from get_write_test_path
sstables: Remove unused sstable test files
sstables: Ensure compare_sstables isn't used for compressed files
sstables: Don't binary compare compressed sstables
sstables: Remove debug printout from test_write_many_partitions
"
consistency_level.hh is rather heavyweighy in both its contents and what it
includes. Reduce the number of inclusion sites and split the file to reduce
dependencies.
"
* tag 'cl-header/v2' of https://github.com/avikivity/scylla:
consistency_level: simplify validation API
Split consistency_level.hh header
database: remove unneeded consistency_level.hh include
cql: remove unneeded includes of consistency_level.hh
It has two unrelated users: cql for validation, and storage_proxy for
complicated calculations. Split the simple stuff into a new header to reduce
dependencies.
Not all compaction operations submitted through compaction manager sets a callback
for releasing references of exhausted sstables in compaction manager itself.
That callback lives in compaction descriptor which is passed to table::compaction().
Let's make the call conditional to avoid bad function call exceptions.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20181126235616.10452-1-raphaelsc@scylladb.com>
There's no _M_t._M_head_impl any more in the standard library.
We now have std_unique_ptr wrapper which abstracts this fact away so
use that.
Message-Id: <20181126174837.11542-1-tgrabiec@scylladb.com>
"
One part of the improvement comes from replacing zlib's CRC32 with the one
from libdeflate, which is optimized for modern architecture and utilizes the
PCLMUL instruction.
perf_checksum test was introduced to measure performance of various
checksumming operations.
Results for 514 B (relevant for writing with compression enabled):
test iterations median mad min max
crc_test.perf_deflate_crc32_combine 58414 16.711us 3.483ns 16.708us 16.725us
crc_test.perf_adler_combine 165788278 6.059ns 0.031ns 6.027ns 7.519ns
crc_test.perf_zlib_crc32_combine 59546 16.767us 26.191ns 16.741us 16.801us
---
crc_test.perf_deflate_crc32_checksum 12705072 83.267ns 4.580ns 78.687ns 98.964ns
crc_test.perf_adler_checksum 3918014 206.701ns 23.469ns 183.231ns 258.859ns
crc_test.perf_zlib_crc32_checksum 2329682 428.787ns 0.085ns 428.702ns 510.085ns
Results for 64 KB (relevant for writing with compression disabled):
test iterations median mad min max
crc_test.perf_deflate_crc32_combine 25364 38.393us 17.683ns 38.375us 38.545us
crc_test.perf_adler_combine 169797143 5.842ns 0.009ns 5.833ns 6.901ns
crc_test.perf_zlib_crc32_combine 26067 38.663us 95.094ns 38.546us 40.523us
---
crc_test.perf_deflate_crc32_checksum 202821 4.937us 14.426ns 4.912us 5.093us
crc_test.perf_adler_checksum 44684 22.733us 206.263ns 22.492us 25.258us
crc_test.perf_zlib_crc32_checksum 18839 53.049us 36.117ns 53.013us 53.274us
The new CRC32 implementation (deflate_crc32) doesn't provide a fast
checksum_combine() yet, it delegates to zlib so it's as slow as the latter.
Because for CRC32 checksum_combine() is several orders of magnitude slower
than checksum(), we avoid calling checksum_combine() completely for this
checksummer. We still do it for adler32, which has combine() which is faster
than checksum().
SStable write performance was evaluated by running:
perf_fast_forward --populate --data-directory /tmp/perf-mc \
--rows=10000000 -c1 -m4G --datasets small-part
Below is a summary of the average frag/s for a memtable flush. Each result is
an average of about 20 flushes with stddev of about 4k.
Before:
[1] MC,lz4: 330'903
[2] LA,lz4: 450'157
[3] MC,checksum: 419'716
[4] LA,checksum: 459'559
After:
[1'] MC,lz4: 446'917 ([1] + 35%)
[2'] LA,lz4: 456'046 ([2] + 1.3%)
[3'] MC,checksum: 462'894 ([3] + 10%)
[4'] LA,checksum: 467'508 ([4] + 1.7%)
After this series, the performance of the MC format writer is similar to that
of the LA format before the series.
There seems to be a small but consistent improvement for LA too. I'm not sure
why.
"
* tag 'improve-mc-sstable-checksum-libdeflate-v3' of github.com:tgrabiec/scylla:
tests: perf: Introduce perf_checksum
tests: Add test for libdeflate CRC32 implementation
sstables: compress: Use libdeflate for crc32
sstables: compress: Rename crc32_utils to zlib_crc32_checksummer
licenses: Add libdeflate license
Integrate libdeflate with the build system
Add libdeflate submodule
sstables: Avoid checksum_combine() for the crc32 checksummer
sstables: compress: Avoid unnecessary checksum_combine()
sstables: checksum_utils: Add missing include
Improves memtable flush performance by 10% in a CPU-bound case.
Unlike the zlib implementation, libdeflate is optimized for modern
CPUs. It utilizes the PCLMUL instruction.
checksum_combine() is much slower than re-feeding the buffer to
checksum() for the zlib CRC32 checksummer.
Introduce Checksum::prefer_combine() to determine this and select
more optimal behavior for given checksummer.
Improves performance of memtable flush with compression enabled by 30%.
class_registry's staticness brings has the usual problem of
static classes (loss of dependency information) and prevents us
from librarifying Scylla since all objects that define a registration
must be linked in.
Take a first step against this staticness by defining a nonstatic
variant. The static class_registry is then redefined in terms of the
nonstatic class. After all uses have been converted, the static
variant can be retired.
Message-Id: <20181126130935.12837-1-avi@scylladb.com>
"
Previously we were checking for schema incompatibility between current schema and sstable
serialization header before reading any data. This isn't the best approach because
data in sstable may be already irrelevant due to column drop for example.
This patchset moves the check after actual data is read and verified that it has
a timestamp new enough to classify it as nonobsolete.
Fixes#3924
"
* 'haaawk/3924/v3' of github.com:scylladb/seastar-dev:
sstables: Enable test_schema_change for MC format
sstables3: Throw error on schema mismatch only for live cells
sstables: Pass column_info to consume_*_column
sstables: Add schema_mismatch to column_info
sstables: Store column data type in column_info
sstables: Remove code duplication in column_translation
BYPASS CACHE was mistakenly documenting an earlier version of the patch.
Correct it to document th committed version.
Message-Id: <20181126125810.9344-1-avi@scylladb.com>
This family of test_write_many_partitions_* tests writes
sstables down from memtable using different compressions.
Then it compares the resulting file with a blueprint file
and reads the data back to check everything is there.
Compression is not deterministic so this patch makes the
tests not compare resulting compressed sstable file with blueprint
file and instead only read data back.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Previously we were throwing exception during the creation of
column_translation. This wasn't always correct because sometimes
column for which the mismatch appeared was already dropped and
data present in sstable should be ignored anyway.
Fixes#3924
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
"
Some queries are very unlikely to hit cache. Usually this includes
range queries on large tables, but other patterns are possible.
While the database should adapt to the query pattern, sometimes the
user has information the database does not have. By passing this
information along, the user helps the database manage its resources
more optimally.
To do this, this patch introduces a BYPASS CACHE clause to the
SELECT statement. A query thus marked will not attempt to read
from the cache, and instead will read from sstables and memtables
only. This reduces CPU time spent to query and populate the cache,
and will prevent the cache from being flooded with data that is
not likely to be read again soon. The existing cache disabled path
is engaged when the option is selected.
Tests: unit (release), manual metrics verification with ccm with and without the
BYPASS CACHE clause.
Ref #3770.
"
* tag 'cache-bypass/v2' of https://github.com/avikivity/scylla:
doc: document SELECT ... BYPASS CACHE
tests: add test for SELECT ... BYPASS CACHE
cql: add SELECT ... BYPASS CACHE clause
db: add query option to bypass cache
* tag 'perf-ffwd-dataset-population-v2' of github.com:tgrabiec/scylla:
tests: perf_fast_forward: Measure performance of dataset population
tests: perf_fast_forward: Record the dataset on which test case was run
tests: perf_fast_forward: Introduce the concept of a dataset
tests: perf_fast_forward: Introduce make_compaction_disabling_guard()
tests: perf_fast_forward: Initialize output manager before population
tests: perf_fast_forward: Handle empty test parameter set
tests: perf_fast_forward: Extract json_output_writer::write_common_test_group()
tests: perf_fast_forward: Factor out access to cfg to a single place per function
tests: perf_fast_forward: Extract result_collector
tests: perf_fast_forward: Take writes into account in AIO statistics
tests: perf_fast_forward: Reorder members
tests: perf_fast_forward: Add --sstable-format command line option
The BYPASS CACHE clause instructs the database not to read from or populate the
cache for this query. The new keywords (BYPASS and CACHE) are not reserved.
do_process_buffer had two unreachable default cases and a long
if-else-if chain.
This converts the the if-else-if chain to a switch and a helper
function.
This moves the error checking from run time to compile time. If we
were to add a 128 bit integer for example, gcc would complain about it
missing from the switch.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20181125221451.106067-1-espindola@scylladb.com>
"
This new compaction approach consists of releasing exhausted fragments[1] of a run[2] a
compaction proceeds, so decreasing considerably the space requirement.
These changes will immediately benefit leveled strategy because it already works with
the run concept.
[1] fragment is a sstable composing a run; exhausted means sstable was fully consumed
by compaction procedure.
[2] run is a set of non-overlapping sstables which roughly span the
entire token range.
Note:
Last patch includes an example compaction strategy showing how to work with the interface.
unit tests: all modes passing
dtests: compaction ones passing
"
* 'sstable_run_based_compaction_v10' of github.com:raphaelsc/scylla:
tests: add example compaction strategy for sstable run based approach
sstables/compaction: propagate sstable replacement to all compaction of a CF
sstables: store cf pointer in compaction_info
tests/sstable_test: add test for compaction replacement of exhausted sstable
sstables: add sstable's on closed handling
tests/sstables: add test for sstable run based compaction
sstables/compaction_manager: prevent partial run from being selected for compaction
compaction: use same run identifier for sstables generated by same compaction
sstables: introduce sstable run
sstables/compaction_manager: release reference to exhausted sstable through callback
sstables/compaction: stop tracking exhausted input sstable in compaction_read_monitor
database: do not keep reference to sstable in selector when done selecting
compaction: share sstable set with incremental reader selector
sstables/compaction: release space earlier of exhausted input sstables
sstables: make partitioned sstable set's incremental selector resilient to changes in the set
database: do not store reference to sstable in incremental selector
tests/sstables: add run identifier correctness test
sstables: use a random uuid for sstables without run identifier
sstables: add run identifier to scylla metadata
This is needed for parallel compaction to work with sstable run based approach.
That's because regular compaction clones a set containing all sstables of its
column family. So compaction A can potentially hold a reference to a compacting
sstable of compaction B, so preventing compacting B from releasing its exhausted
sstable.
So all replacements are propagated to all compactions of a given column family,
and compactions in turn, including the one which initiated the propagation,
will do the replacement.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
motivation is that we need a more efficient way to find compactions
that belong to a given column family in compaction list.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Make sure that compaction is capable of releasing exhausted sstable space
early in the procedure.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Motivation is that it will be useful for catching regression on compaction
when releasing early exhausted sstables. That's because sstable's space
is only released once it's closed. So this will allow us to write a test
case and possibly use it for entities holding exhausted sstable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Filter out sstable belonging to a partial run being generated by an ongoing
compaction. Otherwise, that could lead to wrong decisions by the compaction
strategy.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
SSTables composing the same run will share the same run identifier.
Therefore, a new compaction strategy will be able to get all sstables belong
to the same run from sstable_set, which now keeps track of existing runs.
Same UUID is passed to writers of a given compaction. Otherwise, a new UUID
is picked for every sstable created by compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
sstable run is a structure that will hold all sstables that has the same
run identifier. All sstables belonging to the same run will not overlap
with one another.
It can be used by compaction strategy to work on runs instead of individual
sstables.
sstable_set structure which holds all sstables for a given column family
will be responsible for providing to its user an interface to work with
runs instead of individual sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's important for the reference to sstable to not be kept throughout
the compaction procedure, which would break the goal of releasing
space during compaction.
Manager passes a callback to compaction which calls it whenever
there's sstable replacement.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Motivation is that we want to release space for exhausted sstable and that
will only happen when all references to it are gone *and* that backlog
tracker takes the early replacement into account.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When compacting, we'll create all readers at once and will not select
again from incremental selector, meaning the selector will keep all
respective sstables in current_sstables, preventing compaction from
releasing space as it goes on.
The change is about refreshing sstable set's selector such that it
will not hold a reference to an exhausted sstable whatsoever.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
By doing that, we'll be able to release exhausted sstable from both
simulteaneously.
That's achieved by sharing set containing input sstables with the incremental
reader selector and removing exhausted sstables from shared set when the
time has come.
Step towards reducing disk requirement for compaction by making it delete
sstable which all data is in a sealed new sstable. For that to happen,
all references must be gone.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently, compaction only replace input sstables at end of compaction,
meaning compaction must be finished for all the space of those sstables
to be released.
What we can do instead is to delete earlier some input sstable under
some conditions:
1) SStable data should be committed to a new, sealed output sstable,
meaning it's exhausted.
2) Exhausted sstable mustn't overlap with a non-exhausted sstable
because a tombstone in the exhausted could have been purged and the
shadowed data in non-exhausted could be ressurected if system
crashes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The motivation is that compaction may remove a sstable from the set while the
incremental selector is alive, and for that to work, we need to invalidate
the iterators stored by the selector. We could have added a method to notify
it, but there will be a case where the one keeping the set cannot forward
the notification to the selector. So it's better for the selector to take
care of itself. Change counter approach is used which allows the selector
to know when to invalidate the iterators.
After invalidation, selector will move the iterator back into its right
place by looking for lower bound for current pos.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Use sstable generation instead to keep track of read sstables.
The motivation is that we'll not keep reference to sstables, so allowing
their space on disk to be released as soon they get exhausted.
Generation is used because it guarantees uniqueness of the sstable.
Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Older sstables must have an identifier for them to be associated
with their own run.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
It identifies a run which a particular sstable belongs to.
Existing sstables will have a random uuid associated with it
in memory.
UUID is the correct choice because it allows sstables to be
exported without having conflicts when using identifier generated
by different nodes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
These switches are fully covered. We can be sure they will stay this
way because of -Werror and gcc's -Wswitch warning.
We can also be sure that we never have an invalid enum value since the
state machine values are not read from disk.
The patch also removes a superfluous ';'.
Message-Id: <20181124020128.111083-1-espindola@scylladb.com>
This field is true when there's a mismatch
between column type in serialization header and
current schema.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The reason for that is that it's not available in sstable format mc,
so we can no longer rely on it in common code for the currently
supported formats.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20181121170057.20900-1-raphaelsc@scylladb.com>
A dataset represents a table with data, populated in certain way, with
certain characteristics of the schema and data.
Before this change, datasets were implicitly defined, with population
hard-coded inside the populate() function.
This change gathers logic related to datasets into classes, in order to:
- make it easier to define new datasets.
- be able to measure performance of dataset population in a
standardized way.
- being able to express constraints on datasets imposed by different
test cases. Test cases are matched with possible datasets based
on the abstract interface they accept (e.g. clustered_ds,
multipartition_ds), and which must be implemented by a compatible
dataset. To facilitate this matching, test function is now wrapped
into a dataset_acceptor object, with an automatically-generated can_run()
virtual method, deduced by make_test_fn().
- be able to select tests to run based on the dataset name.
Only tests which are compatible with that dataset will be run.
Extracts the result collection and reporting logic out of
run_test_case(). Will be needed in population tests, for which we
don't want the looping logic.
This series adds a generic test for schema changes that generates
various schema and data before and after an ALTER TABLE operation. It is
then used to check correctness of mutation::upgrade() and sstable
readers and lead to the discovery of #3924 and #3925.
Fixes#3925.
* https://github.com/pdziepak/scylla.git schema-change-test/v3.1
schema_builder: make member function names less confusing
converting_mutation_partition_applier: fix collection type changes
converting_mutation_partition_applier: do not emit empty collections
sstable: use format() instead of sprint()
tests/random-utils: make functions and variables inline
tests: add models for schemas and data
tests: generate schema changes
tests/mutation: add test for schema changes
tests/sstable: add test for schema changes
for_each_schema_change() is used for testing reading an sstable that was
written with a different schema. Because of #3924, for now the mc format
is not verified this way.
This patch adds for_each_schema_change() functions which generates
schemas and data before and after some modification to the schema (e.g.
adding a column, changing its type). It can be used to test schema
upgrades.
This patch introduces a model of Scylla schemas and data, implemented
using simple standard library primitives. It can be used for testing the
actuall schemas, mutation_partitions, etc. used by the schema by
comparing the results of various actions.
The initial use case for this model was to test schema changes, but
there is no reason why in the future it cannot be extended to test other
things as well.
packer 1.3.2 no longer supported enhanced_networking directive, we need
to use new directives("sriov_support" and "ena_support") to build with
new version.
packer provides automatic configuration file fixing tool, so new
scylla.json is generated by following command:
./packer/packer fix scylla.json
Fixes#3938
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181123053719.32451-1-syuu@scylladb.com>
appending_hash is used for computing hashes that become part of the
binary interface. They cannot change between Scylla version and the same
data needs to always result in the same hash.
At the moment, appending_hash<bytes_ostream> doesn't fulfil those
requirements since it leaks information how the underlying buffer is
fragmented. Fortunately, it has no users so it doesn't casue any
compatibility issues.
Moreover, bytes_ostream is usually used as an output of some
serialisation routine (e.g. frozen_mutation_fragment or CQL response).
Those serialisation formats do not guarantee that there is a single
representation of a given data and therefore are not fit to be hashed by
appending_hash. Removing appending_hash<bytes_ostream> may help
preventing such incorrect uses.
Message-Id: <20181122163823.12759-1-pdziepak@scylladb.com>
* seastar b924495...1fbb633 (3):
> rpc: Reduce code duplication
> tests: perf: Make do_not_optimize() take the argument by const&
> doc: Fix import paths in the tutorial
The format message was using the new stlye formatting markers ("{}")
which are understood by format() but not by sprint() (the latter is
basically deprecated).
This patch changes the behaviour of the schema upgrade code so that if
all cells and the tombstons of a collection are removed during the upgrade
the collection is not emitted (as opposed to emitting an empty one).
Both behaviours are valid, but the new one makes it more consistent with
how atomic cells are upgraded and how schema upgrades work for sstable
readers.
ALTER TABLE allows changing the type of a collection to a compatible
one. This includes changes from a fixed-sized type to a variable-sized
one. If that happens the atomic_cells representing collection elements
need to be rewritten so that the value size is included. The logic for
rewritting atomic cells already exists (for those that are not
collection members) and is reused in this patch.
Fixes#3925.
Right now, schema_builder member functions have names that very poorly
convey the actions that are performed for them. This is made even worse
by some overloads which drastically change the semantics. For example:
schema_builder()
.with_column("v1", /* ... */)
.without_column("v1", removal_timestamp);
Creates a column "v1" and adds an information that there was a column
with that name that was removed at 'removal_timestamp'.
schema_builder()
.with_coulmn("v1")
.without_column(utf8_type->decompose("v1"));
This adds column "v1" and then immediately removes it.
In order to clean up this mess the names were changes so that:
* with_/without_ functions only add informations to the schema (e.g.
info that a column was removed, but without removing a column of that
name if one exists)
* functions which names start with a verb actually perform that action,
e.g. the new remove_column() removes the column (and adds information
that it used to exist) as in the second example.
Currently if hints directory contains unexpected directories Scylla fails to
start with unhandled std::invalid_argument exception. Make the manager
ignore malformed files instead and try to proceed anyway.
Message-Id: <20181121134618.29936-2-gleb@scylladb.com>
We scan hints directory in two places: to search for files to replay and
to search for directories to remove after resharding. The code that
translates directory name to a shard is duplicated. It is simple now, so
not a bit issue but in case it grows better have it in one place.
Message-Id: <20181121134618.29936-1-gleb@scylladb.com>
"
Tested with perf_fast_forward from:
github.com/tgrabiec/scylla.git perf_fast_forward-for-sst3-opt-write-v1
Using the following command line:
build/release/tests/perf/perf_fast_forward_g --populate --sstable-format=mc \
--data-directory /tmp/perf-mc --rows=10000000 -c1 -m4G \
--datasets small-part
The average reported flush throughput was (stdev for the avergages is around 4k):
- for mc before the series: 367848 frag/s
- for lc before the series: 463458 frag/s (= mc.before +25%)
- for mc after the series: 429276 frag/s (= mc.before +16%)
- for lc after the series: 466495 frag/s (= mc.before +26%)
Refs #3874.
"
* tag 'sst3-opt-write-v2' of github.com:tgrabiec/scylla:
sstables: mc: Avoid serialization of promoted index when empty
sstables: mc: Avoid double serialization of rows
tests: sstable 3.x: Do not compare Statistics component
utils: Introduce memory_data_sink
schema: Optimize column count getters
sstables: checksummed_file_data_sink_impl: Bypass output_stream
The old code was serializing the row twice. Once to get the size of
its block on disk, which is needed to write the block length, and then
to actually write the block.
This patch avoids this by serializing once into a temporary buffer and
then appending that buffer to the data file writer.
I measured about 10% improvement in memtable flush throughput with
this for the small-part dataset in perf_fast_forward.
The Statistics component recorded in the test was generated using a
buggy verion of Scylla, and is not correct. Exposed by fixing the bug
in the way statistics are generated.
Rather than comparing binary content, we should have explicit checks
for statistics.
"
Enables sstable compression with LZ4 by default, which was the
long-time behavior until a regression turned off compression by
default.
Fixes#3926
"
* 'restore-default-compression/v2' of https://github.com/duarten/scylla:
tests/cql_query_test: Assert default compression options
compress: Restore lz4 as default compressor
tests: Be explicit about absence of compression
Indicate the default scylla directories, rather than Cassandra's.
Provide links to Scylladocumentation where possible,
update links to Casandra documentation otherwise.
Clean up a few typos.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20181119141912.28830-1-bhalevy@scylladb.com>
This avoids a difference between little and big endian sytems. We
now also calculate a full murmur hash for tokens with less than 8
bytes, however in practice the token size is always 8.
Message-Id: <20181120214733.43800-1-mike.munday@ibm.com>
The boost multiprecision library that I am compiling against seems
to be missing an overload for the cast to a string. The easy
workaround seems to be to call str() directly instead.
This also fixes#3922.
Message-Id: <20181120215709.43939-1-mike.munday@ibm.com>
Fixes a regression introduced in
74758c87cd, where tables started to be
created without compression by default (before they were created with
lz4 by default).
Fixes#3926
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
* seastar d59fcef...b924495 (2):
> build: Fix protobuf generation rules
> Merge "Restructure files" from Jesse
Includes fixup patch from Jesse:
"
Update Seastar `#include`s to reflect restructure
All Seastar header files are now prefixed with "seastar" and the
configure script reflects the new locations of files.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com>
"
The build environment may not installed ninja-build before running
install-dependencies.sh, so do it after running the script.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181031110737.17755-1-syuu@scylladb.com>
* seastar a44cedf...d59fcef (10):
> dns: Set tcp output stream buffer size to zero explicitly
> tests: add libc-ares to travis dependencies
> tests: add dns_test to test suite
> build: drop bundled c-ares package
> prometheus: replace the instance label with an optional one
> build: Refactor C++ dialect detection
> build: add libatomic to install-depenencies.sh
> core: use std::underlying_type for open_flags
> core: introduce open_flags::operator&
> core: Fix build for `gnu++14`
Currently, when advance_and_await() fails to allocate the new gate
object, it will throw bad_alloc and leave the phased_barrier object in
an invalid state. Calling advance_and_await() again on it will result
in undefined behavior (typically SIGSEGV) beacuse _gate will be
disengaged.
One place affected by this is table::seal_active_memtable(), which
calls _flush_barrier.advance_and_await(). If this throws, subsequent
flush attempts will SIGSEGV.
This patch rearranges the code so that advance_and_await() has strong
exception guarantees.
Message-Id: <1542645562-20932-1-git-send-email-tgrabiec@scylladb.com>
In (almost) all SSTable write paths, we need to inform the monitor that
the write has failed as well. The monitor will remove the SSTable from
controller's tracking at that point.
Except there is one place where we are not doing that: streaming of big
mutations. Streaming of big mutations is an interesting use case, in
which it is done in 2 parts: if the writing of the SSTable fails right
away, then we do the correct thing.
But the SSTables are not commited at that point and the monitors are
still kept around with the SSTables until a later time, when they are
finally committed. Between those two points in time, it is possible that
the streaming code will detect a failure and manually call
fail_streaming_mutations(), which marks the SSTable for deletions. At
that point we should propagate that information to the monitor as well,
but we don't.
Fixes#3732 (hopefully)
Tests: unit (release)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181114213618.16789-1-glauber@scylladb.com>
In a homogeneous cluster this will reduce number of internal cross-shard hops
per request since RPC calls will arrive to correct shard.
Message-Id: <20181118150817.GF2062@scylladb.com>
In commit a33f0d6, we changed the way we handle arrays during the write
and parse code to avoid reactor stalls. Some potentially big loops were
transformed into futurized loops, and also some calls to vector resizes
were replaced by a reserve + push_back idiom.
The latter broke parsing of the estimated histogram. The reason being
that the vectors that are used here are already initialized internally
by the estimated_histogram object. Therefore, when we push_back, we
don't fill the array all the way from index 0, but end up with a zeroed
beginning and only push back some of the elements we need.
We could revert this array to a resize() call. After all, the reason we
are using reserve + push_back is to avoid calling the constructor member
for each element, but We don't really expect the integer specialization
to do any of that.
However, to avoid confusion with future developers that may feel tempted
to converted this as well for the sake of consistency, it is safer to
just make sure these arrays are zeroed.
Fixes#3918
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181116130853.10473-1-glauber@scylladb.com>
Remove the timeout argument to
db::view::view_builder::wait_until_built(), a test-only function to
wait until a given materialized view has finished building.
This change is motivated by the fact that some tests running on slow
environments will timeout. Instead of incrementally increasing the
timeout, remove it completely since tests are already run under an
exterior timeout.
Fixes#3920
Tests: unit release(view_build_test, view_schema_test)
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181115173902.19048-1-duarte@scylladb.com>
map_reduce_column_families_locally iterate over all tables (column
family) in a shard.
If the number of tables is big it can cause latency spikes.
This patch replaces the current loop with a do_for_each allowing
preepmtion inside the loop.
Fixes#3886
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20181115154825.23430-1-amnon@scylladb.com>
In a previous patch I fixed most TTLs in the view_complex_test.cc tests
from low numbers to 100 seconds. I missed one. This one never caused
problems in practice, but for good form, let's fix it too.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181115160234.26478-1-nyh@scylladb.com>
After this patch, the Materialized Views and Secondary Index features
are considered generally-available and no longer require passing an
explicit "--experimental=on" flag to Scylla.
The "--experimental=on" flag and the db::config::check_experimental()
function remain unused, as we graduated the only two features which used
this flag. However, we leave the support for experimental features in
the code, to make it easier to add new experimental features in the future.
Another reason to leave the command-line parameter behind is so existing
scripts that still use it will not break.
Fixes#3917
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181115144456.25518-1-nyh@scylladb.com>
"
This series enables filtering support for CONTAINS restriction.
"
* 'enable_filtering_for_contains_2' of https://github.com/psarna/scylla:
tests: add CONTAINS test case to filtering tests
cql3: enable filtering for CONTAINS restriction
cql3: add is_satisfied_by(bytes_view) for CONTAINS
Several of the tests in tests/view_complex_test.cc set a cell with a
TTL, and then skip time ahead artificially with forward_jump_clocks(),
to go past the TTL time and check the cell disappeared as expected.
The TTLs chosen for these tests were arbitrary numbers - some had 3 seconds,
some 5 seconds, and some 60 seconds. The actual number doesn't matter - it
is completely artificial (we move the clock with forward_jump_clocks() and
never really wait for that amount of time) and could very well be a million
seconds. But *low* numbers, like the 3 seconds, present a problem on extremely
overcomitted test machines. Our eventually() function already allows for
the possibility that things can hang for up to 8 seconds, but with a 3 second
TTL, we can find ourselves with data being expired and the test failing just
after 3 seconds of wall time have passed - while the test intended that the
dataq will expire only when we explicitly call forward_jump_clocks().
So this patch changes all the TTLs in this test to be the same high number -
100 seconds. This hopefully fixes#3918.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181115125607.22647-1-nyh@scylladb.com>
With the use of Docker image, some extra options needed to be exposed
to provide extended functionality when starting the image. The flags
added by this commit are:
- cluster-name: name of the Scylla cluster. cluster_name option in
scylla.yaml.
- rpc-address: IP address for client connections (CQL). rpc_address
option in scylla.yaml.
- endpoint-snitch: The snitch used to discover the cluster topology.
endpoint_snitch option in scylla.yaml.
- replace-address-first-boot: Replace a Scylla node by its IP.
replace_address_first_boot option in scylla.yaml.
Signed-off-by: Yannis Zarkadas <yanniszarkadas@gmail.com>
[ penberg@scylladb.com: fix up merge conflicts ]
Message-Id: <20181108234212.19969-2-yanniszarkadas@gmail.com>
test.py:26:1: F401 'signal' imported but unused
test.py:27:1: F401 'shlex' imported but unused
test.py:28:1: F401 'threading' imported but unused
test.py:173:1: E305 expected 2 blank lines after class or function definition,
found 1
test.py:181:34: E241 multiple spaces after ','
test.py:183:34: E241 multiple spaces after ','
test.py:209:24: E222 multiple spaces after operator
test.py:240:5: E301 expected 1 blank line, found 0
test.py:249:23: W504 line break after binary operator
test.py:254:9: E306 expected 1 blank line before a nested definition, found 0
test.py:263:13: F841 local variable 'out' is assigned to but never used
test.py:264:33: E128 continuation line under-indented for visual indent
test.py:265:33: E128 continuation line under-indented for visual indent
test.py:266:33: E128 continuation line under-indented for visual indent
test.py:274:64: F821 undefined name 'e'
test.py:278:53: F821 undefined name 'e'
Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20181104115255.22547-1-ultrabug@gentoo.org>
gen_segmented_compress_params.py:52:47: E226 missing whitespace around
arithmetic operator
gen_segmented_compress_params.py:56:64: E226 missing whitespace around
arithmetic operator
gen_segmented_compress_params.py:60:36: E226 missing whitespace around
arithmetic operator
gen_segmented_compress_params.py:60:48: E226 missing whitespace around
arithmetic operator
gen_segmented_compress_params.py:70:35: E226 missing whitespace around
arithmetic operator
gen_segmented_compress_params.py:70:48: E226 missing whitespace around
arithmetic operator
gen_segmented_compress_params.py:99:43: E226 missing whitespace around
arithmetic operator
gen_segmented_compress_params.py:106:18: E225 missing whitespace around
operator
gen_segmented_compress_params.py:120:5: E303 too many blank lines (2)
gen_segmented_compress_params.py:200:30: E261 at least two spaces before
inline comment
gen_segmented_compress_params.py:200:31: E262 inline comment should start with
'# '
gen_segmented_compress_params.py:218:76: E261 at least two spaces before
inline comment
gen_segmented_compress_params.py:219:59: E703 statement ends with a semicolon
gen_segmented_compress_params.py:219:60: E261 at least two spaces before
inline comment
Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20181104115753.4701-1-ultrabug@gentoo.org>
fix_system_distributed_tables.py:28:20: E203 whitespace before ':'
fix_system_distributed_tables.py:29:20: E203 whitespace before ':'
fix_system_distributed_tables.py:30:20: E203 whitespace before ':'
fix_system_distributed_tables.py:31:20: E203 whitespace before ':'
fix_system_distributed_tables.py:33:20: E203 whitespace before ':'
fix_system_distributed_tables.py:34:23: E203 whitespace before ':'
fix_system_distributed_tables.py:35:23: E203 whitespace before ':'
fix_system_distributed_tables.py:39:20: E203 whitespace before ':'
fix_system_distributed_tables.py:40:20: E203 whitespace before ':'
fix_system_distributed_tables.py:41:20: E203 whitespace before ':'
fix_system_distributed_tables.py:42:20: E203 whitespace before ':'
fix_system_distributed_tables.py:43:20: E203 whitespace before ':'
fix_system_distributed_tables.py:44:20: E203 whitespace before ':'
fix_system_distributed_tables.py:45:20: E203 whitespace before ':'
fix_system_distributed_tables.py:46:20: E203 whitespace before ':'
fix_system_distributed_tables.py:47:20: E203 whitespace before ':'
fix_system_distributed_tables.py:48:20: E203 whitespace before ':'
fix_system_distributed_tables.py:52:20: E203 whitespace before ':'
fix_system_distributed_tables.py:53:20: E203 whitespace before ':'
fix_system_distributed_tables.py:54:20: E203 whitespace before ':'
fix_system_distributed_tables.py:55:20: E203 whitespace before ':'
fix_system_distributed_tables.py:56:20: E203 whitespace before ':'
fix_system_distributed_tables.py:57:20: E203 whitespace before ':'
fix_system_distributed_tables.py:58:20: E203 whitespace before ':'
fix_system_distributed_tables.py:59:20: E203 whitespace before ':'
fix_system_distributed_tables.py:60:20: E203 whitespace before ':'
fix_system_distributed_tables.py:61:20: E203 whitespace before ':'
fix_system_distributed_tables.py:62:20: E203 whitespace before ':'
fix_system_distributed_tables.py:66:19: E203 whitespace before ':'
fix_system_distributed_tables.py:67:19: E203 whitespace before ':'
fix_system_distributed_tables.py:72:19: E203 whitespace before ':'
fix_system_distributed_tables.py:73:19: E203 whitespace before ':'
fix_system_distributed_tables.py:74:19: E203 whitespace before ':'
fix_system_distributed_tables.py:78:19: E203 whitespace before ':'
fix_system_distributed_tables.py:79:19: E203 whitespace before ':'
fix_system_distributed_tables.py:80:19: E203 whitespace before ':'
fix_system_distributed_tables.py:84:19: E203 whitespace before ':'
fix_system_distributed_tables.py:85:19: E203 whitespace before ':'
fix_system_distributed_tables.py:89:19: E203 whitespace before ':'
fix_system_distributed_tables.py:90:19: E203 whitespace before ':'
fix_system_distributed_tables.py:91:19: E203 whitespace before ':'
fix_system_distributed_tables.py:95:22: E203 whitespace before ':'
fix_system_distributed_tables.py:96:22: E203 whitespace before ':'
fix_system_distributed_tables.py:99:1: E302 expected 2 blank lines, found 0
fix_system_distributed_tables.py:103:72: E201 whitespace after '['
fix_system_distributed_tables.py:103:82: E202 whitespace before ']'
fix_system_distributed_tables.py:105:43: E201 whitespace after '['
fix_system_distributed_tables.py:105:53: E202 whitespace before ']'
fix_system_distributed_tables.py:111:16: E713 test for membership should be
'not in'
fix_system_distributed_tables.py:118:20: E713 test for membership should be
'not in'
fix_system_distributed_tables.py:135:25: E722 do not use bare 'except'
fix_system_distributed_tables.py:138:5: E722 do not use bare 'except'
fix_system_distributed_tables.py:144:1: E305 expected 2 blank lines after
class or function definition, found 0
fix_system_distributed_tables.py:145:47: E251 unexpected spaces around keyword
/ parameter equals
fix_system_distributed_tables.py:145:49: E251 unexpected spaces around keyword
/ parameter equals
fix_system_distributed_tables.py:160:1: W391 blank line at end of file
Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20181104113001.22783-1-ultrabug@gentoo.org>
dist/docker/redhat/docker-entrypoint.py:20:1: E722 do not use bare 'except'
dist/docker/redhat/commandlineparser.py:13:13: E128 continuation line
under-indented for visual indent
Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20181104120134.9598-1-ultrabug@gentoo.org>
With the number of unit tests approaching one hundred, the output of
test.py becomes more challenging to read.
If some test fails, we will only get the details after all the tests
complete, but some tests take way longer than others.
With the coloured status, it is much simpler to immediately locate
failing tests. Developer can cancel others and repeat the failing
ones.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <63a99a2fb70fdc33fd6eeb8e18fee977a47bd278.1541541184.git.vladimir@scylladb.com>
"
This series adds DEFAULT UNSET and DEFAULT NULL keyword support
to INSERT JSON statement, as stated in #3909.
Tests: unit (release)
"
* 'add_json_default_unset_2' of https://github.com/psarna/scylla:
tests: add DEFAULT UNSET case to JSON cql tests
tests: split JSON part of cql query test
cql3: add DEFAULT UNSET to INSERT JSON
When inserting a JSON, additional DEFAULT UNSET or DEFAULT NULL
keywords can be appended.
With DEFAULT UNSET, values omitted in JSON will not be changed
at all. With DEFAULT NULL (default), omitted values will be
treated as having a 'null' value.
Fixes#3909
"
Fix for #3897
"Ec2MultiRegionSnitch: prints a cryptic error when a Public IP is not
available"
Ec2MultiRegionSnitch naturally requires a Public IP to be available and
therefore it's expected to refuse to work without it.
However the error message that is printed today is a total disaster and
has to be fixed ASAP to be something much more human readable.
This series adds a human readable preabmle that will let a poor user
understand what should he/she do.
"
* 'improve-ec2-multi-region-snitch-error-message-when-pulic-address-is-not-available-v2' of https://github.com/vladzcloudius/scylla:
locator: ec2_multi_region_snitch::start(): print a human readable error if Public IP may not be retrieved
locator: ec2_multi_region_snitch::start(): rework on top of seastar::thread
During streaming, there are cases when we should invoke the view write
path. In particular, if we're streaming because of repair or if a view
has not yet finished building and we're bootstrapping a new node.
The design constraints are:
1) The streamed writes should be visible to new writes, but the
sstable should not participate in compaction, or we would lose the
ability to exclude the streamed writes on a restart;
2) The streamed writes must not be considered when generating view
updates for them;
3) Resilient to node restarts;
4) Resilient to concurrent stream sessions, possibly streaming mutations for overlapping ranges.
We achieve this by writing the streamed writes to an sstable in a
different folder, call it "staging". We achieve 1) by publishing the
sstable to the column family sstable set, but excluding it from
compactions. We do these steps upon boot, by looking at the staging
directory, thus achieving 3).
Fixes#3275
* 'streaming_view_to_staging_sstables_9' of https://github.com/psarna/scylla: (29 commits)
tests: add materialized views test
tests: add view update generator to cql test env
main: add registering staging sstables read from disk
database: add a check if loaded sstable is already staging
database: add get_staging_sstable method
streaming: stream tables with views through staging sstables
streaming: add system distributed keyspace ref to streaming
streaming: add view update generator reference to streaming
main: add generating missed mv updates from staging sstables
storage_service: move initializing sys_dist_ks before bootstrap
db/view: add view_update_from_staging_generator service
db/view: add view updating consumer
table: add stream_view_replica_updates
table: split push_view_replica_updates
table: add as_mutation_source_excluding
table: move push_view_replica_updates to table.cc
database: add populating tables with staging sstables
database: add creating /staging directory for sstables
database: add sstable-excluding reader
table: add move_sstable_from_staging_in_thread function
...
Right now materialized_views_test.cc contains view updating tests,
but the intention is to move mv-related tests from cql_query_test
here and use it for all future unit testing of MV.
Staging sstables are loaded before regular ones. If the process
fails midway, an sstable can be linked both in the regular directory
and in staging directory. In such cases, the sstable remains
in staging and will be moved to the regular directory
by view update streamer service.
This method can be used to check if sstable is staging,
i.e. it shouldn't be compacted and it will not be used
for generating view updates from other staging tables,
and return proper shared_sstable pointer if it is.
While streaming to a table with paired views, staging sstables
are used. After the table is written to disk, it's used to generate
all required view updates. It's also resistant to restarts as it's
stored on a hard drive in staging/ directory.
Refs #3275
Bootstrapping process may need system distributed keyspace
to generate view updates, so initializing sys_dist_ks
is moved before the bootstrapping process is launched.
When generating view updates from a staging sstable, this sstable
should not be used in the process. Hence, a reader that skips a single
sstable is added.
After materialized view updates are generated, the sstable
should be moved from staging/ to a regular directory.
It's expected to be called from seastar::async thread context.
When moving sstables between directories, this helper function
will create links and update generation and dir accordingly.
It's expected to be called in thread context.
Staging sstables are not part of the compaction process to ensure
than each sstable can be easily excluded from view generation process
that depends on the mentioned sstable.
"
It appears that in case when there are any static columns in serialization header,
Cassandra would write a (possibly empty) static row to every partition
in the SSTables file.
This patchset alings Scylla's logic with that of Cassandra.
Note that Scylla optimizes the case when no partition contains a static
row because it keeps track of updated columns that Scylla currently does
not do - see #3901 for details.
Fixes#3900.
"
* 'projects/sstables-30/write-all-static-rows/v1' of https://github.com/argenet/scylla:
tests: Test writing empty static rows for partitions in tables with static columns.
sstables: Ignore empty static rows on reading.
sstables: Write empty static rows when there are static columns in the table.
MC format lacks ancestors metadata, so we need to workaround it by using
ancestors in metadata collector, which is only available for a sstable
written during this instance. It works fine here because we only want
to know if a sstable recently compacted has an ancestor which wasn't
yet deleted.
Fixes#3852.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <20181102154951.22950-1-raphaelsc@scylladb.com>
Public IP is required for Ec2MultiRegionSnitch. If it's not available
different snitch should be used.
This patch would result in a readable error message to be printed
instead of just a cryptic message with HTTP response body.
Fixes#3897
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
"
The update to libfmt 5.2.1 brought with it a subtle change - calls to
sprint("%s", 3) now throw a format_error instead of returning "3". To
prevent such hidden (or not so hidden) bugs from lurking, convert all calls
to the modern fmt syntax.
Such conversion has several benefits:
- prevent the bug from biting us
- as fmt is being standardized, we can later move to std::format()
- commonality with the logger format syntax (indeed, we may move the logger
to use libfmt itself)
During the conversion, some bugs were caught and fixed. These are presented in
individual patches in the patchset.
Most of the conversion was scripted, using https://github.com/avikivity/unsprint.
Some sprint() calls remain, as they were too complex for the script. They
will be converted later.
"
* tag 'fmt-1/v1' of https://github.com/avikivity/scylla:
toplevel: convert sprint() to format()
repair: convert sprint() to format()
tests: convert sprint() to format()
tracing: convert sprint() to format()
service: convert sprint() to format()
exceptions: convert sprint() to format()
index: convert sprint() to format()
streaming: convert sprint() to format()
streaming: progress_info: fix format string
api: convert sprint() to format()
dht: convert sprint() to format()
thrift: convert sprint() to format()
locator: convert sprint() to format()
gms: convert sprint() to format()
db: convert sprint() to format()
transport: convert sprint() to format()
utils: convert sprint() to format()
sstables: convert sprint() to format()
auth: convert sprint() to format()
cql3: convert sprint() to format()
row_cache: fix bad format string syntax
repair: fix bad format string syntax
tests: fix bad format string syntax
dht: fix bad format string syntax
sstables: fix bad format string syntax
utils: estimated_histogram: convert generated format strings to fmt
tests: perf_fast_forward: rename "format" variable
tests: perf_fast_forward: massage result of sprint() into std::string
utils: i_filter: rename "format" variable
system_keyspace: simplify complicated sprint()
cql: convert Cql.g sprint()s to fmt
types: get rid of PRId64 formatting
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() returns std::string(), but the new format() returns an sstring. Usually
an sstring is wanted but in this case an sstring will fail as it is added to
an std::string.
Fix the failure (after spring->format conversion) by converting to an std::string.
"
Use perftune.py for tuning disks:
- Distribute/pin disks' IRQs:
- For NVMe drives: evenly among all present CPUs.
- For non-NVMe drives: according to chosen tuning mode.
- For all disks used by scylla:
- Tune nomerges
- Tune I/O scheduler.
It's important to tune NIC and disks together in order to keep IRQ
pinning in the same mode.
Disk are detected and tuned based on the current content of
/etc/scylla/scylla.yaml configuration file.
"
Fixes#3831.
* 'use_perftune_for_disks-v3' of https://github.com/vladzcloudius/scylla:
dist: change the sysconfig parameter name to reflect the new semantics
scylla_util.py::sysconfig_parser: introduce has_option()
dist: scylla_setup and scylla_sysconfig_setup: change paremeters names to reflect new semantics
dist: don't distribute posix_net_conf.sh any more
dist: use perftune.py to tune disks and NIC
* seastar c1e0e5d...c02150e (5):
> prometheus: pass names as query parameter instead of part of the URL
> treewide: convert printf() style formatting to fmt
> print: add fmt_print()
> build: Remove experimental CMake support
> Merge "Correct and clean-up `signal_test`" from Jesse
It achieves 2.0x speedup on intel E5 and 1.1x to 2.5x speedup on
various arm64 microarchitectures.
The algorithm cuts data into blocks of 1024 bytes and calculates crc
for each block, which is furthur divided into three subblocks of 336
bytes(42 uint64) each, and 16 remaining bytes(2 uint64).
For each iteration, three independent crc are caculated for one uint64
from each subgroup. It increases IPC(instructions per cycle) much.
After subblocks are done, three crc and remaining two uint64 are
combined using carry-less multiplication to reach the final result
for one block of 1024 bytes.
Signed-off-by: Yibo Cai <yibo.cai@arm.com>
Message-Id: <1541042759-24767-1-git-send-email-yibo.cai@arm.com>
We tune NIC and disks together now. Change the sysconfig parameter to
reflect this new semantics.
However if we detect an old parameter name in the scylla-server we would
still update it thereby keeping the support for old installations.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Change the name of the corresponding parameter (--setup-nic) to reflect
the fact that we tune not just NIC now but rather NIC and disks together.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Tune disks using perftune.py together with NIC.
This is needed because disk(s) and NIC tuning has to be
performed using the mode (for non-NVMe disks).
We tune disks based on the current content of /etc/scylla/scylla.yaml.
Don't use scylla-blocktune for optimizing disks' performance
any more.
Unite the decision to optimize the NIC and disks tuning.
Optimize or not optimize them both together.
Disable disk tuning for DPDK and "virtio" modes for now.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This patchset addresses two issues with static rows support in SSTables
3.x. ('mc' format):
1. Since collections are allowed in static rows, we need to check for
complex deletion, set corresponding flag and write tombstones, if any.
2. Column indices need to be partitioned for static columns the same way
they are partitioned for regular ones.
* github.com/argenet/scylla.git projects/sstables-30/columns-proper-order-followup/v1:
sstables: Partition static columns by atomicity when reading/writing
SSTables 3.x.
sstables: Use std::reference_wrapper<> instead of a helper structure.
sstables: Check for complex deletion when writing static rows.
tests: Add/fix comments to
test_write_interleaved_atomic_and_collection_columns.
tests: Add test covering inverleaved atomic and collection cells in
static row.
It is possible to have collections in a static row so we need to check
for collection-wide tombstones like with clustering rows.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Collections are permitted in static rows so same partitioning as for
regular columns is required.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Current scylla.spec fails build on Fedora 27, since python2-pystache is
new package name that renamed on Fedora 28.
But Fedora 28's python2-pystache has tag "Provides: pystache",
so we can depends on old package name, this way we can build scylla.spec both
on Fedora 27/28.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181028175450.31156-1-syuu@scylladb.com>
When a node reshards (i.e., restarts with a different number of CPUs), and
is in the middle of building a view for a pre-existing table, the view
building needs to find the right token from which to start building on all
shards. We ran the same code on all shards, hoping they would all make
the same decision on which token to continue. But in some cases, one
shard might make the decision, start building, and make progress -
all before a second shard goes to make the decision, which will now
be different.
This resulted, in some rare cases, in the new materialized view missing
a few rows when the build was interrupted with a resharding.
The fix is to add the missing synchronization: All shards should make
the same decision on whether and how to reshard - and only then should
start building the view.
Fixes#3890Fixes#3452
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181028140549.21200-1-nyh@scylladb.com>
"
In Cassandra, row columns are stored in a BTree that uses the following
ordering on them:
- all atomic columns go first, then all multi-cell ones
- columns of both types (atomic and multi-cell) are
lexicographically ordered by name regarding each other
Scylla needs to store columns and their respective indices using the
same ordering as well as when reading them back.
Fixes#3853
Tests: unit {release}
+
Checked that the following SSTables are dumped fine using Cassandra's
sstabledump:
cqlsh:sst3> CREATE TABLE atomic_and_collection3 ( pk int, ck int, rc1 text, rc2 list<text>, rc3 text, rc4 list<text>, rc5 text, rc6 list<text>, PRIMARY KEY (pk, ck)) WITH compression = {'sstable_compression': ''};
cqlsh:sst3> INSERT INTO atomic_and_collection3 (pk, ck, rc1, rc4, rc5) VALUES (0, 0, 'hello', ['beautiful','world'], 'here');
<< flush >>
sstabledump:
[
{
"partition" : {
"key" : [ "0" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 96,
"clustering" : [ 0 ],
"liveness_info" : { "tstamp" : "1540599270139464" },
"cells" : [
{ "name" : "rc1", "value" : "hello" },
{ "name" : "rc5", "value" : "here" },
{ "name" : "rc4", "deletion_info" : { "marked_deleted" : "1540599270139463", "local_delete_time" : "1540599270" } },
{ "name" : "rc4", "path" : [ "45e22cb0-d97d-11e8-9f07-000000000000" ], "value" : "beautiful" },
{ "name" : "rc4", "path" : [ "45e22cb1-d97d-11e8-9f07-000000000000" ], "value" : "world" }
]
}
]
}
]
"
* 'projects/sstables-30/columns-proper-order/v1' of https://github.com/argenet/scylla:
tests: Test interleaved atomic and multi-cell columns written to SSTables 3.x.
sstables: Re-order columns (atomic first, then collections) for SSTables 3.x.
sstables: Use a compound structure for storing information used for reading columns.
In Cassandra, row columns are stored in a BTree that uses the following
ordering on them:
- all atomic columns go first, then all multi-cell ones
- columns of both types (atomic and multi-cell) are
lexicographically ordered by name regarding each other
Since schema already has all columns lexicographically sorted by name,
we only need to stably partition them by atomicity for that.
Fixes#3853
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This representation makes it easier to operate with compound structures
instead of separate values that were stored in multiple containers.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Before this fix, write_missing_columns() helper would always deal with
regular columns even when writing static rows.
This would cause errors on reading those files.
Now, the missing columns are written correctly for regular and static
rows alike.
* github.com/argenet/scylla.git projects/sstables-30/fix-writing-static-missing-columns/v1:
schema: Add helper method returning the count of columns of specified
kind.
sstables: Honour the column kind when writing missing columns in 'mc'
format.
tests: Add test for a static row with missing columns (SStables 3.x.).
Previously, we've been writing the wrong missing columns indices for
static rows because write_missing_columns() explicitly used regular
columns internally.
Now, it takes the proper column kind into account.
Fixes#3892
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
* seastar d152f2d...c1e0e5d (6):
> scripts: perftune.py: properly merge parameters from the command line and the configuration file
> fmt: update to 5.2.1
> io_queue: only increment statistics when request is admitted
> Adds `read_first_line.cc` and `read_first_line.hh` to CMake.
> fstream: remove default extent allocation hint
> core/semaphore: Change the access of semaphore_units main ctor
Due to a compile-time fight between fmt and boost::multiprecision, a
lexical_cast was added to mediate.
sprint("%s", var) no longer accepts numeric values, so some sprint()s were
converted to format() calls. Since more may be lurking we'll need to remove
all sprint() calls.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"
This patchset adddresses two problems with shadowable deletions handling
in SSTables 3.x. ('mc' format).
Firstly, we previously did not set a flag indicating the presence of
extended flags byte with HAS_SHADOWABLE_DELETION bitmask on writing.
This would break subsequent reading and cause all types of failures up
to crash.
Secondly, when reading rows with this extended flag set, we need to
preserve that information and create a shadowable_tombstone for the row.
Tests: unit {release}
+
Verified manually with 'hexdump' and using modified 'sstabledump' that
second (shadowable) tombstone is written for MV tables by Scylla.
+
DTest (materialized_views_test.py:TestMaterializedViews.hundred_mv_concurrent_test)
that originally failed due to this issue has successfully passed locally.
"
* 'projects/sstables-30/shadowable-deletion/v4' of https://github.com/argenet/scylla:
tests: Add tests writing both regular and shadowable tombstones to SSTables 3.x.
tests: Add test covering writing and reading a shadowable tombstone with SSTables 3.x.
sstables: Support Scylla-specific extension for writing shadowable tombstones.
sstables: Introduce a feature for shadowable tombstones in Scylla.db.
memtable: Track regular and shadowable tombstones separately in encoding_stats_collector.
sstables: Error out when reading SSTables 3.x with Cassandra shadowable deletion.
sstables: Support checking row extension flags for Cassandra shadowable deletion.
Even when we're using a full clustering range, need_skip() will return
true when we start a new partition and advance_context() will be
called with position_in_partition::before_all_clustered_rows(). We
should detect that there is no need to skip to that position before
the call to advance_to(*_current_partition_key), which will read the
index page.
Fixes#3868.
Message-Id: <1539881775-8578-1-git-send-email-tgrabiec@scylladb.com>
"
This patchset adds support generating .rpm/.deb from relocatable
package.
"
* 'reloc_rpmdeb_v5' of https://github.com/syuu1228/scylla:
configure.py: run create-relocatable-package.py everytime
configure.py: add SCYLLA-RELEASE-FILE/SCYLLA-VERSION-FILE targets
configure.py: use {mode} instead of $mode on scylla-package.tar.gz build target
dist/ami: build relocatable .rpm when --localrpm specified
dist/debian: use relocatable package to produce .deb
dist/redhat: use relocatable package to produce .rpm
install-dependencies.sh: add libsystemd as dependencies
install.sh: drop hardcoded distribution name, add --target option to specify distribution
build: add script to build relocatable package
build: compress relocatable package
build: add files on relocatable package to support generating .rpm/.deb
Right now we don't have dependencies for dist/, ninja not able to detect
changes under the directory.
To update relocatable package even only change is under dist/, we need
to run create-relocatable-package.py everytime.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
To re-generate scylla version files when it removed, since these files
required for relocatable package.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Since debian packaging system requires source package to compress tar
file, so let's use .gz compression.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
After the new in-memory representation of cells was introduced there was
a regression in atomic_cell_or_collection::operator<< which stopped
printing the content of the cell. This makes debugging more incovenient
are time-consuming. This patch fixes the problem. Schema is propagated
to the atomic_cell_or_collection printer and the full content of the
cell is printed.
Fixes#3571.
Message-Id: <20181024095413.10736-1-pdziepak@scylladb.com>
Limit message size according to the configuration, to avoid a huge message from
allocating all of the server's memory.
We also need to limit memory used in aggregate by thrift, but that is left to
another patch.
Fixes#3878.
Message-Id: <20181024081042.13067-1-avi@scylladb.com>
Compaction mode fails if more than one shard is used because it doesn't
make sure sstables used as input for compaction only contain local keys.
Therefore, sstable generated by compaction has less keys than expected
because non-local keys are purged out.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20181022225153.12029-1-raphaelsc@scylladb.com>
It is very useful for investigations in scylla issues, and we have
been moving those scripts manually when needed. Make it officially
part of the scylla package.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181023184400.23187-1-glauber@scylladb.com>
scyllatop uses a log file, if opening the file fails, the user should
get a clear response not an exception trace.
The same is true for connecting to scylla
After this patch the following:
$ scyllatop.py -L /usr/lib/scyllatop.log
scyllatop failed opening log file: '/usr/lib/scyllatop.log' With an error: [Errno 13] Permission denied: '/usr/lib/scyllatop.log'
Fixes#3860
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20181021065525.22749-1-amnon@scylladb.com>
The original SSTables 'mc' format, as defined in Cassandra, does not provide
a way to store shadowable deletion in addition to regular row deletion
for materialized views.
It is essential to store it because of known corner-case issues that
otherwise appear.
For this to work, we introduce a Scylla-specific extended flag to be set
in SSTables in 'mc' format that indicates a shadowable tombstone is
written after the regular row tombstone.
This is deemed to be safe because shadowable tombstones are specific to
materialized views and MV tables are not supposed to be imported or
exported.
Note that a shadowable tombstone can be written without a regular
tombstone as well as along with it.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This is used to indicate that the SSTables being read may contain a
Scylla-specific HAS_SCYLLA_SHADOWABLE_TOMBSTONE extended flag set.
If feature is not disabled, we should not honour this flag.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This flag can be only used in MV tables that are not supposed to be
imported to Scylla.
Since Scylla representation of shadowable tombstones differs from that
of Cassandra, such SSTables are rejected on read and Scylla never sets
this flag on writing.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
"
Hinted handoff should not overpower regular flows like READs, WRITEs or
background activities like memtable flushes or compactions.
In order to achieve this put its sending in the STEAMING CPU scheduling
group and its commitlog object into the STREAMING I/O scheduling group.
Fixes#3817
"
* 'hinted_handoff_scheduling_groups-v2' of https://github.com/vladzcloudius/scylla:
db::hints::manager: use "streaming" I/O scheduling class for reads
commitlog::read_log_file(): set the a read I/O priority class explicitly
db::hints::manager: add hints sender to the "streaming" CPU scheduling group
Commit 1d34ef38a8 "cql3: make pagers use
time_point instead of duration" has unintentionally altered the timeout
semantics for aggregate queries. Such requests fetch multiple pages before
sending a response to the client. Originally, each of those fetches had
a timeout-duration to finish, after the problematic commit the whole
request needs to complete in a single timeout-duration. This,
unsurprisingly, makes some queries that were successful before fail with
a timeout. This patch restores the original behaviour.
Fixes#3877.
Message-Id: <20181022125318.4384-1-pdziepak@scylladb.com>
his series attempts to make fragments per second results reported by
perf_fast_forward more stable. That includes running each test case
multiple time and reporting median, median average deviation, maximum
and minimum value. That should allow to relatively easily assess how
repeatable the presented results are. Moreover, since perf_fast_forward
does IO operation it is important that they do not introduce any
excessive noise to the results. The location of the data directory
is made configurable so that the user can choose less noisy disk or a
ramdisk.
* github.com/pdziepak/scylla.git stabilise-perf_fast_forward/v3:
tests/perf_fast_forward: make fragments/s measurements more stable
tests/perf_fast_forward: make data directory location configurable
network_topology_strategy test creates a ring with hundreds of tokens (and one
token per node). Then, for each token, it calls get_primary_ranges(), which in
turn walks the token ring. However, because the each datacenter occupies a
disjoint token range, this walk practically has to walk the entire ring until
it collects enough endpoints for each datacenter. The whole thing takes 15 minutes.
Speed this up by randomizing the token<->dc relationship. This is more realistic,
and switches the algorithm to be O(token count), and now it completes in less
than a minute (still not great, but better).
Message-Id: <20181022154026.19618-1-avi@scylladb.com>
perf_fast_forward populates perf_fast_forward_output with some data and
then runs performance tests that read it. That makes the disk a
significant factor in the final result and may make the results less
repeatable. This patch adds a flag that allows setting the location
of the data directory so that the user can opt for a less noisy disk
or a ramdisk.
perf_fast_forward performs various operations, many of which involve
sstable reads and verifies the metrics that there weren't any
unnecessary IO operations. It also provides fragments per seconds
measurements for the tests it runs. However, since some of the tests are
very short and involve IO those values vary a lot what makes them not
very useful.
This commit attempts to stabilise those results. Each test case is run
multiple time (by default for a second, but at least 3 times) and shows
median, median absolute deviation, maximum and minimum value. This
should allow assessing whether the changes in the results are just noise
or a real regression or improvement.
Currently, restricting_mutation_reader::fill_buffer justs reads
lower-layer reader's fragments one by one without doing any further
transformations. This change just swaps the parent-child buffers in a
single step, as suggested in #3604, and, hence, removing any possible
per-fragment overhead.
I couldn't find any test that exercises restricting_mutation_reader as
a mutation source, so I added test_restricted_reader_as_mutation_source
in mutation_reader_test.
Tests: unit (release), though these 4 tests are failing regardless of
my changes (they fail on master for me as well): snitch_reset_test,
sstable_mutation_test, sstable_test, sstable_3_x_test.
Fixes: #3604
Signed-off-by: George Kollias <georgioskollias@gmail.com>
Message-Id: <1540052861-621-1-git-send-email-georgioskollias@gmail.com>
"
This patchset fixes#3803. When a select statement with filtering
is executed and the column that is needed for the filtering is not
present in the select clause, rows that should have been filtered out
according to this column will still be present in the result set.
Tests:
1. The testcase from the issue.
2. Unit tests (release) including the
newly added test from this patchset.
"
* 'issues/3803/v10' of https://github.com/eliransin/scylla:
unit test: add test for filtering queries without the filtered column
cql3 unit test: add assertion for the number of serialized columns
cql3: ensure retrieval of columns for filtering
cql3: refactor find_idx to be part of statement restrictions object
cql3: add prefix size common functionality to all clustering restrictions
cql3: rename selection metadata manipulation functions
Test the usecase where the column that the filtering operates on
is not a part of the select clause. The expected result is a set
containing the columns of the select clause with the additional
columns for filtering marked as non serializable.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
The result sets that the assertions are performed against
are result sets before serialization to the user and therefore
contain also columns that will not be serialized and sent as
the query's final result. The patch adds an assertion on the
number of columns that will be present in the final serialized
result.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
When a query that needs filtering is executed, the columns
that the coordinator is filtering by have to be retrieved.The
columns should be retrieved even if they are not used for
ordering or named in the actual select clause.
If the columns are missing from the result set, then any
filtering that restricts the missing column will not take
place.
Fixes#3803
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
find_idx calculates the index that will be used in the statement if
indexes are to be used. In the static form it requires redundant
information (the schema is already contained within the statement
restrictions object). In addition find_idx will need to be used for
filtering in order not to include redundant selectors in the selection
objects. This change refactors find_idx to run under the statement
restrictions object and changes it's scope from private to public.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Up untill now, knowing the prefix size, which is used to determine
if a filtering is needed was implemented only for a single column
clustering restrictions. The patch adds a function to calculate the
prefix size for all types of clustering key restrictions given the
schema.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Every call of a tracing::global_trace_state_ptr object instead of a
tracing::tracing_state_ptr or a call to tracing::global_trace_state_ptr::get()
creates a new tracing session (span) object.
This should never be done unless query handling moves to a different shard.
Fixes#3862
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20181018003500.10030-1-vladz@scylladb.com>
Previously we were making assumptions about missing columns
(the size of its value, whether it's a collection or a counter) but
they didn't have to be always true. Now we're using column type
from serialization header to use the right values.
Fixes#3859
* seastar-dev.git haaawk/projects/sstables-30/handling-dropped-columns/v4:
sstables 3: Correctly handle dropped columns in column_translation
sstables 3: Add test for dropped columns handling
"
Refs #3828
(Probably fixes it)
We found a few flaws in a way we enable hints replaying.
First of all it was allowed before manager::start() is complete.
Then, since manager::start() is called after messaging_service is
initialized there was a time window when hints are rejected and this
creates an issue for MV.
Both issues above were found in the context of #3828.
This series fixes them both.
Tested {release}:
dtest: materialized_views_test.py:TestMaterializedViews.write_to_hinted_handoff_for_views_test
dtest: hintedhandoff_additional_test.py
"
* 'hinted_handoff_dont_create_hints_until_started-v1' of https://github.com/vladzcloudius/scylla:
hinted handoff: enable storing hints before starting messaging_service
db::hints::manager: add a "started" state
db::hints::manager: introduce a _state
get_ranges() is supposed to return ranges in sorted order. However, a35136533d
broke this and returned the range that was supposed to be last in the second
position (e.g. [0, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9]). The broke cleanup, which
relied on the sort order to perform a binary search. Other users of the
get_ranges() family did not rely on the sort order.
Fixes#3872.
Message-Id: <20181019113613.1895-1-avi@scylladb.com>
When messaging_service is started we may immediately receive a mutation
from another node (e.g. in the MV update context). If hinted handoff is not
ready to store hints at that point we may fail some of MV updates.
We are going to resolve this by start()ing hints::managers before we
start messaging_service and blocking hints replaying until all relevant
objects are initialized.
Refs #3828
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Hinting is allowed after "started" before "stopping".
Hints that attempted to be stored outside this time frame are going to
be dropped.
Refs #3828
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Introduce a multi-bit state field. In this patch it replaces the _stopping
boolean. We are going to add more states in the following patches.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Previously we were making assumptions about missing columns
(the size of its value, whether it's a collection or a counter) but
they didn't have to be always true. Now we're using column type
from serialization header to use the right values.
Fixes#3859
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
In the past the addition of non serializable columns was being used
only for post ordering of result sets.The newly added ALLOW FILTERING
feature will need to use these functions to other post processing operations
i.e filtering. The renaming accounts for the new and existing uses for the
function.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
* seastar 4669469...d152f2d (5):
> build: don't link with libgcc_s explicitly
> scheduling: add std::hash<seastar::scheduling_group>
> prometheus: Allow preemption between each metric
> Merge "improve memory detection in containers" from Juliana
> Merge "perf_tests: produce json reports" from Paweł
"
Hints are stored on disk by a hints::manager, ensuring they are
eventually sent. A hints::resource_manager ensures the hints::managers
it tracks don't consume more than their allocated resources by
monitoring disk space and disabling new hints if needed. This series
fixes some bugs related to the backlog calculation, but mainly exposes
the backlog through a hints::manager so upper layers can apply flow
control.
Refs #2538
"
* 'hh-manager-backlog/v3' of https://github.com/duarten/scylla:
db/hints/manager: Expose current backlog
db/hints/manager: Move decision about blocking hints to the manager
db/hints/resource_manager: Correctly account resources in space_watchdog
db/hints/resource_manager: Replace timer with seastar::thread
db/hints/resource_manager: Ensure managers are correctly registered
db/hints/resource_manager: Fix formatting
db/hints: Disallow moving or copying the managers
The space_watchdog enables or disables hints for the managers
associated with a particular device. We encapsulate this decision
inside the hints::managers by introducing the update_backlog()
function.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
A db::hints::resource_manager manages the resources for one or two
db::hints::managers. Each of these can be using the same or different
devices. The db::hints::space_watchdog periodically checks whether
each manager is within their resource allocation, and if not disables
it.
The watchdog iterates over the managers and accounts for the total
size they are using. This is wrong, since it can account in the same
variable the size consumed by managers using different devices.
We fix this while taking advantage of the fact that on_timer is now
called in the context of a seastar::thread, instead of using future
combinators.
Fixes#3821
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Registering a manager for a new device used
std::unordered_map::emplace(), which may not insert the specified
value if one with the same key has already been added. This could
happen if both managers were using the same device and the fiber
deferred in-between adding them.
Found during code reading. Could cause hints to not be disabled for an
overloaded manager.
Fixes#3822
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Disable the copy and move ctors and assignment operators for both the
hints::manager and the hints::resource_manager.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Without that, we don't know where to look for the problems
Before:
compaction failed: sstables::malformed_sstable_exception (Too big ttl: 3163676957)
After:
compaction_manager - compaction failed: sstables::malformed_sstable_exception (Too big ttl: 4294967295 in sstable /var/lib/scylla/data/system_traces/events-8826e8e9e16a372887533bc1fc713c25/mc-832-big-Data.db)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181016181004.17838-1-glauber@scylladb.com>
On receiving a mutation_fragment or a mutation triggered by a streaming
operation, we pass an enum stream_reason to notify the receiver what
the streaming is used for. So the receiver can decide further operation,
e.g., send view updates, beyond applying the streaming data on disk.
Fixes#3276
Message-Id: <f15ebcdee25e87a033dcdd066770114a499881c0.1539498866.git.asias@scylladb.com>
Handle the before_all_keys and after_all_keys token_kind
at the highest layer before calling into the virtual
i_partitioner::tri_compare that is not set up to handle these cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20181015165612.29356-1-bhalevy@scylladb.com>
Int types in json will be serialized to int types in C++. They will then
only be able to handle 4GB, and we tend to store more data than that.
Without this patch, listsnapshots is broken in all versions.
Fixes: #3845
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181012155902.7573-1-glauber@scylladb.com>
Introduce uppermost_bound() method instead of upper_bound() in mutation_fragment_filter and clustering_ranges_walker.
For now, this has been only used to produce the final range tombstone
for sliced reads inside consume_partition_end().
Usage of the upper bound of the current range causes problems of two
kinds:
1. If not all the slicing ranges have been traversed with the
clustering range walker, which is possible when the last read
mutation fragment was before some of the ranges and reading was limited
to a specific range of positions taken from index, the emitted range
tombstone will not cover the untraversed slices.
2. At the same time, if all ranges have been walked past, the end
bound is set to after_all_clustered_rows and the emitted RT may span
more data than it should.
To avoid both situations, the uppermost bound is used instead, which
refers to the upper bound of the last range in the sequence.
* github.com/scylladb/seastar-dev.git haaawk/projects/sstables-30/enable-mc-with-sstable-mutation-test/v2
sstables: Use uppermost_bound() instead of upper_bound() in
mutation_fragment_filter.
tests: Enable sstable_mutation_test for SSTables 'mc' format.
Rebased by Piotr J.
For now, this has been only used to produce the final range tombstone
for sliced reads inside consume_partition_end().
Usage of the upper bound of the current range causes problems of two
kinds:
1. If not all the slicing ranges have been traversed with the
clustering range walker, which is possible when the last read
mutation fragment was before some of the ranges and reading was limited
to a specific range of positions taken from index, the emitted range
tombstone will not cover the untraversed slices.
2. At the same time, if all ranges have been walked past, the end
bound is set to after_all_clustered_rows and the emitted RT may span
more data than it should.
To avoid both situations, the uppermost bound is used instead, which
refers to the upper bound of the last range in the sequence.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
* seastar-dev.git haaawk/sst3/test_clustering_slices/v8:
sstables: Extract on_end_of_stream from consume_partition_end
sstables: Don't call consume_range_tombstone_end in
consume_partition_end
sstables: Change the way fragments are returned from consumer
Split range tombstone (if present) on every consume_row_end call
and store both range tombstone and row in different fields called
_stored_row and _stored_tombstone instead of using single field
called _stored.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The new function will be called when the stream of data is finished
while old consume_partition_end will be called when partition
is finished but stream is not done yet.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Readers for SST3 return a bit more precise range tombstones
when reader is slicing. Namely, SST2 readers return whole
range tombstones that overlap with slicing range but SST3
trim those range tombstones to slicing range.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
There is a mismatch between row markers used in SSTables 2.x (ka/la) and
liveness_info used by SSTables 3.x (mc) in that a row marker can be
written as a deleted cell but liveness_info cannot.
To handle this, for a dead row marker the corresponding liveness_info is
written as expiring liveness_info with a fake TTL set to 1.
This approach is adapted from the solution for CASSANDRA-13395 that
exercised similar issue during SSTables upgrades.
* github.com/argenet/scylla.git projects/sstables-30/dead-row-marker/v7:
sstables: Introduce TTL limitation and special 'expired TTL' value.
sstables: Write dead row marker as expired liveness info.
tests: Add test covering dead row marker writing to SSTables 3.x.
'Consumer function' parameter for distribute_reader_and_consume_on_shards()
captures schema_ptr (which is a seastar::shared_ptr), but the function
is later copied on another shard at which point schema_ptr is also copied
and its counter is incremented by the wrong shard. The capture is not
even used, so lets just drop it.
Fixes#3838
Message-Id: <20181011075500.GN14449@scylladb.com>
The single-range overload, when used by
make_multishard_streaming_reader(), has to create a reader that is
forwardable. Otherwise the multishard streaming reader will not produce
any output as it cannot fast-forward its shard readers to the ranges
produced by the generator.
Also add a unit test, that is based on the real-life purpose the
multishard streaming reader was designed for - serving partition
from a shard, according to a sharding configuration that is different
than the local one. This is also the scenario that found the buf in the
first place.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <bf799961bfd535882ede6a54cd6c4b6f92e4e1c1.1539235034.git.bdenes@scylladb.com>
Make sure that read I/O in the context of HH sending do not overpower I/O
in the context of queries, memtable flushes or compactions.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This allows to distinguish expired liveness info from yet-to-expire one
and convert it into a dead row marker on read.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This allows to store expired liveness info in SSTables 3.x format
without introducing a possible conflict with real TTL values.
As per Cassandra, TTL cannot exceed 20 years so taking the maximum value
as a special value for indicating expired liveness info is safe.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Fixes#3787
Message service streaming sink was created using direct call to
rpc::client::make_sink. This in turn needs a new socker, which it
creates completely ignoring what underlying transport is active for the
client in question.
Fix by retaining the tls credential pointer in the client wrapper, and
using this in a sink method to determine whether to create a new tls
socker, or just go ahead with a plain one.
Message-Id: <20181010003249.30526-1-calle@scylladb.com>
"
This series changes hinted handoff to work with `frozen_mutation`s
instead of naked `mutation`s. Instead of unfreezing a mutation from
the commitlog entry and then freezing it again for sending, now we'll
just keep the read, frozen mutation.
Tests: unit(release)
"
* 'hh-manager-cleanup/v1' of https://github.com/duarten/scylla:
db/hints/manager: Use frozen_mutation instead of mutation
db/hints/manager: Use database::find_schema()
db/commitlog/commitlog_entry: Allow moving the contained mutation
service/storage_proxy: send_to_endpoint overload accepting frozen_mutation
service/storage_proxy: Build a shared_mutation from a frozen_mutation
service/storage_proxy: Lift frozen_mutation_and_schema
service/storage_proxy: Allow non-const ranges in mutate_prepare()
A simple case for SI paging is added to secondary_index_test suite.
This commit should be followed by more complex testing
and serves as an example on how to extract paging state and use it
across CQL queries.
Message-Id: <b22bdb5da1ef8df399849a66ac6a1f377e6a650a.1539090350.git.sarna@scylladb.com>
write_stats is referenced from write handler which is available in
send_to_live_endpoints already. No need to pass it down.
Message-Id: <20181009133017.GA14449@scylladb.com>
Currently, when stopping a reader fails, it simply won't be attempted to
be saved, and it will be left in the `_readers` array as-is. This can
lead to an assertion failure as the reader state will contain futures
that were already waited upon, and that the cleanup code will attempt to
wait on again. To prevent this, when stopping a reader fails, reset it
to nonexistent state, so that the cleanup code doesn't attempt to do
anything with it.
Refs: #3830
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <a1afc1d3d74f196b772e6c218999c57c15ca05be.1539088164.git.bdenes@scylladb.com>
It is useful to have this counter to investigate the reason for read
repairs. Non zero value means that writes were lost after CL is reached
and RR is expected.
Message-Id: <20181009120900.GF22665@scylladb.com>
The pager::state() function returns a valid paging object even
if the pager itself is exhausted. It may also not contain the partition
key, so using it unconditionally was a bug - now, in case there is no
partition key present, paging state will contain an empty partition key.
Fixes#3829
Message-Id: <28401eb21ab8f12645c0a33d9e92ada9de83e96b.1539074813.git.sarna@scylladb.com>
This will be used by the `make_multishard_streaming_reader()` in the
next patch. This method will create a multishard combining reader which
needs its shard readers to take a single range, not a vector of ranges
like the existing overload.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <cc6f2c9a8cf2c42696ff756ed6cb7949b95fe986.1538470782.git.bdenes@scylladb.com>
It might take long time for get_all_ranges_with_sources_for and
get_all_ranges_with_strict_sources_for to calculate which cause reactor
stall. To fix, run them in a thread and yield. Those functions are used in
the slow path, it is ok to yield more than needed.
Fixes#3639
Message-Id: <63aa7794906ac020c9d9b2984e1351a8298a249b.1536135617.git.asias@scylladb.com>
A materialized views can provide a filter so as to pick up only a subset
of the rows from the base table. Usually, the filter operates on columns
from the base table's primary key. If we use a filter on regular (non-key)
columns, things get hairy, and as issue #3430 showed, wrong: merely updating
this column in the base table may require us to delete, or resurrect, the
view row. But normally we need to do the above when the "new view key column"
was updated, when there is one. We use shadowable tombstones with one
timestamp to do this, so it cannot take into account the two timestamp from
those two columns (the filtered column and the new key column).
So in the current code, filtering by a non-key column does not work correctly.
In this patch we provide two test cases (one involving TTLs, and one involves
only normal updates), which demonstrate vividly that it does *not* work
correctly. With normal updates, trying to resurect a view row that has
previously disappeared, fails. With TTLs, things are even worse, and the view
row fails to disappear when the filtered column is TTLed.
In Cassandra, the same thing doesn't work correctly as well (see
CASSANDRA-13798 and CASSANDRA-13832) so they decided to refuse creating
a materialized view filtering a non-key column. In this patch we also
do this - fail the creation of such an unsupported view. For this reason,
the two tests mentioned above are commented out in a "#if", with, instead,
a trivial test verifying a failure to create such a view.
Note that as explained above, when the filtered column and new view key
column are *different* we have a problem. But when they are the *same* - namely
we filter by a non-key base column which actually *is* a key in the view -
we are actually fine. This patch includes additional test cases verifying
that this case is really fine and provides correct results. Accordingly,
this case is *not* forbidden in the view creation code.
Fixes#3430.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181008185633.24616-1-nyh@scylladb.com>
"
This patchset fixes a bug in SSTables 3.x reading when fast-forwarding
is enabled. It is possible that a mutation fragment, row or RT marker,
is read and then stored because it falls outside the current
fast-forwarding range.
If the reader is further fast-forwarded but the
row still falls outside of it, the reader would still continue reading
and get the next fragment, if any, that would clobber the currently
stored one. With this fix, the reader does not attempt to read on
after storing the current fragment.
Tests: unit {release}
"
* 'projects/sstables-30/row-skipped-on-double-ff/v2' of https://github.com/argenet/scylla:
tests: Add test for reading rows after multiple fast-forwarding with SSTables 3.x.
sstables: mp_row_consumer_m to notify reader on end of stream when storing a mutation fragment.
sstables: In mp_row_consumer_m::push_mutation_fragments(), return the called helper's value.
Fixes#3798Fixes#3694
Tests:
unit(release), dtest([new] cql_tests.py:TruncateTester.truncate_after_restart_test)
* tag 'fix-gossip-shard-replication-v1' of github.com:tgrabiec/scylla:
gms/gossiper: Replicate enpoint states in add_saved_endpoint()
gms/gossiper: Make reset_endpoint_state_map() have effect on all shards
gms/gossiper: Replicate STATUS change from mark_as_shutdown() to other shards
gms/gossiper: Always override states from older generations
writetime() or ttl() selections of non-frozen collections can work, as they
are single cells. Relax the check to allow them, and only forbid non-frozen
collections.
Fixes#3825.
Tests: cql_query_test (release).
Message-Id: <20181008123920.27575-1-avi@scylladb.com>
Uncomment existing declare() calls and implement tests. Because the
data_value(bytes) constructor is explicit, we add explicit conversion to
data_value in impl_min_function_for<> and impl_max_function_for<>.
Fixes#3824.
Message-Id: <20181008084127.11062-1-avi@scylladb.com>
We found on some Debian environment Ubuntu .deb build fails with
gpg error because lack of Ubuntu GPG key, so we need to install it before
start pbuilder.
Same as on Ubuntu, it needs to install Debian GPG key.
Fixes#3823
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181008072246.13305-1-syuu@scylladb.com>
Instead of unfreezing a mutation from the commitlog and then freezing
it again to send, just keep the read frozen mutation.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Instead of using find_column_family() and repeatedly asking for
column_family::schema(), use database::find_schema() instead.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Add an overload to send_to_endpoint() which accepts a frozen_mutation.
The motivation is to allow better accounting of pending view updates,
but this change also allows some callers to avoid unfreezing already
frozen mutations.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Lift frozen_mutation_and_schema to frozen_mutation.hh, since other
subsystems using frozen_mutations will likely want to pass it around
together with the schema.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"
This patchset implements ALTER TABLE ADD/DROP for multiple columns.
Fixes: #2907Fixes: #3691
Tests: schema_change_test
"
* 'projects/cql3/alter-table-multi/v3' of https://github.com/bhalevy/scylla:
cql3: schema_change_test: add test_multiple_columns_add_and_drop
cql3: allow adding or dropping multiple columns in ALTER TABLE statement
cql3: alter_table_statement: extract add/alter/drop per-column code into functions
cql3: testing for MVs for alter_table_statement::type::drop is not per column
cql3: schema_change_test: add test_static_column_is_dropped
So we don't attempt to send mutations to unreachable endpoints and
instead store a hint for them, we now check the endpoint status and
populate dead_endpoints accordingly in
storage_proxy::send_to_endpoint().
Fixes#3820
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181007100640.2182-1-duarte@scylladb.com>
No column can be dropped from a table with materialized views
so the respective exception can ignore and omit the dropped column name.
In preparation for refactoring the respective code, moving the per-column
code to member functions.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Test dropping of static column defined in CREATE TABLE, and
adding and dropping of a static column using ALTER TABLE.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Without it, the reader will attempt to read further and may clobber the
stored fragment with the next one read, if any.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Unpaged queries are those for which the client didn't enable paging,
and we already account for them in
indexed_table_select_statement::do_execute().
Remove the second increment in read_posting_list().
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181003121811.11750-1-duarte@scylladb.com>
We had two commented out tests based on Cassandra's MV unit tests, for
the case that the view's filter (the "SELECT" clause used to define the
view) filtered by a non-primary-key column. These tests used to fail
because of problems we had in the filtering code, but they now succeed,
so we can enable them. This patch also adds some comments about what
the tests do, and adds a few more cases to one of the tests.
Refs #3430.
However, note that the success of these tests does not really prove that
the non-PK-column filtering feature works fully correctly and that issue
forbidding it, as explained in
https://issues.apache.org/jira/browse/CASSANDRA-13798. We can probably
fix this feature with our "virtual cells" mechanism, but will need to add
a test to confirm the possible problem and its (probably needed fix).
We do not add such a test in this patch.
In the meantime, issue #3430 should remain open: we still *allow* users
to create MV with such a filter, and, as the tests in this patch show,
this "mostly" works correctly. We just need to prove and/or fix what happens
with the complex row liveness issues a la issue #3362.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181004213637.32330-1-nyh@scylladb.com>
Lack of this may result in non-zero shards on some nodes still seeing
STATUS as NORMAL for a node which shut down, in some cases.
mark_as_shutdown() is invoked in reaction to an RPC call initiated by
the node which is shutting down. Another way a node can learn about
other node shutting down is via gossiping with a node which knows
this. In that case, the states will be replicated to non-zero
shards. The node which learnt via mark_as_shutdown() may also
eventually propagate this to non-zero shards, e.g. when it gossips
about it with other nodes, and its local version number at the time of
mark_as_shudown() was smaller than the one used to set the STATE by
the shutting down node.
Application states of each node are versioned per-node with a pair of
generation number (more significant) and value version. Generation
number uniquely identifies the life time of a scylla
process. Generation number changes after restart. Value versions start
from 0 on each restart. When a node gets updates for application
states, it merges them with its view on given node. Value updates with
older versions are ignored.
Gossiper processes updates only on shard 0, and replicates value
updates to other shards. When it sees a value with a new generation,
it correclty forgets all previous values. However, non-zero shards
don't forget values from previous generations. As a result,
replication will fail to override the values on non-zero shards when
generation number changes until their value version exceeds the
version prior to the restart.
This will result in incorrect STATUS for non-seed nodes on non-zero
shards. When restarting a non-seed node, it will do a shadow gossip
round before setting its STATUS to NORMAL. In the shadow round it will
learn from other nodes about itself, and set its STATUS to shutdown on
all shards with a high value version. Later, when it sets its status
to NORMAL, it will override it only on shard 0, because on other
shards the version of STATUS is higher.
This will cause CQL truncate to skip current node if the coordinator
runs on non-zero shards.
The fix is to override the entries on remote shards in the same way we
do on shard 0. All updates to endpoint states should be already
serialized on shard 0, and remote shards should see them in the same
order.
Introduced in 2d5fb9dFixes#3798Fixes#3694
The Antlr3 exception class has a null dereference bug that crashes
the system when trying to extract the exception message using
ANTLR_Exception<...>::displayRecognitionError(...) function. When
a parsing error occurs the CqlParser throws an exception which in
turn processesed for some special cases in scylla to generate a custom
message. The default case however, creates the message using
displayRecognitionError, causing the system to crash.
The fix is a simple workaround, making sure the pointer is not null
before the call to the function. A "proper" fix can't be implemented
because the exception class itself is implemented outside scylla
in antlr headers that resides on the host machine os.
Tested manualy 2 testcases, a typo causing scylla to crash and
a cql comment without a newline at the end also caused scylla to crash.
Ran unit tests (release).
Fixes#3740Fixes#3764
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <cfc7e0d758d7a855d113bb7c8191b0fd7d2e8921.1538566542.git.eliransin@scylladb.com>
shutdown_announce_in_ms specifies a period of time that a node which
is shutting down waits to allow its state to propagate to other nodes.
However, we were setting _enabled to false before waiting, which
will make the current node ignore gossip messages.
Message-Id: <1538576996-26283-1-git-send-email-tgrabiec@scylladb.com>
The as_json_function class is not registered as a function, but we can
still keep it cql3/functions, as per its namespace, to reduce the size
of select_statement.cc.
Message-Id: <20181002132637.30233-1-penberg@scylladb.com>
"
This patchset enables very simple column type conversions.
It covers only handling variable and fixed size type differences.
Two types still have to be compatiple on bits level to be able to convert a field from one to the other.
"
* 'haaawk/sst3/column_type_schema_change/v4' of github.com:scylladb/seastar-dev:
Fix check_multi_schema to actually check the column type change
Handle very basic column type conversions in SST3
Enable check_multi_schema for SST3
Field 'e' was supposed to be read as blob but the test had a bug
and the read schema was treating that field as int. This patch
changes that and makes the test really check column type change.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
After this change very simple schema changes of column type
will work. This change makes sure that variable size and fixed
size types can be converted to each other but only if their bit
representation can be automatically converted between those types.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
We are already maintaining a statistic of the number of pending view updates
sent but but not yet completed by view replicas, so let's expose it.
As all per-table statistics, also this one will only be exposed if the
"--enable-keyspace-column-family-metrics" option is on.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
mutate_MV usually calls send_to_endpoint() to push view update to remote
view replicas. This function gets passed a statistics object,
service::storage_proxy_stats::write_stats and, in particular, updates
its "writes" statistic which counts the number of ongoing writes.
In the case that the paired view replica happens to be the *same* node,
we avoid calling send_to_endpoint() and call mutate_locally() instead.
That function does not take a write_stats object, so the "writes" statistic
doesn't get incremented for the duration of the write. So we should do
this explicitly.
Co-authored-by: Nadav Har'El <nyh@scylladb.com>
Co-authored-by: Duarte Nunes <duarte@scylladb.com>
Currently we diff schemas based on table/view name, and if the names
match, then we detect altered schemas by comparing the schema
mutations. This fails to detect transitions which involve dropping and
recreating a schema with the same name, if a node receives these
notifications simultaneously (for example, if the node was temporarily
down or partitioned).
Note that because the ID is persisted and created when executing a
create_table_statement, then even if a schema is re-created with the
exact same structure as before, we will still considered it altered
because the mutations will differ.
This also stops schema pulling from working, since it relies on schema
merging.
The solution is to diff schemas using their ID, and not their name.
Keyspaces and user types are also susceptible to this, but in their
case it's fine: these are values with no identity, and are just
metadata. Dropping and recreating a keyspace can be views as dropping
all tables from the keyspace, altering it, and eventually adding new
tables to the keyspace.
Note that this solution doesn't apply to tables dropped and created
with the same ID (using the `WITH ID = {}` syntax). For that, we would
need to detect deltas instead of applying changes and then reading the
new state to find differences. However, this solution is enough,
because tables are usually created with ID = {} for very specific,
peculiar reasons. The original motivation meant for the new table to
be treated exactly as the old, so the current behavior is in fact the
desired one.
Tests: unit(release), dtests(schema_test, schema_management_test)
Fixes#3797
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181001230932.47153-2-duarte@scylladb.com>
The linker uses an opt-in system for non-executable stack: if all object files
opt into a non-executable stack, the binary will have a non-executable stack,
which is very desirable for security. The compiler cooperates by opting into
a non-executable stack whenever possible (always for our code).
However, we also have an assembly file (for fast power crc32 computations).
Since it doesn't opt into a non-executable stack, we get a binary with
executable stack, which Gentoo's build system rightly complains about.
Fix by adding the correct incantation to the file.
Fixes#3799.
Reported-by: Alexys Jacob <ultrabug@gmail.com>
Message-Id: <20181002151251.26383-1-avi@scylladb.com>
* seastar 5712816...71e914e (12):
> Merge "rpc shard to shard connection" from Gleb
> Merge "Fix memory leaks when stoppping memcached" from Tomasz
> scripts: perftune.py: prioritize I/O schedulers
> alien: fix the size of local item[]
> seastar-addr2line: don't invoke addr2line multiple times
> reactor: use labels for different io_priority_class:s
> util/spinlock: fix bad namespacing of <xmmintrin.h>
> Merge "scripts: perftune.py: support different I/O schedulers" from Vlad
> timer: Do not require callback to be copyable
> core/reactor: Fix hang on shutdown with long task quota
> build: use 'ppa:scylladb/ppa' instead of URL for sourceline
> net/dns: add net::dns::get_srv_records() helper
We allow tables to be created with the ID property, mostly for
advanced recovery cases. However, we need to validate that the ID
doesn't match an existing one. We currently do this in
database::add_column_family(), but this is already too late in the
normal workflow: if we allow the schema change to go through, then
it is applied to the system tables and loaded the next time the node
boots, regardless of us throwing from database::add_column_family().
To fix this, we perform this validation when announcing a new table.
Note that the check wasn't removed from database::add_column_family();
it's there since 2015 and there might have been other reasons to add
it that are not related to the ID property.
Refs #2059
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181001230142.46743-1-duarte@scylladb.com>
"
This series adds support for range generator functors to multi range
reader. A range generator functor can lazily generate an uknown amount
of ranges on-the-fly for the reader to read.
The range generator support was added by refactoring
`flat_multi_range_mutation_reader` to work in terms of a generator
functor. The existing overload taking a `dht::partition_range_vector`
is adapted to the generator interface behind the scenes.
"
* 'multi-range-reader-generator/v9' of https://github.com/denesb/scylla:
tests/flat_mutation_reader_test: extend multi-range reader tests
make_flat_multi_range_reader: add documentation
make_flat_multi_range_reader: add generator overload
flat_multi_range_reader: refactor to work in terms of generator
make_flat_multi_range_reader(): better handle the 0 range case
flat_mutation_reader: add move_buffer_content_to()
flat_multi_range_mutation_reader: drop fwd_mr ctor parameter
To be used by sstable_writer for stats collection.
Note that this patch is factored out so it can be verified with no
other change in functionality.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
Indexed select statement consists of two queries - the view query
used to extract base keys and the base query that uses those keys
to return base rows.
The main idea of this series is to replace raw proxy.query() call
during the view query to one that uses a pager.
Additionally, paging info from the view query needs to be returned
to the client, in order to be used later for requesting new pages.
"
* 'paging_indexes_7' of https://github.com/psarna/scylla:
tests: add test for secondary index with paging
cql3: remove execute(primary_keys) from select statement
cql3: add incremental base queries to index query
storage_proxy: make get_restricted_ranges public
cql3: add base query handling function to indexed statement
cql3: add generating base key from index keys
cql3: add paging state generation function
cql3: move getting index view schema to prepare stage
pager: make state() defined for exhausted pagers
cql3: add maybe_set_paging_state function
cql3: rename set_has_more_pages to set_paging_state
pager: add setters for partition/clustering keys
cql3: add paging to read_posting_list
cql3: add non-const get_result_metadata method
cql3: make find_index_* functions return paging state
cql3: make read_posting_list return future<rows>
cql3: make pagers use time_point instead of duration
Commit d6b0c4dda4 changed the built-in default
murmur3_ignore_msb_bits to 12 (from 0) and removed the scylla.yaml default.
Removal of the scylla.yaml default was a mistake for two reasons:
- if someone downgrades a cluster, keeping scylla.yaml derived from the
master branch, they will experience resharding since the built-in default,
which has changed, will take effect. While that scenario is not supported,
it already happened and caused much consternation.
- if, in the future, we wish to change the default, we will cause resharding
again. Embedding the default in scylla.yaml allows us to change the default
for new clusters while allowing upgraded clusters to retain older values.
Therefore, this patch restores murmur3_ignore_msb_bits in scylla.yaml. Future
changes to the configuration item should change both scylla.yaml and the
built-in default.
Message-Id: <20180930090053.21136-1-avi@scylladb.com>
Reusable buffers are meant to be used when protocol or third-party
library limiations force us to allocate large contiguous buffers. There
isn't much that can be done about this so there is little point in
warning about that.
Fixes#3788.
Message-Id: <20180928085141.6469-1-pdziepak@scylladb.com>
In commit 4a0b561376, "storage_service:
Get rid of moving operation", we removed remove_from_moving() in
update_normal_tokens(). However, remove_from_moving() calls
invalidate_cached_rings(). We should call invalidate_cached_rings() in
update_normal_tokens(), otherwise we will get wrong token range to
address map in the token_metadata cache.
This issue exists in master only. It is not in any of the releases.
Message-Id: <c03f2ed478cfdb84494f36dce9a8cfc05ed9e0cd.1538288364.git.asias@scylladb.com>
Allows creating a multi range reader from an arbitrary callable that
return std::optional<dht::partition_range>. The callable is expected to
return a new range on each call, such that passing each successive range
to `flat_mutation_reader::fast_forward_to` is valid. When exhausted the
callable is expected to return std::nullopt.
Instead of working with a dht::partition_range_vector directly, work
with an abstract generator that returns a pointer to the next range on
each invocation. When exhausted it returns nullptr. This opens up the
possibility to create multi range readers from a generator functor that
creates ranges lazily. This is indeed what the next path does.
Previously, when the passed in range of partition ranges contained 0
ranges, an empty reader was returned. This means that the returned
reader was forwardable or not depending on the number of passed in
ranges. This is inconsistent and can lead to nasty surprises.
To solve this problem add `forwardable_empty_mutation_reader`, a
specialized reader that delays creating the underlying reader until
fast_forward_to() is called on it, and thus a range is available.
When `make_flat_multi_range_mutation_reader()` is called with
`mutation_reader::forwarding::no` a simple empty reader is created, like
before.
`move_buffer_content_to()` makes it possible to implement more efficient
wrapping readers, readers that wrap another flat mutation reader but do
no transformation to the underlying fragment stream.
These readers, when filling their buffers, can simply fill the
underlying reader's buffer, then move its content into their own. When
the reader's own buffer is empty, this is very efficient, as it can be
done by simply swapping the buffers, avoiding the work of moving the
fragments one-by-one.
The factory function creating this reader ensures that the passed-in
ranges vector has more then one range, which effectively makes the
`fwd_mr` constructor parameter have no effect. The underlying reader
will always be created with `mutation_reader::forwarding::yes` as it has
to be able to fast-forward between the ranges.
When validating assignment between two types, it's possible one of
them is wrapped in a reverse_type, if it comes, for example, from the
type associated with a clustering column. When checking for weak
assignment the types are correctly unwrapped, but not when checking
for an exact match, which this patch fixes.
Technically, the receiver is never a reversed_type for the current
callers, but this is the morally correct implementation, as the type
being reversed or not plays no role in assignment.
Tests: unit(release)
Fixes#3789
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180927223201.28152-1-duarte@scylladb.com>
Right now, with specialized execute() that takes primary keys
for indexed_table_select_statement, the original execute()
method implemented in select_statement is not used anywhere,
so it's removed.
Base queries that are part of index queries are allowed to be short,
which can result in wasted work - e.g. when we query all replicas
in parallel, but have to discard most of the result, since the first
one (in token order) resulted in a short read.
Thus, we start by quering 1 range, check if the read is short,
and if not, continue by querying 2x more ranges than before.
Refs #2960
Searching for index view schema for an indexed statement can be done
once in prepare stage, so it's moved to indexed_table_select_statement
prepare method.
If service::pager is exhausted, state() function used to return
a nullptr instead of a pointer to a valid paging state and the
documented return type in this case was 'unspecified'.
Sometimes a paging state may be needed anyway, even if the pager
is already exhausted - thus, state() return value becomes defined
after this commit. Exhausted pagers will return a valid object
to a state with _remaining field set to 0.
We have found issues when a flush is requested outside the usual
memtable flush loop and because there is not a lot of data the
controller will not have a high amount of shares.
To prevent this, this patch guarantees some minimum amount of shares
when extraneous operations (nodetool flush, commitlog-driven flush, etc)
are requested.
Another option would be to add shares instead of guarantee a minimum.
But in my view the approach I am taking here has two main advantages:
1) It won't cause spikes when those operations are requested
2) It is cumbersome to add shares in the current infrastructure, as just
adding backlog can cause shares to spike. Consider this example:
Backlog is within the first range of very low backlog (~0.2). Shares
for this would be around ~20. If we want to add 200 shares, that is
equivalent to a backlog of 0.8. Once we add those two backlogs
together, we end up with 1 (max backlog).
Fixes#3761
Tests: unit (release)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180927131904.8826-1-glauber@scylladb.com>
Instead of returning a coordinator result and making a caller parse it
later, read_posting_list now extracts rows by itself.
This change is later needed when querying is replaced with a pager.
A standard way for passing a timeout parameter is specifying
a time_point, while pagers used to take a duration in order
to compute time points on the fly. This patch adds a timeout
parameter, which is a time_point, to fetch_page().
This patchset addresses multiple errors in normalizing_reader
implementation found during review.
I have decided to not make a clustering key full inside
before_key()/after_key() helpers. The reason is that for this they
would need schema to be passed as another parameter so existing
methods don't suit. OTOH, introducing new members for a class using
for testing purposes only seems an overkill.
* github.com/argenet/scylla.git projects/sstables-30/normalizing_reader_fixes/v1:
range_tombstone: Add constructor accepting position_in_partition_views
for range bounds.
tests: Make sure range tombstone is properly split over rows with
non-full keys.
tests: Multiple fixes for draining and clearing range tombstones in
normalizing_reader.
We only wait from the last test case, so if an individual test is executed,
a memory leak may be reported.
Fix by waiting from all test cases.
Message-Id: <20180926203723.18026-1-avi@scylladb.com>
When executing a prepared select statement with a multicolumn IN, the
system returned incorrect results due to a memory violation (a bytes view
referring to an out of scope bytes object).
Added test for the prepared statement results correctness.
Tests:
1. unit (release) with the new test.
2. Python script.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <36c9cf9ed3fe72e3b4801e3cd120678429ce218a.1537947897.git.eliransin@scylladb.com>
"
This patchset fixes several issues in SSTables 3.x ('mc') writing and
parsing and extends existing SSTables unit tests to cover the new
format.
The only test enabled temporarily is check_multi_schema because it
turned out that reading SSTables 3.x with a different schema has not
been implemented in full. This will be addressed in a separate patchset.
This patchset depends on the "Support SSTables 3.x in Scylla runtime"
patchset.
Tests: unit {release}
"
* 'projects/sstables-30/unit-tests/v3' of https://github.com/argenet/scylla: (25 commits)
tests: Enable existing SSTables tests for 'mc' format.
tests: Fix test_wrong_range_tombstone_order for 'mc' format.
tests: Extend reader assertions to check clustering keys made full.
tests: Disable test_old_format_non_compound_range_tombstone_is_read for 'mc' format.
tests: Disable check_multi_schema for 'mc' format.
tests: Fix test_promoted_index_read for 'mc' format by using normalizing_reader.
tests: Fix promoted_index_read to not rely on a specific index length
tests: Add 'mc' files for test_wrong_range_tombstone_order
tests: Add 'mc' files for test_wrong_counter_shard_order
tests: Add 'mc' files for summary_test
tests: Add 'mc' files for test_promoted_index_read
tests: Add 'mc' files for test_partition_skipping
tests: Add 'mc' files for large_partition tests (promoted_index_read, sub_partition_read, sub_partitions_read
tests: Add 'mc' files for test_counter_read
tests: Add 'mc' files for test_broken_promoted_index_is_skipped
tests: SSTables 'mc' files for sliced_mutation_reads_test.
tests: Introduce normalizing_reader helper for SSTables tests.
mutation_fragment: Add range_tombstone_stream::empty() method.
sstables: Make key full when setting a range tombstone start from end open marker.
sstables: For 'mc' format, use excl_start when split an RT over a row with a full key.
...
"
This patchset makes it possible to use SSTables 'mc' format, commonly
referred to as 'SSTables 3.x', when running Scylla instance.
Several bugs found on this way are fixed. Also, a configuration option
is introduced to allow running Scylla either with 'mc' or 'la' format
as default.
Tests: unit {release}
+ tested Scylla with both 'la' and 'mc' formats to work fine:
cqlsh> CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}; [3/1890]
cqlsh> USE test;
cqlsh:test> CREATE TABLE cfsst3 (pk int, ck int, rc int, PRIMARY KEY (pk, ck)) WITH compression = {'sstable_compression': ''};
cqlsh:test> INSERT INTO cfsst3 (pk, ck, rc) VALUES ( 4, 7, 8);
<<flush>>
cqlsh:test> DELETE from cfsst3 WHERE pk = 4 and ck> 3 and ck < 8;
<<flush>>
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 2, 3);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 4, 6);
cqlsh:test> SELECT * FROM cfsst3 ;
pk | ck | rc
----+----+------
2 | 3 | null
4 | 6 | null
(2 rows)
<<Scylla restart>>
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 5, 7);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 6, 8);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 7, 9);
cqlsh:test> INSERT INTO cfsst3 (pk, ck) VALUES ( 8, 10);
cqlsh:test> SELECT * from cfsst3 ;
pk | ck | rc
----+----+------
5 | 7 | null
8 | 10 | null
2 | 3 | null
4 | 6 | null
7 | 9 | null
6 | 8 | null
(6 rows)
"
* 'projects/sstables-30/try-runtime/v8' of https://github.com/argenet/scylla:
database: Honour enable_sstables_mc_format configuration option.
sstables: Support SSTables 'mc' format as a feature.
db: Add configuration option for enabling SSTables 'mc' format.
tests: Add test for reading a complex column with zero subcolumns (SST3).
sstables: Fix parsing of complex columns with zero subcolumns.
sstables: Explicitly cast api::timestamp_type to uint64_t when delta-encoding.
sstables: Use parser_type instead of abstract_type::parse_type in column_translation.
bytes: Add helper for turning bytes_view into sstring_view.
sstables: Only forward the call to fast_forwarding_to in mp_row_consumer_m if filter exists.
sstables: Fix string formatting for exception messages in m_format_read_helpers.
sstables: Don't validate timestamps against the max value on parsing.
sstables: Always store only min bases in serialization_header.
sstables: Support 'mc' version parsing from filename.
SST3: Make sure we call consume_partition_end
This test is not applicable to the 'mc' format as it covers a backward
compatibility case which may only occur with SSTables generated by older
Scylla versions in 'ka' format.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This is a helper flat_mutation_reader that wraps another reader and
splits range tombstones over rows before emitting them.
It is used to produce the same mutation streams for both old (ka/la) and
new (mc) SSTables formats in unit tests.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This fixes the monotonicity issue as otherwise the range tombstone
emitted after such clustering row has a start position that should be
ordered before that of the row.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
If we go past the slice to be read with a range tombstone being opened
we need to emit an RT corresponding to this slice.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Only enable SSTables 'mc' format if the entire cluster supports it and
it is enabled in the configuration file.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This flag will only be used for testing purposes until Scylla 3.o
release and will be removed once SSTables 'mc' testing is completed.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Before this fix, a complex column with zero subcolumns would be
incorrecty parsed as it would re-read the deletion time twice.
Now, this case is handled properly.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
abstract_type::parse_type() only deals with simple types and fails to
parse wrapped types such as
org.apache.cassandra.db.marshal.FrozenType(org.apache.cassandra.db.marshal.ListType(org.apache.cassandra.db.marshal.UTF8Type))
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
It may happen that we hit the end of partition and then get
fast_forward_to() called in which case we attempt to call it from an
already destroyed object. We need to check the _mf_filter value before
doing so.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Before this fix, the code was a potential undefined behaviour and crash
because it would add a large value to a const char* and try to create a
std::string out of it.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Internally, timestamps are represented as signed integers (int64_t) but
stored as unsigned ones. So it is quite possible to store data with
timestamp that is represented as a number larger than the max value of
int64_t type.
One such example is api::min_timestamp() that is used when generating
system schema tables ("keyspaces"). When cast to uint64_t, it turns into
a large value.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
There previously was an inconsistency in treating min values stored in a
serialization_header. They are written to or read from a Statistics.db
as deltas against fixed bases, but when we parse timeouts from the data
file, we need the full bases, not just deltas.
This inconsistency causes wrong timestamp values if we write an sstable
and then read from it using one and the same sstable object because we
turn min values into bases on write and then don't adjust them back
because we already have them in memory.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
SStable format mc doesn't write ancestors to metadata, so resharding
will not work with this new format because it relies on ancestors to
replace new unshared sstables with old shared ones.
Fix is about not relying on ancestors metadata for this operation.
Fixes#3777.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180922211933.1987-1-raphaelsc@scylladb.com>
Today I realised that although we have per-table metrics, they are not
*really* available by default. I was suprised to find that we don't have
(as far as I can tell) a document explaining why it is so, or how to enable
them anyway. Moreover, the more I investigated this issue, the more I
realised how little I know on Scylla's metrics - how they are calculated,
how they are collected, their different types, and so on.
So I sat down to figure out everything I wanted to learn about Scylla metrics,
and then wrote it all down in a new document, docs/metrics.md.
There are some missing pieces in this document marked by TODO, and probably
additional missing pieces that I'm not aware of, but I think this is already
a good start and can be (and should be) improved-on later.
We really need to have more of these documents describing various Scylla
subsystems to new developers - what each subsystem does, why it does what
it does, where is the code, and so on. I am facing these problems every
day as a seasoned developer - I can't even imagine what our new developers
face when trying to understand a subsystem they are not yet familiar with.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180920131103.20590-1-nyh@scylladb.com>
There is a bad interaction between may_need_paging() and query result
size limiter. The former is trying to avoid the complexity of paged
queries when the number of returned rows is going to be smaller than the
page size. The latter uses the fact that paged queries need not return
all requested rows to limit the size of a query results. Since
may_need_paging() may turn a paged query into non-paged one as a side
effect it disables the oversized result protection.
This patch limits the cases when may_need_paging() disables paging to
the situations when we know for sure that query result size limiter
won't be needed, i.e.: the result is not going to contain more than one
row. If the client knows for sure that the paging is not needed and
the performance impact is worthwhile it can disable paging on its side.
Otherwise, let's default to the safer behaviour.
Fixes#3620.
Message-Id: <20180925134431.24329-1-pdziepak@scylladb.com>
"
This series changes commitlog write path so that it uses fragmented
buffers and therefore avoids large allocations. This is done by first
switching the code to use seastar memory_output_stream interface, which
can handle fragmented buffer without any additional actions from the
user code needed and then making it use buffers of fixed size 128 kB.
Tests: unit(release, debug) dtest(commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup commitlog_test.py:TestCommitLog.test_commitlog_replay_with_alter_table)
"
* tag 'fragmented-commitlog-writes/v3' of https://github.com/pdziepak/scylla:
commitlog: switch to fragmented buffers
commitlog: drop buffer pools
commitlog: drop recovery from bad alloc
utils: drop data_output
commitlog: use memory_output_stream
serialization_visitors: add support for memory_output_stream
utils: fragmented_temporary_buffer::view: add remove_prefix()
utils: fragmented_temporary_buffer: add empty() and size_bytes()
utils: fragmented_temporary_buffer: add get_ostream()
idl: serializer: don't assume Iterator::value_type is bytes_view
idl: serializer: create buffer view from streams
utils: crc: accept FragmentRange
When an sstable is deleted, this work is done as a background task
since it cannot be done from the destructor. If we don't wait for
that background task, it is detected as a leak by ASAN.
Fix by waiting for background tasks in every test.
A more complete fix would involve having a factory class create
sstables and assume the responsibility for background tasks, and
something similar to with_cql_test_env(), but that is deferred until later.
Tests: sstable_3_x_test (debug).
Message-Id: <20180923111745.8313-1-avi@scylladb.com>
Currently, both scylla-housekeeping-daily/-restart services mistakenly
specify repo file path as "@@REPOFILES@@", witch is copied from .in
template, need to be replace with actual path.
Fixes#3776
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180921031605.9330-1-syuu@scylladb.com>
Starting with kernel 4.17 XFS will support the lazytime mount option.
That will be beneficial for Scylla as updating times synchronously is
one of our current sources of stalls.
Fortunately, older kernels are able to parse the option and just ignore
it. We verified that to be the case in a 4.15 kernel on ubuntu.
Therefore, just add the option unconditionally.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180920170017.13215-1-glauber@scylladb.com>
File extensions can also produce errors that checked file wants to
intercept and act upon. The patch changes the order in which files are
wrapped to make checked file the outermost wrapped to be able to handle
exception generated by all inner wrappers.
Message-Id: <20180920124430.GD2326@scylladb.com>
Currently timeout is opt-in, that is, all methods that even have it
default it to `db::no_timeout`. This means that ensuring timeout is used
where it should be is completely up to the author and the reviewrs of
the code. As humans are notoriously prone to mistakes this has resulted
in a very inconsistent usage of timeout, many clients of
`flat_mutation_reader` passing the timeout only to some members and only
on certain call sites. This is small wonder considering that some core
operations like `operator()()` only recently received a timeout
parameter and others like `peek()` didn't even have one until this
patch. Both of these methods call `fill_buffer()` which potentially
talks to the lower layers and is supposed to propagate the timeout.
All this makes the `flat_mutation_reader`'s timeout effectively useless.
To make order in this chaos make the timeout parameter a mandatory one
on all `flat_mutation_reader` methods that need it. This ensures that
humans now get a reminder from the compiler when they forget to pass the
timeout. Clients can still opt-out from passing a timeout by passing
`db::no_timeout` (the previous default value) but this will be now
explicit and developers should think before typing it.
There were suprisingly few core call sites to fix up. Where a timeout
was available nearby I propagated it to be able to pass it to the
reader, where I couldn't I passed `db::no_timeout`. Authors of the
latter kind of code (view, streaming and repair are some of the notable
examples) should maybe consider propagating down a timeout if needed.
In the test code (the wast majority of the changes) I just used
`db::no_timeout` everywhere.
Tests: unit(release, debug)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <1edc10802d5eb23de8af28c9f48b8d3be0f1a468.1536744563.git.bdenes@scylladb.com>
"
In SSTables 3.x, the 'ancestors' field of compaction metadata is no
longer stored in the Statistics.db file
The newly added test has previously failed due to this inconsistency.
Tests: unit {release}
"
* 'projects/sstables-30/empty_clustering_key/v1' of https://github.com/argenet/scylla:
tests: Add test for reading table with empty clustering key from SSTables 3.x.
tests: Update Statistics.db files for SSTables 3.x write tests.
sstables: Do not parse ancestors from compaction metadata for SSTables 3.x
Those files have been generated with 'ancestors' field in compaction
metadata and so were invalid.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Adjust the script to the new schema of system_traces.sessions. Two
new columns have been added:
- request_size: int
- response_size: int
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20180919005504.12498-1-vladz@scylladb.com>
So far commitlog was using contiguous buffers for storing the data that
is about to be written to disk. It was able to coalesce small writes so
that multiple small mutations would use the same buffer, but if a
muation was large the commitlog would attempt to allocate a single,
appropriately large buffer. This excessively stresses the memory
allocator and may cause memory fragmentation to become an issue. The
solution is to use fixed-size buffers of 128 kB, which is the standard
buffer size in Scylla and keep large values fragmented.
Buffer pools were added in 7191a130bb
"Commitlog: recycle buffers to reduce fragmentation." They introduce a
lot of complexity and will become unnecessary once the code is switched
to use fixed-size 128kB buffers.
If a node cannot allocate a 128 kB it is already in a very bad shape, so
there isn't much value in trying to recover by attempting smaller
allocations and it just adds more complexity to the segment allocation.
It actually may be better to let some requests fail and give the node a
chance to recover rather than trying to use every last byte of free
memory and end up with bad_alloc in a noexcept context.
"
Some of the write tests were missing the read after write validation
which has now been added for better coverage.
Tests: unit {release}
"
* 'projects/sstables-30/more-enriched-tests/v1' of https://github.com/argenet/scylla:
tests: Enrich test_write_adjacent_range_tombstones_with_rows with read after write
tests: Enrich test_write_many_range_tombstones with read after write
tests: Enrich test_write_mixed_rows_and_range_tombstones with read after write
tests: Enrich test_write_non_adjacent_range_tombstones with read after write
tests: Enrich test_write_adjacent_range_tombstones with read after write
tests: Enrich test_write_simple_range_tombstone with read after write.
tests: Enrich test_write_deleted_column with read after write.
"
This patchset fixes the bug in SSTables 3.x parser that did not properly
handle deleted counter cells.
A write test is enriched to validate read after write so that this case
is covered.
Tests: unit {release}
"
* 'projects/sstables-30/fix-deleted-counters-read/v1' of https://github.com/argenet/scylla:
tests: Read after write in test_write_counter_table.
sstables: Fix deleted counter cells processing in SSTables 3.x parser.
When a query with multicolumn inequality is issued on clustering columns
having mixed order (ASC and DESC together), if the ranges are not
broken to none overlapping lexicographically monotonic ones, the node
return incorrect rows. This is due to the search nature
(prefix comparison). The solution is to break the range imposed
by the restriction into several single column restrictions OR-ed
together that will be logically equivalent and preserve the
monotonicity assumption. This commit also fixes incorrect results
returned by a multicolumn query on an all descending columns.
A unit test have been added to account for both issues fixed.
Fixes#2050
Tests: Unit test, manual tests of the use case in the issue.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <3b96620a3bd8b0614359a3b0757f324d45189dbb.1536478193.git.eliransin@scylladb.com>
scripts/create-relocatable-package.py:24:1: F401 'shutil' imported but unused
scripts/create-relocatable-package.py:24:1: F401 'tempfile' imported but unused
scripts/create-relocatable-package.py:24:16: E401 multiple imports on one line
scripts/create-relocatable-package.py:26:1: E302 expected 2 blank lines, found 1
scripts/create-relocatable-package.py:47:1: E305 expected 2 blank lines after class or function definition, found 1
scripts/create-relocatable-package.py:93:6: E225 missing whitespace around operator
Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180917152520.5032-1-ultrabug@gentoo.org>
The test in question is `restricted_reader_timeout`.
Use `eventually_true()` instead of `sleep()` to wait on the timeout
expiring, making the test more robust on overloaded machines.
Also fix graceful failing, another longstanding issue with this test.
The readers created for the test need different destruction logic
depending whether the test failed or succeeded. Previously this was
dealt with by using the logic that worked in case of success and using
asserts to abort when the test failed, thus avoiding developers
investigating the invalid memory accesses happening due to the wrong
destruction logic.
The solution is to use BOOST_CHECK() macro in the check that validates
whether timeout works as expected. This allows for execution to continue
even if the test failed, and thus allows for running the proper cleanup
code even when the test failed.
Fixes: #3719
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <911921dffc924f1b0a3e86408757467e9be2b65b.1537169933.git.bdenes@scylladb.com>
test_partial_delete_selected_column() does a long string of various
updates and deletes, each specifies a different timestamp. In one
of these updates, the timestamp was forgotten. This means that the
server picks the current time, a large number.
As the test is currently written, it doesn't matter which timestamp
was chosen, the test would still succeed (if timestamp >= 15, and it
must be since the timestamp is the time from the epoch).
But the intention was probably to use timestamp = 15, so let's make
this intention clear.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180905095552.11883-2-nyh@scylladb.com>
We recently saw a failure in test_partial_delete_selected_column() but
this is a very long test doing many operations and comparisons of their
results, and without BOOST_TEST_PASSPOINT() we can't know which of them
really failed.
So let's sprinkle BOOST_TEST_PASSPOINT() calls between the different parts
of test_partial_delete_selected_column(). If this test ever fails again,
we'll know where.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180905095552.11883-1-nyh@scylladb.com>
On Fedora 28, creating an instance of `std::random_device` opens a file
descriptor for `/dev/urandom` (observed via `strace`).
By declaring static thread-local instances of `std::random_device`,
these descriptors will be open (barring optimization by the compiler)
for the entire duration of the Scylla process's life.
However, the `std::random_device` instance is only necessary for
initializing the `RandomNumberEngine` for generating salts. With this
change, the file-descriptor is closed immediately after the engine is
initialized.
I considered generalizing this pattern of initialization into a
function, but with only two uses (and simple ones) I think this would
only obscure things.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Tests: unit (release)
Message-Id: <f1b985d99f66e5e64d714fd0f087e235b71557d2.1536697368.git.jhaberku@scylladb.com>
multishard_mutation_reader starts read-aheads on the
shards-to-be-read-soon. When doing this it didn't check whether the
respective shards had an ongoing read-ahead already. This lead to a
single shard executing multiple concurrent read-aheads. This is damaging
for multiple reasons:
* Can lead to concurrent access of the remote reader's data members.
* The `shard_reader` was designed around a single read-ahead and
thus will synchronise foreground reads with only the last one.
The practical implications of this seen so far was that queries reading
a large number of rows (large enough to reliably trigger the
bug) would stop the read early, due the `combined_mutation_reader`'s
internal accounting being messed up by concurrent access.
Also add a unit test. Instead of coming up with a very specific, and
very contrived unit test, use the test-case that detected this bug in
the first place: count(*) on a table with lots of rows (>1000). This
unit-test should serve well for detecting any similar bugs in the
future.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <ff1c49be64e2fb443f9aa8c5c8d235e682442248.1536746388.git.bdenes@scylladb.com>
Currently the `trace_state` is moved into the `querier` object's
constructor when one has to be created. Since the trace_state is used
below this lines this had the effect that on the first page of the
query, when a querier object has to be created, tracing would not work
inside the `querier_cache` which received a move-from `trace_state` (a
nullptr effectively).
Change the move to a copy so the other half of the function doesn't use
a moved-from `trace_state`.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4987419781aa287141aa9dc8ce99c5068b564c84.1536739052.git.bdenes@scylladb.com>
As our support for reading SSTables 3.x rows is nearly complete, the
write tests can be extended to read data after write.
This patchset adds reading to a handful of write tests.
* https://github.com/argenet/scylla/tree/projects/sstables-30/enrich-write-tests/v6:
tests: Factor out the helper building SSTables path for write tests.
tests: Add validate_read() helper to use in SSTables 3.x write tests.
tests: Preserve tmpdir in SSTables 3.x write tests upon comparison.
tests: Read SSTables for write_static_row test after validating write.
tests: Read SSTables for write_composite_partition_key test after
validating write.
tests: Read SSTables for write_composite_clustering_key test after
validating write.
tests: Read SSTables for write_wide_partitions test after validating
write.
tests: Read SSTables for write_ttled_column test after validating
write.
tests: Read SSTables for write_collection_wide_update test after
validating write.
tests: Read SSTables for write_collection_incremental_update test
after validating write.
tests: Read SSTables for write_missing_columns_large_set test after
validating write.
tests: Read SSTables for write_multiple_partitions test after
validating write.
tests: Read SSTables for write_multiple_rows test after validating
write.
tests: Read SSTables for write_different_types test after validating
write.
tests: Read SSTables for write_empty_clustering_values test after
validating write.
tests: Read SSTables for write_large_clustering_keys test after
validating write.
tests: Read SSTables for write_user_defined_type_table test after
validating write.
tests: Read SSTables for write_deleted_row test after validating
write.
sstables: Fix SSTables 3.x parsing: check use_row_ttl() for TTLed
columns.
tests: Read SSTables for write_ttled_row test after validating write.
Read SSTables for write_compact_table test after validating write.
tests: Read SSTables for tests of many partitions after validating
write.
"
This series contains miscellaneous improvements to the stateful range
scans. These improvements are either things that I forgot to include in
the original series (tracing), was requested by other developers
(comments) or I discovered them while reading the code (lockup and
cleanup).
"
* 'multishard_mutation_query_fixes/v1' of https://github.com/denesb/scylla:
multishard_mutation_query: add some tracing
multishard_mutation_query: add comment to `read_context`
multishard_mutation_query: always cleanup readers properly
multishard_mutation_query: fix possible deadlock when creating a reader fails
Add tracing for the following events:
1) Dismantling of the combined buffer.
2) Dismantling of the compaction state.
3) Cleaning up the readers.
(1) and (2) can possibly have adverse effects on the performance of the
query and hence it is important that details about the dismantled
fragments is exposed in the tracing data.
(3) is less critical but still good to know how much readers were
created by the read (in case they aren't saved). Since normally (in
strateful queries) this will always be 0 only trace this when it is
non-zero (and is interesting).
Currently the reader cleanup code, which ensures the readers and their
dependent objects are destroyed in the corect order and a single
smp::submit_to() message, are only run when the readers are attempted to
be saved. However proper cleanup is needed not only then, but also when
the query is not stateful. Rename the current `cleanup()` method to
`stop()`, make it public and call it from a `finally()` block after the
page is finalized to ensure readers are properly cleaned up at all
times.
Also make sure that failures in `stop()` are never propagated so that
a failure in the cleanup doesn't fail the read itself.
This covers five tests, including three for compressed tables:
- write_many_partitions_deflate
- write_many_partitions_lz4
- write_many_partitions_snappy
- write_many_live_partitions
- write_many_deleted_partitions
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Failing to create a reader (`do_make_remote_reader()`) can lead to a
deadlock if the reader is in any of the future_*_state states, as the
`then()` block is not executed and hence the promise of the first
future in the chain is not set. Avoid this by changing the `then()` to a
`then_wrapped()` and using `set_exception()` and `set_value()`
accordingly, such that the future is resolved on both the happy and
error path.
The non-TLS RPC server has an rpc::resource_limits configuration that limits
its memory consumption, but the TLS server does not. That means a many-node
TLS configuration can OOM if all nodes gang up on a single replica.
Fix by passing the limits to the TLS server too.
Fixes#3757.
Message-Id: <20180907192607.19802-1-avi@scylladb.com>
When measuring_output_stream is used to calculate result's element size
it incorrectly takes into account not only serialized element size, but
a placeholder that ser::qr_partition__rows/qr_partition__static_row__cells
constructors puts in the beginning. Fix it by taking starting point in a
stream before element serialization and subtracting it afterwords.
Fixes#3755
Message-Id: <20180906153609.GJ2326@scylladb.com>
The code uses incorrect output stream in case only digest is requested
and thus getting incorrect data size. Failing to correctly account
for static row size while calculating digest may cause digest mismatch
between digest and data query.
Fixes#3753.
Message-Id: <20180905131219.GD2326@scylladb.com>
Commit e664f9b0c6 transitioned internal
CQL queries in the auth. sub-system to be executed with finite time-outs
instead of infinite ones.
It should have also modified the functions in `auth/roles-metadata.cc`
to have finite time-outs.
This change fixes some previously failing dtests, particularly around
repair. Without this change, the QUORUM query fails to terminate when
the necessary consistency level cannot be achieved.
Fixes#3736.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <e244dc3e731b4019f3be72c52a91f23ee4bb68d1.1536163859.git.jhaberku@scylladb.com>
The foreground reads metric is derived from the number of live read
executors minus the number of background reads. Background reads are
counted down when their resolver times out. However, a read executor
may still be around for a while, resulting in such reads being
accounted as foreground.
Usually, the gap in which this happens is short, because executor
reference holders timeout quickly as well. It's not always the case
though. For instance, local read executor doesn't time out quickly
when the target shard has an overloaded CPU, and it takes a while
before the request goes through all the queues, even if IO is not
involved. Observed in #3628.
Fixes#3734.
Another problem is that all reads which received CL responses are
accounted as background, until all replicas respond, but if such read
needs reconciliation, it's still practically a foreground read and
should be accounted as such. Found during code review.
Fixes#3745.
This patch fixes both issues by rearranging accounting to track
foreground reads instead of background reads, and considering all
reads as foreground until the resulting promise is resolved.
Message-Id: <1535999620-25784-1-git-send-email-tgrabiec@scylladb.com>
"
This patchset adds proper support for sliced reads of partitions
containing range tombstones.
Given the SSTables 3.x repesentation of range tombstones by separate
start and end markers, we refer to the index for the information about
the currently opened range tombstone, if any, when skipping to the next
promoted index block.
Note that for this we have to take the promoted index block immediately
preceding the one we are jumping to.
Tests: unit {release}
"
* 'projects/sstables-30/range-tombstones-slicing/v3' of https://github.com/argenet/scylla:
tests: Test filtering and forwarding on a partition with interleaved rows and RTs.
tests: Add tests for reading wide partitions with range tombstones.
sstables: Support slicing for range tombstones.
sstables: Set/reset range tombstone start from end open marker.
sstables: Fix end_open_marker population in promoted index blocks.
sstables: Add need_skip() helper to data_consume_context.
sstables: For end_open_marker, return both position in partition and deletion time.
When we skip through a wide partition using promoted index, we may land
to a position that lies in the middle of a range tombstone so we need to
be aware of it. For this, we check if the previous promoted block has an
end open marker and either set the range tombstone start using it or
reset if missing.
Note several things about the implementation.
Firstly, we have to peek back at the previous promoted index block for the
end open marker, and so we have to always preserve one more promoted
index block when we read the next batch so that we can stil access it.
Secondly, we use the previous promoted block end position to build
position in partition for the range tombstone start.
Lastly, we don't have a notion of end open marker in older consumers
that work with SSTables of ka/la formats so we only call the
corresponding methods if the consumer supports them.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
We should not access the internal object stored in std::optional when
passing the end_open_marker, moreover that it can be disengaged.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This methods tells whether we will need to skip to reach the input
position or not.
It can be used for skipping with index when reading SSTables 3.x because
we only want to to set/reset the open range tombstone bound when we
actually move to another promoted index block.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
When a joining node announcing join status through gossip, other
existing nodes will send writes to the joining node. At this time, it
is possible the joining node hasn't learnt the tokens of other nodes
that causes the error like below:
token_metadata - sorted_tokens is empty in first_token_index!
storage_proxy - Failed to apply mutation from 127.0.4.1#0:
std::runtime_error (sorted_tokens is empty in first_token_index!)
To fix, wait for the token range setup before announcing the join
status.
Fixes: #3382
Tests: 60 run of materialized_views_test.py:TestMaterializedViews.add_dc_during_mv_update_test
Message-Id: <01abb21ae3315ae275297e507c5956e5774557ef.1536128531.git.asias@scylladb.com>
Change the validity timeout from 1s to 1h in order to avoid false alarms
on busy systems: for a short value there is a chance that
(loading_cache.size() == num_loaders) check is going to run after some elements
of the cache have already been evicted.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20180904193026.7304-1-vladz@scylladb.com>
Prior to this fix, the end_open_marker has been only accessible as a
plain deletion_time structure. Now it also contains the start position
of a promoted index block so that it can be used for setting range
tombstone open bound.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
When slice::is_satisfied_by() restriction check is performed
on raw data represented as bytes, it should always use a regular
type comparator, not a reversed one. Reversed types are used to
preserve descending clustering order, but comparison with constants
should be used with a regular underlying type comparator (for x < 1
to actually mean 'lesser than 1' instead of 'bigger than 1, because
the clustering order is reversed').
Fixes#3741
Message-Id: <3e25fc66688c9253287f2c4f31ede8339b9bbe23.1535981852.git.sarna@scylladb.com>
When /etc/systemd/system/scylla-server.service.d/capabilities.conf is
not installed, we don't have /etc/systemd/system/scylla-server.service.d/,
need to create it.
Fixes#3738
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180904015841.18433-1-syuu@scylladb.com>
This ensures that row::external_memory_usage() is invariant to
insertion order of cells.
It should be so, so that accounting of a clustering_row, merged from
multiple MVCC versions by the partition_snapshot_flat_reader on behalf
of a memtable flush, doesn't give a greater result than what is used
by the memtable region. Overaccounting leads to assertion failure in
~flush_memory_accounter.
Fixes#3625 (hopefully).
Message-Id: <1535982513-19922-1-git-send-email-tgrabiec@scylladb.com>
"
This series extends the query statefullness, introduced by f8613a841 to
point queries, to range scans as well. This means that queriers will be
saved and reused for range scans too.
This series builds heavily on the infrastructure introduced by stateful
point queries, namely the querier object and the querier_cache. It also
builds on another critical piece of infrastructure, the
multishard_combining_reader, introduced by 2d126a79b.
To make the range scan on a given node suspendable and resumable we move
away from the current code in
`storage_proxy::query_nonsingular_mutations_locally()` and use a
multishard_combining_reader to execute the read. When the page is filled
this reader is dismantled and its shard readers are saved in the
querier cache.
There are of course a lot more details to it but this is the gist of it.
Tests: unit(release, debug), dtest(paging_test.py, paging_additional_test.py)
"
* '1865/range-scans/v7.1' of https://github.com/denesb/scylla: (33 commits)
query_pagers: generate query_uuid for range-scans as well
storage_proxy: use preferred/last replicas
storage_proxy: add preferred/last replicas to the signature of query_partition_key_range_concurrent
db::consistency_level::filter_for_query() add preferred_endpoints
storage_proxy: use query_mutations_from_all_shards() for range scans
tests: add unit test for multishard_mutation_query()
tests/mutation_assertions.hh: add missing include
multishard_mutation_query: add badness counters
database: add query_mutations_on_all_shards()
mutation_compactor: add detach_state()
flat_mutation_reader: add unpop_mutation_fragment()
Move reconcilable_result_builder declaration to mutation_query.hh
mutation_source_test: add an additional REQUIRE()
mutation: add missing assert to mutation from reader
querier: add shard_mutation_querier
querier: prepare for multi-ranges
tests/querier_cache: add tests specific for multiple entry-types
querier: split querier into separate data and mutation querier types
querier: move consume_page logic into a free function
querier: move all matching related logic into free functions
...
Add badness counters that allow tracking problems. The following
counters are added:
1) multishard_query_unpopped_fragments
2) multishard_query_unpopped_bytes
3) multishard_query_failed_reader_stops
4) multishard_query_failed_reader_saves
The first pair of counters observe the amount of work range scan queries
have to undo on each page. It is normal for these counters to be
non-zero, however sudden spikes in their values can indicate problems.
This undoing of work is needed for stateful range-scans to work.
When stateful queries are enabled the `multishard_combining_reader` is
dismantled and all unconsumed fragments in its and any of its
intermediate reader's buffers are pushed back into the originating shard
reader's buffer (via `unpop_mutation_fragment()`). This also includes
the `partition_start`, the `static_row` (if there is one) and all
extracted and active `range_tombstone` fragments. This together can
amount to a substantial amount of fragments.
(1) counts the amount of fragments moved back, while (2) counts the
number of bytes. Monitoring size and quantity separately allows for
detecting edge cases like moving many small fragments or just a few huge
ones. The counters count the fragments/bytes moved back to readers
located on the shard they belong to.
The second pair of counters are added to detect any problems around
saving readers. Since the failure to save a reader will not fail the
read itself, it is necessary to add visibility to these failures by
other means.
(3) counts the number of times stopping a shard reader (waiting
on pending read-aheads and next-partitions) failed while (4)
counts the number of times inserting the reader into the `querier_cache`
failed.
Contrary to the first two counters, which will almost certainly never be
zero, these latter two counters should always be zero. Any other value
indicates problems in the respective shards/nodes.
This method allows for querying a range or ranges on all shards of the
node. Under the hood it uses the multishard_combining_reader for
executing the query.
It supports paging and stateful queries (saving and reusing the readers
between pages). All this is transparent to the client, who only needs to
supply the same query::read_command::query_uuid through the pages of the
query (and supply correct start positions on each page, that match the
stop position of the last page).
Allow the state of the compaction to be detached. The detached state is
a set of mutation fragments, which if replayed through a new compactor
object will result in the latter being in the same state as the previous
one was.
This allows for storing the compaction state in the compacted reader by
using `unpop_mutation_fragment()` to push back the fragments that
comprise the detached state into the reader. This way, if a new
compaction object is created it can just consume the reader and continue
where the previous compaction left off.
This is the inverse of `pop_mutation_fragment()`. Allow fragments to be
pushed back into the buffer of the reader to undo a previous consumtion
of the fragments.
test_streamed_mutation_forwarding_is_consistent_with_slicing already has
a REQUIRE() for the mutation read with the slicing reader. Add another
one for the forwarding reader. This makes it more consistent and also
helps finding problems with either the forwarding or slicing reader.
read_mutation_from_flat_mutation_reader's internal adapter can build a
single mutation only and hence can consume only a single partition.
If more than one partitions are pushed down from the producer the
adaptor will very likely crash. To avoid unnecessary investigations add
an assert() to fail early and make it clear what the real problem is.
All other consume_ methods have an assert() already for their
invariants so this is just following suit.
The querier to be used for saving shard readers belonging to a
multishard range scan. This querier doesn't provide a `consume_page`
method as it doesn't support reading from it directly. It is more
of a storage to allow caching the reader and any objects it depends on.
In the next patch a querier will be added that reads multiple ranges as
opposed to a single range that data and mutation queriers read.
To keep `querier_cache` code seamless regarding this difference change all
range-matching logic to work in terms of `dht::partition_ranges_view`.
This allows for cheap and seamless way of having a single code-base for
the insert/lookup logic. Code actually matching ranges is updated to be
able to handle both singular and multi-ranges while maintaining backward
compatibility.
Instead of hiding what compaction method the querier uses (and only
expose it via rejecting 'can_be_used_for_page()`) make it very explicit
that these are really two different queriers. This allows using
different indexes for the two queriers in `querier_cache` and
eliminating the possibility of picking up a querier with the wrong
compaction method (read kind).
This also makes it possible to add new querier type(s) that suit the
multishard-query's needs without making a confusing mess of `querier` by
making it a union of all querying logic.
Splitting the queriers this way changes what happens when a lookup finds
a querier of the wrong kind (e.g. emit_only_live::yes for an
emit_only_live::no command). As opposed to dropping the found (but
wrong) querier the querier will now simply not be found by the lookup.
This is a result of using separate search indexes for the different
mutation kinds. This change should have no practical implications.
Splitting is done by making querier templated on `emit_only_live_rows`.
It doesn't make sense to duplicate the entire querier as the two share
99% of the code.
In preparation of the now single querier being split into multiple more
specialized ones. Make it possible for the multiple queriers sharing the
same implementation. Also, the code can now be reused by outside code as
well, not just queriers.
So that they can be used for multiple querier classes easily, without
inheritance. The functions are not visible from the header.
Also update the comments on `querier` to w.r.t. the disappeared
checking functions. Change the language to be more general. In practice
these checks are never done by client code, instead they are done by the
`querier_cache`.
In preparations for introducing support multiple entry types in the
querier_cache move all insert/lookup related logic into free functions.
Later these functions will be templated so they can handle multiple
entry types with the same code.
Requiring the caller of lookup() to pass in a `create_fun()` was not
such a good idea in hindsight. It leads to awkward call sites and even
more awkward code when trying to find out whether the lookup was
successfull or not.
Returning an optional gives calling code much more flexibility and makes
the code cleaner.
Add a dismantler functor parameter. When the multishard reader is
destroyed this functor will be called for each shard reader, passing a
future to a `stopped_foreign_reader`. This future becomes available when
the shard reader is stopped, that is, when it finished all in-progress
read-aheads and/or pending next partition calls.
The intended use case for the dismantler functor is a client that needs
to be notified when readers are destroyed and/or has to have access to
any unconsumed fragments from the foreign readers wrapping the shard
readers.
Extend `remote_reader_factory` interface so that it accepts all standard
mutation reader creation parameters. This allows factory lambdas to be
truly stateless, not having to capture any standard parameters that is
needed for creating the reader.
Standard parameters are those accepted by
`mutation_source::make_reader()`.
"
After we fixed reloading flow it enabled situations when items are no longer cached but
still held in the underlying loading_shared_values object. Since loading_cache::size() returns
the size of its loading_shared_values object and loading_cache::begin()/end()/find() are returning
iterators based on loading_shared_values iterators these APIs may return very weird values, e.g.
size() may return the same value after one of the items have been removed using remove(key) API.
This series fixes this by switching mentioned above APIs to work on top of lru_list object instead
of loading_shared_values.
"
* 'loading_cache_fix_api_semantics-v1' of https://github.com/vladzcloudius/scylla:
loading_cache: make iterator work on top of lru_list iterators instead of loading_shared_values'
loading_cache: make size() return the size of lru_list instead of loading_shared_values
A relocatable package contains the Scylla (and iotune)
executables (in a bin/ directory), any libraries they may need (lib/)
the configuration file defaults (conf/) and supporting scripts (dist/).
The libraries are picked up from the host; including libc and the dynamic
linker (ld.so).
We also provide a thunk script that forces the library path
(LD_LIBRARY_PATH) to point at our libraries, and overrides the
interpreter to point at our ld.so.
With these files, it is possible to run a fully functional Scylla
instance on any Linux distribution. This is similar to chroot or
containers, except that we run in the same namespace as the host.
The packages are created by running
ninja build/release/scylla-package.tar
or
ninja --mode debug build/debug/scylla-package.tar
Message-Id: <20180828065352.30730-1-avi@scylladb.com>
Reloading may hold value in the underlying loading_shared_values while
the corresponding cache values have already been deleted.
This may create weird situations like this:
<populate cache with 10 entries>
cache.remove(key1);
for (auto& e : cache) {
std::out << e << std::endl;
}
<all 10 entries are printed, including the one for "key1">
In order to avoid such situations we are going to make the loading_cache::iterator
to be a transform_iterator of lru_list::iterator instead of loading_shared_values::iterator
because lru_list contains entries only for cached items.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
reloading flow may hold the items in the underlying loading_shared_values
after they have been removed (e.g. via remove(key) API) thereby loading_shared_values.size()
doesn't represent the correct value for the loading_cache. lru_list.size() on the other hand - does.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
* seastar 12f18ce...5712816 (6):
> tests: add signal_test to test list
> Merge "Enhancements for memory_output_stream" from Paweł
> seastar-addr2line: don't print an empty line between backtrace lines
> seastar-addr2line: add --verbose option
> seastar-addr2line: make prefix matching non-greedy
> future: make available() const
When we load new SSTables, we use the directory information from the
entry descriptor to build information about those SSTables. When the
descriptor is created by flush_upload_dir, the sstable directory used in
the descriptor contains the `upload` part. Therefore, we will try to
load SSTables that are in the upload directory when we already moved
them out and fail.
Since the generation also changes, we have been historically fixing the
generation manually, but not the SSTable directory. The reason for that
is that up until recently, the SSTable directory was passed statically
to open_sstables, ignoring whatever the entry descriptor said. Now that
the sstable directory is also derived from the entry descriptor, we
should fix that too.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180829165326.12183-1-glauber@scylladb.com>
Additional tests for cases surrounding issue #3362, where base rows
disappear (or not) and view rows need to disappear (or not) as well.
These new tests focus on checking that view_updates::do_delete_old_entry()
is correct.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180829131914.16042-2-nyh@scylladb.com>
In previous patches, we gave up on an old (and broken) attempt to track
the timestamps of many unselected base-table columns through one row marker
in the view table - and replaced them by "virtual cells", one per unselected
cell.
The do_delete_old_entry() function still contains old code which maintained
that row marker, and is no longer needed. That old code is no only no longer
needed, it also no longer did anything because all columns now appear in
the view (as virtual columns) so the code ignored them when calculating the
row marker.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180829131914.16042-1-nyh@scylladb.com>
"
When a view's partition key contains only columns from the base's partition
key (and not an additional one), the liveness - existance or disappearance -
of a view-table row is tied to the liveness of the base table row. And
that, in turn, depends not only on selected columns (base-table columns
SELECTed to also appear in the view) but also on unselected columns.
This means that we may need to keep a view row alive even without data,
just because some unselected column is alive in the base table. Before this
patch set we tried to build a single "row marker" in the view column which
tried to summarize the liveness information in all unselected columns.
But this proved unworkable, as explained in issue #3362 and as will be
demonstrated in unit tests at the end of this series.
Because we can't replace several unselected cells by one row marker, what
we do in this series is to add for each for the unselected cells a "virtual
cell" which contains the cell's liveness information (timestamp, deletion,
ttl) but not its value. For collections, we can't represent the entire
collection by one virtual cell, and rather need a collection of virtual
cells.
Fixes#3362
"
* 'virtual-cols-v3' of https://github.com/nyh/scylla:
Materialized Views: test that virtual columns are not visible
Materialized Views: unit test reproducing fixed issue #3362
Materialized Views: no need for elaborate row marker calculations
Materialized Views: add unselected columns as virtual columns
Materialized Views: fill virtual columns
Do not allow selecting a virtual column
schema: persist "view virtual" columns to a separate system table
schema: add "view virtual" flag to schema's column_definition
Add "empty" type name to CQL parser, but only for internal parsing
"
Previous work (71471bb322) converted the CQL layer to inheriting
execution stages, paving the way to multiple users sharing the front-end.
This patchset does the same thing to the back-end, converting more execution
stages to preserve the caller's scheduling_group. Since RPC now (8c993e0728)
assigns the correct scheduling group within the replica, we can extend that
work so a statement is executed with the same scheduling group all the way
to sstable parsing, even if we cross nodes in the process. This improves
performance isolation and paves the way to multi-user SLA guarantees.
"
* tag 'inherit-sched_group/v1' of https://github.com/avikivity/scylla:
database: make database's mutation apply stage inherit its scheduling group from the caller
database: make database::_mutation_query_stage inherit the scheduling group
database: make database::_data_query_stage inheriting its caller's scheduling_group
storage_proxy: make _mutate_stage inherit its caller's scheduling_group
"This series introduces a few improvements related to a reload flow.
From now on the callback may assume that the "key" parameter value
is kept alive till the end of its execution in the reloading flow.
It may also safely evict as many items from the cache as needed."
Fixes#3606
* 'loading_cache_improve_reload-v1' of https://github.com/vladzcloudius/scylla:
utils::loading_cache: hold a shared_value_ptr to the value when we reload
utils::loading_cache::on_timer(): remove not needed capture of "this"
utils::loading_cache::on_timer(): use chunked_vector for storing elements we want to reload
"
Fix loading_cache_test flakiness by retrying assertions.
Tests: unit(loading_cache_test(debug, release))
Fixes#3723
"
* 'loading-cache-test-flake/v4' of https://github.com/duarten/scylla:
tests/loading_cache_test: Unflake test_loading_cache_loading_reloading
tests/loading_cache_test: Use eventually() instead of open-coding it
tests/mutation_reader_test: Extract eventually_true() to eventually.hh
tests/cql_test_env: Lift eventually() to its own header file
* seastar 9bb1611...12f18ce (17):
> correctly configure I/O Scheduler for usage with the YAML file
> Added support for user-defined signal handlers
> Added reactor method to modify blocked_reactor_notify_ms
> configure.py: Use the user-specified compiler for dialect detection
> seastar-addr2line: clear current trace when omitting already seen trace
> seastar-addr2line: fix redirecting output to a file
> seastar-addr2line: don't require a space before the addresses
> tests: Ensure test thread is always joined
> README.md: Add cute badges
> iotune: adjust num-io-queues recommendation
> dns: add SRV record lookup
> reactor: define max_aio_per_queue for C++14
> reactor,alien: silence GCC warnings
> core,json,net: silence GCC warnings
> fstream: "using data_sink_impl::put" to silence gcc warning
> Merge 'Ensure Seastar compiles in C++14 mode' from Jesse
> Revert "foreign_ptr: allow waiting for the destruction of the managed ptr"
Implement and test support for reading range tombstones in SSTables 3.
Does not yet support reads which are using slicing or fast forwarding.
From github.com/scylladb/seastar-dev.git haaawk/sstables3/tombstones_v11:
Piotr Jastrzebski (5):
sstables: Add consumer_m::consume_range_tombstone
sstables: Support null columns in ck
sstables: Support reading range_tombstones
sstables: Test reading range_tombstones
sstables: Add test for RT with non-full key
Vladimir Krivopalov (2):
sstables: Add operator<< overload for bound_kind_m.
keys: Add clustering_key_prefix::make_full helper.
The `loading_cache_test::test_loading_cache_loading_reloading` test
case is flaky, and fails in both debug and release mode. In an
over-provisioned environment, it's possible that when the reactor
runs, the timers for the `sleep()` and for reloading the
`loading_cache` are both expired, and continuations are scheduled with
an arbitrary order, causing the test to fail.
Fixes#3723
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This error is transient, since as soon as the node is up we will be able
to send the migration request. Downgrade it to a warning to reduce anxiety
among people who actually read the logs (like QA).
The message is also badly worded as no one can guess what a migration
request is, but that is left to another patch.
Fixes#3706.
Message-Id: <20180821070200.18691-1-avi@scylladb.com>
Now scylla-ami is not submodule of scylla repo, it will works as
independent repository just like scylla-jmx and scylla-tools, provides
.rpm package to install AMI scripts on AMI.
Most files are gone from dist/ami/files, but scylla_install_ami copied
from scylla-ami, since it requires to install scylla .rpms, cannot
pacakge in scylla-ami rpm.
On scylla_install_ami, we dropped ixgbevf/ena drivers code, we will
provide 'scylla-ixgbevf' and 'scylla-ena' DKMS .rpm instead.
It will automatically build kernel modules for current kernel.
A repo of the driver packages is on
https://copr.fedorainfracloud.org/coprs/scylladb/scylla-ami-drivers/
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180821201101.4631-1-syuu@scylladb.com>
"
Right now, simple_memory_input_stream takes Iterator as a template
parameter. That iterator is supposed to point to fragments in a
underlying fragmented buffer. This makes no sense, since simple streams
deal only with contiguous buffer.
This series removes any assumption that simple_memory_input_stream has
iterator_type member from Scylla so that it can be removed.
"
* tag 'prepare-simple-stream-no-iterator/v1' of https://github.com/pdziepak/scylla:
idl: deserialized_bytes_proxy do not assume presence of iterator_type
idl-compiler: specify return type of with_serialized_stream() lambdas
"
This series is a refactor of password management, motivated by a
combination of correctness bugs, improving testability, improving
clarity, and adding documentation.
Tests: unit (release)
"
* 'jhk/passwords_refactor/v2' of https://github.com/hakuch/scylla:
auth: Clean up implementation comments
auth: Remove unnecessary local variable
auth: Allow different random engines for salt
auth: Correct modulo bias in salt generation
auth: Extract random byte generation for salt
auth: Split out test for best supported scheme
auth: Rename function to use full words
auth: Add domain-specific exception for passwords
auth: Document passwords interface
auth: Move passsword stuff to its own namespace
auth: Identify password hashing errors correctly
auth: Add unit tests for password handling
auth: Move password handling to its own files
auth: Construct `std::random_device` instances once
There could be soft pressure, but soft-pressure flusher may not be
able to make progress (Refs #3716). It will keep trying to flush empty
memtables, which block on earlier flushes to complete, and thus
allocate continuations in memory. Those continuations accumulate in
memory and can cause OOM.
flush will take longer to complete. Due to scheduling group isolation,
the soft-pressure flusher will keep getting the CPU.
This causes bad_alloc and crashes of dtest:
limits_test.py:TestLimits.max_cells_test
Fixes#3717
Message-Id: <1535102520-23039-1-git-send-email-tgrabiec@scylladb.com>
The flusher picks the memtable list which contains the largest region
according to region_impl::evictable_occupancy().total_space(), which
follows region::occupancy().total_space(). But only the latest
memtable in the list can start flushing. It can happen that the
memtable corresponding to the largest region was already flushed to an
sstable (flush permit released), but not yet fsynced or moved to
cache, so it's still in the memtable list.
The latest memtable in the winning list may be small, or empty, in
which case the soft pressure flusher will not be able to make much
progress. There could be other memtable lists with non-empty
(flushable) latest memtables. This can lead to writes unnecessarily
blocking on dirty.
I observed this for the system memtable group, where it's easy for the
memtables to overshoot small soft pressure limits. The flusher kept
trying to flush empty memtables, while the previous non-empty memtable
was still in the group.
The CPU scheduler makes this worse, because it runs memtable_to_cache
in a separate scheduling group, so it further defers in time the
removal of the flushed memtable from the memtable list.
This patch fixes the problem by making regions corresponding to
memtables which started flushing report evictable_occupancy() as 0, so
that they're picked by the flusher last.
Fixes#3716.
Message-Id: <1535040132-11153-2-git-send-email-tgrabiec@scylladb.com>
* dist/ami/files/scylla-ami c7e5a70...b7db861 (2):
> scylla-ami-setup.service: run only on first startup
> Use fstab to mount RAID volume on every reboot
Since the Linux system abort booting when it fails to mount fstab entries,
user may not able to see an error message when we use fstab to mount
/var/lib/scylla on AMI.
Instead of abort booting, we can just abort to start scylla-server.service
when RAID volume is not mounted, using RequiresMountsFor directive of systemd
unit file.
See #3640
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180824185511.17557-1-syuu@scylladb.com>
Like the two preceeding patches, convert the mutation apply stage
to an inheriting_concrete_scheduling_group. This change has two
added benefits: we get rid of a thread_local, and we drop a
with_scheduling_group() inside an execution stage which just creates a bunch
of continuations and somewhat undoes the benefit of the execution stage.
Now (8c993e0728) that replica-side operations run under the correct
scheduling group, we can inherit the scheduling_group for _data_query_stage
from the caller. By itself this doesn't do much, but it will later allow us
to have multiple groups for statement executions.
Right now, storage_proxy's mutate_stage violates isolation by running
in a plain execution_stage without a scheduling_group. This means do_mutate()
will run under the main scheduling_group, at least until we reach the database
apply execution stage, which is correct.
Fix by moving to an inheriting execution stage; this works because the
messaging service will tell RPC to set the correct execution stage for us. We
could explicitly specify statement_scheduling_group, but inheriting the
scheduling group allows us to have multiple statment scheduling groups, later.
deserialized_bytes_proxy assumes that the provided input stream has
iterator_type that represents the iterator pointing to the next
fragment of the fragmented underlying buffyer. This makes little sense
if the input stream is a contiguous one (i.e.
simple_memory_input_stream) so let's not make such assumptions.
IDL-generated code uses with_serialized_stream() to optimise for cases
when the underlying buffer is not fragmented. The provided lambda will
be called with wither simple or fragmented stream as an argument. The
consequence of this is that both instantations of generic lambda need to
return the same type. This is a problem if the type is deduced and
depends on the provided input stream (e.g. different type for fragmented
and simple streams). The solution is to explictly specify the return
type as the type returned by deserialising general utils::input_stream.
This way each instantation of lambda can return whatever it wants as
long as it is convertible to the type that the serialiser would return
if utils::input_stream was given.
memtable flushes for system and regular region groups run under the
memtable_scheduling_group, but the controller adjusts shares based on
the occupancy of the regular region group.
It can happen that regular is not under pressure, but system is. In
this case the controller will incorrectly assign low shares to the
memtable flush of system. This may result in high latency and low
throughput for writes in the system group.
I observed writes to the sytem keyspace timing out (on scylla-2.3-rc2)
in the dtest: limits_test.py:TestLimits.max_cells_test, which went
away after this.
Fixes#3717.
Message-Id: <1535016026-28006-1-git-send-email-tgrabiec@scylladb.com>
Being the single user of fnv1a, this allows us to get rid of it. As
the TODO inside fnv1a_hasher.hh indicates, and judging by any
independent benchmark, fnv1a is very slow. As we have added xx_hash
since then, and we know it to be fast, use it instead.
Tests: unit(release/cell_locker_test)
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180823081715.26089-1-duarte@scylladb.com>
This method fills non-full clustering key with trailing empty values to
make it full.
This can be used for clustering keys of rows in a compact table as,
unlike in regular tables, they can be non-full.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This patch changes how list of tokens returned from the storage_service
API.
Instead of create a vector and construct a json object of it, use the
streaming capabilities of the http.
This is important for large cluster and prevent large allocations.
Fixes#3701
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20180820195631.26792-1-amnon@scylladb.com>
Fixes a regression introduced in
9e88b60ef5, which broke the lookup for
prefetched values of lists when a clustering key is specified.
This is the code that was removed from some list operations:
std::experimental::optional<clustering_key> row_key;
if (!column.is_static()) {
row_key = clustering_key::from_clustering_prefix(*params._schema, prefix);
}
...
auto&& existing_list = params.get_prefetched_list(m.key().view(), row_key, column);
Put it back, in the form of common code in the update_parameters class.
Fixes#3703
* https://github.com/duarten/scylla cql-list-fixes/v1:
tests/cql_query_test: Test multi-cell static list updates with ckeys
cql3/lists: Fix multi-cell static list updates in the presence of ckeys
keys: Add factory for an empty clustering_key_prefix_view
This patch fixes a regression introduced in
9e88b60ef5, which broke the lookup for
prefetched values of lists when a clustering key is specified.
This is the code that was removed from some list operations:
std::experimental::optional<clustering_key> row_key;
if (!column.is_static()) {
row_key = clustering_key::from_clustering_prefix(*params._schema, prefix);
}
...
auto&& existing_list = params.get_prefetched_list(m.key().view(), row_key, column);
Put it back, in the form of common code in the update_parameters class.
Fixes#3703
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patchset adds support for skipping inside wide partitions using
index for sliced queries. This can significantly reduce disk I/O for
queries that only need to read a small amount of data from a wide
partition.
Other changes include general code clean-up and simplification.
* github.com/argenet/scylla.git tree/projects/sstables-30/skip_using_index/v6:
sstables: Support resetting data_consume_rows_context_m to
indexable_element::cell.
tests: Add tests to cover skipping with index through SSTables 3.x.
sstables: Support skipping inside wide partitions using index.
to_string: Add operator<< overload for std::optional.
sstables: Use std::optional instead of std::experimental::optional.
We see occasional bad_alloc failures in release mode; this is due
to the random mutation generator generating large mutations.
Reduce the mutation count to 300. I tested 100 runs and all passed,
so it reduces the false positive rate to < 1%.
After ac27d1c93b if a read executor has just enough targets to
achieve request's CL and a connection to one of them will be dropped
during execution ReadFailed error will be returned immediately and
client will not have a chance to issue speculative read (retry). The
patch changes the code to not return ReadFailed error immediately, but
wait for timeout instead and give a client chance to issue speculative
read in case read executor does not have additional targets to send
speculative reads to by itself.
Fixes#3699.
Message-Id: <20180819131646.GK2326@scylladb.com>
This fix adds proper support for skipping inside wide partitions using
index for sliced reads. This significantly reduces disk I/O for filtered
queries.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
The constraint is no longer relevant, since Casandra removed
it in version 2.2. In addition the mechanism for handling this
case is already implemented and is identical in case of
clustering keys with single column EQ,= and IN relations.
(Cartesian product of singular ranges).
A unit test for this test case was added.
Fixes#1735
Tests:
1. Unit Tests.
2. Manual testing with the case described in the issue.
3. dtest: ql_additional_tests.py:TestCQL.composite_row_key_test
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <83b43fdc1ca0e0cc287f66f11816fc71b8bd2925.1534430405.git.eliransin@scylladb.com>
LIMIT should restrict the output result and not the query whose result
set is aggregated. when using aggregate the output is guarantied to
be only one row long. since LIMIT accepts only none negative numbers,
it has no effect and can be ignored.
Fixes#2028
Tests: The issue described Testcase , UnitTests.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <6c235376c81f052020e2ed23d0a3d071b36d4415.1534416997.git.eliransin@scylladb.com>
In the previous patches, we added "virtual columns" to materialized views
to solve row liveness issues (issue #3362). Here we add a test that confirms
that although these virtual columns exist in the view, they should not be
visible to the user. They cannot be explicitly SELECTed from the view table,
and a "SELECT *" will skip them.
Refs #3362.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch includes several tests reproducing issue #3362 - the effect
of unselected columns on view-table row liveness - and confirming
that it was fixed.
We found two example scenarios to demonstrate the bug. One scenario,
test_3362_with_ttls(), involves an unselected column with a TTL. The other,
test_3362_no_ttls() demonstrates the same bug without using TTL, and using
explicit updates and deletions instead. These two tests are heavily
commented, to explain what they test, and why.
In addition to these two basic tests, we also include similar tests
involving multiple items in a collection column, instead of multiple
separate columns, which demonstrate the same problem exists there (and
why, unfortunately, the "virtual columns" we add in that case need to
be collections too).
We also test that the virtual columns - and the problems they fix -
work not only on columns originally created with the view, but also
with unselected columns added later with ALTER TABLE on the base table.
Refs #3362.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Now that we have separate virtual cells to represent unselected columns
in a materialized view, we no longer need the elaborate row-marker liveness
calculations which aimed (but failed) to do the same thing. So that code
can be removed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When a view's partition key contains only columns from the base's partition
key (and not an additional one), the liveness (existance or disappearance)
of a view-table row is tied to the liveness of the base table row - and
that depends not only on selected columns (base-table columns SELECTed to
also appear in the view) but also on unselected columns.
This means that we may need to keep a view row alive even without data,
just because some unselected column is alive in the base table. Before this
patch we tried to build a single "row marker" in the view column which
summarizes the liveness information in all unselected columns, but this
proved unworkable, as explained in issue #3362 and as will be demonstrated
in unit tests in a later patch.
Because we can't replace several unselected cells by one row marker, what
we do in this patch is to add for each for the unselected cell a "virtual
cell" which contains the cell's liveness information (timestamp, deletion,
ttl) but not its value. For collections, we can't represent the entire
collection by one virtual cell, and rather need a collection of virtual
cells.
This patch just adds the virtual columns to the view schema. Code in
the previous patch, when it notices the virtual columns in the view's
schema, added the appropriate content into these columns.
We may need to add virtual columns to a view when first created, but also
when an unselected column is added to the base table with "ALTER TABLE",
so both are supported in this patch.
Fixes#3362.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The add_cells_to_view() function usually adds selected cells from the base
table to the view mutation. For issue #3362, we sometimes want to also
add unselected cells as "virtual" cells - truncated versions of the
base-table cells just without the values.
This patch contains the code to fill the virtual columns' data using the
regular columns from the base table.
This patch does not yet actually *add* any virtual columns to the schema,
so until that is done (in the next patch), this patch will not yet cause
any behavior change. This is important for bisectability.
Refs #3362.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
For issue #3362, we will need to add to a materialized view also unselected
base-table columns as "virtual columns". We need these columns to exist
to keep view rows alive, but we don't want the user to be able to see
them.
In this patch we prevent SELECTing the virtual columns of the view,
and also exclude the virtual columns from a "SELECT *" on a view.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In the previous patch, we added a "view virtual" flag on columns. In this
patch we add persistance to this flag: I.e., writing it to the on-disk
schema table and reading it back on startup. But the implementation is
not as simple as adding a flag:
In the on-disk system tables, we have a "columns" table listing all the
columns in the database and their types. Cqlsh's "DESCRIBE MATERIALIZED
VIEW" works by reading this "columns" table, and listing all of the
requested view's columns. Therefore, we cannot add "virtual columns" -
which are columns not added by the user and not intended to be seen -
to this list.
We therefore need to create in this patch a separate list for virtual
columns, in a new table "view_virtual_columns". This table is essentially
identical to the existing "columns" table, just separate. We need to write
each column to the appropriate table (columns with the view_virtual flag to
"view_virtual_columns", columns without it to the old "columns"), read
from both on startup, and remember to delete columns from both when a table
is dropped.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In this patch we add a flag, "view virtual", that we can mark on on a
column defined in a schema. In following patches, we will add such virtual
columns to materialized views to allow view rows to remain alive despite
having no data (refs #3362).
After this patch, the "view virtual" flag exists in our in-memory
representation of the schema, but not persisted to disk - we will
fix this in the next patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Even before this patch, Scylla supported the "empty" type (a column with
no content) but only internally - i.e., in code but not in CQL syntax.
The "empty" type was used in dense tables without regular columns, and a
special optimization in db::cql_type_parser::parse() allowed this type
name to be parsed when reading the schema tables, without allowing the
"empty" type to be used by users in CQL statements.
However, parse() only supported "empty" itself, and more complex types
like list<empty> were not recognized by parse(). In the following patches,
we plan to add to virtual columns to materialized views, with types empty,
list<empty> or map<something, empty>. We need all these types to work, and
before this patch, they don't. But we want all of these types to only work
internally - when Scylla's code creates these hidden columns; we do not
want to add the "empty" type to CQL's syntax.
This is what we do in this patch: The CQL parser's comparator_type rule
now has a parameter, "internal", used to differenciate internal calls
via db::cql_type_parser::parse() from calls from CQL query parsing.
If a user tries something like:
CREATE TABLE e (pk empty PRIMARY KEY);
He will get the error:
Invalid (reserved) user type name empty
Note that here, as usual, unknown types are treated as "user types",
and "empty" is not allowed as a user type name - we "reserve" it in case
one day in the future we will want to allow users a direct syntax to
create empty columns. We already have, following Cassandra, a bunch of
other names reserved from being user type names, including "byte",
"complex", and others (see _reserved_type_names()), and using "empty"
as a type name will result in a similar error message.
Just like all other type names, the name "empty" is not a reserved
keyword in other senses: a user can create a table or a column with
the name "empty", just like he can create one with the name "int".
Refs #3362.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
_value_views is the authoritative data structure for the
client-specified values. Indeed, the ctor called
transport::request::read_options() leaves _values completely empty.
In query_options::prepare() we were, however, using _values to
associated values to the client-specified column names, and not
_value_views. Fix this by using _value_views instead.
As for the reasons we didn't see this bug earlier, I assume it's
because very few drivers set the 0x04 query options flag, which means
column names are omitted. This is the right thing to do since most
drivers have enough information to correctly position the values.
Fixes#3688
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814234605.14775-1-duarte@scylladb.com>
A raw value can be in one of three states: a valid value, an unset
value, a null value. When translating raw_values to their views, we
were treating both unset and null values are null raw_value_views.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814231051.14385-1-duarte@scylladb.com>
We need to validate before calling query_options::prepare() whether
the set of prepared statement values sent in the query matches the
amount of names we need to bind, otherwise we risk an out-of-bounds
access if the client also specified names together with the values.
Refs #3688
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814225607.14215-1-duarte@scylladb.com>
When the list of values in the IN list of a single column contains
duplicates, multiple executors are activated since the assumption
is that each value in the IN list corresponds to a different partition.
this results in the same row appearing in the result number times
corresponding to the duplication of the partition value.
Added queries for the in restriction unitest and fixed with a bad result check.
Fixes#2837
Tests: Queries as in the usecase from the GitHub issue in both forms ,
prepared and plain (using python driver),Unitest.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <ad88b7218fa55466be7bc4303dc50326a3d59733.1534322238.git.eliransin@scylladb.com>
Change the test so that services are correctly teared down, by the
correct order (e.g., storage_service access the messaging_service when
stopping).
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180814112111.8521-2-duarte@scylladb.com>
This makes the function useable in more contexts due to
flexibility (including in tests), since the state is not captured and
the characteristics of salt generation can be customized to the caller's
needs.
Instead of reducing the large value via `%`, which can produce
non-uniformly distributed values when the range is small, we specify the
range in the distribution, which is uniform by construction.
The `generate_salt` function invokes this function internally now.
This change means that `generate_salt` is now thread-safe and therefore
does not have to be invoked by a single thread only when starting the
`password_authenticator`.
This further means that `generate_salt` does not need to be part of the
public interface of the module, and can be moved to the implementation
file.
While the `password_authenticator` is a complex component with lots of
dependencies, password hashing and checking itself is a process with
limited logical state and dependencies, which makes it easy to isolate
and test.
`std::random_device` has a lot of implementation-specific behavior, and
as a result we cannot assume much about its performance characteristics.
We initialize thread-specific static instances of `std::random_device`
once so that we don't have the overhead of invoking the ctor during
every invocation of `gensalt`.
* seastar d40faff...8ad870f (9):
> reactor: switch indentation
> properly configure I/O Scheduler when --max-io-requests is passed
> IOTune: tell users that the evaluation will take a while
> exceptions: fix compilation with static libstdc++
> apps/iotune: print out which config file updated
> foreign_ptr: allow waiting for the destruction of the managed ptr
> Merge "Improve UX for backtraces read from stdin" from Botond
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Compressing debug section reduces build size by 30% with no
significant increase in build time.
Results on a 4-core system (ninja release, size in MB):
before:
18056 build
real 59m43.138s
user 229m3.180s
sys 6m49.460s
after:
12387 build
real 60m30.112s
user 232m8.962s
sys 6m49.364s
Presumably, the difference in debug mode is even greater.x
Message-Id: <20180811180444.30578-1-avi@scylladb.com>
On previous commit we moved debian/scylla-server.service to
debian/scylla-server.scylla-server.service to explicitly specify
subpackage name, but it doesn't work for dh_installinit without '--name'
option.
Result of that current scylla-server .deb package missing
scylla-server.service, so we need to rename the service to original
file name.
Fixes#3675
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180810221944.24837-1-syuu@scylladb.com>
"
This miniseries fixes ALLOW FILTERING support for prepared statements
by passing correct query options to the filter instead of empty ones.
"
* 'pass_query_options_to_restrictions_filter' of https://github.com/psarna/scylla:
tests: add testing prepared statements with ALLOW FILTERING
cql3: pass query options to restrictions filter
"
This series addresses SELECT/INSERT JSON support issues, namely
handling null values properly and parsing decimals from strings.
It also comes with updated cql tests.
Tests: unit (release)
"
* 'json_fixes_3' of https://github.com/psarna/scylla:
cql3: remove superfluous null conversions in to_json_string
tests: update JSON cql tests
cql3: enable parsing decimal JSON values from string
cql3: add missing return for dead cells
cql3: simplify parsing optional JSON values
cql3: add handling null value in to_json
cql3: provide to_json_string for optional bytes argument
Some types checked when passed bytes argument was empty, and if so,
returned "null" as a JSON string. Now, with to_json_string(bytes_opt)
it's not needed anymore. Also, some types returned "null" instead
of signaling a deserialization error.
Tests are updated to check for recently fixed issues, i.e.
* proper handling of null values
* parsing decimal values from string
Refs #3664
Refs #3666
Refs #3667
Query options may contain bound values needed for checking filtering
restrictions. Previously, empty query_options{} were used, which
caused prepared statements to fail.
Fixes#3677
When emplace_back() fails, value is already moved-from into a
temporary, which breaks monotonicity expected from
apply_monotonically(). As a result, writes to that cell will be lost.
The fix is to avoid the temporary by in-place construction of
cell_and_hash. To do that, appropriate cell_and_hash constructor was
added.
Found by mutation_test.cc::test_apply_monotonically_is_monotonic with
some modifications to the random mutation generator.
Introduced in 99a3e3a.
Fixes#3678.
Message-Id: <1533816965-27328-1-git-send-email-tgrabiec@scylladb.com>
Caused assert failure when collection cells were so large as to
require fragmentation. Currently collection cells are not fragmented,
and deserialization asserts that.
Message-Id: <1533817077-27583-1-git-send-email-tgrabiec@scylladb.com>
When we use str.format() to pass variables on the message it will always
causes Exception like "KeyError: 'red'", since the message contains color
variables but it's not passed to str.format().
To avoid the error we need to pass all format variables to colorprint()
and run str.format() inside the function.
Fixes#3649
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180803015216.14328-1-syuu@scylladb.com>
When murmur3_ignore_msb_bits was introduced in 1.7, we set its default zero
(to avoid resharding on upgrade) and set it to 12 in the scylla.yaml template
(to make sure we get the right value for new clusters).
Now, however, things have changed:
- clusters installed before 1.7 are a small minority
- they should have resharded long ago
- resharding is much better these days
- we have more migrations from Cassandra compared to old clusters
To allow clusters that migrated using their cassandra.yaml, and to clean up
the default scylla.yaml, make the default 12.
Users upgrading from pre-1.7 clusters will need to update their scylla.yaml,
or to reshard (which is a good idea anyway).
Fixes#3670.
Message-Id: <20180808063003.26046-1-avi@scylladb.com>
Currently scylla_ec2_check exits silently when EC2 instance is optimized
for Scylla, it's not clear a result of the check, need to output
message.
Note that this change effects AMI login prompt too.
Fixes#3655
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180808024256.9601-1-syuu@scylladb.com>
These tests check the correctness of resulting compacted SSTables based
on the files produced by compacting input files with Cassandra.
Note that output files are not identical to those generated by Cassandra
because Scylla compaction does not yet optimise delta-encoded values
using serialization header.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <3fa05ce72352292d1026ce80ac87552889d10d96.1533667535.git.vladimir@scylladb.com>
Since our scripts were converted to Python, we can no longer
source them from a shell. Execute them directly instead. Also,
we now need to import configuration variables ourselves, since
scylla_prepare, being an independent process, won't do it for
us.
Fixes#3647
Message-Id: <20180802153017.11112-1-avi@scylladb.com>
"
Store sizes of the request and the response for each traces query.
In the example below I traced the cassandra-stress write workload with a default schema using the probabilistic tracing.
Here is an entry created for one of queries:
cassandra@cqlsh> SELECT parameters FROM system_traces.sessions where session_id=30c3a8ea-96bb-11e8-8a97-000000000000;
parameters
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
{'consistency_level': 'LOCAL_ONE', 'page_size': '5000', 'param[0]': 'f749eb03d6a995d8b3496075da8f20aa9228c5db12401e8a37000fa5baa13531...', 'param[1]': '845809b53a9aff7eef8f85308eaef79e03c696653ca23957f1ed5d539dc00463...', 'param[2]': 'd303585def93a5d40e41ceb12880ad3ede3d9f6308a1b1c5e42e911a191f1de1...', 'param[3]': 'be77c7da059d4b52687cd9b3eaa7d04cdfe7b5e38e84a8eea318299a01c7845f...', 'param[4]': '32faaaea1b3d73d9d628a4945b69a8531740348d49ee30c03f697dd2d63e8dee...', 'param[5]': '50503850374d34323330', 'query': 'UPDATE "standard1" SET "C0" = ?,"C1" = ?,"C2" = ?,"C3" = ?,"C4" = ? WHERE KEY=?', 'serial_consistency_level': 'SERIAL'}
(1 rows)
cassandra@cqlsh> SELECT request_size,response_size FROM system_traces.sessions where session_id=30c3a8ea-96bb-11e8-8a97-000000000000;
request_size | response_size
--------------+---------------
239 | 4
(1 rows)
Now let's try to read the same keyspace1.standard1 entry (based on the "key" value in "param[5]") from cqlsh and trace it using TRACING ON.
cassandra@cqlsh> TRACING ON
Now Tracing is enabled
cassandra@cqlsh> SELECT * from keyspace1.standard1 where key=0x50503850374d34323330;
key | C0 | C1 | C2 | C3 |
C4
------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+-
-----------------------------------------------------------------------
0x50503850374d34323330 | 0xf749eb03d6a995d8b3496075da8f20aa9228c5db12401e8a37000fa5baa135315430 | 0x845809b53a9aff7eef8f85308eaef79e03c696653ca23957f1ed5d539dc00463e10e | 0xd303585def93a5d40e41ceb12880ad3ede3d9f6308a1b1c5e42e911a191f1de12924 | 0xbe77c7da059d4b52687cd9b3eaa7d04cdfe7b5e38e84a8eea318299a01c7845fb8a2 |
0x32faaaea1b3d73d9d628a4945b69a8531740348d49ee30c03f697dd2d63e8dee5dde
(1 rows)
Tracing session: 639ca0a0-96bb-11e8-8a97-000000000000
activity | timestamp | source | source_elapsed
------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+---------------+----------------
Execute CQL3 query | 2018-08-02 21:20:20.906000 | 192.168.1.138 | 0
Parsing a statement [shard 0] | 2018-08-02 21:20:20.906358 | 192.168.1.138 | --
Processing a statement [shard 0] | 2018-08-02 21:20:20.906405 | 192.168.1.138 | 47
Creating read executor for token -5698461774438220979 with all: {192.168.1.138} targets: {192.168.1.138} repair decision: NONE [shard 0] | 2018-08-02 21:20:20.906445 | 192.168.1.138 | 87
read_data: querying locally [shard 0] | 2018-08-02 21:20:20.906448 | 192.168.1.138 | 90
Start querying the token range that starts with -5698461774438220979 [shard 0] | 2018-08-02 21:20:20.906452 | 192.168.1.138 | 94
Querying is done [shard 0] | 2018-08-02 21:20:20.906509 | 192.168.1.138 | 151
Done processing - preparing a result [shard 0] | 2018-08-02 21:20:20.906533 | 192.168.1.138 | 175
Request complete | 2018-08-02 21:20:20.906186 | 192.168.1.138 | 186
cassandra@cqlsh> TRACING OFF
Disabled Tracing.
cassandra@cqlsh> SELECT request_size,response_size FROM system_traces.sessions where session_id=639ca0a0-96bb-11e8-8a97-000000000000;
request_size | response_size
--------------+---------------
82 | 369
(1 rows)
"
* 'tracing_request_response_size-v2' of https://github.com/vladzcloudius/scylla:
tracing: move all tracing related API functions to a cold path
tracing: store a query response size
tracing: store request size
In previous versions of Fedora, the `crypt_r` function returned
`nullptr` when a requested hashing algorithm was not supported.
This is consistent with the documentation of the function in its man
page.
As of Fedora 28, the function's behavior changes so that the encrypted
text is not `nullptr` on error, but instead the string "*0".
The info pages for `crypt_r` clarify somewhat (and contradict the man
pages):
Some implementations return `NULL` on failure, and others return an
_invalid_ hashed passphrase, which will begin with a `*` and will
not be the same as SALT.
Because of this change of behavior, users running Scylla on a Fedora 28
machine which was upgraded from a previous release would not be able to
authenticate: an unsupported hashing algorithm would be selected,
producing encrypted text that did not match the entry in the table.
With this change, unsupported algorithms are correctly detected and
users should be able to continue to authenticate themselves.
Fixes#3637.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <bcd708f3ec195870fa2b0d147c8910fb63db7e0e.1533322594.git.jhaberku@scylladb.com>
This patch completes what was started in a4282c2c6e
Make trace_state_ptr to be a wrapper class around lw_shared_ptr<trace_state> that
hints that bool(trace_state_ptr) is likely to return FALSE.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Add a new "response_size" column to system_traces.sessions and store a size of an uncompressed response
for a traced query.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Add a new column "request_size" to system_traces.sessions and store
the uncompressed request frame data size.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
"
There is an exception safety problem in imr::utils::object. If multiple
memory allocations are needed and one of them fails the main object is
going to be freed (as expected). However, at this stage it is not
constructed yet, so when LSA asks its migrator for the size it may get
a meaningless value. The solution is to remember the size until object
is fully created and use sized deallocation in case of failures.
Fixes#3618.
Tests: unit(release, debug/imr_test)
"
Each IMR type needs its own LSA migrator. It is possible that user will
provide a migrator for a different type than the one which instance is
being created. This patch adds compile-time detection of that bug.
Each member of a structure may require different deserialisation
context. They are provided by context_for<Tag>() method of the context
used to deserialise the structure itself.
imr::utils::object needs to add backpointer to the structure it manages
so that it can be used in the LSA memory. This is done by creating a
structure that has two members: the backpointer and the actual structure
that imr::utils::object is to manage. imr::utils::object_context creates
approperiate deserialisation context for it.
context_for() is called for each member of a structure. object_context
implementation of context_for() always created a deserialisation context
for the underlying structure regardless which member that was, so it was
done also for backpointer. This is wrong since the context may read the
object on its creation.
The fix is to use no_context_t for the backpointer.
imr::utils::object::make() handles creation of IMR objects. They are
created in three phases:
1. The size of the object and all additional needed memory allocations
is determined
2. All needed buffers are allocated
3. Data is written to the allocated space
When IMR objects are deallocated LSA asks their migrator for the size.
Migrator may read some parts of the object to figure out its size. This
is a problem if there is allocation failure in make() at point 2.
If one of required allocations fails, the buffers that were already
acquired need to be freed. However, since the object hasn't been fully
created yet migrator won't return a valid value.
The solution for this is to remember object size until all allocations
are completed. This way the LSA won't need to ask migrators for it in
case of failure. imr::alloc::object_allocator already does that but
imr::utils::object doesn't. This patch fixes that.
For some reason the doc entry for large_partitions was outdated.
It contained incorrect ORDERING information and wrong usage example,
since large_partitions' schema changed multiple times during
the reviewing process.
Message-Id: <1910f270419536ebccffde163ec1bfc67d273306.1533128957.git.sarna@scylladb.
com>
We need the mapping between dht::token_range to
std::vector<inet_address> and inet_address to dht::token_range_vector in
various places. Currently, we use std::unordered_multimap and convert to
std::unordered_map. It is better to use std::unordered_map in the first
place. The changes like below:
- Change from
std::unordered_multimap<dht::token_range, inet_address>
to
std::unordered_map<dht::token_range, std::vector<inet_address>>
- Change from
std::unordered_multimap<inet_address, dht::token_range>
to
std::unordered_map<inet_address, dht::token_range_vector>
Message-Id: <b8ecc41775e46ec064db3ee07510c404583390aa.1533106019.git.asias@scylladb.com>
The calculation consists of several parts with preemption point between
them, so a table can be added while calculation is ongoing. Do not
assume that table exists in intermediate data structure.
Fixes#3636
Message-Id: <20180801093147.GD23569@scylladb.com>
"
This series replaces infinite time-outs in internal distributed
(non-local) CQL queries with finite ones.
The implementation of tracing, which also performs internal queries,
already has finite time-outs, so it is unchanged.
Fixes#3603.
"
* 'jhk/finite_time_outs/v2' of https://github.com/hakuch/scylla:
Use finite time-outs for internal auth. queries
Use finite query time-outs for `system_distributed`
The moving operation changes a node's token to a new token. It is
supported only when a node has one token. The legacy moving operation is
useful in the early days before the vnode is introduced where a node has
only one token. I don't think it is useful anymore.
In the future, we might support adjusting the number of vnodes to reblance
the token range each node owns.
Removing it simplifies the cluster operation logic and code.
Fixes#3475
Message-Id: <144d3bea4140eda550770b866ec30e961933401d.1533111227.git.asias@scylladb.com>
"
When we are out of memtable space (real of virtual), lsa will defer running
our mutation application and run it later when memory is in fact available.
However, it will run it in the main group, giving the write more shares than it
would otherwise get.
This patchset fixes the problem by running those deferred mutation applications
in the correct scheduling group.
Fixes#3638
"
* tag '3638/v2' of https://github.com/avikivity/scylla:
database: tag dirty memory managers with scheduling groups
logalloc: run releaser() in user-provided scheduling group
dirty memory managers run code on behalf of their callers
in a background fiber, so provide that background fiber with
the scheduling group appropriate to their caller.
- system: main (we want to let system writes through quickly)
- dirty: statement (normal user writes)
- streaming: streaming (streaming writes)
* seastar 6b97e00...d40faff (10):
> tutorial: update build as needed for newer pandoc
> core: fix __libc_free return type signature
> future-utils: when_all: avoid calling member function on an uninitialized data member
> future-util: reduce continuations in when_all (variadic version)
> future-utils: remove allocation in when_all() if all futures are available
> future: reduce allocations in when_all()
> future: fill missing futurize::from_tuple() functions
> future: expose more types in continuation_base
> log: predict logger::is_enabled() as false
> README: add Resources section with infomation about the mailing list etc.
Let the user specify which scheduling group should run the
releaser, since it is running functions on the user's behalf.
Perhaps a cleaner interface is to require the user to call
a long-running function for the releaser, and so we'd just
inherit its scheduling group, but that's a much bigger change.
Both the Prometheus and the API servers are used for maintenance
operations, similarly to streaming. Run them under the streaming
scheduling group to prevent them from impacting normal operations,
and rename the streaming scheduling group to reflect the more
generic role.
This helps to prevent spikes from Prometheus or API requests from
interfering with the normal workload. Using an existing group is
preferable to creating a new group because in the worst case, all
the non-main-workload groups compete with the main workload.
Consolidating them allows us to give them significant shares in
total without increasing competition in the worst case.
The group's label is unchanged to preserve compatibility with
dashboards.
A nice side effect is that repair, which is initiated by API calls,
gets placed into the maintenance group naturally. Compaction tasks
which are run by compaction manager are not changed.
Message-Id: <20180714160723.23655-1-avi@scylladb.com>
Most queries run without tracing (and those that run with tracing
are not sensitive to a few cycles), so mark the tracing paths as
cold.
Message-Id: <20180723133000.30482-1-avi@scylladb.com>
This will allow continuous integration to use the optimal number
of compiler jobs, without having to resort to complex calculations
from its scripting environment.
Message-Id: <20180722172050.13148-1-avi@scylladb.com>
"
This series adds some optimisations to the paging logic, that attempt to
close the performance gap between paged and not paged queries. The
former are more complex so always are going to be slower, but the
performance loss was unacceptably large.
Fixes#3619.
Performance with paging:
./perf_paging_before ./perf_paging_after diff
read 271246.13 312815.49 15.3%
Without paging:
./perf_nopaging_before ./perf_nopaging_after diff
read 343732.17 342575.77 -0.3%
Tests: unit(release), dtests(paging_test.py, paging_additional_test.py)
"
* tag 'optimise-paging/v1' of https://github.com/pdziepak/scylla:
cql3: select statement: don't copy metadata if not needed
cql3: query_options: make simple getter inlineable
cql3: metadata: avoid copying column information
query_pager: avoid visiting result_view if not needed
query::result_view: add get_last_partition_and_clustering_key()
query::result_reader: fix const correctness
tests/uuid: add more tests including make_randm_uuid()
utils: uuid: don't use std::random_device()
std::random_device() uses the relatively slow /dev/urandom, and we rarely if
ever intend to use it directly - we normally want to use it to seed a faster
random_engine (a pseudo-random number generator).
In many places in the code, we first created a random_device variable, and then
using it created a random_engine variable. However, this practice created the
risk of a programmer accidentally using the random_device object, instead of the
random_engine object, because both have the same API; This hurts performance.
This risk materialized in just two places in the code, utils/uuid.cc and
gms/gossiper.cc. A patch for to uuid.cc was sent previously by Pawel and is
not included in this patch, and the fix for gossiper.{cc,hh} is included here.
To avoid risking the same mistake in the future, this patch switches across the
code to an idiom where the random_device object is not *named*, so cannot be
accidentally used. We use the following idiom:
std::default_random_engine _engine{std::random_device{}()};
Here std::random_device{}() creates the random device (/dev/urandom) and pulls
a random integer from it. It then uses this seed to create the random_engine
(the pseudo-random number generator). The std::random_device{} object is
temporary and unnamed, and cannot be unintentionally used directly.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180726154958.4405-1-nyh@scylladb.com>
The column-related metadata is shared by all requests done with the same
perpared query. However, metadata class contains also some additional
flags and paging state which may differ. This patch allows sharing
column information among multiple instances of the metadata class.
query::result_visitor provides get_last_partition_and_clustering_key()
which allows getting those without iterating through the whole result.
Moreover, row count may be precomputed in the result, if it isn't there
is query::result_view::count_partitions_and_rows() for getting it.
Paging needs to get last partition and clustering key (if the latter
exists). Previously, this was done by result_view visitor but that is
suboptimal. Let's add a direct getter for those.
mock outputs files owned by root. This causes attempts
by scripts that want to junk the working directory (typically
continuous integration) to fail on permission errors.
Fixup those permissions after the fact.
Message-Id: <20180719163553.5186-1-avi@scylladb.com>
Count operations which were started on one shard and
were performed on another, due to non-shard-aware driver
and/or RPC.
Message-Id: <20180723155118.8545-1-avi@scylladb.com>
"
This mini-series covers a regression caused by newest versions
of jsoncpp library, which changed the way of quoting UTF-8 strings.
Tests: unit (release)
"
* 'add_json_quoting_3' of https://github.com/psarna/scylla:
tests: add JSON unit test
types: use value_to_quoted_string in JSON quoting
json: add value_to_quoted_string helper function
Ref #3622.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
After open-source-parsers/jsoncpp@42a161f commit jsoncpp's version
of valueToQuotedString no longer fits our needs, because too many
UTF-8 characters are unnecessarily escaped. To remedy that,
this commit provides our own string quoting implementation.
Reported-by: Nadav Har'El <nyh@scylladb.com>
Refs #3622
Previously CQL grammar wrongfully required INSERT JSON queries
to provide a list of columns, even though they are already
present in JSON itself.
Unfortunately, tests were written with this false assumption as well,
so they're are updated.
Message-Id: <33b496cba523f0f27b6cbf5539a90b6feb20269e.1532514111.git.sarna@scylladb.com>
Compactions start and end all the time, especially with many shards,
and don't contribute much to understanding what is going on these
days. Compaction throughput is available through the metrics and
other information is available via the compaction history table.
Demote compaction start and end messages to DEBUG level to keep
the log clean. Cleaning and resharding compactions are kept as
INFO, at least for now, since they are manual operations and
therefore rarer.
Message-Id: <20180724132859.14109-1-avi@scylladb.com>
"
This series follows up ALLOW FILTERING support series and depends on
this one: https://groups.google.com/d/msg/scylladb-dev/Qxt3_MP03jI/5ZhRTJ3gBwAJ
The following optimizations regarding clustering key prefix and filtering are
applied:
* if clustering key restrictions require filtering, but they still
contain any part of the prefix, this prefix can be used to narrow
down the query by using it in computing clustering bounds
* if an indexed query has partition key restrictions and any clustering
key restrictions that form a prefix, then from now on this prefix
will be used to narrow down the index query
"
Ref #3611.
* 'use_prefix_with_filtering_and_si_4' of https://github.com/psarna/scylla:
tests: add prefix cases to indexed filtered queries tests
cql3: use ck prefix in filtered queries
cql3: use clustering key prefix in index queries
cql3: add conversion to ck longest prefix restrictions
cql3: add prefix_size method to ck restrictions
If an indexed query has partition+clustering key restrictions as well
and at least some of these restrictions create a prefix, this prefix
is used in the index query to narrow down the number of rows read.
Refs #3611
sstable close is an asychronous operation launched in the background,
so we can't wait for it. If the test ends before all operations are
complete, the background operations are detected as leaks.
We need either a proper close(), or maybe a sstables::quiesce() that
waits until there are no sstables alive on the shard, but until then,
a hack.
"
This patchset authored by Piotr fixes ck filtering and fast forwarding in SSTables 3.x.
For now only clustering rows are supported and range tombstones will come next.
Test: unit {release}
"
* 'projects/sstables-30/filtering/v5' of https://github.com/argenet/scylla:
sstables: Minor clean-up and renaming to clustering_ranges_walker.
sstables: Add test for filtering and forwarding
sstables: Fix schema for static row tests
sstables: Fix ck filtering and fast forwarding
sstables: Introduce mutation_fragment_filter
"
This series changes the native CQL3 protocl layer so that it works with
fragmented buffers instead of a single temporary_buffer per request.
The main part is fragmented_temporary_buffer which represents a
fragmented buffer consisting of multiple temporary_buffers. It provides
helpers for reading fragmented buffer from an input_stream, interpreting
the data in the fragmented buffer as well as view that satisfy
FragmentRange concept.
There are still situations where a fragmented buffer is linearised. That
includes decompressing client requests (this uses reusable buffers in a
similar way to the code that sends compressed responses), CQL statement
restrictions and values that are hard-coded in prepared statements
(hopefully, the values in those cases will be small), value validation
in some cases (blobs are not validated, irrelevant for many fixed-size
small types, but may be a problem for large text cells) as well as
operations on collections.
Tests: unit(release), dtests(cql_prepared_test.py, cql_tests.py, cql_additional_tests.py)
"
* tag 'fragmented-cql3-receive/v1' of https://github.com/pdziepak/scylla: (23 commits)
types: bytes_view: override fragmented validate()
cql3: value_view: switch to fragmented_temporary_buffer::view
types: add validate that accepts fragmented_temporary_buffer::view
cql3 query_options: add linearize()
cql3: query_options: use bytes_ostream for temporaries
cql3: operation: make make_cell accept fragmented_temporary_buffer::view
atomic_cell: accept fragmented_temporary_buffer::view values
cql3: avoid ambiguity in a call to update_parameters::make_cell()
transport: switch to fragmented_temporary_buffer
transport: extract compression buffers from response class
tests/reusable_buffer: test fragmented_temporary_buffer support
utils: reusable_buffer: support fragmented_temporary_buffer
tests: add test for fragmented_temporary_buffer
util fragment_range: add general linearisation functions
utils: add fragmented_temporary_buffer
tests: add basic test for transport requests and responses
tests/random-utils: print seed
tests/random-utils: generate sstrings
cql3: add value_view printer and equality comparison
transport: move response outside of cql_server class
...
"
This patchset adds support for reading Index.db files written in
SSTables 3.x ('mc') format.
Note that the offsets map introduced in SSTables 3.x is neither used nor
read yet. It is located last in promoted index and so current parsers
just ignore it for the time being.
Later it should be used to perform binary search of a desired promoted
index block in large partition, thus reducing the complexity from linear
to logarithmic.
Tests: unit {release}
"
* 'projects/sstables-30/index_reader/v5' of https://github.com/argenet/scylla:
sstables: Add getter for end_open_marker to index_reader.
tests: Add test reading index for a partition comprised of RT markers of boundary types.
tests: Add test for reading index of a partition comprised of only range tombstones.
tests: Use std::adjacent_find in index_reader_assertions::has_monotonic_positions()
tests: Read rows only index
sstables: Do not seek through the promoted index for static row positions.
sstables: Read promoted index stored in SSTables 3.x ('mc') format.
sstables: Make promoted_index_block support clustering keys for both ka/la and mc formats.
utils: Add overloaded_functor helper.
position_in_partition: Add a constructor from range_tag_t{}, bound_kind and clustering_key_prefix.
sstables: Support reading signed vints in continuous_data_consumer.
sstables: Factor out the code building a vector of fixed clustering values lengths.
sstables: Remove unused includes from index_entry.hh
tests: Add test for reading SSTables 3.x index file with empty promoted index.
tests: Rename sstable_assertions.hh -> tests/index_reader_assertions.hh
sstables: Support parsing index entries from SSTables 3.x format.
sstables: move bound_kind_m to header
- Renamed _current to _current_range to better reflect its nature as
there are other similarly named members (_current_start and
_current_end).
- Don't use a temporary variable for incrementing the change counter.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This class encapsulates the logic related to
clustering key filtering and fast forwarding.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Not only this is easier to read and understand, but it also doesn't
force the promoted_index_block class to support copying which is
heavyweight and otherwise not needed.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
The overloaded_functor class template can be used to encompass multiple
lambdas accepting different types into a single callable object that can
be used with any of those types.
One application is visitors for std::variant where different handling is
required for different types.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This facilitates position_in_partition creation when parsing range tombstones bounds from SSTables files.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This code will be re-used in promoted_index_blocks_parser to parse
clustering key prefixes from SSTables 3.x format.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
The previous name of the file is moreover confusing as we have several
sstable_assertions classes throughout tests but this header only
contains a class for index reader assertions.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
With this patch, index_reader is capable of reading index_entries from
both 'ka'/'la' and 'mc' formats.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
"
Queries that use secondary index and have a full partition key restriction
or full primary key restriction should not require filtering - it's
sufficient to add these restrictions to the index query.
This also adds secondary index tests to cover this case.
Tests: unit (release)
"
* 'si_and_pk_restrictions_2' of https://github.com/psarna/scylla:
tests: add index + partition key test
cql3: make index+primary key restrictions filtering-independent
cql3: use primary key restrictions in filtering index queries
cql3: add is_all_eq to primary key restrictions
cql3: add explicit conversion between key restrictions
cql3: add apply_to() method to single column restriction
cql3: make primary key restrictions' values unambiguous
Now that verb categorizations also affect scheduling, getting them
correct is more important. The first three patches in this series
improve the infrastructure a little, and the forth fixes some
categorization errors wrt. repair/streaming verbs.
* https://github.com/avikivity/scylla msg-idx-sanity/v1:
messaging: choose connection index via a look-up table
messaging: convert do_get_rpc_client_idx into a switch
messaging: remove default when computing rpc client index
messaging: categorize more streaming/repair verbs as streaming
For example, to bootstrap a 50th node in a cluster
[shard 0] range_streamer - Bootstrap with
[127.0.0.8, 127.0.0.2, 127.0.0.24, 127.0.0.21, 127.0.0.49, 127.0.0.44,
127.0.0.9, 127.0.0.7, 127.0.0.47, 127.0.0.15, 127.0.0.5, 127.0.0.30,
127.0.0.14, 127.0.0.12, 127.0.0.36, 127.0.0.11, 127.0.0.48, 127.0.0.28,
127.0.0.33, 127.0.0.10, 127.0.0.41, 127.0.0.4, 127.0.0.40, 127.0.0.3,
127.0.0.6, 127.0.0.43, 127.0.0.22, 127.0.0.26, 127.0.0.42, 127.0.0.25,
127.0.0.17, 127.0.0.37, 127.0.0.23, 127.0.0.13, 127.0.0.38, 127.0.0.1,
127.0.0.18, 127.0.0.20, 127.0.0.39, 127.0.0.27, 127.0.0.34, 127.0.0.32,
127.0.0.19, 127.0.0.16, 127.0.0.31, 127.0.0.45, 127.0.0.29, 127.0.0.35,
127.0.0.46]
for keyspace=keyspace1 started, nodes_to_stream=49, nodes_in_parallel=49
the new node will get data from 49 existing nodes.
Currently, it will stream from all the 49 existing nodes at the same
time. It is not a good idea to stream from all the nodes in parallel
which can overwhelm the bootstrap node, i.e., 49 nodes sending, 1 node
receiving.
To fix this, limit the nr of nodes to stream in parallel. We should have
a better control over the memory usage and parallelism. But for now,
limit the nr of nodes to a maximum of 16 as a starter. With this limit,
each shard can work with as many as 16 remote nodes in parallel, I think
this has enough parallelism for streaming in terms of performance.
This change have effect on the bootstrap/decommission/removenode node
operations, and do not have effect on repair.
Refs #2782
Message-Id: <980610dc97490d4f16281a0c3203b9bee73e04e4.1531989557.git.asias@scylladb.com>
"
Use chunked_vector instead of vector. It won't have compatibility issues
because chunked_vector and vector have the same on wire format.
Refs #278
"
* 'asias/gossip_memory_v2' of github.com:scylladb/seastar-dev:
gossip: Reduce continuous memory usage
to_string: Add std::list and utils::chunked_vector support
serializer: Add chunked_vector support
If full partition key (or full primary key) is used in an indexed
query, it should not require filtering, because queries like that
can be efficiently narrowed down with stricter index restrictions.
If both index and partition key is used in a query, it should not
require filtering, because indexed query can be narrowed down
with partition key information. This commit appends partition key
restrictions to index query.
The default implementation linearises the buffer and calls
validate(bytes_view). This is bad and not needed for bytes_type which
doesn't do any validation anyway.
Some code in the CQL3 layer requires bytes_view and it is fairly
reasonable to assume that it won't deal with large buffers (e.g.
statement restrictions). query_options already has make_temporary()
which takes ownership of a cql3::raw_value so that the rest of the code
can use cql3::raw_value_view. This patch adds similar linearize()
function which, if necessary, linearises a cql3::raw_value_view and
returns a bytes_view with lifetime tied to the life or query_options.
bytes_ostream is going to be more efficient than
std::vector<std::vector<char>> since it can put multiple small values in
a single buffer thus reducing the number of memory allocations.
Using initializer lists in calls like foo({}) is ambiguous if foo() has
multiple overloads with more than one accepting a type that is
default-constructible. update_parameters::make_cell() is about to get an
overload that accepts fragmented_temporary_buffer::view as a value, so
let's make sure its call site won't be ambiguous.
The logic responsible for reading requests was operating on
temporary_buffer<char> and bytes_view. This required all request
messages to be linearised to a contiguous buffer, possibly causing large
allocations. Changing to fragmented_temporary_buffer mostly alleviates this
problem unless the reader code explicitly asks for a contiguous bytes_view.
reusable_buffer already supports bytes_ostream which is often used for
handling data sent from Scylla. This patch adds support for
fragmented_temporary_buffer which is going to be mainly used for data
received by Scylla.
Seastar output_streams produce temporary_buffer<char>s.
fragmented_temporary_buffer represents a single fragmented buffer that
consists of, possibly multiple, temporary_buffer<char>s.
"
The problem happens under the following circumstances:
- we have a partially populated partition in cache, with a gap in the middle
- a read with no clustering restrictions trying to populate that gap
- eviction of the entry for the lower bound of the gap concurrent with population
The population may incorrectly mark the range before the gap as continuous.
This may result in temporary loss of writes in that clustering range. The
problem heals by clearing cache.
Caught by row_cache_test::test_concurrent_reads_and_eviction, which has been
failing sporadically.
The problem is in ensure_population_lower_bound(), which returns true if
current clustering range covers all rows, which means that the populator has a
right to set continuity flag to true on the row it inserts. This is correct
only if the current population range actually starts since before all
clustering rows. Otherwise, we're populating since _last_row and should
consult it.
Fixes#3608.
"
* 'tgrabiec/fix-violation-of-continuity-on-concurrent-read-and-eviction' of github.com:tgrabiec/scylla:
row_cache: Fix violation of continuity on concurrent eviction and population
position_in_partition: Introduce is_before_all_clustered_rows()
This series contains a couple of fixes to the bookkeeping of the view
build process, which could cause data to be left behind in the system
tables.
* git@github.com:duarten/scylla.git materialized-views/view-build-fixes/v1:
Duarte Nunes (3):
db/system_keyspace: Add function to remove view build status of a
shard
db/view: Don't have shard 0 clear other shard's status on drop
db/view: Restrict writes to the distributed system keyspace to shard 0
This series contains a couple of fixes to the adjusting of clustering
keys in the build_progress_virtual_reader, some of which could
potentially cause heap overflows when querying the legacy system table.
* git@github.com:duarten/scylla.git materialized-views/build-progress-virtual-reader-fixes/v1:
Duarte Nunes (3):
db/view/build_progress_virtual_reader: Use correct schema to adjust ck
db/view/build_progress_virtual_reader: Fix full ck detection
db/view/build_progress_virtual_reader: Also adjust end RT bound
"
This series fixes two issues related to bad_allocs and keys which require
linearization (larger than 12.8 KiB). With such keys, comparators may throw if
memory allocation fails. This may cause lookups in partition and rows trees to
fail with bad_alloc.
The first issue (#3583) was that partition version merging
(mutation_partition::apply_monotonically()) was not taking into account that
lookups may fail. If we fail, the partition which is being applied may be
incorrectly left with the clustering range since the begging of the range up
to the current row marked as continuous, if the current row has the continuity
flag set, because we've moved all of the preceding rows into the target, and
the correct lower bound row is no longer there in the source. This may mark
some discontinuous ranges as continuous. Merging is retried by
allocating_section, and there will be no problem if it eventually succeeds,
original continuity will be reflected in the sum. The problem will persist if
it doesn't eventually succeed, when we're really out of memory.
The user-perceivable effect of this would be temporary loss of writes in the
clustering range which was marked as continuous but shouldn't. Introduced in
2.2-rc1.
The second issue (#3585) is that the code which inserts partitions in memtable
and cache will leak the entry if boost::intrusive_set::insert() throws. This
will also cause SIGSEGV when cache tries to evict from such a leaked entry.
"
* tag 'tgrabiec/fix-bad-continuity-on-oom-in-apply-v2' of github.com:tgrabiec/scylla:
managed_bytes: Mark read_linearize() as an allocation point
tests: Relax expectation about continuity after failed merging
tests: mutation_partition: Verify continuity is consistent on bad_alloc on merging
tests: Switch to seastar's allocation failure injector
mutation_partition: Introduce set_continuity()
clustering_interval_set: Introduce contained_in()
clustering_interval_set: Introduce add() overload accepting another interval set
mutation_partition: Fix merging to not leave the source with broader continuity on bad_alloc
mutation_partition: Preserve continuity in case row merging with no tracker throws
memtable, cache: Fix exception safety of partition entry insertions
ensure_population_lower_bound() returned true if current clustering
range covers all rows, which means that the populator has a right to
set continuity flag to true on the row it inserts. This is correct
only if the current population range actually starts since before all
clustering rows. Otherwise we're populating since _last_row, and
should consult it.
The fix introduces a new flag, set when starting to populte, which
indicates if we're populating from the beginning of the range or
not. We cannot simply check if _last_row is set in
ensure_population_lower_bound() because _last_row can be set and then
become empty again.
Fixes#3608
Currently we check that the sum of continuities is exactly the same as
expected on failure. Relax this to require that continuity is not
broader, since in some bad_alloc scenarios, or preemption, we will
have to mark some ranges as discontinuous.
When clustering keys are larger than 12.8 KiB they may get fragmented
and key comparator will need to linearize them on comparison. This may
cause lookups in the rows tree to fail with bad_alloc. Partition
version merging (mutation_partition::apply_monotonically()) was not
taking this into account. If we fail on lookup, the partition which is
being applied may be incorrectly left with the clustering range since
the begging up to the current row marked as continuous, if the current
row has the continuity flag set, because we've moved all of the
preceding rows into the target, and the correct lower bound row is no
longer there in the source. This may mark some discontinuous ranges as
continuous.
Merging is retried by allocating_section, and there will be no problem
if it eventually suceeds, original continity will be reflected in the
sum. The problem will persist if it doesn't eventually succeed, when
we're really out of memory.
To protect against this, we could reset the continuity flag of the
current row in the source when exiting on exception.
Fixes#3583
Example:
p: row{key=A, cont=0} row{key=C, cont=1}
this: row{key=C, cont=0}
When we get to processing key=C, key=A was already moved to this, so p
has stale continuity on key=C, which marks (-inf,C) as continuous,
whereas it should mark only (A, C). That's not a problem if merging
succeeds, but if exception happens at this point, we will violate the
invariant which says that the sum of p and this should yield the same
logical partition. It wouldn't because continuity of the sum is
calculated as a set union, and (-inf, A) would be incorrectly turned
into a continuous range.
This is not a problem currently because continuity is always full when
there is no tracker (memtables), so won't change anyway, and when
there is a tracker (cache) we never merge but overwrite instead, so
there is no memory allocation and thus no possibility for failure. But
better be safe.
boost::intrusive::set::insert() may throw if keys require
linearization and that fails, in which case we will leak the entry.
When this happens in cache, we will also violate the invariant for
entry eviction, which assumes all tracked entries are linked, and
cause a SEGFAULT.
Use the non-throwing and faster insert_before() instead. Where we
can't use insert_before(), use alloc_strategy_unique_ptr<> to ensure
that entry is deallocated on insert failure.
Fixes#3585.
"
Partition keys are currently stored in serialized form in the
system.large_partitions table. This is an obstacle to operators
who usually can't deserialize partition keys in their heads.
Improve the situation by deserializing the partition key for them.
"
* tag 'pkey-print/v1' of https://github.com/avikivity/scylla:
large_partition_handler: output friendly partition key
keys: schema-aware printing of a partition_key
* seastar aac6cf1...6b97e00 (5):
> Merge "changes to fix travis CI builds" from Kefu
> tls.cc: Make "close" timeout delay exception proof
> core/sharded: mark foreign_ptr::get_owner_shard() const
> core/memory: Expose counter of large allocations
> tests: add test for multi-fragmented net::packet
Fixes#3461.
Ref scylladb/seastar#474.
In case population of the vector throws, the vector object would not
be destroyed. It's a managed object, so in addition to causing a leak,
it would corrupt memory if later moved by the LSA, because it would
try to fixup forward references to itself.
Caused sporadic failures and crashes of row_cache_test, especially
with allocation failure injector enabled.
Introduced in 27014a23d7.
Message-Id: <1531757764-7638-1-git-send-email-tgrabiec@scylladb.com>
Gossip SYN and ACK uses std::vector to store a list of gossip_digest,
the larger the cluster, the more continuous memory is needed. To reduce
the memory pressure which might cause std::bad_alloc, switch the std::vector
to chunked_vector.
In addition, change add_local_application_state to use std::list instead
of std::vector.
Refs #2782
Use abstract_type::to_string() to prettify partition key components.
Manually tested by setting --compaction-large-partition-warning-threshold-mb
to zero and inspecting the output for compound and non-compound partition
keys.
Add a with_schema() helper to decorate a partition key with its
schema for pretty-printing purposes, and matching operator<<.
This is useful to print partition keys where the operator, who
may not be familiar with the encoding, may see them.
`query_partition_key_range()` does the final result merging and trimming
(if necessary) to make sure we don't send more rows to the client than
requested. This merging and trimming is done by a continuation attached
to the `query_partition_key_range_concurrent()` which does the actual
querying. The continuations captures via value the `row_limit` and
`partition_limit` fields of the `query::read_command` object of the
query. This has an unexpected consequence. The lambda object is
constructed after the call to `query_partition_key_range_concurrent()`
returns. If this call doesn't defer, any modifications done to the read
command object done by `query_partition_key_range_concurrent()` will be
visible to the lambda. This is undesirable because
`query_partition_key_range_concurrent()` updates the read command object
directly as the vnodes are traversed which in turn will result in the
lambda doing the final trimming according to a decremented `row_limits`,
which will cause the paging logic to declare the query as exhausted
prematurely because the page will not be full.
To avoid all this make a copy of the relevant limit fields before
`query_partition_key_range_concurrent()` is called and pass these copies
to the continuation, thus ensuring that the final trimming will be done
according to the original page limits.
Spotted while investigating a dtest failure on my 1865/range-scans/v2
branch. On that branch the way range scans are executed on replicas is
completely refactored. These changes appearantly reduce the number of
continuations in the read path to the point where an entire page can be
filled without deferring and thus causing the problem to surface.
Fixes#3605.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f11e80a6bf8089d49ba3c112b25a69edf1a92231.1531743940.git.bdenes@scylladb.com>
Since the messaging service will assign a scheduling group based
on the client index, it's more important now to get the verbs categorized
correctly.
Re-categorize REPLICATION_FINISHED, REPAIR_CHECKSUM_RANGE, and most
importantly STREAM_MUTATION_FRAGMENTS to the repair/streaming oriented
connections so we get the correct scheduling.
A default means that when adding new verbs, we may forget to
categorize a verb correctly. Without the default, the compiler
will complain due to -Wswitch.
"
While Cassandra supports multiple data directories, we have been
historically supporting just one. The one-directory model suits us
better because of the I/O Scheduler and so far we have seen very few
requests -- if any, to support this.
Still, the infrastructure needed to support multiple directories can be
beneficial so I am trying to bring this in.
For simplicity, we will treat the first directory in the list as the
main directory. By being able to still associate one singular directory
with a table, most of the code doesn't have to change and we don't have
to worry about how to distribute data between the directories.
In this design:
- We scan all data directories for existing data.
- resharding only happens within a particular data directory.
- snapshot details are accumulated with data for all directories that
host snapshots for the tables we are examining
- snapshots are created with files in its own directories, but the
manifest file goes to the main directory. For this one, note that in
Cassandra the same thing happens, except that there is no "main"
directory. Still the manifest file is still just in one of them.
- SSTables are flushed into the main directory.
- Compactions write data into the main directory
Despite the restrictions, one example of usage of this is recovery. If
we have network attached devices for instance, we can quickly attach a
network device to an existing node and make the data immediately
available as it is compacted back to main storage.
Tests: unit (release)
"
* 'multi-data-file-v2' of github.com:glommer/scylla:
database: change ident
database: support multiple data directories
database: allow resharing to specify a directory
database: support multiple directories in get_snapshot_details
database: move get_snapshot_info into a seastar::thread
snapshots: always create the snapshot directory
sstables: pass sstable dir with entry descriptor
database: make nodetool listsnapshots print correct information
sstables: correctly create descriptors for snapshots
"
This work is on top of Gleb's rpc streaming which is merged recently.
What this series does is to replace scylla streaming service's data plane to
use the new rpc streaming instead of the old rpc verb to send the mutations for
scylla streaming. Other parts of scylla streaming, the control plane, are not
changed.
In my test, to bootstrap a new node to the existing one node cluster, smp 2,
scylla stores data on ramdisk to minimize disk io impact.
I saw x2 improvment in streaming bandwidth.
Before:
[shard 0] stream_session - [Stream #2ae92320-5fc8-11e8-911a-000000000000]
Streaming plan for Bootstrap-ks3-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1570312 KiB, 109521.02 KiB/s
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks3 succeeded, took 14.338 seconds
After:
[shard 0] stream_session - [Stream #e5589ac0-5fc7-11e8-b463-000000000000]
Streaming plan for Bootstrap-ks3-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1546875 KiB, 220415.36 KiB/s
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks3 succeeded, took 7.018 seconds
Tests: dtest update_cluster_layout_tests.py
Fixes: #3591
"
* tag 'asias/scylla_streaming_with_rpc_streaming_v8' of github.com:scylladb/seastar-dev:
streaming: Add rpc streaming support
storage_service: Introduce STREAM_WITH_RPC_STREAM feature
streaming: Add estimate_partitions to send_info
messaging_service: Add streaming with rpc streaming support
messaging_service: Add streaming_domain
database: Add add_sstable_and_update_cache
database: Add make_streaming_sstable_for_write
This allows to remove the requirement to hold the key value inside the
_load callback if its value is needed in the asynchronous continuation
inside the callback in the context of a reload.
This also resolves the use-after-free issue when a _load() callback removes
the item for a given key.
See a9b72db34d.1528794135.git.bdenes%40scylladb.com
for a discussion about this.
In addition this patch makes the loading_cache more robust for any existing
and potential situations when cached entries are being removed from inside the
callback. This is achieved by extending the idea implemented by Duarte in the
"utils/loading_cache: Avoid using invalidated iterators" by capturing timestamped_val_ptr
(which is essentially a lw_shared_ptr to an intrusive set entry which holds both the key
and the cached value) instead of a naked pointer.
Tests {debug, release}:
- Unit tests:
- loading_cache_test
- view_build_test
- auth_test
- auth_resource_test
- dtest:
- auth_test.py
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
The list of elements that needs to be reloaded may be rather large.
Use chunked_vector in order to make the allocator's life easier.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Assign a scheduling_group for each RPC service. Assignement is
done by connection (get_rpc_client_idx()) - all verbs on the
same connection are assigned the same group. While this may seem
arbitrary, it avoids priority inversion; if two verbs on the same
connection have different scheduling groups, the verb with the low
shares may cause a backlog and stall the connection, including
following requests from verbs that ought to have higher shares.
The scheduling_group parameters are encapsulated in different
classes as they are passed around to avoid adding dependencies.
Message-Id: <20180708140433.6426-1-avi@scylladb.com>
This patch addresses several issues.
1. The class no longer uses placement-new trick for move-assignment.
It was incorrect to use because the class contains const refererences
and re-initializing the same region of memory would result in undefined
behaviour on accessing these members.
2. Use boost::iterator_range for tracking the current range of
cr_ranges. It is easier to deal with and avoids possible bugs like
assigning only one of two iterators
Message-Id: <4096182c4ee2fb1157e135c487c41012b266ba69.1531440684.git.vladimir@scylladb.com>
This patch changes scylla streaming to use the recently added rpc
streaming feature provided by seastar to send mutation fragments for
scylla streaming instead of the rpc verbs.
It also changes the receiver to write to the sstable file directly,
skipping writing to memtable.
Preparation for adding rpc streaming in scylla streaming.
- register_stream_mutation_fragments is used to register the rpc
streaming verb
- make_sink_and_source_for_stream_mutation_fragments is used to get the
sink and source object for the sender
- make_sink_for_stream_mutation_fragments is used to get a sink object
for the receiver
Since we can write mutations to sstable directly in streaming, we need
to add those sstables to the system so it can be seen by the query.
Also we need to update the cache so the query refects the latest data.
This will be used to create sstable for streaming receiver to write the
mutations received from network to sstable file instead of writing to
memtable.
Since some AMIs using consistent network device naming, primary NIC
ifname is not 'eth0'.
But we hardcoded NIC name as 'eth0' on scylla_ec2_check, we need to add
--nic option to specify custom NIC ifname.
Fixes#3584
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180712142446.15909-1-syuu@scylladb.com>
Reduces size of index_entry from 384 bytes to 64 bytes by using
indirection for the optional promoted index instead of embedding it.
Improves query time from 9ms to 4ms in a micro benchmark with a very
large index page.
Message-Id: <1531406354-10089-1-git-send-email-tgrabiec@scylladb.com>
When periodically reloading the values in the loading_cache, we would
iterate over the list of entries and call the load() function for
those which need to be reloaded.
For some concrete caches, load() can remove the entry from the LRU set,
and can be executed inline from the parallel_for_each(). This means we
could potentially keep iterating using an invalidated iterator.
Fix this by using a temporary container to hold those entries to be
reloaded.
Spotted when reading the code.
Also use if constexpr and fix the comment in the function containing
the changes.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180712124143.13638-1-duarte@scylladb.com>
The continuation attached to _load() needs the key of the loaded entry
to check whether it was disposed during the load. However if _load()
invalidates the entry the continuation's capture line will access
invalid memory while trying to obtain the key.
To avoid this save a copy of the key before calling _load() and pass it
to both _load() and the continuation.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <b571b73076ca863690f907fbd3fb4ff54e597b28.1531393608.git.bdenes@scylladb.com>
"
If there is a lot of partitions in the index page, index_list may grow large
and require large contiguous blocks of memory, because it's based on
std::vector. That puts pressure on the memory allocator, and if memory is
fragmented, may not be possible to satisfy without a lot of eviction. Switch
to chunked_vector to avoid this.
Refs #3597
"
* 'tgrabiec/avoid-large-alloc-in-index-reader' of github.com:tgrabiec/scylla:
sstables: Switch index_list to chunked_vector to avoid large allocations
utils: chunked_vector: Do not require T to be default-constructible for clear()
utils: chunked_vector: Implement front()
Referring to a function parameter via "global" no longer works on
python 3. We should be using "nonlocal", which is absent on python 2
though. To make the script work on both, inline next().
Message-Id: <1531317984-29224-1-git-send-email-tgrabiec@scylladb.com>
"
Previous series on ALLOW FILTERING introduced it for regular queries,
but it's also possible to have an indexed query which requires
filtering. This series contains minor fixes that allow treating
indexed+filtered queries properly. The most important part is having
more selective approach of extracting values from restrictions
in read_posting_list() helper function. Before ALLOW FILTERING,
restrictions contained only a single entry that matched the indexed
column, but it's not the case with filtering (and it won't be the case
with multiple indexing support).
This series also comes with test cases for indexed+filtered queries.
Tests: unit (release)
"
* 'allow_filtering_and_si_3' of https://github.com/psarna/scylla:
tests: add filtering indexed queries tests
cql3: use single restriction value in index creation
cql3: add secondary index condition to need_filtering
cql3: add value_for method
cql3: add missing inline declarations to restrictions
cql3: make index detection more specific
index: add target_column getter to index
As an optimization, the virtual reader doesn't change the underlying
key if it is not full, and hence doesn't include the extra clustering
key. However, this detection is broken because it checked for 3
clustering columns, instead of 2.
This patch fixes that by obtaining the clustering key size from the
underlying schema instead of hardcoding the size.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The virtual reader adjusts clustering keys obtained from the
underlying, scylla-specific schema, and potentially sheds the extra
clustering key that's absent from the Cassandra-compatible schema.
This patches ensures we use the correct schema to iterator over the
key.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Writing to the distributed system keyspace should be confined to a
single shard of each host, namely shard 0. We were violating this
constraint by having all shards set the host status to "started". This
could be problematic when the build finishes quickly or there's a
concurrent view drop, such that a write done by shard 0 can have a
smaller timestamp than one done by some other shard.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Shard 0 can clear the in-progress build status of all shards when a
view finishes building, because we are ensured all writes to the
system table have completed with earlier timestamps.
This is not the case when dropping a view. A drop can happen
concurrently with the build, in which case shard 0 may process the
notification before another shard receives it, and before that shard
writes to the system table.
Fix this by ensuring each shard clears its own status on drop.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds a function that clears the view build in-progress
status for the current shard, similar to the existing one that clears
it across all shards.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Tests covering ALLOW FILTERING usage while using secondary indexes
as well are added to cql_query_test.
Tests are based on Cassandra's test suite for filtering secondary
indexes + some more simple cases.
ALLOW FILTERING support caused index-related restrictions to possibly
have more values. In order to remain correct, only those restrictions
which match the indexed columns should be used.
In order to extract value from a restriction for just one column,
value_for(column_name, options) method is implemented.
It's needed because once ALLOW FILTERING support was introduced,
index-related restrictions may contain more than 1 value.
In order to prevent future compilation errors, externally defined
class methods from single column primary key restrictions are explicitly
marked inline.
Conditions that detect if restrictions need an indexed query weren't
specific enough to work properly with mixed index-filtering queries,
because they would overly eager assume that partition/clustering key
restrictions have a backing index.
If there is a lot of partitions in the index page, index_list may grow
large and require large contiguous blocks of memory. That puts
pressure on the memory allocator, and if memory is fragmented, may not
be possible to satisfy without a lot of eviction.
resize(), used by clear(), requires T to be default-constructible in
case the vector is expanded. It's not actually needed for clearing,
and there will be users which use clear() with
non-default-constructible T, so implement clear() without using
resize().
Drop scylla_lib.sh since all bash scripts depends on the library is
already converted to python3, and all scylla_lib.sh features are
implemented on scylla_util.py.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180711114756.21823-1-syuu@scylladb.com>
"Converted more scripts to python3."
* 'script_python_conversion2_v2' of https://github.com/syuu1228/scylla:
dist/common/scripts/scylla_util.py: make run()/out() functions shorter
dist/ami: install python34 to run scylla_install_ami
dist/common/scripts/scylla_ec2_check: move ec2 related code to class aws_instance
dist/common/scripts: drop class concolor, use colorprint()
dist/ami/files/.bash_profile: convert almost all lines to python3
dist/common/scripts: convert node_exporter_install to python3
dist/common/scripts: convert scylla_stop to python3
dist/common/scripts: convert scylla_prepare to python3
Python 3 doesn't have 'long' anymore, so commands using it fail with
newer GDB. long on python2 is the same as int on python3, both are
arbitrary-precision. On python2 int is fixed-precision, but seems to
be still enough (64 bit), so use that instead.
Message-Id: <1531215600-31899-1-git-send-email-tgrabiec@scylladb.com>
As noticed by Tomasz Grabiec, we test a future's available() after
having already waited for it with when_all(), which is pointless.
The code after the wrong if() exchanges the contents of a token-range
between this node and several other live neighbors; We can't do this
exchange if either this node is broken or there is no other live neighbor.
So this is what we needed to test. so !available() should have been failed().
Also the test for live_neighbors_checksum.empty() added in commit 7c873f0d1f
is unnecessary - we build live_neighbors and live_neighbors_checksum
together, so if one of them is empty, so is the other.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180710114940.26027-1-nyh@scylladb.com>
Original series that introduced filtering logged a warning
when collection restrictions appeared. Instead, an exception
should be thrown until collection restrictions are supported
for ALLOW FILTERING clauses.
Message-Id: <ddaf342d4d6766fadb756f66e5afa0b99ce054f8.1531220558.git.sarna@scylladb.com>
std::function's move constructor is not noexcept, so observer's move
constructor and assignment operator also cannot be. Switch to Seastar's
noncopyable_function which provides better guarantees.
Tests: observer_tests (release)
Message-Id: <20180710073628.30702-1-avi@scylladb.com>
"
Implement and test support for reading deleted cells in SSTables 3.
"
* 'haaawk/sstables3/read-deleted-cells-v2' of ssh://github.com/scylladb/seastar-dev:
sstables: Test reading deleted cells from SST3
sstables: Support deleted cells in reading SST3
test_uncompressed_compound_ck_read: fix comment
utils: add observer/observable templates
An observable is used to decouple an information producer from a consumer
(in the same way as a callback), while allowing multiple consumers (called
observers) to coexist and to manage their lifetime separately.
Two classes are introduced:
observable: a producer class; when an observable is invoked all observers
receive the information
observer: a consumer class; receives information from a observable
Modelled after boost::signals2, with the following changes
- all signals return void; information is passed from the producer to
the consumer but not back
- thread-unsafe
- modern C++ without preprocessor hacks
- connection lifetime is always managed rather than leaked by default
- renamed to avoid the funky "slot" name
Message-Id: <20180709172726.5079-1-avi@scylladb.com>
Incorrect column_kind was passed, which may cause wrong type to be
used for comparison if schema contains static columns. Affects only
tests.
Spotted during code review.
Message-Id: <1531144991-2658-1-git-send-email-tgrabiec@scylladb.com>
We were considering the token ranges in the size_estimates system
table to be inclusive, which is incorrect and incompatible with
Cassandra.
While we ignore the inclusiveness of the partition_range bounds when
selecting sstables, we do take it into account in
estimated_keys_for_range(). We would thus select the correct sstables,
but could over-estimate the range size nonetheless.
Tests: virtual_reader_test(release)
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180709115919.5106-1-duarte@scylladb.com>
Currently rpc::closed_error is not counted towards replica failure
during read and thus read operation waits for timeout even if one
of the nodes dies. Fix this by counting rpc::closed_error towards
failed attempts.
Fixes#3590.
Message-Id: <20180708123522.GC28899@scylladb.com>
It's not very helpful in normal operation, and generates much noise,
especially when there are many tables.
Message-Id: <20180708070051.8508-1-avi@scylladb.com>
Require a timeout parameter for storage_proxy::mutate_begin() and
all its callers (all the way to thrift and cql modification_statement
and batch_statement).
This should fix spurious debug-mode test failures, where overcommit
and general debug slowness result in the default timeouts being
exceeded. Since the tests use infinite timeouts, they should not
time out any more.
Tests: unit (release), with an extra patch that aborts
when a non-infinite timeout is detected.
Message-Id: <20180707204424.17116-1-avi@scylladb.com>
Current sysconfig_parser.get() returns parameter including double quote,
it will cause problem by append text using sysconfig_parser.set().
Fixes#3587
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180706172219.16859-1-syuu@scylladb.com>
"
This series adds the last missing part of the HH feature list (as in the design doc) - rebalancing;
and finally removes the "experimental" tag from the HH.
"
* 'hinted_handoff_rebalance-v3' of https://github.com/vladzcloudius/scylla:
main: remove the "experimental" tag from the hinted handoff feature
db::hints::manager: implement rebalance() method
There is duplicated code on both scylla_ec2_check and class aws_instance
on scylla_util.py, so drop these code from scylla_ec2_check and use
class aws_instance.
In bash local variable declaration is a separate operation with its own exit status
(always 0) therefore constructs like
local var=`cmd`
will always result in the 0 exit status ($? value) regardless of the actual
result of "cmd" invocation.
To overcome this we should split the declaration and the assignment to be like this:
local var
var=`cmd`
Fixes#3508
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1529702903-24909-3-git-send-email-vladz@scylladb.com>
* seastar d7f35d7...216d499 (10):
> temporary_buffer: Add clone method()
> temporary_buffer: Make move-assignment operator noexcept.
> deleter: Make move-assignment operator noexcept.
> reactor: don't become inefficient when max_task_backlog is exceeded
> reactor: switch cumulative time metrics resolution from nanoseconds to milliseconds
> preempt: annotate for branch prediction
> tests: silence "-Werror=sign-compare" warnings
> Merge "Support one I/O Scheduler per device" from Glauber
> rpc: make rpc server scheduling aware
> Add SEASTAR_USER_CFLAGS and SEASTAR_ENABLE_WERROR
Rebalance hints segments that need to be sent among all present shards.
Ensure that after rebalancing the difference between the number of segments
of any two shards is not greater than 1.
Try to minimize the amount of "file rename" operations in order to achieve the needed result.
Note: "Resharding" is a particular case of rebalancing.
Tests: dtest: hintedhandoff_additional_test.py:TestHintedHandoff.hintedhandoff_rebalance_test
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
In original series cell iterator for regular cells
was erroneously taken by copy instead of by reference,
which will result in iterating over the first value indefinitely.
Also, the same iterator was not updated for collections,
which is fixed too.
Message-Id: <83297adf8121de4fd37257c87f250d61ea9ec80b.1530892191.git.sarna@scylladb.com>
"
This series addresses issue #3575 by adding 3 ALLOW FILTERING related
metrics to help profile queries:
* number of read request that required filtering
* total number of rows read that required filtering
* number of rows read that required filtering and matched
Tests: unit (release)
"
* 'allow_filtering_metrics_4' of https://github.com/psarna/scylla:
cql3: publish ALLOW FILTERING metrics
cql3: add updating ALLOW FILTERING metrics
cql3: define ALLOW FILTERING metrics
The following metrics are defined for ALLOW FILTERING:
* number of read request that required filtering
* total number of rows read that required filtering
* number of rows read that required filtering and matched
While Cassandra supports multiple data directories, we have been
historically supporting just one. The one-directory model suits us
better because of the I/O Scheduler and so far we have seen very few
requests -- if any, to support this.
Still, the infrastructure needed to support multiple directories can be
beneficial so I am trying to bring this in.
For simplicity, we will treat the first directory in the list as the
main directory. By being able to still associate one singular directory
with a table, most of the code doesn't have to change and we don't have
to worry about how to distribute data between the directories.
In this design:
- We scan all data directories for existing data.
- resharding only happens within a particular data directory.
- snapshot details are accumulated with data for all directories that
host snapshots for the tables we are examining
- snapshots are created with files in its own directories, but the
manifest file goes to the main directory. For this one, note that in
Cassandra the same thing happens, except that there is no "main"
directory. Still the manifest file is still just in one of them.
- SSTables are flushed into the main directory.
- Compactions write data into the main directory
Signed-off-by: Glauber Costa <glauber@scylladb.com>
resharding assumes that all SSTables will be in cf->dir(), but in
reality we will soon have tables in other places. We can specify a
directory in get_all_shared_sstables and specify that directory from the
resharding process.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
I am about to add another level of identation and this code already
shifts right too much. It is not performance critical, so let's use a
thread for that. seastar::threads did not exist when this was first
written.
Also remove one unused continuation from inside the inner scan,
simplifying its code.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We currently don't always create the snapshot directory as an
optimization. We have a test at sync time handling this use case.
This works well when all SSTables are created in the same directory, but
if we have more than one data directory than it may not work if we don't
have SSTables in all data directories.
We can fix it by unconditionally creating the directory.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We have been assuming that all SSTables for a table will be in the same
directory, and we pass the directory name to make_descriptor only
because that's the way in ka and la to find out the keyspace and table
names.
However, SSTables for a given column family could be spread into
multiple directories. So let's pass it down with the descriptor so we
can load from the right place.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
nodetool listsnapshots is currently printing zero sizes for all snapshots
The reason for that is that we are moving the snapshot directory name in
the capture list, which can be evaluated by the compiler to happen
before we use it as the function parameter.
Fixes#3572
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Our regular expression for parsing SSTable files tests for the directory
for the la file format, since that file format does not include the
ks/cf pair in the file name itself.
However, the regular expression does not cover the case in which the
SSTable files are coming from snapshots. This patch extends the regex so
they are also covered.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"
This is a series of patches which make it possible for a human to examine
contents of cache or memtables from GDB.
"
* 'tgrabiec/gdb-cache-printers' of github.com:tgrabiec/scylla:
gdb: Add pretty printer for managed_vector
gdb: Add pretty printer for rows
gdb: Add mutation_partition pretty printer
gdb: Add pretty printer for partition_entry
gdb: Add pretty printer for managed_bytes
gdb: Add iteration wrapper for intrusive_set_external_comparator
gdb: Add iteration wrapper for boost intrusive set
A report that C-Ares returned some errors tells the user nothing.
Improve the error message by including the name of the configuration
variable and its value.
Message-Id: <20180705084959.10872-1-avi@scylladb.com>
"
The main idea of this series is to provide a filtering_visitor
as a specialised result_set_builder::visitor implementation
that keeps restriction info and applies it on query results.
Also, since allow_filtering checking is not correct now (e.g. #2025)
on select_statement level, this series tries to fix any issues
related to it.
Still in TODO:
* handling CONTAINS relation in single column restriction filtering
* handling multi-column restrictions - especially EQ, which can be
split into multiple single-column restrictions
* more tests - it's never enough; especially esoteric cases
like filtering queries which also use secondary indexes,
paging tests, etc.
Tests: unit (release)
"
* 'allow_filtering_6' of https://github.com/psarna/scylla:
tests: add allow_filtering tests to cql_query_test
cql3: enable ALLOW FILTERING
service: add filtering_pager
cql3: optimize filtering partition keys and static rows
cql3: add filtering visitor
cql3: move result_set_builder functions to header
cql3: amend need_filtering()
cql3: add single column primary key restrictions getters
cql3: expose single column primary key restrictions
cql3: add needs_filtering to primary key restrictions
cql3: add simpler single_column_restriction::is_satisfied_by
Enables 'ALLOW FILTERING' queries by transfering control
to result_set_builder::filtering_visitor.
Both regular and primary key columns are allowed,
but some things are left unimplemented:
- multi-column restrictions
- CONTAINS queries
Fixes#2025
If any restriction on partition key or static row part fails,
it will be so for every row that belongs to a partition.
Hence, full check of the rest of the rows is skipped.
In order to filter results of an 'ALLOW FILTERING' query,
a visitor that can take optional filter for result_builder
is provided. It defaults to nop_filter, which accepts
all rows.
Previous implementation of need_filtering() was too eager to assume
that index query should be used, whereas sometimes a query should
just be filtered.
"
This series fixes the "LCS data-loss bug" where full scans (and
everything that uses them) would miss some small percentage (> 0.001%)
of the keys. This could easily lead to permanent data-loss as compaction
and decomission both use full scans.
aeffbb673 worked around this bug by disabling the incremental reader
selectors (the class identified as the source of the bug) altogether.
This series fixes the underlying issue and reverts aeffbb673.
The root cause of the bug is that the `incremental_reader_selector` uses
the current read position to poll for new readers using
`sstable_set::incremental_selector::select()`. This means that when the
currently open sstables contain no partitions that would intersect with
some of the yet unselected sstables, those sstables would be ignored.
Solve the problem by not calling `select()` with the current read
position and always pass the `next_position` returned in the previous
call. This means that the traversal of the sstable-set happens at a pace
defined by the sstable-set itself and this guarantees that no sstable
will be jumped over. When asked for new readers the
`incremental_reader_selector` will now iteratively call `select()` using
the `next_position` from the previous `select()` call until it either
receives some new, yet unselected sstables, or `next_position` surpasses
the read position (in which case `select()` will be tried again later).
The `sstable_set::incremental_selector` was not suitable in its present
state to support calling `select()` with the `next_position` from a
previous call as in some cases it could not make progress due to
inclusiveness related ambiguities. So in preparation to the above fix
`sstable_set` was updated to work in terms of ring-position instead of
tokens. Ring-position can express positions in a much more fine-grained
way then token, including positions after/before tokens and keys. This
allows for a clear expression of `next_position` such that calling
`select()` with it guarantees forward progress in the token-space.
Tests: unit(release, debug)
Refs: #3513
"
* 'leveled-missing-keys/v4' of https://github.com/denesb/scylla:
tests/mutation_reader_test: combined_mutation_reader_test: use SEASTAR_THREAD_TEST_CASE
tests/mutation_reader_test: refactor combined_mutation_reader_test
tests/mutation_reader_test: fix reader_selector related tests
Revert "database: stop using incremental selectors"
incremental_reader_selector: don't jump over sstables
mutation_reader: reader_selector: use ring_position instead of token
sstables_set::incremental_selector: use ring_position instead of token
compatible_ring_position: refactor to compatible_ring_position_view
dht::ring_position_view: use token_bound from ring_position
i_partitioner: add free function ring-position tri comparator
mutation_reader_merger::maybe_add_readers(): remove early return
mutation_reader_merger: get rid of _key
We break build_ami.sh since we dropped Ubuntu support, scylla_current_repo
command does not finishes because of less argument ('--target' with no
distribution name, since $TARGET is always blank now).
It need to hardcoded as centos.
Fixes#3577
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180705035251.29160-1-syuu@scylladb.com>
When test.py is run with --jenkins flag Boost UTF is asked to generate
an XML file with the test results. This automatically disables the
human-readable output printed to stdout. There is no real reason to do
so and it is actually less confusing when the Boost UTF messages are in
the test output together with Scylla logger messages.
Message-Id: <20180704172913.23462-1-pdziepak@scylladb.com>
Function to reevaluate postponed compaction was called too early for strategies that
don't allow parallel compaction (only leveled strategy (LCS) at this moment).
Such strategies must first have the ongoing compaction deregistered before reevaluating
the postponed ones. Manager uses task list of ongoing compaction to decides if there's
ongoing compaction for a given column family. So compaction could stop making progress
at all *if and only if* we stop flushing new data.
So it could happen that a column family would be left with lots of pending compaction,
leading the user to think all compacting is done, but after reboot, there will be
lots of compaction activity.
We'll both improve method to detect parallel compaction here and also add a call to
reevaluate postponed compaction after compaction is done.
Fixes#3534.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180702185327.26615-1-raphaelsc@scylladb.com>
Make combined_mutation_reader_test more interesting:
* Set the levels on the sstables
* Arrange the sstables so that they test for the "jump over sstables"
bug.
* Arrange the sstables so that they test for the "gap between sstables".
While at it also make the code more compact.
Passing the current read position to the
`incremental_selector::select()` can lead to "jumping" through sstables.
This can happen when the currently open sstables have no partition that
intersects with a yet unselected sstable that has an intersecting range
nevertheless, in other words there is a gap in the selected sstables
that this unselected one completely fits into. In this case the
unselected sstable will be completely omitted from the read.
The solution is to not to avoid calling `select()` with a position that
is larger than the `next_position` returned from the previous `select()`
call. Instead, call `select()` repeatedly with the `next_position` from
the previous call, until either at least one new sstable is selected or
the current read position is surpassed. This guarantess that no sstables
will be jumped over. In other words, advance the incremental selector in
a pace defined by itself thus guaranteeing that no sstable will be
jumped over.
sstable_set::incremental selector was migrated to ring position, follow
suit and migrate the reader_selector to use ring_position as well. Above
correctness this also improves efficiency in case of dense tables,
avoiding prematurely selecting sstables that share the token but start
at different keys, altough one could argue that this is a niche case.
Currently `sstable_set::incremental_selector` works in terms of tokens.
Sstables can be selected with tokens and internally the token-space is
partitioned (in `partitioned_sstable_set`, used for LCS) with tokens as
well. This is problematic for severeal reasons.
The sub-range sstables cover from the token-space is defined in terms of
decorated keys. It is even possible that multiple sstables cover
multiple non-overlapping sub-ranges of a single token. The current
system is unable to model this and will at best result in selecting
unnecessary sstables.
The usage of token for providing the next position where the
intersecting sstables change [1] causes further problems. Attempting to
walk over the token-space by repeatedly calling `select()` with the
`next_position` returned from the previous call will quite possibly lead
to an infinite loop as a token cannot express inclusiveness/exclusiveness
and thus the incremental selector will not be able to make progress when
the upper and lower bounds of two neighbouring intervals share the same
token with different inclusiveness e.g. [t1, t2](t2, t3].
To solve these problems update incremental_selector to work in terms of
ring position. This makes it possible to partition the token-space
amoing sstables at decorated key granularity. It also makes it possible
for select() to return a next_position that is guaranteed to make
progress.
partitioned_sstable_set now builds the internal interval map using the
decorated key of the sstables, not just the tokens.
incremental_selector::select() now uses `dht::ring_position_view` as
both the selector and the next_position. ring_position_view can express
positions between keys so it can also include information about
inclusiveness/exclusiveness of the next interval guaranteeing forward
progress.
[1] `sstable_set::incremental_selector::selection::next_position`
"
In the same way that drivers can route requests to a coordinator that
is also a replica of the data used by the request, we can allow
drivers to route requests directly to the shard. This patchset
adds and documents a way for drivers to know which shard a connection
is connected to, and how to perform this routing.
"
* tag 'shard-info-alt/v1' of https://github.com/avikivity/scylla:
doc: documented protocol extension for exposing sharding
transport: expose more information about sharding via the OPTIONS/SUPPORTED messages
dht: add i_partitioner::sharding_ignore_msb()
Use query::is_single_partition() to check whether the queried ranges are
singular or not. The current method of using
`dht::partition_range::is_singular()` is incorrect, as it is possible to
build a singular range that doesn't represent a single partition.
`query::is_single_partition()` correctly checks for this so use it
instead.
Found during code-review.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f671f107e8069910a2f84b14c8d22638333d571c.1530675889.git.bdenes@scylladb.com>
Use is_debian()/is_ubuntu() to detect target distribution, also install
pystache by path since package name is different between Fedora and
CentOS.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180703193224.4773-1-syuu@scylladb.com>
Primary key restrictions sometimes require filtering. These functions
return true if ALLOW FILTERING needs to be enabled in order to satisfy
these restrictions.
Currently restriction::is_satisfied_by() accepts only keys and rows
as arguments. In this commit, a version that only takes bytes of data
is provided.
This simpler version applies to single_column_restriction only,
because it compares raw bytes underneath anyway. For other restriction
types, simplified is_satisfied_by is not defined.
compatible_ring_position's sole purpose is to allow creating
boost::icl::interval_map with dht::ring_position as the key and list of
sstables as the value. This function is served equally well if
compatible_ring_position wraps a `dht::ring_position_view` instead of a
`dht::ring_position` with the added benefit of not having to copy the
possibly heavy `dht::decorated_key` around. It also makes it possible
to do lookups with `dht::ring_position_view` which is much more
versatile and allows avoiding copies just to make lookups.
The only downside is that `dht::ring_position_view` requires the
lifetime of the "viewed" object to be taken care of. This is not a
concern however, as so long as an interval is present in the map the
represented sstable is guaranteed to be alive to, as the interval map
participates in the ownership of the stored sstables.
Rename compatible_ring_position to compatible_ring_position_view to
reflect the changes.
While at it upgrade the std::experimental::optional to std::optional.
Currently dht::ring_position_view's dht::token constructor takes the
token bound in the form of a raw `uint8_t`. This allows for passing a
weight of "0" which is illegal as single token does not represent a
single ring position but an interval as arbitrary number of keys can
have the same token. dht::ring_position uses an enum in its dht::token
constructor. Import that same enum into the dht::ring_position_view
scope and take a `token_bound` instead of `uint8_t`.
This is especially important as in later patches the internal weight of
the ring_position_view will be exposed and illegal values can cause all
sorts of problems.
Gentoo Linux was not supported by the node_health_check script
which resulted in the following error message displayed:
"This s a Non-Supported OS, Please Review the Support Matrix"
This patch adds support for Gentoo Linux while adding a TODO note
to add support for authenticated clusters which the script does
not support yet.
Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20180703124458.3788-1-ultrabug@gentoo.org>
After the transition to the new in-memory representation in
aab6b0ee27 'Merge "Introduce new in-memory
representation for cells" from Paweł'
atomic_cell_or_collection::external_memory_usage() stopped accounting
for the externally stored data. Since, it wasn't covered by the unit
tests the bug remained unnotices until now.
This series fixes the memory usage calculation and adds proper unit
tests.
* https://github.com/pdziepak/scylla.git fix-external-memory-usage/v1:
tests/mutation: properly mark atomic_cells that are collection members
imr::utils::object: expose size overhead
data::cell: expose size overhead of external chunks
atomic_cell: add external chunks and overheads to
external_memory_usage()
tests/mutation: test external_memory_usage()
A bug was found recently (#3564) in the paging logic, where the code
assumed the queried ranges list is non-empty. This assumption is
incorrect as there can be valid (if rare) queries that can result in the
ranges list to be empty. Add a unit test that executes such a query with
paging enabled to detect any future bugs related to assumptions about
the ranges list being non-empty.
Refs: #3564
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f5ba308c4014c24bb392060a7e72e7521ff021fa.1530618836.git.bdenes@scylladb.com>
do_fetch_page() checks in the beginning whether there is a saved query
state already, meaning this is not the first page. If there is not it
checks whether the query is for a singulular partitions or a range scan
to decide whether to enable the stateful queries or not. This check
assumed that there is at least one range in _ranges which will not hold
under some circumstances. Add a check for _ranges being empty.
Fixes: #3564
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <cbe64473f8013967a93ef7b2104c7ca0507afac9.1530610709.git.bdenes@scylladb.com>
Debug mode is so slow that generating 1000 mutations is too much for it.
High memory use can also confuse the santitizers that track each allocation.
Reduce mutation count from 1000 to 10 in debug mode.
"
Added NIC / Disk existance check, --force-raid mode on
scylla_raid_setup.
"
* 'scylla_setup_fix4' of https://github.com/syuu1228/scylla:
dist/common/scripts/scylla_raid_setup: verify specified disks are unused
dist/common/scripts/scylla_raid_setup: add --force-raid to construct raid even only one disk is specified
dist/common/scripts/scylla_setup: don't accept disk path if it's not block device
dist/common/scripts/scylla_raid_setup: verify specified disk paths are block device
dist/common/scripts/scylla_sysconfig_setup: verify NIC existance
Currently only scylla_setup interactive mode verifies selected disks are
unused, on non-interactive mode we get mdadm/mkfs.xfs program error and
python backtrace when disks are busy.
So we should verify disks are unused also on scylla_raid_setup, print
out simpler error message.
scylla_install_pkg is initially written for one-liner-installer, but now
it only used for creating AMI, and it just few lines of code, so it should be
merge into scylla_install_ami script.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180612150106.26573-2-syuu@scylladb.com>
`_key` is only used in a single place and this does not warrant storing
it in a member. Also get rid of current_position() which was used to
query `_key`.
"
I found problems on previously submmited patchset 'scylla_setup fixes'
and 'more fixes for scylla_setup', so fixed them and merged into one
patchset.
Also added few more patches.
"
* 'scylla_setup_fix3' of https://github.com/syuu1228/scylla:
dist/common/scripts/scylla_setup: allow input multiple disk paths on RAID disk prompt
dist/common/scripts/scylla_raid_setup: skip constructing RAID0 when only one disk specified
dist/common/scripts/scylla_raid_setup: fix module import
dist/common/scripts/scylla_setup: check disk is used in MDRAID
dist/common/scripts/scylla_setup: move unmasking scylla-fstrim.timer on scylla_fstrim_setup
dist/common/scripts/scylla_setup: use print() instead of logging.error()
dist/common/scripts/scylla_setup: implement do_verify_package() for Gentoo Linux
dist/common/scripts/scylla_coredump_setup: run os.remove() when deleting directory is symlink
dist/common/scripts/scylla_setup: don't include the disk on unused list when it contains partitions
dist/common/scripts/scylla_setup: skip running rest of the check when the disk detected as used
dist/common/scripts/scylla_setup: add a disk to selected list correctly
dist/common/scripts/scylla_setup: fix wrong indent
dist/common/scripts: sync instance type list for detect NIC type to latest one
dist/common/scripts: verify systemd unit existance using 'systemctl cat'
"
If a coordinator sends write requests with ID=X and restarts it may get a reply to
the request after it restarts and sends another request with the same ID (but to
different replicas). This condition will trigger an assert in a coordinator. Drop
the assertion in favor of a warning and initialize handler id in a way to make
this situation less likely.
Fixes: #3153
"
* 'gleb/write-handler-id' of github.com:scylladb/seastar-dev:
storage_proxy: initialize write response id counter from wall clock value
storage_proxy: drop virtual from signal(gms::inet_address)
storage_proxy: do not assert on getting an unexpected write reply
Initializing write response id to the same value on each reboot may
cause stale id to be taken for active one if node restarts after
sending only a couple of write request and before receiving replies.
On next reboot it will start assigning id's from the same value and
receiving old replies will confuse it. Mitigate this by assigning
initial id to wall clock value in milliseconds. It will not solve the
problem completely, but will mitigate it.
When nodetool repair is used with the combination of the "-pr" (primary
range) and "-local" (only repair with nodes in the same DC) options,
Scylla needs to define the "primary ranges" differently: Rather than
assign one node in the entire cluster to be the primary owner of every
token, we need one node in each data-center - so that a "-local"
repair will cover all the tokens.
Fixes#3557.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180701132445.21685-1-nyh@scylladb.com>
In theory we should not get write reply from a node we did not send
write to, but in practice stale reply can be received if node reboot
between sending write and getting a reply. Do not assert, but log the
warning instead and ignore the reply.
Fixes: #3153
Introduced in 5b59df3761.
It is incorrect to erase entries from the memtable being moved to
cache if partition update can be preempted because a later memtable
read may create a snapshot in the memtable before memtable writes for
that partition are made visible through cache. As a result the read
may miss some of the writes which were in the memtable. The code was
checking for presence of snapshots when entering the partition, but
this condition may change if update is preempted. The fix is to not
allow erasing if update is preemptible.
This also caused SIGSEGVs because we were assuming that no such
snapshots will be created and hence were not invalidating iterators on
removal of the entries, which results in undefined behavior when such
snapshots are actually created.
Fixes SIGSEGV in dtest: limits_test.py:TestLimits.max_cells_test
Fixes#3532
Message-Id: <1530129009-13716-1-git-send-email-tgrabiec@scylladb.com>
"
Partition snapshots go away when the last read using the snapshot is done.
Currently we will synchronously attempt to merge partition versions on this event.
If partitions are large, that may stall the reactor for a significant amount of time,
depending on the size of newer versions. Cache update on memtable flush can
create especially large versions.
The solution implemented in this series is to allow merging to be preemptable,
and continue in the background. Background merging is done by the mutation_cleaner
associated with the container (memtable, cache). There is a single merging process
per mutation_cleaner. The merging worker runs in a separate scheduling group,
introduced here, called "mem_compaction".
When the last user of a snapshot goes away the snapshot is slided to the
oldest unreferenced version first so that the version is no longer reachable
from partition_entry::read(). The cleaner will then keep merging preceding
(newer) versions into it, until it merges a version which is referenced. The
merging is preemtable. If the initial merging is preempted, the snapshot is
enqueued into the cleaner, the worker woken up, and merging will continue
asynchronously.
When memtable is merged with cache, its cleaner is merged with cache cleaner,
so any outstanding background merges will be continued by the cache cleaner
without disruption.
This reduces scheduling latency spikes in tests/perf_row_cache_update
for the case of large partition with many rows. For -c1 -m1G I saw
them dropping from >23ms to 1-2ms. System-level benchmark using scylla-bench
shows a similar improvement.
"
* tag 'tgrabiec/merge-snapshots-gradually-v4' of github.com:tgrabiec/scylla:
tests: perf_row_cache_update: Test with an active reader surviving memtable flush
memtable, cache: Run mutation_cleaner worker in its own scheduling group
mutation_cleaner: Make merge() redirect old instance to the new one
mvcc: Use RAII to ensure that partition versions are merged
mvcc: Merge partition version versions gradually in the background
mutation_partition: Make merging preemtable
tests: mvcc: Use the standard maybe_merge_versions() to merge snapshots
Tests testing different aspects of `foreign_reader` and
`multishard_combining_reader` are designed to run with a certain minimum
shard count. Running them with any shard count below this minimum makes
them useless at best but can even fail them.
Refuse to run these tests when the shard count is below the required
minimum to avoid an accidental and unnecessary investigation into a
false-positive test failure.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <d24159415b6a9d74eafb8355b6e3fba98c1ff7ff.1530274392.git.bdenes@scylladb.com>
The index_reader class public interface has been amended to only deal
with the upper bound cursor along with advancing the lower bound.
Since the class users can only explicitly operate with the lower bound
cursor (take data file position, advance to the next partition, etc), it
no longer makes sense to specify that the method operates on the lower
bound cursor in its name.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
The only case when it needs to be called is when an index_reader is
advanced to a specific partition as part of sstable_reader
initialisation.
Instead, we're passing an optional upper_bound parameter that is used to
call advance_upper_past() internally if partition is found.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
"
When frozen mutation gets deserialised current implementation copies
its value 3 times: from IDL buffer to bytes object, from bytes object to
atomic_cell and then atomic_cell is copied again. Moreover, the value
gets linearised which may cause a large allocation.
All of that is very wasteful. This patch devirtualises and reworks IDL
reading code so that when used with partition_builder the cell value is
copied only once and without linearisation: from the IDL buffer to the
final atomic_cell.
perf_simple_query -c4, medians of 30 results:
./perf_before ./perf_after diff
read 310576.54 316273.90 1.8%
write 359913.15 375579.44 4.4%
microbenchmark, perf_idl:
BEFORE
test iterations median mad min max
frozen_mutation.freeze_one_small_row 2142435 462.431ns 0.125ns 462.306ns 467.659ns
frozen_mutation.unfreeze_one_small_row 1640949 601.422ns 0.082ns 601.340ns 605.279ns
frozen_mutation.apply_one_small_row 1538969 645.993ns 0.405ns 645.588ns 656.510ns
AFTER
test iterations median mad min max
frozen_mutation.freeze_one_small_row 2139548 455.525ns 0.631ns 454.894ns 456.707ns
frozen_mutation.unfreeze_one_small_row 1760139 566.157ns 0.003ns 566.153ns 584.339ns
frozen_mutation.apply_one_small_row 1582050 610.951ns 0.060ns 610.891ns 613.044ns
Tests: unit(release)
"
* tag 'avoid-copy-unfreeze/v2' of https://github.com/pdziepak/scylla:
mutation_partition_view: use column_mapping_entry::is_atomic()
schema: column_mapping_entry: cache abstract_type::is_atomic()
schema: column_mapping_entry: reduce logic duplication
mutation_partition_view: do not linearise or copy cell value
atomic_cell: allow passing value via ser::buffer_view
mutation_partition_view: pass cell by value to visitor
mutation_partition_view: devirtualise accept()
storage_proxy: use mutation_partition_view::{first, last}_row_key()
mutation_partition_view: add last_row_key() and first_row_key() getters
IDL deserialisation code calls is_atomic() for each cell. An additional
indirection and a virtual call can be avoided by caching that value in
column_mapping_entry. There is already very similar optimisation done
for column_definitions.
User-defined constructors often make it more likely that a careless
developer will forget to update one of them when adding a new member to
a structure. The risk of that happening can be reduced by reducing code
duplication with delegating constructors.
mutation_partition_view needs to create an atomic_cell from
IDL-serialised data. Then that cell is passed to the visitor. However,
because generic mutation_partition_visitor interface was used, the cell
was passed by constant reference which forced the visitor to needlessly
copy it.
This patch takes advantage of the fact that mutation_partition_view is
devirtualised now and adjust the interfaces of its visitors so that the
cell can be passed without copying.
There are only two types of visitors used and only one of them appears
in the hot path. They can be devirtualised without too much effort,
which also enables future custom interface specialisations specific to
mutation_partition_views and its users, not necessairly in the scope of
more general mutation_partition_visitor.
Some users (e.g. reconciliation code) need only to know the clustering
key of the first or the last row in the partition. This was done with a
full visitor visiting every single cell of the partition, which is very
wasteful. This patch adds direct getters for the needed information.
It is only being used by index_reader internally and never exposed so
should not be listed in commonly used types.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
An index entry may or may not have a promoted index. All the optional
fields are better scoped under the same class to avoid lots of separate
optional fields and give better representation.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
In the removenode operation, if the message servicing is stopped, e.g., due
to disk io error isolation, the node can keep retrying the
REPLICATION_FINISHED verb infinitely.
Scylla log full of such message was observed:
[shard 0] storage_service - Fail to send REPLICATION_FINISHED to $IP:0:
seastar::rpc::closed_error (connection is closed)
To fix, limit the number of retires.
Tests: update_cluster_layout_tests.py
Fixes#3542
Message-Id: <638d392d6b39cc2dd2b175d7f000e7fb1d474f87.1529927816.git.asias@scylladb.com>
Currently, enabling scylla-fstrim.timer is part of 'enable-service', it
will be enabled even --no-fstrim-setup specified (or input 'No' on interactive setup prompt).
To apply --no-fstrim-setup we need to enabling scylla-fstrim.timer in
scylla_fstrim_setup instead of enable-service part of scylla_setup.
Fixes#3248
On current implementation, we are checking the partition is mounted, but
a disk contains the partition marked as unused.
To avoid the problem, we should skip a disk which contains partitions.
Fixes#3545
When a disk path typed on the RAID setup prompt, the script mistakenly
splits the input for each character,
like ['/', 'd', 'e', 'v', '/', 's', 'd', 'b'].
To fix the issue we need to use selected.append() instead of
selected +=.
See #3545
The presence of a plain reference prohibits the bound_view class from
being copyable. The trick employed to work around that was to use
'placement new' for copy-assigning bound_view objects, but this approach
is ill-formed and causes undefined behaviour for classes that have const
and/or reference members.
The solution is to use a std::reference_wrapper instead.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <a0c951649c7aef2f66612fc006c44f8a33713931.1530113273.git.vladimir@scylladb.com>
"
We need a multishard_writer which gets mutation fragments from a producer
(e.g., from the network using the rpc streaming) and consumes the mutation
fragments with a consumer (e.g., write to sstable).
The multishard_writer will take care of the mutation fragments do not belong to
current shard.
This multishard_writer will be used in the new scylla streaming.
"
* 'asias/multishard_writer_v10.1' of github.com:scylladb/seastar-dev:
tests: Add multishard_writer_test to test.py
tests: Add test for multishard_writer
multishard_writer: Introduce multishard_writer
tests: Allow random_mutation_generator to generate mutations belong to remote shrard
The multishard_writer class gets mutation_fragments generated from
flat_mutation_reader and consumes the mutation_fragments with
multishard_writer::_consumer. If the mutation_fragment does not belong to the
shard multishard_writer is on, it will forward the mutation_fragment to the
correct shard. Future returned by multishard_writer() becomes ready
when all the mutation_fragments are consumed.
Tests: tests/multishard_writer_test.cc
Tests: dtest update_cluster_layout_tests.py
Fixes#3497
The worker is responsible for merging MVCC snapshots, which is similar
to merging sstables, but in memory. The new scheduling group will be
therefore called "memory compaction".
We should run it in a separate scheduling group instead of
main/memtables, so that it doesn't disrupt writes and other system
activities. It's also nice for monitoring how much CPU time we spend
on this.
If memtable snapshot goes away after memtable started merging to
cache, it would enqueue the snapshots for cleaning on the memtable's
cleaner, which will have to clean without deferrring when the memtable
is destroyed. That may stall the reactor. To avoid this, make merge()
cause the old instance of the cleaner to redirect to the new instance
(owned by cache), like we do for regions. This way the snapshots
mentioned earlier can be cleaned after memtable is destroyed,
gracefully.
Before this patch, maybe_merge_versions() had to be manually called
before partition snapshot goes away. That is error prone and makes
client code more complicated. Delegate that task to a new
partition_snapshot_ptr object, through which all snapshots are
published now.
"
With DateTiered and TimeWindow, there is a read optimization enabled
which excludes sstables based on overlap with recorded min/max values
of clustering key components. The problem is that it doesn't take into
account partition tombstones and static rows, which should still be
returned by the reader even if there is no overlap in the query's
clustering range. A read which returns no clustering rows can
mispopulate cache, which will appear as partition deletion or writes
to the static row being lost. Until node restart or eviction of the
partition entry.
There is also a bad interaction between cache population on read and
that optimization. When the clustering range of the query doesn't
overlap with any sstable, the reader will return no partition markers
for the read, which leads cache populator to assume there is no
partition in sstables and it will cache an empty partition. This will
cause later reads of that partition to miss prior writes to that
partition until it is evicted from cache or node is restarted.
Disable until a more elaborate fix is implemented.
Fixes#3552Fixes#3553
"
* tag 'tgrabiec/disable-min-max-sstable-filtering-v1' of github.com:tgrabiec/scylla:
tests: Add test for slicing a mutation source with date tiered compaction strategy
tests: Check that database conforms to mutation source
database: Disable sstable filtering based on min/max clustering key components
Fixes#3546
Both older origin and scylla writes "known" compressor names (i.e. those
in origin namespace) unqualified (i.e. LZ4Compressor).
This behaviour was not preserved in the virtualization change. But
probably should be.
Message-Id: <20180627110930.1619-1-calle@scylladb.com>
When snapshots go away, typically when the last reader is destroyed,
we used to merge adjacent versions atomically. This could induce
reactor stalls if partitions were large. This is especially true for
versions created on cache update from memtables.
The solution is to allow this process to be preempted and move to the
background. mutation_cleaner keeps a linked list of such unmerged
snapshots and has a worker fiber which merges them incrementally and
asynchronously with regards to reads.
This reduces scheduling latency spikes in tests/perf_row_cache_update
for the case of large partition with many rows. For -c1 -m1G I saw
them dropping from 23ms to 2ms.
drop_column_family now waits for both writes and reads in progress.
It solves possible liveness issues with row cache, when column_family
could be dropped prematurely, before the read request was finished.
Phaser operation is passed inside database::query() call.
There are other places where reading logic is applied (e.g. view
replicas), but these are guarded with different synchronization
mechanisms, while _pending_reads_phaser applies to regular reads only.
Fixes#3357
Reported-by: Duarte Nunes <duarte@scylladb.com>
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
Message-Id: <d58a5ee10596d0d62c765ee2114ac171b6f087d2.1529928323.git.sarna@scylladb.com>
With DateTiered and TimeWindow, there is a read optimization enabled
which excludes sstables based on overlap with recorded min/max values
of clustering key components. The problem is that it doesn't take into
account partition tombstones and static rows, which should still be
returned by the reader even if there is no overlap in the query's
clustering range. A read which returns no clustering rows can
mispopulate cache, which will appear as partition deletion or writes
to the static row being lost. Until node restart or eviction of the
partition entry.
There is also a bad interaction between cache population on read and
that optimization. When the clustering range of the query doesn't
overlap with any sstable, the reader will return no partition markers
for the read, which leads cache populator to assume there is no
partition in sstables and it will cache an empty partition. This will
cause later reads of that partition to miss prior writes to that
partition until it is evicted from cache or node is restarted.
Disable until a more elaborate fix is implemented.
Fixes#3552Fixes#3553
2018-06-26 18:54:44 +02:00
2300 changed files with 106939 additions and 34407 deletions
Use the [ScyllaDB user mailing list](https://groups.google.com/forum/#!forum/scylladb-users) for general questions and help.
Use the [ScyllaDB user mailing list](https://groups.google.com/forum/#!forum/scylladb-users) or the [Slack workspace](http://slack.scylladb.com) for general questions and help.
Scylla depends on the system package manager for its development dependencies.
Running `./install_dependencies.sh` (as root) installs the appropriate packages based on your Linux distribution.
Running `./install-dependencies.sh` (as root) installs the appropriate packages based on your Linux distribution.
On Ubuntu and Debian based Linux distributions, some packages
required to build Scylla are missing in the official upstream:
- libthrift-dev and libthrift
- antlr3-c++-dev
Try running ```sudo ./scripts/scylla_current_repo``` to add Scylla upstream,
and get the missing packages from it.
### Build system
**Note**: Compiling Scylla requires, conservatively, 2 GB of memory per native thread, and up to 3 GB per native thread while linking.
**Note**: Compiling Scylla requires, conservatively, 2 GB of memory per native
thread, and up to 3 GB per native thread while linking. GCC >= 8.1.1. is
required.
Scylla is built with [Ninja](https://ninja-build.org/), a low-level rule-based system. A Python script, `configure.py`, generates a Ninja file (`build.ninja`) based on configuration options.
@@ -43,11 +54,9 @@ The full suite of options for project configuration is available via
$ ./configure.py --help
```
The most important options are:
The most important option is:
-`--mode={release,debug,all}`: Debug mode enables [AddressSanitizer](https://github.com/google/sanitizers/wiki/AddressSanitizer) and allows for debugging with tools like GDB. Debugging builds are generally slower and generate much larger object files than release builds.
-`--{enable,disable}-dpdk`: [DPDK](http://dpdk.org/) is a set of libraries and drivers for fast packet processing. During development, it's not necessary to enable support even if it is supported by your platform.
- `--enable-dpdk`: [DPDK](http://dpdk.org/) is a set of libraries and drivers for fast packet processing. During development, it's not necessary to enable support even if it is supported by your platform.
Source files and build targets are tracked manually in `configure.py`, so the script needs to be updated when new files or targets are added or removed.
@@ -55,6 +64,30 @@ To save time -- for instance, to avoid compiling all unit tests -- you can also
@@ -83,7 +116,7 @@ The `-c1 -m1G` arguments limit this Seastar-based test to a single system thread
### Preparing patches
All changes to Scylla are submitted as patches to the public mailing list. Once a patch is approved by one of the maintainers of the project, it is committed to the maintainers' copy of the repository at https://github.com/scylladb/scylla.
All changes to Scylla are submitted as patches to the public [mailing list](mailto:scylladb-dev@googlegroups.com). Once a patch is approved by one of the maintainers of the project, it is committed to the maintainers' copy of the repository at https://github.com/scylladb/scylla.
Detailed instructions for formatting patches for the mailing list and advice on preparing good patches are available at the [ScyllaDB website](http://docs.scylladb.com/contribute/). There are also some guidelines that can help you make the patch review process smoother:
@@ -112,6 +145,8 @@ The usual is "Tests: unit (release)", although running debug tests is encouraged
5. When answering review comments, prefer inline quotes as they make it easier to track the conversation across multiple e-mails.
6. The Linux kernel's [Submitting Patches](https://www.kernel.org/doc/html/v4.19/process/submitting-patches.html) document offers excellent advice on how to prepare patches and patchsets for review. Since the Scylla development process is derived from the kernel's, almost all of the advice there is directly applicable.
### Finding a person to review and merge your patches
You can use the `scripts/find-maintainer` script to find a subsystem maintainer and/or reviewer for your patches. The script accepts a filename in the git source tree as an argument and outputs a list of subsystems the file belongs to and their respective maintainers and reviewers. For example, if you changed the `cql3/statements/create_view_statement.hh` file, run the script as follows:
@@ -164,6 +199,29 @@ On a development machine, one might run Scylla as
To interact with scylla it is recommended to build our versions of
cqlsh and nodetool. They are available at
https://github.com/scylladb/scylla-tools-java and can be built with
```bash
$ sudo ./install-dependencies.sh
$ ant jar
```
cqlsh should work out of the box, but nodetool depends on a running
scylla-jmx (https://github.com/scylladb/scylla-jmx). It can be build
with
```bash
$ mvn package
```
and must be started with
```bash
$ ./scripts/scylla-jmx
```
### Branches and tags
Multiple release branches are maintained on the Git repository at https://github.com/scylladb/scylla. Release 1.5, for instance, is tracked on the `branch-1.5` branch.
@@ -254,7 +312,7 @@ In this example, `10.0.0.2` will be sent up to 16 jobs and the local machine wil
When a compilation is in progress, the status of jobs on all remote machines can be visualized in the terminal with `distccmon-text` or graphically as a GTK application with `distccmon-gnome`.
One thing to keep in mind is that linking object files happens on the coordinating machine, which can be a bottleneck. See the next section speeding up this process.
One thing to keep in mind is that linking object files happens on the coordinating machine, which can be a bottleneck. See the next sections speeding up this process.
### Using the `gold` linker
@@ -264,6 +322,24 @@ Linking Scylla can be slow. The gold linker can replace GNU ld and often speeds
$ sudo alternatives --config ld
```
### Using split dwarf
With debug info enabled, most of the link time is spent copying and
relocating it. It is possible to leave most of the debug info out of
the link by writing it to a side .dwo file. This is done by passing
`-gsplit-dwarf` to gcc.
Unfortunately just `-gsplit-dwarf` would slow down `gdb` startup. To
avoid that the gold linker can be told to create an index with
`--gdb-index`.
More info at https://gcc.gnu.org/wiki/DebugFission.
Both options can be enable by passing `--split-dwarf` to configure.py.
Note that distcc is *not* compatible with it, but icecream
(https://github.com/icecc/icecream) is.
### Testing changes in Seastar with Scylla
Sometimes Scylla development is closely tied with a feature being developed in Seastar. It can be useful to compile Scylla with a particular check-out of Seastar.
@@ -277,3 +353,8 @@ $ git remote add local /home/tsmith/src/seastar
seastar::metrics::description("number of operations via Alternator API"),{op(CamelCaseName)}),
#define OPERATION_LATENCY(name, CamelCaseName) \
seastar::metrics::make_histogram("op_latency", \
seastar::metrics::description("Latency histogram of an operation via Alternator API"),{op(CamelCaseName)},[this]{returnapi_operations.name.get_histogram(1,20);}),
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.