Consider 3 nodes in the cluster, n1, n2, n3 with gossip generation
number g1, g2, g3.
n1, n2, n3 running scylla version with commit
0a52ecb6df (gossip: Fix max generation
drift measure)
One year later, user wants the upgrade n1,n2,n3 to a new version
when n3 does a rolling restart with a new version, n3 will use a
generation number g3'. Because g3' - g2 > MAX_GENERATION_DIFFERENCE and
g3' - g1 > MAX_GENERATION_DIFFERENCE, so g1 and g2 will reject n3's
gossip update and mark g3 as down.
Such unnecessary marking of node down can cause availability issues.
For example:
DC1: n1, n2
DC2: n3, n4
When n3 and n4 restart, n1 and n2 will mark n3 and n4 as down, which
causes the whole DC2 to be unavailable.
To fix, we can start the node with a gossip generation within
MAX_GENERATION_DIFFERENCE difference for the new node.
Once all the nodes run the version with commit
0a52ecb6df, the option is no logger
needed.
Fixes#5164
(cherry picked from commit 743b529c2b)
[tgrabiec: resolved major conflicts in config.hh]
User reported an issue that after a node restart, the restarted node
is marked as DOWN by other nodes in the cluster while the node is up
and running normally.
Consier the following:
- n1, n2, n3 in the cluster
- n3 shutdown itself
- n3 send shutdown verb to n1 and n2
- n1 and n2 set n3 in SHUTDOWN status and force the heartbeat version to
INT_MAX
- n3 restarts
- n3 sends gossip shadow rounds to n1 and n2, in
storage_service::prepare_to_join,
- n3 receives response from n1, in gossiper::handle_ack_msg, since
_enabled = false and _in_shadow_round == false, n3 will apply the
application state in fiber1, filber 1 finishes faster filber 2, it
sets _in_shadow_round = false
- n3 receives response from n2, in gossiper::handle_ack_msg, since
_enabled = false and _in_shadow_round == false, n3 will apply the
application state in fiber2, filber 2 yields
- n3 finishes the shadow round and continues
- n3 resets gossip endpoint_state_map with
gossiper.reset_endpoint_state_map()
- n3 resumes fiber 2, apply application state about n3 into
endpoint_state_map, at this point endpoint_state_map contains
information including n3 itself from n2.
- n3 calls gossiper.start_gossiping(generation_number, app_states, ...)
with new generation number generated correctly in
storage_service::prepare_to_join, but in
maybe_initialize_local_state(generation_nbr), it will not set new
generation and heartbeat if the endpoint_state_map contains itself
- n3 continues with the old generation and heartbeat learned in fiber 2
- n3 continues the gossip loop, in gossiper::run,
hbs.update_heart_beat() the heartbeat is set to the number starting
from 0.
- n1 and n2 will not get update from n3 because they use the same
generation number but n1 and n2 has larger heartbeat version
- n1 and n2 will mark n3 as down even if n3 is alive.
To fix, always use the the new generation number.
Fixes: #5800
Backports: 3.0 3.1 3.2
(cherry picked from commit 62774ff882)
When qualifying columns to be fetched for filtering, we also check
if the target column is not used as an index - in which case there's
no need of fetching it. However, the check was incorrectly assuming
that any restriction is eligible for indexing, while it's currently
only true for EQ. The fix makes a more specific check and contains
many dynamic casts, but these will hopefully we gone once our
long planned "restrictions rewrite" is done.
This commit comes with a test.
Fixes#5708
Tests: unit(dev)
(cherry picked from commit 767ff59418)
SimpleStrategy creates a list of endpoints by iterating over the set of
all configured endpoints for the given token, until we reach keyspace
replication factor.
There is a trivial coding bug when we first add at least one endpoint
to the list, and then compare list size and replication factor.
If RF=0 this never yields true.
Fix by moving the RF check before at least one endpoint is added to the
list.
Cassandra never had this bug since it uses a less fancy while()
loop.
Fixes#5962
Message-Id: <20200306193729.130266-1-kostja@scylladb.com>
(cherry picked from commit ac6f64a885)
Reclaim consults the _regions vector, so we don't want it moving around while
allocating more capacity. For that we take the reclaim lock. However, that
can cause a false-positive OOM during startup:
1. all memory is allocated to LSA as part of priming (2baa16b371)
2. the _regions vector is resized from 64k to 128k, requiring a segment
to be freed (plenty are free)
3. but reclaiming_lock is taken, so we cannot reclaim anything.
To fix, resize the _regions vector outside the lock.
Fixes#6003.
Message-Id: <20200311091217.1112081-1-avi@scylladb.com>
(cherry picked from commit c020b4e5e2)
By default, `/usr/lib/rpm/find-debuginfo.sh` will temper with
the binary's build-id when stripping its debug info as it is passed
the `--build-id-seed <version>.<release>` option.
To prevent that we need to set the following macros as follows:
unset `_unique_build_ids`
set `_no_recompute_build_ids` to 1
Fixes#5881
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 25a763a187)
This patch fixes a bug that appears because of an incorrect interaction
between counters and hinted handoff.
When a counter is updated on the leader, it sends mutations to other
replicas that contain all counter shards from the leader. If consistency
level is achieved but some replicas are unavailable, a hint with
mutation containing counter shards is stored.
When a hint's destination node is no longer its replica, it is attempted
to be sent to all its current replicas. Previously,
storage_proxy::mutate was used for that purpose. It was incorrect
because that function treats mutations for counter tables as mutations
containing only a delta (by how much to increase/decrease the counter).
These two types of mutations have different serialization format, so in
this case a "shards" mutation is reinterpreted as "delta" mutation,
which can cause data corruption to occur.
This patch backports `storage_proxy::mutate_hint_from_scratch`
function, which bypasses special handling of counter mutations and
treats them as regular mutations - which is the correct behavior for
"shards" mutations.
Refs #5833.
Backports: 3.1, 3.2, 3.3
Tests: unit(dev)
(cherry picked from commit ec513acc49)
"
Throw an error in case we hit an invalid time UUID
rather than hitting an assert.
Fixes#5552
(Ref #5588 that was dequeued and fixed here)
Test: UUID_test, cql_query_test(debug)
"
* 'validate-time-uuid' of https://github.com/bhalevy/scylla:
cql3: abstract_function_selector: provide assignment_testable_source_context
test: cql_query_test: add time uuid validation tests
cql3: time_uuid_fcts: validate timestamp arg
cql3: make_max_timeuuid_fct: delete outdated FIXME comment
cql3: time_uuid_fcts: validate time UUID
test: UUID_test: add tests for time uuid
utils: UUID: create_time assert nanos_since validity
utils/UUID_gen: make_nanos_since
utils: UUID: assert UUID.is_timestamp
(cherry picked from commit 3343baf159)
Conflicts:
cql3/functions/time_uuid_fcts.hh
tests/cql_query_test.cc
Avoid following UBSAN error:
repair/row_level.cc:2141:7: runtime error: load of value 240, which is not a valid value for type 'bool'
Fixes#5531
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 474ffb6e54)
Docker restricts the number of processes in a container to some
limit it calculates. This limit turns out to be too low on large
machines, since we run multiple links in parallel, and each link
runs many threads.
Remove the limit by specifying --pids-limit -1. Since dbuild is
meant to provide a build environment, not a security barrier,
this is okay (the container is still restricted by host limits).
I checked that --pids-limit is supported by old versions of
docker and by podman.
Fixes#5651.
Message-Id: <20200127090807.3528561-1-avi@scylladb.com>
(cherry picked from commit 897320f6ab)
The table::flush_streaming_mutations is used in the days when streaming
data goes to memtable. After switching to the new streaming, data goes
to sstables directly in streaming, so the sstables generated in
table::flush_streaming_mutations will be empty.
It is unnecessary to invalidate the cache if no sstables are added. To
avoid unnecessary cache invalidating which pokes hole in the cache, skip
calling _cache.invalidate() if the sstables is empty.
The steps are:
- STREAM_MUTATION_DONE verb is sent when streaming is done with old or
new streaming
- table::flush_streaming_mutations is called in the verb handler
- cache is invalidated for the streaming ranges
In summary, this patch will avoid a lot of cache invalidation for
streaming.
Backports: 3.0 3.1 3.2
Fixes: #5769
(cherry picked from commit 5e9925b9f0)
This assert, added by 060e3f8 is supposed to make sure the invariant of
the append() is respected, in order to prevent building an invalid row.
The assert however proved to be too harsh, as it converts any bug
causing out-of-order clustering rows into cluster unavailability.
Downgrade it to on_internal_error(). This will still prevent corrupt
data from spreading in the cluster, without the unavailability caused by
the assert.
Fixes: #5786
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200211083829.915031-1-bdenes@scylladb.com>
(cherry picked from commit 3164456108)
Since dpkg does not re-install conffiles when it removed by user,
currently we are missing dependencies.conf and sysconfdir.conf on rollback.
To prevent this, we need to stop running
'rm -rf /etc/systemd/system/scylla-server.service.d/' on 'remove'.
Fixes#5734
(cherry picked from commit 43097854a5)
We would sometimes produce an unnecessary extra 0xff prefix byte.
The new encoding matches what cassandra does.
This was both a efficiency and correctness issue, as using varint in a
key could produce different tokens.
Fixes#5656
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
(cherry picked from commit c89c90d07f)
We use eventually() in tests to wait for eventually consistent data
to become consistent. However, we see spurious failures indicating
that we wait too little.
Increasing the timeout has a negative side effect in that tests that
fail will now take longer to do so. However, this negative side effect
is negligible to false-positive failures, since they throw away large
test efforts and sometimes require a person to investigate the problem,
only to conclude it is a false positive.
This patch therefore makes eventually() more patient, by a factor of
32.
Fixes#4707.
Message-Id: <20200130162745.45569-1-avi@scylladb.com>
(cherry picked from commit ec5b721db7)
We need to add '~' to handle rcX version correctly on Debian variants
(merged at ae33e9f), but when we moved to relocated package we mistakenly
dropped the code, so add the code again.
Fixes#5641
(cherry picked from commit dd81fd3454)
Consider this:
1) Write partition_start of p1
2) Write clustering_row of p1
3) Write partition_end of p1
4) Repair is stopped due to error before writing partition_start of p2
5) Repair calls repair_row_level_stop() to tear down which calls
wait_for_writer_done(). A duplicate partition_end is written.
To fix, track the partition_start and partition_end written, avoid
unpaired writes.
Backports: 3.1 and 3.2
Fixes: #5527
(cherry picked from commit 401854dbaf)
Build progress virtual reader uses Scylla-specific
scylla_views_builds_in_progress table in order to represent
legacy views_builds_in_progress rows. The Scylla-specific table contains
additional cpu_id clustering key part, which is trimmed before returning
it to the user. That may cause duplicated clustering row fragments to be
emitted by the reader, which may cause undefined behaviour in consumers.
The solution is to keep track of previous clustering keys for each
partition and drop fragments that would cause duplication. That way if
any shard is still building a view, its progress will be returned,
and if many shards are still building, the returned value will indicate
the progress of a single arbitrary shard.
Fixes#4524
Tests:
unit(dev) + custom monotonicity checks from <tgrabiec@scylladb.com>
(cherry picked from commit 85a3a4b458)
We should wait for a future returned from add_local_application_state() to
resolve before issuing new calculation, otherwise two
add_local_application_state() may run simultaneously for the same state.
Fixes#4838.
Message-Id: <20190812082158.GE17984@scylladb.com>
(cherry picked from commit 00c4078af3)
Debian package build script does runs relocate_python_scripts.py for scyllatop,
but mistakenly forgetting to install tools/scyllatop/*.py.
We need install them by using scylla-server.install.
Fixes#5518
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20191227025750.434407-1-syuu@scylladb.com>
Similar to trace_state keep shared_ptr<tracing> _local_tracing_ptr
in one_session_records when constructed so it can be used
during shutdown.
Fixes#5243
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 7aef39e400)
On aarch64, asan detected a use-after-move. It doesn't happen on x86_64,
likely due to different argument evaluation order.
Fix by evaluating full_slice before moving the schema.
Note: I used "auto&&" and "std::move()" even though full_slice()
returns a reference. I think this is safer in case full_slice()
changes, and works just as well with a reference.
Fixes#5419.
(cherry picked from commit 85822c7786)
In the system.types table a user type refers to another by name. When
a user type is modified, only its entry in the table is changed.
At runtime a user type has direct pointer to the types it uses. To
handle the discrepancy we need to recreate any dependent types when a
entry in system.types changes.
Fixes#5049
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
(cherry picked from commit 5af8b1e4a3)
We were missing calls to underlying_type in a few locations and so the
insert would think the given literal was invalid and the select would
refuse to fetch a UDT field.
Fixes#4672
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190708200516.59841-1-espindola@scylladb.com>
(cherry picked from commit 4e7ffb80c0)
This commit makes sure that single-partition readers for
read-before-write do not have fast-forwarding enabled,
as it may lead to huge read amplification. The observed case was:
1. Creating an index.
CREATE INDEX index1 ON myks2.standard1 ("C1");
2. Running cassandra-stress in order to generate view updates.
cassandra-stress write no-warmup n=1000000 cl=ONE -schema \
'replication(factor=2) compaction(strategy=LeveledCompactionStrategy)' \
keyspace=myks2 -pop seq=4000000..8000000 -rate threads=100 -errors
skip-read-validation -node 127.0.0.1;
Without disabling fast-forwarding, single-partition readers
were turned into scanning readers in cache, which resulted
in reading 36GB (sic!) on a workload which generates less
than 1GB of view updates. After applying the fix, the number
dropped down to less than 1GB, as expected.
Refs #5409Fixes#4615Fixes#5418
(cherry picked from commit 79c3a508f4)
Since 90d6c0b, cache will abort when trying to detach partition
entries while they're updated. This should never happen. It can happen
though, when the update fails on bad_alloc, because the cleanup guard
invalidates the cache before it releases partition snapshots (held by
"update" coroutine).
Fix by destroying the coroutine first.
Fixes#5327.
Tests:
- row_cache_test (dev)
Message-Id: <1574360259-10132-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit e3d025d014)
Merged patch set by Piotr Dulikowski:
This change corrects condition on which a row was considered expired by its
TTL.
The logic that decides when a row becomes expired was inconsistent with the
logic that decides if a single cell is expired. A single cell becomes expired
when expiry_timestamp <= now, while a row became expired when
expiry_timestamp < now (notice the strict inequality). For rows inserted
with TTL, this caused non-key cells to expire (change their values to null)
one second before the row disappeared. Now, row expiry logic uses non-strict
inequality.
Fixes#4263,
Fixes#5290.
Tests:
unit(dev)
python test described in issue #5290
(cherry picked from commit 9b9609c65b)
Assume n1 and n2 in a cluster with generation number g1, g2. The
cluster runs for more than 1 year (MAX_GENERATION_DIFFERENCE). When n1
reboots with generation g1' which is time based, n2 will see
g1' > g2 + MAX_GENERATION_DIFFERENCE and reject n1's gossip update.
To fix, check the generation drift with generation value this node would
get if this node were restarted.
This is a backport of CASSANDRA-10969.
Fixes#5164
(cherry picked from commit 0a52ecb6df)
The dependencies are provided by the frozen toolchain. If a dependency
is missing, we must update the toolchain rather than rely on build-time
installation, which is not reproducible (as different package versions
are available at different times).
Luckily "dnf install" does not update an already-installed package. Had
that been a case, none of our builds would have been reproducible, since
packages would be updated to the latest version as of the build time rather
than the version selected by the frozen toolchain.
So, to prevent missing packages in the frozen toolchain translating to
an unreproducible build, remove the support for installing dependencies
from reloc/build_reloc.sh. We still parse the --nodeps option in case some
script uses it.
Fixes#5222.
Tests: reloc/build_reloc.sh.
(cherry picked from commit cd075e9132)
gdb searches for libthread_db.so using its canonical name of libthread_db.so.1 rather
than the file name of libthread_db-1.0.so, so use that name to store the file in the
archive.
Fixes#4996.
(cherry picked from commit d77171e10e)
Usually, a reconcilable_result holds very few partitions (1 is common),
since the page size is limited by 1MB. But if we have paging disabled or
if we are reconciling a range full of tombstones, we may see many more.
This can cause large allocations.
Change to chunked_vector to prevent those large allocations, as they
can be quite expensive.
Fixes#4780.
(cherry picked from commit 093d2cd7e5)
begin() of a const vector should return a const_iterator, to avoid
giving the caller the ability to mutate it.
This slipped through since iterator's constructor does a const_cast.
Noticed by code inspection.
(cherry picked from commit df6faae980)
Prerequisite for #4780.
Scylla currently crashes if we run manual operations like nodetool
compact with the controller disabled. While we neither like nor
recommend running with the controller disabled, due to some corner cases
in the controller algorithm we are not yet at the point in which we can
deprecate this and are sometimes forced to disable it.
The reason for the crash is that manual operations will invoke
_backlog_of_shares, which returns what is the backlog needed to
create a certain number of shares. That scan the existing control
points, but when we run without the controller there are no control
points and we crash.
Backlog doesn't matter if the controller is disabled, and the return
value of this function will be immaterial in this case. So to avoid the
crash, we return something right away if the controller is disabled.
Fixes#5016
Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit c9f2d1d105)
"
Fixes#4540
This series adds proper handling of aggregation for paged indexed queries.
Before this series returned results were presented to the user in per-page
partial manner, while they should have been returned as a single aggregated
value.
Tests: unit(dev)
"
* 'add_proper_aggregation_for_paged_indexing_for_3.1' of https://github.com/psarna/scylla:
test: add 'eventually' block to index paging test
tests: add indexing+paging test case for clustering keys
tests: add indexing + paging + aggregation test case
tests: add query_options to cquery_nofail
cql3: make DEFAULT_COUNT_PAGE_SIZE constant public
cql3: add proper aggregation to paged indexing
cql3: add a query options constructor with explicit page size
cql3: enable explicit copying of query_options
cql3: split execute_base_query implementation
Without 'eventually', the test is flaky because the index can still
be not up to date while checking its conditions.
Fixes#4670
(cherry picked from commit ebbe038d19)
Indexed queries used to erroneously return partial per-page results
for aggregation queries. This test case used to reproduce the problem
and now ensures that there would be no regressions.
Refs #4540
Aggregated and paged filtering needs to aggregate the results
from all pages in order to avoid returning partial per-page
results. It's a little bit more complicated than regular aggregation,
because each paging state needs to be translated between the base
table and the underlying view. The routine keeps fetching pages
from the underlying view, which are then used to fetch base rows,
which go straight to the result set builder.
Fixes#4540
MV backpressure code frees mutation for delayed client replies earlier
to save memory. The commit 2d7c026d6e that
introduced the logic claimed to do it only when all replies are received,
but this is not the case. Fix the code to free only when all replies
are received for real.
Fixes#5242
Message-Id: <20191113142117.GA14484@scylladb.com>
(cherry picked from commit 552c56633e)
For internal use, there already exists a query_options constructor
that copies data from another query_options with overwritten paging
state. This commit adds an option to overwrite page size as well.
In order to handle aggregation queries correctly, the function that
returns base query results is split into two, so it's possible to
access raw query results, before they're converted into end-user
CQL message.
Currently NIC selection prompt on scylla_setup just proceed setup when
user just pressed Enter key on the prompt.
The prompt should ask NIC name again until user input correct NIC name.
Fixes#4517
Message-Id: <20190617124925.11559-1-syuu@scylladb.com>
(cherry picked from commit 7320c966bc)
Calculating the select statement for given view_info structure
used to work fine, but once local indexes were introduced, a subtle
bug appeared: the legacy token column does not exist in local indexes
and a valid clustering key column was omitted instead.
That results in potentially incorrect partition slices being used later
in read-before-write.
There's a long term plan for removing select_statement from
view info altogether, but nonetheless the bug needs to be fixed first.
cherry picked from commit 9e98b51aaaFixes#5241
Message-Id: <cb2e863e8e993e00ec7329505f737a9ce4b752ae.1572432826.git.sarna@scylladb.com>
The loop that collects the result of the checksum calculations and logs
any errors. The error logging includes `checksums[0]` which corresponds
to the checksum calculation on the local node. This violates the
assumption of the code following the loop, which assumes that the future
of `checksums[0]` is intact after the loop terminates. However this is
only true when the checksum calculation is successful and is false when
it fails, as in this case the loop extracts the error and logs it. When
the code after the loop checks again whether said calculation failed, it
will get a false negative and will go ahead and attempt to extract the
value, triggering an assert failure.
Fix by making sure that even in the case of failed checksum calculation,
the result of `checksum[0]` is extracted only once.
Fixes: #5238
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191029151709.90986-1-bdenes@scylladb.com>
(cherry picked from commit e48f301e95)
ommit 93270dd changed gc_clock to be 64-bit, to fix the Y2038
problem. While 64-bit tombstone::deletion_time is serialized in a
compatible way, TTLs (gc_clock::duration) were not.
This patchset reverts TTL serialization to the 32-bit serialization
format, and also allows opting-in to the 64-bit format in case a
cluster was installed with the broken code. Only Scylla 3.1.0 is
vulnerable.
Fixes#4855
Tests: unit (dev)
(cherry picked from commit e621db591e)
"
Fixes#5134, Eviction concurrent with preempted partition entry update after
memtable flush may allow stale data to be populated into cache.
Fixes#5135, Cache reads may miss some writes if schema alter followed by a
read happened concurrently with preempted partition entry update.
Fixes#5127, Cache populating read concurrent with schema alter may use the
wrong schema version to interpret sstable data.
Fixes#5128, Reads of multi-row partitions concurrent with memtable flush may
fail or cause a node crash after schema alter.
"
* tag 'fix-cache-issues-with-schema-alter-and-eviction-v2' of github.com:tgrabiec/scylla:
tests: row_cache: Introduce test_alter_then_preempted_update_then_memtable_read
tests: row_cache_stress_test: Verify all entries are evictable at the end
tests: row_cache_stress_test: Exercise single-partition reads
tests: row_cache_stress_test: Add periodic schema alters
tests: memtable_snapshot_source: Allow changing the schema
tests: simple_schema: Prepare for schema altering
row_cache: Record upgraded schema in memtable entries during update
memtable: Extract memtable_entry::upgrade_schema()
row_cache, mvcc: Prevent locked snapshots from being evicted
row_cache: Make evict() not use invalidate_unwrapped()
mvcc: Introduce partition_snapshot::touch()
row_cache, mvcc: Do not upgrade schema of entries which are being updated
row_cache: Use the correct schema version to populate the partition entry
delegating_reader: Optimize fill_buffer()
row_cache, memtable: Use upgrade_schema()
flat_mutation_reader: Introduce upgrade_schema()
(cherry picked from commit 8ed6f94a16)
Currently, the population stat is not increased for entries that are
evicted immediately on insert, however the code that does the eviction
still decreases the population stat, leading to an imbalance and in some
cases the underflow of the population stat. To fix, unconditionally
increase the population stat upon inserting an entry, regardless of
whether it is immediately evicted or not.
Fixes: #5123
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191001153215.82997-1-bdenes@scylladb.com>
(cherry picked from commit 00b432b61d)
"
Fix races that may lead to use-after-free events and file system level exceptions
during shutdown and drain.
The root cause of use-after-free events in question is that space_watchdog blocks on
end_point_hints_manager::file_update_mutex() and we need to make sure this mutex is alive as long as
it's accessed even if the corresponding end_point_hints_manager instance
is destroyed in the context of manager::drain_for().
File system exceptions may occur when space_watchdog attempts to scan a
directory while it's being deleted from the drain_for() context.
In case of such an exception new hints generation is going to be blocked
- including for materialized views, till the next space_watchdog round (in 1s).
Issues that are fixed are #4685 and #4836.
Tested as follows:
1) Patched the code in order to trigger the race with (a lot) higher
probability and running slightly modified hinted handoff replace
dtest with a debug binary for 100 times. Side effect of this
testing was discovering of #4836.
2) Using the same patch as above tested that there are no crashes and
nodes survive stop/start sequences (they were not without this series)
in the context of all hinted handoff dtests. Ran the whole set of
tests with dev binary for 10 times.
"
Fixes#4685Fixes#4836
* 'hinted_handoff_race_between_drain_for_and_space_watchdog_no_global_lock-v2' of https://github.com/vladzcloudius/scylla:
hinted handoff: fix a race on a directory removal between space_watchdog and drain_for()
hinted handoff: make taking file_update_mutex safe
db::hints::manager::drain_for(): fix alignment
db::hints::manager: serialize calls to drain_for()
db::hints: cosmetics: identation and missing method qualifier
(cherry picked from commit 3cb081eb84)
* seastar 7dfcf334c4...75488f6ef2 (2):
> net: socket::{set,get}_reuseaddr() should not be virtual
> Merge "fix some tcp connection bugs and add reuseaddr option to a client socket" from Gleb
Prerequisite for #4943.
Affects single-partition reads only.
Refs #5113
When executing a query on the replica we do several things in order to
narrow down the sstable set we read from.
For tables which use LeveledCompactionStrategy, we store sstables in
an interval set and we select only sstables whose partition ranges
overlap with the queried range. Other compaction strategies don't
organize the sstables and will select all sstables at this stage. The
reasoning behind this is that for non-LCS compaction strategies the
sstables' ranges will typically overlap and using interval sets in
this case would not be effective and would result in quadratic (in
sstable count) memory consumption.
The assumption for overlap does not hold if the sstables come from
repair or streaming, which generates non-overlapping sstables.
At a later stage, for single-partition queries, we use the sstables'
bloom filter (kept in memory) to drop sstables which surely don't
contain given partition. Then we proceed to sstable indexes to narrow
down the data file range.
Tables which don't use LCS will do unnecessary I/O to read index pages
for single-partition reads if the partition is outside of the
sstable's range and the bloom filter is ineffective (Refs #5112).
This patch fixes the problem by consulting sstable's partition range
in addition to the bloom filter, so that the non-overlapping sstables
will be filtered out with certainty and not depend on bloom filter's
efficiency.
It's also faster to drop sstables based on the keys than the bloom
filter.
Tests:
- unit (dev)
- manual using cqlsh
Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190927122505.21932-1-tgrabiec@scylladb.com>
(cherry picked from commit b0e0f29b06)
The method sstable::estimated_keys_for_range() was severely
under-estimating the number of partitions in an sstable for a given
token range.
The first reason is that it underestimated the number of sstable index
pages covered by the range, by one. In extreme, if the requested range
falls into a single index page, we will assume 0 pages, and report 1
partition. The reason is that we were using
get_sample_indexes_for_range(), which returns entries with the keys
falling into the range, not entries for pages which may contain the
keys.
A single page can have a lot of partitions though. By default, there
is a 1:20000 ratio between summary entry size and the data file size
covered by it. If partitions are small, that can be many hundreds of
partitions.
Another reason is that we underestimate the number of partitions in an
index page. We multiply the number of pages by:
(downsampling::BASE_SAMPLING_LEVEL * _components->summary.header.min_index_interval)
/ _components->summary.header.sampling_level
Using defaults, that means multiplying by 128. In the cassandra-stress
workload a single partition takes about 300 bytes in the data file and
summary entry is 22 bytes. That means a single page covers 22 * 20'000
= 440'000 bytes of the data file, which contains about 1'466
partitions. So we underestimate by an order of magnitude.
Underestimating the number of partitions will result in too small
bloom filters being generated for the sstables which are the output of
repair or streaming. This will make the bloom filters ineffective
which results in reads selecting more sstables than necessary.
The fix is to base the estimation on the number of index pages which
may contain keys for the range, and multiply that by the average key
count per index page.
Fixes#5112.
Refs #4994.
The output of test_key_count_estimation:
Before:
count = 10000
est = 10112
est([-inf; +inf]) = 512
est([0; 0]) = 128
est([0; 63]) = 128
est([0; 255]) = 128
est([0; 511]) = 128
est([0; 1023]) = 128
est([0; 4095]) = 256
est([0; 9999]) = 512
est([5000; 5000]) = 1
est([5000; 5063]) = 1
est([5000; 5255]) = 1
est([5000; 5511]) = 1
est([5000; 6023]) = 128
est([5000; 9095]) = 256
est([5000; 9999]) = 256
est(non-overlapping to the left) = 1
est(non-overlapping to the right) = 1
After:
count = 10000
est = 10112
est([-inf; +inf]) = 10112
est([0; 0]) = 2528
est([0; 63]) = 2528
est([0; 255]) = 2528
est([0; 511]) = 2528
est([0; 1023]) = 2528
est([0; 4095]) = 5056
est([0; 9999]) = 10112
est([5000; 5000]) = 2528
est([5000; 5063]) = 2528
est([5000; 5255]) = 2528
est([5000; 5511]) = 2528
est([5000; 6023]) = 5056
est([5000; 9095]) = 7584
est([5000; 9999]) = 7584
est(non-overlapping to the left) = 0
est(non-overlapping to the right) = 0
Tests:
- unit (dev)
Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190927141339.31315-1-tgrabiec@scylladb.com>
(cherry picked from commit b93cc21a94)
compaction_manager::perform_sstable_upgrade() fails when it feeds
compaction mechanism with shared sstables. Shared sstables should
be ignored when performing upgrade and so wait for reshard to pick
them up in parallel. Whenever a shared sstable is brought up either
on restart or via refresh, reshard procedure kicks in.
Reshard picks the highest supported format so the upgrade for
shared sstable will naturally take place.
Fixes#5056.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190925042414.4330-1-raphaelsc@scylladb.com>
(cherry picked from commit 571fa94eb5)
"
Currently affects only counter tables.
Introduced in 27014a2.
mutation_partition(s, mp) is incorrect because it uses s to interpret
mp, while it should use mp_schema.
We may hit this if the current node has a newer schema than the
incoming mutation. This can happen during table schema altering when we receive the
mutation from a node which hasn't processed the schema change yet.
This is undefined behavior in general. If the alter was adding or
removing columns, this may result in corruption of the write where
values of one column are inserted into a different column.
Fixes#5095.
"
* 'fix-schema-alter-counter-tables' of https://github.com/tgrabiec/scylla:
mvcc: Fix incorrect schema verison being used to copy the mutation when applying
mutation_partition: Track and validate schema version in debug builds
tests: Use the correct schema to access mutation_partition
(cherry picked from commit 83bc59a89f)
1. Add assert in remove_response_handler to make crashes like in #5032 easier to understand.
2. Lookup the view_update_write_response_handler id before calling timeout_cb and tolerate it not found.
Just log a warning if this happened.
Fixes#5032
(cherry picked from commit 06b9818e98)
When the toppartitions operation gathers results, it copies partition
keys with their schema_ptr:s. When these schema_ptr:s are copies
or destroyed, they can cause leaks or premature frees of the schema
in its original shard since reference count operations in are not atomic.
Fix that by converting the schema_ptr to a global_schema_ptr during
transportation.
Fixes#5104 (direct bug)
Fixes#5018 (schema prematurely freed, toppartitions previously executed on that node)
Fixes#4973 (corrupted memory pool of the same size class as schema, toppartitions previously executed on that node)
Tests: new test added that fails with the existing code in debug mode,
manual toppartitions test
(cherry picked from commit 5b0e48f25b)
std::regex_match of the leading path may run out of stack
with long paths in debug build.
Using rfind instead to lookup the last '/' in in pathname
and skip it if found.
Fixes#4464
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190505144133.4333-1-bhalevy@scylladb.com>
(cherry picked from commit d9136f96f3)
When a node is restarted, there is a race between gossip starts (other
nodes will mark this node up again and send requests) and the tokens are
replicated to other shards. Here is an example:
- n1, n2
- n2 is down, n1 think n2 is down
- n2 starts again, n2 starts gossip service, n1 thinks n2 is up and sends
reads/writes to n2, but n2 hasn't replicated the token_metadata to all
the shards.
- n2 complains:
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
token_metadata - sorted_tokens is empty in first_token_index!
storage_proxy - Failed to apply mutation from $ip#4: std::runtime_error
(sorted_tokens is empty in first_token_index!)
The code path looks like below:
0 stoarge_service::init_server
1 prepare_to_join()
2 add gossip application state of NET_VERSION, SCHEMA and so on.
3 _gossiper.start_gossiping().get()
4 join_token_ring()
5 _token_metadata.update_normal_tokens(tokens, get_broadcast_address());
6 replicate_to_all_cores().get()
7 storage_service::set_gossip_tokens() which adds the gossip application state of TOKENS and STATUS
The race talked above is at line 3 and line 6.
To fix, we can replicate the token_metadata early after it is filled
with the tokens read from system table before gossip starts. So that
when other nodes think this restarting node is up, the tokens are
already replicated to all the shards.
In addition, this patch also fixes the issue that other nodes might see
a node miss the TOKENS and STATUS application state in gossip if that
node failed in the middle of a restarting process, i.e., it is killed
after line 3 and before line 7. As a result we could not replace the
node.
Tests: update_cluster_layout_tests.py
Fixes: #4709Fixes: #4723
(cherry picked from commit 3b39a59135)
A recent fix to #3767 limited the amount of ranges that
can return from query_ranges_to_vnodes_generator. This with
the combination of a large amount of token ranges can lead to
an infinite recursion. The algorithm multiplies by factor of
2 (actualy a shift left by one) the amount of requested
tokens in each recursion iteration. As long as the requested
number of ranges is greater than 0, the recursion is implicit,
and each call is scheduled separately since the call is inside
a continuation of a map reduce.
But if the amount of iterations is large enough (~32) the
counter for requested ranges zeros out and from that moment on
two things will happen:
1. The counter will remain 0 forever (0*2 == 0)
2. The map reduce future will be immediately available and this
will result in the continuation being invoked immediately.
The latter causes the recursive call to be a "regular" recursive call
thus, through the stack and not the task queue of the scheduler, and
the former causes this recursion to be infinite.
The combination creates a stack that keeps growing and eventually
overflows resulting in undefined behavior (due to memory overrun).
This patch prevent the problem from happening, it limits the growth of
the concurrency counter beyond twice the last amount of tokens returned
by the query_ranges_to_vnodes_generator.And also makes sure it is not
get stuck at zero.
Testing: * Unit test in dev mode.
* Modified add 50 dtest that reproduce the problem
Fixes#4944
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20190922072838.14957-1-eliransin@scylladb.com>
(cherry picked from commit 280715ad45)
This reverts commit 7f64a6ec4b.
Fixes#5011
The reverted commit exposes #3760 for all schemas, not only those
which have UDTs.
The problem is that table schema deserialization now requires keyspace
to be present. If the replica hasn't received schema changes which
introduce the keyspace yet, the write will fail.
(cherry picked from commit 8517eecc28)
Stopping services which occurs in a destructor of deferred_action
should not throw, or it will end the program with
terminate(). View builder breaks a semaphore during its shutdown,
which results in propagating a broken_semaphore exception,
which in turn results in throwing an exception during stop().get().
In order to fix that issue, semaphore exceptions are explicitly
ignored, since they're expected to appear during shutdown.
Fixes#4875Fixes#4995.
(cherry picked from commit 23c891923e)
Currently when an error happens during the receive and distribute phase
it is swallowed and we just return a -1 status to the remote. We only
log errors that happen during responding with the status. This means
that when streaming fails, we only know that something went wrong, but
the node on which the failure happened doesn't log anything.
Fix by also logging errors happening in the receive and distribute
phase. Also mention the phase in which the error happened in both error
log messages.
Refs: #4901
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190903115735.49915-1-bdenes@scylladb.com>
(cherry picked from commit 783277fb02)
The previous code was not exception safe and would eventually cause a
file to be destroyed without being closed, causing an assert failure.
Unfortunately it doesn't seem to be possible to test this without
error injection, since using an invalid directory fails before this
code is executed.
Fixes#4948
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190904002314.79591-1-espindola@scylladb.com>
(cherry picked from commit 000514e7cc)
on_down() iterates over _view_update_handlers_list, but it yields during iteration,
and while it yields, elements in that list can be removed, resulting in a
use-after-free.
Prevent this by registering iterators that can be potentially invalidated, and
any time we remove an element from the list, check whether we're removing an element
that is being pointed to by a live iterator. If that is the case, advance the iterator
so that it points at a valid element (or at the end of the list).
Fixes#4912.
Tests: unit (dev)
(cherry picked from commit 301246f6c0)
Before ede1d248af, running "tools/toolchain/dbuild -it -- bash" was
a nice way to play in the toolchain environment, for example to start
a debugger. But that commit caused containers to run in detached mode,
which is incompatible with interactive mode.
To restore the old behavior, detect that the user wants interactive mode,
and run the container in non-detached mode instead. Add the --rm flag
so the container is removed after execution (as it was before ede1d248af).
Fixes#4930.
Message-Id: <20190506175942.27361-1-avi@scylladb.com>
(cherry picked from commit db536776d9)
Introduced in c96ee98.
We call update_schema_version() after features are enabled and we
recalculate the schema version. This method is not updating gossip
though. The node will still use it's database::version() to decide on
syncing, so it will not sync and stay inconsistent in gossip until the
next schema change.
We should call updatE_schema_version_and_announce() instead so that
the gossip state is also updated.
There is no actual schema inconsistency, but the joining node will
think there is and will wait indefinitely. Making a random schema
change would unbock it.
Fixes#4647.
Message-Id: <1566825684-18000-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit ac5ff4994a)
row::append_cell() has a precondition that the new cell column id needs
to be larger than that of any other already existing cell. If this
precondition is violated the row will end up in an invalid state. This
patch adds assertion to make sure we fail early in such cases.
(cherry picked from commit 060e3f8ac2)
"
Streamed view updates parasitized on writing io priority, which is
reserved for user writes - it's now properly bound to streaming
write priority.
Verified manually by checking appropriate io metrics: scylla_io_queue_total_bytes{class="streaming_write" ...} vs scylla_io_queue_total_bytes{class="query" ...}
Tests: unit(dev)
"
Fixes#4615.
* 'assign_proper_io_priority_to_streaming_view_updates' of https://github.com/psarna/scylla:
db,view: wrap view update generation in stream scheduling group
database: assign proper io priority for streaming view updates
(cherry picked from commit 2c7435418a)
The loop over view update handlers used a guard in order to ensure
that the object is not prematurely destroyed (thus invalidating
the iterator), but the guard itself was not in the right scope.
Fixed by replacinga 'for' loop with a 'while' loop, which moves
the iterator incrementation inside the scope in which it's still
guarded and valid.
Fixes#4866
(cherry picked from commit 526f4c42aa)
Our current relocation works by invoking the dynamic linker with the
executable as an argument. This confuses gdb since the kernel records
the dynamic linker as the executable, not the real executable.
Switch to install-time relocation with patchelf: when installing the
executable and libraries, all paths are known, and we can update the
path to the dynamic loader and to the dynamic libraries.
Since patchelf itself is dynamically linked, we have to relocate it
dynamically (with the old method of invoking it via the dynamic linker).
This is okay since it's a one-time operation and since we don't expect
to debug core dumps of patchelf crashes.
We lose the ability to run scylla directly from the uninstalled
tarball, but since the nonroot installer is already moving in the
direction of requiring install.sh, that is not a great loss, and
certainly the ability to debug is more important.
dh_strip barfs on some binaries which were treated with patchelf,
so exclude them from dh_strip. This doesn't lose any functionality,
since these binaries didn't have debug information to begin with
(they are already-stripped Fedora executables).
Fixes#4673.
(cherry-picked from commit 698b72b501)
Backport notes:
- 3.1 doesn't call install.sh from the debian packager, so add an adjust_bin
and call it from the debian rules file directly
- adjusted install.sh for 3.1 prefix (/usr) compared to master prefix (/opt/scylladb)
"Commit e3f7fe4 added file owner validation to prevent Scylla from
crashing when it tries to touch a file it doesn't own. However, under
docker, we cannot expect to pass this check since user IDs are from
different namespaces: the process runs in a container namespace, but the
data files usually come from a mounted volume, and so their uids are
from the host namespace.
So we need to relax the check. We do this by reverting b1226fb, which
causes Scylla to run as euid 0 in docker, and by special-casing euid 0
in the ownership verification step.
Fixes #4823."
* 'docker-euid-0' of git://github.com/avikivity/scylla:
main: relax file ownership checks if running under euid 0
Revert "dist/docker/redhat: change user of scylla services to 'scylla'"
(cherry picked from commit 595434a554)
Make the reader recreation logic more robust, by moving away from
deciding which fragments have to be dropped based on a bunch of
special cases, instead replacing this with a general logic which just
drops all already seen fragments (based on their position). Special
handling is added for the case when the last position is a range
tombstone with a non full prefix starting position. Reproducer unit
tests are added for both cases.
Refs #4695Fixes#4733
(cherry picked from commit 0cf4fab2ca)
Command line arguments are parsed twice in Scylla: once in main and once in Seastar's app_template::run.
The first parse is there to check if the "--version" flag is present --- in this case the version is printed
and the program exists. The second parsing is correct; however, most of the arguments were improperly treated
as positional arguments during the first parsing (e.g., "--network host" would treat "host" as a positional argument).
This happened because the arguments weren't known to the command line parser.
This commit fixes the issue by moving the parsing code until after the arguments are registered.
Resolves#4141.
Signed-off-by: Kamil Braun <kbraun@scylladb.com>
(cherry picked from commit f155a2d334)
We were using segment::_closed to decide whether _file was already
closed. Unfortunately they are not exactly the same thing. As far as
I understand it, segments can be closed and reused without actually
closing the file.
Found with a seastar patch that asserts on destroying an open
append_challenged_posix_file_impl.
Fixes#4745.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190721171332.7995-1-espindola@scylladb.com>
(cherry picked from commit 636e2470b1)
"
scylla_setup is currently broken for OEL. This happens because the
OS detection code checks for RHEL and Fedora. CentOS returns itself
as RHEL, but OEL does not.
"
Fixes#4842.
* 'unbreakable' of github.com:glommer/scylla:
scylla_setup: be nicer about unrecognized OS
scylla_util: recognize OEL as part of the RHEL family
(cherry picked from commit 1cf72b39a5)
"
Not emitting partition_end for a partition is incorrect. SStable
writer assumes that it is emitted. If it's not, the sstable will not
be written correctly. The partition index entry for the last partition
will be left partially written, which will result in errors during
reads. Also, statistics and sstable key ranges will not include the
last partition.
It's better to catch this problem at the time of writing, and not
generate bad sstables.
Another way of handling this would be to implicitly generate a
partition_end, but I don't think that we should do this. We cannot
trust the mutation stream when invariants are violated, we don't know
if this was really the last partition which was supposed to be
written. So it's safer to fail the write.
Enabled for both mc and la/ka.
Passing --abort-on-internal-error on the command line will switch to
aborting instead of throwing an exception.
The reason we don't abort by default is that it may bring the whole
cluster down and cause unavailability, while it may not be necessary
to do so. It's safer to fail just the affected operation,
e.g. repair. However, failing the operation with an exception leaves
little information for debugging the root cause. So the idea is that the
user would enable aborts on only one of the nodes in the cluster to
get a core dump and not bring the whole cluster down.
"
* 'catch-unclosed-partition-sstable-write' of https://github.com/tgrabiec/scylla:
sstables: writer: Validate that partition is closed when the input mutation stream ends
config, exceptions: Add helper for handling internal errors
utils: config_file: Introduce named_value::observe()
(cherry picked from commit 95c0804731)
If a node is a seed node, it can not be started with
replace-address-first-boot or the replace-address flag.
The issue is that as a seed node it will generate new tokens instead of
replacing the existing one the user expect it to replaec when supplying
the flags.
This patch will throw a bad_configuration_error exception
in this case.
Fixes#3889
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 399d79fc6f)
It shouldn't rely on argument evaluation order, which is ub.
Fixes#4718.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 0e732ed1cf)
Fixes a segfault when querying for an empty keyspace.
Also, fixes an infinite loop on smp > 1. Queries to
system.size_estimates table which are not single-partition queries
caused Scylla to go into an infinite loop inside
multishard_combining_reader::fill_buffer. This happened because
multishard_combinind_reader assumes that shards return rows belonging
to separate partitions, which was not the case for
size_estimates_mutation_reader.
Fixes#4689.
(cherry picked from commit 14700c2ac4)
Avoid including the lengthy stream_session.hh in messaging_service.
More importantly, fix the build because currently messaging_service.cc
and messaging_service.hh does not include stream_mutation_fragments_cmd.
I am not sure why it builds on my machine. Spotted this when backporting
the "streaming: Send error code from the sender to receiver" to 3.0
branch.
Refs: #4789
(cherry picked from commit 49a73aa2fc)
In case of error on the sender side, the sender does not propagate the
error to the receiver. The sender will close the stream. As a result,
the receiver will get nullopt from the source in
get_next_mutation_fragment and pass mutation_fragment_opt with no value
to the generating_reader. In turn, the generating_reader generates end
of stream. However, the last element that the generating_reader has
generated can be any type of mutation_fragment. This makes the sstable
that consumes the generating_reader violates the mutation_fragment
stream rule.
To fix, we need to propagate the error. However RPC streaming does not
support propagate the error in the framework. User has to send an error
code explicitly.
Fixes: #4789
(cherry picked from commit bac987e32a)
(cherry picked from commit 288371ce75)
Currently, if there is a fragment in _ready and _out_of_range was set
after row end was consumer, push_ready_fragments() would return
without emitting partition_end.
This is problematic once we make consume_row_start() emit
partiton_start directly, because we will want to assume that all
fragments for the previous partition are emitted by then. If they're
not, then we'd emit partition_start before partition_end for the
previous partition. The fix is to make sure that
push_ready_fragments() emits everything.
Fixes#4786
(cherry picked from commit 9b8ac5ecbc)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Data listener reads are implemented as flat_mutation_readers, which
take a reference to the listener and then execute asynchronously.
The listener can be removed between the time when the reference is
taken and actual execution, resulting in a dangling pointer
dereference.
Fix by using a weak_ptr to avoid writing to a destroyed object. Note that writes
don't need protection because they execute atomically.
Fixes#4661.
Tests: unit (dev)
(cherry picked from commit e03c7003f1)
If we had an error while reading, then we would have failed to close
the reader, which in turn can cause memory corruption. Make the
closing more robust by using then_wrapped (that doesn't skip on
exception) and log the error for analysis.
Fixes#4761.
(cherry picked from commit b272db368f)
streaming_reader_lifecycle_policy::create_reader() was ignoring the
partition_slice passed to it and always creating the reader for the
full slice.
That's wrong because create_reader() is called when recreating a
reader after it's evicted. If the reader stopped in the middle of
partition we need to start from that point. Otherwise, fragments in
the mutation stream will appear duplicated or out of ordre, violating
assumptions of the consumers.
This was observed to result in repair writing incorrect sstables with
duplicated clustering rows, which results in
malformed_sstable_exception on read from those sstables.
Fixes#4659.
In v2:
- Added an overload without partition_slice to avoid changing existing users which never slice
Tests:
- unit (dev)
- manual (3 node ccm + repair)
Backport: 3.1
Reviewd-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <1563451506-8871-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 7604980d63)
"
disable_sstable_write needs to acquire _sstable_deletion_sem to properly synchronize
with background deletions done by on_compaction_completion to ensure no sstables will
be created or deleted during reshuffle_sstables after
storage_service::load_new_sstables disables sstable writes.
Fixes#4622
Test: unit(dev), nodetool_additional_test.py migration_test.py
"
* 'scylla-4622-fix-disable-sstable-write' of https://github.com/bhalevy/scylla:
table: document _sstables_lock/_sstable_deletion_sem locking order
table: disable_sstable_write: acquire _sstable_deletion_sem
table: uninline enable_sstable_write
table: reshuffle_sstables: add log message
(cherry picked from commit 43690ecbdf)
Start n1, n2
Create ks with rf = 2
Run repair on n2
Stop n2 in the middle of repair
n1 will notice n2 is DOWN, gossip handler will remove repair instance
with n2 which calls remove_repair_meta().
Inside remove_repair_meta(), we have:
```
1 return parallel_for_each(*repair_metas, [repair_metas] (auto& rm) {
2 return rm->stop();
3 }).then([repair_metas, from] {
4 rlogger.debug("Removed all repair_meta for single node {}", from);
5 });
```
Since 3.1, we start 16 repair instances in parallel which will create 16
readers.The reader semaphore is 10.
At line 2, it calls
```
6 future<> stop() {
7 auto gate_future = _gate.close();
8 auto writer_future = _repair_writer.wait_for_writer_done();
9 return when_all_succeed(std::move(gate_future), std::move(writer_future));
10 }
```
The gate protects the reader to read data from disk:
```
11 with_gate(_gate, [] {
12 read_rows_from_disk
13 return _repair_reader.read_mutation_fragment() --> calls reader() to read data
14 })
```
So line 7 won't return until all the 16 readers return from the call of
reader().
The problem is, the reader won't release the reader semaphore until the
reader is destroyed!
So, even if 10 out of the 16 readers have finished reading, they won't
release the semaphore. As a result, the stop() hangs forever.
To fix in short term, we can delete the reader, aka, drop the the
repair_meta object once it is stopped.
Refs: #4693
(cherry picked from commit 8774adb9d0)
Given a list of ranges to stream, stream_transfer_task will create an
reader with the ranges and create a rpc stream connection on all the shards.
When user provides ranges to repair with -st -et options, e.g.,
using scylla-manger, such ranges can belong to only one shard, repair
will pass such ranges to streaming.
As a result, only one shard will have data to send while the rpc stream
connections are created on all the shards, which can cause the kernel
run out of ports in some systems.
To mitigate the problem, do not open the connection if the ranges do not
belong to the shard at all.
Refs: #4708
(cherry picked from commit 64a4c0ede2)
Now it accepts the 'z' or 'Z' timezone, denoting UTC+00:00.
Fixes#4641.
Signed-off-by: Kamil Braun <kbraun@scylladb.com>
(cherry picked from commit 4417e78125)
When scylla is started for the first time with PasswordAuthenticator
enabled, it can be that a record of the default superuser
will be created in the table with the can_login and is_superuser
set to null. It happens because the module in charge of creating
the row is the role manger and the module in charge of setting the
default password salted hash value is the password authenticator.
Those two modules are started together, it the case when the
password authenticator finish the initialization first, in the
period until the role manager completes it initialization, the row
contains those null columns and any loging attempt in this period
will cause a memory access violation since those columns are not
expected to ever be null. This patch removes the race by starting
the password authenticator and autorizer only after the role manger
finished its initialization.
Tests:
1. Unit tests (release)
2. Auth and cqlsh auth related dtests.
Fixes#4226
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <20190714124839.8392-1-eliransin@scylladb.com>
(cherry picked from commit 997a146c7f)
In scylla-debuginfo package, we have /usr/lib/debug/opt/scylladb/libreloc/libthread_db-1.0.so-666.development-0.20190711.73a1978fb.el7.x86_64.debug
but we actually does not have libthread_db.so.1 in /opt/scylladb/libreloc
since it's not available on ldd result with scylla binary.
To debug thread, we need to add the library in a relocatable package manually.
Fixes#4673
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190711111058.7454-1-syuu@scylladb.com>
(cherry picked from commit 842f75d066)
Since commit bb56653 (repair: Sync schema from follower nodes before
repair), the behaviour of handling down node during repair has been
changed. That is, if a repair follower is down, it will fail to sync
schema with it and the repair of the range will be skipped. This means
a range can not be repaired unless all the nodes for the replicas are up.
To fix, we filter out the nodes that is down and mark the repair is
partial and repair with the nodes that are still up.
Tests: repair_additional_test:RepairAdditionalTest.repair_with_down_nodes_2b_test
Fixes: #4616
Backports: 3.1
Message-Id: <621572af40335cf5ad222c149345281e669f7116.1562568434.git.asias@scylladb.com>
(cherry picked from commit 39ca044dab)
This fixes a possible cause of #4614.
From the backtrace in that issue, it looks like a file is being closed
twice. The first point in the backtrace where that seems likely is in
the MC writer.
My first idea was to add a writer::close and make it the responsibility
of the code using the writer to call it. That way we would move work
out of the destructor.
That is a bit hard since the writer is destroyed from
flat_mutation_reader::impl::~consumer_adapter and that would need to
get a close function too.
This patch instead just fixes an exception safety issue. If
_index_writer->close() throws, _index_writer is still valid and
~writer will try to close it again.
If the exception was thrown after _completed.set_value(), that would
explain the assert about _completed.set_value() being called twice.
With this patch the path outside of the destructor now moves the
writer to a local variable before trying to close it.
Fixes#4614
Message-Id: <20190710171747.27337-1-espindola@scylladb.com>
(cherry picked from commit 281f3a69f8)
The view builder is started only if it's enabled in config,
via the view_building=true variable. Unfortunately, stopping
the builder was unconditional, which may result in failed
assertions during shutdown. To remedy this, view building
is stopped only if it was previously started.
Fixes#4589
(cherry picked from commit efa7951ea5)
This series makes sure new schema is propagated to repair master and
follower nodes before repair.
Fixes#4575
* dev.git asias/repair_pull_schema_v2:
migration_manager: Add sync_schema
repair: Sync schema from follower nodes before repair
(cherry picked from commit 269e65a8db)
The repair_rows in row_list are sorted. It is only possible for the
current repair_row to share the same partition key with the last
repair_row inserted into repair_row_on_wire. So, no need to search from
the beginning of the repair_rows_on_wire to avoid quadratic complexity.
To fix, look at the last item in repair_rows_on_wire.
Fixes#4580
Message-Id: <08a8bfe90d1a6cf16b67c210151245879418c042.1561001271.git.asias@scylladb.com>
(cherry picked from commit b99c75429a)
This patch set fixes repair nodes using different schema version and
optimizes the hashing thanks to the fact now all nodes uses same schema
version.
Fixes: #4549
* seastar-dev.git asias/repair_use_same_schema.v3:
repair: Use the same schema version for repair master and followers
repair: Hash column kind and id instead of column name and type name
(cherry picked from commit cd1ff1fe02)
"
Fixes#4569
This series fixes the infinite paging for indexed queries issue.
Before this fix, paging indexes tended to end up in an infinite loop
of returning pages with 0 results, but has_more_pages flag set to true,
which confused the drivers.
Tests: unit(dev)
Branches: 3.0, 3.1
"
* 'fix_infinite_paging_for_indexed_queries' of https://github.com/psarna/scylla:
tests: add test case for finishing index paging
cql3: fix infinite paging for indexed queries
(cherry picked from commit 9229afe64f)
Recently, in merge commit 2718c90448,
we added the ability to cancel pending view-update requests when we detect
that the target node went down. This is important for view updates because
these have a very long timeout (5 minutes), and we wanted to make this
timeout even longer.
However, the implementation caused a race: Between *creating* the update's
request handler (create_write_response_handler()) and actually starting
the request with this handler (mutate_begin()), there is a preemption point
and we may end up deleting the request handler before starting the request.
So mutate_begin() must gracefully handle the case of a missing request
handler, and not crash with a segmentation fault as it did before this patch.
Eventually the lifetime management of request handlers could be refactored
to avoid this delicate fix (which requires more comments to explain than
code), or even better, it would be more correct to cancel individual writes
when a node goes down, not drop the entire handler (see issue #4523).
However, for now, let's not do such invasive changes and just fix bug that
we set out to fix.
Fixes#4386.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190620123949.22123-1-nyh@scylladb.com>
(cherry picked from commit 6e87bca65d)
The code that decides whether a query should used indexing was buggy - a partition key index might have influenced the decision even if the whole partition key was passed in the query (which effectively means that indexing it is not necessary).
Fixes#4539
Closes https://github.com/scylladb/scylla/pull/4544
Merged from branch 'fix_deciding_whether_a_query_uses_indexing' of git://github.com/psarna/scylla
tests: add case for partition key index and filtering
cql3: fix deciding if a query uses indexing
(cherry picked from commit 6aab1a61be)
When a column is not present in the select clause, but used for
filtering, it usually needs to be fetched from replicas.
Sometimes it can be avoided, e.g. if primary key columns form a valid
prefix - then, they will be optimized out before filtering itself.
However, clustering key prefix can only be qualified for this
optimization if the whole partition key is restricted - otherwise
the clustering columns still need to be present for filtering.
This commit also fixes tests in cql_query_test suite, because they now
expect more values - columns fetched for filtering will be present as
well (only internally, the clients receive only data they asked for).
Fixes#4541
Message-Id: <f08ebae5562d570ece2bb7ee6c84e647345dfe48.1560410018.git.sarna@scylladb.com>
(cherry picked from commit adeea0a022)
Most tests await the result of cql_test_env::execute_cql(). Most
would also benefit from reporting errors with top-level location
included.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
(cherry picked from commit a9849ecba7)
Consider
master: row(pk=1, ck=1, col=10)
follower1: row(pk=1, ck=1, col=20)
follower2: row(pk=1, ck=1, col=30)
When repair runs, master fetches row(pk=1, ck=1, col=20) and row(pk=1,
ck=1, col=30) from follower1 and follower2.
Then repair master sends row(pk=1, ck=1, col=10) and row(pk=1, ck=1,
col=30) to follower1, follower1 will write the row with the same
pk=1, ck=1 twice, which violates uniqueness constraints.
To fix, we apply the row with same pk and ck into the previous row.
We only needs this on repair follower because the rows can come from
multiple nodes. While on repair master, we have a sstable writer per
follower, so the rows feed into sstable writer can come from only a
single node.
Tests: repair_additional_test.py:RepairAdditionalTest.repair_same_row_diff_value_3nodes_test
Fixes: #4510
Message-Id: <cb4fbba1e10fb0018116ffe5649c0870cda34575.1560405722.git.asias@scylladb.com>
(cherry picked from commit 9079790f85)
On repair follower node, only decorated_key_with_hash and the
mutation_fragment inside repair_row are used in apply_rows() to apply
the rows to disk. Allow repair_row to initialize partially and throw if
the uninitialized member is accessed to be safe.
Message-Id: <b4e5cc050c11b1bafcf997076a3e32f20d059045.1560405722.git.asias@scylladb.com>
(cherry picked from commit 912ce53fc5)
Before this patch mc sstables writer was ignoring
empty cellpaths. This is a wrong behaviour because
it is possible to have empty key in a map. In such case,
our writer creats a wrong sstable that we can't read back.
This is becaus a complex cell expects cellpath for each
simple cell it has. When writer ignores empty cellpath
it writes nothing and instead it should write a length
of zero to the file so that we know there's an empty cellpath.
Fixes#4533
Tests: unit(release)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <46242906c691a56a915ca5994b36baf87ee633b7.1560532790.git.piotr@scylladb.com>
(cherry picked from commit a41c9763a9)
On branch-3.1 / master, we are getting following error:
ERROR 2019-06-11 10:58:49,156 [shard 0] database - /var/lib/scylla/data: File not owned by current euid: 0. Owner is: 999
ERROR 2019-06-11 10:58:49,156 [shard 0] init - Failed owner and mode verification: std::runtime_error (File not owned by current euid: 0. Owner is: 999)
ERROR 2019-06-11 10:58:49,156 [shard 0] database - /var/lib/scylla/hints: File not owned by current euid: 0. Owner is: 999
ERROR 2019-06-11 10:58:49,156 [shard 0] init - Failed owner and mode verification: std::runtime_error (File not owned by current euid: 0. Owner is: 999)
ERROR 2019-06-11 10:58:49,156 [shard 0] database - /var/lib/scylla/commitlog: File not owned by current euid: 0. Owner is: 999
ERROR 2019-06-11 10:58:49,156 [shard 0] init - Failed owner and mode verification: std::runtime_error (File not owned by current euid: 0. Owner is: 999)
ERROR 2019-06-11 10:58:49,156 [shard 0] database - /var/lib/scylla/view_hints: File not owned by current euid: 0. Owner is: 999
ERROR 2019-06-11 10:58:49,156 [shard 0] init - Failed owner and mode verification: std::runtime_error (File not owned by current euid: 0. Owner is: 999)
It seems like owner verification of data directory fails because
scylla-server process is running in root but data directory owned by
scylla, so we should run services as scylla user.
Fixes#4536
Message-Id: <20190611113142.23599-1-syuu@scylladb.com>
(cherry picked from commit b1226fb15a)
Fixes#4525
req_param uses boost::lexical cast to convert text->var.
However, lexical_cast does not handle textual booleans,
thus param=true causes not only wrong values, but
exceptions.
Message-Id: <20190610140511.15478-1-calle@scylladb.com>
(cherry picked from commit 26702612f3)
If a port value passed as a string this makes the cluster.connect() to
fail with Python3.4.
Let's fix this by explicitly declaring a 'port' argument as 'int'.
Fixes#4527
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20190606133321.28225-1-vladz@scylladb.com>
(cherry picked from commit 20a610f6bc)
The relocatable Python is built from Fedora packages. Unfortunately TLS
certificates are in a different location on Debian variants, which
causes "node_exporter_install" to fail as follows:
Traceback (most recent call last):
File "/usr/lib/scylla/libexec/node_exporter_install", line 58, in <module>
data = curl('https://github.com/prometheus/node_exporter/releases/download/v{version}/node_exporter-{version}.linux-amd64.tar.gz'.format(version=VERSION), byte=True)
File "/usr/lib/scylla/scylla_util.py", line 40, in curl
with urllib.request.urlopen(req) as res:
File "/opt/scylladb/python3/lib64/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/opt/scylladb/python3/lib64/python3.7/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/opt/scylladb/python3/lib64/python3.7/urllib/request.py", line 543, in _open
'_open', req)
File "/opt/scylladb/python3/lib64/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/opt/scylladb/python3/lib64/python3.7/urllib/request.py", line 1360, in https_open
context=self._context, check_hostname=self._check_hostname)
File "/opt/scylladb/python3/lib64/python3.7/urllib/request.py", line 1319, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)>
Unable to retrieve version information
node exporter setup failed.
Fix the problem by overriding the SSL_CERT_FILE environment variable to
point to the correct location of the TLS bundle.
Message-Id: <20190604175434.24534-1-penberg@scylladb.com>
(cherry picked from commit eb00095bca)
perf_fast_forward is used to detect performance regressions. The two
main metrics used for this are fargments per second and the number of
the IO operations. The former is a median of a several runs, but the
latter is just the actual number of asynchronous IO operations performed
in the run that happened to be picked as a median frag/s-wise. There's
no always a direct correlation between frag/s and aio and the latter can
vary which makes the latter hard to compare.
In order to make this easier a new metric was introduced: "average aio"
which reports the average number of asynchronous IO operations performed
in a run. This should produce much more stable results and therefore
make the comparison more meaningful.
Message-Id: <20190430134401.19238-1-pdziepak@scylladb.com>
(cherry picked from commit 51e98e0e11)
Unlike CentOS, Debian variants has python3 package on official repository,
so we don't have to use relocatable python3 on these distributions.
However, official python3 version is different on each distribution, we may
have issue because of that.
Also, our scripts and packaging implementation are becoming presuppose
existence of relocatable python3, it is causing issue on Debian
variants.
Switching to relocatable python3 on Debian variants avoid these issues,
it will easier to manage Scylla python3 environments accross multiple
distributions.
Fixes#4495
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190531112707.20082-1-syuu@scylladb.com>
(cherry picked from commit 25112408a7)
After incremental compaction, new sstables may have already replaced old
sstables at any point. Meaning that a new sstable is in-use by table and
a old sstable is already deleted when compaction itself is UNFINISHED.
Therefore, we should *NEVER* delete a new sstable unconditionally for an
interrupted compaction, or data loss could happen.
To fix it, we'll only delete new sstables that didn't replace anything
in the table, meaning they are unused.
Found the problem while auditting the code.
Fixes#4479.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190506134723.16639-1-raphaelsc@scylladb.com>
(cherry picked from commit ef5681486f)
Users may want to know which version of packages are used for the AMI,
it's good to have it on AMI tags and description.
To do this, we need to download .rpm from specified .repo, extract
version information from .rpm.
Fixes#4499
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190520123924.14060-2-syuu@scylladb.com>
(cherry picked from commit a55330a10b)
Unlike CentOS, Debian variants has python3 package on official repository,
so we don't have to use relocatable python3 on these distributions.
However, official python3 version is different on each distribution, we may
have issue because of that.
Also, our scripts and packaging implementation are becoming presuppose
existence of relocatable python3, it is causing issue on Debian
variants.
Switching to relocatable python3 on Debian variants avoid these issues,
it will easier to manage Scylla python3 environments accross multiple
distributions.
Fixes#4495
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190526105138.677-1-syuu@scylladb.com>
(cherry picked from commit 4d119cbd6d)
"
Before this patchset empty counters were incorrectly persisted for
MC format. No value was written to disk for them. The correct way
is to still write a header that informs the counter is empty.
We also need to make sure that reading wrongly persisted empty
counters works because customers may have sstables with wrongly
persisted empty counters.
Fixes#4363
"
* 'haaawk/4363/v3' of github.com:scylladb/seastar-dev:
sstables: add test for empty counters
docs: add CorrectEmptyCounters to sstable-scylla-format
sstables: Add a feature for empty counters in Scylla.db.
sstables: Write header for empty counters
sstables: Remove unused variables in make_counter_cell
sstables: Handle empty counter value in read path
(cherry picked from commit 899ebe483a)
The calculation is done in a non preemptable loop over all tables, so if
numbers of tables is very large it may take a while since we also build
a string for gossiper state. Make the loop preemtable and also make
the string calculation more efficient by preallocating memory for it.
Message-Id: <20190516132748.6469-3-gleb@scylladb.com>
(cherry picked from commit 31bf4cfb5e)
invoke_on_all() copies provided function for each shard it is executed
on, so by moving stats map into the capture we copy it for each shard
too. Avoid it by putting it into the top level object which is already
captured by reference.
Message-Id: <20190516132748.6469-2-gleb@scylladb.com>
(cherry picked from commit 4517c56a57)
"
Commit d0f9e00 changed the representation of the gc_clock::duration
from int32_t to int64_t.
Mutation hashing uses appending_hash<gc_clock::time_point>, which by
default feeds duration::count() into the hasher. duration::rep changed
from int32_t to int64_t, which changes the value of the hash.
This affects schema digest and query digests, resulting in mismatches
between nodes during a rolling upgrade.
Fixes#4460.
Refs #4485.
"
* tag 'fix-gc_clock-digest-v2.1' of github.com:tgrabiec/scylla:
tests: Add test which verifies that schema digest stays the same
tests: Add sstables for the schema digest test
schema_tables, storage_service: Make schema digest insensitive to expired tombstones in empty partition
db/schema_tables: Move feed_hash_for_schema_digest() to .cc file
hashing: Introduce type-erased interface for the hasher
hashing: Introduce C++ concept for the hasher
hashers: Rename hasher to cryptopp_hasher
gc_clock: Fix hashing to be backwards-compatible
(cherry picked from commit 82b91c1511)
Since other build_*.sh are for running inside extracted relocatable
package, they have SCYLLA-PRODUCT-FILE on top of the directory,
but build_ami.sh is not running in such condition, we need to run
SCYLLA-VERSION-GEN first, then refer to build/SCYLLA-PRODUCT-FILE.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190509110621.27468-1-syuu@scylladb.com>
(cherry picked from commit 19a973cd05)
AWS just released their new instances, the i3en instances. The instance
is verified already to work well with scylla, the only adjustments that
we need is advertise that we support it, and pre-fill the disk
information according to the performance numbers obtained by running the
instance.
Fixes#4486
Branches: 3.1
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190508170831.6003-1-glauber@scylladb.com>
(cherry picked from commit a23531ebd5)
endpoint_filter() function assumes that each bucket of
std::unordered_multimap contains elements with the same key only, so
its size can be used to know how many elements with a particular key
are there. But this is not the case, elements with multiple keys may
share a bucket. Fix it by counting keys in other way.
Fixes#3229
Message-Id: <20190501133127.GE21208@scylladb.com>
(cherry picked from commit 95c6d19f6c)
The setup script asks the user whether or not housekeeping should
be called, and in the first time the script is executed this decision
is respected.
However if the script is invoked again, that decision is not respected.
This is because the check has the form:
if (housekeeping_cfg_file_exists) {
version_check = ask_user();
}
if (version_check) { do_version_check() } else { dont_do_it() }
When it should have the form:
if (housekeeping_cfg_file_exists) {
version_check = ask_user();
if (version_check) { do_version_check() } else { dont_do_it() }
}
(Thanks python)
This is problematic in systems that are not connected to the internet, since
housekeeping will fail to run and crash the setup script.
Fixes#4462
Branches: master, branch-3.1
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190502034211.18435-1-glauber@scylladb.com>
(cherry picked from commit 47d04e49e8)
Newer versions of RHEL ship the os-release file with newlines in the
end, which our script was not prepared to handle. As such, scylla_setup
would fail.
This patch makes our OS detection robust against that.
Fixes#4473
Branches: master, branch-3.1
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190502152224.31307-1-glauber@scylladb.com>
(cherry picked from commit 99c00547ad)
Until this patch, dropping columns from a table was completely forbidden
if this table has any materialized views or secondary indexes. However,
this is excessively harsh, and not compatible with Cassandra which does
allow dropping columns from a base table which has a secondary index on
*other* columns. This incompatibility was raised in the following
Stackoverflow question:
https://stackoverflow.com/questions/55757273/error-while-dropping-column-from-a-table-with-secondary-index-scylladb/55776490
In this patch, we allow dropping a base table column if none of its
materialized views *needs* this column. Columns selected by a view
(as regular or key columns) are needed by it, of course, but when
virtual columns are used (namely, there is a view with same key columns
as the base), *all* columns are needed by the view, so unfortunately none
of the columns may be dropped.
After this patch, when a base-table column cannot be dropped because one
of the materialized views needs it, the error message will look like:
exceptions::invalid_request_exception: Cannot drop column a from base
table ks.cf: a materialized view cf_a_idx_index needs this column.
This patch also includes extensive testing for the cases where dropping
columns are now allowed, and not allowed. The secondary-index tests are
especially interesting, because they demonstrate that now usually (when
a non-key column is being indexed) dropping columns will be allowed,
which is what originally bothered the Stackoverflow user.
Fixes#4448.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190429214805.2972-1-nyh@scylladb.com>
There are several places were IN restrictions are not currently supported,
especially in queries involving a secondary index. However, when the IN
restriction has just a single value, it is nothing more than an equality
restriction and can be converted into one and be supported. So this patch
does exactly this.
Note that Cassandra does this conversion since August 2016, and therefore
supports the special case of single-value IN even where general IN is not
supported. So it's important for Cassandra compatibility that we do this
conversion too.
This patch also includes a test with two queries involving a secondary
index that were previously disallowed because of the "IN" on the primary
key or the indexed column - and are now allowed when the IN restriction
has just a single value. A third query tested is not related to secondary
indexes, but confirms we don't break multi-column single-value IN queries.
Fixes#4455.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190428160317.23328-1-nyh@scylladb.com>
The shard readers of the multishard reader assumed that the positions in
the data stream are strictly monotonous. This assumption is invalid.
Range tombstones can have positions that they can share with other range
tombstones and/or a clustering row. The effect of this false assumption
was that when the shard reader was evicted such that the last seen
fragment was a range tombstone, when recreated it would skip any unseen
fragments that have the same position as that of the last seen range
tombstone.
Fixes: #4418
Branches: master, 3.0, 2019.1
Tests: unit(dev)
* https://github.com/denesb/scylla.git
multishard_reader_handle_non_strictly_monotonous_positions/v4:
multishard_combining_reader: shard_reader::remote_reader extract
fill-buffer logic into do_fill_buffer()
mutlishard_combining_reader: reorder
shard_reader::remote_reader::do_fill_buffer() code
position_in_partition_view: add region() accessor
multishard_combining_reader: fix handling of non-strictly monotonous
positions
flat_mutation_reader: add flat_mutation_reader_from_mutations()
overload with range and slice
flat_mutation_reader: add make_flat_mutation_reader_from_fragments()
overload with range and slice
tests: add unit test for multishard reader correctly handling
non-strictly monotonous positions
Currently null and missing values are treated differently. Missing
values throw no_such_column. Null values return nullptr, std::nullopt
or throw null_column_value.
The api is a bit confusing since a function returning a std::optional
either returns std::nullopt or throws depending on why there is no
value.
With this patch series only get_nonnull throws and there is only one
exception type.
* https://github.com/espindola/scylla.git espindola/merge-null-and-missing-v2:
query-result-set: merge handling of null and missing values
Remove result_set_row::has
Return a reference from get_nonnull
No reason to copy if we don't have to. Now that get_nonnull doesn't
copy, replace a raw used of get_data_value with it.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Now that the various get methods return nullptr or std::nullopt on
missing values, we don't need to do double lookups.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Nothing seems to differentiate a missing and a null value. This patch
then merges the two exception types and now the only method that
throws is get_nonnull. The other methods return nullptr or
std::nullopt as appropriate.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
"
After 7c87405, schema sync includes system_schema.view_virtual_columns in the
schema digest. Old nodes don't know about this table and will not include it
in the digest calculation. As a result, there will be schema disagreement
until the whole cluster is upgraded.
Also, the order in which tables were hashed changed in 7c87405, which
causes digests to differ in some schemas.
Fixes#4457.
"
* tag 'fix-disagreement-during-upgrade-v2' of github.com:tgrabiec/scylla:
db/schema_tables: Include view_virtual_columns in the digest only when all nodes do
storage_service: Introduce the VIEW_VIRTUAL_COLUMNS cluster feature
db/schema_tables: Hash schema tables in the same order as on 3.0
db/schema_tables: Remove table name caching from all_tables()
treewide: Propagate schema_features to db::schema::all_tables()
enum_set: Introduce full()
service/storage_service: Introduce cluster_schema_features()
schema: Introduce schema_features
schema_tables: Propagate storage_service& to merge_schema()
gms/feature: Introduce a more convenient when_enabled()
gms/feature: Mark all when_enabled() overloads as const
Currently, we use --sig-proxy to forward signals to the container. However, this
requires the container's co-operation, which usually doesn't exist. For example,
docker run --sig-proxy fedora:29 bash -c "sleep 5"
Does not respond to ctrl-C.
This is a problem for continuous integration. If a build is aborted, Jenkins will
first attempt to gracefully terminate the processes (SIGINT/SIGTERM) and then give
up and use SIGKILL. If the graceful termination doesn't work, we end up with an
orphan container running on the node, which can then consume enough memory and CPU
to harm the following jobs.
To fix this, trap signals and handle them by killing the container. Also trap
shell exit, and even kill the container unconditionally, since if Jenkins happens
to kill the "docker wait" process the regular paths will not be taken.
We lose a lot by running the container asynchronously with the dbuild shell
script, so we need to add it back:
- log display: via the "docker logs" command
- auto-removal of the container: add a "docker rm -f" command on signal
or normal exit
Message-Id: <20190424130112.794-1-avi@scylladb.com>
To be able to support this new overload, the reader is made
partition-range aware. It will now correctly only return fragments that
fall into the partition-range it was created with. For completeness'
sake and to be able to test it, also implement
`fast_forward_to(const dht::partition_range)`. Slicing is done by
filtering out non-overlapping fragments from the initial list of
fragments. Also add a unit test that runs it through the mutation_source
test suite.
After 7c87405, schema sync includes system_schema.view_virtual_columns
in the schema digest. Old nodes don't know about this table and will
not include it in the digest calculation. As a result, there will be
schema disagreement until the whole cluster is upgraded.
Fix this by taking the new table into account only when the whole
cluster is upgraded.
The table should not be used for anything before this happens. This is
not currently enforced, but should be.
Fixes#4457.
Needed for determining if all nodes in the cluster are aware of the
new schema table. Only when all nodes are aware of it we can take it
into account when calculating schema digest, otherwise there would be
permanent schema disagreement in during rolling upgrade.
The commit 7c87405 also indirectly changed the order of schema tables
during hash calculation (index table should be taken after all other
tables). This shows up when there is an index created and any of {user
defined type, function, or aggregate}.
Refs #4457.
It can be invoked with a lambda without the ceremony of creating a
class deriving from gms::feature::listener.
The reutrned registration object controls listener's scope.
There was nothing keeping the verify lambda alive after the return. It
worked most of the time since the only state kept by the lambda was
a pointer to cql_test_env.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190426203823.15562-1-espindola@scylladb.com>
* seastar e84d2647c...4cdccae53 (4):
> Merge "future: Move some code out of line" from Rafael
> tests: socket_test: Add missing virtual and override
> build: Don't pass -Wno-maybe-uninitialized to clang
> Merge "expose file_permssions for creating files and dirs in API" from Benny
To be able to run the mutation-source test suite with this reader. In
the next patch, this reader will be used in testing another reader, so
it is important to make sure it works correctly first.
The shard readers under a multishard reader are paused after every
operation executed on them. When paused they can be evicted at any time.
When this happens, they will be re-created lazily on the next
operation, with a start position such that they continue reading from
where the evicted reader left off. This start position is determined
from the last fragment seen by the previous reader. When this position
is clustering position, the reader will be recreated such that it reads
the clustering range (from the half-read partition): (last-ckey, +inf).
This can cause problems if the last fragment seen by the evicted reader
was a range-tombstone. Range tombstones can share the same clustering
position with other range tombstones and potentially one clustering row.
This means that when the reader is recreated, it will start from the
next clustering position, ignoring any unread fragments that share the
same position as the last seen range tombstone.
To fix, ensure that on each fill-buffer call, the buffer contains all
fragments for the last position. To this end, when the last fragment in
the buffer is a range tombstone (with pos x), we continue reading until
we see a fragment with a position y that is greater. This way it is
ensured that we have seen all fragments for pos x and it is safe to
resume the read, starting from after position x.
In order to avoid schema disagreements during upgrades (which may lead
to deadlocks), system distributed keyspace initialization is moved
right before starting the bootstrapping process, after the schema
agreement checks already succeeded.
Fixes#3976
Message-Id: <932e642659df1d00a2953df988f939a81275774a.1556204185.git.sarna@scylladb.com>
To make it available, we'll need to make it optional the usage of level metadata,
used to deal with interval map's fragmentation issue when level 0 falls behind,
and also introduce a interface for compaction strategies to implement
make_sstable_set() that instantiate partitioned sstable set.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190424232948.668-1-raphaelsc@scylladb.com>
Instead of app-template::run_deprecated() and at_exit() hooks, use
app_template::run() and RAII (via defer()) to stop services. This makes it
easier to add services that do support shutdown correctly.
Ref #2737
Message-Id: <20190420175733.29454-1-avi@scylladb.com>
compact_and_evict gets memory_to_release in bytes while
reclamation step is in segments.
Broken in f092decd90.
It doesn't make much difference with the current default step of 1
segment since we cannot reclaim less than that, so shouldn't cause
problems in practice.
Message-Id: <1556013920-29676-1-git-send-email-tgrabiec@scylladb.com>
Currently stop returns ready future immediately. This is not a problem
since calculation loop holds a shared pointer to the local service, so
it will not be destroyed until calculation completes and global database
object db, that also used by the calculation, is never destroyed. But the
later is just a workaround for a shutdown sequence that cannot handle
it and will be changed one day. Make cache hitrate calculation service
ready for it.
Message-Id: <20190422113538.GR21208@scylladb.com>
non_system_filter lambda is defined static which means it is initialized
only once, so the 'this' that is will capture will belong to a shard
where the function runs first. During service destruction the function
may run on different shard and access already other's shard service that
may be already freed.
Fixed#4425
Message-Id: <20190421152139.GN21208@scylladb.com>
This patch add the node_exporter to the docker image.
It install it create and run a service with it.
After this patch node_exporter will run and will be part of scylla
Docker image.
Fixes#4300
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20190421130643.6837-1-amnon@scylladb.com>
To prepare for a seastar change that adds an optional file_permissions
parameter to touch_directory and recursive_touch_directory.
This change messes up the call to io_check since the compiler can't
derive the Func&& argument. Therefore, use a lambda function instead
to wrap the call to {recursive_,}touch_directory.
Ref #4395
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190421085502.24729-1-bhalevy@scylladb.com>
When we start the LSA reclamation it can be that
segment_pool::_free_segments is 0 under some conditions and
segment_pool::_current_emergency_reserve_goal is set to 1. The
reclamation step is 1 segment, and compact_and_evict_locked() frees 1
segment back into the segment_pool. However,
segment_pool::reclaim_segments() doesn't free anything to the standard
allocator because the condition _free_segments >
_current_emergency_reserve_goal is false. As a result,
tracker::impl::reclaim() returns 0 as the amount of released memory,
tracker::reclaim() returns
memory::reclaiming_result::reclaimed_nothing and the seastar allocator
thinks it's a real OOM and throws std::bad_alloc.
The fix is to change compact_and_evict() to make sure that reserves
are met, by releasing more if they're not met at entry.
This change also allows us to drop the variant of allocate_segment()
which accepts the reclamation step as a means to refill reserves
faster. This is now not needed, because compact_and_evict() will look
at the reserve deficit to increase the amount of memory to reclaim.
Fixes#4445
Message-Id: <1555671713-16530-1-git-send-email-tgrabiec@scylladb.com>
* seastar eb03ba5cd...e84d2647c (14):
> Fix hardcoded python paths in shebang line
> Disable -Wmaybe-uninitialized everywhere
> app_template: allow opting out of automatic SIGINT/SIGTERM handling
> build: Restore DPDK machine inference from cflags
> http: capture request content for POST requests
> Merge "Simplify future_state and promise" from Rafael
> temporary_buffer: fix memleak on fast path
> perftune.py: allow explicitly giving a CPU mask to be used for binding IRQs
> perftune.py: fix the sanity check for args.tune
> perftune.py: identify fast-path hardware queues IRQs of Mellanox NICs
> memory: malloc_allocator should be always available
> Merge "Using custom allocator in the posix network stack" from Elazar
> memory: Tell reclaimers how much should be reclaimed
> net/ipv4_addr: add std::hash & operator== overloads
To prevent running entrypoint script in another python3 package like
python36 in EPEL, move /opt/scylladb/python3/bin to top of $PATH.
It won't happen on this container image, but may occurs when user tries to
extend the image.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190417165806.12212-1-syuu@scylladb.com>
perftune.py executes hwloc-calc, the command is now provided as
relocatable binary, placed under /opt/scylladb/bin.
So we need to add the directory to PATH when calling
subprocess.check_output(), but our utility function already do that,
switch to it.
Fixes#4443
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190418124345.24973-1-syuu@scylladb.com>
When we add product name customization, we mistakenly defined the
parameter on each package build script.
Number of script is increasing since we recently added relocatable
python3 package, we should merge it in single place.
Also we should save the parameter on relocatable package, just like
version-release parameters.
So move the definition to SCYLLA-VERSION-GEN, save it to
build/SCYLLA-PRODUCT-FILE then archive it to relocatable package.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190417163335.10191-1-syuu@scylladb.com>
Seastar now supports two RPC compression algorithm: the original LZ4 one
and LZ4_FRAGMENTED. The latter uses lz4 stream interface which allows it
to process large messages without fully linearising them. Since, RPC
requests used by Scylla often contain user-provided data that
potentially could be very large, LZ4_FRAGMENTED is a better choice for
the default compression algorithm.
Message-Id: <20190417144318.27701-1-pdziepak@scylladb.com>
Since we moved relocatable .rpm now Scylla able to run on Amazon Linux
2.
However, is_redhat_variant() on scylla_util.py does not works on Amazon
Linux 2, since it does not have /etc/redhat-release.
So we need to switch to /etc/os-release, use ID_LIKE to detect Redhat
variants/Debian variants.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190417115634.9635-1-syuu@scylladb.com>
Switch to relocatable python3 instead of EPEL's python3 on docker-entrypoint.py.
Also drop uneeded dependencies, since we switched to relocatable scylla
image.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190417111024.6604-1-syuu@scylladb.com>
"
Previously we weren't validating elements of collections so it
was possible to add non-UTF-8 string to a column with type
list<text>.
Tests: unit(release)
Fixes#4009
"
* 'haaawk/4009/v5' of github.com:scylladb/seastar-dev:
types: Test correct map validation
types: Test correct in clause validation
types: Test correct tuple validation
types: Test correct set validation
types: Test correct list validation
types: Add test_tuple_elements_validation
types: Add test_in_clause_validation
types: Add test_map_elements_validation
types: Add test_set_elements_validation
types: Add test_list_elements_validation
types: Validate input when tuples
types: Validate input when parsing a set
types: Validate input when parsing a map
types: Validate input when parsing a list
types: Implement validation for tuple
types: Implement validation for set
types: Implement validation for map
types: Implement validation for list
types: Add cql_serialization_format parameter to validate
1. All nodes in the cluster have to support MC_SSTABLE_FEATURE
2. When a node observes that whole cluster supports MC_SSTABLE_FEATURE
then it should start using MC format.
3. Once all shards start to use MC then a node should broadcast that
unbounded range tombstones are now supported by the cluster.
4. Once whole cluster supports unbounded range tombstones we can
start accepting them on CQL level.
tests: unit(release)
Fixes#4205Fixes#4113
* seastar-dev.git dev/haaawk/enable_mc/v11:
system_keyspace: Add scylla_local
system_keyspace: add accessors for SCYLLA_LOCAL
storage_service: add _sstables_format field
feature: add when_enabled callbacks
system_keyspace: add storage_service param to setup
Add sstable format helper methods
Register feature listeners in storage_service
Add service::read_sstables_format
Use read_sstables_format in main.cc
Use _sstables_format to determine current format
Add _unbounded_range_tombstones_feature
Update supported features on format change
Currently, we use --sig-proxy to forward signals to the container. However, this
requires the container's co-operation, which usually doesn't exist. For example,
docker run --sig-proxy fedora:29 bash -c "sleep 5"
Does not respond to ctrl-C.
This is a problem for continuous integration. If a build is aborted, Jenkins will
first attempt to gracefully terminate the processes (SIGINT/SIGTERM) and then give
up and use SIGKILL. If the graceful termination doesn't work, we end up with an
orphan container running on the node, which can then consume enough memory and CPU
to harm the following jobs.
To fix this, trap signals and handle them by killing the container. Also trap
shell exit, and even kill the container unconditionally, since if Jenkins happens
to kill the "docker wait" process the regular paths will not be taken.
Message-Id: <20190415084040.12352-1-avi@scylladb.com>
This series addresses two issues in the hinted handoff that should
complete fixing the infamous #4231.
In particular the second patch removes the requirement to manually
delete hints files after upgrading to 3.0.4.
Tested with manual unit testing.
* https://github.com/vladzcloudius/scylla.git hinted_handoff_drop_broken_segments-v3:
hinted handoff: disable "reuse_segments"
commitlog: introduce a segment_error
hinted handoff: discard corrupted segments
* seastar 6f73675...eb03ba5 (11):
> tests: tests C++14 dialect in continuous integration
> rpc/compressor/lz4: fix std:variant related compiler errors
> tests: futures_test: allow project to compile with C++14
> Merge "io_queue: make io_priority_class namespace global" from Benny
> future::then_wrapped: use std::terminate instead of abort
> reactor: make metric about task quota violations less sensitive
> Merge "Add LZ4_FRAGMENTED compressor for RPC" from Paweł
> Fix build issues with Clang 7
> Merge "file_stat follow_symlink option and related fixes" from Benny
> doc/tutorial.md: reword mention of seastar::thread premption on get()
> tests: semaphore_test: relax timeouts
Fixes#4272.
Resharding wasn't preserving the sstable run structure, which depends
on all fragments sharing the same run identifier. So let's make
resharding run aware, meaning that a run will be created for each
shard involved.
tests: release mode.
Fixes#4428.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190415193556.16435-1-raphaelsc@scylladb.com>
This change drops the hit count from the name of the node, because it
prevents coalescing of nodes which are shared parents for paths with
different counts. This lack of coalescing makes the flamegraph a lot
less useful.
Message-Id: <1555348576-26382-1-git-send-email-tgrabiec@scylladb.com>
All schema changes made to the node locally are serialized on a
semaphore which lives on shard 0. For historical reasons, they don't
queue but rather try to take the lock without blocking and retry on
failure with a random delay from the range [0, 100 us]. Contenders
which do not originate on shard 0 will have an extra disadvantage as
each lock attempt will be longer by the across-shard round trip
latency. If there is constant contention on shard 0, contenders
originating from other shards may keep loosing to take the lock.
Schema merge executed on behalf of a DDL statement may originate on
any shard. Same for the schema merge which is coming from a push
notification. Schema merge executed as part of the background schema
pull will originate on shard 0 only, where the application state
change listeners run. So if there are constant schema pulls, DDL
statements may take a long time to get through.
The fix is to serialize merge requests fairly, by using the blocking
semaphore::wait(), which is fair.
We don't have to back-off any more, since submit_to() no longer has a
global concurrency limit.
Fixes#4436.
Message-Id: <1555349915-27703-1-git-send-email-tgrabiec@scylladb.com>
Currently, each instanciation of `random_mutation_generator::impl` will
generate a new random seed for itself. Altough these are printed,
mapping back all the printed seeds to the exact source location where it
has to be substituted in is non-trivial. This makes reproducing random
test failures very hard. To solve this problem, use
`tests::random::get_int()` to produce the random seed of the
`random_mutation_generator::impl` instances. This way the seed of all
the mutation generator will be derived from a single "master" seed that
is easily replaced after a test failure, hopefully also leading to
easily reproducible random test failures.
I checked that after substituting in a previously generated master
random seed, all derived seeds were exactly the same.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <0471415938fc27485975ef9213d37d94bff20fd5.1555329062.git.bdenes@scylladb.com>
"
These are patches I wrote while working on UDF/UDA, but IMHO they are
independent improvements and are ready for review.
Tests: unit (debug) dtest (release)
I checked that all tests in
nosetests -v user_types_test.py sstabledump_test.py cqlsh_tests/cqlsh_tests.py
now pass.
"
* 'espindola/udf-uda-refactoring-v3' of https://github.com/espindola/scylla:
Refactor user type merging
cql_type_parser::raw_builder: Allow building types incrementally
cql3: delete dead code
Include missing header
return a const reference from return_type
delete unused var
Add a test on nested user types.
auto_bootstrap: false provide negligible gains for new clusters and
it is extremely dangerous everywhere else. We have seen a couple of
times in which users, confused by this, added this flag by mistake
and added nodes with it. While they were pleased by the extremely fast
times to add nodes, they were later displeased to find their data
missing.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190414012028.20767-1-glauber@scylladb.com>
This requires introduction of storage_service::get_known_features
and using it with check_knows_remote_features.
Otherwise a node joining the existing cluster won't be able to
join because it does not support unbounded range tombstones yet.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
We have to run this script in python2, since we dropped EPEL from
dependencies, and the script is installer for rpms so we cannot use
relocatable python3 for it.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190411151858.2292-1-syuu@scylladb.com>
Although curl is widely available, there is no reason to depend on it.
There are mainly two users, as indicated by grep:
1) scylla-housekeeping
2) scripts within the AMI
3) docker image
The AMI has its own RPM and it already depends on curl. While we could
get rid of the curl dependency there too, we can do that later. Docker
is its own thing and it only needs it at build time anyway.
For the main scylla repo, this patch changes scylla-housekeeping so as
not to depend on the curl binary and use urllib directly instead. We can
then remove curl from our dependency list.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190411125642.9754-1-glauber@scylladb.com>
date_type_impl::less() invokes `compare_unsigned()` to compare the
underlying raw byte values. `compared_unsigned()` is a tri comparator,
however `date_type_impl::less()` implicitely converted the returned
value to bool. In effect, `date_type_impl::less()` would *always* return
`true` when the two compared values were not equal.
Found while working on a unit test which empoly a randomly generated
schema to test a component.
Fixes#4419.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <8a17c81bad586b3772bf3d1d1dae0e3dc3524e2d.1554907100.git.bdenes@scylladb.com>
Currently the partition header will always be reported as different when
comparing two mutations. This is because they are prepended with the
"expected: " and "... but got: " texts. This generates unnecessary
noise. Inject a new line between the prefix and the partition-header
proper. This way the partition header will only show up in the diff when
there is an actual difference. The "expected: " and "... but got: "
phrases are still shown as different on the top of the diff but this is
fine as one can immediately see that they are not part of the data and
additionaly they help the reader in determining which part of the diff
is the expected one and which is the actual one.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <29e0f413d248048d7db032224a3fd4180bf1b319.1554909144.git.bdenes@scylladb.com>
The problem happens after a schema change because we fail to properly
remove ongoing compaction, which stopped being tracked, from list that
is used to calculate backlog, so it may happen that a compaction read
monitor (ceases to exist after compaction ends) is used after freed.
Fixes#4410.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190409024936.23775-1-raphaelsc@scylladb.com>
If we discover that a current segment is corrupted there is nothing we
can do about it.
This patch does the following:
1) Drops the corrupted segment and moves to the next one.
2) Logs such events as ERRORs.
3) Introduces a new metrics that accounts such event.
Fixes#4364
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Introduce a common base class for all errors that indicate that the current
segment has "issues".
This allows a laconic "catch" clause for all such errors.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Hinted handoff doesn't utilize this feature (which was developed with a
commitlog in mind).
Since it's enabled by default we need to explicitly disable it.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
The comparison of tables before and after mutation is now done by a
generic diff_rows function. The same function will be used for user
defined functions and user defined aggregates.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Before this patch raw_builder would always start with an empty list of
user types. This means that every time a type is added to a keyspace,
every type in that keyspace needs to be recreated.
With this patch we pass a keyspace_metadata instead of just the
keyspace name and can construct new user types on top of previous
ones.
This will be used in the followup patch, where only new types are
created.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
abstract_function.hh uses function, which is defined in function.hh,
so it should include it.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
We define data_type as
using data_type = shared_ptr<const abstract_type>;
Since it is a shared_ptr, it cannot be copied into another thread
since that would create a race condition incrementing the reference
counter.
In particular, before this patch it is not legal to call
return_type from another thread.
With this patch read only access from another thread is possible.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
There were many calls to read_keyspace_mutation. One in each function
that prepares a mutation for some other schema change.
With this patch they are all moved to a single location.
Tests: unit (dev, debug)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190328024440.26201-1-espindola@scylladb.com>
Fedora28 python magic used to return a x-sharedlib mime type for .so files.
Fedora29 changed that to x-pie-executable, so the libraries are no longer
relocated.
Let's be more permissive and relocate everything that starts with application/.
Fixes#4396
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190404140929.7119-1-glauber@scylladb.com>
This series fixes row level repair shutdown related issues we saw with
dtests, e.g., use after free of the repair meta object, fail to stop a
table during shutdown.
Fixes: #4044Fixes: #4314Fixes: #4333Fixes: #4380
Tests: repair_additional_test.py:RepairAdditionalTest.repair_abort_test
repair_additional_test.py:RepairAdditionalTest.repair_kill_2_test
* sestar-dev.git asias/repair.fix.shutdown.v1:
repair: Wait for pending repair_meta operation before removing it
repair: Check shutdown in row level repair
repair: Remove repair meta when node is dead
repair: Remove all row level repair during shtudown
* seastar 63d8607...6f73675 (5):
> Merge "seastar-addr2line: improve the context of backtraces" from Botond
> log: fix std::system_error ostream operator to print full error message
> Revert "threads: yield on get if we had run for too long."
> core/queue: Document concurrency constraints
> core/memory: Make small pools use the full span size
Fixes#4407.
Fixes#4316.
"
Calculation of IO properties is slightly wrong for i3.metal, because we get
the number of disks wrong. The reason for that is our check for ephemeral nvme
disks, that pre-date the time in which root devices were exposed as nvme devices
(nitro and metal instances).
"
toolchain updated with python3-psutil
* 'ec2fixes' of github.com:glommer/scylla:
scylla_util.py: do not include root disks in ephemeral list
scylla-python3: include the psutil module
fix typo in scylla_ec2_check
They are of type db::system_distributed_keyspace and
db::view::view_update_generator.
n1 is in normal status
n2 boots up and _sys_dist_ks or _view_update_generator are not
initialized
n1 runs stream, n2 is the follower.
n2 uses the _sys_dist_ks or _view_update_generator
"Assertion `local_is_initialized()' failed" is observed
Fixes#4360
Message-Id: <4ae13e1640ac8707a9ba0503a2744f6faf89ecf4.1554330030.git.asias@scylladb.com>
"
With these changes we avoid a std::vector<data_value> copy, which is
nice in itself, but also makes it possible to call get_list from other
shards.
"
* 'espindola/result-set-v3' of https://github.com/espindola/scylla:
Avoid copying a std::vector in get_list
query-result-set: add and use a get_ptr method
They are of type db::system_distributed_keyspace and db::view::view_update_generator.
n1 is in normal status
n2 boots up and _sys_dist_ks or _view_update_generator are not initialized
n1 runs repair, n2 is the follower.
n2 uses the _sys_dist_ks or _view_update_generator
"Assertion `local_is_initialized()' failed" is observed
Fixes#4360
Message-Id: <6616c21078c47137a99ba71baf82594ba709597c.1553742487.git.asias@scylladb.com>
For now this is just an optimization. But it also avoids copying
data_type, which will allow this be used across shards.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This moves a copy up the call stack and makes it possible to avoid it
completely by passing a reference type to get_nonnull.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
When --abort-on-lsa-bad-alloc is enabled we want to abort whenever
we think we can be out of memory.
We covered failures due to bad_alloc thrown from inside of the
allocation section, but did not cover failures from reservations done
at the beginning of with_reserve(). Fix by moving the trap into
reserve().
Message-Id: <1553258915-27929-1-git-send-email-tgrabiec@scylladb.com>
Nitro instances (and metal ones) put their root device in nvme (as a
protocol. it is still EBS). Our algorithm so far has relied on parsing
the nvme devices to figure out which ones are ephemeral but it will
break for those instances.
Out of our supported instances so far, the i3.metal is the only one
in which this breaks.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Using a new python3 module has never been that easy! So we'll
unapologetically use psutil and don't even worry about whether or not
CentOS supports it (it doesn't)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
* seastar 5572de7...63d8607 (6):
> test: verify that negative sleep time doesn't cause infinite sleep
> httpd: Change address handling to use socket_address
> dns: Change "unspecififed" address search type to retrive first avail
> Allow when_all and when_all_succeed to take function arguments
> when_all: abort if memory allocation fails
> inet_address: Add missing constructor impl.
We saw dtest failed to stop a node like:
```
ERROR: repair_one_missing_row_test (repair_additional_test.RepairAdditionalTest)
----------------------------------------------------------------------
Traceback (most recent
[2019.1.3.node1.repair.zip](https://github.com/scylladb/scylla/files/2723244/2019.1.3.node1.repair.zip)
call last):
File "/home/asias/src/cloudius-systems/scylla-dtest/repair_additional_test.py", line 2521, in repair_one_missing_row_test
return RepairAdditionalBase._repair_one_missing_row_test(self)
File "/home/asias/src/cloudius-systems/scylla-dtest/repair_additional_test.py", line 1842, in _repair_one_missing_row_test
self.check_rows_on_node(node2, nr_rows)
File "/home/asias/src/cloudius-systems/scylla-dtest/repair_additional_test.py", line 34, in check_rows_on_node
node.stop(wait_other_notice=True)
File "/home/asias/src/cloudius-systems/scylla-ccm/ccmlib/scylla_node.py", line 496, in stop
raise NodeError("Problem stopping node %s" % self.name)
NodeError: Problem stopping node node1
```
The problem is:
1) repair_meat is created
repair_meta -> repair_writer::create_writer() -> t.stream_in_progress()
repari_meta -> repair_reader::repair_reader -> cf.read_in_progress()
2) repair_meta is stored in _repair_metas map.
3) Shtudown repair, repair_meta is not removed from the _repair_metas map
4) Shutdown database which wait for the utils::phased_barrier.
To fix, we should stop and remove all the repair_meata from the _repair_metas map.
Tests: 30 successful runs of the repair_kill_2_test
Fixes: #4044
Repair follower nodes will create repair meta object when repair master
node starts a repair. Normally, the repair meta object is removed when
repair master finishes the repair and sends the verb
REPAIR_ROW_LEVEL_STOP to all the followers to remove the repair meta
object. In case of repair master was killed suddenly, no one will remove
the repair meta object.
To prevent keeping this repair meta object forever, we should remove
such objects when gossip detects a node is dead with the gossip
listener.
Fixes: #4380
Reviewed-by: Botond Dénes <bdenes@scylladb.com>
We remove repair_meta object in remove_repair_meta up receiving of stop
row level repair rpc verb. It is possible there is an pending operation
of repair_meta. To avoid use after free, we should not remove the
repair_meta object until all the pending operations are done.
Use a gate to protect it.
Fixes: #4333Fixes: #4314
Tests: 50 succesful run of repair_additional_test.py:RepairAdditionalTest.repair_kill_2_test
ignore_ready_future in load_new_ss_tables broke
migration_test:TestMigration_with_*.migrate_sstable_with_counter_test_expect_fail dtests.
The java.io.NotSerializableException in nodetool was caused by exceptions that
were too long.
This fix prints the problematic file names onto the node system log
and includes the casue in the resulting exception so to provide the user
with information about the nature of the error.
Fixes#4375
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190331154006.12808-1-bhalevy@scylladb.com>
"
To make offline installer easier we need to minimize dependencies as
possible.
Python dependencies are already dropped by adding relocatable python3 by
Glauber, now it's time to drop rest of command line tools which used by
scylla setup tools.
(even scripts are converted to python3, it still executes some external
commands, so these commands should be distributed with offline installer)
Note that some of CLI tools haven't added such as NTP and RAID stuff,
since these tools have daemons, not just CLI.
To use such stuff in offline mode, users have to install them manually.
But both NTP setup and RAID setup are optional, users still can run Scylla w/o
them.
"
Toolchain updated to docker.io/scylladb/scylla-toolchain:fedora-29-20190401
for changes in install-dependencies.sh; also updates to gnutls 3.6.7 security
release.
* 'reloc_clitools_v5' of https://github.com/syuu1228/scylla:
reloc: add relocatable CLI tools for scylla setup scripts
dist/redhat: drop systemd-libs from dependency
dist/redhat: drop file from dependency since it seems unused
dist/redhat: drop pciutils from dependency since it only used in DPDK mode
Truncate would make disk usage stat go wild because it isn't updated
when sstables are removed in table::discard_sstables(). Let's update
the stat after sstables are removed from the sstable set.
Fixes#3624.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190328154918.25404-1-raphaelsc@scylladb.com>
Since we don't use DPDK mode by default, and the mode is not officially
supported, drop pciutils from package dependency.
Users who want to use DPDK mode they neeed to install the package
manually.
* seastar 05efbce...5572de7 (5):
> posix_file_impl::list_directory: do not ignore symbolic link file type
> prometheus: yield explicitly after each metric is processed
> thread: add maybe_yield function
> metrics: add vector overload of add_group()
> memory: tone down message for memory allocator
Currently scylla-python3 package name is hardcorded, need to support
package name renaming just like on other scylla packages.
This is required to release enterprise version.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190329003941.12289-1-syuu@scylladb.com>
"
File must be either owned by the process uid
or have both read and write access to it,
so it could be (hard) linked when sysctl
fs.protected_hardlinks is enabled.
Fixes#3117
"
* 'projects/valid_owner_and_mode/v3-rebased' of https://github.com/bhalevy/scylla:
storage_service: handle load_new_sstables exception
init: validate file ownership and mode.
treewide: use std::filesystem
Files and directories must be owned by the process uid.
Files must have read access and directories must have
read, write, and execute access.
Refs #3117
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than {std::experimental,boost,seastar::compat}::filesystem
On Sat, 2019-03-23 at 01:44 +0200, Avi Kivity wrote:
> The intent for seastar::compat was to allow the application to choose
> the C++ dialect and have seastar follow, rather than have seastar choose
> the types and have the application follow (as in your patch).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When generating view updates for base mutations when no pre-existing
data exists, we were forgetting to apply the tracked tombstones.
Fixes#4321
Tests: unit(dev)
* https://github.com/duarten/scylla materialized-views/4321/v1.1:
db/view: Apply tracked tombstones for new updates
tests/view_schema_test: Add reproducer for #4321
When generating view updates for base mutations when no pre-existing
data exists, we were forgetting to apply the tracked tombstones.
Fixes#4321
Tests: unit(dev)
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"
Variable length integers are used are used extensively by SSTables mc
format. The current deserialisation routine is quite naive in a way that
it reads each byte separately. Since, those vints usually appear inside
much larger buffers, we optimise for such cases, read 8-bytes at once
and then mask out the unneeded parts (as well as fix their order because
big-endian).
Tests: unit(dev).
perf_vint (average time per element when deserializing 1000 vints):
before:
vint.deserialize 69442000 14.400ns 0.000ns 14.399ns 14.400ns
after:
vint.deserialize 241502000 4.140ns 0.000ns 4.140ns 4.140ns
perf_fast_forward (data on /tmp):
large-partition-single-key-slice on dataset large-part-ds1:
before:
range time (s) iterations frags frag/s mad f/s max f/s min f/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
-> [0, 1] 0.000278 8792 2 7190 119 7367 1960 3 104 2 0 0 1 1 0 0 1 100.0%
-> [1, 100) 0.000344 96 99 288100 4335 307689 193809 2 108 2 0 0 1 1 0 0 1 100.0%
-> (100, 200] 0.000339 13254 100 295263 2824 301734 222725 2 108 2 0 0 1 1 0 0 1 100.0%
after:
range time (s) iterations frags frag/s mad f/s max f/s min f/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
-> [0, 1] 0.000236 10001 2 8461 59 8718 2261 3 104 2 0 0 1 1 0 0 1 100.0%
-> [1, 100) 0.000285 89 99 347500 2441 355826 215745 2 108 2 0 0 1 1 0 0 1 100.0%
-> (100, 200] 0.000293 14369 100 341302 1512 350123 222049 2 108 2 0 0 1 1 0 0 1 100.0%
"
* tag 'optimise-vint/v2' of https://github.com/pdziepak/scylla:
sstable: pass full length of buffer to vint deserialiser
vint: optimise deserialisation routine
vint: drop deserialize_type structure
tests/vint: reduce test dependencies
tests/perf: add performance test for vint serialisation
"
This series introduce a rudimentary sstables manager
that will be used for making and deleting sstables, and tracking
of thereof.
The motivation for having a sstables manager is detailed in
https://github.com/scylladb/scylla/issues/4149.
The gist of it is that we need a proper way to manage the life
cycle of sstables to solve potential races between compaction
and various consumers of sstables, so they don't get deleted by
compaction while being used.
In addition, we plan to add global statistics methods like returning
the total capacity used by all sstables.
This patchset changes the way class sstable gets the large_data_handler.
Rather than passing it separately for writing the sstable and when deleting
sstables, we provide the large_data_handler when the sstable object is
constructed and then use it when needed.
Refs #4149
"
* 'projects/sstables_manager/v3' of https://github.com/bhalevy/scylla:
sstables: provide large_data_handler to constructor
sstables_manager: default_sstable_buffer_size need not be a function
sstables: introduce sstables_manager
sstables: move shareable_components def to its own header
tests: use global nop_lp_handler in test_services
sstables: compress.hh: add missing include
sstables: reorder entry_descriptor constructor params
sstables: entry_descriptor: get rid of unused ctor
sstables: make load_shared_components a method of sstable
sstables: remove default params from sstable constructor
database: add table::make_sstable helper
distributed_loader: pass column_family to load_sstables_with_open_info
distributed_loader: no need for forward declaration of load_sstables_with_open_info
distributed_loader: reshard: use default params for make_sstable
The goal of the sstables manager is to track and manage sstables life-cycle.
There is a sstable manager instance per database and it is passed to each column-family
(and test environment) on construction.
All sstables created, loaded, and deleted pass through the sstables manager.
The manager will make sure consumers of sstables are in sync so that sstables
will not be deleted while in use.
Refs #4149
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The goal is to construct sstables only via make_sstables
that will be moved to class sstables_manager in a later patch.
Defining the default values in both interfaces is unneeded
and may to lead to them going out of sync.
Therefore, have only make_sstables provide the default parameter values.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In most cases we make a sstable based on the table schema
and soon - large_data_handler.
Encapsulate that in a make_sstable method.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
Most of the binaries we link in a debug build are linked with -s, so
the only impact is build/debug/scylla, which grows by 583 MiB when
using --compress-exec-debuginfo=0.
On the other hand, not having to recompress all the debug info from
all the used object files is a pretty big win when debugging an issue.
For example, linking build/debug/scylla goes from
56.01s user 15.86s system 220% cpu 32.592 total
to
27.39s user 19.51s system 991% cpu 4.731 total
Note how the cpu time is "only" 2x better, but given that compressing
debug info is a long serial task, the wall time is 6.8x better.
Tests: unit (debug)
"
* 'espindola/dont-compress-debug-v5' of https://github.com/espindola/scylla:
configure: Add a --compress-exec-debuginfo option
configure: Move some flags from cxx_ld_flags to cxxflags
configure: rename per mode opt to cxx_ld_flags
configure: remove per mode libs
configure: remove sanitize_libs and merge sanitize into opt
configure: split a ld_flags_{mode} out of cxxflags_{mode}
Function messaging_service::get_rpc_client() suppose to either return
existing client or create one and return it. The function is suppose to
be atomic, so after checking that requested client does not exist it is
safe to assume emplace() will succeed. But we saw bugs that made the
function to not be atomic. Lets add an assert that will help to catch
such bugs easier if they will happen in the future.
Message-Id: <20190326115741.GX26144@scylladb.com>
"
Fixes#4348
v2 changes:
* added a unit test
This miniseries fixes decimal/varint serialization - it did not update
output iterator in all cases, which may lead to overwriting decimal data
if any other value follows them directly in the same buffer (e.g. in a tuple).
It also comes with a reproducing unit test covering both decimals and varints.
Tests: unit (dev)
dtest: json_test.FromJsonUpdateTests.complex_data_types_test
json_test.FromJsonInsertTests.complex_data_types_test
json_test.ToJsonSelectTests.complex_data_types_test
"
* 'fix_varint_serialization_2' of https://github.com/psarna/scylla:
tests: add test for unpacking decimals
types: fix varint and decimal serialization
Varint and decimal types serialization did not update the output
iterator after generating a value, which may lead to corrupted
sstables - variable-length integers were properly serialized,
but if anything followed them directly in the buffer (e.g. in a tuple),
their value will be overwritten.
Fixes#4348
Tests: unit (dev)
dtest: json_test.FromJsonUpdateTests.complex_data_types_test
json_test.FromJsonInsertTests.complex_data_types_test
json_test.ToJsonSelectTests.complex_data_types_test
Note that dtests still do not succeed 100% due to formatting differences
in compared results (e.g. 1.0e+07 vs 1.0E7, but it's no longer a query
correctness issue.
I noticed a test failure with
Mutation inequality is not symmetric for ...
And the difference between the two mutations was that one atomic_cell
was live and the other wasn't.
Looking at the code I found a few cases where the comparison was not
symmetrical. This patch fixes them.
This patch will not fix the test, as it will now fail with a
"Mutations differ" error, but that is probably an independent issue.
Ref #3975.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190325194647.54950-1-espindola@scylladb.com>
When we call perftune.py in order to get a particular mode's cpu set
(e.g. mode=sq_split) it may fail and print an error message to stderr because
there are too few CPUs for a particular configuration mode (e.g. when
there are only 2 CPUs and the mode is sq_split).
We already treat these situations correctly however we let the
corresponding perftune.py error message get out into the syslog.
This is definitely confusing, stressful and annoying.
Let's not let these messages out.
Fixes#4211
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20190325220018.22824-1-vladz@scylladb.com>
Compilation fails with fmt release 5.3.0 when we print a bytes_view
using "{}" formatter.
Compiler's complain is: "error: static assertion failed: mismatch between char-types of context and argument"
Resolve this by explicitly using the operator<<() across the whole
operator<<(std::ostream& os, const result_message::rows& msg) function.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20190325203628.5902-1-vladz@scylladb.com>
has_scylla_component is always false before loading the sstable.
Also, return exception future rather than throwing.
Hit with the following dtests:
counter_tests.TestCounters.upgrade_test
counter_tests.TestCountersOnMultipleNodes.counter_consistency_node_*_test
resharding_test.ReshardingTest_nodes?_with_*CompactionStrategy.resharding_counter_test
update_cluster_layout_tests.TestUpdateClusterLayout.increment_decrement_counters_in_threads_nodes_restarted_test
Fixes#4306
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190326084151.18848-1-bhalevy@scylladb.com>
"
This series removes the usage of the static gossiper object in init.cc
and storage_service.
Follow up series will remove more in other components. This is the
effort to clean up the component dependencies and have better shutdown
procedure.
Tests: tests/gossip_test, tests/cql_query_test, tests/sstable_mutation_test, dtests.
"
* tag 'asias/storage_service_gossiper_dep_v5' of github.com:cloudius-systems/seastar-dev:
storage_service: Do not use the global gms::get_local_gossiper()
storage_service: Pass gossiper object to storage_service
gms: Remove i_failure_detector.hh
gossip: Get rid of the gms::get_local_failure_detector static object
dht: Do not use failure_detector::is_alive in failure_detector_source_filter
tests: Fix stop snitch in gossip_test.cc
gossiper: Do not use value_factory from storage_service object
gossiper: Use cfg options from _cfg instead of get_local_storage_service
gossiper: Pass db::config object to gossiper class
init: Pass gossiper object to init_ms_fd_gossiper
* seastar 33baf62...caa98f8 (8):
> Merge "Add file_accessible and file_stat methods" from Benny
> future::then: use std::terminate instead of abort
> build: Allow cooked dependencies with configure.py
> tests: Show a test's output when it fails
> posix_file_impl: Bypass flush() call iff opened with O_DSYNC
> posix_file_impl: Propagate and keep open_flags
> open_flags: Add O_DSYNC value
> build: Forward variables to CMake correctly
"
Both cql3_type and abstract_type are normally used inside
shared_ptr. This creates a problem when an abstract_type needs to refer
to a cql3_type as that creates a cycle.
To avoid warnings from asan, we were using a std::unordered_map to
store one of the edges of the cycle. This avoids the warning, but
wastes even more memory.
Even before this series cql3_type was a fairly light weight
structure. This patch pushes in that direction and now cql3_type is a
struct with a single member variable, a data_type.
This avoids the reference cycle and is easier to understand IMHO.
The one corner case is varchar. In the old system cql3_type::varchar
and cql3_type::text don't compare equal, but they both map to the same
data_type.
In the new system they would compare equal, so we avoid the confusion
by just removing the cql3_type::varchar variable.
Tests: unit (dev)
"
* 'espindola/merge-cq3-type-and-type-v3' of https://github.com/espindola/scylla:
Turn cql3_type into a trivial wrapper over data_type
Delete cql3_type::varchar
Simplify db::cql_type_parser::parse
Add a test for the varchar column representation
The crash observed in issue #4335 happens because
delete_large_data_entries is passed a deleted name.
Normally we don't get a crash, but a garbage name and we fail to
delete entries from system.large_*.
Adding a test for the fix found another issue that the second patch
is this series fixes.
Tests: unit (dev)
Fixes#4335.
* https://github.com/espindola/scylla guthub/fix-use-after-free-v4:
large_data_handler: Fix a use after destruction
large_data_handler: Make a variable non static
Allow large_data_handler to be stopped twice
Allow table to be stopped twice
Test that large data entries are deleted
"
Validate the to-be-loaded sstables in the open_sstable phase and handle any exceptions before calling cf.get_row_cache().invalidate.
Currently if exception is thrown from distributed_loader::open_sstable cf._sstables_opened_but_not_loaded may be left partially populated.
Fixes#4306
Tests: unit (dev)
- next-gating dtests (dev)
- migration_test:TestMigration_with_2_1_x.migrate_sstable_with_counter_test
migration_test:TestMigration_with_2_1_x.migrate_sstable_with_counter_test_expect_fail
- with bypassing exception in distributed_loader::flush_upload_dir
to trigger the exception in table::open_sstable
"
* 'issues/4306/v3' of https://github.com/bhalevy/scylla:
table: move sstable counters validation from load_sstable to open_sstable
distributed_loader::load_new_sstables: handle exceptions in open_sstable
"
Aligned way to build relocatable rpm with existing relocatable packages.
"
* 'relocatable-python3-fix-v3' of https://github.com/syuu1228/scylla:
reloc: allow specify rpmbuild dir
reloc/python3: archive package version number on build_reloc.sh
reloc/python3: archive rpm build script in the relocatable package, build rpm using the script
relloc/python3: fix PyYAML package name
reloc: rename python3 relocatable package filename to align same style with other packages
reloc: move relocatable python build scripts to reloc/python3 and dist/redhat/python3
When a view replica becomes unavailable, updates to it are stored as
hints at the paired based replica. This on-disk queue of pending view
updates grows as long as there are view updated and the view replica
remains unavailable. Currently, we take that relative queue size into
account when calculating the delay for new base writes, in the context
of the backpressure algorithm for materialized views.
However, the way we're calculating that on-disk backlog is wrong,
since we calculate it per-device and then feed it to all the hints
managers for that device. This means that normal hints will show up as
backlog for the view hints manager, which in turn introduces delays.
This can make the view backpressure mechanism kick-in even if the
cluster uses no materialized views.
There's yet another way in which considering the view hints backlog is
wrong: a view replica that is unavailable for some period of time can
cause the backlog to grow to a point where all base writes are applied
the maximum delay of 1 second. This turns a single-node failure into
cluster unavailability.
The fix to both issues is to simply not take this on-disk backlog into
account for the backpressure algorithm.
Fixes#4351Fixes#4352
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190321170418.25953-1-duarte@scylladb.com>
Since we archive rpm/deb build script on relocatable package and build
rpm/deb using the script, so align python relocatable package too.
Also added SCYLLA-RELOCATABLE-FILE, SCYLLA-RELEASE-FILE and SCYLLA-VERSION-FILE
since these files are required for relocatable package.
Debugging continuations is challenging. There is no support from gdb for
finding out which continuation was this continuation called from, nor
what other continuations are attached to it. GDB's `bt` command is of
limited use, at best a handful of continuations will appear in the
backtrace, those that were ready. This series attempts to fill part of
this void and provides a command that answers the latter question: what
continuations are attached to this one?
`scylla fiber` allows for walking a continuation chain, printing each
continuation. It is supposed to be the seastar equivalent of `bt`.
The continuation chain is walked starting from an arbitrary task,
specified by the user. The command will print all continuations attached
to the specified task.
This series also contains some loosely related cleanup of existing
commands and code in `scylla-gdb.py`.
* https://github.com/denesb/scylla.git scylla-fiber-gdb-command/v4:
scylla-gdb.py: fix static_vector
scylla-gdb.py: std_unique_ptr: add get() method
scylla-gdb.py: fix existing documentation
scylla-gdb.py: fix tasks and task-stats commands
scylla-gdb.py: resolve(): add cache parameter
scylla-gdb.py: scylla_ptr: move actual logic into analyze()
scylla-gdb.py: scylla_ptr: make analyze() usable for outside code
scylla-gdb.py: scylla_ptr: accept any valid gdb expression as input
scylla-gdb.py: add scylla fiber command
Store the failure_detector object inside gossiper object.
- No more the global object sharded<failure_detector>
- No need to initialize sharded<failure_detector> manually which
simplifies the code in tests/cql_test_env.cc and init.cc.
Switch failure_detector_source_filter to use get_local_gossiper::is_alive
directly since we are going to remove the static
gms::get_local_failure_detector object soon.
Pass the nodes that are down to the filter direclty, to avoid the
range_streamer to depends on gossiper at all.
This area is hard to test since we only issue deletes during
compaction and we wait for deletes only during shutdown.
That is probably worth it, seeing that two independent bugs would have
been found by this test.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The default is the old behavior, but it is now possible to configure
with --compress-exec-debuginfo=0 to get faster links but larger
binaries.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
It is the same name used in the build.ninja file.
A followup patch will add cxxflags and move compiler only flags there.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
These are flags we want to pass to both compilation and linking. There
is nothing special about the fact that they are sanitizer related.
With {sanitize} being passed to the link, we don't need {sanitize_libs}.
We do need to make sure -fno-sanitize=vptr is the last one in the
command line. Before we were implicitly getting it from seastar, but
it is bad practice to get some sanitizer flags from seastar but not
others.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
"
This series adds moving sstables uploaded via `nodetool refresh` to
staging/ directory if they require generating view updates from them.
Previous behavior (leaving these sstables in upload/ directory until
view updates are generated) might have caused sstables with
conflicting names to be mistakenly overwritten by the user.
Fixes#4047
Tests: unit (dev)
dtest: backup_restore_tests.py + backup_restore_tests.py modified with
having materialized view definitions
"
* 'use_staging_directory_for_uploaded_sstables_awaiting_view_updates' of https://github.com/psarna/scylla:
sstables: simplify requires_view_building
loader: move uploaded view pending sstables to staging
Current code captures a reference to rpc::client in a continuation, but
there is no guaranty that the reference will be valid when continuation runs.
Capture shared pointer to rpc::client instead.
Fixes#4350.
Message-Id: <20190314135538.GC21521@scylladb.com>
When we're populating a partition range and the population range ends
with a partition key (not a token) which is present in sstables and
there was a concurrent memtable flush, we would abort on the following
assert in cache::autoupdating_underlying_reader:
utils::phased_barrier::phase_type creation_phase() const {
assert(_reader);
return _reader_creation_phase;
}
That's because autoupdating_underlying_reader::move_to_next_partition()
clears the _reader field when it tries to recreate a reader but it finds
the new range to be empty:
if (!_reader || _reader_creation_phase != phase) {
if (_last_key) {
auto cmp = dht::ring_position_comparator(*_cache._schema);
auto&& new_range = _range.split_after(*_last_key, cmp);
if (!new_range) {
_reader = {};
return make_ready_future<mutation_fragment_opt>();
}
Fix by not asserting on _reader. creation_phase() will now be
meaningful even after we clear the _reader. The meaning of
creation_phase() is now "the phase in which the reader was last
created or 0", which makes it valid in more cases than before.
If the reader was never created we will return 0, which is smaller
than any phase returned by cache::phase_of(), since cache starts from
phase 1. This shouldn't affect current behavior, since we'd abort() if
called for this case, it just makes the value more appropriate for the
new semantics.
Tests:
- unit.row_cache_test (debug)
Fixes#4236
Message-Id: <1553107389-16214-1-git-send-email-tgrabiec@scylladb.com>
In commit 71bf757b2c, we call
wait_for_gossip_to_settle() which takes some time to complete in
storage_service::prepare_to_join().
In tests/cql_query_test calls init_server with do_bind == false which in
turn calls storage_service::prepare_to_join(). Since in the test, there
is only one node, there is no point to wait for gossip to settle.
To make the cql_query_test fast again, do not call
wait_for_gossip_to_settle if do_bind is false.
Before this patch, cql_query_test takes forever to complete.
After it takes 10s.
Tests: tests/cql_query_test
Message-Id: <3ae509e0a011ae30eef3f383c6a107e194e0e243.1553147332.git.asias@scylladb.com>
"
This series adds support for local indexing, i.e. when the index table
resides on the same partition as base data.
It addresses the performance issue of having an indexed query
that also specifies a partition key - index will be queried
locally.
"
* 'add_local_indexing_11' of https://github.com/psarna/scylla: (30 commits)
tests: add cases for local index prefix optimization
tests: add create/drop local index test case
tests: add non-standard names cases to local index tests
tests: add multi pk case for local index tests
tests: add test for malformed local index definitions
tests: add local index paging test
tests: add local indexing test
cql3: add CREATE INDEX syntax for local indexes
cql3: use serialization function to create index target string
index: add serialization function for index targets
index: use proper local index target when adding index
index: add parsing target column name from local index targets
db: add checking for local index in schema tables
index: add checking if serialized target implies local index
index: enable parsing multi-key targets
index: move target parser code to .cc file
json: add non-throwing overload for to_json_value
cql3: add checking for local indexes in has_supporting_index()
cql3: move finding index restrictions to prepare stage
cql3: add picking an index by score
...
When a materialized view was created, the verification code artificially
forbade creating a view without a clustering key column. However, there
is no real reason to forbid this. In the trivial case, the original base
table might not have had a clustering key, and the view might want to use
the exact same key. In a more complex case, a view may want to have all the
primary key columns as *partition* key columns, and that should be fine.
The patch also includes a regression test, which failed before this patch,
and succeeds with it (we test that we can create materialized views in both
aforementioned scenarios, and these materialized views work as expected).
Duarte raised the opinion that the "trivial" case of a view table with
a key identical to that of the base should be disallowed. However, this
should be done, if at all (I think it shouldn't), in a follow-up patch,
which will implement the non-triviality requirement consistently (e.g.,
require view primary key to be different from base's, regardless of
the existance or non-existance of clustering columns).
Fixes#4340.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190320122925.10108-1-nyh@scylladb.com>
When we are tracing requests, we would like to know everything that
happened to a query that can contribute to it having increased
latencies.
We insert some of those latencies explicitly due to throttling, but we
do not log that into tracing.
In the case of storage proxy, we do have a log message at trace level
but that is rarely used: trace messages are too heavy of a hammer, there
is no way to specify specific queries, etc.
The correct place for that is CQL tracing. This patch moves that message
to CQL tracing. We also add a matching tracepoint assuring us that no
delay happened if that's the case.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190320163350.15075-1-glauber@scylladb.com>
"
Static compact tables are tables with compact storage and no
clustering columns.
Before this patch, Scylla was writing rows of static compact tables as
clustered rows instead of as static rows. That's because in our in-memory
model such tables have regular rows and no static row. In Cassandra's
schema (since 3.x), those tables have columns which are marked as
static and there are no regular columns.
This worked fine as long as Scylla was writing and reading those
sstables. But when importing sstables from Cassandra, our reader was
skipping the static row, since it's not present in our schema, and
returning no rows as a result. Also, Cassandra, and Scylla tools,
would have problems reading those sstables.
Fix this by writing rows for such tables the same way as Cassandra
does. In order to support rolling downgrade, we do that only when all
nodes are upgraded.
Fixes#4139.
Tests:
- unit (dev)
"
* tag 'static-compact-mc-fix-v3.1' of github.com:tgrabiec/scylla:
tests: sstables: Test reading of static compact sstable generated by Cassandra
tests: sstables: Add test for writing and reading of static compact tables
sstables: mc: Write static compact tables the same way as Cassandra
sstable: mc: writer: Set _static_row_written inside write_static_row()
sstables: Add sstable::features()
sstables: mc: writer: Prepare write_static_row() for working with any column_kind
storage_service: Introduce the CORRECT_STATIC_COMPACT feature flag
sstables: mc: writer: Build indexed_columns together with serialization_header
sstables: mc: writer: De-optimize make_serialization_header()
sstable: mc: writer: Move attaching of mc-specific components out of generic code
Both cql3_type and abstract_type are normally used inside
shared_ptr. This creates a problem when an abstract_type needs to refer
to a cql3_type as that creates a cycle.
To avoid warnings from asan, we were using a std::unordered_map to
store one of the edges of the cycle. This avoids the warning, but
wastes even more memory.
Even before this patch cql3_type was a fairly light weight
structure. This patch pushes in that direction and now cql3_type is a
struct with a single member variable, a data_type.
This avoids the reference cycle and is easier to understand IMHO.
Tests: unit (dev)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
varchar is just an alias for text. Handle that conversion directly in
the parser and delete the cql3_type::varchar variable.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Since its first version, db::cql_type_parser::parse had special cases
for native and user defined types.
Those are not necessary, as the general parser has no problem handling
them.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The value computed is not static since
f254664fe6, but unfortunately that was
missed in that commit.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The path leading to the issue was:
The sstable name is allocated and passed to maybe_delete_large_data_entries by reference
auto name = sst->get_filename();
return large_data_handler.maybe_delete_large_data_entries(*sst->get_schema(), name, sst->data_size());
A future is created with a reference to it
large_partitions = with_sem([&s, &filename, this] {
return delete_large_data_entries(s, filename, db::system_keyspace::LARGE_PARTITIONS);
});
The semaphore blocks.
The filename is destroyed.
delete_large_data_entries is called with a destroyed filename.
The reason this did not reproduce trivially in a debug build was that
the sstable itself was in the stack and the destructed value was read
as an internal value, and so asan had nothing to complain about.
Unfortunately we also had no tests that the entry in
system.large_rows was actually deleted.
This patch passes the name by value. It might create up to 3 copies of
it. If that is too inefficient it can probably be avoided with a
do_with in maybe_delete_large_data_entries.
Fixes#4335
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Flags that we want to pass to gcc during compilation and linking are
in cxx_ld_flags_{mode}.
With this patch, we no longer pass
-I. -I build/{mode}/gen
to the link, which should have no impact.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The repair reader depends on the table object being alive, while it is
reading. However, for local reads, there was no synchronization between
the lifecycle of the repair reader and that of the table. In some cases
this can result in use-after-free. Solve by using the table's existing
mechanism for lifecycle extension: `read_in_progress()`.
For the non-local reader, when the local node's shard configuration is
different from the remote one's, this problem is already solved, as the
multishard streaming reader already pins table objects on the used
shards. This creates an inconsistency that might be suprising (in a bad
way). One reader takes care of pinning needed resources while the other
one doesn't. I was thorn on how to reconcile this, and decided to go
with the simplest solution, explicitely pinning the table for local
reads, that is conserve the inconsistency. It was suggested that this
inconsitency is remedied by building resource pinning into the local
reader as well [1] but there is opposition to this [2]. Adding a wrapper
reader which does just the resource pinning seems excessive, both in
code and runtime overhead.
Spotted while investigating repair-related crashes which occured during
interrupted repairs.
Fixes: #4342
[1] https://github.com/scylladb/scylla/issues/4342#issuecomment-474271050
[2] https://github.com/scylladb/scylla/issues/4342#issuecomment-474331657
Tests: none, this is a trivial fix for a not-yet-seen-in-the-wild bug.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <8e84ece8343468960d4e161467ecd9bb10870c27.1553072505.git.bdenes@scylladb.com>
When loading tables uploaded via `nodetool refresh`, they used to be
left in upload/ directory if view updates would need to be generated
from them. Since view update generation is asynchronous, sstables
left in the directory could erroneously get overwritten by the user,
who decides to upload another batch of sstables and some of the names
collided.
To remedy this, uploaded sstables that need view updates are moved
to staging/ directory with a unique generation number, where they
await view update generation.
Fixes#4047
Mounting /sys/fs/cgroup inside the image causes docker cgroup to not
be mounted internally. Therefore, hosts cannot limit resources on
Scylla. This patch removes the cgroup volume mount, allowing folders
under /sys/fs/cgroup to be created inside docker.
Message-Id: <20190320122053.GA20256@shenzou.localdomain>
db::schema_tables::ALL and db::schema_tables::all_tables() are both supposed
to list the same schema tables - the former is the list of their names, and
the latter is the list of their schemas. This code duplication makes it easy
to forget to update one of them, and indeed recently the new
"view_virtual_columns" was added to all_tables() but not to ALL.
What this patch does is to make ALL a function instead of constant vector.
The newly named all_table_names() function uses all_tables() so the list
of schema tables only appears once.
So that nobody worries about the performance impact, all_table_names()
caches the list in a per-thread vector that is only prepared once per thread.
Because after this patch all_table_names() has the "view_virtual_columns"
that was previously missing, this patch also fixes#4339, which was about
virtual columns in materialized views not being propagated to other nodes.
Unfortunately, to test the fix for #4339 we need a test with multiple
nodes, so we cannot test it here in a unit test, and will instead use
the dtest framework, in a separate patch.
Fixes#4339
Branches: 3.0
Tests: all unit tests (release and debug mode), new dtest for #4339. The unit test mutation_reader_test failed in debug mode but not in release mode, but this probably has nothing to do with this patch (?).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190320063437.32731-1-nyh@scylladb.com>
The materialized views flow control mechanism works by adding a certain
delay to each client request, designed to slow down the client to the
rate at we can complete the background view work. Until now we could observe
this mechanism only indirectly, in whether or not it succeeded to keep the
view backlog bounded; But we had no way to directly observe the delay that
we decided to add. In fact, we had a bug where this delay was constantly
zero, and we didn't even notice :-)
So in this patch we add a new metric,
scylla_storage_proxy_coordinator_last_mv_flow_control_delay
The metric is a floating point number, in units of seconds.
This metric is somewhat peculiar that it always contains the *last* delay
used for some request - unlike other metrics it doesn't measure the "current"
value of something. Moreover, it can jump wildly because there is no
guarantee that each request's delay will be identical (in particular,
different requests may involve different base replicas which have different
view backlogs, so decide on different delays). In the future we may want
to supplement this metric with some sort of delay histogram. But even
this simple metric is already useful to debug certain scenarios and
understand if the materialized-views flow control is working or not.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190227133630.26328-1-nyh@scylladb.com>
Scylla built using the frozen toolchain needs to be debugged
on a system with matching libraries. It's easiest if it's also done on the same image.
Install gdb in the image so that it's always out there when we need it.
Fixes#4329
Message-Id: <1553072393-9145-1-git-send-email-tgrabiec@scylladb.com>
In order to create a local index, the syntax used is:
CREATE INDEX t ON ((p1, p2, p3), v);
where (p1, p2, p3) are partition key columns (all of them),
and v is the indexed column.
With global indexes, target column name is always the same as the string
kept in 'options[target]' field. It's not the case for local indexes,
and so a proper extracting function is used to get the value.
When (re)creating a local index, the target string needs to be used
to parse out the actual indexed column:
"(base_pk_part1,base_pk_part2,base_pk_part3),actual_indexed_column".
This column is later used to deterine if an index should be applied
to a SELECT statement.
With local indexes it's not sufficient to check if a single
restriction is supported by an index in order to decide
that in can be used, because local indexes can be leveraged
only when full partition key is properly restricted.
(It also serves as a great example why restrictions code
would greatly benefit from a facelift! :) )
Index restrictions that match a given index were recomputed
during execution stage, which is redundant and prone to errors.
Now, used index restrictions are cached in a prepare statement.
Instead of choosing the first index that we find (in column def order),
the index with highest score is picked. Currently local indexes
score higher than global ones if restrictions allow local indexing
to be applied.
When computing paging state for local indexes, the partition
and clustering keys are different than with global ones:
- partition key is the same as base's
- clustering key starts with the indexed column
It already accepts several arguments that can be extracted from 'this',
and more will be added in the future.
New parameters include lambdas prepared during prepare stage
that define how to extract partition/clustering key ranges depending
on which index is used, so keeping it a static function will result
in unbounded number of parameters with complex types, which will
in turn make the function header almost illegible for a reader.
Hence, read_posting_list becomes a member function with easy access
to any data prepared during prepare stage.
Instead of having just one column definition, index target is now
a variant of either single column definition or a vector of them.
The vector is expected to be used when part of a target definition
is enclosed in parentheses:
$ CREATE INDEX ON t((p),v);
or
$ CREATE INDEX ON t((p1,p2), v);
etc.
This feature will allow providing (possibly composite) base partition key
to CREATE INDEX statement, which will result in creating a local index.
When the index is local, its partition key in underlying materialized
view is the the same as base's, and the indexed column is a first
clustering key. This implementation ensures that view and base rows
will reside on the same partition, while querying the indexed column
will be possible by putting it as a first clustering key part.
Our guidelines dictate that each header is self-sufficient, i.e.
after including it into an empty .cc file, the .cc file can be compiled
without having to include any other header file.
Currently we don't have any tool to check that a header is self
sufficient. This patch aims to remedy that by adding a target to check
each header, as well as a target to check all the headers.
For each header a target is generated that does the equivalent of
including the header into an empty .cc file, then compiling the
resulting .cc file.This targetis called {header_name}.o, so for
given the header `myheader.hh` this will be `build/dev/myheader.hh.o`
(if the dev build-mode is used).
Also a target, `checkheaders` is added which validates all headers in
the project. This currently fails as we have many headers that are not
self-sufficient.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <fdf550dc71203417252f1d8144e7a540eec074a1.1552636812.git.bdenes@scylladb.com>
The scylla fiber command traverses a continuation chain, given an
arbitrary task pointer.
Example (cropped for brevity):
(gdb) scylla fiber this
#0 (task*) 0x0000600000550360 0x000000000468ac40 vtable for seastar...
#1 (task*) 0x0000600000550300 0x00000000046c3778 vtable for seastar...
#2 (task*) 0x00006000018af600 0x00000000046c37a0 vtable for seastar...
#3 (task*) 0x00006000005502a0 0x00000000046c37f0 vtable for seastar...
#4 (task*) 0x0000600001a65e10 0x00000000046c6b10 vtable for seastar...
scylla fiber can be passed any expression that evaluates to a task
pointer. C++ variables, raw adresses and GDB variables (e.g. $1) all
work.
The command works by scanning the task object for pointers. If a pointer
is found it is dereferenced. If successful it checks that the pointer
dereferences to a vtable, the class for which is a known task.
If this succeeds the found task is saved, the scan then recursively
proceeds to scan the newly found task until a task with no further
attached continuations is found.
Instead of a formatted message, intended for humans, return a
`pointer_metadata` object, suitable for being using by code. The
formatting of the pointer metadata into the human readable message is
now done by the `pointer_metadata.__str__()` method, on the call site.
Also make `analyze()` a class method, making it possible for being
called without having to create a `scylla_ptr` command instance,
possibly confusing GDB.
Allow callers to prevent the resolved name from being saved. Useful when
one is just probing addresses but doesn't want to flood the cache with
useless symbols.
These two commands are broken for some time, roughly since the CPU
scheduler was merged. Fix them and move the task queue parsing code into
a common method, which now is used by both commands.
Some commands are documented, but not in the python way. Refactor these
commands so they use the standard python way for self documenting. In
addition to being more "python", this makes these documentation strings
discoverable by GDB so they appear in the `help scylla` output.
Add a `get()` method that retrieves the wrapped pointer without
dereferencing it. All existing methods are refactored to use this new
method to obtain the pointer instead of directly accessing the members.
This way only a single method has to be fixed if the object
implementation changes.
Tomek and I recently had a discussion about whether or not a commitlog
replay would be safe after we dropped or truncated a table that is not
flushed (durable, but auto_snapshots being false).
While we agreed that would be the safe, we both agreed we would feel
better with a unit test covering that.
This patch adds such a test (btw, it passes)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190318223811.6862-1-glauber@scylladb.com>
On some environment dh_strip fails at libreloc/ld.so, so it's better to
skip too just like libprotobuf.so.15.
error message is:
dh_strip -Xlibprotobuf.so.15 --dbg-package=scylla-server-dbg
strip:debian/scylla-server/opt/scylladb/libreloc/ld.so[.gnu.build.attributes]: corrupt GNU build attribute note: bad description size: Bad value
dh_strip: strip --remove-section=.comment --remove-section=.note --strip-unneeded debian/scylla-server/opt/scylladb/libreloc/ld.so returned exit code 1
0
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190319005153.26506-1-syuu@scylladb.com>
n1, n2, n3 in the cluster,
shutdown n1, n2, n3
start n1, n2
start n3, we saw features are enabled using the system table while n1 and n2 are already up and running in the cluster.
INFO 2019-02-27 09:24:41,023 [shard 0] gossip - Feature check passed. Local node 127.0.0.3 features = {CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}, Remote common_features = {CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}
INFO 2019-02-27 09:24:41,025 [shard 0] storage_service - Starting up server gossip
INFO 2019-02-27 09:24:41,063 [shard 0] gossip - Node 127.0.0.1 does not contain SUPPORTED_FEATURES in gossip, using features saved in system table, features={CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}
INFO 2019-02-27 09:24:41,063 [shard 0] gossip - Node 127.0.0.2 does not contain SUPPORTED_FEATURES in gossip, using features saved in system table, features={CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS, DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS, LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT, RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3, STREAM_WITH_RPC_STREAM, TRUNCATION_TABLE, WRITE_FAILURE_REPLY, XXHASH}
The problem is we enable the features too early in the start up process.
We should enable features after gossip is settled.
Fixes#4289
Message-Id: <04f2edb25457806bd9e8450dfdcccc9f466ae832.1551406991.git.asias@scylladb.com>
This brings the version check up-to-date with README.md and HACKING.md,
which were updated by commit fa2b03 ("Replace std::experimental types
with C++17 std version.") to say that minimum GCC 8.1.1 is required.
Tests: manually run configure.py with various `--compiler` values.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Message-Id: <20190318130543.24982-1-dejan@scylladb.com>
Static compact tables are tables with compact storage and no
clustering columns.
Before this patch, Scylla was writing rows of static compact tables as
clustered rows instead of static rows. That's because in our in-memory
model such tables have regular rows and no static row. In Cassandra's
schema (since 3.x), those tables have columns which are marked as
static and there are no regular columns.
This worked fine as long as Scylla was writing and reading those
sstables. But when importing sstables from Cassandra, our reader was
skipping the static row, since it's not present in the schema, and
returning no rows as a result. Also, Cassandra, and Scylla tools,
would have problems reading those sstables.
Fix this by writing rows for such tables the same way as Cassandra
does. In order to support rolling downgrade, we do that only when all
nodes are upgraded.
Fixes#4139.
When enabled on all nodes, sstable writers will start to produce
correct MC-format sstables for compact storage tables by writing rows
into the static row (like C*) rather than into the regular row.
We only do that when all nodes are upgraded to support rolling
downgrade. After all nodes are upgraded, regular rolling downgrade will
not be possible.
Refs #4139
The set of columns in both must match, so it's better to build them
together. Later the for choosing columns will become more
complicated, and this patch will allow for avoiding duplication.
So that it's easier to make it use schema_v3 conditionally in later
patches. It's not on the hot path, so it shouldn't matter that we
don't reserve the vectors.
Three nodes in the cluster node1, node2, node3
Shutdown the whole cluster
Start node1
Start node2, node2 sees empty remote common_features.
gossip - Feature check passed. Local node 127.0.0.2 features =
{CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS,
DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS,
LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT,
RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3,
STREAM_WITH_RPC_STREAM, WRITE_FAILURE_REPLY, XXHASH},
Remote common_features = {}
The problem is node3 hasn't started yet, node1 sees node3 has empty
features. In get_supported_features(), an empty common features will be
returned if an empty features of a node is seen. To fix, we should
fallback to use the features saved in system table.
Start node3, node3 sees empty remote common_features.
gossip - Feature check passed. Local node 127.0.0.3 features =
{CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS,
DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS,
LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT,
RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3,
STREAM_WITH_RPC_STREAM, WRITE_FAILURE_REPLY, XXHASH},
Remote common_features = {}
The problem is node3 hasn't inserted its own features into gossip
endpoint_state_map. get_supported_features() returns the common features
of all nodes in endpoint_state_map. To fix, we should fallback to use
the features stored in the system table for such node in this case.
Fixes#4225Fixes#4341
* dev asias/fix_check_knows_remote_features.upstream.v4.1:
gossiper: Remove unused register_feature and unregister_feature
gossiper: Remove unused wait_for_feature_on_all_node and
wait_for_feature_on_node
gossiper: Log feature is enabled only if the feature is not enabled
previously
gossiper: Fix empty remote common_features in
check_knows_remote_features
Three nodes in the cluster node1, node2, node3
Shutdown the whole cluster
Start node1
Start node2, node2 sees empty remote common_features.
gossip - Feature check passed. Local node 127.0.0.2 features =
{CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS,
DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS,
LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT,
RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3,
STREAM_WITH_RPC_STREAM, WRITE_FAILURE_REPLY, XXHASH},
Remote common_features = {}
The problem is node3 hasn't started yet, node1 sees node3 has empty
features. In get_supported_features(), an empty common features will be
returned if an empty features of a node is seen. To fix, we should
fallback to use the features saved in system table.
Start node3, node3 sees empty remote common_features.
gossip - Feature check passed. Local node 127.0.0.3 features =
{CORRECT_COUNTER_ORDER, CORRECT_NON_COMPOUND_RANGE_TOMBSTONES, COUNTERS,
DIGEST_MULTIPARTITION_READ, INDEXES, LARGE_PARTITIONS,
LA_SSTABLE_FORMAT, MATERIALIZED_VIEWS, MC_SSTABLE_FORMAT,
RANGE_TOMBSTONES, ROLES, ROW_LEVEL_REPAIR, SCHEMA_TABLES_V3,
STREAM_WITH_RPC_STREAM, WRITE_FAILURE_REPLY, XXHASH},
Remote common_features = {}
The problem is node3 hasn't inserted its own features into gossip
endpoint_state_map. get_supported_features() returns the common features
of all nodes in endpoint_state_map. To fix, we should fallback to use
the features stored in the system table for such node in this case.
Fixes#4225
We saw the log "Feature FOO is enabled" more than once like below. It is
better to log it only when the feature is not enabled previously.
gossip - InetAddress 127.0.0.1 is now UP, status = NORMAL
gossip - Feature CORRECT_COUNTER_ORDER is enabled
gossip - Feature CORRECT_NON_COMPOUND_RANGE_TOMBSTONES is enabled
gossip - Feature COUNTERS is enabled
gossip - Feature DIGEST_MULTIPARTITION_READ is enabled
gossip - Feature INDEXES is enabled
gossip - Feature LARGE_PARTITIONS is enabled
gossip - Feature LA_SSTABLE_FORMAT is enabled
gossip - Feature MATERIALIZED_VIEWS is enabled
gossip - Feature MC_SSTABLE_FORMAT is enabled
gossip - Feature RANGE_TOMBSTONES is enabled
gossip - Feature ROLES is enabled
gossip - Feature ROW_LEVEL_REPAIR is enabled
gossip - Feature SCHEMA_TABLES_V3 is enabled
gossip - Feature STREAM_WITH_RPC_STREAM is enabled
gossip - Feature TRUNCATION_TABLE is enabled
gossip - Feature WRITE_FAILURE_REPLY is enabled
gossip - Feature XXHASH is enabled
gossip - Feature CORRECT_COUNTER_ORDER is enabled
gossip - Feature CORRECT_NON_COMPOUND_RANGE_TOMBSTONES is enabled
gossip - Feature COUNTERS is enabled
gossip - Feature DIGEST_MULTIPARTITION_READ is enabled
gossip - Feature INDEXES is enabled
gossip - Feature LARGE_PARTITIONS is enabled
gossip - Feature LA_SSTABLE_FORMAT is enabled
gossip - Feature MATERIALIZED_VIEWS is enabled
gossip - Feature MC_SSTABLE_FORMAT is enabled
gossip - Feature RANGE_TOMBSTONES is enabled
gossip - Feature ROLES is enabled
gossip - Feature ROW_LEVEL_REPAIR is enabled
gossip - Feature SCHEMA_TABLES_V3 is enabled
gossip - Feature STREAM_WITH_RPC_STREAM is enabled
gossip - Feature TRUNCATION_TABLE is enabled
gossip - Feature WRITE_FAILURE_REPLY is enabled
gossip - Feature XXHASH is enabled
gossip - InetAddress 127.0.0.2 is now UP, status = NORMAL
It works accidentally but it just expanded by bash to use mached files
in current directory, not correctly recognized by tar.
Need to use full file name instead.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190312172243.5482-2-syuu@scylladb.com>
We get following link error when running reloc/build_reloc.sh in dbuild,
need to enable DPDK on Seastar:
g++: error: /usr/lib64/librte_cfgfile.so: No such file or directory
g++: error: /usr/lib64/librte_cmdline.so: No such file or directory
g++: error: /usr/lib64/librte_ethdev.so: No such file or directory
g++: error: /usr/lib64/librte_hash.so: No such file or directory
g++: error: /usr/lib64/librte_kvargs.so: No such file or directory
g++: error: /usr/lib64/librte_mbuf.so: No such file or directory
g++: error: /usr/lib64/librte_eal.so: No such file or directory
g++: error: /usr/lib64/librte_mempool.so: No such file or directory
g++: error: /usr/lib64/librte_mempool_ring.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_bnxt.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_e1000.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_ena.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_enic.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_fm10k.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_qede.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_i40e.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_ixgbe.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_nfp.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_ring.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_sfc_efx.so: No such file or directory
g++: error: /usr/lib64/librte_pmd_vmxnet3_uio.so: No such file or directory
g++: error: /usr/lib64/librte_ring.so: No such file or directory
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190312172243.5482-1-syuu@scylladb.com>
Since we removed dist/common/bin/scyllatop we are getting a build error
on .deb package build (1bb65a0888).
To fix the error we need to create a symlink for /usr/bin/scyllatop.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190316162105.28855-1-syuu@scylladb.com>
"
Refuse to accept SSTables that were created with partitioner
different than the one used by the Scylla server.
Fixes#4331
"
* 'haaawk/4331/v4' of github.com:scylladb/seastar-dev:
sstables: Add test for sstable::validate_partitioner
sstables: Add sstable::validate_partitioner and use it
Make sure the exception is thrown when Scylla
tries to load an SSTable created with non-compatible partitioner.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Scylla server can't read sstables that were created
with different partitioner than the one being used by Scylla.
We should make sure that Scylla identifies such mismatch
and refuses to use such SSTables.
We can use partitioner information stored in validation metadata
(Statistics.db file) for each SSTable and compare it against
partitioner used by Scylla.
Fixes#4331
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
A previous version of the patch that introduced these calls had no
limit on how far behind the large data recording could get, and
maybe_record_large_cells returned null.
The final version switched to a semaphore, but unfortunately these
calls were not updated.
Tests: unit (dev)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190314195856.66387-1-espindola@scylladb.com>
vint deserialiser can be more performant if it is allowed to do an
overread (i.e. read more memory than the value it is deserialising).
In case of sstable reads those vints are going to be usually in a middle
of a much larger buffer so lets pass the whole length of the buffer and
enable this optimisation.
At the moment, vint deserialisation is using a naive approach, reading
each byte separately. In practice, vints are going to most often appears
inside larger buffers. That means we can read 8-bytes at a time end then
figure out unneded parts and mask them out. This way we avoid a loop and
do less memory loads which are much more expensive than arithmetic
operations (even if they hit the cache).
Deserialisation function returns a structure containing both the value
and its length in the input buffer. In the vast majority of the cases
the caller will already know the length and having this structure will
make it harder for the compiler to emit good code, especially if the
function is not inlined.
In practice I've seen the structure causing register pressure problems
that lead to spilling variables to memory.
In order to allow yielding when handling endpoint lifecycle changes,
notifiers now run in threaded context.
Implementations which used this assumption before are supplemented
with assertions that they indeed run in seastar::async mode.
Fixes#4317
Message-Id: <45bbaf2d25dac314e4f322a91350705fad8b81ed.1552567666.git.sarna@scylladb.com>
* seastar e640314...463d24e (3):
> Merge 'Handle IOV_MAX limit in posix_file_impl' from Paweł
> core: remove unneeded 'exceptional future ignored' report
> tests/perf: support multiple iterations in a single test run
Issue #4234 asks for a large collection detector. Discussing the issue
Benny pointed out that it is probably better to have a generic large
cell detector as it makes a natural progression on what we already
warn on (large partitions and large rows).
This patch series implements that. It is on top of
shutdown-order-patches-v7 which is currently on next.
With the charges to use a semaphore this patch series might be getting
a bit big. Let me know if I should split it.
* https://github.com/espindola/scylla espindola/large-cells-on-top-of-shutdown-v5:
db: refactor large data deletion code
db: Rename (maybe_)?update_large_partitions
db: refactor a try_record helper
large_data_handler: assert it is not used after stop()
db: don't use _stopped directly
sstables: delete dead error handling code.
large_data_handler: Remove const from a few functions
large_data_handler: propagate a future out of stop()
large_data_handler: Run large data recording in parallel
Create a system.large_cells table
db: Record large cells
Add a test for large cells
This is analogous to the system.large_rows table, but holds individual
cells, so it also needs the column name.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
With this changes the futures returned by large_data_handler will not
normally wait for entries to be written to system.large_rows or
system.large_partitions.
We use a semaphore to bound how behind system.large_* table updates
can get.
This should avoid delaying sstables writes in the common case, which
is more relevant once we warn of large cells since the the default
threshold will be just 1MB.
Note that there is no ordering between the various maybe_record_* and
maybe_delete_large_data_entries requests. This means that we can end
up with a stale entry that is only removed once the TTL expires.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
These will use a member semaphore variable in a followup patch, so they
cannot be const.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
maybe_delete_large_data_entries handles exceptions internally, so the
code this patch deletes would never run.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This should have been changed in the patch
db: stop the commit log after the tables during shutdown
But unfortunately I missed it then.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
We had almost identical error handling for large_partitions and
large_rows. Refactor in preparation for large_cells.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This renames it to record_large_partitions, which matches
record_large_rows. It also changes the signature to be closer to
record_large_rows.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The code for deleting entries from system.large_partitions was almost
a duplicate from the code for deleting entries from system.large_rows.
This patch unifies the two, which also improves the error message when
we fail to delete entries from system.large_partitions.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
There is no guarantee that rpc streaming makes progress in some time
period. Remove the keep alive timer in streaming to avoid killing the
session when the rpc streaming is just slow.
The keep alive timer is used to close the session in the following case:
n2 (the rpc streaming sender) streams to n1 (the rpc streaming receiver)
kill -9 n2
We need this because we do not kill the session when gossip think a node
is down, because we think the node down might only be temporary
and it is a waste to drop the previous work that has done especially
when the stream session takes long time.
Since in range_streamer, we do not stream all data in a single stream
session, we stream 10% of the data per time, and we have retry logic.
I think it is fine to kill a stream session when gossip thinks a node is
down. This patch changes to close all stream session with the node that
gossip think it is down.
Message-Id: <bdbb9486a533eee25fcaf4a23a946629ba946537.1551773823.git.asias@scylladb.com>
"
This series allows canceling view update requests when a node is
discovered DOWN. View updates are sent in the background with long
timeout (5 minutes), and in case we discover that the node is
unavailable, there's no point in waiting that long for the request
to finish. What's more, waiting for these requests occurs on shutdown,
which may result in waiting 5 minutes until Scylla properly shuts down,
which is bad for both users and dtests.
This series implements storage_proxy as a lifecycle subscriber,
so it can react to membership changes. It also keeps track of all
"interruptible" writes per endpoint, so once a node is detected as DOWN,
an artificial timeout can be triggered for all aforementioned write
requests.
Fixes#3826Fixes#3966Fixes#4028
"
* 'write_hints_for_view_updates_on_shutdown_4' of https://github.com/psarna/scylla:
service: remove unused stop_hints_manager
storage_proxy: add drain_on_shutdown implementation
main: register storage proxy as lifecycle subscriber
storage_proxy: add endpoint_lifecycle_subscriber interface
storage_proxy: register view update handlers for view write type
storage_proxy: add intrusive list of view write handlers
storage_proxy: add view_update_write_response_handler
Complex timestamp tests were ported from dtest and contained a potential
race - rows were updated with TTL 1 and then checked if the row exists
in both base and view replicas in an eventually() loop.
During this loop however, TTL of 1 second might have already passed
and the row could have been deleted from base.
This patch changes the mentioned TTL to 30 seconds, making the tests
extremely unlikely to be flaky.
Message-Id: <6b43fe31850babeaa43465eb771c0af45ee6e80d.1552041571.git.sarna@scylladb.com>
This series contains several improvements to perf_fast_forward that
either address some of the problems seen in the automated runs or help
understanding the results.
The main problem was that test small-partition-slicing had a preparation
stage disproportionally long compared to the actual testing phase. While
the fragments per second results wasn't affected by that, it restricted
the number of iterations of the test that we were able to run, and the
test which single iterations is short (and more prone to noise) was
executed only four times. This was solved by sharing the preparation
stage with all iterations, thus enabling the test to be run many times
and improving the stability of the results.
Another, improvement is the ability to dump all test results and process
them producing histograms. This allows us to see how the distribution of
particular statistics looks like and if there are some complications.
Refs #4278.
* https://github.com/pdziepak/scylla.git more-perf_fast_forward/v1:
tests/perf_fast_forward: print number of iterations of each test
tests/perf_fast_forward: reuse keys in small partition slicing test
tests/perf_fast_forward: extract json result file writing logic
tests/perf_fast_forward: add an option to dump all results
tests/perf_fast_forward: add script for analysing full results
When storage proxy is shutting down, all interruptible writes
can be timed out in order not to wait for them. Instead, the mechanism
will fall back to storing hints and/or not progressing with view
building.
In order to be able to iterate over view update write response handlers,
an intrusive list of them is added to storage proxy. This way
iteration can be easily yielded without invalidating operators and all
logic is moved to slow path.
View update write response handler inherits from a regular write
response handler, but it's also possible to link it intrusively
in order to be able to induce timeouts on them later.
perf_fast_forward with flag --dump-all-results reports the results of
every test iteration that was executed. This patch introduces a python
script that can analyse those results (in json format) and present them
in a more human-friendly way.
For now, the only option is to plot histograms of selected statistics.
perf_fast_forward runs each test case multiple times and reports a
summary of those results (median, min, max, and median absolute
deviation).
While very convenient the summary may hide some important information
(e.g. the distribution of the results). This patch adds an option to
report results of every single executed iteration.
"
Fixes#4245
Breaks up "perform_cleanup" in parameterized "rewrite_sstables"
and implements upgrade + scrub in terms of this.
Both run as a "regular" compaction, but ignore the normal criteria
for compaction and select obsolete/all tables.
We also ensure all previous compactions are done so we can guarantee
all tables are rewritten post invocation of command.
"
* 'calle/upgrade_sstables' of github.com:scylladb/seastar-dev:
api::storage_service: Implement "scrub"
api/storage_service: Implement "upgradesstables"
api::storage_service: Add keyspace + tables helper
compaction_manager: Add perform_sstable_scrub
compaction_manager: Add perform_sstable_upgrade
compaction_manager: break out rewrite_sstables from cleanup
table: parameterize cleanup_sstables
"
Currently any large partitions found during shutdown are not
recorded. The reason is that the database commit log is already off,
so there is nowhere to record it to.
One possible solution is to have an independent system database. With
that the regular db is shutdown first and writes can continue to the
system db.
That is a pretty big change. It would also not allow us to record
large partitions in any system tables.
This patch series instead tries to stop the commit log later. With
that any large partitions are recorded to the log and moved to a
sstable on the next startup.
"
* 'espindola/shutdown-order-patches-v7' of https://github.com/espindola/scylla:
db: stop the commit log after the tables during shutdown
db: stop the compaction manager earlier
db: Add a stop_database helper
db: Don't record large partitions in system tables
Fixes#4245
Implemented as a compation barrier (forcing previous compactions to
finish) + parameterized "cleanup", with sstable list based on
parameters.
Truncating a table is very slow if the system is under pressure. Because
in that case we mostly just want to get rid of the existing data, it
shouldn't take this long. The problem happens because truncate has to
wait for memtable flushes to end, twice. This is regardless of whether
or not the table being truncated has any data.
1. The first time is when we call truncate itself:
if auto_snapshot is enabled, we will flush the contents of this table
first and we are expected to be slow. However, even if auto_snapshot is
disabled we will still do it -- which is a bug -- if the table is marked
as durable. We should just not flush in this case and it is a silly bug.
1. The second time is when we call cf->stop(). Stopping a table will
wait for a flush to finish. At this point, regardless of which path
(Durable or non-durable) we took in the previous step we will have no
more data in the table. However, calling `flush()` still need to acquire
a flush_permit, which means we will wait for whichever memtable is
flushing at that very moment to end.
If the system is under pressure and a memtable flush will take many
seconds, so will truncate. Even if auto_snapshots are enabled, we
shouldn't have to flush twice. The first flush should already put is in
a state in which the next one is immediate (maybe holding on to the
permit, maybe destroying the memtable_list already at that point ->
since no other memtables should be created).
If auto_snapshots are not enabled, the whole thing should just be
instantaneous.
This patchset fixes that by removing the flush need when !auto_snapshot,
and special casing the flush of an empty table.
Fixes#4294
* git@github.com:glommer/scylla.git slowtruncate-v2:
database: immediately flush tables with no memtables.
truncate: do not flush memtables if auto_snapshot is false.
This commit rewrites the logic of table creation at startup of the auth
mechanism to be race proof. This is done by simply ignoring the
already_exists exception as done in system_distributed_keyspace.
The old creation logic, tested for existance of the column family and
right after called announce_new_column_family with the newly
created table schema. The problem was that it does not prevent
a race since the announcement itself is a fiber and the created table
can still be gossiped from another node, causing the announce
function to throw an already_exists exception that in turn crashes
scylla.
Message-Id: <20190306075016.28131-1-eliransin@scylladb.com>
This allows for system.large_partitions to be updated if a large
partition is found while writing the last sstables.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
We want to finish all large data logging in stop_system, so stopping
the compaction manager should be the first thing stop_system does.
The make_ready_future<>() will be removed in a followup patch.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
According to the cql definitions, if no ORDER BY clause is present,
records should be returned ordered by the clustering keys. Since the
backend returns the ranges according to their order of appearance
in the request, the bounds should be sorted before sending it to the
backend. This kind of sorting is needed in queries that generates more
than one bound to be read, examples to such queris are:
1. a SELECT query with an IN clause.
2. a SELECT query on a mixed order tupple of columns (see #2050).
The assumption this commit makes is the correctness of the bounds
list, that is, the bounds are non overlapping. If this wasn't true, multiple
occurences of the same reccord could have returned for certain queries.
Tests:
1. Unit tests release
2. All dtest that requires #2050 and #2029Fixes#2029
"
Fixes#3708
This series adds JSON serialization and deserialization procedures
to tuples and user defined types.
Tests: unit (dev)
"
* 'add_tuple_and_udt_json_support_2' of https://github.com/psarna/scylla:
tests: add test cases for JSON and UDT
types: add JSON support to UDT
tests: add JSON tuple tests
types: add JSON support for tuples
Right now we flush memtables if the table is durable (which in practice
it almost always is).
We are truncating, so we don't want the data. We should only flush if
auto_snapshot is true.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
If a table has no data, it may still take a long time to flush. This is
because before we even try to flush, we need go acquire a permit and
that can take a while if there is a long running flush already queued.
We can special case the situation in which there is no data in any of
the memtables owned by table and return immediately.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
* seastar ab54765...e640314 (10):
> net: enable IP_BIND_ADDRESS_NO_PORT before binding a socket during connection
> core: show address in error message for posix_listen failures
> fmt: remove submodule
> tests: fix loopback socket close() to not fail when the peer's side is already closed
> Merge "Add suffixes to target names" from Jesse
> temporary_buffer: improve documentation for alignment param requirements
> docs: Fix dependencies for split tutorial target
> deleter: prevent early memory free caused by deleter append.
> doc/tutorial.md: introduce memory allocation foreign_ptr
> Fix CLI help message (network & DPDK options)
Toolchain and configure.py updated for fmt submodule removal.
fuzzy_test performs some checks that are expected to fail and whoose
failure does not influence the outcome of the test. For this it uses the
`BOOT_WARN_*` family of macros. These will just log a warning when their
predicate fails. This can however confuse someone looking at the logs
trying to determine the cause of a failure. Since these checks are
performed primarly to provide an aid in debugging failures, replace them
with a conditional debug-level log message.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <f550a9d9ab1b5b4aeb4f81860cbd3d924fc86898.1551792035.git.bdenes@scylladb.com>
The `test_abandoned_read` verifies that an abandoned read does a proper
cleanup. One of the things checked is that after the querier TTL
expires, the saved queriers are cleaned-up. This check however had a
very tight timing. The TTL was 2s and the test waited 2s before it did
the check, which is wrapped in an `eventually_true()` (max +1s).
The TTL timer scans the queriers with a period of TTL/2 so a querier
can live 1.5*TTL time. This means that the 2s + 1s wait time is just on
the limit and with some bad luck (and a slow machine) it can fail.
Reduce the TTL in this test to 1s to relax the dependence on timing.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <ed0d45b5a07960b83b391d289cade9b9f60c7785.1551787638.git.bdenes@scylladb.com>
This change ammends on the functionality of the result generation,
it changes the behaviour to return the expected results vector sorted
in the expected order of appearance in the result set. Then the
result set is validated for both, content and also order.
Tests: unit tests (Release)
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Whenever a query with an IN clause on clustering keys is executed,
assuming only one partition, the rows are ordered according to the
clustering keys. This commit adds the order validation to the content
validation whenever possible (which means removing the
ignore order part).
Tests: unit tests (Release)
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
According to the cql definitions, if no ORDER BY clause is present,
records should be returned ordered by the clustering keys. Since the
backend returns the ranges according to their order of appearance
in the request, the bounds should be sorted before sending it to the
backend. This kind of sorting is needed in queries that generates more
than one bound to be read, examples to such queris are:
1. a SELECT query with an IN clause.
2. a SELECT query on a mixed order tupple of columns (see #2050).
The assumption this commit makes is the correctness of the bounds
list, that is, the bounds are non overlapping. If this wasn't true, multiple
occurences of the same reccord could have returned for certain queries.
Tests:
1. Unit tests release
2. All dtest that requires #2050 and #2029Fixes#2029
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
"
Futurize split_range_to_single_shard to fix reactor stall.
Fixes: #3846
"
* tag 'asias/split_range_to_single_shard/v4' of github.com:scylladb/seastar-dev:
partitioner: Futurize split_range_to_single_shard
tests: Use SEASTAR_THREAD_TEST_CASE for partitioner_test.cc
table::load_sstable: fix missing arg in old format counters exception
Properly catch and log the exception in load_new_sstables.
Abort when the exception is caught to keep current behavior.
Seen with migration_test:TestMigration_with_2_1_x.migrate_sstable_with_counter_test
without enable_dangerous_direct_import_of_cassandra_counters.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190301091235.2914-1-bhalevy@scylladb.com>
"
This fixes#3988.
We already have a system.large_partitions, but only a warning for
large rows. These patches close the gap by also recording large rows
into a new system.large_rows.
"
* 'espindola/large-row-add-table-v6' of https://github.com/espindola/scylla:
Add a testcase for large rows
Populate system.large_rows.
Create a system.large_rows table
Extract a key_to_str helper
Don't call record_large_rows if stopped
Add a delete_large_rows_entries method to large_data_handler
db::large_data_handler::(maybe_)?record_large_rows: Return future<> instead of void
Rename maybe_delete_large_partitions_entry
Rename log_large_row to record_large_rows
Rename maybe_log_large_row to maybe_record_large_rows
"
This series contains minor improvements to commitlog log messages that
have helped investigating #4231, but are not specific to that bug.
"
* tag 'improve-commitlog-logs/v1' of https://github.com/pdziepak/scylla:
commitlog: use consistent chunk offsets in logs
commitlog: provide more information in logs
commitlog: remove unnecessary comment
Logs in commitlog writer use offset in the file of the chunk header to
identify chunks. However, the replayer is using offset after the header
for the same purpose. This causes unnecessary confusion suggesting that
the replayer is reading at the wrong position.
This patch changes the replayer so that it reports chunk header offsets.
This commits adds some more information to the logs. Motivated, by
experiences with investigating #4231.
* size of each write
* position of each write
* log message for final write
"
This series fixes a problem in the commitlog cycle() function that
confused in-memory and on-disk size of chunks it wrote to disk. The
former was used to decide how much data needs to be actually written,
and the latter was used to compute the offset of the next chunk. If two
chunk writes happened concurrently one the one positioned earlier in
the file could corrupt the header of the next one.
Fixes#4231.
Tests: unit(dev), dtest(commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup,test_commitlog_replay_with_alter_table)
"
* tag 'fix-commitlog-cycle/v1' of https://github.com/pdziepak/scylla:
commitlog: write the correct buffer size
utils/fragmented_temporary_buffer_view: add remove suffix
Introduced in 2a437ab427.
regular_compaction::select_sstable_writer() creates the sstable writer
when the first partition is consumed from the combined mutation
fragment stream. It gets the schema directly from the table
object. That may be a different schema than the one used by the
readers if there was a concurrent schema alter duringthat small time
window. As a result, the writing consumer attached to readers will
interpret fragments using the wrong version of the schema.
One effect of this is storing values of some columns under a different
column.
This patch replaces all column_family::schema() accesses with accesses
to the _schema memeber which is obtained once per compaction and is
the same schema which readers use.
Fixes#4304.
Tests:
- manual tests with hard-coded schema change injection to reproduce the bug
- build/dev/scylla boot
- tests/sstable_mutation_test
Message-Id: <1551698056-23386-1-git-send-email-tgrabiec@scylladb.com>
Commitlog files contain multiple chunks. Each chunk starts as a single
(possibly, fragmented buffer). The size of that buffer in memory may be
larger than the size in the file.
cycle() was incorrectly using the in-memory size to write the whole
buffer to the file. That sometimes caused data corruption, since a
smaller on-file size was used to compute the offset of the next chunk
and there could be multiple chunk writes happening at the same time.
This patch solves the issue by ensuring that only the actual on-file
size of the chunk is written.
This patch adds fragmented_temporary_buffer_view::remove_suffix(). It is
also necessary to adjust remove_prefix() since now the total size of all
fragments may be larger than the size of the view if both those
operations are performed.
"
This series heavily refactors `auth_test` in anticipation of
the last patch, which fixes a bug and which should be backported.
Branches: branch-3.0, branch-2.3
"
Fixes#4284
* 'jhk/check_can_login/v2' of https://github.com/hakuch/scylla:
auth: Reject logins from disallowed roles
tests: Restrict the scope of a variable
tests: Simplify boolean assertions in `auth_test`
tests: Abstract out repeated assertion checking
tests: Do not use the `auth` namespace
tests: Validate authentication correctly
tests: Ensure test roles are created and dropped
tests: Use `static` variables in `auth_test`
tests: Remove non-useful test
4 nodes in the cluster
n1, n2 in dc1
n3, n4 in dc2
dc1 RF=2, dc2 RF=2.
If we run
nodetool repair -hosts 127.0.0.1,127.0.03 -dc "dc1,dc2" multi
on n1.
The -hosts option will be ignored and only the -dc option
will be used to choose which hosts to repair. In this case, n1 to n4
will be repaired.
If user wants to select specific hosts to repair with, there is no need
to specify the -dc option. Use the -hosts option is enough.
Reject the combination and not to surprise the user.
In https://issues.apache.org/jira/browse/CASSANDRA-9876, the same logic
is introduced as well.
Refs #3836
Message-Id: <e95ac1099f98dd53bb9d6534316005ea3577e639.1551406529.git.asias@scylladb.com>
Scylla Manager communicates through SSH, so this patch adds SSH server
to Scylla's docker image in order for it to be configurable by Scylla
Manager.
Message-Id: <20190301161428.GA12148@shenzou.localdomain>
"
This series aims to fix inconsistencies in recent view update generation series (435447998).
First of all, it checks view row marker liveness instead of that of a base row marker
when deciding if optimizations can be applied or not.
Secondly, tests based on creating mutations directly are removed. Instead:
- dtest case which detected inconsistencies in previous series is ported to be a unit test
- the above case is also expanded to cover views with regular base column in their key
- additional test for TTL and timestamps is added and it's based on CQL
Tests: unit (dev)
dtest: materialized_views_test.TestMaterializedViews.test_no_base_column_in_view_pk_complex_timestamp_without_flush
Fixes: #4271
"
* 'fix_virtual_columns_liveness_checks_in_update_optimization_5' of https://github.com/psarna/scylla:
tests: add view update optimization case for TTL
database: add view_stats getter
tests: port complex timestamp view test from dtest
db,view: fix virtual columns liveness checks
tests: remove update generating test case
There are additional validation steps that the server executes in
addition to simply invoking the authenticator, so we adapt the tests to
also perform that validation.
We also eliminate lots of code duplication.
Since the role manager and authenticator work in tandem, the test cases
should use the wrapper for `auth::service` to create and drop users
instead of just doing it through the authenticator.
Password handling is verified in its own test suite, and this test not
only makes a number of assumptions about implementation details, but
also tries to verify a hashing scheme (bcrypt) which is not supported on
most Linux distributions.
These defines are global, so they can be in the mode-agnostic cxxflags
rather than the mode-specific cxxflags_{mode}.
Message-Id: <20190228081247.20116-1-avi@scylladb.com>
This test was useful in discovering corner cases for TTLs of virtual
columns, so it's ported to unit test suite from dtest.
The test is also extended with a mirrored case for base regular column
that *is* included in view pk.
When looking for optimization paths, columns selected in a view
are checked against multiple conditions - unfortunately virtual
columns were erroneously skipped from that check, which resulted
in ignoring their TTLs. That can lead to overoptimizing
and not including vital liveness info into view rows,
which can then result in row disappearing too early.
This test case should have been based on CQL instead of creating
artificial update scenarios. It also contains invalid cases
regarding base and view row marker, so it's removed here
and replaced with CQL-based test in this same series.
gnutls requires a configuration file, and the configuration file must match
the one used by the library. Since we ship our own version of the library with
the relocatable package, we must also ship the configuration file.
Luckily, it is possible to override the location of the configuration file via
an environment variable, so all we need to do is to copy the file to the archive
and provide the environment variable in the thunk that adjusts the library path.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190227110529.14146-1-avi@scylladb.com>
Currently, we only allocate memory for concurrent unit test runs. This can cause
CPU overcommit when running test.py on machines with a log of memory but few cores.
This overcommit can cause timeouts in tests that are time-sensitive (bad practice,
but can happen) and makes the desktop sluggish.
Improve by allocating at least one logical core per running test.
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190227132516.22147-1-avi@scylladb.com>
Collect /etc/redhat-release as well as os-release from relevant
hosts. The problem with os-release is that it doesn't contain the
minor version of the EL OS family. Since this is only present in
Red Hat distributions and derivatives, it will not be collected
in Debian derivatives.
Another approach is to use lsb_release -a but it will not provide
anything more useful than os-release on Debian and lsb needs to be
installed on EL derivatives first.
Fixes#4093
Message-Id: <20190225204727.20805-4-dyasny@scylladb.com>
Hostname -i produces a garbled output on new systems with ipv6
enabled, better to use the clean hostname instead, for the file
names.
Message-Id: <20190225204727.20805-3-dyasny@scylladb.com>
The script relies on hostname -i for host address, which can be
wrong in some systems. This patch checks for where the defined
CQL_PORT is listening, and uses the correct IP address instead.
Message-Id: <20190225204727.20805-2-dyasny@scylladb.com>
"
This series restructures the SASL code that was previously internal
to the `password_authenticator` so that it can be used in other contexts.
"
* 'jhk/restructure_sasl/v1' of https://github.com/hakuch/scylla:
auth: Rename SASL challenge class for "PLAIN"
auth: Make a ctor `explicit`
auth: Move `sasl_challenge` to its own file
auth: Decouple SASL code from its parent class
"
This series adds a fuzzer-type unit test for range scans, which
generates a semi-random dataset and executes semi-random range scans
against it, validating the result.
This test aims to cover a wide range of corner cases with the help of
randomness. Data and queries against it are generated in such a way that
various corner cases and their combinations are likely to be covered.
The infrastructure under range-scans have gone under massive changes in
the last year, growing in complexity and scope. The correctness of range
scans is critical for the correct functioning of any Scylla cluster, and
while the current unit tests served well in detecting any major problems
(mostly while developing), they are too simplistic and can only be
relied on to check the correctness of the basic functionality. This test
aims to extend coverage drastically, testing cases that the author of
the range-scan code or that of the existing unit tests didn't even think
exists, by relying on some randomness.
Fixes: #3954 (deprecates really)
"
* 'more-extensive-range-scan-unit-tests/v2' of https://github.com/denesb/scylla:
tests/multishard_mutation_query_test: add fuzzy test
tests/multishard_mutation_query_test: refactor read_all_partitions_with_paged_scan()
tests/test_table: add advanced `create_test_table()` overload
tests/test_table: make `create_test_table()` customizable
query: add trim_clustering_row_ranges_to()
tests/test_table: add keyspace and table name params
tests/test_table: s/create_test_cf/create_test_table/
tests: move create_test_cf() to tests/test_table.{hh,cc}
tests/multishard_mutation_query_test: drop many partition test
tests/multishard_mutation_query_test: drop range tombstone test
"
This miniseries hides virtual columns's writetime and ttl
from the user.
Tests: unit (dev)
Fixes#4288
"
* 'hide_virtual_columns_writetime_and_ttl_2' of https://github.com/psarna/scylla:
tests: add test for hiding virtual columns from WRITETIME
cql3: hide virtual columns from WRITETIME() and TTL()
schema: add column_definition::is_hidden_from_cql
Virtual columns should not be visible to the user,
so they are now hidden not only from directly selecting them,
but also via WRITETIME() and TTL() keywords.
Fixes#4288
test_distributed_loader_with_pending_delete issues a dma write, but violates
the unwritten contract to temporary_buffer::aligned(), which requires that
size be a multiple of alignment. As a result the test fails spuriously.
Instead of playing with the alignment, rewrite that snippet to use the
easier-to-use make_file_output_stream().
Introduced in 1ba88b709f.
Branches: master.
Message-Id: <20190226181850.3074-1-avi@scylladb.com>
It now records large rows when they are first written to an sstable
and removes them when the sstable is deleted.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This is analogous to the system.large_partitions table, but holds
individual rows, so it also needs the clustering key of the large
rows.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The implementations large_data_handler should only be called if
large_data_handler hasn't been stopped yet.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
These functions will record into tables in a followup patch, so they
will need to return a future.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
From the log it looks like these checks were added in 2014 because of
a broken clang.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
With this patch HACKING suggest using just ./configure.py and passing
the mode to ninja. It also expands on the characteristics of each mode
and mentions the dev mode.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190208020444.19145-1-espindola@scylladb.com>
The materialized-views flow control carefully calculates an amount of
microseconds to delay a client to slow it down to the desired rate -
but then a typo (std::min instead of std::max) causes this delay to
be zeroed, which in effect completely nullifies the flow control
algorithm.
Before this fix, experiments suggested that view flow control was
not having any effect and view backlog not bounded at all. After this
fix, we can see the flow control having its desired effect, and the
view backlog converging.
Fixes#4143.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190226161452.498-1-nyh@scylladb.com>
" from Asias
Some of the function names are not updated after we change the rpc verb
names. Rename them to make them consistent with the rpc verb names.
* seastar-dev.git asias/row_level_repair_rename_consistent_with_rpc_verb/v1:
repair: Rename request_sync_boundary to get_sync_boundary
repair: Rename request_full_row_hashes to get_full_row_hashes
repair: Rename request_combined_row_hash to get_combined_row_hash
repair: Rename request_row_diff to get_row_diff
repair: Rename send_row_diff to put_row_diff
repair: Update function name in docs/row_level_repair.md
The shard-aware drivers can cause a huge amount of connections to be created
when there are tens of thousands of clients. While normally the shard-aware
drivers are beneficial, in those cases they can consume too much memory.
Provide an option to disable shard awareness from the server (it is likely to
be easier to do this on the server than to reprovision those thousands of
clients).
Tests: manual test with wireshark.
Message-Id: <20190223173331.24424-1-avi@scylladb.com>
In resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test, we saw:
4 nodes in the tests
n1, n2, n3, n4 are started
n1 is stopped
n1 is changed to use different shard config
n1 is restarted ( 2019-01-27 04:56:00,377 )
The backtrace happened on n2 right fater n1 restarts:
0 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature STREAM_WITH_RPC_STREAM is enabled
1 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature WRITE_FAILURE_REPLY is enabled
2 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature XXHASH is enabled
3 WARN 2019-01-27 04:56:05,177 [shard 0] gossip - Fail to send EchoMessage to 127.0.58.1: seastar::rpc::closed_error (connection is closed)
4 INFO 2019-01-27 04:56:05,205 [shard 0] gossip - InetAddress 127.0.58.1 is now UP, status =
5 Segmentation fault on shard 0.
6 Backtrace:
7 0x00000000041c0782
8 0x00000000040d9a8c
9 0x00000000040d9d35
10 0x00000000040d9d83
11 /lib64/libpthread.so.0+0x00000000000121af
12 0x0000000001a8ac0e
13 0x00000000040ba39e
14 0x00000000040ba561
15 0x000000000418c247
16 0x0000000004265437
17 0x000000000054766e
18 /lib64/libc.so.6+0x0000000000020f29
19 0x00000000005b17d9
The theory is: migration_manager::maybe_schedule_schema_pull is scheduled, at this time
n1 has SCHEMA application_state, when n1 restarts, n2 gets new application
state from n1 which does not have SCHEMA yet, when migration_manager::maybe_schedule
wakes up from the 60 sleep, n1 has non-empty endpoint_state but empty
application_state for SCHEMA. We dereference the nullptr
application_state and abort.
In commit da80f27f44, we fixed the problem by
checking the pointer before dereference.
To prevent this to happen in the first place, we'd better to add
application_state::SCHEMA when gossip starts. This way, peer nodes
always see the application_state::SCHEMA when a node restarts.
Tests: resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test
Fixes#4148Fixes#4258
Split the update_schema_version_and_announce() into
update_schema_version() and announce_schema_version(). This is going to
be used in storage_service::prepare_to_join() where we want to first
update the schema version, start gossip, announce the schema version.
It is sometimes usefull for force reinstallation of the node_exporter,
for example during upgrade or if something is wrong with the current
installation.
This patch adds a --force command line option.
If the --force is given to the node_expoerter_install, it will reinstall
node_exporter to the latest version, regardless if it was already
installed.
The symbolic link in /usr/bin/node_exporter will be set to the installed
version, so if there are other installed version, they will remain.
Examples:
$ sudo ./dist/common/scripts/node_exporter_install
node_exporter already installed, you can use `--force` to force reinstallation
$ sudo ./dist/common/scripts/node_exporter_install --force
node_exporter already installed, reinstalling
Fixes#4201
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20190225151120.21919-1-amnon@scylladb.com>
Let's add a PRODUCT variable, similar to build_rpm.sh, for example, so
that we can override package names for enterprise AMIs.
Message-Id: <20190225063319.19516-1-penberg@scylladb.com>
"
After adcb3ec20c ("row_cache: read is not
single-partition if inter-partition forwarding is enabled") we have
noticed a regression in the results of some perf_fast_forward tests.
This was caused by those tests not disabling partition-level
fast-forwarding even though it was not needed and the commit in question
fixed an incorrect optimisation in such cases.
However, after solving that issue it has also become apparent that
mutation_reader_merger performs worse when the fast-forwarding is
disabled. This was attributed to logic responsible for dropping readers
as soon as they have reached the end of stream (which cannot be done if
fast-forwarding is enabled). This problem was mitigated with avoiding a
scan of the list and removing readers in small batches.
Fixes#4246.
Fixes#4254.
Tests: unit(dev)
"
* tag 'perf_fast_forward-fix-regression/v1' of https://github.com/pdziepak/scylla:
mutation_reader_merger: drop unneded readers in small batches
mutation_reader_merger: track readers by iterators and not pointers
tests/perf_fast_forward: disable partition-level fast-forwarding if not needed
* seastar 2313dec...ab54765 (10):
> Fix C++-17-only uses of static_assert() with a single parameter.
> README.md: fix out-of-date explanation of C++ dialect
> net: fix tcp load balancer accounting leak while moving socket to other shard
> Revert "deleter: prevent early memory free caused by deleter append."
> deleter: prevent early memory free caused by deleter append.
> Solve seastar.unit.thread failure in debug mode
> Fix iovec-based read_dma: use make_readv_iocb instead of make_read_iocb
> build: Fix the required version of `fmt`
> app_template: fix use after move in app constructor
> build: Rename CMake variable for private flags
Fixes#4269.
* 'jhk/define_debug/v1' of https://github.com/hakuch/scylla:
build: Remove the `DEBUG_SHARED_PTR` pp variable
build: Prefer the Seastar version of a pp variable
Scylla currently prints a welcome message when it starts, with the
Scylla version, but this is not printed to the regular log so in some
cases (e.g., Jenkins runs) we do not see it in the log. So let's add
a regular INFO-level log message with the same information.
Also, Scylla currently doesn't print any specific log message when it
normally completes its shutdown. In some cases, users may end up
wondering whether Scylla hung in the middle of the shutdown, or in
fact exited normally. Refs #4238. So in this patch we add a "shutdown
complete" message as the very last message in a successfull shutdown.
We print Scylla's version also in the shutdown message, which may be
useful to see in the logs when shutting down one version of Scylla
and starting a different version.
Finally, we also add a log message when initialization is complete,
which may also be useful to understand whether Scylla hung during
initialization.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190217140659.19512-1-nyh@scylladb.com>
It was observed that destroying readers as soon as they are not needed
negatively affects performance of relatively small reads. We don't want
to keep them alive for too long either, since they may own a lot of
memory, but deferring the destruction slightly and removing them in
batches of 4 seems to solve the problem for the small reads.
mutation_reader_merger uses a std::list of mutation_reader to keep them
alive while the rest of the logic operates on non-owning pointers.
This means that when it is a time to drop some of the readers that are
no longer needed, the merger needs to scan the list looking for them.
That's not ideal.
The solution is to make the logic use iterators to elements in that
list, which allows for O(1) removal of an unneeded reader. Iterators to
list are just pointers to the node and are not invalidated by unrelated
additions and removals.
Several of the test cases in perf_fast_forward do not need
partition-level fast-forwarding. However, since the defaults are used to
construct most of the readers the fast-forwarding is enabled regardless.
This showed an apparent regression in the perf_fast_forward results
after adcb3ec20c ("row_cache: read is not
single-partition if inter-partition forwarding is enabled") which
disabled an optimisation that was invalid when partition-level
fast-forwarind was requested.
This patch ensures that all single-partition reads that do not need
partition-level fast-forwarding keep it disabled.
"
Currently we keep the entries in a circular_buffer, which uses
a contiguous storage. For large partitions with many promoted index
entries this can cause OOM and sstable compaction failure.
A similar problem exists for the offset vector built
in write_promoted_index().
This change solves the problem by serializing promoted index entries
and the offset vector on the fly directly into a bytes_ostream, which
uses fragmented storage.
The serialization of the first entry is deferred, so that
serialization is avoided if there will be less than 2
entries. Promoted index is not added for such partitions.
There still remains a problem that large-enough promoted index can cause OOM.
Refs #4217
Tests:
- unit (release)
- scylla-bench write
Branches: 3.0
"
* tag 'fix-large-alloc-for-promoted-index-v3' of github.com:tgrabiec/scylla:
sstables: mc: writer: Avoid large allocations for maintaining promoted index
sstables: mc: writer: Avoid double-serialization of the promoted index
"
The delete_atomically function is required to delete a set of sstables
atomically. I.e. Either delete all or none of them. Deleting only
some sstables in the set might result in data resurrection in case
sstable A holding tombstone that cover mutation in sstable B, is deleted,
while sstable B remains.
This patchset introduces a log file holding a list of SSTable TOC files
to delete for recovering a partial delete_atomically operation.
A new subdirectory is create in the sstables dir called `pending_delete`
holding in-flight logs.
The logs are created with a temporary name (using a .tmp suffix)
and renamed to the final .log name once ready. This indicates
the commit point for the operation.
When populating the column family, all files in the pending_delete
sub-directory are examined. Temporary log files are just removed,
and committed log files are read, replayed, and deleted.
Fixes#4082
Tests: unit (dev), database_test (debug)
"
* 'projects/delete_atomically_recovery/v5' of https://github.com/bhalevy/scylla:
tests: database_test: add test_distributed_loader_with_pending_delete
distributed_loader: replay and cleanup pending_delete log files
distributed_loader: populated_column_family: separate temp sst dirs cleanup phase
docs: add sstables-directory-structure.md
sstables: commit sstables to delete_atomically into a pending_delete log file
sstables: delete_atomically: delete sstables in a thread
sstables: component_basename: reuse with sstring component
sstables: introduce component_basename
database: maybe_delete_large_partitions_entry: do not access sstable and do not mask exceptions
sstables: add delete_sstable_and_maybe_large_data_entries
sstables: call remove_by_toc_name in dtor if marked_for_deletion
Scan the table's pending_delete sub-directory if it exists.
Remove any temporary pending_delete log files to roll back the respective
delete_atomically operation.
Replay completed pending_delete log files to roll forward the respective
delete_atomically operation, and finally delete the log files.
Cleanup of temporary sstable directories and pending_delete
sstables are done in a preliminary scan phase when populating the column family
so that we won't attempt to load the to-be-deleted sstables.
Fixes#4082
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In preparation for replaying pending_delete log files,
we would like to first remove any temporary sst dirs
and later handle pending_delete log files, and only
then populate the column family.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To facilitate recovery of a delete_atomically operation that crashed mid
way, add a replayable log file holding the committed sstables to delete.
It will be used by populate_column_family to replay the atomic deletion.
1. Write the toc names of sstables to be deleted into a temporary file.
2. Once flushed and closed, rename the temp log file into the final name
and flush the pending_delete directory.
3. delete the sstables.
4. Remove the pending_delete log file
and flush the pending_delete directory.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
component_basename returns just the basename for the component filename
without the leading sstdir path.
To be used for delete_atomically's pending_delete log file.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
1. We would like to be able to call maybe_delete_large_partitions_entry
from the sstable destructor path in the future so the sstable might go away
while the large data entries are being deleted.
2. We would like the caller to handle any exception on this path,
especially in the prepatation part, before calling delete_large_partitions_entry().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To be called by delete_atomically,
rather that passing a vector to delete_sstables.
This way, no need to build `sstables_to_delete_atomically` vector
To be replaced in the future with a sstable method once we
provide the large_data_handler upon construction.
Handle exceptions from remove_by_toc_name or maybe_delete_large_partitions_entry
by merely logging an error. There is nothing else we can do at this point.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
No need to call delete_sstables which works on a list of sstable
(by toc name).
Also, add FIXME comment about not calling
large_data_handler.maybe_delete_large_partitions_entry
on this path.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
checksummed_file_writer does not override allocate_buffer(), so it inherits
data_source_impl's default allocate_buffer, which does not care about alignment.
The buffer is then passed to the real file_data_sink_impl, and thence to the file
itself, which cannot complete the write since it is not properly aligned.
This doesn't fail in release mode, since the Seastar allocator will supply a
properly aligned buffer even if not asked to do so. The ASAN allocator usually
does supply an aligned buffer, but not always, which causes the test to fail.
Fix by forwarding the allocate_buffer() function to the underlying data_source.
Fixes#4262.
Branches: branch-3.0
Message-Id: <20190221184115.6695-1-avi@scylladb.com>
Limits are stored as uint32_t everywhere, but in some places
int32_t was used, which created inconsistencies when comparing
the value to std::numeric_limits<Type>::max().
In order to solve inconsistencies, the types are unified to uint32_t,
and instead of explicitly calling numeric limit max,
an already existing constant value query::max_rows is utilized.
Fixes#4253
Message-Id: <4234712ff61a0391821acaba63455a34844e489b.1550683120.git.sarna@scylladb.com>
We've seen schema application failing with marshal_exception
here. That's not enough information to figure out what is the
problem. Knowing which table and column is affected would make
diagnosis much easier in certain cases.
This patch wraps errors in query::deserialization_error with more
information.
Example output:
query::deserialization_error (failed on column system_schema.tables#bloom_filter_fp_chance \
(version: c179c1d7-9503-3f66-a5b3-70e72af3392a, id: 0, index: 0, type: org.apache.cassandra.db.marshal.DoubleType):\
seastar::internal::backtraced<marshal_exception> (marshaling error: read_simple - not enough bytes (expected 8, got 3)
Message-Id: <20190221113219.13018-1-tgrabiec@scylladb.com>
This patch removes the log message about "compaction_manager - Asked to stop"
at the very end of Scylla runs. This log message is confusing because it
only has the "asked to stop" part, without finally a "stopped", and may
lead a user to incorrectly fear that the shutdown hung - when it in fact
finished just fine.
The database object holds a compaction_manager and stop()s it when the
database is stop()ed - and that is the very last thing our shutdown does.
However, much earlier, as the *first* shutdown operation (i.e., the last
at_exit() in main.cc), we already stop() the compaction manager.
The second stop() call does nothing, but unfortunately prints the log
message just before checking if it has anything to stop. So this patch
just moves the log message to after the check.
Fixes#4238.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190217142657.19963-1-nyh@scylladb.com>
"
Fixes#4256
This miniseries fixes a problem with inserting NULL values through
INSERT JSON interface.
Tests: unit (dev)
"
* 'fix_insert_json_with_null' of https://github.com/psarna/scylla:
tests: add test for INSERT JSON with null values
cql3: add missing value erasing to json parser
Can be useful in diagnosing problems with application of schema
mutations.
do_merge_schema() is called on every change of schema of the local
node.
create_table_from_mutations() is called on schema merge when a table
was altered or created using mutations read from local schema tables
after applying the change, or when loading schema on boot.
Message-Id: <20190221093929.8929-2-tgrabiec@scylladb.com>
"
cryptopp's config.h has the following pragma:
#pragma GCC diagnostic ignored "-Wunused-function"
It is not wrapped in a push/pop. Because of that, including cryptopp
headers disables that warning on scylla code too.
This patch series introduces a single .cc file that has to include
cryptopp headers.
"
* 'avoid-cryptopp-v3' of https://github.com/espindola/scylla:
Avoid including cryptopp headers
Delete dead code
cryptopp's config.h has the following pragma:
#pragma GCC diagnostic ignored "-Wunused-function"
It is not wrapped in a push/pop. Because of that, including cryptopp
headers disables that warning on scylla code too.
The issue has been reported as
https://github.com/weidai11/cryptopp/issues/793
To work around it, this patch uses a pimpl to have a single .cc file
that has to include cryptopp headers.
While at it, it also reduces the differences and code duplication
between the md5 and sha1 hashers.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This code would have be to refactored by the next patch. Since it is
commented out, just delete it.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
"
This series addresses the issue of redundant view updates,
generated for columns that were not selected for given materialized view.
Cases covered (quote:)
* If a base row has a live row marker, then we can avoid generating
view updates if only unselected columns change;
* If a base row has no live row marker, then we can avoid generating
view updates if unselected columns are updated, unless they are newly
created, deleted, or they have a TTL.
Additionally, this series includes caching selected columns and is_index information
to avoid unnecessary CPU cycles spent on recomputing these two.
Fixes#3819
"
* 'send_less_view_updates_if_not_necessary_4' of https://github.com/psarna/scylla:
tests: add cases for view update generation optimizations
view: minimize generated view updates for unselected columns
view: cache is_index for view pointer
index: make non-pointer overload of is_index function
index: avoid copying when checking for is_index
In some cases generating view updates for columns that were not
selected in CREATE VIEW statement is redundant - it is the case
when the update will not influence row liveness in anyway.
Currently, these cases are optimized out:
- row marker is live and only unselected columns were updated;
- row marked is not live and only unselected columns were updated,
and in the process nothing was created or deleted and there was
no TTL involved;
It's detrimental to keep querying index manager whether a view
is backing a secondary index every time, so this value is cached
at construct time.
At the same time, this value is not simply passed to view_info
when being created in secondary index manager, in order to
decouple materialized view logic from secondary indexes as much as
possible (the sole existence of is_index() is bad enough).
allocate_segment() can fail even though we're not out of memory, when
it's invoked inside an allocating section with the cache region
locked. That section may later succeed after retried after memory
reclamation.
We should ignore bad_alloc thrown inside allocating section body and
fail only when the whole section fails.
Fixes#2924
Message-Id: <1550597493-22500-1-git-send-email-tgrabiec@scylladb.com>
"
Fixes#3574
This series adds missing multi-column restrictions filtering to CQL.
The underlying infrastructure already allows checking multi-column
restrictions in a reasonable way, so this series consists of mostly
adding simple interfaces and parameters.
Also, unit test cases for multi-column restrictions are provided.
Tests: unit (dev)
"
* 'add_multi_column_restrictions_filtering_3' of https://github.com/psarna/scylla:
tests: add multi-column filtering tests
cql3: add multi-column restrictions filtering
cql3: add specified is_satisfied_by to multi-column restriction
cql3: rewrite raw loop in is_satisfied_by to boost::any_of
cql3: fix is_satisfied_by for multi-column restrictions
cql3: add missing include to multi-column restriction
Multi-column restrictions need only schema, clustering key and query
options in order to decide if they are satisfied, so an overloaded
function that takes reduced number of parameters is added.
Multi-column restriction should be satisfied by the value
if any of the ranges contains it, not all of them.
Example: SELECT * FROM t WHERE (a,b) IN ((1,2),(1,3))
will operate on two singular ranges: [(1,2),(1,2)] and [(1,3),(1,3)].
It's sufficient for a value to be inside any of these two in order
to satisfy the restriction.
"
As part of implementing sstables manager and fixing issue related
to updating large_data_handler on all delete paths, we want to funnel
all sstable creations, loading, and deletions through a manager.
The patchset lays out test infrastructure to funnel these opeations
through class sstables::test_env.
In the process, it cleans up many numerous call sites in the existing
unit tests that evolved over time.
Refs #4198
Refs #4149
Tests: unit (dev)
"
* 'projects/test_env/v3' of https://github.com/bhalevy/scylla:
tests: introduce sstables::test_env
tests: perf_sstable: rename test_env
tests: sstable_datafile_test: use useable_sst
tests: sstable_test: add write_and_validate_sst helper
tests: sstable_test: add test_using_reusable_sst helper
tests: sstable_test: use reusable_sst where possible
tests: sstable_test: add test_using_working_sst helper
tests: sstable_3_x_test: make_test_sstable
tests: run_sstable_resharding_test: use default parameters to make_sstable
tests: sstables::test::make_test_sstable: reorder params
tests: test_setup: do_with_test_directory is unused
tests: move sstable_resharding_strategy_tests to sstable_reharding_test
tests: move create_token_from_key helpers to test_services
tests: move column_family_for_tests to test_services
dht: move declaration of default_partitioner from sstable_datafile_test to i_partitioner.hh
Allow the --mode argument to ./configure.py and ./test.py to be repeated. This
is to allow contiuous integration to configure only debug and release, leaving dev
to developers.
Message-Id: <20190214162736.16443-1-avi@scylladb.com>
Currently, we keep the entries in a circular_buffer, which uses
a contiguous storage. For large partitions with many promoted index
entries this can cause OOM and sstable compaction failure.
A similar problem exists for the offset vector built
in write_promoted_index().
This change solves the problem by serializing promoted index entries
and the offset vector on the fly directly into a bytes_ostream, which
uses fragmented storage.
The serialization of the first entry is deferred, so that
serialization is avoided if there will be less than 2
entries. Promoted index is not added for such partitions.
There still remains a problem that large-enough promoted index can cause OOM.
Refs #4217
"
This series introduces PER PARTITION LIMIT to CQL.
Protocol and storage is already capable of applying per-partition limits,
so for nonpaged queries the changes are superficial - a variable is parsed
and passed down.
For paged queries and filtering the situation is a little bit more complicated
due to corner cases: results for one partition can be split over 2 or more pages,
filtering may drop rows, etc. To solve these, another variable is added to paging
state - the number of rows already returned from last served partition.
Note that "last" partition may be stretched over any number of pages, not just the
last one, which is a case especially when considering filtering.
As a result, per-partition-limiting queries are not eligible for page generator
optimization, because they may need to have their results locally filtered
for extraneous rows (e.g. when the next page asks for per-partition limit 5,
but we already received 4 rows from the last partition, so need just 1 more
from last partition key, but 5 from all next ones).
Tests: unit (dev)
Fixes#2202
"
* 'add_per_partition_limit_3' of https://github.com/psarna/scylla:
tests: remove superficial ignore_order from filtering tests
tests: add filtering with per partition key limit test
tests: publish extract_paging_state and count_rows_fetched
tests: fix order of parameters in with_rows_ignore_order
cql3,grammar: add PER PARTITION LIMIT
idl,service: add persistent last partition row count
cql3: prevent page generator usage for per-partition limit
cql3: add checking for previous partition count to filtering
pager: add adjusting per-partition row limit
cql3: obey per partition limit for filtering
cql3: clean up unneeded limit variables
cql3: obey per partition limit for select statement
cql3: add get_per_partition_limit
cql3: add per_partition_limit to CQL statement
Python 3.6 is the first version to accept bytes to the json.loads(),
which causes the following error on older Python 3 versions:
Traceback (most recent call last):
File "/usr/lib/scylla/scylla-housekeeping", line 175, in <module>
args.func(args)
File "/usr/lib/scylla/scylla-housekeeping", line 121, in check_version
raise e
File "/usr/lib/scylla/scylla-housekeeping", line 116, in check_version
versions = get_json_from_url(version_url + params)
File "/usr/lib/scylla/scylla-housekeeping", line 55, in get_json_from_url
return json.loads(data)
File "/usr/lib64/python3.4/json/__init__.py", line 312, in loads
s.__class__.__name__))
TypeError: the JSON object must be str, not 'bytes'
To support those older Python versions, convert the bytes read to utf8
strings before calling the json.loads().
Fixes#4239
Branches: master, 3.0
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20190218112312.24455-1-amnon@scylladb.com>
When reporting a failure, expected rows were mixed up with received
rows. Also, the message assumed it received more rows, but it can
as well be less, so now it reports a "different number" of rows.
In order to process paged queries with per-partition limits properly,
paging state needs to keep additional information: what was the row
count of last partition returned in previous run.
That's necessary because the end of previous page and the beginning
of current one might consist of rows with the same partition key
and we need to be able to trim the results to the number indicated
by per-partition limit.
Paged queries that induce per-partition limits cannot use
page generator optimization, as sometimes the results need
to be filtered for extraneous rows on page breaks.
Filtering now needs to take into account per partition limits as well,
and for that it's essential to be able to compare partition keys
and decide which rows should be dropped - if previous page(s) contained
rows with the same partition key, these need to be taken into
consideration too.
For filtering pagers, per partition limit should be set
to page size every time a query is executed, because some rows
may potentially get dropped from results.
Part of the code is already implemented (counters and hinted-handoff).
Part of the code will probably never be (triggers). And the rest is
the code that estimates number of rows per range to determine query
parallelism, but we implemented exponential growth algorithms instead.
Message-Id: <20190214112226.GE19055@scylladb.com>
"
get_restricted_ranges() is inefficient since it calculates all
vnodes that cover a requested key ranges in advance, but callers often
use only the first one. Replace the function with generator interface
that generates requested number of vnodes on demand.
"
* 'gleb/query_ranges_to_vnodes_generator' of github.com:scylladb/seastar-dev:
storage_proxy: limit amount of precaclulated ranges by query_ranges_to_vnodes_generator
storage_proxy: remove old get_restricted_ranges() interface
cql3/statements/select_statement: convert index query interface to new query_ranges_to_vnodes_generator interface
tests: convert storage_proxy test to new query_ranges_to_vnodes_generator interface
storage_proxy: convert range query path to new query_ranges_to_vnodes_generator interface
storage_proxy: introduce new query_ranges_to_vnode_generator interface
Give the constant 1024*1024 introduced in an earlier commit a name,
"batch_memory_max", and move it from view.cc to view_builder.hh.
It now resides next to the pre-existing constant that controlled how
many rows were read in each build step, "batch_size".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190217100222.15673-1-nyh@scylladb.com>
* seastar 11546d4...2313dec (6):
> Deprecate thread_scheduling_group in favor of scheduling_group
> Merge "Fixes for Doxygen documentation" from Jesse
> future: optionally type-erase future::then() and future::then_wrapped
> build: Allow deprecated declarations internally
> rpc: fix insertion of server connections into server's container
> rpc: split BOOST_REQUIRE with long conditions into multiple
read_exactly(), when given a stream that does not contain the amount of data
requested, will loop endlessly, allocating more and more memory as it does, until
it fails with an exception (at which point it will release the memory).
Fix by returning an empty result, like input_stream::read_exactly() (which it
replaces). Add a test case that fails without a fix.
Affected callers are the native transport, commitlog replay, and internal
deserialization.
Fixes#4233.
Branches: master, branch-3.0
Tests: unit(dev)
Message-Id: <20190216150825.14841-1-avi@scylladb.com>
When yum-utils already installed on Fedora, 'yum install dnf-utils' causes
conflict, will fail.
We should show description message instead of just causing dnf error
mesage.
Fixes#4215
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190215221103.2379-1-syuu@scylladb.com>
When bootstrapping, a node should to wait to have a schema agreement
with its peers, before it can join the ring. This is to ensure it can
immediately accept writes. Failing to reach schema agreement before
joining is not fatal, as the node can pull unknown schemas on writes
on-demand. However, if such a schema contains references to UDFs, the
node will reject writes using it, due to #3760.
To ensure that schema agreement is reached before joining the ring,
`storage_service::join_token_ring()` has to checks. First it checks that
at least one peer was connected previously. For this it compares
`database::get_version()` with `database::empty_version`. The (implied)
assumption is that this will become something other than
`database::empty_version` only after having connected (and pulled
schemas from) at least one peer. This assumption doesn't hold anymore,
as we now set the version earlier in the boot process.
The second check verifies that we have the same schema version as all
known, live peers. This check assumes (since 3e415e2) that we have
already "met" all (or at least some) of our peers and if there is just
one known node (us) it concludes that this is a single-node cluster,
which automatically has schema agreement.
It's easy to see how these two checks will fail. The first fails to
ensure that we have met our peers, and the second wrongfully concludes
that we are a one-node cluster, and hence have schema agreement.
To fix this, modify the first check. Instead of relying on the presence
of a non-empty database version, supposedly implying that we already
talked to our peers, explicitely make sure that we have really talked to
*at least* one other node, before proceeding to the second check, which
will now do the correct thing, actually checking the schema versions.
Fixes: #4196
Branches: 3.0, 2.3
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <40b95b18e09c787e31ba6c5519fb64d68b4ca32e.1550228389.git.bdenes@scylladb.com>
The included testcase used to crash because during database::stop() we
would try to update system.large_partition.
There doesn't seem to be an order we can stop the existing services in
cql_test_env that makes this possible.
This patch then adds another step when shutting down a database: first
stop updating system.large_partition.
This means that during shutdown any memtable flush, compaction or
sstable deletion will not be reflected in system.large_partition. This
is hopefully not too bad since the data in the table is TTLed.
This seems to impact only tests, since main.cc calls _exit directly.
Tests: unit (release,debug)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190213194851.117692-1-espindola@scylladb.com>
-Og is advertised as debug-friendly optimization, both in compile time
and debug experience. It also cuts sstable_mutation_test run time in half:
Changing -O0 to -Og
Before:
real 16m49.441s
user 16m34.641s
sys 0m10.490s
After:
real 8m38.696s
user 8m26.073s
sys 0m10.575s
Message-Id: <20190214205521.19341-1-avi@scylladb.com>
In preparation for providing a default large_data_handler in
a test-standard way.
buffer_size parameter reordered and now has a default value
same as make_sstable()'s.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
For fixing issue #3362 we added in materialized views, in some cases,
"virtual columns" for columns which were not selected into the view.
Although these columns nominally exist in the view's schema, they must
not be visible to the user, and in commit
3f3a76aa8f we prevented a user from being
able to SELECT these columns.
In this patch we also prevent the user from being able to use these
column names (which shouldn't exist in the view) in WHERE restrictions.
Fixes#4216
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190212162014.18778-1-nyh@scylladb.com>
The bulk materialized-view building processes (when adding a materialized
view to a table with existing data) currently reads the base table in
batches of 128 (view_builder::batch_size) rows. This is clearly better
than reading entire partitions (which may be huge), but still, 128 rows
may grow pretty large when we have rows with large strings or blobs,
and there is no real reason to buffer 128 rows when they are large.
Instead, when the rows we read so far exceed some size threshold (in this
patch, 1MB), we can operate on them immediately instead of waiting for
128.
As a side-effect, this patch also solves another bug: At worst case, all
the base rows of one batch may be written into one output view partition,
in one mutation. But there is a hard limit on the size of one mutation
(commitlog_segment_size_in_mb, by default 32MB), so we cannot allow the
batch size to exceed this limit. By not batching further after 1MB,
we avoid reaching this limit when individual rows do not reach it but
128 of them did.
Fixes#4213.
This patch also includes a unit test reproducing #4213, and demonstrating
that it is now solved.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190214093424.7172-1-nyh@scylladb.com>
Fixes#4222
Iff an extension creation callback returns null (not exception)
we treat this as "I'm not needed" and simply ignore it.
Message-Id: <20190213124311.23238-1-calle@scylladb.com>
The way the `pkg-config` executable works on Fedora and Ubuntu is
different, since on Fedora `pkg-config` is provided by the `pkgconf`
project.
In the build directory of Seastar, `seastar.pc` and `seastar-testing.pc`
are generated. `seastar` is a requirement of `seastar-testing`.
When pkg-config is invoked like this:
pkg-config --libs build/release/seastar-testing.pc
the version of `pkg-config` on Fedora resolves the reference to
`seastar` in `Requires` to the `seastar.pc` in the same directory.
However, the version of `pkg-config` on Ubuntu 18.04 does not:
Package seastar was not found in the pkg-config search path.
Perhaps you should add the directory containing `seastar.pc'
to the PKG_CONFIG_PATH environment variable
Package 'seastar', required by '/seastar-testing', not found
To address the divergent behavior, we set the `PKG_CONFIG_PATH` variable
to point to the directory containing `seastar.pc`. With this change, I
was able to configure Scylla on both Fedora 29 and Ubuntu 18.04.
Fixes#4218
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <d7164bde2790708425ac6761154d517404818ecd.1550002959.git.jhaberku@scylladb.com>
"
Fixes#4083
Instead of sharded collection in system.local, use a
dedicated system table (system.truncated) to store
truncation positions. Makes query/update easier
and easier on the query memory.
The code also migrates any existing truncation
positions on startup and clears the old data.
"
* 'calle/truncation' of github.com:scylladb/seastar-dev:
truncation_migration_test: Add rudimentary test
system_keyspace: Add waitable for trunc. migration
cql_test_env: Add separate config w. feature disable
cql_test_env: Add truncation migration to init
cql_assertions: Add null/non-null tests
storage_service: Add features disabling for tests
Add system.truncated documentation in docs
commitlog_replay: Use dedicated table for truncation
storage_service: Add "truncation_table" feature
Fixes#4083
Instead of sharded collection in system.local, use a
dedicated system table (system.truncated) to store
truncation positions. Makes query/update easier
and easier on the query memory.
The code also migrates any existing truncation
positions on startup and clears the old data.
* seastar 428f4ac...11546d4 (9):
> reactor: Fix an infinite loop caused the by high resolution timer not being monitored
> build: Add back `SEASTAR_SHUFFLE_TASK_QUEUE`
> build: Unify dependency versions
> future-util: optimize parallel_for_each() with single element
> core/sharded.hh: fix doxygen for "Multicore" group
> build: switch from travis-ci to circleci
> perftune.py: fix irqbalance tuning on Ubuntu 18
> build: Make the use of sanitizers transitive
> net: ipv6: fix ipv6 detection and tests by binding to loopback
"
Recently, there has been a series of incidents of the multishard
combining reader deadlocking, when the concurrency of reads were
severely restricted and there was no timeout for the read.
Several fixes have been merged (414b14a6b, 21b4b2b9a, ee193f1ab,
170fa382f) but eliminating all occurrences of deadlocks proved to be a
whack-a-mole game. After the last bug report I have decided that instead
of trying to plug new wholes as we find them, I'll try to make wholes
impossible to appear in the first place. To translate this into the
multishard reader, instead of sprinkling new `reader.pause()` calls all
over the place in the multishard reader to solve the newly found
deadlocks, make the pausing of readers fully automatic on the shard
reader level. Readers are now always kept in a paused state, except when
actually used. This eliminates the entire class of deadlock bugs.
This patch-set also aims at simplifying the multishard reader code, as
well as the code of the existing `lifecycle_policy` implementations.
This effort resulted in:
* mutation_reader.cc: no change in SLOC, although it now also contains
logic that used to be duplicated in every `lifecycle_policy`
implementation;
* multishard_mutation_query.cc: 150 SLOC removed;
* database.cc: 30 SLOC removed;
Also the code is now (hopefully) simpler, safer and has a clearer
structure.
Fixes#4050 (main issue)
Fixes#3970Fixes#3998 (deprecates really)
"
* 'simplify-and-fix-multishard-reader/v3.1' of https://github.com/denesb/scylla:
query_mutations_on_all_shards(): make states light-weight
query_mutations_on_all_shards(): get rid of read_context::paused_reader
query_mutations_on_all_shards(): merge the dismantling and ready_to_save states into saving state
query_mutations_on_all_shards(): pause looked-up readers
query_mutation_on_all_shards(): remove unecessary indirection
shard_reader: auto pause readers after being used
reader_concurrency_semaphore::inactive_read_handle: fix handle semantics
shard_reader: make reader creation sync
shard_reader: use semaphore directly to pause-resume
shard_reader: recreate_reader(): fix empty range case
foreign_reader: rip out the now unused private API
shard_reader: move away from foreign_reader
multishard_combining_reader: make shard_reader a shared pointer
multishard_combining_reader: move the shard reader definition out
multishard_combining_reader: disentangle shard_reader
Previously the different states a reader can be in were all separate
structs, and were joined together by a variant. When this was designed
this made sense as states were numerous and quite different. By this
point however the number of states has been reduced to 4, with 3 of them
being almost the same. Thus it makes sense to merge these states into
single struct and keep track of the current state with an enum field.
This can theoretically increase the chances of mistakes, but in practice
I expect the opposite, due to the simpler (and less) code. Also, all the
important checks that verify that a reader is in the state expected by
the code are all left in place.
A byproduct of this change is that the amount of cross-shard writes is
greatly reduced. Whereas previously the whole state object had to be
rewritten on state change, now a single enum value has to be updated.
Cross shard reads are reduced as well to the read of a few foreign
pointers, all state-related data is now kept on the shard where the
associated reader lives.
These two states are now the same, with the artificial distinction that
all readers are promoted to readey_to_save state after the compaction
state and the combined buffer is dismantled. From a practical
perspective this distinction is meaningless so merge the two states into
a single `saving` state.
On the beginning of each page, all saved readers from the previous pages
(if any) are looked up, so they can be reused. Some of these saved
readers can end up not being used at all for the current page, in which
case they will needlessly sit on their permit for the duration of
filling the page. Avoid this by immediately pausing all looked-up
readers. This also allows a nice unifying of the reader saving logic, as
now *all* readers will be in a paused state when `save_reader()` is
called. Previously, looked-up, but not used readers were an exception to
this, requiring extra logic to handle both cases. This logic can now be
removed.
Previously it was the responsibility of the layer above (multishard
combining reader) to pause readers, which happened via an explicit
`pause()` call. This proved to be a very bad design as we kept finding
spots where the multishard reader should have paused the reader to avoid
potential deadlocks (due to starved reader concurrency semaphores), but
didn't.
This commit moves the responsibility of pausing the reader into the
shard reader. The reader is now kept in a paused state, except when it
is actually used (a `fill_buffer()` or `fast_forward_to()` call is
executing). This is fully transparent to the layer above.
As a side note, the shard reader now also hides when the reader is
created. This also used to be the responsibility of the multishard
reader, and although it caused no problems so far, it can be considered
a leak of internal details. The shard reader now automatically creates
the remote reader on the first time it is attempted to be used.
The code has been reorganized, such that there is now a clear separation
of responsibilities. The multishard combining reader handles the
combining of the output of the shard readers, as well as issuing
read-aheads. The shard reader handles read-ahead and creating the
remote reader when needed, as well as transferring the results of remote
reads to the "home" shard. The remote reader
(`shard_reader::remote_reader`, new in this patch) handles
pausing-resuming as well as recreating the reader after it was evicted.
Layers don't access each other's internals (like they used to).
After this commit, the reader passed to `destroy_reader()` will always
be in paused state.
Reader creation happens through the `reader_lifecycle_policy` interface,
which offers a `create_reader()` method. This method accepts a shard
parameter (among others) and returns a future. Its implementation is
expected to go to the specified shard and then return with the created
reader. The method is expected to be called from the shard where the
shard reader (and consequently the multishard reader) lives. This API,
while reasonable enough, has a serious flaw. It doesn't make batching
possible. For example, if the shard reader issues a call to the remote
shard to fill the remote reader's buffer, but finds that it was evicted
while paused, it has to come back to the local shard just to issue the
recreate call. This makes the code both convoluted and slow.
Change the reader creation API to be synchronous, that is, callable from
the shard where the reader has to be created, allowing for simple call
sites and batching.
This change requires that implementations of the lifecycle policy update
any per-reader data-structure they have from the remote shard. This is
not a problem however, as these data-structures are usually partitioned,
such that they can be accessed safely from a remote shard.
Another, very pleasant, consequence of this change is that now all
methods of the lifecycle interface are sync and thus calls to them
cannot overlap anymore.
This patch also removes the
`test_multishard_combining_reader_destroyed_with_pending_create_reader`
unit test, which is not useful anymore.
For now just emulate the old interface inside shard reader. We will
overhaul the shard reader after some further changes to minimize
noise.
The shard reader relies on the `reader_lifecycle_policy` for pausing and
resuming the remote reader. The lifecycle policy's API was designed to
be as general as possible, allowing for any implementation of
pause/resume. However, in practice, we have a single implementation of
pause/resume: registering/unregistering the reader with the relevant
`reader_concurrency_semaphore`, and we don't expect any new
implementations to appear in the future.
Thus, the generic API of the lifecycle policy, is needlessly abstract
making its implementations needlessly complex. We can instead make this
very concrete and have the lifecycle policy just return the relevant
semaphore, removing the need for every implementor of the lifecycle
policy interface to have a duplicate implementation of the very same
logic.
For now just emulate the old interface inside shard reader. We will
overhaul the shard reader after some further changes to minimize noise.
If the shard reader is created for a singular range (has a single
partition), and then it is evicted after reaching EOS, when recreated we
would have to create a reader that reads an empty range, since the only
partition the range has was already read. Since it is not possible to
create a reader with an empty range, we just didn't recreate the reader
in this case. This is incorrect however, as the code might still attempt
to read from this reader, if only due to a bug, and would trigger a
crash. The correct fix is to create an empty reader that will
immediately be at EOS.
Drop all the glue code, needed in the past so the shard reader can be
implemented on top of foreign reader. As the shard reader moved away
from foreign reader, this glue code is not needed anymore.
In the past, shard reader wrapped a foreign reader instance, adding
functionality required by the multishard reader on top. This has worked
well to a certain degree, but after the addition of pause-resume of
shard reader, the cooperation with foreign reader became more-and-more a
struggle. It has now gotten to a point, where it feels like shard reader
is fighting foreign reader as much as it reuses it. This manifested
itself in the ever growing amount of glue code, and hacks baked into
foreign reader (which is supposed to be of general use), specific to
the usage in the multishard reader.
It is time we don't force this code-reuse anymore and instead implement
all the required functionality in shard reader directly.
Some members of shard reader have to be accessed even after it is
destroyed. This is required by background work that might still be
pending when the reader is destroyed. This was solved by creating a
special `state` struct, which contained all the members of the shard
readers that had to be accessed even after it was destroyed. This state
struct was managed through a shared pointer, that each continuation that
was expected to outlive the reader, held a copy of. This however created
a minefield, where each line of the code had to be carefully audited to
access only fields that will be guaranteed to remain valid.
Fix this mess by making the whole class a shared pointer, with
`enable_shared_from_this`. Now each continuation just has to make sure
to keep `this` alive and code can now access all members freely (well,
almost).
Shard reader started its life as a very thin layer above foreign reader,
with just some convenience methods added. As usual, by now it has grown
into a hairy monster, its class definition out-growing even that of the
multishard reader itself. It is time shard reader is moved into the
top-level scope, improving the readability of both classes.
Currently shard reader has a reference to the owning multishard reader
and it freely accesses its members. This resulted in a mess, where it's
not clear what exactly shard reader depends on. Disentangle this mess,
by making the shard reader self-sufficient, passing all it depends on
into its constructor.
All tests that involve writing to a base table and then reading from the
view table must use the eventually() function to account for the fact that
the view update is asynchronous, and may be visible only some time after
writing the base table. Forgetting an eventually() can cause the test
to become flaky and sometimes fail because the expected data is not *yet*
in the view. Botond noticed these failures in practice in two subtests
(test_partition_key_filtering_with_slice and
test_clustering_key_in_restrictions).
This patch fixes both tests, and I also reviewed the entire source file
view_schem_test.cc and found additional places missing an eventually()
(and also places that unnecessarily used eventually() to read from the
base table), and fixed those as well.
Fixes#4212
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190212121140.14679-1-nyh@scylladb.com>
There is no value if having a default value for encoding_stats parameter
of write_components(). If anything it weakens the tests by encouraging
not using the real encoding stats which is not what the actual sstable
write path in Scylla does.
This patch removes the default value and makes most of the tests provide
real encoding statistics. The ones that do not are those that have no
easy way of obtaining those (and those stats are not that important for
the test itself) or there is a reason for not using those
(sstable_3_x_test::test_sstable_write_large_row uses row size thresholds
based on size with default-constructed encoding_stats).
Message-Id: <20190212124356.14878-1-pdziepak@scylladb.com>
Refs #4085
Changes commitlog descriptor to both accept "Recycled-Commitlog..."
file names, and preserve said name in the descriptor.
This ensures we pick up the not-yet-used recycled segments left
from a crash for replay. The replay in turn will simply ignore
the recycled files, and post actual replay they will be deleted
as needed.
Message-Id: <20190129123311.16050-1-calle@scylladb.com>
create-relocatable-package.py currently (refs #4194) builds a compressed
tar file, but does so using a painfully slow Python implementation of gzip,
which is a problem considering the huge size (around 2 gigabytes) of Scylla's
executable. On my machine, running it for a release build of Scylla takes a
whopping 6 minutes.
Just replacing the Python compression with a pipe to an external "gzip"
process speeds up the run to just 2 minutes. But gzip is still not optimal,
using only one thread even when on a many-core machine. If we switch to
"pigz", a parallel implementation of "gzip", all cores are used and on
my machine the compression speeds up to just 23 seconds - that's 15
times faster than before this patch.
So this patch has create-relocatable-package.py use an external pigz process.
"pigz" is now required on the build system (if you want to create packages),
so is added to install-dependencies.sh.
[avi: update toolchain]
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190212090333.3970-1-nyh@scylladb.com>
In commit ec66dd6562, in non-interactive
runs of scylla_setup all options were unintentionally set to "false",
regardless of the options passed on the scylla_setup command line. This
can lead to all sorts of wrong behaviors, and in particular one test
setup assumed it was enabling the Scylla service (which was previously
the default) but after this commit, it no longer did.
This patch restores the previous behavior: Non-interactive invocations
of scylla_setup adhere to the defaults and the command-line options,
rather than blindly choosing "false".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190211214105.32613-1-nyh@scylladb.com>
Do not recalculate too much ranges in advance, it requires large
allocation and usually means that a consumer of the interface is going
to do to much work in parallel.
Fixes: #3767
In case salted_hash was NULL, we'd access uninitialized memory when dereferencing
the optional in get_as<>().
Protect against that by using get_opt() and failing authentication if we see a NULL.
Fixes#4168.
Tests: unit (release)
Branches: 3.0, 2.3
Message-Id: <20190211173820.8053-1-avi@scylladb.com>
"Fuzzy test" executes semi-random range-scans against semi-random data.
By doing so we hope to achieve a coverage of edge cases that would
be very hard to achieve by "conventional" unit tests.
Fuzzy test generates a table with a population of partitions that are
a combinations of all of:
* Size of static row: none, tiny, small and large;
* Number of clustering rows: none, few, several, and lots;
* Size of clustering rows: tiny, small and large;
* Number of range deletions: few, several and lots;
* Number of rows covered by a range deletion: few, several;
As well as a partition with extreme large static row, extreme number of
rows and rows of extreme size.
To avoid writing an excess amount of data, the size limit of pages is
reduced to 1KB (from the default 1MB) and the row count limit of pages
is reduced to 1000 (from the default of 10000).
The test then executes range-scans against this population. For each
range scan, a random partition range is generated, that is guaranteed to
contain at least one partition (to avoid executing mostly empty scans),
as well as a random partition-slice (row ranges). The data returned by
the query is then thoroughly validated against the population
description returned by the `create_test_table()` function.
As this test has a large degree of randomness to it, covering a
quasi-infinite input-space, it can (theoretically) fail at any time.
As such I took great care in making such failures deterministically
reproducible, based on a single random seed, which is logged to the
output in case of a failure, together with instructions on how to repeat
the particular run. The test also uses extensive logging to aid
investigations. For logging, seastar's logging mechanism is used, as
`BOOST_TEST_MESSAGE` produces unintelligible output when running with
-c > 1. Log messages are carefully tagged, so that the test produces the
least amount of noise by default, while being very explicit about what's
happening when ran with `debug` or especially `trace` log levels.
The existing `read_all_partitions_with_paged_scan()` implementation was
tailored to the existing, simplistic test cases. Refactor it so that it
can be used in much more complex test cases:
* Allow specifying the page's `max_size`.
* Allow specifying the query range.
* Allow specifying the partition slice's ck ranges.
* Fix minor bugs in the paging logic.
To avoid churn, a backward-compatible overload is added, that retains
the old parameter set.
This overload provides a middle ground between the very generic, but
hard-to-use "expert version" and to very restrictive and simplistic
"beginner version". It allows the user to declaratively describe the
to-be-generated population in terms of bunch
`std::uniform_int_distribution` objects (e.g. number of rows, size of
rows, etc.).
This allows for generating a random population in a controlled way, with
a minimum amount of boiler-plate code on the user side.
Allow the user to specify the population of the table in a generic and
flexible way. This patch essentially rewrites the `create_test_table()`
implementation from scratch, so that it populates the table using the
partition generator passed in by the user. Backward compatibility is
kept, by providing a `create_test_table()` overload that is identical to
the previous API. This overload is now implemented on top of the generic
overload.
get_restricted_ranges() function gets query provided key ranges
and divides them on vnode boundaries. It iterates over all ranges and
calculates all vnodes, but all its users are usually interested in only
one vnode since most likely it will be enough to populate a page. If it
will be not enough they will ask for more. This patch introduces new
interface instead of the function that allows to generate vnode ranges
on demand instead of precalculating all of them.
Right now Cassandra SSTables with counters cannot be imported into
Scylla. The reason for that is that Cassandra changed their counter
representation in their 2.1 version and kept transparently supporting
both representations. We do not support their old representation, nor
there is a sane way to figure out by looking at the data which one is in
use.
For safety, we had made the decision long ago to not import any
tables with counters: if a counter was generated in older Cassandra, we
would misrepresent them.
In this patch, I propose we offer a non-default way to import SSTables
with counters: we can gate it with a flag, and trust that the user knows
what they are doing when flipping it (at their own peril). Cassandra 2.1
is by now pretty old. many users can safely say they've never used
anything older.
While there are tools like sstableloader that can be used to import
those counters, there are often situations in which directly importing
SSTables is either better, faster, or worse: the only option left. I
argue that having a flag that allow us to import them when we are sure
it is safe is better than having no option at all.
With this patch I was able to successfully import Cassandra tables with
counters that were generated in Cassandra 2.1, reshard and compact their
SSTables, and read the data back to get the same values in Scylla as in
Cassandra.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190210154028.12472-1-glauber@scylladb.com>
This is convenient to test scylla directly by invoking build/dev/scylla.
This needs to be done under docker because the shared objects scylla
looks for may not exist in the host system.
During quick development we may not want to go through the trouble of
packaging relocatable scylla every time to test changes.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190209021033.8400-1-glauber@scylladb.com>
"
The code reading counter cells form sstables verifies that there are no
unsupported local or remote shards. The latter are detected by checking
if all shards are present in the counter cell header (only remote shards
do not have entries there). However, the logic responsible for doing
that was incorrectly computing the total number of counter shards in a
cell if the header was larger than a single counter shard. This resulted
in incorrect complaints that remote shards are present.
Fixes#4206
Tests: unit(release)
"
* tag 'counter-header-fix/v1' of https://github.com/pdziepak/scylla:
tests/sstables: test counter cell header with large number of shards
sstables/counters: fix remote counter shard detection
The logic responsible for reading counters from sstables was getting
confused by large headers. The size of the header depends directly on
the number of shards. This tests checks that we can handle cells with
large number of counter shards properly.
Each counter cell has a header with an entry for each local and global
shards. The detection of remote shards is done by checking if there are
any counter shards that do not have an entry in the header. This is done
by computing the number of counter shards in a cell and comparing it to
the number of header entries. However, the computation was wrong and
included the size taken by the header itself. As a result, if the header
was as big or larger than a single counter shard Scylla incorrectly
complained about remote shards.
The interpreter as it is right now has a bug: I incorrectly assumed that
all the shared libraries that python dynamically links would be in
lib-dynload. That is not true, and at least some of them are in
site-packages.
With that, we were loading system libraries for some shared objects.
The approach taken to fix this is to just check if we're seeing a shared
library and relocate everything we see: we will end up relocating the
ones in lib64 too, but that not only should be okay, it is probably even
more fool-proof.
While doing that I noticed that I had forgotten to incorporate one of
previous feedback from Avi (that we're leaving temporary files behind).
So I'm fixing that as well.
[avi: update toolchain]
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190208115501.7234-1-glauber@scylladb.com>
I was playing with the python3 interpreter trying to get pip to work,
just to see how far we can go. We don't really need pip, but I figured
it would be a good stress test to make sure that the process is working
and robust.
And it didn't really work, because although pip will correctly install
things into $relocatable_root/local/lib, sys.path will still refer to a
hardcoded /usr/local. While this should not affect Scylla, since we
expect to have all our modules in out path anyway -- and that path is
searched before /usr/local, it is still dangerous to make an absolute
reference like this.
Unfortunately, /usr/local/ it is included unconditionally by site.py,
which is executed when the interpreter is started and there is no
environment variable I found to change that (the help string refers to
PYTHONNOUSERSITE, but I found no mention of that in site.py whatsoever)
There is a way to tell site.py not to bother to add user sites, by
passing the -s flag, which this patch does.
Aside from doing that, we also enhance PYTHONPATH to include a reference
to ./local/{lib,lib64}/python<version>/site-packages.
After applying this patch, I was able to build an interpreter containing
only python3-pip and python3-setuptools, and build the relocatable
environment from there.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190206052104.25927-1-glauber@scylladb.com>
This algorithm was already duplicated in two places
(service/pager/query_pagers.cc and mutation_reader.cc). Soon it will be
used in a third place. Instead of triplicating, move it into a function
that everybody can use.
In the next patches `create_test_cf()` will be made much more powerful
and as such generally useful. Move it into its own files so other tests
can start using it as well.
tmpdir is a helper class representing a temporary directory.
Unfortunately, it suffers for some problems such as lack of proper
encapsulation and weak typing. This has caused bugs in the past when the
user code accidentally modified the member variable with the path to the
directory.
This patch modernises tmpdir and updates its users. The path is stored
in a std::filesystem::path and available read-only to the class users.
mkdtemp and boost are replaced by standard solution.
The users are update to use path more (when it didn't involve too many
changes to their code) and stop using lw_shared_ptr to store the tmpdir
when it wasn't necessary.
tmpdir intentionally doesn't provide any helpers for getting the path as
a string in order to discourage weak types.
Message-Id: <20190207145727.491-1-pdziepak@scylladb.com>
* seastar c3be06d...428f4ac (13):
> build: make the "dist" test respect the build type
> Merge 'Add support for docker --cpuset-cpus' from Juliana
> Merge "Add support for Coroutines TS" from Paweł
> Merge "Modernize dependency management" from Avi
> future: propagate broken_promise exception to abandoned continuations
> net/inet_address: avoid clang Wmissing-braces
> build: Default to the "Release" type if unspecified
> rpc: log an exception that may happen while processing an RPC message
> Add a --split-dwarf option to configure.py
> build: Fix the `StdFilesystem` module
> Compress debug info by default
> Add an option for building with split dwarf
> Dockerfile: install stow
Default constructed extremum_tracker has uninitialised _default_value
which basically makes it never correct to do that. Since this class is a
mechanism and not a value it doesn't really need to be a regular type,
so let's drop the default constructor.
Message-Id: <20190207162430.7460-1-pdziepak@scylladb.com>
This series contains several fixes and improvements as well as new tests
for sstable code dealing with statistics.
* https://github.com/pdziepak/scylla.git sstable-stats-fixes/v1-rebased:
sstables: compaction: don't access moved-from vector of sstables
memtable: move encoding_stats_collector implementation out of header
sstables: seal_statistics(): pass encoding_stats by constant reference
sstables/mc/writer: don't assume all schema columns are present
tests/sstable3: improvements to file compare
tests: extract mutation data model
tests/data_model: add support for expiring atomic cells
tests/data_model: allow specifying timestamp for row markers
tests/memtable: test column tracking for encoding stats
sstables: use correct source of statistics in
get_encoding_stats_for_compaction()
utils/extremum_tracking: preserve "not-set" status on merge
sstables/metadata_collector: move the default values to the global
tracker
tests/sstables: test for reading serialisation header
tests/sstables: pass encoding stats to write_components()
tests/sstable: test merging encoding_stats
Fixes#4202.
By default write_components() uses a safe default for encoding_stats
which indicates that all columns are present. This may hide so bugs, so
let's pass the real thing in the tests that this may matter.
column_stats is a per-partition tracker, while metadata_collector is the
global one. The statistics gathered by column_stats are merged into the
metadata_collector. In order to ensure that we get proper default values
in case no value of particular kind (e.g. no TTLs) was seen they need to
be set on the global tracker, not the per-partition one.
extremum_tracker allows choosing a default value that's going to be used
only if no "real" values were provided. Since it is never compared with
the actual input values it can be anything. For instance, if the minimum
tracker default value is 0 and there was one update with the value 1 the
detected minimum is going to be 1 (the default is ignored).
However, this doesn't work when the trackers are merged since that
process always leaves the destination tracker in the "set" state
regardless whether any of the merged trakcers has ever seen any value.
This is fixed by this patch, by properly preserving _is_set state on
merge.
sstable class is responsible for much more things that it should. In
particular, it takes care of both writing and reading sstables. The
problem that it causes is that it is very easy to confuse those two.
This is what has happened in get_encoding_stats_for_compaction().
Originally, it was using _c_stats as a source of the statistics, which
is used only during the write and per-partition. Needless to say, the
returned encoding_stats were bogus.
The correct source of those statistics is get_stats_metadata().
This patch introduces some improvement to file comparison:
- exception flags are set so that any error triggers an exceptions and
guarantees that they are not silently ignored
- std::ios_base::binary flag is passed to open()
- istreambuf_iterator is used instead of istream_iterator. It is better
suited for comparing binary data.
The interface tmpdir::path isn't properly encapsulated and its users can
modify the path even though they really shouldn't. This can happen
accidentally, in cql_test_env a reference to tmpdir::path was created
and later assigned to in one of the code paths. This caused tmpdir
destructor to remove wrong directory at program exit.
This patch solves the problem by avoiding referencing tmpdir::path, a
copy is perfectly acceptable considering that this is tests-only code.
Message-Id: <20190206173046.26801-1-pdziepak@scylladb.com>
rm -rf build/* was to start rpm building on clean state, but it also delete
scylla built binaries so it was not good idea.
Instead of rm -rf build/*, we can check file existance on cloned
directory, if it seems good we can reuse it.
Also we need to run git pull on each package repo since it may not
included latest commit.
Fixes#4189
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190206101755.2056-1-syuu@scylladb.com>
"
Fixes#4193Fixes#3795
This series enables handling IN restrictions for regular columns,
which is needed by both filtering and indexing mechanisms.
Tests: unit (release)
"
* 'allow_non_key_in_restrictions' of https://github.com/psarna/scylla:
tests: add filtering with IN restriction test
cql3: remove unused can_have_only_one_value function
cql3: allow non-key IN restrictions
We still allow the delete of rows from system.large_partition to run
in parallel with the sstable deletion, but now we return a future that
waits for both.
Tests: unit (release)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190205001526.68774-1-espindola@scylladb.com>
Fixes#4010
Unless user sets this explicitly, we should try explicitly avoid
deprecated protocol versions. While gnutls should do this for
connections initiated thusly, clients such as drivers etc might
use obsolete versions.
Message-Id: <20190107131513.30197-1-calle@scylladb.com>
"
get_compaction_history can return a lot of records which will add up to a
big http reply.
This series makes sure it will not create large allocations when
returning the results.
It adds an api to the query_processor to use paged queries with a
consumer function that returns a future, this way we can use the http
stream after each record.
This implementation will prevent large allocations and stalls.
Fixes#4152
"
* 'amnon/compaction_history_stream_v7' of github.com:scylladb/seastar-dev:
tests/query_processor_test: add query_with_consumer_test
system_keyspace, api: stream get_compaction_history
query_processor: query and for_each_cql_result with future
"
Virtual columns are MV-specific columns that contribute to the
liveness of view rows. However, we were not adding those columns when
creating an index's underlying MV, causing indexes to miss base rows.
Fixes#4144
Branches: master, branch-3.0
"
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
* 'sec-index/virtual-columns/v1' of https://github.com/duarten/scylla:
tests/secondary_index_test: Add reproducer for #4144
index/secondary_index_manager: Add virtual columns to MV
"
We would like to deploy Scylla in constrained environments where
internet access is not permitted. In those environments it is not
possible to acquire the dependencies of Scylla from external repos and
the packages have to be sent alongside with its dependencies.
In older distributions, like CentOS7 there isn't a python3 interpreter
available. And while we can package one from EPEL this tends to break in
practice when installing the software in older patchlevels (for
instance, installing into RHEL7.3 when the latest is RHEL7.5).
The reason for that, as we saw in practice, is that EPEL may
not respect RHEL patchlevels and have the python interpreter depending
on newer versions of some system libraries.
virtualenv can be used to create isolated python enviornments, but it is
not designed for full isolation and I hit at least two roadblocks in
practice:
1) It doesn't copy the files, linking some instead. There is an
--always-copy option but it is broken (for years) in some
distributions.
2) Even when the above works, it still doesn't copy some files, relying
on the system files instead (one sad example was the subprocess
module that was just kept in the system and not moved to the
virtualenv)
This patch solves that problem by creating a python3 environment in a
directory with the modules that Scylla uses, and no other else. It is
essentially doing what vitualenv should do but doesn't. Once this
environment is assembled the binaries are then made relocatable the same
way the Scylla binary is.
One difference (for now) between the Scylla binary relocation process
and ours is that we steer away from LD_LIBRARY_PATH: the environment
variable is inherited by any child process steming from the caller,
which means that we are unable to use the subprocess module to call
system binaries like mkfs (which our scripts do a lot). Instead, we rely
on RUNPATH to tell the binary where to search for its libraries.
Once we generate an archive with the python3 interpreter, we then
package it as an rpm with bare any dependencies. The dependencies listed
are:
$ rpm -qpR scylla-relocatable-python3-3.6.7-1.el7.x86_64.rpm
rpmlib(CompressedFileNames) <= 3.0.4-1
rpmlib(FileDigests) <= 4.6.0-1
rpmlib(PartialHardlinkSets) <= 4.0.4-1
rpmlib(PayloadFilesHavePrefix) <= 4.0-1
rpmlib(PayloadIsXz) <= 5.2-1
And the total size of that rpm, with all modules scylla needs is 20MB.
The Scylla rpm now have a way more modest dependency list:
$ rpm -qpR scylla-server-666.development-0.20190121.80b7c7953.el7.x86_64.rpm | sort | uniq
/bin/sh
curl
file
hwloc
kernel >= 3.10.0-514
mdadm
pciutils
rpmlib(CompressedFileNames) <= 3.0.4-1
rpmlib(FileDigests) <= 4.6.0-1
rpmlib(PayloadFilesHavePrefix) <= 4.0-1
rpmlib(PayloadIsXz) <= 5.2-1
scylla-conf
scylla-relocatable-python3 <== our python3 package.
systemd-libs
util-linux
xfsprogs
I have tested this end to end by generating RPMs from our master branch,
then installing them in a clean CentOS7.3 installation without even
using yum, just rpm -Uhv <package_list>
Then I called scylla_setup to make sure all python scripts were working
and started Scylla successfully.
"
* 'scylla-python3-v5' of github.com:glommer/scylla:
Create a relocatable python3 interpreter
spec file: fix python3 dependency list.
fixup scripts before installing them to their final location
automatically relocate python scripts
make scyllatop relocatable
use relative paths for installing scylla and iotune binaries
This patch adds a unit test for querying with a consumer function.
query with consumer uses paging, the tests covers the scenarios where
the number of rows bellow and above the page size, it also test the
option to stop in the middle of reading.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
get_compaciton_history can return big chunk of data.
To prevent large memory allocation, the get_compaction_history now read
each compaction_history record and use the http stream to send it.
Fixes#4152
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
query and for_each_cql_result accept a function that reads a row and
return a stop_iterator.
This implementation of those functions gets a function that returns a
future stop_iterator allowing preemption between calls.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
We would like to deploy Scylla in constrained environments where
internet access is not permitted. In those environments it is not
possible to acquire the dependencies of Scylla from external repos and
the packages have to be sent alongside with its dependencies.
In older distributions, like CentOS7 there isn't a python3 interpreter
available. And while we can package one from EPEL this tends to break in
practice when installing the software in older patchlevels (for
instance, installing into RHEL7.3 when the latest is RHEL7.5).
The reason for that, as we saw in practice, is that EPEL may
not respect RHEL patchlevels and have the python interpreter depending
on newer versions of some system libraries.
virtualenv can be used to create isolated python enviornments, but it is
not designed for full isolation and I hit at least two roadblocks in
practice:
1) It doesn't copy the files, linking some instead. There is an
--always-copy option but it is broken (for years) in some
distributions.
2) Even when the above works, it still doesn't copy some files, relying
on the system files instead (one sad example was the subprocess
module that was just kept in the system and not moved to the
virtualenv)
This patch solves that problem by creating a python3 environment in a
directory with the modules that Scylla uses, and no other else. It is
essentially doing what vitualenv should do but doesn't. Once this
environment is assembled the binaries are then made relocatable the same
way the Scylla binary is.
One difference (for now) between the Scylla binary relocation process
and ours is that we steer away from LD_LIBRARY_PATH: the environment
variable is inherited by any child process steming from the caller,
which means that we are unable to use the subprocess module to call
system binaries like mkfs (which our scripts do a lot). Instead, we rely
on RUNPATH to tell the binary where to search for its libraries.
In terms of the python interpreter, PYTHONPATH does not need to be set
for this to work as the python interpreter will include the lib
directory in its PYTHONPATH. To confirm this, we executed the following
code:
bin/python3 -c "import sys; print('\n'.join(sys.path))"
with the interpreter unpacked to both /home/centos/glaubertmp/test/ and
/tmp. It yields respectively:
/home/centos/glaubertmp/test/lib64/python36.zip
/home/centos/glaubertmp/test/lib64/python3.6
/home/centos/glaubertmp/test/lib64/python3.6/lib-dynload
/home/centos/glaubertmp/test/lib64/python3.6/site-packages
and
/tmp/python/lib64/python36.zip
/tmp/python/lib64/python3.6
/tmp/python/lib64/python3.6/lib-dynload
/tmp/python/lib64/python3.6/site-packages
This was tested by moving the .tar.gz generated on my Fedora28 laptop to
a CentOS machine without python3 installed. I could then invoke
./scylla_python_env/python3 and use the interpreter to call 'ls' through
the subprocess module.
I have also tested that we can successfully import all the modules we listed
for installation and that we can read a sample yaml file (since PyYAML depends
on the system's libyaml, we know that this works)
Time to build:
real 0m15.935s
user 0m15.198s
sys 0m0.382s
Final archive size (uncompressed): 81MB
Final archive sie (compressed) : 25MB
Signed-off-by: Glauber Costa <glauber@scylladb.com>
--
v3:
- rewrite in python3
- do not use temporary directories, add directly to the archive. Only the python binary
have to be materialized
- Use --cacheonly for repoquery, and also repoquery --list in a second step to grab the file list
v2:
- do not use yum, resolve dependencies from installed packages instead
- move to scripts as Avi wants this not only for old offline CentOS
The dependency list as it was did not reflect the fact that scyllatop is
now written in python3.
Some packages, like urwid, should use the python3 version. CentOS
doesn't really have an urwid package for python3, not even in EPEL. So
this officially marks the point in which we can't build packages that
will install in CentOS7 anyway.
Luckily, we will soon be providing our own python3 interpreter. But for
now, as a first step, simplify the dependency list by removing the
CentOS/Fedora conditional and listing the full python3 list
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Before installing python files to their final location in install.sh,
replace them with a thunk so that they can work with our python3
interpreter. The way the thunk works, they will also work without our
python3 interpreter so unconditionally fixing them up is always safe.
I opt in this patch for fixing up just at install time to simplify
developer's life, who won't have to worry about this at all.
Note about the rpm .spec file: since we are relying on specific format
for the shebangs, we shouldn't let rpmbuild mess with them. Therefore,
we need to disable a global variable that controls that behavior (by
definition, Fedora rpmbuild will rewrite all shebangs to /usr/bin/python3)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Given a python script at $DIR/script.py, this copies the script to
$DIR/libexec/script.py.bin, fixes its shebang to use /usr/bin/env instead
of an absolute path for the interpreter and replaces the original script
with a thunk that calls into that script.
PYTHONPATH is adjusted so that the original directory containing the script
can also serve as a source of modules, as would be originally intended.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Right now the binary we distribute with scyllatop calls into
/usr/lib/scylla/scyllatop/scyllatop.py unconditionally. Calling that is
all that this binary does.
This poses a problem to our relocatable process, since we don't want
to be referring to absolute paths (And moreover, that is calling python
whereas it should be calling python3)
The scyllatop.py files includes a python3 shebang and is executable.
Therefore, it is best to just create a link to that file and execute it
directly
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The build times I got with a clean ccache were:
ninja dev 10806.89s user 678.29s system 2805% cpu 6:49.33 total
ninja release 28906.37s user 1094.53s system 2378% cpu 21:01.27 total
ninja debug 18611.17s user 1405.66s system 2310% cpu 14:26.52 total
With this version -gz is not passed to seastar's configure. It should
probably be seastar's configure responsibility to do that and I will
send a separate patch to do it.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190204162112.7471-1-espindola@scylladb.com>
"uuid" was ref:ed in a continuation. Works 99.9% of the time because
the continuation is not actually delayed (and assuming we begin the
checks with non-truncated (system) cf:s it works).
But if we do delay continuation, the resulting cf map will be
borked.
Fixes#4187.
Message-Id: <20190204141831.3387-1-calle@scylladb.com>
The multishard reader has to combine the output of all shards into a
single fragment stream. To do that, each time a `partition_start` is
read it has to check if there is another partition, from another shard,
that has to be emitted before this partition. Currently for this it
uses the partitioner. At every partition start fragment it checks if the
token falls into the current shard sub-range. The shard sub-range is the
continuous range of tokens, where each token belongs to the same shard.
If the partition doesn't belong to the current shard sub-range the
multishard reader assumes the following shard sub-range of the next shard
will have data and move over to it. This assumption will however only
stand on very dense tables, and will fail miserably on less dense
tables, resulting in the multishard reader effectively iterating over
the shard sub-ranges (4096 in the worst case), only to find data in just
a few of them. This resulted in high user-perceived latency when
scanning a sparse table.
This patch replaces this algorithm with one based on a shard heap. The
shards are now organized into a min-heap, by the next token they have
data for. When a partition start fragment is read from the current
shard, its token is compared to the smallest token in the shard heap. If
smaller, we continue to read from the current shard. Otherwise we move
to the shard with the smallest token. When constructing the reader, or
after fast-forwarding we don't know what first token each reader will
produce. To avoid reading in a partition from each reader, we assume
each reader will produce the first token from the first shard sub-range
that overlaps with the query range. This algorithm performs much better
on sparse tables, while also being slightly better on dense tables.
I did only a very rough measurement using CQL tracing. I populated a
table with four rows on a 64 shards machine, then scanned the entire
table.
Time to scan the table (microseconds):
before 27'846
after 5'248
Fixes: #4125
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <d559f887b650ab8caa79ad4d45fa2b7adc39462d.1548846019.git.bdenes@scylladb.com>
"
This is a first step in fixing #3988.
"
* 'espindola/large-row-warn-only-v4' of https://github.com/espindola/scylla:
Rename large_partition_handler
Print a warning if a row is too large
Remove defaut parameter value
Rename _threshold_bytes to _partition_threshold_bytes
keys: add schema-aware printing for clustering_key_prefix
Three error messages were supposed to include a column name, but a "{}"
was missing in the format so the given column name didn't actually appear
in the error message. So this patch adds the missing {}'s.
Fixes#4183.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190203112100.13031-1-nyh@scylladb.com>
Adds new columns to the "Page spans" table named "large [B]" and
"[spans]", which shows how much memory is allocated in spans of given
size. Excludes spans used by small pools.
Useful in determining what is the size of large allocations which
consume the memory.
Example output:
Page spans:
index size [B] free [B] large [B] [spans]
0 4096 4096 4096 1
1 8192 32768 0 0
2 16384 16384 0 0
3 32768 98304 2785280 85
4 65536 65536 1900544 29
5 131072 524288 471597056 3598
...
31 8796093022208 0 0 0
Large allocations: 484675584 [B]
Message-Id: <1548956406-7601-1-git-send-email-tgrabiec@scylladb.com>
Commit 976324bbb8 changed to use
get_application_state_ptr to get a pointer of the application_state. It
may return nullptr that is dereferenced unconditionally.
In resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test, we saw:
4 nodes in the tests
n1, n2, n3, n4 are started
n1 is stopped
n1 is changed to use different shard config
n1 is restarted ( 2019-01-27 04:56:00,377 )
The backtrace happened on n2 right fater n1 restarts:
0 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature STREAM_WITH_RPC_STREAM is enabled
1 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature WRITE_FAILURE_REPLY is enabled
2 INFO 2019-01-27 04:56:05,175 [shard 0] gossip - Feature XXHASH is enabled
3 WARN 2019-01-27 04:56:05,177 [shard 0] gossip - Fail to send EchoMessage to 127.0.58.1: seastar::rpc::closed_error (connection is closed)
4 INFO 2019-01-27 04:56:05,205 [shard 0] gossip - InetAddress 127.0.58.1 is now UP, status =
5 Segmentation fault on shard 0.
6 Backtrace:
7 0x00000000041c0782
8 0x00000000040d9a8c
9 0x00000000040d9d35
10 0x00000000040d9d83
11 /lib64/libpthread.so.0+0x00000000000121af
12 0x0000000001a8ac0e
13 0x00000000040ba39e
14 0x00000000040ba561
15 0x000000000418c247
16 0x0000000004265437
17 0x000000000054766e
18 /lib64/libc.so.6+0x0000000000020f29
19 0x00000000005b17d9
We do not know when this backtrace happened, but according to log from n3 an n4:
INFO 2019-01-27 04:56:22,154 [shard 0] gossip - InetAddress 127.0.58.2 is now DOWN, status = NORMAL
INFO 2019-01-27 04:56:21,594 [shard 0] gossip - InetAddress 127.0.58.2 is now DOWN, status = NORMAL
We can be sure the backtrace on n2 happened before 04:56:21 - 19 seconds (the
delay the gossip notice a peer is down), so the abort time is around 04:56:0X.
The migration_manager::maybe_schedule_schema_pull that triggers the backtrace
must be scheduled before n1 is restarted, because it dereference
application_state pointer after it sleeps 60 seconds, so the time
maybe_schedule_schema_pull is called is around 04:55:0X which is before n1 is
restarted.
So my theory is: migration_manager::maybe_schedule_schema_pull is scheduled, at this time
n1 has SCHEMA application_state, when n1 restarts, n2 gets new application
state from n1 which does not have SCHEMA yet, when migration_manager::maybe_schedule
wakes up from the 60 sleep, n1 has non-empty endpoint_state but empty
application_state for SCHEMA. We dereference the nullptr
application_state and abort.
Fixes: #4148
Tests: resharding_test.py:ReshardingTest_nodes4_with_SizeTieredCompactionStrategy.resharding_by_smp_increase_test
Message-Id: <9ef33277483ae193a49c5f441486ee6e045d766b.1548896554.git.asias@scylladb.com>
When a repair failed, we saw logs like:
repair - Checksum of range (8235770168569320790, 8235957818553794560] on
127.0.0.1 failed: std::bad_alloc (std::bad_alloc)
It is hard to tell which keyspace and table has failed.
To fix, log the keyspace and table name. It is useful to know when debugging.
Fixes#4166
Message-Id: <8424d314125b88bf5378ea02a703b0f82c2daeda.1548818669.git.asias@scylladb.com>
Stop calling .remove_suffix on empty string_view.
ck_bview can be empty because this function can be
called for a half open range tombstone.
It is impossible to write such range tombstones to LA/KA SSTables
so we should throw a proper exception instead of allowing
an undefined behaviour.
Refs #4113
Tests: unit(release)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <c3738916953e4b10812aed95e645c739b4c29462.1548777086.git.piotr@scylladb.com>
All of our python scripts are there and they are all installed
automatically into /usr/lib/scylla. By keeping scylla-housekeeping
separately we are just complicating our build process.
This would be just a minor annoyance but this broke the new relocatable
process for python3 that I am trying to put together because I forgot to
add the new location as a source for the scripts.
Therefore, I propose we start being more diligent with this and keeping
all scripts together for the future.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190123191732.32126-2-glauber@scylladb.com>
When a file in the `seastar` directory changes, we want to minimize the
amount of Scylla artifacts that are re-built while ensuring that all
changes in Seastar are reflected in Scylla correctly.
For compiling object files, we change Seastar to be an "order only"
dependency so that changes to Seastar don't trigger unnecessary builds.
For linking, we add an "implicit" dependency on Seastar so that Scylla
is re-linked when Seastar changes.
With these changes, modifying a Seastar header file will trigger the
recompilation of the affected Scylla object files, and modifying a
Seastar source file will trigger linking only.
Fixes#4171
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <0ab43d79ce0d41348238465d1819d4c937ac6414.1548906335.git.jhaberku@scylladb.com>
Fully expired sstable is not added to compacting set, meaning it's not actually
compacted, but it's kept in a list of sstables which incremental compaction
uses to check if any sstable can be replaced.
Incremental compaction was unconditionally removing expired sstable from compacting
set, which led to segfault because end iterator was given.
The fix is about changing sstable_set::erase() behavior to follow standard one
for erase functions which will works if the target element is not present.
Fixes#4085.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190130163100.5824-1-raphaelsc@scylladb.com>
"
This series enhances perf_simple_query error reporting by adding an
option of producing a json file containing the results. The format of
that file is very similar to the results produces by perf_fast_forward
in order to ease integration with any tools that may want to interpret
them.
In addition to that perf_simple_query now prints to the standard output
median, median absolute deviation, minimum and maximum of the partial
results, so that there is no need for external scripts to compute those
values.
"
* tag 'perf_simple_query-json/v1' of https://github.com/pdziepak/scylla:
perf_simple_query: produce json results
perf_simple_query: calculate and print statistics
perf: time_parallel: return results of each iteration
perf_simple_query: take advantage of threads in main()
"
Recently it was discovered that the memtable reader
(partition_snapshot_reader to be more precise) can violate mutation
fragment monotonicity, by remitting range tombstones when those overlap
with more than one ck range of the partition slice.
This was fixed by 7049cd9, however after that fix was merged a much
simpler fix was proposed by Tomek, one that doesn't involve nearly as
much changes to the partition snapshot reader and hences poses less risk
of breaking it.
This mini-series reverts the previous fix, then applies the new, simpler
one.
Refs: #4104
"
* 'partition-snapshot-reader-simpler-fix/v2' of https://github.com/denesb/scylla:
partition_snapshot_reader: don't re-emit range tombstones overlapping multiple ck ranges
Revert "partition_snapshot_reader: don't re-emit range tombstones overlapping multiple ck ranges"
Committer: Avi Kivity <avi@scylladb.com>
Branch: next
Switch to the the CMake-ified Seastar
This change allows Scylla to be compiled against the `master` branch of
Seastar.
The necessary changes:
- Add `-Wno-error` to prevent a Seastar warning from terminating the
build
- The new Seastar build system generates the pkg-config files (for
example, `seastar.pc`) at configure time, so we don't need to invoke
Ninja to generate them
- The `-march` argument is no longer inherited from Seastar (correctly),
so it needs to be provided independently
- Define `SEASTAR_TESTING_MAIN` so that the definition of an entry
point is included for all unit test compilation units
- Independently link Scylla against Seastar's compiled copy of fmt in
its build directory
- All test files use the (now public) Seastar testing headers
- Add some missing Seastar headers to source files
[avi: regenerate frozen toolchain, adjust seastar submoule]
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <02141f2e1ecff5cbcd56b32768356c3bf62750c4.1548820547.git.jhaberku@scylladb.com>
When entering a new ck range (of the partition-slice), the partition
snapshot reader will apply to its range tombstones stream all the
tombstones that are relevant to the new ck range. When the partition has
range tombstones that overlap with multiple ck ranges, these will be
applied to the range tombstone stream when entering any of the ck ranges
they overlap with. This will result in the violation of the monotonicity
of the mutation fragments emitted by the reader, as these range
tombstones will be re-emitted on each ck range, if the ck range has at
least one clustering row they apply to.
For example, given the following partition:
rt{[1,10]}, cr{1}, cr{2}, cr{3}...
And a partition-slice with the following ck ranges:
[1,2], [3, 4]
The reader will emit the following fragment stream:
rt{[1,10]}, cr{1}, cr{2}, rt{[1,10]}, cr{3}, ...
Note how the range tombstone is emitted twice. In addition to violating
the monotonicity guarantee, this can also result in an explosion of the
number of emitted range tombstones.
Fix by trimming range tombstones to the start of the current ck range,
thus ensuring that they will not violate mutation fragment monotonicity
guarantees.
Refs: #4104
This is a much simpler fix for the above issue, than the already
committed one (7049cd937A). The latter is reverted by the previous
patch and this patch applies the simpler fix.
docs/metrics.md so far explained just the REST API for retrieving current
metrics from a single Scylla node. In this patch, I add basic explanations
on how to use the Prometheus and Grafana tools included in the
"scylla-grafana-monitoring" project.
It is true that technically, what is being explained here doesn't come
with the Scylla project and requires the separate scylla-grafana-monitoring
to be installed as well. Nevertheless, most Scylla developers will need this
knowledge eventually and suprisingly it appears it was never documented
anywhere accessible to newbie developers, and I think metrics.md is the
right place to introduce it.
In fact, I myself wasn't aware until today that Prometheus actually had
its own Web UI on port 9090, and that it is probably more useful for
developers than Grafana is.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Reviewed-by: Botond Denes <bdenes@scylladb.com>
Message-Id: <20190129114214.17786-1-nyh@scylladb.com>
"
An error in validating CONTAINS restrictions against collections caused
only the first restriction to be taken into account due to returning
prematurely.
This miniseries provides a fix for that as well as a matching test case.
Tests: unit (release)
Fixes#4161
"
* 'fix_multiple_contains_for_one_column' of https://github.com/psarna/scylla:
tests: enable CONTAINS tests for filtering
cql3: remove premature return from is_satisfied_by
cql3: restore indentation
Tests for filtering with CONTAINS restrictions were not enabled,
so they are now. Also, another case for having two CONTAINS restrictions
for a single column is added.
Refs #4161
Function which checked whether a CONTAINS restriction is satisfied
by a collection erroneously returned prematurely after checking
just the first restriction - which works fine for the usual case,
but fails if there are multiple CONTAINS restrictions present
for a column.
Fixes#4161
The value is already passed by cql_table_large_partition_handler, so
the default was just for nop_large_partition_handler.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
For reporting large rows we have to be able to print clustering keys
in addition to partition keys.
Refs #3988.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Use the ":z" suffix to tell Docker to relabel file objets on shared
volumes. Fixes accessing filesystem via dbuild when SELinux is enabled.
Message-Id: <20190128160557.2066-1-penberg@scylladb.com>
"
Cleanup of temporary sstable directories in distributed_loader::populate_column_family
is completely broken and non tested. This code path was never executed since
populate_column_family doesn't currently list subdirectories at all.
This patchset fixes this code path and scans subdirectories in populate_column_family.
Also, a unit test is added for testing the cleanup of incomplete (unsealed) sstables.
Fixes: #4129
"
* 'projects/sst-temp-dir-cleanup/v3' of https://github.com/bhalevy/scylla:
tests: add test_distributed_loader_with_incomplete_sstables
tests: single_node_cql_env::do_with: use the provided data_file_directories path if available
tests: single_node_cql_env::_data_dir is not used
distributed_loader: populate_column_family should scan directories too
sstables: fix is_temp_dir
distributed_loader: populate_column_family: ignore directories other than sstable::is_temp_dir
distributed_loader: remove temporary sstable directories only on shard 0
distributed_loader: push future returned by rmdir into futures vector
"
This series prevents view building to fall back to storing hints.
Instead, it will try to send hints to an endpoint as if it has
consistency level ONE, and in case of failure retry the whole
building step. Then, view building will never be marked as finished
prematurely (because of pending hints), which will help avoid
creating inconsistencies when decommissioning a node from the cluster.
Tests:
unit (release)
dtest (materialized_views_test.py.*)
Fixes#3857Fixes#4039
"
* 'do_not_mark_view_as_built_with_hints_7' of https://github.com/psarna/scylla:
db,view: add updating view_building_paused statistics
database: add view_building_paused metrics
table: make populate_views not allow hints
db,view: add allow_hints parameter to mutate_MV
storage_proxy: add allow_hints parameter to send_to_endpoint
View building uses populate_views to generate and send view updates.
This procedure will now not allow hints to be used to acknowledge
the write. Instead, the whole building step will be retried on failure.
Fixes#3857Fixes#4039
Despite the name, this option also controls if a warning is issued
during memtable writes.
Warning during memtable writes is useful but the option name also
exists in cassandra, so probably the best we can do is update the
description.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190125020821.72815-1-espindola@scylladb.com>
We need to follow changes of rpm package build procedure on
-jmx/-tools/-ami packages, since it have been changed when we merged
relocatable pacakge.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190127204436.13959-1-syuu@scylladb.com>
On Scylla 3rdparty tools, we add /opt/scylladb/lib to LD_LIBRARY_PATH.
We use same directory for relocatable binaries, including libc.so.6.
Once we install both scylla-env package and relocatable version of scylla-server package, the loader tries to load libc from /opt/scylladb/lib then entire distribution become unusable.
We may able to use Obsoletes or Conflict tag on .rpm/.deb to avoid
install new Scylla package with scylla-env, but it's better & safer not to share
same directory for different purpose.
Fixes#3943
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190128023757.25676-1-syuu@scylladb.com>
Make it easier to work interactively by not reporting surprising times.
There are also reports that dtest fails with incorrect timezones, but those
are probably bugs in dtest.
Message-Id: <20190127134754.1428-1-avi@scylladb.com>
Virtual columns are MV-specific columns that contribute to the
liveness of view rows. However, we were not adding those columns when
creating an index's underlying MV, causing indexes to miss base rows.
Fixes#4144
Branches: master, branch-3.0
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Test removal of sstables with temporary TOC file,
with and without temporary sstable directory.
Temporary sstable directories may be empty or still have
leftover components in them.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
1. fs::canonical required that the path will exist.
and there is no need for fs::canonical here.
2. fs::path::extension will return the leading dot.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
populate_column_family currently lists only regular files. ignoring all directories.
A later patch in this series allows it to list also directories so to cleanup
the temporary sstable directories, yet valid sub-directories, like staging|upload|snapshots,
may still exist and need to be ignored.
Other kinds of handling, like validating recgnized sub-directories and halting on
unrecognized sub-directories are possible, yet out of scope for this patch(set).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Similar to calling remove_sstable_with_temp_toc later on in
populate_column_family(), we need only one thread to do the
cleanup work and the existing convention is that it's shard 0.
Since lister::rmdir is checking remove_file of all entries
(recursively) and the dir itself, doing that concurrently would fail.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
Use int64_t in data::cell for expiry / deletion time.
Extend time_overflow unit tests in cql_query_test to use
select statements with and without bypass cache to access deeper
into the system.
Refs #3353
"
* 'projects/gc_clock_64_fixes/v1' of https://github.com/bhalevy/scylla:
tests: extend time_overflow unit tests
data::cell: use int64_t for expiry and deletion time
From day1, scylla_setup can be run either iteractively or through
command line parameters. Still, one of the requests we are asked the
most from users is whether we can provide them with a version of
scylla_setup that they can call from their scripts.
This probably happens because once you call a script interactively,
it may not be totally obvious that a different mode is available.
Even when we do tell users about that possibility, the request number
two is then "which flags do I pass?"
The solution I am proposing is to just tell users the answers to those
qestions at the end of an interactive session. After this patch, we
print the following message to the console:
ScyllaDB setup finished.
scylla_setup accepts command line arguments as well! For easily provisioning in a similar environmen than this, type:
scylla_setup --no-raid-setup --nic eth0 --no-kernel-check \
--no-verify-package --no-enable-service --no-ntp-setup \
--no-node-exporter --no-fstrim-setup
Also, to avoid the time-consuming I/O tuning you can add --no-io-setup and copy the contents of /etc/scylla.d/io*
Only do that if you are moving the files into machines with the exact same hardware
Notes on the implementation: it is unfortunate for these purposes that
all our options are negated. Most conditionals are branching on true
conditions, so although I could write this:
args.no_option = not interactive_ask_service(...)
if not args.no_option:
...
I opted in this patch to write:
option = interactive_ask_service(...)
args.no_option = not option
if option:
...
There is an extra line and we have to update args separately, but it
makes it less hard to get confused in the conditional with the double
negation. Let me know if there are disagreements here.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190124153832.21140-1-glauber@scylladb.com>
"
I've recently had to work around types.hh/types.cc files and had
very unpleasent experience with incremental build on every change
to types.hh. It took ~30 min on my machine which is almost as much
as the clean build.
I looked around and it turns out that types.hh contains the whole
hierarchy of the types. On the same time, many places access the
types only through abstract_type which is the root of the
hierarchy.
This patchset extracts user_type_impl, tuple_type_impl,
map_type_impl, set_type_impl, list_type_impl and
collection_type_impl from types.hh and places each of them
in a separate header.
The result of this is that change in user_type_impl causes now
incremental build of ~6 min instead of ~30 min.
Change to tuple_type_impl causes incremental build of ~7.5 min
instead of ~30 min and change to map_type_impl triggers incremental
build that takes ~20 min instead of ~30 min.
Tests: unit(release)
"
* 'haaawk/types_build_speedup_2/rfc/2' of github.com:scylladb/seastar-dev:
Stop including types/list.hh in cql3/tuples.hh
Stop including types/set.hh into cql3/sets.hh
Move collection_type_impl out of types.hh to types/collection.hh
Move set_type_impl out of types.hh to types/set.hh
Move list_type_impl out of types.hh to types/list.hh
Move map_type_impl out of types.hh to types/map.hh
Move tuple_type_impl from types.hh to types/tuple.hh
Decouple database.hh from types/user.hh
Allow to use shared_ptr with incomplete type other than sstable
Move user_type_impl out of types.hh to types/user.hh
This commit declares shared_ptr<user_types_metadata> in
database.hh were user_types_metadata is an incomplete type so
it requires
"Allow to use shared_ptr with incomplete type other than sstable"
to compile correctly.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
When seastar/core/shared_ptr_incomplete.hh is included in a header
then it causes problems with all declarations of shared_ptr<T> with
incomplete type T that end up in the same compilation unit.
The problem happens when we have a compilation unit that includes
two headers a.hh and b.hh such that a.hh includes
seastar/core/shared_ptr_incomplete.hh and b.hh declares
shared_ptr<T> with incomplete type T. On the same time this
compilation unit does not use declared shared_ptr<T> so it should
compile and work but it does not because shared_ptr_incomplete.hh
is included and it forces instantiation of:
template <typename T>
T*
lw_shared_ptr_accessors<T,
void_t<decltype(lw_shared_ptr_deleter<T>{})>>::to_value(lw_shared_ptr_counter_base*
counter) {
return static_cast<T*>(counter);
}
for each declared shared_ptr<T> with incomplete type T. Even the once
that are never used.
Following commit "Decouple database.hh from types/user.hh"
moves user_types_metadata type out of database.hh and instead
declares shared_ptr<user_types_metadata> in database.hh where
user_types_metadata is incomplete. Without this commit
the compilation of the following one fails with:
In file included from ./sstables/sstables.hh:34,
from ./db/size_estimates_virtual_reader.hh:38,
from db/system_keyspace.cc:77:
seastar/include/seastar/core/shared_ptr_incomplete.hh: In
instantiation of ‘static T*
seastar::internal::lw_shared_ptr_accessors<T,
seastar::internal::void_t<decltype
(seastar::lw_shared_ptr_deleter<T>{})>
>::to_value(seastar::lw_shared_ptr_counter_base*) [with T =
user_types_metadata]’:
seastar/include/seastar/core/shared_ptr.hh:243:51: required from
‘static void seastar::internal::lw_shared_ptr_accessors<T,
seastar::internal::void_t<decltype
(seastar::lw_shared_ptr_deleter<T>{})>
>::dispose(seastar::lw_shared_ptr_counter_base*) [with T =
user_types_metadata]’
seastar/include/seastar/core/shared_ptr.hh:300:31: required from
‘seastar::lw_shared_ptr<T>::~lw_shared_ptr() [with T =
user_types_metadata]’
./database.hh:1004:7: required from ‘static void
seastar::internal::lw_shared_ptr_accessors_no_esft<T>::dispose(seastar::lw_shared_ptr_counter_base*)
[with T = keyspace_metadata]’
seastar/include/seastar/core/shared_ptr.hh:300:31: required from
‘seastar::lw_shared_ptr<T>::~lw_shared_ptr() [with T =
keyspace_metadata]’
./db/size_estimates_virtual_reader.hh:233:67: required from here
seastar/include/seastar/core/shared_ptr_incomplete.hh:38:12: error:
invalid static_cast from type ‘seastar::lw_shared_ptr_counter_base*’
to type ‘user_types_metadata*’
return static_cast<T*>(counter);
^~~~~~~~~~~~~~~~~~~~~~~~
[131/415] CXX build/release/distributed_loader.o
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Currently nop_large_partition_handler is only used in tests, but it
can also be used avoid self-reporting.
Tests: unit(Release)
I also tested starting scylla with
--compaction-large-partition-warning-threshold-mb=0.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190123205059.39573-1-espindola@scylladb.com>
"
This series is a first small step towards rewriting
CQL restrictions layer. Primary key restrictions used to be
a template that accepts either partition_key or clustering_key,
but the implementation is already based on virtual inheritance,
so in multiple cases these templates need specializations.
Refs #3815
"
* 'detemplatize_primary_key_restrictions_2' of https://github.com/psarna/scylla:
cql3: alias single_column_primary_key_restrictions
cql3: remove KeyType template from statement_restrictions
cql3: remove template from primary_key_restrictions
cql3: remove forwarding_primary_key_restrictions
In preparation for detemplatizing this class, it's aliased with
single_column_partition_key restrictions and
single_column_clustering_key_restrictions accordingly.
Partition key restrictions and clustering key restrictions
currently require virtual function specializations and have
lots of distinct code, so there's no value in having
primary_key_restrictions<KeyType> template.
libdeflate's build places some object files in the source directory, which is
shared between the debug and release build. If the same object file (for the two
modes) is written concurrently, or if one more reads it while the other writes it,
it will be corrupted.
Fix by not building the executables at all. They aren't needed, and we already
placed the libraries' objects in the build directory (which is unshared). We only
need the libraries anyway.
Fixes#4130.
Branches: master, branch-3.0
Message-Id: <20190123145435.19049-1-avi@scylladb.com>
Commit fd422c954e aimed to fix
issue #3803. In that issue, if a query SELECTed only certain columns but
did filtering (ALLOW FILTERING) over other unselected columns, the filtering
didn't work. The fix involved adding the columns being filtered to the set
of columns we read from disk, so they can be filtered.
But that commit included an optimization: If you have clustering keys
c1 and c2, and the query asks for a specific partition key and c1 < 3 and
c2 > 3, the "c1 < 3" part does NOT need to be filtered because it is already
done as a slice (a contiguous read from disk). The committed code erroneously
concluded that both c1 and c2 don't need to be filtered, which was wrong
(c2 *does* need to be read and filtered).
In this patch, we fix this optimization. Previously, we used the "prefix
length", which in the above example was 2 (both c1 and c2 were filtered)
but we need a new and more elaborate function,
num_prefix_columns_that_need_not_be_filtered(), to determine we can only
skip filtering of 1 (c1) and cannot skip the second.
Fixes#4121. This patch also adds a unit test to confirm this.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190123131212.6269-1-nyh@scylladb.com>
If docker sees the Dockerfile hasn't changed it may reuse an old image, not
caring that context files and dependent images have in fact changed. This can
happen for us if install-dependencies.sh or the base Fedora image changed.
To make sure we always get a correct image, add --no-cache to the build command.
Message-Id: <20190122185042.23131-1-avi@scylladb.com>
Done in a separate step so we can update the toolchain first.
dnf-utils is used to bring us repoquery, which we will use to derive the
list of files in the python packages.
patchelf is needed so we can add a DT_RUNPATH section to the interpreter
binary.
the python modules, as well as the python3 interpreter are taken from
the current RPM spec file.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
[avi: regenerate frozen toolchain image]
Message-Id: <20190123011751.14440-1-glauber@scylladb.com>
"
This series prepares for the integration of the `master` branch of
Seastar back into Scylla.
A number of changes to the existing build are necessary to integrate
Seastar correctly, and these are detailed in the individual change
messages.
I tested with and without DPDK, in release and debug mode.
The actual switch is a separate patch.
"
* 'jhk/seastar_cmake/v4' of https://github.com/hakuch/scylla:
build: Fix link order for DPDK
tests: Split out `sstable_datafile_test`
build: Remove unnecessary inclusion
tests: Fix use-after-free errors in static vars
build: Remove Seastar internals
build: Only use Seastar flags from pkg-config
build: Query Seastar flags using pkg-config
build: Change parameters for `pkg_config` function
"
Cache cf mappings when breaking in the middle of a segment sending so
that the sender has them the next time it wants to send this segment
for where it left off before.
Also add the "discard" metric so that we can track hints that are being
discarded in the send flow.
"
Fixes#4122
* 'hinted_handoff_cache_cf_mappings-v1' of https://github.com/vladzcloudius/scylla:
hinted handoff: cache column family mappings for segments that were not sent out in full
hinted handoff: add a "discarded" metric
Each `*_test.cc` file must be compiled separately so that there is only
one definition of `main`.
This change correctly defines an independent `sstable_datafile_test`
from `sstable_datafile_test.cc` and adds that test to the existing
suite.
We don't need to re-specify Seastar internals in Scylla's build, since
everything private to Seastar is managed via pkg-config.
We can eliminate all references to ragel and generated ragel header
files from Seastar.
We can also simplify the dependence on generated Seastar header files by
ensuring that all object files depend on Seastar being built first.
Some Seastar-specific flags were manually specified as Ninja rules, but
we want to rely exclusively on Seastar for its necessary flags.
The pkg-config file generated by the latest version of Seastar is
correct and allows us to do this, but the version generated by Scylla's
current check-out of Seastar does not. Therefore, we have to manually
adjust the pkg-config results temporarily until we update Seastar.
Previously, we manually parsed the pkg-config file. We now used
pkg-config itself to get the correct build flags.
This means that we will get the correct behavior for variable expansion,
and fields like `Requires`, `Requires.private`, and `Libs.private`.
Previously, these fields were ignored.
We will try to send a particular segment later (in 1s) from the place
where we left off if it wasn't sent out in full before. However we may miss
some of column family mappings when we get back to sending this file and
start sending from some entry in the middle of it (where we left off)
if we didn't save column family mappings we cached while reading this segment
from its begining.
This happens because commitlog doesn't save a column family information
in every entry but rather once for each uniq column family (version) per
"cycle" (see commitlog::segment description for more info).
Therefore we have to assume that a particular column family mapping
appears once in the whole segment (worst case). And therefore, when we
decide to resume sending a segment we need to keep the column family
mappings we accumulated so far and drop them only after we are done with
this particular segment (sent it out in full).
Fixes#4122
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Account the amount of hints that were discarded in the send path.
This may happen for instance due to a schema change or because a hint
being to old.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
"
wrap around on 2038-01-19 03:14:07 UTC. Such dates are valid deletion
times starting 2018-01-19 with the 20 years long maximum ttl.
This patchset extends gc_clock::duration::rep to int64_t and adds
respective unit tests for the max_ttl cases.
Fixes#3353
Tests: unit (release)
"
* 'projects/gc_clock_64/v2' of https://github.com/bhalevy/scylla:
tests: cql_query_test add test_time_overflow
gc_clock: make 64 bit
sstables: mc: use int64_t for local_deletion_time and ttl
sstables: add capped_tombstone_deletion_time stats counter
sstables: mc: cap partition tombstone local_deletion_time to max
sstables: add capped_local_deletion_time stats counter
sstables: mc: metadata collector: cap local_deletion_time at max
sstables: mc: use proper gc_clock types for local_deletion_time and ttl
db: get default_time_to_live as int32_t rather than gc_clock::rep
sstables: safely convert ttl and local_deletion_time to int32_t
sstables: mc: move liveness_info initialization to members
sstables: mc: move parsing of liveness_info deltas to data_consume_rows_context_m
sstables: mc: define expired_liveness_ttl as signed int32_t
sstables: mc: change write_delta_deletion_time to receive tombstone rather than deletion_time
sstables: mc: use gc_clock types for writing delta ttl and local_deletion_time
Commit 019a2e3a27 marked some arguments as required, which improved
the usability of scylla_setup.
The problem is that when we call scylla_setup in interactive mode,
no argument should be required. After the aforementioned commit
scylla_setup will either complain that the required arguments were
not passed if zero arguments are present, or skip interactive mode
if one of the mandatory ones is present.
This patch fixes that by checking whether or not we were invoked with
no command line arguments and lifting the requirements for mandatory
arguments in that case.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190122003621.11156-1-glauber@scylladb.com>
deletion_time struct as int32_t deletion_time that cannot hold long
time values. Cap local_deletion_time to max_local_deletion_time and
log a warning about that,
This corresponds to Cassandra's MAX_DELETION_TIME.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
max local_deletion_time_tracker in stats is int32_t so just track the limit
of (max int32_t - 1) if time_point is greater than the limit.
This corresponds to Cassandra's MAX_DELETION_TIME.
Refs #3353
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
mc format only writes delta local_deletion_time of tombstones.
Conventional deletion_time is written only for the partition header.
Restructure the code to pass a tombstone to write_delta_deletion_time
rather than struct deletion_time to prepare for using 64-bit deletion times.
The tombstone uses gc_clock::time_point while struct
deletion_time is limited to int32_t local_deletion_time.
Note that for "live" tombstones we encode <api::missing_timestamp,
no_deletion_time> as was previously evaluated by to_deletion_time().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When the reclaim request was satisfied from the pool there's no need
to call compact_and_evict_locked(). This allows us to avoid calling
boost::range::make_heap(), which is a tiny performance difference, as
well as some confusing log messages.
Message-Id: <1548091941-8534-1-git-send-email-tgrabiec@scylladb.com>
We can invoke pkg-config with multiple options, and we specify the
package name first since this is the "target" of the pkg-config query.
Supporting multiple options is necessary for querying Seastar's
pkg-config file with `--static`, which we anticipate in a future change.
The system won't work properly if IOTune is not run. While it is fair
to skip this step because it takes long-- indeed, it is common to provision
io.conf manually to be able to skip this step, first time users don't know
this and can have the impression that this is just a totally optional step.
Except the node won't boot up without it.
As a user nicely put recently in our mailing list:
"...in this case, it would be even simpler to forbid answering "no"
to this not-so-optional step :)"
We should not forbid saying no to IOTune, but we should warn the user
about the consequences of doing so.
Fixes#4120
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20190121144506.17121-1-glauber@scylladb.com>
Recently we had a bug (#4096) due to a component
(`multishard_mutation_query()`) assuming that all reads used the
semaphore obtainable via `database::user_read_concurrency_sem()`.
This problem revealed that it is plain wrong to allow access to the
shard-global semaphores residing in the database object. Instead all
code wishing to access the relevant semaphore for some read, should do
so via the relevant `table` object, thus guaranteeing that it will get
the correct semaphore, configured for that table.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4f3a6780eb3240822db34aba7c1ba0a675a96592.1547734212.git.bdenes@scylladb.com>
Scylla starts doing IO much earlier that it starts cql/thrift servers.
The IO may cause an error that will try stop all servers, but since they
are still not running it will do nothing, but servers will be started
later. Fix it by checking that the node is not isolated before starting
servers.
Message-Id: <20190110152830.GE3172@scylladb.com>
Many area of the code are splattered with unneeded templates. This patchset replaces
some of them, where the template parameter is a function object, with an std::function
or noncopyable_function (with a preference towards the latter; but it is not always
possible). As the template is compiled for each instantiation (if the function
object is a lambda) while a function is compiled only once, there are significant
savings in compile time and bloat.
text data bss dec hex filename
85160690 42120 284910 85487720 5187068 scylla.before
84824762 42120 284910 85151792 5135030 scylla.after
* https://github.com/avikivity/scylla detemplate/v2:
api/commitlog: de-template acquire_cl_metric()
database: de-template do_parse_schema_tables
database: merge for_all_partitions and for_all_partitions_slow
hints: de-template scan_for_hints_dirs()
schema_tables: partially de-template make_map_mutation()
distributed_loader: de-template
tests: commitlog_test: de-template
tests: cql_auth_query_test: de-template
test: de-template eventually() and eventually_true()
tests: flush_queue_test: de-template
hint_test: de-template
tests: mutation_fragment_test: de-template
test: mutation_test: de-template
"
This miniseries fixes the behaviour of distributed loader,
which now unconditionally mutates new sstables found in /upload
dir to LCS level 0 first, and only after that proceeds with
either queueing them for update generation or moving them
to data directory.
"
* 'restore_always_mutating_sstables_level_0' of https://github.com/psarna/scylla:
distributed_loader: restore indentation
distributed_loader: restore always mutating to level 0
Fix runtime error: signed integer overflow
introduced by 2dc3776407
Delta-encoded values may wrap around if the encoded value is
less than the base value. This could happen in two places:
In the mc-format serialization header itself, where the base values are implicit
Cassandra epoch time, and in the sstables data files, where the base values
are taken from the encoding_stats (later written to the serialization_header).
In these cases, when the calculation is done using signed integer/long we may see
"runtime error: signed integer overflow" messages in debug mode
(with -fsanitize=undefined / -fsanitize=signed-integer-overflow).
Overflow here is expected and harmless since we do not gurantee that
neither the base values in the serialization header are greater than
or equal to Cassandra's epoch now that the delta-encoded values are
always greater than or equal to the respective base values in
the serialization header.
To prevent these warnings, the subtraction/addition should be done with unsigned
(two's complement) arithmetic and the result converted to the signed type.
Note that to keep the code simple where possible, when also rely on implicit
conversion of signed integers to unsigned when either one of added value is unsigned
and the other is signed.
Fixes: #4098
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190120142950.15776-1-bhalevy@scylladb.com>
The internal test_propagation template is instantiated many times. Replace
with an oridinary function to reduce bloat. Call sites adjusted to have a
uniform signature.
distributed_loader has several large templates that can be converted to normal
function with the help of noncopyable_function<>, reducing code bloat.
One of the lambdas used as an actual argument was adjusted, because the de-templated
callee only accepts functions returning a future, while the original accepted both
functions returning a future and functions returning void (similar to future::then).
This long slow-path function is called four times, so de-templating it is an
easy win. We use std::function instead of noncopyable_function because the
function is copied within the parallel_for_each callback. The original code
uses a move, which is incorrect, but did not fail because moving the lambdas
that were used as the actual arguments is equivalent to a copy.
Otherwise read test results for subsequent datasets will override each other.
Also, rename population test case to not include dataset name, which
is now redundant.
Message-Id: <1547822942-9690-1-git-send-email-tgrabiec@scylladb.com>
When entering a new ck range (of the partition-slice), the partition
snapshot reader will apply to its range tombstones stream all the
tombstones that are relevant to the new ck range. When the partition has
range tombstones that overlap with multiple ck ranges, these will be
applied to the range tombstone stream when entering any of the ck ranges
they overlap with. This will result in the violation of the monotonicity
of the mutation fragments emitted by the reader, as these range
tombstones will be re-emitted on each ck range, if the ck range has at
least one clustering row they apply to.
For example, given the following partition:
rt{[1,10]}, cr{1}, cr{2}, cr{3}...
And a partition-slice with the following ck ranges:
[1,2], [3, 4]
The reader will emit the following fragment stream:
rt{[1,10]}, cr{1}, cr{2}, rt{[1,10]}, cr{3}, ...
Note how the range tombstone is emitted twice. In addition to violating
the monotonicity guarantee, this can also result in an explosion of the
number of emitted range tombstones.
Fix by applying only those range tombstones to the range tombstone
stream, that have a position strictly greater than that of the last
emitted clustering row (or range tombstone), when entering a new ck
range.
Fixes: #4104
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <e047af76df75972acb3c32c7ef9bb5d65d804c82.1547916701.git.bdenes@scylladb.com>
At the moment are inefficiencies in how
collection_type_impl::mutation::compact_and_expire( handles tombstones.
If there is a higher-level tombstone that covers the collection one
(including cases where there is no collection tombstone) it will be
applied to the collection tombstone and present in the compaction
output. This also means that the collection tombstone is never dropped
if fully covered by a higher-level one.
This patch fixes both those problems. After the compaction the
collection tombstone is either unchanged or removed if covered by a
higher-level one.
Fixes#4092.
Message-Id: <20190118174244.15880-1-pdziepak@scylladb.com>
Use std::function instead of a template parameter. Likely doesn't gain
anyting, because the template was always instantiated with the same type
(the result of std::bind() with the same signatures), but still good practice.
std::function was used instead of noncopyable_function because
sharded::map_reduce0() copies the input function.
"
Use input sstables stats metadata to re-calculate encoding_stats.
Fixes#3971.
"
* 'projects/compaction-encoding-stats/v3' of https://github.com/bhalevy/scylla:
compaction: mc: re-calculate encoding_stats based on column stats
memtable: extract encoding_stats_collector base class to encoding_stats header file
The compilation fails on -Warray-bounds, even though the branch is never taken:
inlined from ‘managed_bytes::managed_bytes(bytes_view)’ at ./utils/managed_bytes.hh:195:22,
inlined from ‘managed_bytes::managed_bytes(const bytes&)’ at ./utils/managed_bytes.hh:162:77,
inlined from ‘dht::token dht::bytes_to_token(bytes)’ at dht/random_partitioner.cc:68:57,
inlined from ‘dht::token dht::random_partitioner::get_token(bytes)’ at dht/random_partitioner.cc:85:39:
/usr/include/c++/8/bits/stl_algobase.h:368:23: error: ‘void* __builtin_memmove(void*, const void*, long unsigned int)’ offset 16 from the object at ‘<anonymous>’ is out of the bounds of referenced subobject ‘managed_bytes::small_blob::data’ with type ‘signed char [15]’ at offset 0 [-Werror=array-bounds]
__builtin_memmove(__result, __first, sizeof(_Tp) * _Num);
~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Work around by disabling the diagnostic locally.
Message-Id: <1547205350-30225-1-git-send-email-tgrabiec@scylladb.com>
Many area of the code are splattered with unneeded templates. This patchset replaces
some of them, where the template parameter is a function object, with an std::function
or noncopyable_function (with a preference towards the latter; but it is not always
possible). As the template is compiled for each instantiation (if the function
object is a lambda) while a function is compiled only once, there are significant
savings in compile time and bloat.
text data bss dec hex filename
85160690 42120 284910 85487720 5187068 scylla.before
84824762 42120 284910 85151792 5135030 scylla.after
* https://github.com/avikivity/scylla detemplate/v1:
api/commitlog: de-template acquire_cl_metric()
database: de-template do_parse_schema_tables
database: merge for_all_partitions and for_all_partitions_slow
hints: de-template scan_for_hints_dirs()
schema_tables: partially de-template make_map_mutation()
distributed_loader: de-template
tests: commitlog_test: de-template
tests: cql_auth_query_test: de-template
test: de-template eventually() and eventually_true()
tests: flush_queue_test: de-template
hint_test: de-template
tests: mutation_fragment_test: de-template
test: mutation_test: de-template
When introducing view update generation path for sstables
in /upload directory, mutating these sstables was moved
to regular path only. It was wrong, because sstables that
need view updates generated from them may still need
to be downgraded to LCS level 0, so they won't disrupt
LCS assumptions after being loaded.
Reported-by: Nadav Har'El <nyh@scylladb.com>
The internal test_propagation template is instantiated many times. Replace
with an oridinary function to reduce bloat. Call sites adjusted to have a
uniform signature.
Use noncopyable_function instead of a template parameter. Likely doesn't gain
anyting, because the template was always instantiated with the same type
(the result of std::bind() with the same signatures), but still good practice.
The multishard mutation query used the semaphore obtained from
`database::user_read_concurrency_sem()` to pause-resume shard readers.
This presented a problem when `multishard_mutation_query()` was reading
from system tables. In this case the readers themselves would obtain
their permits from the system read concurrency semaphore. Since the
pausing of shard readers used the user read semaphore, pausing failed to
fulfill its objective of alleviating pressure on the semaphore the reads
obtained their permits from. In some cases this lead to a deadlock
during system reads.
To ensure the correct semaphore is used for pausing-resuming readers,
obtain the semaphore from the `table` object. To avoid looking up the
table on every pause or resume call, cache the semaphores when readers
are created.
Fixes: #4096
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <c784a3cd525ce29642d7216fbe92638fa7884e88.1547729119.git.bdenes@scylladb.com>
d2dbbba139 converted scyllatop's interperter to Python 3, but neglected to do
the actual conversion. This patch does so, by running 2to3 over allfiles and adding
an additional bytes->string decode step in prometheus.py. Superfluous 2to3 changes
to print() calls were removed.
Message-Id: <20190117124121.7409-1-avi@scylladb.com>
"
Before this series the limit was applied per page instead
of globally, which might have resulted in returning too many
rows.
To fix that:
1. restrictions filter now has a 'remaining' parameter
in order to stop accepting rows after enough of them
have already been accepted
2. pager passes its row limit to restrictions filter,
so no more rows than necessary will be served to the client
3. results no longer need to be trimmed on select_statement
level
Tests: unit (release)
"
* 'fix_filtering_limit_with_paging_3' of https://github.com/psarna/scylla:
tests: add filtering+limit+paging test case
tests: allow null paging state in filtering tests
cql3: fix filtering with LIMIT with regard to paging
Previously the utility to extract paging state asserted
that the state exists, but in future tests it would be useful
to be able to call this function even if it would return null.
Previously the limit was erroneously applied per page
instead of being accumulated, which might have caused returning
too many rows. As of now, LIMIT is handled properly inside
restrictions filter.
Fixes#4100
When compacting several sstables, get and merge their encoding_stats
for encoding the result.
Introduce sstable::get_encoding_stats_for_compaction to return encoding_stats
based on the sstable's column stats.
Use encoding_stats_collector to keep track of the minimum encoding_stats
values of all input sstables.
Fixes#3971
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Number of rows sent and received
- tx_row_nr
- rx_row_nr
Bytes of rows sent and received
- tx_row_bytes
- rx_row_bytes
Number of row hashes sent and received
- tx_hashes_nr
- rx_hashes_nr
Number of rows read from disk
- row_from_disk_nr
Bytes of rows read from disk
- row_from_disk_bytes
Message-Id: <d1ee6b8ae8370857fe45f88b6c13087ea217d381.1547603905.git.asias@scylladb.com>
"
This series adds generating view updates from sstables added through
/upload directory if their tables have accompanying materialized views.
Said sstables are left in /upload directory until updates are generated
from them and are treated just like staging sstables from /staging dir.
If there are no views for a given tables, sstables are simply moved
from /upload dir to datadir without any changes.
Tests: unit (release)
"
* 'add_handling_staging_sstables_to_upload_dir_5' of https://github.com/psarna/scylla:
all: rename view_update_from_staging_generator
distributed_loader: fix indentation
service: add generating view updates from uploaded sstables
init: pass view update generator to storage service
sstables: treat sstables in upload dir as needing view build
sstables,table: rename is_staging to requires_view_building
distributed_loader: use proper directory for opening SSTable
db,view: make throttling optional for view_update_generator
"
This series addresses the problem mentioned in issue 4032, which is a race
between creating a view and streaming sstables to a node. Before this patch
the following scenario is possible:
- sstable X arrives from a streaming session
- we decide that view updates won't be generated from an sstable X
by the view builder
- new view is created for the table that owns sstable X
- view builder doesn't generate updates from sstable X, even though the table
has accompanying views - which is an inconsistency
This race is fixed by making the view builder wait for all ongoing streams,
just like it does for reads and writes. It's implemented with a phaser.
Tests:
unit (release)
dtest(not merged yet: materialized_views_test.TestMaterializedViews.stream_from_repair_during_build_process_test)
"
* 'add_stream_phasing_2' of https://github.com/psarna/scylla:
repair: add stream phasing to row level repair
streaming: add phasing incoming streams
multishard_writer: add phaser operation parameter
view: wait for stream sessions to finish before view building
table: wait for pending streams on table::stop
database: add pending streams phaser
SSTables loaded to the system via /upload dir may sometimes be needed
to generate view updates from them (if their table has accompanying
views).
Fixes#4047
In some cases, sstables put in the upload dir should have view updates
generated from them. In order to avoid moving them across directories
(which then involves handling failure paths), upload dir will also be
treated as a valid directory where staging sstables reside.
Regular sstables that are not needed for view updates will be
immediately moved from upload/ dir as before.
Previous implementation assumes that each SSTable resides directly
in table::datadir directory, while what should actually be used
is directory path from SSTable descriptor.
This patch prevents a regression when adding staging sstables support
for upload/ dir.
Currently registering new view updates is throttled by a semaphore,
which makes sense during stream sessions in order to avoid overloading
the queue. Still, registration also occurs during initialization,
where it makes little sense to wait on a semaphore, since view update
generator might not have started at all yet.
"
Cleanup various cases related to updating of metatdata stats and encoding stats
updating in preparation for 64-bit gc_clock (#3353).
Fixes#4026Fixes#4033Fixes#4035Fixes#4041
Refs #3353
"
* 'projects/encoding-stats-fixes/v6' of https://github.com/bhalevy/scylla:
sstables: remove duplicated code in data_consume_rows_context CELL_VALUE_BYTES
sstables: mc: use api::timestamp_type in write_liveness_info
sstables: mc: sstable_write encoding_stats are const
mp_row_consumer_k_l::consume_deleted_cell rename ttl param to local_deletion_time
memtable: don't use encoding_stats epochs as default
memtable: mc: udpate min_ttl encoding stats for dead row marker
memtable: mc: add comment regarding updating encoding stats of collection tombstones
sstables: metadata_collector: add update tombstone stats
sstables: assert that delete_time is not live when updating stats
sstables: move update_deletion_time_stats to metadata collector
sstables: metadata_collector: introduce update_local_deletion_time_and_tombstone_histogram
sstables: mc: write_liveness_info and write_collection should update tombstone_histogram
sstables: update_local_deletion_time for row marker deletion_time and expiration
Presence checker is constructed and destroyed in the standard
allocator context, but the presence check was invoked in the LSA
context. If the presence checker allocates and caches some managed
objects, there will be alloc-dealloc mismatch.
That is the case with LeveledCompactionStrategy, which uses
incremental_selector.
Fix by invoking the presence check in the standard allocator context.
Fixes#4063.
Message-Id: <1547547700-16599-1-git-send-email-tgrabiec@scylladb.com>
In order to allow other services to wait for incoming streams
to finish, row level repair uses stream phasing when creating
new sstables from incoming data.
Fixes scylladb#4032
During streaming, there's a race between streamed sstables
and view creation, which might result in some tables not being
used to generate view updates, even though they should.
That happens when the decision about view update path for a table
is done before view creation, but after already receiving some sstables
via streaming. These will not be used in view building even though
they should.
Hence, a phaser is used to make the view builder wait for all ongoing
stream sessions for a table to finish before proceeding with build steps.
Refs #4032
"
This mini-series adds counters for the inactive reads registered in the
reader concurrency semaphore.
"
* 'reader-concurrency-semaphore-counters/v6' of https://github.com/denesb/scylla:
tests/querier_cache: use stats to get the no. of inactive reads
reader_concurrency_semaphore: add counters for inactive reads
The existence of LZ4_compress_default is a property of the lz4
library, not seastar.
With this patch scylla does its own configure check instead of
depending on the one done by seastar.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190114013737.5395-1-espindola@scylladb.com>
"
Currently the logic is scattered between types.*, cql3_types.* and
sstables/mc/writer.cc.
This patchset places all the logic in types.* and makes sure we
correctly add "frozen<...>" and "FrozenType(...)" to the names of
tuples and UDTs.
Fixes#4087
Tests: unit(release)
"
* 'haaawk/4087_v1' of github.com:scylladb/seastar-dev:
Add comment explaining tuple type name creation
Add "FrozenType(...)" to UDT name only when it's frozen
Move "FrozenType(...)" addition to UDT name to user_type_impl
Add "frozen<...>" to tuple CQL name only when it's frozen
Move "frozen<...>" addition to tuple CQL name to tuple_type_impl
Merge make_cql3_tuple_type into tuple_type_impl::as_cql3_type
Add "frozen<...>" to UDT CQL name only when it's frozen
Move "frozen<...>" addition to UDT CQL name to user_type_impl
Why default to an artificial minimum when you can do better
with zero effort? Track the actual minima in the memtable instead.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Update min ttl with expired_liveness_ttl (although it's value of max int32
is not expected to affect the minimum).
Fixes#4041
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When the row flag has_complex_deletion is set, some collection columns may have
deletion tombstones and some may not. we don't strictly need to update stats
will not affect the encoding_stats anyway.
Fixes#4035
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that continuous_data_consumer::position() is meaningful (since
36dd660), we can use our position in the stream to calculate offsets
instead of duplicating state machine in offset calculations.
The value of position() - data.size() always holds the current offset
in the stream.
Message-Id: <1547219234-21182-1-git-send-email-tgrabiec@scylladb.com>
Put package names one per line. This makes it easier to review changes,
and to backport changes to this file. No content changes.
Message-Id: <20190112091024.21878-1-avi@scylladb.com>
Do not allow write access to the sstable list via this accessor. Luckily
there are no violations, and now we enforce it.
Message-Id: <20190111151049.16953-1-avi@scylladb.com>
To keep format compatibiliti we never wrap tuple type name
into "org.apache.cassandra.db.marshal.FrozenType(...)".
Even when the tuple is frozen.
This patch adds a comment in tuple_type_impl::make_name that
explains the situation.
For more details see #4087
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
At the moment Scylla supports only frozen UDTs but
the code should be able to handle non-frozen UDTs as well.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
At the moment Scylla supports only frozen tuples but
the code should be able to handle non-frozen tuples as well.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
At the moment Scylla supports only frozen UDTs but
the code should be able to handle non-frozen UDTs as well.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Race condition takes place when one of the sstables selected by snapshot
is deleted by compaction. Snapshot fails because it tries to link a
sstable that was previously unlinked by compaction's sstable deletion.
Fixes#4051.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190110194048.26051-1-raphaelsc@scylladb.com>
sstable_file_io_extensions() return an array of pointers to extensions,
but do_for_each() may defer and the array will be destroyed. The match
keeps it alive until do_for_each completes.
Message-Id: <20190110125656.GC3172@scylladb.com>
make_sstable_reader needs to deal with single-key and scanning reads, and
with restricting and non-restricting (in terms of read concurrency) readers.
Right now it does this combinatorically - there are separate cases for
restricting single-key reads, non-restricting single-key reads, restricing
scans, and non-restricting scans.
This makes further changes more complicated, so separate the two concepts.
The patch splits the code into two stages; the first selects between a single-key
and a scan, and the second selects between a restricting and non-restricting read.
This slightly pessimizes non-restricting reads (a mutation_source is created and
immediately destroyed), but that's not the common case.
Tests: unit(release)
Message-Id: <20190109175804.9352-1-avi@scylladb.com>
compare_row_marker_for_merge compares deletion_time also for row markers
that have missing timestamps. This happened to succeed due to implicit
initialization to 0. However, we prefer the initialization to be explicit
and allow calling row_marker::deletion_time() in all states.
Fixes#4068
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190110102949.17896-1-bhalevy@scylladb.com>
Serialization header stores column types for all
columns in sstable. If any of them is a UDT then it
has to be wrapped into
"org.apache.cassandra.db.marshal.FrozenType(...)".
This patch adds a test case to verify that.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Serialization header stores type names of all
columns in a table. Including partition key columns,
clustering key columns, static columns and regular columns.
If one of those types is a user defined type then we need to
wrap its name into
"org.apache.cassandra.db.marshal.FrozenType(...)".
Fixes#4073
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This renames some variables and functions to make it clear that they
refer to partitions and not rows.
Old versions of sstablemetadata used to refer to a row histogram, but
current versions now mention a partition histogram instead.
This patch doesn't change the exposed API names.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20181229223311.4184-2-espindola@scylladb.com>
Both build_rpm.sh/build_deb.sh are failing at beginning of the script
when relocatable package does not exist, need to prevent it and show
user friendly message.
Fixes#4071
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190109094353.16690-1-syuu@scylladb.com>
Compaction manager holds reference to all cleaning sstables till the very
end, and that becomes a problem because disk space of cleaned sstables
cannot be reclaimed due to respective file descriptors opened.
Fixes#3735.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20181221000941.15024-1-raphaelsc@scylladb.com>
Replace stdx::optional and stdx::string_view with the C++ std
counterparts.
Some instances of boost::variant were also replaced with std::variant,
namely those that called seastar::visit.
Scylla now requires GCC 8 to compile.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190108111141.5369-1-duarte@scylladb.com>
test_fast_forwarding_across_partitions_to_empty_range uses an uninitialized
string to populate an sstable, but this can be invalid utf-8 so that sstable
cannot be sstabledumped.
Make it valid by using make_random_string().
Fixes#4040.
Message-Id: <20190107193240.14409-1-avi@scylladb.com>
In c++17 there are standard ways of requesting aligned memory, so
seastar doesn't need to provide one.
This patch is in preparation for removing with_alignment from seastar.
Tests: unit (debug)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190107191019.22295-1-espindola@scylladb.com>
The current timeout is way too small for debug builds. Currently
jenkins runs avoid the problem by increasing the timeout by 100x. This
patch increases it by 10x, with seems to be sufficient to run the
tests in most desktop machines.
Message-Id: <20190107191413.22531-1-espindola@scylladb.com>
Non-privileged user may not belongs to "wheel" group, for example Debian
variants uses "sudo" group instead of "wheel".
To make sudo able to work on all environment we should allow sudo for
"ALL" instead of "wheel".
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20190107173410.23140-1-syuu@scylladb.com>
When building something other than Scylla (like scylla-tools-java or scylla-jmx)
it is convenient to run it from some other directory. To do that, allow running
dbuild from any directory (so we locate tools/toolchain/image relative to the
dbuild script rather than use a fixed path) and mount the current directory
since it's likely the user will want to access files there.
Message-Id: <20190107165824.25164-1-avi@scylladb.com>
Start a new document with an overview of isolation in Scylla, i.e.,
scheduling groups, I/O priority classes, controllers, etc.
As all documents in docs/, this is a document for developers (not users!)
who need to understand how isolation between different pieces of Scylla
(e.g., queries, compaction, repair, etc.) works, which scheduling groups
and I/O classes we have and why, etc.
The document is still very partial and includes a lot of TODOs on
places where the explanation needs to be expanded. In particular it
needs an accurate explanation (and not just a name) of what kind of
work is done under each of the groups and classes, and an explanation
of how we set up RPC to use which scheduling groups for the code it
executes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190103183232.21348-1-nyh@scylladb.com>
Now that we added stats for the inactive reads, the tests don't need
the `reader_concurrency_semaphore::inactive_reads()` method, instead
they can rely on the stats to check the number of inactive reads.
While we keep ordinary hints in a directory parallel to the data directory,
we decided to keep the materialized view hints in a subdirectory of the data
directory, named "view_pending_updates". But during boot, we expect all
subdirectories of data/ to be keyspace names, and when we notice this one,
we print a warning:
WARN: database - Skipping undefined keyspace: view_pending_updates
This spurious warning annoyed users. But moreover, we could have bigger
problems if the user actually tries to create a keyspace with that name.
So in this patch, we move the view hints to a separate top-level directory,
which defaults to /var/lib/scylla/view_hints, but as usual can be configured.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190107142257.16342-1-nyh@scylladb.com>
"Add features that are useful for continuous integration pipelines (and
also ordinary developers):
- sudo support, with and without a tty, as our packaging scripts require it
- install ccache package to allow reducing incremental build times
- dependencies needed to build scylla-jmx and scylla-tools-java"
* tag 'toolchain-ci/v1' of https://github.com/avikivity/scylla:
tools: toolchain: update image for ant, maven, ccache, sudo
tools: toolchain: dbuild: pass-through supplementary groups
tools: toolchain: defeat PAM
tools: toolchain: improve sudo support
tools: toolchain: break long line in dbuild
tools: toolchain: prepare sudoers file
tools: toolchain: install ccache
install-dependencies.sh: add maven and ant
"This patchset reduces inclusions of database.hh, particularly in header
files. It reduces the number of objects depending on database.hh from 166
to 116.
Tests: unit(release), playing a little with tracing"
* tag 'database.hh/v1' of https://github.com/avikivity/scylla:
streaming: stream_session: remove include of db/view/view_update_from_staging_generator.hh
sstables: writer.hh: add some forward declarations
table_helper: remove database.hh include
table_helper: de-inline insert() and setup_keyspace()
table_helper: de-template setup_keyspace()
table_helper: simplify template body of table_helper::insert()
schema_tables: remove #include of database.hh
cql_type_parser: remove dependency on user_types_metadata
thrift: add missing include of sleep.hh
cql3: ks_prop_defs: remove #include "database.hh"
To be able to verify the golden version with sstabledump.
These files were generated by running sstable_3_x_test and keeping its
generated output files.
Refs #4043
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190103112511.23488-2-bhalevy@scylladb.com>
Various small improvements to docs/logging.md:
1. Describe the options to log to stdout or syslog and their defaults.
2. Mention the possibility of using nodetool instead of REST API.
3. Additional small tweaks to formatting.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190106111851.26700-1-nyh@scylladb.com>
Add a new document about logging in Scylla, and how to change the log levels
when running Scylla and during the run.
It needs more developer-oriented information (e.g., how to create new logger
subsystems in the code) but I think it's a good start.
Some of the text is based on Glauber's writeup for the Scylla website on
changing log levels at runtime.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190106103606.26032-1-nyh@scylladb.com>
This header, which is easily replaced with a forward declaration,
introduces a dependency on database.hh everywhere. Remove it and scatter
includes of database.hh in source files that really need it.
Move most of the body into a non-template overload to reduce dependencies
in the header (and template bloat). The function is not on any fast path,
and noncopyable_function will likely not even allocate anything.
A default parameter of type T (or lw_shared_ptr<T>) requires that T be
defined. Remove the depndency by redefining the default parameter
as an overload, for T = user_types_metadata.
The workload in #3844 has these characteristics:
- very small data set size (a few gigabytes per shard)
- large working set size (all the data, enough for high cache miss rate)
- high overwrite rate (so a compaction results in 12X data reduction)
As a result, the compaction backlog controller assigns very few shares to
compaction (low data set size -> low backlog), so compaction proceeds very slowly.
Meanwhile, we have tons of cache misses, and each cache miss needs to read from a
large number of sstables (since compaction isn't progressing). The end result is
a high read amplification, and in this test, timeouts.
While we could declare that the scenario is very artificial, there are other
real-world scenarios that could trigger it. Consider a 100% write load
(population phase) followed by 100% read. Towards the end of the last compaction,
the backlog will drop more and more until compaction slows to a crawl, and until
it completes, all the data (for that compaction) will have to be read from its
input sstables, resulting in read amplification.
We should probably have read amplification affect the backlog, but for now the
simpler solution is to increase the minimum shares to 50 so that compaction
always makes forward progress. This will result in higher-than-needed compaction
bandwidth in some low write rate scenarios so we will see fluctuations in request
rate (what the controller was designed to avoid), but these fluctioations will be
limited to 5%.
Since the base class backlog_controller has a fixed (0, 0) point, remove it
and add it to derived classes (setting it to (0, 50) for compaction).
Fixes#3844 (or at least improves it).
Message-Id: <20181231162710.29410-1-avi@scylladb.com>
Prevent PAM from enforcing security and preventing sudo from working. This is
done by replacing the default configuration (designed for workstations) to
one that uses pam_permit for everything.
Add tools needed to build scylla-jmx and scylla-tools-java. While
not requirements of this repository, it's nicer if a single setup
can be used to build and run everything.
We also install pystache as it's used by packaging scripts.
The current implementation breaks the invariant that
_size_bytes = reduce(_fragments, &temporary_buffer::size)
In particular, this breaks algorithms that check the individual
segment size.
Correctly implement remove_suffix() by destroying superfluous
temporary_buffer's and by trimming the last one, if needed.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190103133523.34937-1-duarte@scylladb.com>
Simplify the fix for memory based eviction, introduced by 918d255 so
there is no need to massage the counters.
Also add a check to `test_memory_based_cache_eviction` which checks for
the bug fixed. While at it also add a check to
`test_time_based_cache_eviction` for the fix to time based eviction
(e5a0ea3).
Tests: tests/querier_cache:debug
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <c89e2788a88c2a701a2c39f377328e77ac01e3ef.1546515465.git.bdenes@scylladb.com>
"
This series adds staging SSTables support to row level repair.
It was introduced for streaming sessions before, but since row level
repair doesn't leverage sessions at all, it's added separately.
Tests:
unit (release)
dtest (repair_additional_test.py:RepairAdditionalTest,
excluding repair_abort_test, which fails for me locally on master)
"
* 'add_staging_sstables_generation_to_row_level_repair_2' of https://github.com/psarna/scylla:
repair: add staging sstables support to row level repair
main,repair: add params to row level repair init
streaming,view: move view update checks to separate file
When a node is bootstrapping, it will receive data from other nodes
via streaming, including materialized views. Regardless whether these
views are built on other nodes or not, building them on newly
bootstrapped nodes has no effect - updates were either already streamed
completely (if view building have finished) or will be propagated
via view building, if the process is still ongoing.
So, marking all views as 'built' for the bootstrapped node prevents it
from spawning superfluous view building processes.
Fixes#4001
Message-Id: <fd53692c38d944122d1b1013fdb0aedf517fa409.1546498861.git.sarna@scylladb.com>
Currently queriers evicted due to their TTL expiring are not
unregistered from the `reader_concurrency_semaphore`. This can cause a
use-after-free when the semaphore tries to evict the same querier at
some later point in time, as the querier entry it has a pointer to is
now invalid.
Fix by unregistering the querier from the semaphore before destroying
the entry.
Refs: #4018
Refs: #4031
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4adfd09f5af8a12d73c29d59407a789324cd3d01.1546504034.git.bdenes@scylladb.com>
Checking if view update path should be used for sstables
is going to be reused in row level repair code,
so relevant functions are moved to a separate header.
In insert_querier(), we may evict older queriers to make room for the new one.
However, we forgot to unregister the evicted queriers from
reader_concurrency_semaphore. As a result, when reader_concurrency_semaphore
eventually wanted to evict something, it saw an inactive_read_handle that was
not connected to a querier_cache::entry, and crashed on use-after-free.
Fix by evicting through the inactive_read_handle associated with the querier
to be evicted. This removes traces of the querier from both
reader_concurrency_semaphore and querier_cache. We also have to massage the
statistics since querier_inactive_read::evict() updates different counters.
Fixes#4018.
Tests: unit(release)
Reviewed-by: Botond Denes <bdenes@scylladb.com>
Message-Id: <20190102175023.26093-1-avi@scylladb.com>
The check in consume_range_tombstone was too late. Before getting to
it we would fail an assert calling to_bound_kind.
This moves the check earlier and adds a testcase.
Tests: unit (release)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This moves the predicate functions to the start of the file, renames
is_in_bound_kind to is_bound_kind for consistency with to_bound_kind
and defines all predicates in a similar fashion.
It also uses the predicates to reduce code duplication.
Tests: unit (release)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
It is useful to adjust the command line when running the docker image,
for example to attach a data volume or a ccache directory. Add e mechanism
to do that.
Message-Id: <20181228163306.19439-1-avi@scylladb.com>
"
Instead of allocating a contiguous temporary_buffer when reading
mutations from the commitlog - or hint - replaying, use fragemnted
buffers instead.
Refs #4020
"
* 'commitlog/fragmented-read/v1' of https://github.com/duarten/scylla:
db/commitlog: Use fragmented buffers to read entries
db/commitlog: Implement skip in terms of input buffer skipping
tests/fragmented_temporary_buffer_test: Add unit test for remove_suffix()
utils/fragmented_temporary_buffer: Add remove_suffix
tests/fragmented_temporary_buffer_test: Add unit test for skip()
utils/fragmented_temporary_buffer: Allow skipping in the input stream
Static class_registries hinder librarification by requiring linking with
all object files (instead of a library from which objects are linked on
demand) and reduce readability by hiding dependencies and by their
horrible syntax. Hide them behind a non-static, non-template tracing
backend registry.
Message-Id: <20181229121000.7885-1-avi@scylladb.com>
This simplifies the code and allows to get rid of the overload of
advance() taking a temporary_buffer.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
distributed_loader is a sizeable fraction of database.cc, so moving it
out reduces compile time and improves readability.
Message-Id: <20181230200926.15074-1-avi@scylladb.com>
lz4 1.8.3 was released with a fix for data corruption during compression. While
the release notes indicate we aren't vulnerable, be cautious and update anyway.
Message-Id: <20181230144716.7238-1-avi@scylladb.com>
"This series contains a couple of fixes to the
view_update_from_staging_generator, the object responsible for
generating view updates from sstables written through streaming.
Fixes#4021"
* 'materialized-views/staging-generator-fixes/v2' of https://github.com/duarten/scylla:
db/view/view_update_from_staging_generator: Break semaphore on stop()
db/view/view_update_from_staging_generator: Restore formatting
db/view/view_update_from_staging_generator: Avoid creating more than one fiber
If view_update_from_staging_generator::maybe_generate_view_updates()
is called before view_update_from_staging_generator::start(), as can
happen in main.cc, then we can potentially create more than one fiber,
which leads to corrupted state and conflicting operations.
To avoid this, use just one fiber and be explicit about notifying it
that more work is needed, by leveraging a condition-variable.
Fixes#4021
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"
A sharded<database> is not very useful for accessing data since data is
usually distributed across many nodes, while a sharded<database>
contains only a single node's view. So it is really only used for
accessing replicated metadata, not data. As such only the local shard
is accessed.
Use that to simplify query_processor a little by replacing sharded<database>
with a plain database.
We can probably be more ambitious and make all accesses, data and metadata,
go through storage_proxy, but this is a start.
"
* tag 'qp-unshard-database/v1' of https://github.com/avikivity/scylla:
query_processor: replace sharded<database> with the local shard
commitlog_replayer: don't use query_processor
client_state: change set_keyspace() to accept a single database shard
legacy_schema_migrator: initialize with database reference
system_keyspace is an implementation detail for most of its users, not
part of the interface, as it's only used to store internal data. Therefore,
including it in a header file causes unneeded dependencies.
This patch removes a dependency between views and system_keyspace.hh
by moving view_name and view_build_progress into a separate header file,
and using forward declarations where possible. This allows us to
remove an inclusion of system_keyspace.hh from a header file (the last
one), so that further changes to system_keyspace.hh will cause fewer
recompilations.
Message-Id: <20181228215736.11493-1-avi@scylladb.com>
query_processor uses storage_proxy to access data, and the local
database object to access replicated metadata. While it seems strange
that the database object is not used to access data, it is logical
when you consider that a sharded<database> only contain's this node's
data, not the cluster data.
Take advantage of this to replace sharded<database> with a single database
shard.
During normal writes, query processing happens before commitlog, so
logically commitlog replaying the commitlog shouldn't need it. And in
fact the dependency on query_processor can be eliminated, all it needs
is the local node's database.
Provide legacy_schema_migrator with a sharded<database> so it doesn't need
to use the one from query_processor. We want to replace query_processor's
sharded<database> with just a local database reference in order to simplify
it, and this is standing in the way.
When streaming, sstables for which we need to generate view updates
are placed in a special staging directory. However, we only need to do
this for tables that actually have views.
Refs #4021
Message-Id: <20181227215412.5632-1-duarte@scylladb.com>
The version of boost in Fedora 29 has a use-after-free bug that is only
exposed when ./test.py is run with the --jenkins flag. To patch it,
use a fixed version from the copr repository scylladb/toolchain.
Message-Id: <20181228150419.29623-1-avi@scylladb.com>
Implementation of nodetool toppartiotion query, which samples most frequest PKs in read/write
operation over a period of time.
Content:
- data_listener classes: mechanism that interfaces with mutation readers in database and table classes,
- toppartition_query and toppartition_data_listener classes to implement toppartition-specific query (this
interfaces with data_listeners and the REST api),
- REST api for toppartitions query.
Uses Top-k structure for handling stream summary statistics (based on implementation in C*, see #2811).
What's still missing:
- JMX interface to nodetool (interface customization may be required),
- Querying #rows and #bytes (currently, only #partitions is supported).
Fixes#2811
* https://github.com/avikivity/scylla rafie_toppartitions_v7.1:
top_k: whitespace and minor fixes
top_k: map template arguments
top_k: std::list -> chunked_vector
top_k: support for appending top_k results
nodetool toppartitions: refactor table::config constructor
nodetool toppartitions: data listeners
nodetool toppartitions: add data_listeners to database/table
nodetool toppartitions: fully_qualified_cf_name
nodetool toppartitions: Toppartitions query implementation
nodetool toppartitions: Toppartitions query REST API
nodetool toppartitions: nodetool-toppartitions script
A Python script mimicking the nodetool toppartitions utility, utilizing Scylla REST API.
Examples:
$ ./nodetool-toppartitions --help
usage: nodetool-toppartitions [-h] [-k LIST_SIZE] [-s CAPACITY]
keyspace table duration
Samples database reads and writes and reports the most active partitions in a
specified table
positional arguments:
keyspace Name of keyspace
table Name of column family
duration Query duration in milliseconds
optional arguments:
-h, --help show this help message and exit
-k LIST_SIZE The number of the top partitions to list (default: 10)
-s CAPACITY The capacity of stream summary (default: 256)
$ ./nodetool-toppartitions ks test1 10000
READ
Partition Count
30 2
20 2
10 2
WRITE
Partition Count
30 1
20 1
10 1
Signed-off-by: Rafi Einstein <rafie@scylladb.com>
toppartitions_query installs toppartitions_data_listener-s on all database shards, waits for
the designated period, uninstalls shards and collects top-k read/write partition keys.
Signed-off-by: Rafi Einstein <rafie@scylladb.com>
Add data_listeners member to database.
Adds data_listeners* to table::config, to be used by table methods to invoke listeners.
Install on_read() listener in table::make_reader().
Install on_write() listener in database::apply_in_memory().
Tests: Unit (release)
Signed-off-by: Rafi Einstein <rafie@scylladb.com>
Mechanism that interfaces with mutation readers in database and table classes, to
allow tracking most frequent partition keys in read and write operation.
Basic design is specified in #2811.
Tracking top #rows and #bytes will be supported in the future.
Signed-off-by: Rafi Einstein <rafie@scylladb.com>
Replaced std::list with chunked_vector. Because chunked_vector requires
a noexcept move constructor from its value type, change the bad_boy type
in the unit test not to throw in the move constructor.
Signed-off-by: Rafi Einstein <rafie@scylladb.com>
"
partition_snapshots created in the memtable will keep a reference to
the memtable (as region*) and to memtable::_cleaner. As long as the
reader is alive, the memtable will be kept alive by
partition_snapshot_flat_reader::_container_guard. But after that
nothing prevents it from being destroyed. The snapshot can outlive the
read if mutation_cleaner::merge_and_destroy() defers its destruction
for later. When the read ends after memtable was flushed, the snapshot
will be queued in the cache's cleaner, but internally will reference
memtable's region and cleaner. This will result in a use-after-free
when the snapshot resumes destruction.
The fix is to update snapshots's region and cleaner references at the
time of queueing to point to the cache's region and cleaner.
When memtable is destroyed without being moved to cache there is no
problem because the snapshot would be queued into memtable's cleaner,
which will be drained on destruction from all snapshots.
Introduced in f3da043 (in >= 3.0-rc1)
Fixes#4030.
Tests:
- mvcc_test (debug)
"
* tag 'fix-snapshot-merging-use-after-free-v1.1' of github.com:tgrabiec/scylla:
tests: mvcc: Add test_snapshot_merging_after_container_is_destroyed
tests: mvcc: Introduce mvcc_container::migrate()
tests: mvcc: Make mvcc_partition move-constructible
tests: mvcc: Introduce mvcc_container::make_not_evictable()
tests: mvcc: Allow constructing mvcc_container without a cache_tracker
mutation_cleaner: Migrate partition_snapshots when queueing for background cleanup
mvcc: partition_snapshot: Introduce migrate()
mutation_cleaner: impl: Store a back-reference to the owning mutation_cleaner
Some test cases will need many containers to simulate memtable ->
cache transitions, but there can be only one cache_tracker per shard
due to metrics. Allow constructing a conatiner without a cache_tracker
(and thus non-evictable).
partition_snapshots created in the memtable will keep a reference to
the memtable (as region*) and to memtable::_cleaner. As long as the
reader is alive the memtable will be kept alive by
partition_snapshot_flat_reader::_container_guard. But after that,
nothing prevents it from being destroyed. The snapshot can outlive the
read if mutation_cleaner::merge_and_destroy() defers its destruction
for later. When the read ends after memtable was flushed, the snapshot
will be queued in the cache's cleaner, but internally will reference
memtable's region and cleaner. This will result in a use-after-free
when the snapshot resumses destruction.
The fix is to update snapshots's region and cleaner references at the
time of queueing to point to the cache's region and cleaner.
When memtable is destroyed without being moved to cache there is no
problem, because the snapshot would be queued into memtable's cleaner,
which will be drained on destruction from all snapshots.
Introduced in f3da043.
Fixes#4030.
Snapshots which outlive the memtable will need to have their
_region and _cleaner references updated.
The snapshot can be destroyed after the memtable when it is queud in
the mutation_cleaner.
rpc::source cannot be abandoned until EOS is reached, but current code
does not obey it if error code is received, it throws exception instead that
aborts the reading loop. Fix it by moving exception throwing out of the
loop.
Fixes: #4025
Message-Id: <20181227135051.GC29458@scylladb.com>
We saw failure in dtest concurrent_schema_changes_test.py:
TestConcurrentSchemaChanges.changes_while_node_down_test test.
======================================================================
ERROR: changes_while_node_down_test (concurrent_schema_changes_test.TestConcurrentSchemaChanges)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/asias/src/cloudius-systems/scylla-dtest/concurrent_schema_changes_test.py", line 432, in changes_while_node_down_test
self.make_schema_changes(session, namespace='ns2')
File "/home/asias/src/cloudius-systems/scylla-dtest/concurrent_schema_changes_test.py", line 86, in make_schema_changes
session.execute('USE ks_%s' % namespace)
File "cassandra/cluster.py", line 2141, in cassandra.cluster.Session.execute
return self.execute_async(query, parameters, trace, custom_payload, timeout, execution_profile, paging_state).result()
File "cassandra/cluster.py", line 4033, in cassandra.cluster.ResponseFuture.result
raise self._final_exception
ConnectionShutdown: Connection to 127.0.0.1 is closed
The test:
session = self.patient_cql_connection(node2)
self.prepare_for_changes(session, namespace='ns2')
node1.stop()
self.make_schema_changes(session, namespace='ns2') --> ConnectionShutdown exception throws
The problem is that, after receiving the DOWN event, the python
Cassandra driver will call Cluster:on_down which checks if this client
has any connections to the node being shutdown. If there is any
connections, the Cluster:on_down handler will exit early, so the session
to the node being shutdown will not be removed.
If we shutdown the cql server first, the connection count will be zero
and the session will be removed.
Fixes: #4013
Message-Id: <7388f679a7b09ada10afe7e783d7868a58aac6ec.1545634941.git.asias@scylladb.com>
When creating a sstable from which to generate view updates, we held
on to a table reference across defer points. In case there's a
concurrent schema drop, the table object might be destroyed and we
will incur in a use-after-free. Solve this by holding on to a shared
pointer and pinning the table object.
Refs #4021
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181227105921.3601-1-duarte@scylladb.com>
On at least one system, using the container's /tmp as provided by docker
results in spurious EINVALs during aio:
INFO 2018-12-27 09:54:08,997 [shard 0] gossip - Feature ROW_LEVEL_REPAIR is enabled
unknown location(0): fatal error: in "test_write_many_range_tombstones": storage_io_error: Storage I/O error: 22: Invalid argument
seastar/tests/test-utils.cc(40): last checkpoint
The setup is overlayfs over xfs.
To avoid this problem, pass through the host's /tmp to the container.
Using --tmpfs would be better, but it's not possible to guess a good size
as the amount of temporary space needed depends on build concurrency.
Message-Id: <20181227101345.11794-1-avi@scylladb.com>
Image fedora-29-20181219 was broken due to the followin chain of events:
- we install gnutls, which currently is at version 3.6.5
- gnutls 3.6.5 introduced a dependency on nettle 3.4.1
- the gnutls rpm does not include a version requirement on nettle,
so an already-installed nettle will not be upgraded when gnutls is
installed
- the fedora:29 image which we use as a baseline has nettle installed
- docker does not pull the latest tag in FROM statements during
"docker build"
- my build machine already had a fedora:29 image, with nettle 3.4
installed (the repository's image has 3.4.1, but docker doesn't
automatically pull if an image with the required tag exists)
As a result, the image ended up hacing gnutls 3.6.5 and nettle 3.4, which
are incompatible.
To fix, update all packages after installation to attempt to have a self
consistent package set even if dependencies are not perfect, and regenerate
the image.
Message-Id: <20181226135711.24074-1-avi@scylladb.com>
The '-t' flag to 'docker run' passes the tty from the caller environment
to the container, which is nice for interactive jobs, but fails if there
is no tty, such as in a continuous integration environment.
Given that, the '-i' flag doesn't make sense either as there isn't any
input to pass.
Remove both, and replace with --sig-proxy=true which allows SIGTERM to
terminate the container instead of leaving it alive. This reduces the
chances of the build stopping but leaving random containers around.
Message-Id: <20181222105837.22547-1-avi@scylladb.com>
The function pending_collection is only called when
cdef->is_multi_cell() is true, so the throw is dead.
This patch converts it to an assert.
Message-Id: <20181207022119.38387-1-espindola@scylladb.com>
"
=== How the the partition level repair works
- The repair master decides which ranges to work on.
- The repair master splits the ranges to sub ranges which contains around 100
partitions.
- The repair master computes the checksum of the 100 partitions and asks the
related peers to compute the checksum of the 100 partitions.
- If the checksum matches, the data in this sub range is synced.
- If the checksum mismatches, repair master fetches the data from all the peers
and sends back the merged data to peers.
=== Major problems with partition level repair
- A mismatch of a single row in any of the 100 partitions causes 100
partitions to be transferred. A single partition can be very large. Not to
mention the size of 100 partitions.
- Checksum (find the mismatch) and streaming (fix the mismatch) will read the
same data twice
=== Row level repair
Row level checksum and synchronization: detect row level mismatch and transfer
only the mismatch
=== How the row level repair works
- To solve the problem of reading data twice
Read the data only once for both checksum and synchronization between nodes.
We work on a small range which contains only a few mega bytes of rows,
We read all the rows within the small range into memory. Find the
mismatch and send the mismatch rows between peers.
We need to find a sync boundary among the nodes which contains only N bytes of
rows.
- To solve the problem of sending unnecessary data.
We need to find the mismatched rows between nodes and only send the delta.
The problem is called set reconciliation problem which is a common problem in
distributed systems.
For example:
Node1 has set1 = {row1, row2, row3}
Node2 has set2 = { row2, row3}
Node3 has set3 = {row1, row2, row4}
To repair:
Node1 fetches nothing from Node2 (set2 - set1), fetches row4 (set3 - set1) from Node3.
Node1 sends row1 and row4 (set1 + set2 + set3 - set2) to Node2
Node1 sends row3 (set1 + set2 + set3 - set3) to Node3.
=== How to implement repair with set reconciliation
- Step A: Negotiate sync boundary
class repair_sync_boundary {
dht::decorated_key pk;
position_in_partition position
}
Reads rows from disk into row buffers until the size is larger than N
bytes. Return the repair_sync_boundary of the last mutation_fragment we
read from disk. The smallest repair_sync_boundary of all nodes is
set as the current_sync_boundary.
- Step B: Get missing rows from peer nodes so that repair master contains all the rows
Request combined hashes from all nodes between last_sync_boundary and
current_sync_boundary. If the combined hashes from all nodes are identical,
data is synced, goto Step A. If not, request the full hashes from peers.
At this point, the repair master knows exactly what rows are missing. Request the
missing rows from peer nodes.
Now, local node contains all the rows.
- Step C: Send missing rows to the peer nodes
Since local node also knows what peer nodes own, it sends the missing rows to
the peer nodes.
=== How the RPC API looks like
- repair_range_start()
Step A:
- request_sync_boundary()
Step B:
- request_combined_row_hashes()
- reqeust_full_row_hashes()
- request_row_diff()
Step C:
- send_row_diff()
- repair_range_stop()
=== Performance evaluation
We created a cluster of 3 Scylla nodes on AWS using i3.xlarge instance. We
created a keyspace with a replication factor of 3 and inserted 1 billion
rows to each of the 3 nodes. Each node has 241 GiB of data.
We tested 3 cases below.
1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows.
Time to repair:
old = 87 min
new = 70 min (rebuild took 50 minutes)
improvement = 19.54%
2) 100% synced: all of the 3 nodes have 1 billion identical rows.
Time to repair:
old = 43 min
new = 24 min
improvement = 44.18%
3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows.
Time to repair:
old: 211 min
new: 44 min
improvement: 79.15%
Bytes sent on wire for repair:
old: tx= 162 GiB, rx = 90 GiB
new: tx= 1.15 GiB, tx = 0.57 GiB
improvement: tx = 99.29%, rx = 99.36%
It is worth noting that row level repair sends and receives exactly the
number of rows needed in theory.
In this test case, repair master needs to receives 2 million rows and
sends 4 million rows. Here are the details: Each node has 1 billion *
0.1% distinct rows, that is 1 million rows. So repair master receives 1
million rows from repair slave 1 and 1 million rows from repair slave 2.
Repair master sends 1 million rows from repair master and 1 million rows
received from repair slave 1 to repair slave 2. Repair master sends
sends 1 million rows from repair master and 1 million rows received from
repair slave 2 to repair slave 1.
In the result, we saw the rows on wire were as expected.
tx_row_nr = 1000505 + 999619 + 1001257 + 998619 (4 shards, the numbers are for each shard) = 4'000'000
rx_row_nr = 500233 + 500235 + 499559 + 499973 (4 shards, the numbers are for each shard) = 2'000'000
Fixes: #3033
Tests: dtests/repair_additional_test.py
"
* 'asias/row_level_repair_v7' of github.com:cloudius-systems/seastar-dev: (51 commits)
repair: Enable row level repair
repair: Add row_level_repair
repair: Add docs for row level repair
repair: Add repair_init_messaging_service_handler
repair: Add repair_meta
repair: Add repair_writer
repair: Add repair_reader
repair: Add repair_row
repair: Add fragment_hasher
repair: Add decorated_key_with_hash
repair: Add get_random_seed
repair: Add get_common_diff_detect_algorithm
repair: Add shard_config
repair: Add suportted_diff_detect_algorithms
repair: Add repair_stats to repair_info
repair: Introduce repair_stats
flat_mutation_reader: Add make_generating_reader
storage_service: Introduce ROW_LEVEL_REPAIR feature
messaging_service: Add RPC verbs for row level repair
repair: Export the repair logger
...
We had issue to build offline installer on RHEL because of repository
difference.
This fix enables to build offline installer both on CentOS and RHEL.
Also it introduces --releasever <ver>, to build offline installer for
specific minor version of CentOS and RHEL.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181212032129.29515-1-syuu@scylladb.com>
Sometimes one wants to just compile all the source files in the
projects, because for example one just moved around code or files and
there is no need to link and run anything, just check that everything
still compiles.
Since linking takes up a considerable amount of time it is worthwhile to
have a specific target that caters for such needs.
This patch introduces a ${mode}-objects target for each mode (e.g.
release-objects) that only runs the compilation step for each source
file but does not link anything.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <eaad329bf22dfaa3deff43344f3e65916e2c8aaf.1545045775.git.bdenes@scylladb.com>
In one test the types in the schema don't match the types in the
statistics file. In another a column is missing.
The patch also updates the exceptions to have more human readable
messages.
Tests: unit (release)
Part of issue #3960.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20181219233046.74229-1-espindola@scylladb.com>
Validate ascii string by ORing all bytes and check if 7-th bit is 0.
Compared with original std::any_of(), which checks ascii string byte
by byte, this new approach validates input in 8 bytes and two
independent streams. Performance is much higher for normal cases,
though slightly slower when string is very short. See table below.
Speed(MB/s) of ascii string validation
+---------------+-------------+---------+
| String length | std::any_of | u64 x 2 |
+---------------+-------------+---------+
| 9 bytes | 1691 | 1635 |
+---------------+-------------+---------+
| 31 bytes | 2923 | 3181 |
+---------------+-------------+---------+
| 129 bytes | 3377 | 15110 |
+---------------+-------------+---------+
| 1039 bytes | 3357 | 31815 |
+---------------+-------------+---------+
| 16385 bytes | 3448 | 47983 |
+---------------+-------------+---------+
| 1048576 bytes | 3394 | 31391 |
+---------------+-------------+---------+
Signed-off-by: Yibo Cai <yibo.cai@arm.com>
Message-Id: <1544669646-31881-1-git-send-email-yibo.cai@arm.com>
"
Some files use db/config.hh just to access extensions. Reduce dependencies
on this global and volatile file by providing another path to access extensions.
Tests: unit(release)
"
* tag 'unconfig-2/v1' of https://github.com/avikivity/scylla:
hints: reduce dependencies on db/config.hh
commitlog: reduce dependencies on db/config.hh
cql3: reduce dependencies on db/config.hh
database: provide accessor to db::extensions
Rather than forcing callers to go through get_config(), provide a
direct accessor. This reduces dependencies on config.hh, and will
allow separation of extensions from configuration.
When the next pending fragments are after the start of the new range,
we know there is no need to skip.
Caught by perf_fast_forward --datasets large-part-ds3 \
--run-tests=large-partition-slicing
Refs #3984
Message-Id: <1545308006-16389-1-git-send-email-tgrabiec@scylladb.com>
Currently if something throws while streaming in mutation sending loop
sink is not closed. Also when close() is running the code does not hold
onto sink object. close() is async, so sink should be kept alive until
it completes. The patch uses do_with() to hold onto sink while close is
running and run close() on error path too.
Fixes#4004.
Message-Id: <20181220155931.GL3075@scylladb.com>
Mutation readers allow fast-forwarding the ranges from which the data is
being read. The main user of this feature is cache which, when reading
from the underlying reader, may want to skip some data it already has.
Unsurprisingly, this adds more complexity to the implementation of the
readers and more edge cases the developers need to take care of.
While most of the readers were at least to some extent checked in this
area those test usually were quite isolated (e.g. one test doing
inter-partition fast-forwarding, another doing intra-partition
fast-forwarding) and as a consequence didn't cover many corner cases.
This patch adds a generic test for fast-forwarding and slicing that
covers more complicated scenarios when those operations are combined.
Needless to say that did uncover some problems, but fortunately none of
them is user-visible.
Fixes#3963.
Fixes#3997.
Tests: unit(release, debug)
* https://github.com/pdziepak/scylla.git test-fast-forwarding/v4.1:
tests/flat_mutation_reader_assertions: accumulate received tombstones
tests/flat_mutation_reader_assertions: add more test messages
tests/flat_mutation_reader_assertions: relax has_monotonic_positions()
check
tests/mutation_readers: do not ignore streamed_mutation::forwarding
Revert "mutation_source_test: add option to skip intra-partition
fast-forwarding tests"
memtable: it is not a single partition read if partition
fast-forwaring is enabled
sstables: add more tracing in mp_row_consumer_m
row_cache: use make_forwardable() to implement
streamed_mutation::forwarding
row_cache: read is not single-partition if inter-partition forwarding
is enabled
row_cache: drop support for streamed_mutation::forwarding::yes
entirely
sstables/mp_row_consumer: position_range end bound is exclusive
mutation_fragment_filter: handle streamed_mutation::forwarding::yes
properly
tests/mutation_reader: reduce sleeping time
tests/memtable: fix partition_range use-after-free
tests/mutation: fix partition range use-after-free
flat_mutation_reader_from_mutations: add overload that accepts a slice
and partition range
flat_mutation_reader_from_mutations: fix empty range case
flat_mutation_reader_from_mutations: destroy all remaining mutations
tests/mutation_source: drop dropped column handling test
tests/mutation_source: add test for complex fast_forwarding and
slicing
While we already had tests that verified inter- and intra-partition
fast-forwarding as well as slicing, they had quite limited scope and
didn't combine those operations. The new test is meant to extensively
test these cases.
Schema changes are now covered by for_each_schema_change() function.
Having some additional tests in run_mutation_source_tests() is
problematic when it is used to test intermediate mutation readers
because schema changes may be irrelevant for them, which makes the test
a waste of time (might be a problem in debug mode) and requires those
intermediate reader to use more complex underlying reader that supports
schema changes (again, problem in a very slow debug mode).
If the reader is fast-forwarded to another partition range mutation_ may
be left with some partial mutations. Make sure that those are properly
destroyed.
It is a very bad taste to sleep anywhere in the code. The test should be
fixed to explicitly test various orderings between concurrent
operations, but before that happens let's at least readuce how much
those sleeps slow it down by changing it from milliseconds to
microseconds.
Implementing intra-partition fast-forwarding adds more complexity to
already very-much-not-trivial cache readers and isn't really critical in
any way since it is not used outside of the tests. Let's use the generic
adapter instead of natively implementing it.
Single-partition reader is less expensive than the one that accepts any
range of partitions, but it doesn't support fast-forwarding to another
partition range properly and therefore cannot be used if that option is
enabled.
This reverts commit b36733971b. That commit made
run_mutation_reader_tests() support mutation_sources that do not implement
streamed_mutation::forwarding::yes. This is wrong since mutation_sources
are not allowed to ignore or otherwise not support that mode. Moreover,
there is absolutely no reason for them to do so since there is a
make_forwardable() adapter that can make any mutation_reader a
forwardable one (at the cost of performance, but that's not always
important).
It is wrong to silently ignore streamed_mutation::forwarding option
which completely changes how the reader is supposed to operate. The best
solution is to use make_forwardable() adapter which changes
non-forwardable reader to a forwardable one.
Since 41ede08a1d "mutation_reader: Allow
range tombstones with same position in the fragment stream" mutation
readers emit fragments in non-decreasing order (as opposed to strictly
increasing), has_monotonic_posiitons() needs to be updated to allow
that.
Current data model employed by mutation readers doesn't have an unique
representation of range tombstones. This complicates testing by making
multiple ways of emitting range tombstones and rows equally valid.
This patch adds an option to verify mutation readers by checking whether
tombstones they emit properly affect the clustered rows regardless of how
exactly the tombstones are emitted. The interface of
flat_mutation_reader_assertions is extended by adding
may_produce_tombstones() that accepts any number of tombstones and
accumulates them. Then, produces_row_with_key() accepts an additional
argument which is the expected timestamp of the range tombstone that
affects that row.
"
Contains several improvements for fast-forwarding and slicing readers. Mainly
for the MC format, but not only:
- Exiting the parser early when going out of the fast-forwarding window [MC-format-only]
- Avoiding reading of the head of the partition when slicing
- Avoiding parsing rows which are going to be skipped [MC-format-only]
"
* 'sstable-mc-optimize-slicing-reads' of github.com:tgrabiec/scylla:
sstables: mc: reader: Skip ignored rows before parsing them
sstables: mc: reader: Call _cells.clear() when row ends rather than when it starts
sstables: mc: mutation_fragment_filter: Take position_in_partition rather than a clustering_row
sstables: mc: reader: Do not call consume_row_marker_and_tombstone() for static rows
sstables: mc: parser: Allow the consumer to skip the whole row
sstables: continuous_data_consumer: Introduce skip()
sstables: continuous_data_consumer: Make position() meaningful inside state_processor::process_state()
sstables: mc: parser: Allocate dynamic_bitset once per read instead of once per row
sstables: reader: Do not read the head of the partition when index can be used
sstables: mc: mutation_fragment_filter: Check the fast-forward window first
sstables: mc: writer: Avoid calling unsigned_vint::serialized_size()
"
As the amount of pending view updates increases we know that there’s a
mismatch between the rate at which the base receives writes and the
rate at which the view retires them. We react by applying backpressure
to decrease the rate of incoming base writes, allowing the slow view
replicas to catch up. We want to delay the client’s next writes to a
base replica and we use the base’s backlog of view updates to derive
this delay.
To validate this approach we tested a 3 node Scylla cluster on GCE,
using n1-standard-4 instances with NVMEs. A loader running on a
n1-standard-8 instance run cassandra-stress with 100 threads. With the
delay function d(x) set to 1s, we see no base write timeouts. With the
delay function as defined in the series, we see that backlogs stabilize
at some (arbitrary) point, as predicted, but this stabilization
co-exists with base write timeouts. However, the system overall behaves
better than the current version, with the 100 view update limit, and
also better than the version without such limit or any backpressure.
More work is necessary to further stabilize the system. Namely, we want
to keep delaying until we see the backlog is decreasing. This will
require us to add more delay beyond the stabilization point, which in
turn should minimize the base write timeouts, and will also minimize the
amount of memory the backlog takes at each base replica.
Design document:
https://docs.google.com/document/d/1J6GeLBvN8_c3SbLVp8YsOXHcLc9nOLlRY7pC6MH3JWo
Fixes#2538
"
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
* 'materialized-views/backpressure/v2' of https://github.com/duarten/scylla: (32 commits)
service/storage_proxy: Release mutation as early as possible
service/storage_proxy: Delay replica writes based on view update backlog
service/storage_proxy: Get the backlog of a particular base replica
service/storage_proxy: Add counters for delayed base writes
main: Start and stop the view_update_backlog_broker
service: Distribute a node's view update backlog
service: Advertise view update backlog over gossip
service/storage_proxy: Send view update backlog from replicas
service/storage_proxy: Prepare to receive replica view update backlog
service/storage_proxy: Expose local view update backlog
tests/view_schema_test: Add simple test for db::view::node_update_backlog
db/view: Introduce node_update_backlog class
db/hints: Initialize current backlog
database: Add counter for current view backlog
database: Expose current memory view update backlog
idl: Add db::view::update_backlog
db/view: Add view_update_backlog
database: Wait on view update semaphore for view building
service/storage_proxy: Use near-infinite timeouts for view updates
database: generate_and_propagate_view_updates no longer needs a timeout
...
When delaying a base write, there is no need to hold on to the
mutation if all replicas have already replied.
We introduce mutation_holder::release_mutation(), which frees the
mutations that are no longer needed during the rest of the delay.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
As the amount of pending view updates increases we know that there’s a
mismatch between the rate at which the base receives writes and the
rate at which the view retires them. We react by applying backpressure
to decrease the rate of incoming base writes, allowing the slow view
replicas to catch up. We want to delay the client’s next writes to a
base replica. We use the base’s backlog of view updates to derive
this delay.
If we achieve CL and the backlogs of all replicas involved were last
seen to be empty, then we wouldn't delay the client's reply. However,
it could be that one of the replicas is actually overloaded, and won't
reply for many new such requests. We'll eventually start applying
backpressure to the client via the background's write queue, but in
the meanwhile we may be dropping view updates. To mitigate this we rely
on the backlog being gossiped periodically.
Fixes#2538
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces the view_update_backlog_broker class, which is
responsible for periodically updating the local gossip state with the
current node's view update backlog. It also registers to updates from
other nodes, and updates the local coordinator's view of their view
update backlogs.
We consider the view update backlog received from a peer through the
mutation_done verb to be always fresh, but we consider the one received
through gossip to be fresh only if it has a higher timestamp than what
we currently have recorded.
This is because a node only updates its gossip state periodically, and
also because a node can transitively receive gossip state about a third
node with outdated information.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This lays the groundwork for brokering a node's view update
backlog across the whole cluster. This is needed for when a
coordinator does not contact a given replica for a long time, and uses
a backlog view that is outdated and causes requests to be
unnecessarily delayed.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Change the inter-node protocol so we can propagate the view update
backlog from a base replica to the coordinator through the
mutation_done and mutation_failed verbs.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
In subsequent patches, replicas will reply to the coordinator with
their view update backlog. Before introducing changes to the
messaging_service, prepare the storage_proxy to receive and store
those backlogs.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The local view update backlog is the max backlog out of the relative
memory backlog size and the relative hints backlog size.
We leverage the db::view::node_update_backlog class so we can send the
max backlog out of the node's shards.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Expose the base replica's current memory view update backlog, which is
defined in terms of units consumed from the semaphore.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The view update backlog represents the pending view data that a base replica
maintains. It is the maximum of the memory backlog - how much memory pending
view updates are consuming - and the disk backlog - how much view hints are
consuming. The size of a backlog is relative to its maximum size.
We will use this class to represent a base replica's view update
backlog at the coordinator.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
View building sends view updates synchronously, which has natural
backpressure. However, they
1) Contribute to the load on the view replicas, and;
2) Add memory pressure to the base replica.
They should thus count towards the current view update backlog, and
consume units from the view update concurrency semaphore.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
View updates are sent with a timeout of 5 minutes, unrelated to
any user-defined value and meant as a protection mechanism. During
normal operation we don’t benefit from timing out view writes and
offloading them to the hinted-handoff queue, since they are an
internal, non-real time workload that we already spent resources on.
This value should be increases further, but that change depends on
Refs #2538
Refs #3826
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We no longer wait on the semaphore and instead over-subscribe it, so
there's not reason to pass a timeout.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We arrive at an overloaded state when we fail to acquire semaphore
units in the base replica. This can mean clients are working in
interactive mode, we fail to throttle them and consequently should
start shedding load. We want to avoid impacting base table
availability by running out of memory, so we could offload the memory
queue to disk by writing the view updates as hints without attempting
to send them. However, the disk is also a limited resource and in
extreme cases we won’t be able to write hints. A tension exists
between forgetting the view updates, thereby opening up a window for
inconsistencies between base and view, or failing the base replica
write. The latter can fail the whole user write, or if the
coordinator was able to achieve CL, can instead cause inconsistencies
between base tables (we wouldn't want to store a hint, because if the
base replica is still overloaded, we would redo the whole dance).
Between the devil and the deep blue sea, we chose to forget view
updates. As a further simplification, we don't even write hints,
assuming that if clients can’t be throttled (as we'll attempt to do in
future patches), it will only be a matter of time before view updates
can’t be offloaded. We also start acquiring the semaphore units using
consume(), which is non-blocking, but allows for underflow of the
available semaphore units. This is okay, and we expect not to underflow
by much, as we stop generating new view updates.
Refs #2538
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Propagate acquired semaphore units to mutate_MV() to allow the
semaphore to be incrementally signalled as view updates are processed
by view replicas.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Stopping a table with in-flight reads and writes can be happening
concurrently, which rely on table state and we must therefore prevent
its destruction before those operations complete.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The semaphore currently limiting the amount of view updates a given
base replica emits aims to control the load that is imposed on the
cluster, to protect view replicas from being overloaded when there
are bursts of traffic (especially for degenerate cases like an index
with low selectivity).
100 is, however, an arbitrary number. It might allow too much load on
the view replicas, and it might also allow too much memory from the
base shard to be consumed. Conversely, it might allow for too few
updates to be queued in case of a burst, or to absorb updates while a
view replica becomes partitioned.
To deal with the load that is inflicted on the cluster, future patches
will ensure that the rate of base writes obeys the rate at which the
slowest view replica can consume the corresponding view updates.
To protect the current shard from using too much memory for this
queue, we will limit it to 10% of the shard's memory. The goal is to
both protect the shard from being overloaded, but also to allow it to
absorb bursts of writes resulting in large view mutations.
Refs #2538
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Working in terms of frozen_mutations allows us to account more
precisely the memory pending view updates consume at the storage_proxy
layer.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This allows an std::move() in its body to work as intended. Also, make
the lambda's argument type explicit.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Scylla is at the moment incompatible with the Seastar master branch,
so in order to allow Scylla commits that depend on Seastar patches,
we change the submodule to point to scylla-seastar and use a branch
(master-20181217) to hold these dependent commits.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181217153241.67514-1-duarte@scylladb.com>
The "enable_sstables_mc_format" config item help text wants to remove itself
before release. Since scylla-3.0 did not get enough mc format mileage, we
decided to leave it in, so the notice should be removed.
Fixes#4003.
Message-Id: <20181219082554.23923-1-avi@scylladb.com>
"
Implementation of origin change c000da13563907b99fe220a7c8bde3c1dec74ad5
Modifies network topology calculation, reducing the amount of
maps/sets used by applying the knowledge of how many replicas we
expect/need per dc and sharing endpoint and rack set (since we cannot have
overlaps).
Also includes a transposed origin test to ensure new calculation
matches the old one.
Fixes#2896
"
* 'calle/network_topology' of github.com:scylladb/seastar-dev:
network_topology_test: Add test to verify new algorith results equals old
network_topology_strategy: Simplify calculate_natural_endpoints
token_metadata: Add "get_location" ip to dc+rack accessor
sequenced_set: Add "insert" method, following std::set semantics
Currently filtering happens inside consume_row_end() after the whole
row is parsed. It's much faster to skip without parsing.
This patch moves filtering and range tombstones splitting to
consume_row_start().
_stored_row is no longer needed because in case the filter returns
store_and_finish, the consumer exits with retry_later, and the parser
will call consume_row_start() again when resumed.
Tests:
./build/release/tests/perf/perf_fast_forward_g \
--sstable-format=mc \
--datasets large-part-ds1 \
--run-tests=large-partition-skips
Before:
read skip time (s) frags frag/s mad f/s max f/s min f/s aio (KiB)
1 4096 1.085142 1953 1800 32 1803 1720 4990 159604
After:
read skip time (s) frags frag/s mad f/s max f/s min f/s aio (KiB)
1 4096 0.694560 1953 2812 11 2813 2684 4986 159588
mp_row_consumer_m::consume_row_marker_and_tombstone() is called for
both clustering and static rows, but it dereferences and modifies
_in_progress_row, which is only set when inside a clustering row.
Fixes#3999.
Will allow state_processor to know its position in the
stream.
Currently position() is meaningless inside process_state() because in
some cases it points to the position after the buffer and in some
cases before it. This patch standardizes on the former. This is more
useful than the latter because process_state() trims from the front of
the buffer as it consumes, so the position inside the stream can be
obtained by subtracting the remaining buffer size from position(),
without introducing any new variables.
The size of the bitset is the same for given row kind across the sstable, so we can
allocate it once.
_columns_selector is moved into row_schema structure, which we have
one for each row kind and setup in the constructor.
read_partition() was always called through read_next_partition(), even
if we're at the beginning of the read. read_next_partition() is
supposed to skip to the next partition. It still works when we're
positioned before a partition, it doesn't advance the consumer, but it
clears _index_in_current_partition, because it (correctly) assumes it
corresponds to the partition we're about to leave, not the one we're
about to enter.
This means that index lookups we did in the read initializer will be
disregarded when reading starts, and we'll always start by reading
partition data from the data file. This is suboptimal for reads which
are slicing a large partition and don't need to read the front of the
partition.
Regression introduced in 4b9a34a854.
The fix is to call read_partition() directly when we're positioned at
the beginning of the partition. For that purpose a new flag was
introduced.
test_no_index_reads_when_rows_fall_into_range_boundaries has to be
relaxed, because it assumed that slicing reads will read the head of
the partition.
Refs #3984Fixes#3992
Tested using:
./build/release/tests/perf/perf_fast_forward_g \
--sstable-format=mc \
--datasets large-part-ds1 \
--run-tests=large-partition-slicing-clustering-keys
Before (focus on aio):
offset read time (s) frags frag/s mad f/s max f/s min f/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
4000000 1 0.001378 1 726 5 736 102 6 200 4 2 0 1 1 0 0 0 65.8%
After:
offset read time (s) frags frag/s mad f/s max f/s min f/s aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
4000000 1 0.001290 1 775 6 788 716 2 136 2 0 0 1 1 0 0 0 69.1%
Otherwise the parser will keep consuming and dropping fragments
needlessly, rather than giving the user a chance to consume
end-of-stream condition, and maybe skip again.
Refs #3984
Rather than adding serialized_size() to the body size before
serializing the field, we can serialize the field to _tmp_bufs at the
beginning and have the body size automatically account for it.
"
Recently some additional issues were discovered related to recent
changes to the way inactive readers are evicted and making shard readers
evictable.
One such issue is that the `querier_cache` is not prepared for the
querier to be immediately evicted by the reader concurrency semaphore,
when registered with it as an inactive read (#3987).
The other issue is that the multishard mutation query code was not
fully prepared for evicted shard readers being re-created, or failing
why being re-created (#3991).
This series fixes both of these issues and adds a unit test which covers
the second one. I am working on a unit test which would cover the second
issue, but it's proving to be a difficult one and I don't want to delay
the fixes for these issues any longer as they also affect 3.0.
Fixes: #3987Fixes: #3991
Tests: unit(release, debug)
"
* 'evictable-reader-related-issues/v2' of https://github.com/denesb/scylla:
multishard_mutation_query: reset failed readers to inexistent state
multishard_mutation_query: handle missing readers when dismantling
multishard_mutation_query: add support for keeping stats for discarded partitions
multishard_mutation_query: expect evicted reader state when creating reader
multishard_mutation_query: pretty-print the reader state in log messages
querier_cache: check that the query wasn't evicted during registering
reader_concurrency_semaphore: use the correct types in the constructor
reader_concurrency_semaphore: add consume_resources()
reader_concurrency_semaphore::inactive_read_handle: add operator bool()
Transposed from origin unit test.
Creates a semi-random topology of racks, dcs, tokens and replication
factors and verifies endpoint calculation equals old algo.
Fixes#2896 (hopefully)
Implementation of origin change c000da13563907b99fe220a7c8bde3c1dec74ad5
Reduces the amount of maps and sets and general complexity of
endpoint calculation by simply mapping dc:s to expected node
counts, re-using endpoint sets and iterate thusly.
Tested with transposed origin unit test comparing old vs. new
algo results. (Next patch)
When attempting to dismantling readers, some of the to-be-dismantled
readers might be in a failed state. The code waiting on the reader to
stop is expecting failures, however it didn't do anything besides
logging the failure and bumping a counter. Code in the lower layers did
not know how to deal with a failed reader and would trigger
`std::bad_variant_access` when trying to process (save or cleanup) it.
To prevent this, reset the state of failed readers to `inexistent_state`
so code in the lower layers doesn't attempt to further process them.
When dismantling the combined buffer and the compaction state we are no
longer guaranteed to have the reader each partition originated from. The
reader might have been evicted and not resumed, or resuming it might
have failed. In any case we can no longer assume the originating reader
of each partition will be present. If a reader no longer exists,
discard the partitions that it emitted.
In the next patches we will add code that will have to discard some of
the dismantled partitions/fragments/bytes. Prepare the
`dismantle_buffer_stats` struct for being able to track the discarded
partitions/fragments/bytes in addition to those that were successfully
dismantled.
Previously readers were created once, so `make_remote_reader()` had a
validation to ensure readers were not attempted at being created more
than once. This validation was done by checking that the reader-state is
either `inexistent` or `successful_lookup`. However with the
introduction of pausing shard readers, it is now possible that a reader
will have to be created and then re-created several times, however this
validation was not updated to expect this.
Update the validation so it also expects the reader-state to be
`evicted`, the state the reader will be if it was evicted while paused.
The reader concurrency semaphore can evict the querier when it is
registered as an inactive read. Make the `querier_cache` aware of this
so that it doesn't continue to process the inserted querier when this
happens.
Also add a unit test for this.
Previously there was a type mismatch for `count` and `memory`, between
the actual type used to store them in the class (signed) and the type
of the parameters in the constructor (unsigned).
Although negative numbers are completely valid for these members,
initializing them to negative numbers don't make sense, this is why they
used unsigned types in the constructor. This restriction can backfire
however when someone intends to give these parameters the maximum
possible value, which, when interpreted as a signed value will be `-1`.
What's worse the caller might not even be aware of this unsigned->signed
conversion and be very suprised when they find out.
So to prevent surprises, expose the real type of these members, trusting
the clients of knowing what they are doing.
Also add a `no_limits` constructor, so clients don't have to make sure
they don't overflow internal types.
The upgrade to node_exporter 0.17 commit
09c2b8b48a ("node_exporter_install: switch
to node_exporter 0.17") caused the service to no longer start. Turns out
node_exported broke backwards compatibility of the command line between
0.15 to 0.16. Fix it up.
While fixing the command line, all the collector that are enabled by
default were removed.
Fixes#3989
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
[ penberg@scylladb.com: edit commit message ]
Message-Id: <20181213114831.27216-1-amnon@scylladb.com>
"
The motivation is to keep code related to each format separate, to make it
easier to comprehend and reduce incremental compilation times.
Also reduces dependency on sstable writer code by removing writer bits from
sstales.hh.
The ka/la format writers are still left in sstables.cc, they could be also extracted.
"
* 'extract-sstable-writer-code' of github.com:tgrabiec/scylla:
sstables: Make variadic write() not picked on substitution error
sstables: Extract MC format writer to mc/writer.cc
sstables: Extract maybe_add_summary_entry() out of components_writer
sstables: Publish functions used by writers in writer.hh
sstables: Move common write functions to writer.hh
sstables: Extract sstable_writer_impl to a header
sstables: Do not include writer.hh from sstables.hh
sstables: mc: Extract bound_kind_m related stuff into mc/types.hh
sstables: types: Extract sstable_enabled_features::all()
sstables: Move components_writer to .cc
tests: sstable_datafile_test: Avoid dependency on components_writer
"
Working on database.hh or any header that is included in database.hh
(of which there is a lot), is a major pain as each change involves the
recompilation of half of our compilation units.
Reduce the impact by removing the `#include "database.hh"` directive
from as many header files as possible. Many headers can make do with
just some forward declarations and don't need to include the entire
headers. I also found some headers that included database.hh without
actually needing it.
Results
Before:
$ touch database.hh
$ ninja build/release/scylla
[1/154] CXX build/release/gen/cql3/CqlParser.o
After:
$ touch database.hh
$ ninja build/release/scylla
[1/107] CXX build/release/gen/cql3/CqlParser.o
"
* 'reduce-dependencies-on-database-hh/v2' of https://github.com/denesb/scylla:
treewide: remove include database.hh from headers where possible
database_fwd.hh: add keyspace fwd declaration
service/client_state: de-inline set_keyspace()
Move cache_temperature into its own header
The previous code was using mp_row_consumer_k_l to be as close to the
tested code as possible.
Given that it is testing for an unhandled exception, there is probably
more value in moving it to a higher level, easier to use, API.
This patch changes it to use read_rows_flat().
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20181210235016.41133-1-espindola@scylladb.com>
Many headers don't really need to include database.hh, the include can
be replaced by forward declarations and/or including the actually needed
headers directly. Some headers don't need this include at all.
Each header was verified to be compilable on its own after the change,
by including it into an empty `.cc` file and compiling it. `.cc` files
that used to get `database.hh` through headers that no longer include it
were changed to include it themselves.
Embedding the expire timer for a write response in the
abstract_write_response_handler simplifies the code as it allows
removing the rh_entry type.
It will also make the timeout easily accessible inside the handler,
for future patches.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181213111818.39983-1-duarte@scylladb.com>
* https://github.com/espindola/scylla espindola/add-composite-tests:
Remove newline from exception messages.
Fix end marker exception message.
Add tests for broken start and end composite markers.
They are inconsistent with other uses of malformed_sstable_exception
and incompatible with adding " in sstable ..." to the message.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Introduced in 7e15e43.
Exposed by perf_fast_forward:
running: large-partition-skips on dataset large-part-ds1
Testing scanning large partition with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
read skip time (s) frags frag/s (...)
1 0 5.268780 8000000 1518378
1 1 31.695985 4000000 126199
Message-Id: <1544614272-21970-1-git-send-email-tgrabiec@scylladb.com>
If write(v, out, x) doesn't match any overload, the variadic write()
will be picked, with Rest = {}. The compiler will print error messages
about unable to find write(v, out), which totally obscures the
original cause of mismatch.
Make it picked only when there are at least two write() parameters so
that debugging compilation errors is actually possible.
This moves all MC-related writing code to mc/writer.cc:
- m_format_write_helpers.hh is dropped
- m_format_write_helpers_impl.hh is dropped
- sstable_writer_m is moved out of sstables.cc
sstable_writer_m is renamed to sstables::mc::writer
They are common for sstable writers of different formats.
Note that writer.hh is supposed to be included only by writer
implementations, not writer users.
=== How the the partition level repair works
- The repair master decides which ranges to work on.
- The repair master splits the ranges to sub ranges which contains around 100
partitions.
- The repair master computes the checksum of the 100 partitions and asks the
related peers to compute the checksum of the 100 partitions.
- If the checksum matches, the data in this sub range is synced.
- If the checksum mismatches, repair master fetches the data from all the peers
and sends back the merged data to peers.
=== Major problems with partition level repair
- A mismatch of a single row in any of the 100 partitions causes 100
partitions to be transferred. A single partition can be very large. Not to
mention the size of 100 partitions.
- Checksum (find the mismatch) and streaming (fix the mismatch) will read the
same data twice
=== Row level repair
Row level checksum and synchronization: detect row level mismatch and transfer
only the mismatch
=== How the row level repair works
- To solve the problem of reading data twice
Read the data only once for both checksum and synchronization between nodes.
We work on a small range which contains only a few mega bytes of rows,
We read all the rows within the small range into memory. Find the
mismatch and send the mismatch rows between peers.
We need to find a sync boundary among the nodes which contains only N bytes of
rows.
- To solve the problem of sending unnecessary data.
We need to find the mismatched rows between nodes and only send the delta.
The problem is called set reconciliation problem which is a common problem in
distributed systems.
For example:
Node1 has set1 = {row1, row2, row3}
Node2 has set2 = { row2, row3}
Node3 has set3 = {row1, row2, row4}
To repair:
Node1 fetches nothing from Node2 (set2 - set1), fetches row4 (set3 - set1) from Node3.
Node1 sends row1 and row4 (set1 + set2 + set3 - set2) to Node2
Node1 sends row3 (set1 + set2 + set3 - set3) to Node3.
=== How to implement repair with set reconciliation
- Step A: Negotiate sync boundary
class repair_sync_boundary {
dht::decorated_key pk;
position_in_partition position
}
Reads rows from disk into row buffers until the size is larger than N
bytes. Return the repair_sync_boundary of the last mutation_fragment we
read from disk. The smallest repair_sync_boundary of all nodes is
set as the current_sync_boundary.
- Step B: Get missing rows from peer nodes so that repair master contains all the rows
Request combined hashes from all nodes between last_sync_boundary and
current_sync_boundary. If the combined hashes from all nodes are identical,
data is synced, goto Step A. If not, request the full hashes from peers.
At this point, the repair master knows exactly what rows are missing. Request the
missing rows from peer nodes.
Now, local node contains all the rows.
- Step C: Send missing rows to the peer nodes
Since local node also knows what peer nodes own, it sends the missing rows to
the peer nodes.
=== How the RPC API looks like
- repair_range_start()
Step A:
- request_sync_boundary()
Step B:
- request_combined_row_hashes()
- reqeust_full_row_hashes()
- request_row_diff()
Step C:
- send_row_diff()
- repair_range_stop()
=== Performance evaluation
We created a cluster of 3 Scylla nodes on AWS using i3.xlarge instance. We
created a keyspace with a replication factor of 3 and inserted 1 billion
rows to each of the 3 nodes. Each node has 241 GiB of data.
We tested 3 cases below.
1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows.
Time to repair:
old = 87 min
new = 70 min (rebuild took 50 minutes)
improvement = 19.54%
2) 100% synced: all of the 3 nodes have 1 billion identical rows.
Time to repair:
old = 43 min
new = 24 min
improvement = 44.18%
3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows.
Time to repair:
old: 211 min
new: 44 min
improvement: 79.15%
Bytes sent on wire for repair:
old: tx= 162 GiB, rx = 90 GiB
new: tx= 1.15 GiB, tx = 0.57 GiB
improvement: tx = 99.29%, rx = 99.36%
It is worth noting that row level repair sends and receives exactly the
number of rows needed in theory.
In this test case, repair master needs to receives 2 million rows and
sends 4 million rows. Here are the details: Each node has 1 billion *
0.1% distinct rows, that is 1 million rows. So repair master receives 1
million rows from repair slave 1 and 1 million rows from repair slave 2.
Repair master sends 1 million rows from repair master and 1 million rows
received from repair slave 1 to repair slave 2. Repair master sends
sends 1 million rows from repair master and 1 million rows received from
repair slave 2 to repair slave 1.
In the result, we saw the rows on wire were as expected.
tx_row_nr = 1000505 + 999619 + 1001257 + 998619 (4 shards, the numbers are for each shard) = 4'000'000
rx_row_nr = 500233 + 500235 + 499559 + 499973 (4 shards, the numbers are for each shard) = 2'000'000
Fixes#3033
This patch introduces repair_meta class that is the core class for the
row level repair.
For each range to repair, repair_meta objects are created on both repair
master and repair slaves. It stores the meta data for the row level
repair algorithms, e.g, the current sync boundary, the buffer used to
hold the rows the peers are working on, the reader to read data from
sstable and the writer to write data to sstable.
This patch also implements the RPC verbs for row level repair, for
example, REPAIR_ROW_LEVEL_START/REPAIR_ROW_LEVEL_STOP to starts/stops
row level repair for a range, REPAIR_GET_SYNC_BOUNDARY to get sync
boundary peers want to work on, REPAIR_GET_ROW_DIFF to get missing rows
from repair slaves and REPAIR_PUT_ROW_DIFF to pus missing rows to repair
slaves.
repair_writer uses multishard_writer to apply the mutation_fragments to
sstable. The repair master needs one such writer for each of the repair
slave. The repair slave needs one writer for the repair master.
repair_reader is used to read data from disk. It is simply a local
flat_mutation_reader reader for the repair master. It is more
complicated for the repair slave.
The repair slaves have to follow what repair master read from disk.
For example,
Assume repair master has 2 shards and repair slave has 3 shards
Repair master on shard 0 asks repair slave on shard 0 to read range [0,100).
Repair master on shard 1 asks repair slave on shard 1 to read range [0,100).
Repair master on shard 0 will only read the data that belongs to shard 0
within range [0,100). Since master and slave have different shard count,
repair slave on shard 0 has to use the multi shard reader to collect
data on all the shards. It can not pass range [0, 100) to the multi
shard reader, otherwise it will read more data than the repair master.
Instead, repair slave uses a sharder using sharding configuration of the
repair master, to generate the sub ranges belong to shard 0 of repair
master.
If repair master and slave has the same sharding configuration, a simple
local reader is enough for repair slave.
repair_row is the in-memory representation of "row" that the row level
repair works on. It represents a mutation_fragment that is read from the
flat_mutation reader. The hash of a repair_row is the combination of the
mutation_fragment hash and partition_key hash.
Get a random uint64_t number as the seed for the repair row hashing.
The seed is passed to xx_hasher.
We add the randomization when hashing rows so that when we run repair
for the next time the same row produces different hashing number.
It is used to find the common difference detection algorithms supported
by repair master and repair slaves.
It is up to repair master to choose what algorithm to use.
It returns a vector of row level repair difference detection algorithms
supported by this node.
We are going to implement the "send_full_set" in the following patches.
Move generating_reader from stream_session.cc to flat_mutation_reader.cc.
It will be used by repair code soon.
Also introduce a helper make_generating_reader to hide the
implementation of generating_reader.
This patch adds the RPC verbs that are needed by the row level repair.
The usage of those verbs are in the following patches.
All the verbs for row level repair are sent by the repair master.
Repair master asks repair slaves to create repair meta objects, a.k.a,
repair_meta object, to store the repair meta data needed by row level
repair algorithm. The repair meta object is identified by the IP address
of the repair master and a uint32 number repair_meta_id chosen by repair
master. When repair master restarts or is out of the cluster, repair
slaves will detect it and remove all existing repair_meta for the repair
master. When repair slave restarts, the existing repair_meta on the
slave will be gone.
The sync boundary used in the verbs is the position_in_partition of the
last mutation_fragment. In each repair round, peers work on
(last_sync_boundary, current_sync_boundary]
Represent a position of a mutation_fragment read from a flat mutation
reader. Repair nodes negotiate a small sub range identified by two
repair_sync_boundary to work on in each round.
The new row level repair code will access clustering_key_prefix and it
uses std::optional everywhere. Convert position_in_partition to use
std::optional.
This is a backport of CASSANDRA-11038.
Before this, a restarted node will be reported as new node with NEW_NODE
cql notification.
To fix, only send NEW_NODE notification when the node was not part of
the cluster
Fixes: #3979
Tests: pushed_notifications_test.py:TestPushedNotifications.restart_node_test
Message-Id: <453d750b98b5af510c4637db25b629f07dd90140.1544583244.git.asias@scylladb.com>
This patch adds compatibility for Cassandra's "chunk_size_in_kb", as
well as it keeps Scylla's "chunk_size_kb" compression parameter.
Fixes#3669
Tests: unit (release)
v2: use variable instead of array
v3: fix commited files
Signed-off-by: Juliana Oliveira <juliana@scylladb.com>
Message-Id: <20181211215840.GA7379@shenzou.localdomain>
Cassandra supports a "CREATE CUSTOM INDEX" to create a secondary index
with a custom implementation. The only custom implementation that Cassandra
supports is SASI. But Scylla doesn't support this, or any other custom
index implementation. If a CREATE CUSTOM INDEX statement is used, we
shouldn't silently ignore the "CUSTOM" tag, we should generate an error.
This patch also includes a regression test that "CREATE CUSTOM INDEX"
statements with valid syntax fail (before this patch, they succeeded).
Fixes#3977
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181211224545.18349-2-nyh@scylladb.com>
Different nodes can concurrently create the distributed system
keyspace on boot, before the "if not exists" clause can take effect.
However, the resulting schema mutations will be different since
different nodes use different timestamps. This patch forces the
timestamps to be the same across all nodes, so we save some schema
mismatches.
This fixes a bug exposed by ca5dfdf, whereby the initialization of the
distributed system keyspace is done before waiting for schema
agreement. While waiting for schema agreement in
storage_service::join_token_ring(), the node still hasn't joined the
ring and schemas can't be pulled from it, so nodes can deadlock. A
similar situation can happen between a seed node and a non-seed node,
where the seed node progresses to a different "wait for schema
agreement" barrier, but still can't make progress because it can't
pull the schema from the non-seed node still trying to join the ring.
Finally, it is assumed that changes to the schema of the current
distributed system keyspace tables will be protected by a cluster
feature and a subsequent schema synchronization, such that all nodes
will be at a point where schemas can be transferred around.
Fixes#3976
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181211113407.20075-1-duarte@scylladb.com>
"
This series contains several optimizations of the MC format sstable writer, mainly:
- Avoiding output_stream when serializing into memory (e.g. a row)
- Faster serialization of primitive types when serializing into memory
I measured the improvement in throughput (frag/s) using perf_fast_forward for
datasets with a single large partition with many small rows:
- 10% for a row with a single cell of 8 bytes
- 10% for a row with a single cell of 100 bytes
- 9% for a row with a single cell of 1000 bytes
- 13% for a row with 6 cells of 100 bytes
"
* tag 'avoid-output-stream-in-sstable-writer-v2' of github.com:tgrabiec/scylla:
bytes_ostream: Optimize writing of fixed-size types
sstables: mc: Write temporary data to bytes_ostream rather than file_writer
sstables: mc: Avoid double-serialization of a range tombstone marker
sstables: file_writer: Generalize bytes& writer to accept bytes_view
sstables: Templetize write() functions on the writer
sstables: Turn m_format_write_helpers.cc into an impl header
sstables: De-futurize file_writer
bytes_ostream: Implement clear()
bytes_ostream: Make initial chunk size configurable
"
Carry out simplifications of db::extensions: less magical types, de-inline
complex functions, and reduce #include dependencies
Tests: unit(release)
"
* tag 'extensions-simplify/v1' of https://github.com/avikivity/scylla:
extensions: remove unneeded includes
extensions: deinline extension accessors
extensions: return concrete types from the extension accessors
extensions: remove dependency on cql layer
Returning "auto" makes it harder to understand what the function is returning,
and impossible to de-inline.
Return a vector of pointers instead. The caller should iterate immediately, in
any case, and since the previous return value was a range of references to const
unique_ptrs, nothing else could be done with it anyway.
Inlining write() allows the writing code to be optimized for
fixed-size types. In particular, memcpy() calls and loops will be
eliminated.
Saw 4% improvement in throughput in perf_fast_forward for tiny rows.
Currently temporary data is serialized into a file_writer, because
that's what write() functions used to expect, which goes through an
output_stream, a data_sink, into an in-memory data sink implementation
which collects the temporary_buffers.
Going through those abstractions is relatively expensive if we don't
write much, because each time we begin to write after a flush() of the
file_writer the output stream has to allocate a new buffer, which
means a large allocation for small amount of data.
We could avoid that and write into bytes_ostream directly, which will
keep its buffer across clear().
write() functions which are used both to write directly into the data
file and to a temporary arena were templatized to accept a Writer to
which both file_writer and bytes_ostream conform.
I need to templatize functions defined in it and want to avoid
explicit instantiations.
There is only one compilation unit in which this is used
(sstables.cc). I think in the long term we should move all those
"helpers" into sstables/mc/writer.{cc,hh} together with their only
user, the sstable_writer_m class from sstables.cc.
The extensions class reaches into cql's property_definitions class to grab
a map<sstring, sstring> type. This generates a few unneeded dependencies.
Reduce dependencies by defining the map type ourselves; if cql's property_definitions
changes in an incompatible way, it will have to adapt, rather than the extensions
class.
These are the current uninteresting cases I found when looking at
malformed_sstable_exception. The existing code is working, just not
being tested.
* https://github.com/espindola/scylla.git espindola/espindola/broken-sst:
Add a broken sstable test.
Add a test with mismatched schema.
db::config is a global class; changes in any module can cause changes
in db::config. Therefore, it is a cause of needless recompilation.
Remove some of these dependencies by having consumers of db::config
declare an intermediate config struct that is contains only
configuration of interest to them, and have their caller fill it out
(in the case of auth, it already followed this scheme and the patchset
only moves the translation function).
In addition, some outright pointless inclusions of db/config.hh are
removed.
The result is somewhat shorter compile times, and fewer needless
recompiles.
* https://github.com/avikivity/scylla unconfig-1/v1:
config: remove inclusions of db/config.hh from header files
repair: remove unneeded config.hh inclusion
batchlog_manager: remove dependency on db::config
auth: remove permissions_cache dependency on db::config
auth: remove auth::service dependency on db::config
auth: remove unneeded db/config.hh includes
"
This series optimises the read path by replacing some usages of
std::vector by utils::small_vector. The motivation for this change was
an observation that memory allocation functions are pointed out by the
profiler as the ones where we spent most time and while they have a
large number of callers storage allocation for some vectors was close to
the top. The gains are not huge, since the problem is a lot of things
adding up and not a single slow thing, but we need to start with
something.
Unfortunately, the performance of boost::container::small_vector is
quite disappointing so a new implementation of a small_vector was
introduced.
perf_simple_query -c4 --duration 60, medians:
./perf_before ./perf_after diff
read 343086.80 360720.53 5.1%
Tests: unit(release, small_vector in debug)
"
* tag 'small_vector/v2.1' of https://github.com/pdziepak/scylla:
partition_slice: use small_vector for column_ids
mutation_fragment_merger: use small_vector
auth: use small_vector in resource
auth: avoid list-initialisation of vectors
idl: serialiser: add serialiser for utils::small_vector
idl: serialiser: deduplicate vector serialisers
utils: introduce small_vector
intrusive_set_external_comparator: make iterator nothrow move constructible
mutation_fragment_merger: value-initialise iterator
"
This is a backport of CASSANDRA-8236.
Before this patch, scylla sends the node UP event to cql client when it
sees a new node joins the cluster, i.e., when a new node's status
becomes NORMAL. The problem is, at this time, the cql server might not
be ready yet. Once the client receives the UP event, it tries to
connect to the new node's cql port and fails.
To fix, a new application_sate::RPC_READY is introduced, new node sets
RPC_READY to false when it starts gossip in the very beginning and sets
RPC_READY to true when the cql server is ready.
The RPC_READY is a bad name but I think it is better to follow Cassandra.
Nodes with or without this patch are supposed to work together with no
problem.
Refs #3843
"
* 'asias/node_up_down.upstream.v4.1' of github.com:scylladb/seastar-dev:
storage_service: Use cql_ready facility
storage_service: Handle application_state::RPC_READY
storage_service: Add notify_cql_change
storage_service: Add debug log in notify_joined
storage_service: Add extra check in notify_joined
storage_service: Add notify_joined
storage_service: Add debug log in notify_up
storage_service: Add extra check in notify_up
storage_service: Add notify_up
storage_service: Make notify_left log debug level
storage_service: Introduce notify_left
storage_service: Add debug log in notify_down
storage_service: Introduce notify_down
storage_service: Add set_cql_ready
gossip: Add gossiper::is_cql_ready
gms: Add endpoint_state::is_cql_ready
gms: Add application_state::RPC_READY
gms: Introduce cql_ready in versioned_value
At this point the cql_ready facility is ready. To use it, advertise the
RPC_READY application state in the following cases:
- When a node boots, set it to false
- When cql server is ready, set it to true
- When cql server is down, set it to false
- New scylla node always send application_state::RPC_READY = false when
the node boots and send application_state::RPC_READY = true when cql
server is up
- Old scylla node that does not support the application_state::RPC_READY
never has application_state::RPC_READY in the endpoint_state, we can
only think their cql server is up, so we return true here if
application_state::RPC_READY is not present
_ck_blocks_header is a 64-bit variable, so the mask should be 64 bits too.
Otherwise, a shift in the range 32-63 will produce wrong results.
Fix by using a 64-bit mask.
Found by Fedora 29's ubsan.
Fixes#3973.
Message-Id: <20181209120549.21371-1-avi@scylladb.com>
"
Refs #3929
Enables re-use of commitlog segments.
First, ensures we never succeed playing back a commitlog
segment with name not matching the ID:s in the actual
file data, by determining expected id based on file name.
This will also handle partially written re-used files, as
each chunk headers CRC is dependent on the ID, and will
fail once we hit any left-overs.
Second part renamed and puts files into a recycle list
instead of actually deleting them when finished.
Allocating new files will the prioritize this list
before creating a new file.
Note that since consumtion and release of segments can
be somewhat unbalanced, this does not really guarantee
we will use recycled files even in all cases when it
might be possible, simply because of timing. It does
however give a good chance of it.
We limit recycled files based on the max disk size
setting, thus we can potentially grow disk size
more than without depending on timing, but not
uncontrolled.
While all this theoretially might improve disk
writes in some cases, it is far from any magic bullet.
No real performance testing has been done yet, only
functional.
"
* 'calle/commitlog-reuse' of github.com:scylladb/seastar-dev:
commitlog: Recycle used segments instead of delete + new file
commitlog: Terminate all segments with a zero chunk
commitlog_replay: Enforce file name based id matching
Refs #3929
When deleting a segment, IFF we have not yet filled up all reserves,
instead of actually deleting the file, put it on a "recycle" list.
Next segment allocation will instead of creating a new one simply
rename the segment and reuse the file and its allocated space.
We rename the file twice: Once on adding to recycle list, with special
prefix so we don't mix up actual replayable segments and these. Second
when we actually re-use the file (also to ensure consecutive names).
Note that we limit the amount of recyclables, so a really stressed
application which somehow fills up the replenish queue might
cause us to still drop the segments. Could skip this but risk
getting to many files on disk.
Replay should be safe, since all entries are guarded by CRC based
on the file ID (i.e. file name). Thus replaying a recycled segment
will simply cause a CRC error in the main header and be ignored (see
previous patch).
Segments that are fully synced will have terminating zero-header (see
previous patch) so we know when to stop processing a recycled file.
If a file is the result of a mid-write crash, we will generate a CRC
processing error as "normally" in this case, when hitting partially
written block or coming to an old/new chunk boundary.
v2:
* Sync dir on rename
* auto -> const sstring&
* Allow recycling files as long as we're within disk space limits
v3:
* Use special names for files waiting for reuse
Writes a final chunk header of zero to the file on close, to mark
end-of-segment.
This allows us to gracefully stop replay processing of a segment file
even if it was not zeroed from the beginning (maybe recycled - hint
hint).
When reading the header chunk of a commitlog file, check the stored id
value against the id derived from the file name, and ignore if
mismatched. This is a prerequisite for re-using renamed commitlog files,
as we can then fail-fast should one such be left on disk, instead of
trying to replay it.
We also check said id via the CRC check for each chunk parsed. If we
find a chunk with
mismatched id, we will get a CRC error for the chunk, and replay will
terminate (albeit not gracefully).
The newer version of node_exporter comes with important bug fixes, that
is especially important for I3.metal is not supported with the older
version of node_exporter.
The dashboards can now support both the new and the old version of
node_exporter.
Fixes#3927
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20181210085251.23312-1-amnon@scylladb.com>
"
Make major compaction aware of compaction strategy, by using an
optimal approach which suits the strategy needs.
Refs #1431.
"
* 'compaction_strategy_aware_major_compaction_v2' of github.com:raphaelsc/scylla:
tests: add test for compaction-strategy-aware major compaction
compaction: implement major compaction heuristic for leveled strategy
compaction: introduce notion of compaction-strategy-aware major compaction
auth::service already has its own configuration and a function to create it
from db::config; just move it to the caller. This reduces dependencies on the
global db::config class.
permissions_cache already has its own configuration and a function to create it
from db::config; just move it to the caller. This reduces dependencies on the
global db::config class.
Extract configuration into a new struct batchlog_manager_config and have the
callers populate it using db::config. This reduces dependencies on global objects.
Instead, distribute those inclusions to .cc files that require them. This
reduces rebuilds when config.hh changes, and makes it easier to locate files
that need config disaggregation.
rh_entry address is captured inside timeout's callback lambda, so the
structure should not be moved after it is created. Change the code to
create rh_entry in-place instead of moving it into the map.
Fixes#3972.
Message-Id: <20181206164043.GN25283@scylladb.com>
The results vector should be populated vertically, not horizontally.
Responsible for assertion failure with --cache-enabled:
void result_collector::add(test_result_vector): Assertion `rs.size() == results.size()' failed.
Introduced in 3fc78a25bf.
Message-Id: <1544105835-24530-2-git-send-email-tgrabiec@scylladb.com>
Major compaction for leveled strategy will now create a run of
non-overlapping sstables at the highest level. Until now, a single
sstable would be created at level 0 which was very suboptimal because
all data would need to climb up the levels again, making it a very
expensive I/O process.
Refs #1431.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's only the very first step which introduces the machinery for making
major compaction aware of all strategies. By the time being, default
implementation is used for them all which only suits size tiered.
Refs #1431.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Commit cc6c383249 has fixed an issue with
incorrectly tracking max_local_deletion_time and the check in
validate_max_local_deletion_time was called to work around old files.
This fix relaxes conditions for enforcing defaut max_local_deletion_time
so that they don't apply to SSTables in 'mc' format because the original
problem has been resolved before 'mc' format have been introduced.
This is needed to be able to read correct values from
Cassandra-generated SSTables that don't have a Scylla.db component.
Its presence or absence is used as an indicator of possibly affected
files.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
For tiny index files (< 8 bytes long) it could turn to zero and trigger
an assertion in prepare_summary().
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
storage_service keeps a bunch of "feature" variables, indicating cluster-wide
supported features, and has the ability to wait until the entire cluster supports
a given feature.
The propagation of features depends on gossip, but gossip is initialized after
storage_service, so the current code late-initializes the features. However, that
means that whoever waits on a feature between storage_service initialization and
gossip initialization loses their wait entry. In #3952, we have proof that this
in fact happens.
Fix this by removing the circular dependency. We now store features in a new
service, feature_service, that is started before both gossip and storage_service.
Gossip updates feature_service while storage_service reads for it.
Fixes#3953.
* https://github.com/avikivity/3953/v4.1:
storage_service: deinline enable_all_features()
gossiper: keep features registered
tests/gossip: switch to seastar::thread
storage_service: deinline init/deinit functions
gossiper: split feature storage into a new feature_service
gossiper: maybe enable features after start_gossiping()
storage_service: fix gap when feature::when_enabled() doesn't work
storage_service::register_features() reassigns to feature variables in
storage_service. This means that any call to feature::when_enabled() will be
orphaned when the feature is assigned.
Now that feature lifetimes are not tied to gossip, we can move the feature
initialization to the constructor and eliminate the gap. When gossip is started
it will evaluate application_states and enable features that the cluster agrees on.
Since we may now start with features already registered, we need to enable
features immediately after gossip is started. This case happens in a cluster
that already is fully upgraded on startup. Before this series, features were
only added after this point.
Feature lifetime is tied to storage_service lifetime, but features are now managed
by gossip. To avoid circular dependency, add a new feature_service service to manage
feature lifetime.
To work around the problem, the current code re-initializes features after
gossip is initialized. This patch does not fix this problem; it only makes it
possible to solve it by untyping features from gossip.
Gossiper unregisters enabled features as an optimization. However that makes
decoupling features from gossiper harder. Disable this optimization; since the
number of features is small and normal access is to a single feature at a time,
there is no significant performance or memory loss.
In Scylla we have three implementations of vector-like structures
std::vector, utils::chunked_vector and utils::small_vector. Which one is
used is largerly an implementation detail and all should be serialised
by the IDL infrastructure in exactly the same way. To make sure that
it's indeed the case let's make them share the serialiser
implementation.
small_vector is a variation of std::vector<> that reserves a configurable
amount of storage internally, without the need for memory allocation.
This can bring measurable gains if the expected number of elements is
small. The drawback is that moving such small_vector is more expensive
and invalidates iterators as well as references which disqualifies it in
some cases.
"
Multishard combining readers, running concurrently, with limited
concurrency and no timeout may deadlock, due to inactive shard readers
sitting on permits. To avoid this we have to make sure that all shard
readers belonging to a multishard combining readers, that are not
currently active, can be evicted to free up their permits, ensuring that
all readers can make progress.
Making inactive shard readers evictable is the solution for this
problem, however the original series introducing this solution
(414b14a6bd) did not go all they way and
left some loose ends. These loose ends are tied up by this mini-series.
Namely, two issues remained:
* The last reader to reach EOS was not paused (made evictable).
* Readers created/resumed as part of a read-ahead were not paused
immediately after finishing the read-ahead.
This series fixes both of these.
Fixes: #3865
Tests: unit(release, debug)
"
* 'fix-multishard-reader-deadlock/v1' of https://github.com/denesb/scylla:
multishard_combining_reader: pause readers after reading ahead
multishard_combining_reader: pause *all* EOS'd readers
Readers created or resumed just to read ahead should be paused right
after, to avoid consuming all available permits on the shards they
operate on, causing a deadlock.
"
This series of patches ensures that all the Python code base is python3 compliant
and consistent by applying the following logic:
- python3 classifier on setup.py to explicitly state our python compatibility matrix
- add UTF-8 encoding header
- correct every shebang to the same /usr/bin/env python3
- shebang is only added on scripts meant to be executed on their own (removed otherwise)
- migrate some leftover scripts from python2 to python3 with minimal QA
This work is important to prepare for a more drastic change on Python code styling
using the black formatter and the setting up of automated QA checks on Python code base.
"
* 'python3_everywhere' of https://github.com/numberly/scylla:
scylla-housekeeping: fix python3 compat and shebang
dist/ami/files/scylla_install_ami: python3 shebang
dist/docker/redhat/docker-entrypoint.py: add encoding comment
fix_system_distributed_tables.py: fix python3 compat and shebang
gen_segmented_compress_params.py: add encoding comment
idl-compiler.py: python3 shebang
scylla-gdb.py: python3 shebang
configure.py: python3 shebang
tools/scyllatop/: add / normalize python3 shebang
scripts/: add / normalize python3 shebang
dist/common/scripts: add / normalize python3 shebang
test.py: add encoding comment
setup.py: add python3 classifiers
"
This patchset extends a number of existing tests to check SSTables
statistics for 'mc' format and fixes an issue discovered with the help
of one of the tests.
Tests: unit {release}
"
* 'projects/sstables-30/check-stats/v2' of https://github.com/argenet/scylla:
tests: Run sstable_timestamp_metadata_correcness_with_negative with all SSTables versions.
tests: Run sstable_tombstone_histogram_test for all SSTables versions.
tests: Run min_max_clustering_key_test on all SSTables versions.
tests: Expand test_sstable_max_local_deletion_time_2 to run for all SSTables versions.
tests: Run test_sstable_max_local_deletion_time on all SSTables versions.
tests: Extend test checking tombstones histogram to cover all SSTables versions.
sstables: Properly track row-level tombstones when writing SSTables 3.x.
tests: Run min_max_clustering_key_test_2 for all SSTables versions.
tests: Make reusable_sst() helper accept SSTables version parameter.
Previously the last shard reader to reach EOS wasn't paused. This is a
mistake and can contribute to causing deadlocks when the number of
concurrently active readers on any shard is limited.
ForwardIterators are default constructible, but they have to be
value-initialised to compare equal to other value-initialised instances
of that iterator.
Otherwise errors cannot be made sense of, since error are reported
always to stdout. Without test output we don't know what they're
referring to.
This change makes the output always go to stdout, in addition to other
reportes, if any.
Message-Id: <1544020084-16492-1-git-send-email-tgrabiec@scylladb.com>
UTF-8 string is now validated by boost::locale::conv::utf_to_utf, it
actually does string conversions which is more than necessary. As
observed on Arm server, UTF-8 validation can become bottleneck under
heavy loads.
This patch introduces a brand new SIMD implementation supporting both
NEON and SSE, as well as a naive approach to handle short strings.
The naive approach is 3x faster than boost utf_to_utf, whilst SIMD
method outperforms naive approach 3x ~ 5x on Arm and x86. Details at
https://github.com/cyb70289/utf8/.
UTF-8 unit test is added to check various corner cases.
Signed-off-by: Yibo Cai <yibo.cai@arm.com>
Message-Id: <1543978498-12123-1-git-send-email-yibo.cai@arm.com>
If the compaction manager is started, compactions may start (this is
regardless of whether or not we trigger them). The problem with that is
that they start at a time in which we are flushing the commitlog and the
initialization procedure waits for the commitlog to be fully flushed and
the resulting memtables flushed before we move on.
Because there are no incoming writes, the amount of shares in memtable
flushes decrease as memory used decreases and that can cause the startup
procedure to take a long time.
We have recently started to bump the shares manually for manual flushes.
While that guarantees that we will not drive the shares to zero, I will
make the argument that we can do better by making sure that those things
are, at this point, running alone: user experience is affected by
startup times and the bump we give to user-triggered operations will
only do so much. Even if we increase the shares a lot flushes will still
be fighting for resources with compactions and startup will take longer
than it could.
By making sure that flushes are this point running alone we improve the
user experience by making sure the startup is as fast as it can be.
There is a similar problem at the drain level, which is also fixed in this
series.
Fixes#3958
* git@github.com:glommer/scylla.git faster-restart
compaction_manager: delay initialization of the compaction manager.
drain: stop compactions early
In dtest, we have
self.check_rows_on_node(node1, 2000)
self.check_rows_on_node(node2, 2000)
which introduce the following cluster operations:
1) Initially:
- node1 up
- node2 up
2) self.check_rows_on_node(node1, 2000)
- node2 down
- node2 up (A: node2 will call gossiper::real_mark_alive when node2 boots
up to mark node1 up)
3) self.check_rows_on_node(node2, 2000)
- node1 down (B: node1 will send shutdown gossip message to node2, node2
will mark node1 down)
- node1 up (C: when node1 is up, node2 will call
gossiper::real_mark_alive)
Since there is no guarantee the order of Operation A and Operation B, it
is possible node2 will mark node1 as status=shutdown and mark node1 is
UP.
In Operation C, node2 will call gossiper::real_mark_alive to mark node1
up, but since node2 might think node1 is already up, node2 will exit
early in gossiper::real_mark_alive and not log "InetAddress 127.0.0.1 is
now UP, status={}"
As a result, dtest fails to see node2 reports node1 is up when it boots
node1 and fail the test.
TimeoutError: 23 Nov 2018 10:44:19 [node2] Missing: ['127.0.0.1.* now UP']
In the log we can see node1 marked as DOWN and UP almost at the same time on node2:
INFO 2018-11-23 22:31:29,999 [shard 0] gossip - InetAddress 127.0.0.1 is now DOWN, status = shutdown
INFO 2018-11-23 22:31:30,006 [shard 0] gossip - InetAddress 127.0.0.1 is now UP, status = shutdown
Fixes#3940
Tests: dtest with 20 consecutive succesful runs
Message-Id: <996dc325cbcc3f94fc0b7569217aa65464eaaa1c.1543213511.git.asias@scylladb.com>
Fixes a build failure when only the scylla binary was selected for
building like this:
./configure.py --with scylla
In this case the rule for gen_crc_combine_table was missing, but it is
needed to build crc_combine_table.o
Message-Id: <1544010138-21282-1-git-send-email-tgrabiec@scylladb.com>
Both of these have the same problem. They remove the to-be-evicted
entries from `_entries` but they don't unregister the `entry` from the
`read_concurrency_semaphore`. This results in the
`reader_concurrency_semaphore` being left with a dangling pointer to the
entries will trigger segfault when it tries to evict the associated
inactive reads.
Also add a unit test for `evict_all_for_table()` to check that it works
properly (`evict_one()` is only used in tests, so no dedicated test for
it).
Fixes: #3962
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <57001857e3791c6385721b624d33b667ccda2e7d.1544010868.git.bdenes@scylladb.com>
"
Before the reader was just ignoring such columns but this creates a risk of data loss.
Refs #2598
"
* 'haaawk/2598/v3' of github.com:scylladb/seastar-dev:
sstables: Add test_sstable_reader_on_unknown_column
sstables: Exception on sstable's column not present in schema
sstables: store column name in column_translation::column_info
sstables: Make test_dropped_column_handling test dropped columns
Since we merged relocatable package, build_deb.sh/build_rpm.sh only does
packaging using prebuilt binary taken from relocatable package, won't compile
anything.
So passing --jobs option to build_deb.sh/build_rpm.sh becomes meaningless,
we can drop it.
Note that we still can specify --jobs option on reloc/build_reloc.sh, it
runs "ninja-build -jN" to compile Scylla, then generate relocatable package.
See #3956
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181204205652.25138-1-syuu@scylladb.com>
drain suffers from the same problem as startup suffers now: memtables
are flushed as part of the drain routine, and because there are no
incoming writes the shares the controller assign to flushes go down over
time, slowing down the process of drain.
This patch reorders things so that we stop compactions first, and flush
later. It guarantees that when flush do happen it will have the full
bandwidth to work with.
There is a comment in the code saying we should stop compactions
forcefully instead of waiting for them to finish. I consider this
orthogonal to this patch therefore I am not touching this. Doing so will
make the drain operation even faster but can be done later. Even when we
do it, having the flushes proceed alone instead of during compactions
will make it faster.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
If the compaction manager is started, compactions may start (this is
regardless of whether or not we trigger them). The problem with that is
that they start at a time in which we are flushing the commitlog and the
initialization procedure waits for the commitlog to be fully flushed and
the resulting memtables flushed before we move on.
Because there are no incoming writes, the amount of shares in memtable
flushes decrease as memory used decreases and that can cause the startup
procedure to take a long time.
We have recently started to bump the shares manually for manual flushes.
While that guarantees that we will not drive the shares to zero, I will
make the argument that we can do better by making sure that those things
are, at this point, running alone: user experience is affected by
startup times and the bump we give to user-triggered operations will
only do so much. Even if we increase the shares a lot flushes will still
be fighting for resources with compactions and startup will take longer
than it could.
By making sure that flushes are this point running alone we improve the
user experience by making sure the startup is as fast as it can be.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Fixes the condition which determines whether a row ttl should be used for a cell
and adds a test that uses each generated mutation to populate mutation source
and then verifies that it can read back the same mutation.
* seastar-dev.git haaawk/sst3/write-read-test/v3:
Fix use_row_ttl condition
Add test_all_data_is_read_back
gen_crc_combine_table is now executed on every build, so it should not
fail on unsupported archs. The generated file will not contain data,
but this is fine since it should not be used.
Another problem is that u32 and u64 aliases were not visible in the #else
branch in crc_combine.cc
Message-Id: <1543864425-5650-1-git-send-email-tgrabiec@scylladb.com>
This tests that a source after being populated with a mutation
returns exactly the same mutation when read.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
When the auth service is requested to stop during bootstrap,
it might have still not reached schema agreement.
Currently, waiting for this agreement is done in an infinite loop,
without taking abort_source into account.
This patch introduces checking if abort was requested
and breaking the loop in such case, so auth service can terminate.
Tests:
unit (release)
dtest (bootstrap_test.py:TestBootstrap.shutdown_wiped_node_cannot_join_test)
Message-Id: <1b7ded14b7c42254f02b5d2e10791eb767aae7fc.1543914769.git.sarna@scylladb.com>
Previous condition was wrong and was using row ttl too often.
We also have to change test_dead_row_marker to compare
resulting sstable with sstable generated by Origin not
by sstableupgrade.
This is because sstableupgrade transmits information about deleted row
marker automatically to cells in that row.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
"
This is a small step in fixing issue #2347. It is mostly tests and
testing infrastructure, but it does include a fix for a case where we
were missing the filename in the malformed_sstable_exception.
"
* 'espindola/sstable-corruption-v2' of https://github.com/espindola/scylla:
Add a filename to a malformed_sstable_exception.
Try to read the full sst in broken_sst.
Convert tests to SEASTAR_THREAD_TEST_CASE.
Check the exception message.
Move some tests to broken_sstable_test.cc
"
This series attempts to solve the regressions recently discovered in
performance of multi-partition range-scans. Namely that they:
* Flood the reader concurrency semaphore's queues, trampling other
reads.
* Behave very badly when too many of them is running concurrently
(trashing).
* May deadlock if enough of them is running without a timeout.
The solution for these problems is to make inactive shard readers
evictable. This should address all three issues listed above, to varying
degrees:
* Shard readers will now not cling onto their permits for the entire
duration of the scan, which might be a lot of time.
* Will be less affected by infinite concurrency (more than the node can
handle) as each scan now can make progress by evicting inactive shard
readers belonging to other scans.
* Will not deadlock at all.
In addition to the above fix, this series also bundles two further
improvements:
* Add a mechanism to `reader_concurrecy_semaphore` to be notified of
newly inserted evictables.
* General cleanups and fixes for `multishard_combining_reader` and
`foreign_reader`.
I can unbundle these mini series and send them separately, if the
maintainers so prefer, altough considering that this series will have to
be backported to 3.0, I think this present form is better.
Fixes: #3835
"
* 'evictable-inactive-shard-readers/v7' of https://github.com/denesb/scylla: (27 commits)
tests/multishard_mutation_query_test: test stateless query too
tests/querier_cache: fail resource-based eviction test gracefully
tests/querier_cache: simplify resource-based eviction test
tests/mutation_reader_test: add test_multishard_combining_reader_next_partition
tests/mutation_reader_test: restore indentation
tests/mutation_reader_test: enrich pause-related multishard reader test
multishard_combining_reader: use pause-resume API
query::partition_slice: add clear_ranges() method
position_in_partition: add region() accessor
foreign_reader: add pause-resume API
tests/mutation_reader_test: implement the pause-resume API
query_mutations_on_all_shards(): implement pause-resume API
make_multishard_streaming_reader(): implement the pause-resume API
database: add accessors for user and streaming concurrency semaphores
reader_lifecycle_policy: extend with a pause-resume API
query_mutations_on_all_shards(): restore indentation
query_mutations_on_all_shards(): simplify the state-machine
multishard_combining_reader: use the reader lifecycle policy
multishard_combining_reader: add reader lifecycle policy
multishard_combining_reader: drop unnecessary `reader_promise` member
...
Currently when this test fails, resources are not released in the
correct order, which results in ASAN complaining about use-after-free
in debug builds. This is due to the BOOST_REQUIRE macro aborting the
test when the predicate fails, not allowing for correct destruction
order to take place.
To avoid this ugly failure, that adds noise and might cause a developer
investigating into the failure starting on the wrong path, use the more
mild BOOST_CHECK family of test macros. These will allow the test to run
to completion even when the predicate fails, allowing for the correct
destruction of the resources.
Now that we have an accessor for all concurrency semaphores, we don't
need the tricks of creating a dummy keyspace to get them. Use the
accessors instead.
Test the interaction of the multishard reader with the foreign reader
w.r.t next_partition(). next_partition() is a special operation, as it
its execution is deferred until the next cross-shard operations. Give it
some extra stress-testing.
Refactor the multishard combining reader to make use of the new
pause-resume API to pause inactive shard readers.
Make the pause-resume API mandatory to implement, as by now all existing
clients have adapted it.
Allows for clearing any custom partition ranges, effectively resetting
them to the default ones. Useful for code that needs to set several
different specific partition ranges, one after the other, but doesn't
want to remember the last key it set a range for to be able to clear the
previous range with `clear_range()`.
Allowing for pausing the reader and later resume it. Pausing the reader
waits on the ongoing read ahead (if any), executes any pending
`next_partition()` calls and than detaches the shard reader's buffer.
The paused shard reader is returned to the client.
Resuming the reader consists of getting the previously detached reader
back, or one that has the same position as the old reader had.
This API allows for making the inactive shard readers of the
`multishard_combining_reader` evictable.
The API is private, it's only accessible for classes knowing the full
definition of the `foreign_reader` (which resides in a .cc file).
This API provides a way for the mulishard reader to pause inactive shard
readers and later resume them when they are needed again. This allows
for these paused shard readers to be evicted when the node is under
pressure.
How the readers are made evictable while paused is up to the clients.
Using this API in the `multishard_combining_reader` and implementing it
in the clients will be done in the next patches.
Provide default implementation for the new virtual methods to facilitate
gradual adoption.
The `read_context` which handles creating, saving and looking-up the
shard readers has to deal with its `destroy_reader()` method called any
time, even before some other method finished its work. For example it is
valid for a reader to be requested to be destroyed, even before the
contexts finishes creating it.
This means that state transitions that take time can be interleaved with
another state transition request. To deal with this the read context
uses `future_` states, states that mark an ongoing state transitions.
This allows for state transition request that arrive in the middle of
another state transition to be attached as a continuation to the ongoing
transition, and to be executed after that finishes. This however
resulted in complex code, that has to handle readers being in all sorts
of different states, when the `save_readers()` method is called.
To avoid all this complexity, exploit the fact that `destroy_reader()`
receives a future<> as its argument, which resolves when all previous
state transitions have finished. Use a gate to wait on all these futures
to resolve. This way we don't need all those transitional states,
instead in `save_readers()` we only need to wait on the gate to close.
Thus the number of states `save_readers()` has to consider drops
drastically.
This has the theoretical drawback of the process of saving the readers
having to wait on each of the readers to stop, but in practice the
process finishes when the last reader is saved anyway, so I don't expect
this to result in any slowdown.
Currently `multishard_combining_reader` takes two functors, one for
creating the readers and optionally one for destroying them.
A bag of functors (`std::function`) however make for a terrible
interface, and as we are about to add some more customization points,
it's time to use something more formal: policy based design, a
well-known design pattern.
As well as merging the job of the two functors into a single policy
class, also widen the area of responsibility of the policy to include
keeping alive any resource the shard readers might need on their home
shard. Implementing a proper reader cleanup is now not optional either.
This patch only adds the `reader_managing_policy` interface,
refactoring the multishard reader to use it will be done in the next patch.
The `reader_promise` member of the `shard_reader` was used to
synchronize a foreground request to create the underlying reader with an
ongoing background request with the same goal. This is however
unnecessary. The underlying reader is created in the background only as
part of a read ahead. In this case there is no need for extra
synchronization point, the foreground reader create request can just
wait for the read ahead to finish, for which there already exists a
mean. Furthermore, foreground reader create requests are always followed
by a `fill_buffer()` request, so by waiting on the read ahead we ensure
that the following `fill_buffer()` call will not block.
Shard readers used to track pending `next_partition()` calls that they
couldn't execute, because their underlying reader wasn't created yet.
These pending calls were then executed after the reader was created.
However the only situation where a shard reader can receive a
`next_partition()` call, before its underlying reader wasn't created is
when `next_partition()` is called on the multishard reader before a
single fragment is read. In this case we know we are at a partition
boundary and thus this call has no effect, therefore it is safe to
ignore it.
Foreign reader doesn't execute `next_partition()` calls straight away,
when this would require interaction with the remote reader. Instead
these calls are "remembered" and executed on the next occasion the
foreign reader has to interact with the remote reader. This was
implemented with a counter that counts the number of pending
`next_partition()` calls.
However when `next_partition()` is called multiple times, without
interleaving calls to `operator()()` or `fast_forward_to()`, only the
first such call has effect. Thus it doesn't make sense to count these
calls, it is enough to just set a flag if there was at least one such
call.
It doesn't make sense for the multishard reader anyway, as it's only
used by the row-cache. We are about to introduce the pausing of inactive
shard readers, and it would require complex data structures and code
to maintain support for this feature that is not even used. So drop it.
As we are about to add multiple sources of evictable readers, we need a
more scalable solution than a single functor being passed that opaquely
evicts a reader when called.
Add a generic way to register and unregister evictable (inactive)
readers to the semaphore. The readers are expected to be registered when
they become evictable and are expected to be unregistered when they
cease to become evictable. The semaphore might evict any reader that is
registered to it, when it sees fit.
This also solves the problem of notifying the semaphore when new readers
become evictable. Previously there was no such mechanism, and the
semaphore would only evict any such new readers when a new permit was
requested from it.
It is reasonable for parse() to throw when it finds something wrong
with the format. This seems to be the best spot to add the filename
and rethrow.
Also add a testcase to make sure we keep handling this error
gracefully.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
With this patch we use data_consume_rows to try to read the entire
sstable. The patch also adds a test with a particular corruption that
would not be found without parsing the file.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This makes the tests a bit more strict by also checking the message
returned by the what() function.
This shows that some of the tests are out of sync with which errors
they check for. I will hopefully fix this in another pass.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
sstable_test.cc was already a bit too big and there is potential for
having a lot of tests about broken sstables.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
These loops have the structure :
while (true) {
switch (state) {
case state1:
...
break;
case state2:
if (...) { ... break; } else {... continue; }
...
}
break;
}
There a couple things I find a bit odd on that structure:
* The break refers to the switch, the continue to the loop.
* A while (true) loop always hits a break or a continue.
This patch uses early returns to simplify the logic to
while (true) {
switch (state) {
case state1:
...
return
case state2:
if (...) { ... return; }
...
}
}
Now there are no breaks or continues.
Tests: unit (release)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20181126171726.84629-1-espindola@scylladb.com>
"
zlib's crc32_combine() is not very efficient. It is faster to re-combine
the buffer using crc32(). It's still substantial amount of work which
could be avoided.
This patch introduces a fast implementation of crc32_combine() which
uses a different algorithm than zlib. It also utilizes intrinsics for
carry-less multiplication instruction to perform the computation faster.
The details of the algorithm can be found in code comments.
Performance results using perf_checksum and second buffer of length 64 KiB:
zlib CRC32 combine: 38'851 ns
libdeflate CRC32: 4'797 ns
fast_crc32_combine(): 11 ns
So the new implementation is 3500x faster than zlib's, and 417x faster than
re-checksumming the buffer using libdeflate.
Tested on i7-5960X CPU @ 3.00GHz
Performance was also evaluated using sstable writer benchmark:
perf_fast_forward --populate --sstable-format=mc --data-directory /tmp/perf-mc \
--value-size=10000 --rows 1000000 --datasets small-part
It yielded 9% improvement in median frag/s (129'055 vs 117'977).
Refs #3874
"
* tag 'fast-crc32-combine-v2' of github.com:tgrabiec/scylla:
tests: perf_checksum: Test fast_crc32_combine()
tests: Rename libdeflate_test to checksum_utils_test
tests: libdeflate: Add more tests for checksum_combine()
tests: libdeflate: Check both libdeflate and default checksummers
sstables: Use fast_crc_combine() in the default checksummer
utils/gz: Add fast implementation of crc32_combine()
utils/gz: Add pre-computed polynomials
utils/gz: Import Barett reduction implementation from libdeflate
utils: Extract clmul() from crc.hh
Before this patch we were writing offset map enteies in unspecified
order, the one returned by std::unorderd_map. Cassandra writes them
sorted by metadata_type. Use the same order for improved
compatibility.
Fixes#3955.
Message-Id: <1543846649-22861-1-git-send-email-tgrabiec@scylladb.com>
"
Due to an XFS heuristic, if all files are in one (or a few) directories,
then block allocation can become very slow. This is because XFS divides
the disk into a few allocation groups (AGs), and each directory allocates
preferentially from a single AG. That AG can become filled long before
the disk is full.
This patchset works around the problem by:
- creating sstable component files in their own temporary, per-sstable sub-directory,
- moving the files back into the canonical location right after begin created, and finally
- removing the temp sub-directory when the sstable is sealed.
- In addition, any temporary sub-directories that might have been left over if scylla
crashed while creating sstables are looked up and removed when populating the table.
Fixes: #3167
Tests: unit (release)
"
* 'issues/3167/v7' of https://github.com/bhalevy/scylla:
distributed_loader::populate_column_family: lookup and remove temp sstable directories
database: directly use std::experimental::filesystem::path for lister::path
database: use std::experimental::filesystem::path for lister::path
sstable: use std::experimental::filesystem rather than boost
sstable::seal_sstable: fixup indentation
sstable: create sstable component files in a subdirectory
sstable::new_sstable_component_file: pass component_type rather than filename
sstable: cleanup filename related functions
sstable: make write_crc, write_digest, and new_sstable_component_file private methods
zlib's crc32_combine() is not very efficient. It is faster to re-combine
the buffer using crc32(). It's still substantial amount of work which
could be avoided.
This patch introduces a fast implementation of crc32_combine() which
uses a different algorithm than zlib. It also utilizes intrinsics for
carry-less multiplication instruction to perform the computation faster.
The details of the algorithm can be found in code comments.
Performance results using perf_checksum and second buffer of length 64 KiB:
zlib CRC32 combine: 38'851 ns
libdeflate CRC32: 4'797 ns
fast_crc32_combine(): 11 ns
So the new implementation is 3500x faster than zlib's, and 417x faster than
re-checksumming the buffer using libdeflate.
Tested on i7-5960X CPU @ 3.00GHz
Performance was also evaluated using sstable writer benchmark:
perf_fast_forward --populate --sstable-format=mc --data-directory /tmp/perf-mc \
--value-size=10000 --rows 1000000 --datasets small-part
It yielded 9% improvement in median frag/s (129'055 vs 117'977).
gen_crc_combine_table.cc will be run during build to produce tables
with precomputed polynomials (4 x 256 x u32). The definitions will
reside in:
build/<mode>/gen/utils/gz/crc_combine_table.cc
It takes 20ms to generate on my machine.
The purpose of those polynomials will be explained in crc_combine.cc
As we are about to extend the functionality of the reader concurrency
semaphore, adding more method implementations that need to go to a .cc
file, it's time we create a dedicated file, instead of keep shoving them
into unrelated .cc files (mutation_reader.cc).
Use standard convention of the rest of the code base. Type definitions
first, then data members and finally member functions.
As we are about to add more members, its especially important to make
the growing class have a familiar member arrangement.
We would like to get rid of boost::filesystem and gradually replace it with
std::experimental::filesystem.
TODO: using namespace fs = std::experimental::filesystem,
use fs::path directly, rather than lister::path
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When writing the sstable, create a temporary directory
for creating all components so that each sstable files' will be
assigned a different allocaton groups on xfs.
Files are immediately renamed to their default location after creation.
Temp directory is removed when the sstable is sealed.
Additional work to be introduced in the following patches:
- When populating tables, temp directories need to be looked up and removed.
Fixes#3167
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prepare for per-sstable sub directory.
Also, these functions get most of their parameters from the sst at hand so they might
as well be first class members.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a reference to a docker image that contains an "official" toolchain
for building Scylla. In addition, add a script that allows easy usage of
the image, and some documentation.
Message-Id: <20181202120829.21218-1-avi@scylladb.com>
git, python, sudo packages are installed by default on normal Fedora
installation but not in Docker image, we need to install it by this
script.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181201020834.24961-1-syuu@scylladb.com>
"
This patchset addresses two recently discovered bugs both triggered by
summary regeneration:
Tests: unit {release}
+
Validated with debug build of Scylla (ASAN) that no use-after-free
occurs when re-generating Summary.db.
"
* 'projects/sstables-30/summary-regeneration/v1' of https://github.com/argenet/scylla:
tests: Add test reading SSTables in 'mc' format with missing summary.
sstables: When loading, read statistics before summary.
database: Capture io_priority_class by reference to avoid dangling ref.
In case if summary is missing and we attempt to re-generate it,
statistics must be already read to provide us with values stored in
serialization header to facilitate clustering prefixes deserialization.
Fixes#3947
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
The original reference points to a thread-local storage object that
guaranteed to outlive the continuation, but copying it make the
subsequent calls point to a local object and introduces a use-after-free
bug.
Fixes#3948
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This test checks that sstable reader throws an exception
when sstable contains a column that's not present in the schema.
It also checks that dropped columns do not cause exceptions.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Previously such column was ignored but it's better to be explicit
about this situation.
Refs #2598
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
"
This series adds proper handling of filtering queries with LIMIT.
Previously the limit was erroneously applied before filtering,
which leads to truncated results.
To avoid that, paged filtering queries now use an enhanced pager,
which remembers how many rows dropped and uses that information
to fetch for more pages if the limit is not yet reached.
For unpaged filtering queries, paging is done internally as in case
of aggregations to avoid returning keeping huge results in memory.
Also, previously, all limited queries used the page size counted
from max(page size, limit). It's not good for filtering,
because with LIMIT 1 we would then query for rows one-by-one.
To avoid that, filtered queries ask for the whole page and the results
are truncated if need be afterwards.
Tests: unit (release)
"
* 'fix_filtering_with_limit_2' of https://github.com/psarna/scylla:
tests: add filtering with LIMIT test
tests: split filtering tests from cql_query_test
cql3: add proper handling of filtering with LIMIT
service/pager: use dropped_rows to adjust how many rows to read
service/pager: virtualize max_rows_to_fetch function
cql3: add counting dropped rows in filtering pager
Before it was testing missing columns.
It's better to test dropped columns because they should be ignored
while for missing columns some sources will throw.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
"
This patchset adds support to scylla_io_setup for
multiple data directories as well as commitlog,
hints, and saved_caches directories.
Refs #2415
Tests: manual testing with scylla-ccm generated scylla.yaml
"
* 'projects/multidev/v3' of https://github.com/bhalevy/scylla:
scylla_io_setup: assume default directories under /var/lib/scylla
scylla_io_setup: add support for commitlog, hints, and saved_caches directory
scylla_io_setup: support multiple data directories
Previously, limit was erroneously applied before filtering,
which might have resulted in truncated results.
Now, both paged and unpaged queries are filtered first,
and only after that properly trimmed so only X rows are returned
for LIMIT X.
Fixes#3902
Filtering pager may drop some rows and as a result return less
than what was fetched from the replica. To properly adjust how
many rows were actually read, dropped_rows variable is introduced.
Regular pagers use max_rows to figure out how many rows to fetch,
but filtering pager potentially needs the whole page to be fetched
in order to filter the results.
If a specific directory is not configure in scylla.yaml, scylla assumes
a default location under /var/lib/scylla.
Hard code these locations in scylla_io_setup until we have a better way
to probe scylla about it.
Be permissive and ignore the default directories if they don't not exist
on disk and silently ignore them.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Counter for dropped rows is added to the filtering pager.
This metrics can be used later to implement applying LIMIT
to filtering queries properly.
Dropped rows are returned on visitor::accept_partition_end.
"
This series fixes#3891 by amending the way restrictions
are checked for filtering. Previous implementation that returned
false from need_filtering() when multi-column restrictions
were present was incorrect.
Now, the error is going to be returned from restrictions filter layer,
and once multi-column support is implemented for filtering, it will
require no further changes.
Tests: unit (release)
"
* 'fix_multi_column_filtering_check_3' of https://github.com/psarna/scylla:
tests: add multi-column filtering check
cql3: remove incorrect multi-column check
cql3: check filtering restrictions only if applicable
cql3: add pk/ck_restrictions_need_filtering()
need_filtering() incorrectly returned false if multi-column restrictions
were present. Instead, these restrictions should be allowed to need
filtering.
Fixes#3891
"
This miniseries ensures that system tables are not checked
for having view updates, because they never do.
What's more, distributed system table is used in the process,
so it's unsafe to query the table while streaming it.
Tests: unit (release), dtest(update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_decommission_node_2_test)
"
* 'fix_checking_if_system_tables_need_view_updates_3' of https://github.com/psarna/scylla:
streaming: don't check view building of system tables
database: add is_internal_keyspace
streaming: remove unused sstable_is_staging bool class
System tables will never need view building, and, what's more,
are actually used in the process of view build checking.
So, checking whether system tables need a view update path
is simplified to returning 'false'.
"
The series fixed#3565 and #3566
"
* 'gleb/write_failure_fixes' of github.com:scylladb/seastar-dev:
storage_proxy: store hint for CL=ANY if all nodes replied with failure
storage_proxy: complete write request early if all replicas replied with success of failure
storage_proxy: check that write failure response comes from recognized replica
storage_proxy: move code executed on write timeout into separate function
Sync Debian variants dependencies with dist/debian/control.mustache
(before merging relocatable), use scylla 3rdparty packages.
Since we use 3rdparty repo on seastar/install-dependencies.sh, drop repo
setup part from this script.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181031122800.11802-1-syuu@scylladb.com>
Current code assumes that request failed if all replicas replied with
failure, but this is not true for CL=ANY requests. Take it into account.
Fixed: #3565
Currently if write request reaches CL and all replicas replied, but some
replied with failures, the request will wait for timeout to be retired.
Detect this case and retire request immediately instead.
Fixes#3566
As far as I can tell the old sstable reading code required reading the
data into a contiguous buffer. The function data_consume_rows_at_once
implemented the old behavior and incrementally code was moved away
from it.
Right now the only use is in two tests. The sstables used in those
tests are already used in other tests with data_consume_rows.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20181127024319.18732-2-espindola@scylladb.com>
"
Compression is not deterministic so instead of binary comparing the sstable files we just read data back
and make sure everything that was written down is still present.
Tests: unit(release)
"
* 'haaawk/binary-compare-of-compressed-sstables/v3' of github.com:scylladb/seastar-dev:
sstables: Remove compressed parameter from get_write_test_path
sstables: Remove unused sstable test files
sstables: Ensure compare_sstables isn't used for compressed files
sstables: Don't binary compare compressed sstables
sstables: Remove debug printout from test_write_many_partitions
"
consistency_level.hh is rather heavyweighy in both its contents and what it
includes. Reduce the number of inclusion sites and split the file to reduce
dependencies.
"
* tag 'cl-header/v2' of https://github.com/avikivity/scylla:
consistency_level: simplify validation API
Split consistency_level.hh header
database: remove unneeded consistency_level.hh include
cql: remove unneeded includes of consistency_level.hh
It has two unrelated users: cql for validation, and storage_proxy for
complicated calculations. Split the simple stuff into a new header to reduce
dependencies.
Not all compaction operations submitted through compaction manager sets a callback
for releasing references of exhausted sstables in compaction manager itself.
That callback lives in compaction descriptor which is passed to table::compaction().
Let's make the call conditional to avoid bad function call exceptions.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20181126235616.10452-1-raphaelsc@scylladb.com>
There's no _M_t._M_head_impl any more in the standard library.
We now have std_unique_ptr wrapper which abstracts this fact away so
use that.
Message-Id: <20181126174837.11542-1-tgrabiec@scylladb.com>
"
One part of the improvement comes from replacing zlib's CRC32 with the one
from libdeflate, which is optimized for modern architecture and utilizes the
PCLMUL instruction.
perf_checksum test was introduced to measure performance of various
checksumming operations.
Results for 514 B (relevant for writing with compression enabled):
test iterations median mad min max
crc_test.perf_deflate_crc32_combine 58414 16.711us 3.483ns 16.708us 16.725us
crc_test.perf_adler_combine 165788278 6.059ns 0.031ns 6.027ns 7.519ns
crc_test.perf_zlib_crc32_combine 59546 16.767us 26.191ns 16.741us 16.801us
---
crc_test.perf_deflate_crc32_checksum 12705072 83.267ns 4.580ns 78.687ns 98.964ns
crc_test.perf_adler_checksum 3918014 206.701ns 23.469ns 183.231ns 258.859ns
crc_test.perf_zlib_crc32_checksum 2329682 428.787ns 0.085ns 428.702ns 510.085ns
Results for 64 KB (relevant for writing with compression disabled):
test iterations median mad min max
crc_test.perf_deflate_crc32_combine 25364 38.393us 17.683ns 38.375us 38.545us
crc_test.perf_adler_combine 169797143 5.842ns 0.009ns 5.833ns 6.901ns
crc_test.perf_zlib_crc32_combine 26067 38.663us 95.094ns 38.546us 40.523us
---
crc_test.perf_deflate_crc32_checksum 202821 4.937us 14.426ns 4.912us 5.093us
crc_test.perf_adler_checksum 44684 22.733us 206.263ns 22.492us 25.258us
crc_test.perf_zlib_crc32_checksum 18839 53.049us 36.117ns 53.013us 53.274us
The new CRC32 implementation (deflate_crc32) doesn't provide a fast
checksum_combine() yet, it delegates to zlib so it's as slow as the latter.
Because for CRC32 checksum_combine() is several orders of magnitude slower
than checksum(), we avoid calling checksum_combine() completely for this
checksummer. We still do it for adler32, which has combine() which is faster
than checksum().
SStable write performance was evaluated by running:
perf_fast_forward --populate --data-directory /tmp/perf-mc \
--rows=10000000 -c1 -m4G --datasets small-part
Below is a summary of the average frag/s for a memtable flush. Each result is
an average of about 20 flushes with stddev of about 4k.
Before:
[1] MC,lz4: 330'903
[2] LA,lz4: 450'157
[3] MC,checksum: 419'716
[4] LA,checksum: 459'559
After:
[1'] MC,lz4: 446'917 ([1] + 35%)
[2'] LA,lz4: 456'046 ([2] + 1.3%)
[3'] MC,checksum: 462'894 ([3] + 10%)
[4'] LA,checksum: 467'508 ([4] + 1.7%)
After this series, the performance of the MC format writer is similar to that
of the LA format before the series.
There seems to be a small but consistent improvement for LA too. I'm not sure
why.
"
* tag 'improve-mc-sstable-checksum-libdeflate-v3' of github.com:tgrabiec/scylla:
tests: perf: Introduce perf_checksum
tests: Add test for libdeflate CRC32 implementation
sstables: compress: Use libdeflate for crc32
sstables: compress: Rename crc32_utils to zlib_crc32_checksummer
licenses: Add libdeflate license
Integrate libdeflate with the build system
Add libdeflate submodule
sstables: Avoid checksum_combine() for the crc32 checksummer
sstables: compress: Avoid unnecessary checksum_combine()
sstables: checksum_utils: Add missing include
Improves memtable flush performance by 10% in a CPU-bound case.
Unlike the zlib implementation, libdeflate is optimized for modern
CPUs. It utilizes the PCLMUL instruction.
checksum_combine() is much slower than re-feeding the buffer to
checksum() for the zlib CRC32 checksummer.
Introduce Checksum::prefer_combine() to determine this and select
more optimal behavior for given checksummer.
Improves performance of memtable flush with compression enabled by 30%.
class_registry's staticness brings has the usual problem of
static classes (loss of dependency information) and prevents us
from librarifying Scylla since all objects that define a registration
must be linked in.
Take a first step against this staticness by defining a nonstatic
variant. The static class_registry is then redefined in terms of the
nonstatic class. After all uses have been converted, the static
variant can be retired.
Message-Id: <20181126130935.12837-1-avi@scylladb.com>
"
Previously we were checking for schema incompatibility between current schema and sstable
serialization header before reading any data. This isn't the best approach because
data in sstable may be already irrelevant due to column drop for example.
This patchset moves the check after actual data is read and verified that it has
a timestamp new enough to classify it as nonobsolete.
Fixes#3924
"
* 'haaawk/3924/v3' of github.com:scylladb/seastar-dev:
sstables: Enable test_schema_change for MC format
sstables3: Throw error on schema mismatch only for live cells
sstables: Pass column_info to consume_*_column
sstables: Add schema_mismatch to column_info
sstables: Store column data type in column_info
sstables: Remove code duplication in column_translation
BYPASS CACHE was mistakenly documenting an earlier version of the patch.
Correct it to document th committed version.
Message-Id: <20181126125810.9344-1-avi@scylladb.com>
This family of test_write_many_partitions_* tests writes
sstables down from memtable using different compressions.
Then it compares the resulting file with a blueprint file
and reads the data back to check everything is there.
Compression is not deterministic so this patch makes the
tests not compare resulting compressed sstable file with blueprint
file and instead only read data back.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Previously we were throwing exception during the creation of
column_translation. This wasn't always correct because sometimes
column for which the mismatch appeared was already dropped and
data present in sstable should be ignored anyway.
Fixes#3924
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
"
Some queries are very unlikely to hit cache. Usually this includes
range queries on large tables, but other patterns are possible.
While the database should adapt to the query pattern, sometimes the
user has information the database does not have. By passing this
information along, the user helps the database manage its resources
more optimally.
To do this, this patch introduces a BYPASS CACHE clause to the
SELECT statement. A query thus marked will not attempt to read
from the cache, and instead will read from sstables and memtables
only. This reduces CPU time spent to query and populate the cache,
and will prevent the cache from being flooded with data that is
not likely to be read again soon. The existing cache disabled path
is engaged when the option is selected.
Tests: unit (release), manual metrics verification with ccm with and without the
BYPASS CACHE clause.
Ref #3770.
"
* tag 'cache-bypass/v2' of https://github.com/avikivity/scylla:
doc: document SELECT ... BYPASS CACHE
tests: add test for SELECT ... BYPASS CACHE
cql: add SELECT ... BYPASS CACHE clause
db: add query option to bypass cache
* tag 'perf-ffwd-dataset-population-v2' of github.com:tgrabiec/scylla:
tests: perf_fast_forward: Measure performance of dataset population
tests: perf_fast_forward: Record the dataset on which test case was run
tests: perf_fast_forward: Introduce the concept of a dataset
tests: perf_fast_forward: Introduce make_compaction_disabling_guard()
tests: perf_fast_forward: Initialize output manager before population
tests: perf_fast_forward: Handle empty test parameter set
tests: perf_fast_forward: Extract json_output_writer::write_common_test_group()
tests: perf_fast_forward: Factor out access to cfg to a single place per function
tests: perf_fast_forward: Extract result_collector
tests: perf_fast_forward: Take writes into account in AIO statistics
tests: perf_fast_forward: Reorder members
tests: perf_fast_forward: Add --sstable-format command line option
The BYPASS CACHE clause instructs the database not to read from or populate the
cache for this query. The new keywords (BYPASS and CACHE) are not reserved.
do_process_buffer had two unreachable default cases and a long
if-else-if chain.
This converts the the if-else-if chain to a switch and a helper
function.
This moves the error checking from run time to compile time. If we
were to add a 128 bit integer for example, gcc would complain about it
missing from the switch.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20181125221451.106067-1-espindola@scylladb.com>
"
This new compaction approach consists of releasing exhausted fragments[1] of a run[2] a
compaction proceeds, so decreasing considerably the space requirement.
These changes will immediately benefit leveled strategy because it already works with
the run concept.
[1] fragment is a sstable composing a run; exhausted means sstable was fully consumed
by compaction procedure.
[2] run is a set of non-overlapping sstables which roughly span the
entire token range.
Note:
Last patch includes an example compaction strategy showing how to work with the interface.
unit tests: all modes passing
dtests: compaction ones passing
"
* 'sstable_run_based_compaction_v10' of github.com:raphaelsc/scylla:
tests: add example compaction strategy for sstable run based approach
sstables/compaction: propagate sstable replacement to all compaction of a CF
sstables: store cf pointer in compaction_info
tests/sstable_test: add test for compaction replacement of exhausted sstable
sstables: add sstable's on closed handling
tests/sstables: add test for sstable run based compaction
sstables/compaction_manager: prevent partial run from being selected for compaction
compaction: use same run identifier for sstables generated by same compaction
sstables: introduce sstable run
sstables/compaction_manager: release reference to exhausted sstable through callback
sstables/compaction: stop tracking exhausted input sstable in compaction_read_monitor
database: do not keep reference to sstable in selector when done selecting
compaction: share sstable set with incremental reader selector
sstables/compaction: release space earlier of exhausted input sstables
sstables: make partitioned sstable set's incremental selector resilient to changes in the set
database: do not store reference to sstable in incremental selector
tests/sstables: add run identifier correctness test
sstables: use a random uuid for sstables without run identifier
sstables: add run identifier to scylla metadata
This is needed for parallel compaction to work with sstable run based approach.
That's because regular compaction clones a set containing all sstables of its
column family. So compaction A can potentially hold a reference to a compacting
sstable of compaction B, so preventing compacting B from releasing its exhausted
sstable.
So all replacements are propagated to all compactions of a given column family,
and compactions in turn, including the one which initiated the propagation,
will do the replacement.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
motivation is that we need a more efficient way to find compactions
that belong to a given column family in compaction list.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Make sure that compaction is capable of releasing exhausted sstable space
early in the procedure.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Motivation is that it will be useful for catching regression on compaction
when releasing early exhausted sstables. That's because sstable's space
is only released once it's closed. So this will allow us to write a test
case and possibly use it for entities holding exhausted sstable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Filter out sstable belonging to a partial run being generated by an ongoing
compaction. Otherwise, that could lead to wrong decisions by the compaction
strategy.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
SSTables composing the same run will share the same run identifier.
Therefore, a new compaction strategy will be able to get all sstables belong
to the same run from sstable_set, which now keeps track of existing runs.
Same UUID is passed to writers of a given compaction. Otherwise, a new UUID
is picked for every sstable created by compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
sstable run is a structure that will hold all sstables that has the same
run identifier. All sstables belonging to the same run will not overlap
with one another.
It can be used by compaction strategy to work on runs instead of individual
sstables.
sstable_set structure which holds all sstables for a given column family
will be responsible for providing to its user an interface to work with
runs instead of individual sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's important for the reference to sstable to not be kept throughout
the compaction procedure, which would break the goal of releasing
space during compaction.
Manager passes a callback to compaction which calls it whenever
there's sstable replacement.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Motivation is that we want to release space for exhausted sstable and that
will only happen when all references to it are gone *and* that backlog
tracker takes the early replacement into account.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When compacting, we'll create all readers at once and will not select
again from incremental selector, meaning the selector will keep all
respective sstables in current_sstables, preventing compaction from
releasing space as it goes on.
The change is about refreshing sstable set's selector such that it
will not hold a reference to an exhausted sstable whatsoever.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
By doing that, we'll be able to release exhausted sstable from both
simulteaneously.
That's achieved by sharing set containing input sstables with the incremental
reader selector and removing exhausted sstables from shared set when the
time has come.
Step towards reducing disk requirement for compaction by making it delete
sstable which all data is in a sealed new sstable. For that to happen,
all references must be gone.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently, compaction only replace input sstables at end of compaction,
meaning compaction must be finished for all the space of those sstables
to be released.
What we can do instead is to delete earlier some input sstable under
some conditions:
1) SStable data should be committed to a new, sealed output sstable,
meaning it's exhausted.
2) Exhausted sstable mustn't overlap with a non-exhausted sstable
because a tombstone in the exhausted could have been purged and the
shadowed data in non-exhausted could be ressurected if system
crashes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The motivation is that compaction may remove a sstable from the set while the
incremental selector is alive, and for that to work, we need to invalidate
the iterators stored by the selector. We could have added a method to notify
it, but there will be a case where the one keeping the set cannot forward
the notification to the selector. So it's better for the selector to take
care of itself. Change counter approach is used which allows the selector
to know when to invalidate the iterators.
After invalidation, selector will move the iterator back into its right
place by looking for lower bound for current pos.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Use sstable generation instead to keep track of read sstables.
The motivation is that we'll not keep reference to sstables, so allowing
their space on disk to be released as soon they get exhausted.
Generation is used because it guarantees uniqueness of the sstable.
Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Older sstables must have an identifier for them to be associated
with their own run.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
It identifies a run which a particular sstable belongs to.
Existing sstables will have a random uuid associated with it
in memory.
UUID is the correct choice because it allows sstables to be
exported without having conflicts when using identifier generated
by different nodes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
These switches are fully covered. We can be sure they will stay this
way because of -Werror and gcc's -Wswitch warning.
We can also be sure that we never have an invalid enum value since the
state machine values are not read from disk.
The patch also removes a superfluous ';'.
Message-Id: <20181124020128.111083-1-espindola@scylladb.com>
This field is true when there's a mismatch
between column type in serialization header and
current schema.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The reason for that is that it's not available in sstable format mc,
so we can no longer rely on it in common code for the currently
supported formats.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20181121170057.20900-1-raphaelsc@scylladb.com>
A dataset represents a table with data, populated in certain way, with
certain characteristics of the schema and data.
Before this change, datasets were implicitly defined, with population
hard-coded inside the populate() function.
This change gathers logic related to datasets into classes, in order to:
- make it easier to define new datasets.
- be able to measure performance of dataset population in a
standardized way.
- being able to express constraints on datasets imposed by different
test cases. Test cases are matched with possible datasets based
on the abstract interface they accept (e.g. clustered_ds,
multipartition_ds), and which must be implemented by a compatible
dataset. To facilitate this matching, test function is now wrapped
into a dataset_acceptor object, with an automatically-generated can_run()
virtual method, deduced by make_test_fn().
- be able to select tests to run based on the dataset name.
Only tests which are compatible with that dataset will be run.
Extracts the result collection and reporting logic out of
run_test_case(). Will be needed in population tests, for which we
don't want the looping logic.
This series adds a generic test for schema changes that generates
various schema and data before and after an ALTER TABLE operation. It is
then used to check correctness of mutation::upgrade() and sstable
readers and lead to the discovery of #3924 and #3925.
Fixes#3925.
* https://github.com/pdziepak/scylla.git schema-change-test/v3.1
schema_builder: make member function names less confusing
converting_mutation_partition_applier: fix collection type changes
converting_mutation_partition_applier: do not emit empty collections
sstable: use format() instead of sprint()
tests/random-utils: make functions and variables inline
tests: add models for schemas and data
tests: generate schema changes
tests/mutation: add test for schema changes
tests/sstable: add test for schema changes
for_each_schema_change() is used for testing reading an sstable that was
written with a different schema. Because of #3924, for now the mc format
is not verified this way.
This patch adds for_each_schema_change() functions which generates
schemas and data before and after some modification to the schema (e.g.
adding a column, changing its type). It can be used to test schema
upgrades.
This patch introduces a model of Scylla schemas and data, implemented
using simple standard library primitives. It can be used for testing the
actuall schemas, mutation_partitions, etc. used by the schema by
comparing the results of various actions.
The initial use case for this model was to test schema changes, but
there is no reason why in the future it cannot be extended to test other
things as well.
packer 1.3.2 no longer supported enhanced_networking directive, we need
to use new directives("sriov_support" and "ena_support") to build with
new version.
packer provides automatic configuration file fixing tool, so new
scylla.json is generated by following command:
./packer/packer fix scylla.json
Fixes#3938
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181123053719.32451-1-syuu@scylladb.com>
appending_hash is used for computing hashes that become part of the
binary interface. They cannot change between Scylla version and the same
data needs to always result in the same hash.
At the moment, appending_hash<bytes_ostream> doesn't fulfil those
requirements since it leaks information how the underlying buffer is
fragmented. Fortunately, it has no users so it doesn't casue any
compatibility issues.
Moreover, bytes_ostream is usually used as an output of some
serialisation routine (e.g. frozen_mutation_fragment or CQL response).
Those serialisation formats do not guarantee that there is a single
representation of a given data and therefore are not fit to be hashed by
appending_hash. Removing appending_hash<bytes_ostream> may help
preventing such incorrect uses.
Message-Id: <20181122163823.12759-1-pdziepak@scylladb.com>
* seastar b924495...1fbb633 (3):
> rpc: Reduce code duplication
> tests: perf: Make do_not_optimize() take the argument by const&
> doc: Fix import paths in the tutorial
The format message was using the new stlye formatting markers ("{}")
which are understood by format() but not by sprint() (the latter is
basically deprecated).
This patch changes the behaviour of the schema upgrade code so that if
all cells and the tombstons of a collection are removed during the upgrade
the collection is not emitted (as opposed to emitting an empty one).
Both behaviours are valid, but the new one makes it more consistent with
how atomic cells are upgraded and how schema upgrades work for sstable
readers.
ALTER TABLE allows changing the type of a collection to a compatible
one. This includes changes from a fixed-sized type to a variable-sized
one. If that happens the atomic_cells representing collection elements
need to be rewritten so that the value size is included. The logic for
rewritting atomic cells already exists (for those that are not
collection members) and is reused in this patch.
Fixes#3925.
Right now, schema_builder member functions have names that very poorly
convey the actions that are performed for them. This is made even worse
by some overloads which drastically change the semantics. For example:
schema_builder()
.with_column("v1", /* ... */)
.without_column("v1", removal_timestamp);
Creates a column "v1" and adds an information that there was a column
with that name that was removed at 'removal_timestamp'.
schema_builder()
.with_coulmn("v1")
.without_column(utf8_type->decompose("v1"));
This adds column "v1" and then immediately removes it.
In order to clean up this mess the names were changes so that:
* with_/without_ functions only add informations to the schema (e.g.
info that a column was removed, but without removing a column of that
name if one exists)
* functions which names start with a verb actually perform that action,
e.g. the new remove_column() removes the column (and adds information
that it used to exist) as in the second example.
Currently if hints directory contains unexpected directories Scylla fails to
start with unhandled std::invalid_argument exception. Make the manager
ignore malformed files instead and try to proceed anyway.
Message-Id: <20181121134618.29936-2-gleb@scylladb.com>
We scan hints directory in two places: to search for files to replay and
to search for directories to remove after resharding. The code that
translates directory name to a shard is duplicated. It is simple now, so
not a bit issue but in case it grows better have it in one place.
Message-Id: <20181121134618.29936-1-gleb@scylladb.com>
"
Tested with perf_fast_forward from:
github.com/tgrabiec/scylla.git perf_fast_forward-for-sst3-opt-write-v1
Using the following command line:
build/release/tests/perf/perf_fast_forward_g --populate --sstable-format=mc \
--data-directory /tmp/perf-mc --rows=10000000 -c1 -m4G \
--datasets small-part
The average reported flush throughput was (stdev for the avergages is around 4k):
- for mc before the series: 367848 frag/s
- for lc before the series: 463458 frag/s (= mc.before +25%)
- for mc after the series: 429276 frag/s (= mc.before +16%)
- for lc after the series: 466495 frag/s (= mc.before +26%)
Refs #3874.
"
* tag 'sst3-opt-write-v2' of github.com:tgrabiec/scylla:
sstables: mc: Avoid serialization of promoted index when empty
sstables: mc: Avoid double serialization of rows
tests: sstable 3.x: Do not compare Statistics component
utils: Introduce memory_data_sink
schema: Optimize column count getters
sstables: checksummed_file_data_sink_impl: Bypass output_stream
The old code was serializing the row twice. Once to get the size of
its block on disk, which is needed to write the block length, and then
to actually write the block.
This patch avoids this by serializing once into a temporary buffer and
then appending that buffer to the data file writer.
I measured about 10% improvement in memtable flush throughput with
this for the small-part dataset in perf_fast_forward.
The Statistics component recorded in the test was generated using a
buggy verion of Scylla, and is not correct. Exposed by fixing the bug
in the way statistics are generated.
Rather than comparing binary content, we should have explicit checks
for statistics.
"
Enables sstable compression with LZ4 by default, which was the
long-time behavior until a regression turned off compression by
default.
Fixes#3926
"
* 'restore-default-compression/v2' of https://github.com/duarten/scylla:
tests/cql_query_test: Assert default compression options
compress: Restore lz4 as default compressor
tests: Be explicit about absence of compression
Indicate the default scylla directories, rather than Cassandra's.
Provide links to Scylladocumentation where possible,
update links to Casandra documentation otherwise.
Clean up a few typos.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20181119141912.28830-1-bhalevy@scylladb.com>
This avoids a difference between little and big endian sytems. We
now also calculate a full murmur hash for tokens with less than 8
bytes, however in practice the token size is always 8.
Message-Id: <20181120214733.43800-1-mike.munday@ibm.com>
The boost multiprecision library that I am compiling against seems
to be missing an overload for the cast to a string. The easy
workaround seems to be to call str() directly instead.
This also fixes#3922.
Message-Id: <20181120215709.43939-1-mike.munday@ibm.com>
Fixes a regression introduced in
74758c87cd, where tables started to be
created without compression by default (before they were created with
lz4 by default).
Fixes#3926
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
* seastar d59fcef...b924495 (2):
> build: Fix protobuf generation rules
> Merge "Restructure files" from Jesse
Includes fixup patch from Jesse:
"
Update Seastar `#include`s to reflect restructure
All Seastar header files are now prefixed with "seastar" and the
configure script reflects the new locations of files.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com>
"
The build environment may not installed ninja-build before running
install-dependencies.sh, so do it after running the script.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181031110737.17755-1-syuu@scylladb.com>
* seastar a44cedf...d59fcef (10):
> dns: Set tcp output stream buffer size to zero explicitly
> tests: add libc-ares to travis dependencies
> tests: add dns_test to test suite
> build: drop bundled c-ares package
> prometheus: replace the instance label with an optional one
> build: Refactor C++ dialect detection
> build: add libatomic to install-depenencies.sh
> core: use std::underlying_type for open_flags
> core: introduce open_flags::operator&
> core: Fix build for `gnu++14`
Currently, when advance_and_await() fails to allocate the new gate
object, it will throw bad_alloc and leave the phased_barrier object in
an invalid state. Calling advance_and_await() again on it will result
in undefined behavior (typically SIGSEGV) beacuse _gate will be
disengaged.
One place affected by this is table::seal_active_memtable(), which
calls _flush_barrier.advance_and_await(). If this throws, subsequent
flush attempts will SIGSEGV.
This patch rearranges the code so that advance_and_await() has strong
exception guarantees.
Message-Id: <1542645562-20932-1-git-send-email-tgrabiec@scylladb.com>
In (almost) all SSTable write paths, we need to inform the monitor that
the write has failed as well. The monitor will remove the SSTable from
controller's tracking at that point.
Except there is one place where we are not doing that: streaming of big
mutations. Streaming of big mutations is an interesting use case, in
which it is done in 2 parts: if the writing of the SSTable fails right
away, then we do the correct thing.
But the SSTables are not commited at that point and the monitors are
still kept around with the SSTables until a later time, when they are
finally committed. Between those two points in time, it is possible that
the streaming code will detect a failure and manually call
fail_streaming_mutations(), which marks the SSTable for deletions. At
that point we should propagate that information to the monitor as well,
but we don't.
Fixes#3732 (hopefully)
Tests: unit (release)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181114213618.16789-1-glauber@scylladb.com>
In a homogeneous cluster this will reduce number of internal cross-shard hops
per request since RPC calls will arrive to correct shard.
Message-Id: <20181118150817.GF2062@scylladb.com>
In commit a33f0d6, we changed the way we handle arrays during the write
and parse code to avoid reactor stalls. Some potentially big loops were
transformed into futurized loops, and also some calls to vector resizes
were replaced by a reserve + push_back idiom.
The latter broke parsing of the estimated histogram. The reason being
that the vectors that are used here are already initialized internally
by the estimated_histogram object. Therefore, when we push_back, we
don't fill the array all the way from index 0, but end up with a zeroed
beginning and only push back some of the elements we need.
We could revert this array to a resize() call. After all, the reason we
are using reserve + push_back is to avoid calling the constructor member
for each element, but We don't really expect the integer specialization
to do any of that.
However, to avoid confusion with future developers that may feel tempted
to converted this as well for the sake of consistency, it is safer to
just make sure these arrays are zeroed.
Fixes#3918
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181116130853.10473-1-glauber@scylladb.com>
Remove the timeout argument to
db::view::view_builder::wait_until_built(), a test-only function to
wait until a given materialized view has finished building.
This change is motivated by the fact that some tests running on slow
environments will timeout. Instead of incrementally increasing the
timeout, remove it completely since tests are already run under an
exterior timeout.
Fixes#3920
Tests: unit release(view_build_test, view_schema_test)
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181115173902.19048-1-duarte@scylladb.com>
map_reduce_column_families_locally iterate over all tables (column
family) in a shard.
If the number of tables is big it can cause latency spikes.
This patch replaces the current loop with a do_for_each allowing
preepmtion inside the loop.
Fixes#3886
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20181115154825.23430-1-amnon@scylladb.com>
In a previous patch I fixed most TTLs in the view_complex_test.cc tests
from low numbers to 100 seconds. I missed one. This one never caused
problems in practice, but for good form, let's fix it too.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181115160234.26478-1-nyh@scylladb.com>
After this patch, the Materialized Views and Secondary Index features
are considered generally-available and no longer require passing an
explicit "--experimental=on" flag to Scylla.
The "--experimental=on" flag and the db::config::check_experimental()
function remain unused, as we graduated the only two features which used
this flag. However, we leave the support for experimental features in
the code, to make it easier to add new experimental features in the future.
Another reason to leave the command-line parameter behind is so existing
scripts that still use it will not break.
Fixes#3917
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181115144456.25518-1-nyh@scylladb.com>
"
This series enables filtering support for CONTAINS restriction.
"
* 'enable_filtering_for_contains_2' of https://github.com/psarna/scylla:
tests: add CONTAINS test case to filtering tests
cql3: enable filtering for CONTAINS restriction
cql3: add is_satisfied_by(bytes_view) for CONTAINS
Several of the tests in tests/view_complex_test.cc set a cell with a
TTL, and then skip time ahead artificially with forward_jump_clocks(),
to go past the TTL time and check the cell disappeared as expected.
The TTLs chosen for these tests were arbitrary numbers - some had 3 seconds,
some 5 seconds, and some 60 seconds. The actual number doesn't matter - it
is completely artificial (we move the clock with forward_jump_clocks() and
never really wait for that amount of time) and could very well be a million
seconds. But *low* numbers, like the 3 seconds, present a problem on extremely
overcomitted test machines. Our eventually() function already allows for
the possibility that things can hang for up to 8 seconds, but with a 3 second
TTL, we can find ourselves with data being expired and the test failing just
after 3 seconds of wall time have passed - while the test intended that the
dataq will expire only when we explicitly call forward_jump_clocks().
So this patch changes all the TTLs in this test to be the same high number -
100 seconds. This hopefully fixes#3918.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181115125607.22647-1-nyh@scylladb.com>
With the use of Docker image, some extra options needed to be exposed
to provide extended functionality when starting the image. The flags
added by this commit are:
- cluster-name: name of the Scylla cluster. cluster_name option in
scylla.yaml.
- rpc-address: IP address for client connections (CQL). rpc_address
option in scylla.yaml.
- endpoint-snitch: The snitch used to discover the cluster topology.
endpoint_snitch option in scylla.yaml.
- replace-address-first-boot: Replace a Scylla node by its IP.
replace_address_first_boot option in scylla.yaml.
Signed-off-by: Yannis Zarkadas <yanniszarkadas@gmail.com>
[ penberg@scylladb.com: fix up merge conflicts ]
Message-Id: <20181108234212.19969-2-yanniszarkadas@gmail.com>
test.py:26:1: F401 'signal' imported but unused
test.py:27:1: F401 'shlex' imported but unused
test.py:28:1: F401 'threading' imported but unused
test.py:173:1: E305 expected 2 blank lines after class or function definition,
found 1
test.py:181:34: E241 multiple spaces after ','
test.py:183:34: E241 multiple spaces after ','
test.py:209:24: E222 multiple spaces after operator
test.py:240:5: E301 expected 1 blank line, found 0
test.py:249:23: W504 line break after binary operator
test.py:254:9: E306 expected 1 blank line before a nested definition, found 0
test.py:263:13: F841 local variable 'out' is assigned to but never used
test.py:264:33: E128 continuation line under-indented for visual indent
test.py:265:33: E128 continuation line under-indented for visual indent
test.py:266:33: E128 continuation line under-indented for visual indent
test.py:274:64: F821 undefined name 'e'
test.py:278:53: F821 undefined name 'e'
Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20181104115255.22547-1-ultrabug@gentoo.org>
gen_segmented_compress_params.py:52:47: E226 missing whitespace around
arithmetic operator
gen_segmented_compress_params.py:56:64: E226 missing whitespace around
arithmetic operator
gen_segmented_compress_params.py:60:36: E226 missing whitespace around
arithmetic operator
gen_segmented_compress_params.py:60:48: E226 missing whitespace around
arithmetic operator
gen_segmented_compress_params.py:70:35: E226 missing whitespace around
arithmetic operator
gen_segmented_compress_params.py:70:48: E226 missing whitespace around
arithmetic operator
gen_segmented_compress_params.py:99:43: E226 missing whitespace around
arithmetic operator
gen_segmented_compress_params.py:106:18: E225 missing whitespace around
operator
gen_segmented_compress_params.py:120:5: E303 too many blank lines (2)
gen_segmented_compress_params.py:200:30: E261 at least two spaces before
inline comment
gen_segmented_compress_params.py:200:31: E262 inline comment should start with
'# '
gen_segmented_compress_params.py:218:76: E261 at least two spaces before
inline comment
gen_segmented_compress_params.py:219:59: E703 statement ends with a semicolon
gen_segmented_compress_params.py:219:60: E261 at least two spaces before
inline comment
Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20181104115753.4701-1-ultrabug@gentoo.org>
fix_system_distributed_tables.py:28:20: E203 whitespace before ':'
fix_system_distributed_tables.py:29:20: E203 whitespace before ':'
fix_system_distributed_tables.py:30:20: E203 whitespace before ':'
fix_system_distributed_tables.py:31:20: E203 whitespace before ':'
fix_system_distributed_tables.py:33:20: E203 whitespace before ':'
fix_system_distributed_tables.py:34:23: E203 whitespace before ':'
fix_system_distributed_tables.py:35:23: E203 whitespace before ':'
fix_system_distributed_tables.py:39:20: E203 whitespace before ':'
fix_system_distributed_tables.py:40:20: E203 whitespace before ':'
fix_system_distributed_tables.py:41:20: E203 whitespace before ':'
fix_system_distributed_tables.py:42:20: E203 whitespace before ':'
fix_system_distributed_tables.py:43:20: E203 whitespace before ':'
fix_system_distributed_tables.py:44:20: E203 whitespace before ':'
fix_system_distributed_tables.py:45:20: E203 whitespace before ':'
fix_system_distributed_tables.py:46:20: E203 whitespace before ':'
fix_system_distributed_tables.py:47:20: E203 whitespace before ':'
fix_system_distributed_tables.py:48:20: E203 whitespace before ':'
fix_system_distributed_tables.py:52:20: E203 whitespace before ':'
fix_system_distributed_tables.py:53:20: E203 whitespace before ':'
fix_system_distributed_tables.py:54:20: E203 whitespace before ':'
fix_system_distributed_tables.py:55:20: E203 whitespace before ':'
fix_system_distributed_tables.py:56:20: E203 whitespace before ':'
fix_system_distributed_tables.py:57:20: E203 whitespace before ':'
fix_system_distributed_tables.py:58:20: E203 whitespace before ':'
fix_system_distributed_tables.py:59:20: E203 whitespace before ':'
fix_system_distributed_tables.py:60:20: E203 whitespace before ':'
fix_system_distributed_tables.py:61:20: E203 whitespace before ':'
fix_system_distributed_tables.py:62:20: E203 whitespace before ':'
fix_system_distributed_tables.py:66:19: E203 whitespace before ':'
fix_system_distributed_tables.py:67:19: E203 whitespace before ':'
fix_system_distributed_tables.py:72:19: E203 whitespace before ':'
fix_system_distributed_tables.py:73:19: E203 whitespace before ':'
fix_system_distributed_tables.py:74:19: E203 whitespace before ':'
fix_system_distributed_tables.py:78:19: E203 whitespace before ':'
fix_system_distributed_tables.py:79:19: E203 whitespace before ':'
fix_system_distributed_tables.py:80:19: E203 whitespace before ':'
fix_system_distributed_tables.py:84:19: E203 whitespace before ':'
fix_system_distributed_tables.py:85:19: E203 whitespace before ':'
fix_system_distributed_tables.py:89:19: E203 whitespace before ':'
fix_system_distributed_tables.py:90:19: E203 whitespace before ':'
fix_system_distributed_tables.py:91:19: E203 whitespace before ':'
fix_system_distributed_tables.py:95:22: E203 whitespace before ':'
fix_system_distributed_tables.py:96:22: E203 whitespace before ':'
fix_system_distributed_tables.py:99:1: E302 expected 2 blank lines, found 0
fix_system_distributed_tables.py:103:72: E201 whitespace after '['
fix_system_distributed_tables.py:103:82: E202 whitespace before ']'
fix_system_distributed_tables.py:105:43: E201 whitespace after '['
fix_system_distributed_tables.py:105:53: E202 whitespace before ']'
fix_system_distributed_tables.py:111:16: E713 test for membership should be
'not in'
fix_system_distributed_tables.py:118:20: E713 test for membership should be
'not in'
fix_system_distributed_tables.py:135:25: E722 do not use bare 'except'
fix_system_distributed_tables.py:138:5: E722 do not use bare 'except'
fix_system_distributed_tables.py:144:1: E305 expected 2 blank lines after
class or function definition, found 0
fix_system_distributed_tables.py:145:47: E251 unexpected spaces around keyword
/ parameter equals
fix_system_distributed_tables.py:145:49: E251 unexpected spaces around keyword
/ parameter equals
fix_system_distributed_tables.py:160:1: W391 blank line at end of file
Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20181104113001.22783-1-ultrabug@gentoo.org>
dist/docker/redhat/docker-entrypoint.py:20:1: E722 do not use bare 'except'
dist/docker/redhat/commandlineparser.py:13:13: E128 continuation line
under-indented for visual indent
Signed-off-by: Alexys Jacob <ultrabug@gentoo.org>
Message-Id: <20181104120134.9598-1-ultrabug@gentoo.org>
With the number of unit tests approaching one hundred, the output of
test.py becomes more challenging to read.
If some test fails, we will only get the details after all the tests
complete, but some tests take way longer than others.
With the coloured status, it is much simpler to immediately locate
failing tests. Developer can cancel others and repeat the failing
ones.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <63a99a2fb70fdc33fd6eeb8e18fee977a47bd278.1541541184.git.vladimir@scylladb.com>
"
This series adds DEFAULT UNSET and DEFAULT NULL keyword support
to INSERT JSON statement, as stated in #3909.
Tests: unit (release)
"
* 'add_json_default_unset_2' of https://github.com/psarna/scylla:
tests: add DEFAULT UNSET case to JSON cql tests
tests: split JSON part of cql query test
cql3: add DEFAULT UNSET to INSERT JSON
When inserting a JSON, additional DEFAULT UNSET or DEFAULT NULL
keywords can be appended.
With DEFAULT UNSET, values omitted in JSON will not be changed
at all. With DEFAULT NULL (default), omitted values will be
treated as having a 'null' value.
Fixes#3909
"
Fix for #3897
"Ec2MultiRegionSnitch: prints a cryptic error when a Public IP is not
available"
Ec2MultiRegionSnitch naturally requires a Public IP to be available and
therefore it's expected to refuse to work without it.
However the error message that is printed today is a total disaster and
has to be fixed ASAP to be something much more human readable.
This series adds a human readable preabmle that will let a poor user
understand what should he/she do.
"
* 'improve-ec2-multi-region-snitch-error-message-when-pulic-address-is-not-available-v2' of https://github.com/vladzcloudius/scylla:
locator: ec2_multi_region_snitch::start(): print a human readable error if Public IP may not be retrieved
locator: ec2_multi_region_snitch::start(): rework on top of seastar::thread
During streaming, there are cases when we should invoke the view write
path. In particular, if we're streaming because of repair or if a view
has not yet finished building and we're bootstrapping a new node.
The design constraints are:
1) The streamed writes should be visible to new writes, but the
sstable should not participate in compaction, or we would lose the
ability to exclude the streamed writes on a restart;
2) The streamed writes must not be considered when generating view
updates for them;
3) Resilient to node restarts;
4) Resilient to concurrent stream sessions, possibly streaming mutations for overlapping ranges.
We achieve this by writing the streamed writes to an sstable in a
different folder, call it "staging". We achieve 1) by publishing the
sstable to the column family sstable set, but excluding it from
compactions. We do these steps upon boot, by looking at the staging
directory, thus achieving 3).
Fixes#3275
* 'streaming_view_to_staging_sstables_9' of https://github.com/psarna/scylla: (29 commits)
tests: add materialized views test
tests: add view update generator to cql test env
main: add registering staging sstables read from disk
database: add a check if loaded sstable is already staging
database: add get_staging_sstable method
streaming: stream tables with views through staging sstables
streaming: add system distributed keyspace ref to streaming
streaming: add view update generator reference to streaming
main: add generating missed mv updates from staging sstables
storage_service: move initializing sys_dist_ks before bootstrap
db/view: add view_update_from_staging_generator service
db/view: add view updating consumer
table: add stream_view_replica_updates
table: split push_view_replica_updates
table: add as_mutation_source_excluding
table: move push_view_replica_updates to table.cc
database: add populating tables with staging sstables
database: add creating /staging directory for sstables
database: add sstable-excluding reader
table: add move_sstable_from_staging_in_thread function
...
Right now materialized_views_test.cc contains view updating tests,
but the intention is to move mv-related tests from cql_query_test
here and use it for all future unit testing of MV.
Staging sstables are loaded before regular ones. If the process
fails midway, an sstable can be linked both in the regular directory
and in staging directory. In such cases, the sstable remains
in staging and will be moved to the regular directory
by view update streamer service.
This method can be used to check if sstable is staging,
i.e. it shouldn't be compacted and it will not be used
for generating view updates from other staging tables,
and return proper shared_sstable pointer if it is.
While streaming to a table with paired views, staging sstables
are used. After the table is written to disk, it's used to generate
all required view updates. It's also resistant to restarts as it's
stored on a hard drive in staging/ directory.
Refs #3275
Bootstrapping process may need system distributed keyspace
to generate view updates, so initializing sys_dist_ks
is moved before the bootstrapping process is launched.
When generating view updates from a staging sstable, this sstable
should not be used in the process. Hence, a reader that skips a single
sstable is added.
After materialized view updates are generated, the sstable
should be moved from staging/ to a regular directory.
It's expected to be called from seastar::async thread context.
When moving sstables between directories, this helper function
will create links and update generation and dir accordingly.
It's expected to be called in thread context.
Staging sstables are not part of the compaction process to ensure
than each sstable can be easily excluded from view generation process
that depends on the mentioned sstable.
"
It appears that in case when there are any static columns in serialization header,
Cassandra would write a (possibly empty) static row to every partition
in the SSTables file.
This patchset alings Scylla's logic with that of Cassandra.
Note that Scylla optimizes the case when no partition contains a static
row because it keeps track of updated columns that Scylla currently does
not do - see #3901 for details.
Fixes#3900.
"
* 'projects/sstables-30/write-all-static-rows/v1' of https://github.com/argenet/scylla:
tests: Test writing empty static rows for partitions in tables with static columns.
sstables: Ignore empty static rows on reading.
sstables: Write empty static rows when there are static columns in the table.
MC format lacks ancestors metadata, so we need to workaround it by using
ancestors in metadata collector, which is only available for a sstable
written during this instance. It works fine here because we only want
to know if a sstable recently compacted has an ancestor which wasn't
yet deleted.
Fixes#3852.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <20181102154951.22950-1-raphaelsc@scylladb.com>
Public IP is required for Ec2MultiRegionSnitch. If it's not available
different snitch should be used.
This patch would result in a readable error message to be printed
instead of just a cryptic message with HTTP response body.
Fixes#3897
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
"
The update to libfmt 5.2.1 brought with it a subtle change - calls to
sprint("%s", 3) now throw a format_error instead of returning "3". To
prevent such hidden (or not so hidden) bugs from lurking, convert all calls
to the modern fmt syntax.
Such conversion has several benefits:
- prevent the bug from biting us
- as fmt is being standardized, we can later move to std::format()
- commonality with the logger format syntax (indeed, we may move the logger
to use libfmt itself)
During the conversion, some bugs were caught and fixed. These are presented in
individual patches in the patchset.
Most of the conversion was scripted, using https://github.com/avikivity/unsprint.
Some sprint() calls remain, as they were too complex for the script. They
will be converted later.
"
* tag 'fmt-1/v1' of https://github.com/avikivity/scylla:
toplevel: convert sprint() to format()
repair: convert sprint() to format()
tests: convert sprint() to format()
tracing: convert sprint() to format()
service: convert sprint() to format()
exceptions: convert sprint() to format()
index: convert sprint() to format()
streaming: convert sprint() to format()
streaming: progress_info: fix format string
api: convert sprint() to format()
dht: convert sprint() to format()
thrift: convert sprint() to format()
locator: convert sprint() to format()
gms: convert sprint() to format()
db: convert sprint() to format()
transport: convert sprint() to format()
utils: convert sprint() to format()
sstables: convert sprint() to format()
auth: convert sprint() to format()
cql3: convert sprint() to format()
row_cache: fix bad format string syntax
repair: fix bad format string syntax
tests: fix bad format string syntax
dht: fix bad format string syntax
sstables: fix bad format string syntax
utils: estimated_histogram: convert generated format strings to fmt
tests: perf_fast_forward: rename "format" variable
tests: perf_fast_forward: massage result of sprint() into std::string
utils: i_filter: rename "format" variable
system_keyspace: simplify complicated sprint()
cql: convert Cql.g sprint()s to fmt
types: get rid of PRId64 formatting
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
sprint() returns std::string(), but the new format() returns an sstring. Usually
an sstring is wanted but in this case an sstring will fail as it is added to
an std::string.
Fix the failure (after spring->format conversion) by converting to an std::string.
"
Use perftune.py for tuning disks:
- Distribute/pin disks' IRQs:
- For NVMe drives: evenly among all present CPUs.
- For non-NVMe drives: according to chosen tuning mode.
- For all disks used by scylla:
- Tune nomerges
- Tune I/O scheduler.
It's important to tune NIC and disks together in order to keep IRQ
pinning in the same mode.
Disk are detected and tuned based on the current content of
/etc/scylla/scylla.yaml configuration file.
"
Fixes#3831.
* 'use_perftune_for_disks-v3' of https://github.com/vladzcloudius/scylla:
dist: change the sysconfig parameter name to reflect the new semantics
scylla_util.py::sysconfig_parser: introduce has_option()
dist: scylla_setup and scylla_sysconfig_setup: change paremeters names to reflect new semantics
dist: don't distribute posix_net_conf.sh any more
dist: use perftune.py to tune disks and NIC
* seastar c1e0e5d...c02150e (5):
> prometheus: pass names as query parameter instead of part of the URL
> treewide: convert printf() style formatting to fmt
> print: add fmt_print()
> build: Remove experimental CMake support
> Merge "Correct and clean-up `signal_test`" from Jesse
It achieves 2.0x speedup on intel E5 and 1.1x to 2.5x speedup on
various arm64 microarchitectures.
The algorithm cuts data into blocks of 1024 bytes and calculates crc
for each block, which is furthur divided into three subblocks of 336
bytes(42 uint64) each, and 16 remaining bytes(2 uint64).
For each iteration, three independent crc are caculated for one uint64
from each subgroup. It increases IPC(instructions per cycle) much.
After subblocks are done, three crc and remaining two uint64 are
combined using carry-less multiplication to reach the final result
for one block of 1024 bytes.
Signed-off-by: Yibo Cai <yibo.cai@arm.com>
Message-Id: <1541042759-24767-1-git-send-email-yibo.cai@arm.com>
We tune NIC and disks together now. Change the sysconfig parameter to
reflect this new semantics.
However if we detect an old parameter name in the scylla-server we would
still update it thereby keeping the support for old installations.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Change the name of the corresponding parameter (--setup-nic) to reflect
the fact that we tune not just NIC now but rather NIC and disks together.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Tune disks using perftune.py together with NIC.
This is needed because disk(s) and NIC tuning has to be
performed using the mode (for non-NVMe disks).
We tune disks based on the current content of /etc/scylla/scylla.yaml.
Don't use scylla-blocktune for optimizing disks' performance
any more.
Unite the decision to optimize the NIC and disks tuning.
Optimize or not optimize them both together.
Disable disk tuning for DPDK and "virtio" modes for now.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This patchset addresses two issues with static rows support in SSTables
3.x. ('mc' format):
1. Since collections are allowed in static rows, we need to check for
complex deletion, set corresponding flag and write tombstones, if any.
2. Column indices need to be partitioned for static columns the same way
they are partitioned for regular ones.
* github.com/argenet/scylla.git projects/sstables-30/columns-proper-order-followup/v1:
sstables: Partition static columns by atomicity when reading/writing
SSTables 3.x.
sstables: Use std::reference_wrapper<> instead of a helper structure.
sstables: Check for complex deletion when writing static rows.
tests: Add/fix comments to
test_write_interleaved_atomic_and_collection_columns.
tests: Add test covering inverleaved atomic and collection cells in
static row.
It is possible to have collections in a static row so we need to check
for collection-wide tombstones like with clustering rows.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Collections are permitted in static rows so same partitioning as for
regular columns is required.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Current scylla.spec fails build on Fedora 27, since python2-pystache is
new package name that renamed on Fedora 28.
But Fedora 28's python2-pystache has tag "Provides: pystache",
so we can depends on old package name, this way we can build scylla.spec both
on Fedora 27/28.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181028175450.31156-1-syuu@scylladb.com>
When a node reshards (i.e., restarts with a different number of CPUs), and
is in the middle of building a view for a pre-existing table, the view
building needs to find the right token from which to start building on all
shards. We ran the same code on all shards, hoping they would all make
the same decision on which token to continue. But in some cases, one
shard might make the decision, start building, and make progress -
all before a second shard goes to make the decision, which will now
be different.
This resulted, in some rare cases, in the new materialized view missing
a few rows when the build was interrupted with a resharding.
The fix is to add the missing synchronization: All shards should make
the same decision on whether and how to reshard - and only then should
start building the view.
Fixes#3890Fixes#3452
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181028140549.21200-1-nyh@scylladb.com>
"
In Cassandra, row columns are stored in a BTree that uses the following
ordering on them:
- all atomic columns go first, then all multi-cell ones
- columns of both types (atomic and multi-cell) are
lexicographically ordered by name regarding each other
Scylla needs to store columns and their respective indices using the
same ordering as well as when reading them back.
Fixes#3853
Tests: unit {release}
+
Checked that the following SSTables are dumped fine using Cassandra's
sstabledump:
cqlsh:sst3> CREATE TABLE atomic_and_collection3 ( pk int, ck int, rc1 text, rc2 list<text>, rc3 text, rc4 list<text>, rc5 text, rc6 list<text>, PRIMARY KEY (pk, ck)) WITH compression = {'sstable_compression': ''};
cqlsh:sst3> INSERT INTO atomic_and_collection3 (pk, ck, rc1, rc4, rc5) VALUES (0, 0, 'hello', ['beautiful','world'], 'here');
<< flush >>
sstabledump:
[
{
"partition" : {
"key" : [ "0" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 96,
"clustering" : [ 0 ],
"liveness_info" : { "tstamp" : "1540599270139464" },
"cells" : [
{ "name" : "rc1", "value" : "hello" },
{ "name" : "rc5", "value" : "here" },
{ "name" : "rc4", "deletion_info" : { "marked_deleted" : "1540599270139463", "local_delete_time" : "1540599270" } },
{ "name" : "rc4", "path" : [ "45e22cb0-d97d-11e8-9f07-000000000000" ], "value" : "beautiful" },
{ "name" : "rc4", "path" : [ "45e22cb1-d97d-11e8-9f07-000000000000" ], "value" : "world" }
]
}
]
}
]
"
* 'projects/sstables-30/columns-proper-order/v1' of https://github.com/argenet/scylla:
tests: Test interleaved atomic and multi-cell columns written to SSTables 3.x.
sstables: Re-order columns (atomic first, then collections) for SSTables 3.x.
sstables: Use a compound structure for storing information used for reading columns.
In Cassandra, row columns are stored in a BTree that uses the following
ordering on them:
- all atomic columns go first, then all multi-cell ones
- columns of both types (atomic and multi-cell) are
lexicographically ordered by name regarding each other
Since schema already has all columns lexicographically sorted by name,
we only need to stably partition them by atomicity for that.
Fixes#3853
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This representation makes it easier to operate with compound structures
instead of separate values that were stored in multiple containers.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Before this fix, write_missing_columns() helper would always deal with
regular columns even when writing static rows.
This would cause errors on reading those files.
Now, the missing columns are written correctly for regular and static
rows alike.
* github.com/argenet/scylla.git projects/sstables-30/fix-writing-static-missing-columns/v1:
schema: Add helper method returning the count of columns of specified
kind.
sstables: Honour the column kind when writing missing columns in 'mc'
format.
tests: Add test for a static row with missing columns (SStables 3.x.).
Previously, we've been writing the wrong missing columns indices for
static rows because write_missing_columns() explicitly used regular
columns internally.
Now, it takes the proper column kind into account.
Fixes#3892
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
* seastar d152f2d...c1e0e5d (6):
> scripts: perftune.py: properly merge parameters from the command line and the configuration file
> fmt: update to 5.2.1
> io_queue: only increment statistics when request is admitted
> Adds `read_first_line.cc` and `read_first_line.hh` to CMake.
> fstream: remove default extent allocation hint
> core/semaphore: Change the access of semaphore_units main ctor
Due to a compile-time fight between fmt and boost::multiprecision, a
lexical_cast was added to mediate.
sprint("%s", var) no longer accepts numeric values, so some sprint()s were
converted to format() calls. Since more may be lurking we'll need to remove
all sprint() calls.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"
This patchset adddresses two problems with shadowable deletions handling
in SSTables 3.x. ('mc' format).
Firstly, we previously did not set a flag indicating the presence of
extended flags byte with HAS_SHADOWABLE_DELETION bitmask on writing.
This would break subsequent reading and cause all types of failures up
to crash.
Secondly, when reading rows with this extended flag set, we need to
preserve that information and create a shadowable_tombstone for the row.
Tests: unit {release}
+
Verified manually with 'hexdump' and using modified 'sstabledump' that
second (shadowable) tombstone is written for MV tables by Scylla.
+
DTest (materialized_views_test.py:TestMaterializedViews.hundred_mv_concurrent_test)
that originally failed due to this issue has successfully passed locally.
"
* 'projects/sstables-30/shadowable-deletion/v4' of https://github.com/argenet/scylla:
tests: Add tests writing both regular and shadowable tombstones to SSTables 3.x.
tests: Add test covering writing and reading a shadowable tombstone with SSTables 3.x.
sstables: Support Scylla-specific extension for writing shadowable tombstones.
sstables: Introduce a feature for shadowable tombstones in Scylla.db.
memtable: Track regular and shadowable tombstones separately in encoding_stats_collector.
sstables: Error out when reading SSTables 3.x with Cassandra shadowable deletion.
sstables: Support checking row extension flags for Cassandra shadowable deletion.
Even when we're using a full clustering range, need_skip() will return
true when we start a new partition and advance_context() will be
called with position_in_partition::before_all_clustered_rows(). We
should detect that there is no need to skip to that position before
the call to advance_to(*_current_partition_key), which will read the
index page.
Fixes#3868.
Message-Id: <1539881775-8578-1-git-send-email-tgrabiec@scylladb.com>
"
This patchset adds support generating .rpm/.deb from relocatable
package.
"
* 'reloc_rpmdeb_v5' of https://github.com/syuu1228/scylla:
configure.py: run create-relocatable-package.py everytime
configure.py: add SCYLLA-RELEASE-FILE/SCYLLA-VERSION-FILE targets
configure.py: use {mode} instead of $mode on scylla-package.tar.gz build target
dist/ami: build relocatable .rpm when --localrpm specified
dist/debian: use relocatable package to produce .deb
dist/redhat: use relocatable package to produce .rpm
install-dependencies.sh: add libsystemd as dependencies
install.sh: drop hardcoded distribution name, add --target option to specify distribution
build: add script to build relocatable package
build: compress relocatable package
build: add files on relocatable package to support generating .rpm/.deb
Right now we don't have dependencies for dist/, ninja not able to detect
changes under the directory.
To update relocatable package even only change is under dist/, we need
to run create-relocatable-package.py everytime.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
To re-generate scylla version files when it removed, since these files
required for relocatable package.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Since debian packaging system requires source package to compress tar
file, so let's use .gz compression.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
After the new in-memory representation of cells was introduced there was
a regression in atomic_cell_or_collection::operator<< which stopped
printing the content of the cell. This makes debugging more incovenient
are time-consuming. This patch fixes the problem. Schema is propagated
to the atomic_cell_or_collection printer and the full content of the
cell is printed.
Fixes#3571.
Message-Id: <20181024095413.10736-1-pdziepak@scylladb.com>
Limit message size according to the configuration, to avoid a huge message from
allocating all of the server's memory.
We also need to limit memory used in aggregate by thrift, but that is left to
another patch.
Fixes#3878.
Message-Id: <20181024081042.13067-1-avi@scylladb.com>
Compaction mode fails if more than one shard is used because it doesn't
make sure sstables used as input for compaction only contain local keys.
Therefore, sstable generated by compaction has less keys than expected
because non-local keys are purged out.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20181022225153.12029-1-raphaelsc@scylladb.com>
It is very useful for investigations in scylla issues, and we have
been moving those scripts manually when needed. Make it officially
part of the scylla package.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181023184400.23187-1-glauber@scylladb.com>
scyllatop uses a log file, if opening the file fails, the user should
get a clear response not an exception trace.
The same is true for connecting to scylla
After this patch the following:
$ scyllatop.py -L /usr/lib/scyllatop.log
scyllatop failed opening log file: '/usr/lib/scyllatop.log' With an error: [Errno 13] Permission denied: '/usr/lib/scyllatop.log'
Fixes#3860
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20181021065525.22749-1-amnon@scylladb.com>
The original SSTables 'mc' format, as defined in Cassandra, does not provide
a way to store shadowable deletion in addition to regular row deletion
for materialized views.
It is essential to store it because of known corner-case issues that
otherwise appear.
For this to work, we introduce a Scylla-specific extended flag to be set
in SSTables in 'mc' format that indicates a shadowable tombstone is
written after the regular row tombstone.
This is deemed to be safe because shadowable tombstones are specific to
materialized views and MV tables are not supposed to be imported or
exported.
Note that a shadowable tombstone can be written without a regular
tombstone as well as along with it.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This is used to indicate that the SSTables being read may contain a
Scylla-specific HAS_SCYLLA_SHADOWABLE_TOMBSTONE extended flag set.
If feature is not disabled, we should not honour this flag.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This flag can be only used in MV tables that are not supposed to be
imported to Scylla.
Since Scylla representation of shadowable tombstones differs from that
of Cassandra, such SSTables are rejected on read and Scylla never sets
this flag on writing.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
"
Hinted handoff should not overpower regular flows like READs, WRITEs or
background activities like memtable flushes or compactions.
In order to achieve this put its sending in the STEAMING CPU scheduling
group and its commitlog object into the STREAMING I/O scheduling group.
Fixes#3817
"
* 'hinted_handoff_scheduling_groups-v2' of https://github.com/vladzcloudius/scylla:
db::hints::manager: use "streaming" I/O scheduling class for reads
commitlog::read_log_file(): set the a read I/O priority class explicitly
db::hints::manager: add hints sender to the "streaming" CPU scheduling group
Commit 1d34ef38a8 "cql3: make pagers use
time_point instead of duration" has unintentionally altered the timeout
semantics for aggregate queries. Such requests fetch multiple pages before
sending a response to the client. Originally, each of those fetches had
a timeout-duration to finish, after the problematic commit the whole
request needs to complete in a single timeout-duration. This,
unsurprisingly, makes some queries that were successful before fail with
a timeout. This patch restores the original behaviour.
Fixes#3877.
Message-Id: <20181022125318.4384-1-pdziepak@scylladb.com>
his series attempts to make fragments per second results reported by
perf_fast_forward more stable. That includes running each test case
multiple time and reporting median, median average deviation, maximum
and minimum value. That should allow to relatively easily assess how
repeatable the presented results are. Moreover, since perf_fast_forward
does IO operation it is important that they do not introduce any
excessive noise to the results. The location of the data directory
is made configurable so that the user can choose less noisy disk or a
ramdisk.
* github.com/pdziepak/scylla.git stabilise-perf_fast_forward/v3:
tests/perf_fast_forward: make fragments/s measurements more stable
tests/perf_fast_forward: make data directory location configurable
network_topology_strategy test creates a ring with hundreds of tokens (and one
token per node). Then, for each token, it calls get_primary_ranges(), which in
turn walks the token ring. However, because the each datacenter occupies a
disjoint token range, this walk practically has to walk the entire ring until
it collects enough endpoints for each datacenter. The whole thing takes 15 minutes.
Speed this up by randomizing the token<->dc relationship. This is more realistic,
and switches the algorithm to be O(token count), and now it completes in less
than a minute (still not great, but better).
Message-Id: <20181022154026.19618-1-avi@scylladb.com>
perf_fast_forward populates perf_fast_forward_output with some data and
then runs performance tests that read it. That makes the disk a
significant factor in the final result and may make the results less
repeatable. This patch adds a flag that allows setting the location
of the data directory so that the user can opt for a less noisy disk
or a ramdisk.
perf_fast_forward performs various operations, many of which involve
sstable reads and verifies the metrics that there weren't any
unnecessary IO operations. It also provides fragments per seconds
measurements for the tests it runs. However, since some of the tests are
very short and involve IO those values vary a lot what makes them not
very useful.
This commit attempts to stabilise those results. Each test case is run
multiple time (by default for a second, but at least 3 times) and shows
median, median absolute deviation, maximum and minimum value. This
should allow assessing whether the changes in the results are just noise
or a real regression or improvement.
Currently, restricting_mutation_reader::fill_buffer justs reads
lower-layer reader's fragments one by one without doing any further
transformations. This change just swaps the parent-child buffers in a
single step, as suggested in #3604, and, hence, removing any possible
per-fragment overhead.
I couldn't find any test that exercises restricting_mutation_reader as
a mutation source, so I added test_restricted_reader_as_mutation_source
in mutation_reader_test.
Tests: unit (release), though these 4 tests are failing regardless of
my changes (they fail on master for me as well): snitch_reset_test,
sstable_mutation_test, sstable_test, sstable_3_x_test.
Fixes: #3604
Signed-off-by: George Kollias <georgioskollias@gmail.com>
Message-Id: <1540052861-621-1-git-send-email-georgioskollias@gmail.com>
"
This patchset fixes#3803. When a select statement with filtering
is executed and the column that is needed for the filtering is not
present in the select clause, rows that should have been filtered out
according to this column will still be present in the result set.
Tests:
1. The testcase from the issue.
2. Unit tests (release) including the
newly added test from this patchset.
"
* 'issues/3803/v10' of https://github.com/eliransin/scylla:
unit test: add test for filtering queries without the filtered column
cql3 unit test: add assertion for the number of serialized columns
cql3: ensure retrieval of columns for filtering
cql3: refactor find_idx to be part of statement restrictions object
cql3: add prefix size common functionality to all clustering restrictions
cql3: rename selection metadata manipulation functions
Test the usecase where the column that the filtering operates on
is not a part of the select clause. The expected result is a set
containing the columns of the select clause with the additional
columns for filtering marked as non serializable.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
The result sets that the assertions are performed against
are result sets before serialization to the user and therefore
contain also columns that will not be serialized and sent as
the query's final result. The patch adds an assertion on the
number of columns that will be present in the final serialized
result.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
When a query that needs filtering is executed, the columns
that the coordinator is filtering by have to be retrieved.The
columns should be retrieved even if they are not used for
ordering or named in the actual select clause.
If the columns are missing from the result set, then any
filtering that restricts the missing column will not take
place.
Fixes#3803
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
find_idx calculates the index that will be used in the statement if
indexes are to be used. In the static form it requires redundant
information (the schema is already contained within the statement
restrictions object). In addition find_idx will need to be used for
filtering in order not to include redundant selectors in the selection
objects. This change refactors find_idx to run under the statement
restrictions object and changes it's scope from private to public.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Up untill now, knowing the prefix size, which is used to determine
if a filtering is needed was implemented only for a single column
clustering restrictions. The patch adds a function to calculate the
prefix size for all types of clustering key restrictions given the
schema.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Every call of a tracing::global_trace_state_ptr object instead of a
tracing::tracing_state_ptr or a call to tracing::global_trace_state_ptr::get()
creates a new tracing session (span) object.
This should never be done unless query handling moves to a different shard.
Fixes#3862
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20181018003500.10030-1-vladz@scylladb.com>
Previously we were making assumptions about missing columns
(the size of its value, whether it's a collection or a counter) but
they didn't have to be always true. Now we're using column type
from serialization header to use the right values.
Fixes#3859
* seastar-dev.git haaawk/projects/sstables-30/handling-dropped-columns/v4:
sstables 3: Correctly handle dropped columns in column_translation
sstables 3: Add test for dropped columns handling
"
Refs #3828
(Probably fixes it)
We found a few flaws in a way we enable hints replaying.
First of all it was allowed before manager::start() is complete.
Then, since manager::start() is called after messaging_service is
initialized there was a time window when hints are rejected and this
creates an issue for MV.
Both issues above were found in the context of #3828.
This series fixes them both.
Tested {release}:
dtest: materialized_views_test.py:TestMaterializedViews.write_to_hinted_handoff_for_views_test
dtest: hintedhandoff_additional_test.py
"
* 'hinted_handoff_dont_create_hints_until_started-v1' of https://github.com/vladzcloudius/scylla:
hinted handoff: enable storing hints before starting messaging_service
db::hints::manager: add a "started" state
db::hints::manager: introduce a _state
get_ranges() is supposed to return ranges in sorted order. However, a35136533d
broke this and returned the range that was supposed to be last in the second
position (e.g. [0, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9]). The broke cleanup, which
relied on the sort order to perform a binary search. Other users of the
get_ranges() family did not rely on the sort order.
Fixes#3872.
Message-Id: <20181019113613.1895-1-avi@scylladb.com>
When messaging_service is started we may immediately receive a mutation
from another node (e.g. in the MV update context). If hinted handoff is not
ready to store hints at that point we may fail some of MV updates.
We are going to resolve this by start()ing hints::managers before we
start messaging_service and blocking hints replaying until all relevant
objects are initialized.
Refs #3828
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Hinting is allowed after "started" before "stopping".
Hints that attempted to be stored outside this time frame are going to
be dropped.
Refs #3828
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Introduce a multi-bit state field. In this patch it replaces the _stopping
boolean. We are going to add more states in the following patches.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Previously we were making assumptions about missing columns
(the size of its value, whether it's a collection or a counter) but
they didn't have to be always true. Now we're using column type
from serialization header to use the right values.
Fixes#3859
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
In the past the addition of non serializable columns was being used
only for post ordering of result sets.The newly added ALLOW FILTERING
feature will need to use these functions to other post processing operations
i.e filtering. The renaming accounts for the new and existing uses for the
function.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
* seastar 4669469...d152f2d (5):
> build: don't link with libgcc_s explicitly
> scheduling: add std::hash<seastar::scheduling_group>
> prometheus: Allow preemption between each metric
> Merge "improve memory detection in containers" from Juliana
> Merge "perf_tests: produce json reports" from Paweł
"
Hints are stored on disk by a hints::manager, ensuring they are
eventually sent. A hints::resource_manager ensures the hints::managers
it tracks don't consume more than their allocated resources by
monitoring disk space and disabling new hints if needed. This series
fixes some bugs related to the backlog calculation, but mainly exposes
the backlog through a hints::manager so upper layers can apply flow
control.
Refs #2538
"
* 'hh-manager-backlog/v3' of https://github.com/duarten/scylla:
db/hints/manager: Expose current backlog
db/hints/manager: Move decision about blocking hints to the manager
db/hints/resource_manager: Correctly account resources in space_watchdog
db/hints/resource_manager: Replace timer with seastar::thread
db/hints/resource_manager: Ensure managers are correctly registered
db/hints/resource_manager: Fix formatting
db/hints: Disallow moving or copying the managers
The space_watchdog enables or disables hints for the managers
associated with a particular device. We encapsulate this decision
inside the hints::managers by introducing the update_backlog()
function.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
A db::hints::resource_manager manages the resources for one or two
db::hints::managers. Each of these can be using the same or different
devices. The db::hints::space_watchdog periodically checks whether
each manager is within their resource allocation, and if not disables
it.
The watchdog iterates over the managers and accounts for the total
size they are using. This is wrong, since it can account in the same
variable the size consumed by managers using different devices.
We fix this while taking advantage of the fact that on_timer is now
called in the context of a seastar::thread, instead of using future
combinators.
Fixes#3821
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Registering a manager for a new device used
std::unordered_map::emplace(), which may not insert the specified
value if one with the same key has already been added. This could
happen if both managers were using the same device and the fiber
deferred in-between adding them.
Found during code reading. Could cause hints to not be disabled for an
overloaded manager.
Fixes#3822
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Disable the copy and move ctors and assignment operators for both the
hints::manager and the hints::resource_manager.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Without that, we don't know where to look for the problems
Before:
compaction failed: sstables::malformed_sstable_exception (Too big ttl: 3163676957)
After:
compaction_manager - compaction failed: sstables::malformed_sstable_exception (Too big ttl: 4294967295 in sstable /var/lib/scylla/data/system_traces/events-8826e8e9e16a372887533bc1fc713c25/mc-832-big-Data.db)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181016181004.17838-1-glauber@scylladb.com>
On receiving a mutation_fragment or a mutation triggered by a streaming
operation, we pass an enum stream_reason to notify the receiver what
the streaming is used for. So the receiver can decide further operation,
e.g., send view updates, beyond applying the streaming data on disk.
Fixes#3276
Message-Id: <f15ebcdee25e87a033dcdd066770114a499881c0.1539498866.git.asias@scylladb.com>
Handle the before_all_keys and after_all_keys token_kind
at the highest layer before calling into the virtual
i_partitioner::tri_compare that is not set up to handle these cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20181015165612.29356-1-bhalevy@scylladb.com>
Int types in json will be serialized to int types in C++. They will then
only be able to handle 4GB, and we tend to store more data than that.
Without this patch, listsnapshots is broken in all versions.
Fixes: #3845
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181012155902.7573-1-glauber@scylladb.com>
Introduce uppermost_bound() method instead of upper_bound() in mutation_fragment_filter and clustering_ranges_walker.
For now, this has been only used to produce the final range tombstone
for sliced reads inside consume_partition_end().
Usage of the upper bound of the current range causes problems of two
kinds:
1. If not all the slicing ranges have been traversed with the
clustering range walker, which is possible when the last read
mutation fragment was before some of the ranges and reading was limited
to a specific range of positions taken from index, the emitted range
tombstone will not cover the untraversed slices.
2. At the same time, if all ranges have been walked past, the end
bound is set to after_all_clustered_rows and the emitted RT may span
more data than it should.
To avoid both situations, the uppermost bound is used instead, which
refers to the upper bound of the last range in the sequence.
* github.com/scylladb/seastar-dev.git haaawk/projects/sstables-30/enable-mc-with-sstable-mutation-test/v2
sstables: Use uppermost_bound() instead of upper_bound() in
mutation_fragment_filter.
tests: Enable sstable_mutation_test for SSTables 'mc' format.
Rebased by Piotr J.
For now, this has been only used to produce the final range tombstone
for sliced reads inside consume_partition_end().
Usage of the upper bound of the current range causes problems of two
kinds:
1. If not all the slicing ranges have been traversed with the
clustering range walker, which is possible when the last read
mutation fragment was before some of the ranges and reading was limited
to a specific range of positions taken from index, the emitted range
tombstone will not cover the untraversed slices.
2. At the same time, if all ranges have been walked past, the end
bound is set to after_all_clustered_rows and the emitted RT may span
more data than it should.
To avoid both situations, the uppermost bound is used instead, which
refers to the upper bound of the last range in the sequence.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
* seastar-dev.git haaawk/sst3/test_clustering_slices/v8:
sstables: Extract on_end_of_stream from consume_partition_end
sstables: Don't call consume_range_tombstone_end in
consume_partition_end
sstables: Change the way fragments are returned from consumer
Split range tombstone (if present) on every consume_row_end call
and store both range tombstone and row in different fields called
_stored_row and _stored_tombstone instead of using single field
called _stored.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The new function will be called when the stream of data is finished
while old consume_partition_end will be called when partition
is finished but stream is not done yet.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Readers for SST3 return a bit more precise range tombstones
when reader is slicing. Namely, SST2 readers return whole
range tombstones that overlap with slicing range but SST3
trim those range tombstones to slicing range.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
There is a mismatch between row markers used in SSTables 2.x (ka/la) and
liveness_info used by SSTables 3.x (mc) in that a row marker can be
written as a deleted cell but liveness_info cannot.
To handle this, for a dead row marker the corresponding liveness_info is
written as expiring liveness_info with a fake TTL set to 1.
This approach is adapted from the solution for CASSANDRA-13395 that
exercised similar issue during SSTables upgrades.
* github.com/argenet/scylla.git projects/sstables-30/dead-row-marker/v7:
sstables: Introduce TTL limitation and special 'expired TTL' value.
sstables: Write dead row marker as expired liveness info.
tests: Add test covering dead row marker writing to SSTables 3.x.
'Consumer function' parameter for distribute_reader_and_consume_on_shards()
captures schema_ptr (which is a seastar::shared_ptr), but the function
is later copied on another shard at which point schema_ptr is also copied
and its counter is incremented by the wrong shard. The capture is not
even used, so lets just drop it.
Fixes#3838
Message-Id: <20181011075500.GN14449@scylladb.com>
The single-range overload, when used by
make_multishard_streaming_reader(), has to create a reader that is
forwardable. Otherwise the multishard streaming reader will not produce
any output as it cannot fast-forward its shard readers to the ranges
produced by the generator.
Also add a unit test, that is based on the real-life purpose the
multishard streaming reader was designed for - serving partition
from a shard, according to a sharding configuration that is different
than the local one. This is also the scenario that found the buf in the
first place.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <bf799961bfd535882ede6a54cd6c4b6f92e4e1c1.1539235034.git.bdenes@scylladb.com>
Make sure that read I/O in the context of HH sending do not overpower I/O
in the context of queries, memtable flushes or compactions.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This allows to distinguish expired liveness info from yet-to-expire one
and convert it into a dead row marker on read.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This allows to store expired liveness info in SSTables 3.x format
without introducing a possible conflict with real TTL values.
As per Cassandra, TTL cannot exceed 20 years so taking the maximum value
as a special value for indicating expired liveness info is safe.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Fixes#3787
Message service streaming sink was created using direct call to
rpc::client::make_sink. This in turn needs a new socker, which it
creates completely ignoring what underlying transport is active for the
client in question.
Fix by retaining the tls credential pointer in the client wrapper, and
using this in a sink method to determine whether to create a new tls
socker, or just go ahead with a plain one.
Message-Id: <20181010003249.30526-1-calle@scylladb.com>
"
This series changes hinted handoff to work with `frozen_mutation`s
instead of naked `mutation`s. Instead of unfreezing a mutation from
the commitlog entry and then freezing it again for sending, now we'll
just keep the read, frozen mutation.
Tests: unit(release)
"
* 'hh-manager-cleanup/v1' of https://github.com/duarten/scylla:
db/hints/manager: Use frozen_mutation instead of mutation
db/hints/manager: Use database::find_schema()
db/commitlog/commitlog_entry: Allow moving the contained mutation
service/storage_proxy: send_to_endpoint overload accepting frozen_mutation
service/storage_proxy: Build a shared_mutation from a frozen_mutation
service/storage_proxy: Lift frozen_mutation_and_schema
service/storage_proxy: Allow non-const ranges in mutate_prepare()
A simple case for SI paging is added to secondary_index_test suite.
This commit should be followed by more complex testing
and serves as an example on how to extract paging state and use it
across CQL queries.
Message-Id: <b22bdb5da1ef8df399849a66ac6a1f377e6a650a.1539090350.git.sarna@scylladb.com>
write_stats is referenced from write handler which is available in
send_to_live_endpoints already. No need to pass it down.
Message-Id: <20181009133017.GA14449@scylladb.com>
Currently, when stopping a reader fails, it simply won't be attempted to
be saved, and it will be left in the `_readers` array as-is. This can
lead to an assertion failure as the reader state will contain futures
that were already waited upon, and that the cleanup code will attempt to
wait on again. To prevent this, when stopping a reader fails, reset it
to nonexistent state, so that the cleanup code doesn't attempt to do
anything with it.
Refs: #3830
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <a1afc1d3d74f196b772e6c218999c57c15ca05be.1539088164.git.bdenes@scylladb.com>
It is useful to have this counter to investigate the reason for read
repairs. Non zero value means that writes were lost after CL is reached
and RR is expected.
Message-Id: <20181009120900.GF22665@scylladb.com>
The pager::state() function returns a valid paging object even
if the pager itself is exhausted. It may also not contain the partition
key, so using it unconditionally was a bug - now, in case there is no
partition key present, paging state will contain an empty partition key.
Fixes#3829
Message-Id: <28401eb21ab8f12645c0a33d9e92ada9de83e96b.1539074813.git.sarna@scylladb.com>
This will be used by the `make_multishard_streaming_reader()` in the
next patch. This method will create a multishard combining reader which
needs its shard readers to take a single range, not a vector of ranges
like the existing overload.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <cc6f2c9a8cf2c42696ff756ed6cb7949b95fe986.1538470782.git.bdenes@scylladb.com>
It might take long time for get_all_ranges_with_sources_for and
get_all_ranges_with_strict_sources_for to calculate which cause reactor
stall. To fix, run them in a thread and yield. Those functions are used in
the slow path, it is ok to yield more than needed.
Fixes#3639
Message-Id: <63aa7794906ac020c9d9b2984e1351a8298a249b.1536135617.git.asias@scylladb.com>
A materialized views can provide a filter so as to pick up only a subset
of the rows from the base table. Usually, the filter operates on columns
from the base table's primary key. If we use a filter on regular (non-key)
columns, things get hairy, and as issue #3430 showed, wrong: merely updating
this column in the base table may require us to delete, or resurrect, the
view row. But normally we need to do the above when the "new view key column"
was updated, when there is one. We use shadowable tombstones with one
timestamp to do this, so it cannot take into account the two timestamp from
those two columns (the filtered column and the new key column).
So in the current code, filtering by a non-key column does not work correctly.
In this patch we provide two test cases (one involving TTLs, and one involves
only normal updates), which demonstrate vividly that it does *not* work
correctly. With normal updates, trying to resurect a view row that has
previously disappeared, fails. With TTLs, things are even worse, and the view
row fails to disappear when the filtered column is TTLed.
In Cassandra, the same thing doesn't work correctly as well (see
CASSANDRA-13798 and CASSANDRA-13832) so they decided to refuse creating
a materialized view filtering a non-key column. In this patch we also
do this - fail the creation of such an unsupported view. For this reason,
the two tests mentioned above are commented out in a "#if", with, instead,
a trivial test verifying a failure to create such a view.
Note that as explained above, when the filtered column and new view key
column are *different* we have a problem. But when they are the *same* - namely
we filter by a non-key base column which actually *is* a key in the view -
we are actually fine. This patch includes additional test cases verifying
that this case is really fine and provides correct results. Accordingly,
this case is *not* forbidden in the view creation code.
Fixes#3430.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181008185633.24616-1-nyh@scylladb.com>
"
This patchset fixes a bug in SSTables 3.x reading when fast-forwarding
is enabled. It is possible that a mutation fragment, row or RT marker,
is read and then stored because it falls outside the current
fast-forwarding range.
If the reader is further fast-forwarded but the
row still falls outside of it, the reader would still continue reading
and get the next fragment, if any, that would clobber the currently
stored one. With this fix, the reader does not attempt to read on
after storing the current fragment.
Tests: unit {release}
"
* 'projects/sstables-30/row-skipped-on-double-ff/v2' of https://github.com/argenet/scylla:
tests: Add test for reading rows after multiple fast-forwarding with SSTables 3.x.
sstables: mp_row_consumer_m to notify reader on end of stream when storing a mutation fragment.
sstables: In mp_row_consumer_m::push_mutation_fragments(), return the called helper's value.
Fixes#3798Fixes#3694
Tests:
unit(release), dtest([new] cql_tests.py:TruncateTester.truncate_after_restart_test)
* tag 'fix-gossip-shard-replication-v1' of github.com:tgrabiec/scylla:
gms/gossiper: Replicate enpoint states in add_saved_endpoint()
gms/gossiper: Make reset_endpoint_state_map() have effect on all shards
gms/gossiper: Replicate STATUS change from mark_as_shutdown() to other shards
gms/gossiper: Always override states from older generations
writetime() or ttl() selections of non-frozen collections can work, as they
are single cells. Relax the check to allow them, and only forbid non-frozen
collections.
Fixes#3825.
Tests: cql_query_test (release).
Message-Id: <20181008123920.27575-1-avi@scylladb.com>
Uncomment existing declare() calls and implement tests. Because the
data_value(bytes) constructor is explicit, we add explicit conversion to
data_value in impl_min_function_for<> and impl_max_function_for<>.
Fixes#3824.
Message-Id: <20181008084127.11062-1-avi@scylladb.com>
We found on some Debian environment Ubuntu .deb build fails with
gpg error because lack of Ubuntu GPG key, so we need to install it before
start pbuilder.
Same as on Ubuntu, it needs to install Debian GPG key.
Fixes#3823
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20181008072246.13305-1-syuu@scylladb.com>
Instead of unfreezing a mutation from the commitlog and then freezing
it again to send, just keep the read frozen mutation.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Instead of using find_column_family() and repeatedly asking for
column_family::schema(), use database::find_schema() instead.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Add an overload to send_to_endpoint() which accepts a frozen_mutation.
The motivation is to allow better accounting of pending view updates,
but this change also allows some callers to avoid unfreezing already
frozen mutations.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Lift frozen_mutation_and_schema to frozen_mutation.hh, since other
subsystems using frozen_mutations will likely want to pass it around
together with the schema.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"
This patchset implements ALTER TABLE ADD/DROP for multiple columns.
Fixes: #2907Fixes: #3691
Tests: schema_change_test
"
* 'projects/cql3/alter-table-multi/v3' of https://github.com/bhalevy/scylla:
cql3: schema_change_test: add test_multiple_columns_add_and_drop
cql3: allow adding or dropping multiple columns in ALTER TABLE statement
cql3: alter_table_statement: extract add/alter/drop per-column code into functions
cql3: testing for MVs for alter_table_statement::type::drop is not per column
cql3: schema_change_test: add test_static_column_is_dropped
So we don't attempt to send mutations to unreachable endpoints and
instead store a hint for them, we now check the endpoint status and
populate dead_endpoints accordingly in
storage_proxy::send_to_endpoint().
Fixes#3820
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181007100640.2182-1-duarte@scylladb.com>
No column can be dropped from a table with materialized views
so the respective exception can ignore and omit the dropped column name.
In preparation for refactoring the respective code, moving the per-column
code to member functions.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Test dropping of static column defined in CREATE TABLE, and
adding and dropping of a static column using ALTER TABLE.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Without it, the reader will attempt to read further and may clobber the
stored fragment with the next one read, if any.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Unpaged queries are those for which the client didn't enable paging,
and we already account for them in
indexed_table_select_statement::do_execute().
Remove the second increment in read_posting_list().
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181003121811.11750-1-duarte@scylladb.com>
We had two commented out tests based on Cassandra's MV unit tests, for
the case that the view's filter (the "SELECT" clause used to define the
view) filtered by a non-primary-key column. These tests used to fail
because of problems we had in the filtering code, but they now succeed,
so we can enable them. This patch also adds some comments about what
the tests do, and adds a few more cases to one of the tests.
Refs #3430.
However, note that the success of these tests does not really prove that
the non-PK-column filtering feature works fully correctly and that issue
forbidding it, as explained in
https://issues.apache.org/jira/browse/CASSANDRA-13798. We can probably
fix this feature with our "virtual cells" mechanism, but will need to add
a test to confirm the possible problem and its (probably needed fix).
We do not add such a test in this patch.
In the meantime, issue #3430 should remain open: we still *allow* users
to create MV with such a filter, and, as the tests in this patch show,
this "mostly" works correctly. We just need to prove and/or fix what happens
with the complex row liveness issues a la issue #3362.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20181004213637.32330-1-nyh@scylladb.com>
Lack of this may result in non-zero shards on some nodes still seeing
STATUS as NORMAL for a node which shut down, in some cases.
mark_as_shutdown() is invoked in reaction to an RPC call initiated by
the node which is shutting down. Another way a node can learn about
other node shutting down is via gossiping with a node which knows
this. In that case, the states will be replicated to non-zero
shards. The node which learnt via mark_as_shutdown() may also
eventually propagate this to non-zero shards, e.g. when it gossips
about it with other nodes, and its local version number at the time of
mark_as_shudown() was smaller than the one used to set the STATE by
the shutting down node.
Application states of each node are versioned per-node with a pair of
generation number (more significant) and value version. Generation
number uniquely identifies the life time of a scylla
process. Generation number changes after restart. Value versions start
from 0 on each restart. When a node gets updates for application
states, it merges them with its view on given node. Value updates with
older versions are ignored.
Gossiper processes updates only on shard 0, and replicates value
updates to other shards. When it sees a value with a new generation,
it correclty forgets all previous values. However, non-zero shards
don't forget values from previous generations. As a result,
replication will fail to override the values on non-zero shards when
generation number changes until their value version exceeds the
version prior to the restart.
This will result in incorrect STATUS for non-seed nodes on non-zero
shards. When restarting a non-seed node, it will do a shadow gossip
round before setting its STATUS to NORMAL. In the shadow round it will
learn from other nodes about itself, and set its STATUS to shutdown on
all shards with a high value version. Later, when it sets its status
to NORMAL, it will override it only on shard 0, because on other
shards the version of STATUS is higher.
This will cause CQL truncate to skip current node if the coordinator
runs on non-zero shards.
The fix is to override the entries on remote shards in the same way we
do on shard 0. All updates to endpoint states should be already
serialized on shard 0, and remote shards should see them in the same
order.
Introduced in 2d5fb9dFixes#3798Fixes#3694
The Antlr3 exception class has a null dereference bug that crashes
the system when trying to extract the exception message using
ANTLR_Exception<...>::displayRecognitionError(...) function. When
a parsing error occurs the CqlParser throws an exception which in
turn processesed for some special cases in scylla to generate a custom
message. The default case however, creates the message using
displayRecognitionError, causing the system to crash.
The fix is a simple workaround, making sure the pointer is not null
before the call to the function. A "proper" fix can't be implemented
because the exception class itself is implemented outside scylla
in antlr headers that resides on the host machine os.
Tested manualy 2 testcases, a typo causing scylla to crash and
a cql comment without a newline at the end also caused scylla to crash.
Ran unit tests (release).
Fixes#3740Fixes#3764
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Message-Id: <cfc7e0d758d7a855d113bb7c8191b0fd7d2e8921.1538566542.git.eliransin@scylladb.com>
shutdown_announce_in_ms specifies a period of time that a node which
is shutting down waits to allow its state to propagate to other nodes.
However, we were setting _enabled to false before waiting, which
will make the current node ignore gossip messages.
Message-Id: <1538576996-26283-1-git-send-email-tgrabiec@scylladb.com>
The as_json_function class is not registered as a function, but we can
still keep it cql3/functions, as per its namespace, to reduce the size
of select_statement.cc.
Message-Id: <20181002132637.30233-1-penberg@scylladb.com>
"
This patchset enables very simple column type conversions.
It covers only handling variable and fixed size type differences.
Two types still have to be compatiple on bits level to be able to convert a field from one to the other.
"
* 'haaawk/sst3/column_type_schema_change/v4' of github.com:scylladb/seastar-dev:
Fix check_multi_schema to actually check the column type change
Handle very basic column type conversions in SST3
Enable check_multi_schema for SST3
Field 'e' was supposed to be read as blob but the test had a bug
and the read schema was treating that field as int. This patch
changes that and makes the test really check column type change.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
After this change very simple schema changes of column type
will work. This change makes sure that variable size and fixed
size types can be converted to each other but only if their bit
representation can be automatically converted between those types.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
We are already maintaining a statistic of the number of pending view updates
sent but but not yet completed by view replicas, so let's expose it.
As all per-table statistics, also this one will only be exposed if the
"--enable-keyspace-column-family-metrics" option is on.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
mutate_MV usually calls send_to_endpoint() to push view update to remote
view replicas. This function gets passed a statistics object,
service::storage_proxy_stats::write_stats and, in particular, updates
its "writes" statistic which counts the number of ongoing writes.
In the case that the paired view replica happens to be the *same* node,
we avoid calling send_to_endpoint() and call mutate_locally() instead.
That function does not take a write_stats object, so the "writes" statistic
doesn't get incremented for the duration of the write. So we should do
this explicitly.
Co-authored-by: Nadav Har'El <nyh@scylladb.com>
Co-authored-by: Duarte Nunes <duarte@scylladb.com>
Currently we diff schemas based on table/view name, and if the names
match, then we detect altered schemas by comparing the schema
mutations. This fails to detect transitions which involve dropping and
recreating a schema with the same name, if a node receives these
notifications simultaneously (for example, if the node was temporarily
down or partitioned).
Note that because the ID is persisted and created when executing a
create_table_statement, then even if a schema is re-created with the
exact same structure as before, we will still considered it altered
because the mutations will differ.
This also stops schema pulling from working, since it relies on schema
merging.
The solution is to diff schemas using their ID, and not their name.
Keyspaces and user types are also susceptible to this, but in their
case it's fine: these are values with no identity, and are just
metadata. Dropping and recreating a keyspace can be views as dropping
all tables from the keyspace, altering it, and eventually adding new
tables to the keyspace.
Note that this solution doesn't apply to tables dropped and created
with the same ID (using the `WITH ID = {}` syntax). For that, we would
need to detect deltas instead of applying changes and then reading the
new state to find differences. However, this solution is enough,
because tables are usually created with ID = {} for very specific,
peculiar reasons. The original motivation meant for the new table to
be treated exactly as the old, so the current behavior is in fact the
desired one.
Tests: unit(release), dtests(schema_test, schema_management_test)
Fixes#3797
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181001230932.47153-2-duarte@scylladb.com>
The linker uses an opt-in system for non-executable stack: if all object files
opt into a non-executable stack, the binary will have a non-executable stack,
which is very desirable for security. The compiler cooperates by opting into
a non-executable stack whenever possible (always for our code).
However, we also have an assembly file (for fast power crc32 computations).
Since it doesn't opt into a non-executable stack, we get a binary with
executable stack, which Gentoo's build system rightly complains about.
Fix by adding the correct incantation to the file.
Fixes#3799.
Reported-by: Alexys Jacob <ultrabug@gmail.com>
Message-Id: <20181002151251.26383-1-avi@scylladb.com>
* seastar 5712816...71e914e (12):
> Merge "rpc shard to shard connection" from Gleb
> Merge "Fix memory leaks when stoppping memcached" from Tomasz
> scripts: perftune.py: prioritize I/O schedulers
> alien: fix the size of local item[]
> seastar-addr2line: don't invoke addr2line multiple times
> reactor: use labels for different io_priority_class:s
> util/spinlock: fix bad namespacing of <xmmintrin.h>
> Merge "scripts: perftune.py: support different I/O schedulers" from Vlad
> timer: Do not require callback to be copyable
> core/reactor: Fix hang on shutdown with long task quota
> build: use 'ppa:scylladb/ppa' instead of URL for sourceline
> net/dns: add net::dns::get_srv_records() helper
We allow tables to be created with the ID property, mostly for
advanced recovery cases. However, we need to validate that the ID
doesn't match an existing one. We currently do this in
database::add_column_family(), but this is already too late in the
normal workflow: if we allow the schema change to go through, then
it is applied to the system tables and loaded the next time the node
boots, regardless of us throwing from database::add_column_family().
To fix this, we perform this validation when announcing a new table.
Note that the check wasn't removed from database::add_column_family();
it's there since 2015 and there might have been other reasons to add
it that are not related to the ID property.
Refs #2059
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181001230142.46743-1-duarte@scylladb.com>
To be used by sstable_writer for stats collection.
Note that this patch is factored out so it can be verified with no
other change in functionality.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2018-10-01 13:01:00 +03:00
1123 changed files with 38533 additions and 17324 deletions
**Note**: Compiling Scylla requires, conservatively, 2 GB of memory per native thread, and up to 3 GB per native thread while linking.
**Note**: Compiling Scylla requires, conservatively, 2 GB of memory per native
thread, and up to 3 GB per native thread while linking. GCC >= 8.1.1. is
required.
Scylla is built with [Ninja](https://ninja-build.org/), a low-level rule-based system. A Python script, `configure.py`, generates a Ninja file (`build.ninja`) based on configuration options.
@@ -43,9 +45,7 @@ The full suite of options for project configuration is available via
$ ./configure.py --help
```
The most important options are:
-`--mode={release,debug,all}`: Debug mode enables [AddressSanitizer](https://github.com/google/sanitizers/wiki/AddressSanitizer) and allows for debugging with tools like GDB. Debugging builds are generally slower and generate much larger object files than release builds.
The most important option is:
-`--{enable,disable}-dpdk`: [DPDK](http://dpdk.org/) is a set of libraries and drivers for fast packet processing. During development, it's not necessary to enable support even if it is supported by your platform.
@@ -57,6 +57,29 @@ To save time -- for instance, to avoid compiling all unit tests -- you can also
Unit tests live in the `/tests` directory. Like with application source files, test sources and executables are specified manually in `configure.py` and need to be updated when changes are made.
@@ -83,7 +106,7 @@ The `-c1 -m1G` arguments limit this Seastar-based test to a single system thread
### Preparing patches
All changes to Scylla are submitted as patches to the public mailing list. Once a patch is approved by one of the maintainers of the project, it is committed to the maintainers' copy of the repository at https://github.com/scylladb/scylla.
All changes to Scylla are submitted as patches to the public [mailing list](mailto:scylladb-dev@googlegroups.com). Once a patch is approved by one of the maintainers of the project, it is committed to the maintainers' copy of the repository at https://github.com/scylladb/scylla.
Detailed instructions for formatting patches for the mailing list and advice on preparing good patches are available at the [ScyllaDB website](http://docs.scylladb.com/contribute/). There are also some guidelines that can help you make the patch review process smoother:
@@ -112,6 +135,8 @@ The usual is "Tests: unit (release)", although running debug tests is encouraged
5. When answering review comments, prefer inline quotes as they make it easier to track the conversation across multiple e-mails.
6. The Linux kernel's [Submitting Patches](https://www.kernel.org/doc/html/v4.19/process/submitting-patches.html) document offers excellent advice on how to prepare patches and patchsets for review. Since the Scylla development process is derived from the kernel's, almost all of the advice there is directly applicable.
### Finding a person to review and merge your patches
You can use the `scripts/find-maintainer` script to find a subsystem maintainer and/or reviewer for your patches. The script accepts a filename in the git source tree as an argument and outputs a list of subsystems the file belongs to and their respective maintainers and reviewers. For example, if you changed the `cql3/statements/create_view_statement.hh` file, run the script as follows:
@@ -164,6 +189,29 @@ On a development machine, one might run Scylla as
To interact with scylla it is recommended to build our versions of
cqlsh and nodetool. They are available at
https://github.com/scylladb/scylla-tools-java and can be built with
```bash
$ ./install-dependencies.sh
$ ant jar
```
cqlsh should work out of the box, but nodetool depends on a running
scylla-jmx (https://github.com/scylladb/scylla-jmx). It can be build
with
```bash
$ mvn package
```
and must be started with
```bash
$ ./scripts/scylla-jmx
```
### Branches and tags
Multiple release branches are maintained on the Git repository at https://github.com/scylladb/scylla. Release 1.5, for instance, is tracked on the `branch-1.5` branch.
@@ -254,7 +302,7 @@ In this example, `10.0.0.2` will be sent up to 16 jobs and the local machine wil
When a compilation is in progress, the status of jobs on all remote machines can be visualized in the terminal with `distccmon-text` or graphically as a GTK application with `distccmon-gnome`.
One thing to keep in mind is that linking object files happens on the coordinating machine, which can be a bottleneck. See the next section speeding up this process.
One thing to keep in mind is that linking object files happens on the coordinating machine, which can be a bottleneck. See the next sections speeding up this process.
### Using the `gold` linker
@@ -264,6 +312,24 @@ Linking Scylla can be slow. The gold linker can replace GNU ld and often speeds
$ sudo alternatives --config ld
```
### Using split dwarf
With debug info enabled, most of the link time is spent copying and
relocating it. It is possible to leave most of the debug info out of
the link by writing it to a side .dwo file. This is done by passing
`-gsplit-dwarf` to gcc.
Unfortunately just `-gsplit-dwarf` would slow down `gdb` startup. To
avoid that the gold linker can be told to create an index with
`--gdb-index`.
More info at https://gcc.gnu.org/wiki/DebugFission.
Both options can be enable by passing `--split-dwarf` to configure.py.
Note that distcc is *not* compatible with it, but icecream
(https://github.com/icecc/icecream) is.
### Testing changes in Seastar with Scylla
Sometimes Scylla development is closely tied with a feature being developed in Seastar. It can be useful to compile Scylla with a particular check-out of Seastar.
@@ -277,3 +343,8 @@ $ git remote add local /home/tsmith/src/seastar
throwexceptions::invalid_request_exception(sprint("Invalid list literal for %s: bind variables are not supported inside collection literals",*receiver->name));
throwexceptions::invalid_request_exception(format("Invalid list literal for {}: bind variables are not supported inside collection literals",*receiver->name));
throwexceptions::invalid_request_exception(sprint("Invalid map literal for %s: bind variables are not supported inside collection literals",*receiver->name));
throwexceptions::invalid_request_exception(format("Invalid map literal for {}: bind variables are not supported inside collection literals",*receiver->name));
throwexceptions::invalid_request_exception(sprint("Invalid map literal for %s: key %s is not of type %s",*receiver.name,*entry.first,*key_spec->type->as_cql3_type()));
throwexceptions::invalid_request_exception(format("Invalid map literal for {}: key {} is not of type {}",*receiver.name,*entry.first,key_spec->type->as_cql3_type()));
throwexceptions::invalid_request_exception(sprint("Invalid map literal for %s: value %s is not of type %s",*receiver.name,*entry.second,*value_spec->type->as_cql3_type()));
throwexceptions::invalid_request_exception(format("Invalid map literal for {}: value {} is not of type {}",*receiver.name,*entry.second,value_spec->type->as_cql3_type()));
check_false(previous_position==-1,"Clustering columns may not be skipped in multi-column relations. "
"They should appear in the PRIMARY KEY order. Got %s",to_string());
throwexceptions::invalid_request_exception(sprint("Clustering columns must appear in the PRIMARY KEY order in multi-column relations: %s",to_string()));
throwexceptions::invalid_request_exception(format("Clustering columns must appear in the PRIMARY KEY order in multi-column relations: {}",to_string()));
}
names.emplace_back(&def);
previous_position=def.position();
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.