Spotted during code review.
If it doesn't defer, we may execute then_wrapped() body before we
change the state. Fix by moving then_wrapped() body after state changes.
(cherry picked from commit 443e5aef5a)
The problem was that "s" would not be marked as synced-with if it came from
shard != 0.
As a result, mutation using that schema would fail to apply with an exception:
"attempted to mutate using not synced schema of ..."
The problem could surface when altering schema without changing
columns and restarting one of the nodes so that it forgets past
versions.
Fixes#1258.
Will be covered by dtest:
SchemaManagementTest.test_prepared_statements_work_after_node_restart_after_altering_schema_without_changing_columns
(cherry picked from commit 8703136a4f)
Currently we only do that when column set changes. When prepared
statements are executed, paramaters like read repair chance are read
from schema version stored in the statement. Not invalidating prepared
statements on changes of such parameters will appear as if alter took
no effect.
Fixes#1255.
Message-Id: <1462985495-9767-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 13d8cd0ae9)
(cherry picked from commit 734cfa949a)
There is a problem in the implementation of leveled compaction strategy that
prevents level 1 from being compacted into level 2, and so forth. As a result,
all sstables will only belong to either level 0 or 1. One of the consequences
is level 1 being overwhelmed by a huge amount of sstables.
The root of the problem is a conditional statement in the code that prevents a
single sstable, with level > 0, from being compacted into a subsequent level
that is empty or has no overlapping sstables.
Fixes#1180.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <9a4bffdb0368dea77b49c23687015ff5832299ab.1460508373.git.raphaelsc@scylladb.com>
(cherry picked from commit c7b728e716)
Our current throttling code releases one requests per 1MB of memory available
that we have. If we are below the memory limit, but not by 1MB or more, then
we will keep getting to unthrottle, but never really do anything.
If another memtable is close to the flushing point, those requests may be
exactly the ones that would make it flush. Without them, we'll freeze the
database.
In general, we need to always release at least one request to make sure that
progress is always achieved.
This fixes#1144
Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 9c87ae3496)
This is usually not a problem for the main memtable list - although it can be,
depending on settings, but shows up easily for the streaming memtables list.
We would like to have at least two memtables, even if we have to cut it short.
If we don't do that, one memtable will have use all available memory and we'll
force throttling until the memtable gets totally flushed.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 2c5dfe08c1)
throttle_state is currently a nested member of database, but there is no
particular reason - aside from the fact that it is currently only ever
referenced by the database for us to do so.
We'll soon want to have some interaction between this and the column family, to
allow us to flush during throttle. To make that easier, let's unnest it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 1daede7396)
This is a preparation patch so we can move the throttling infrastructure inside
the memtable_list. To do that, the region group will have to be passed to the
throttler so let's just go ahead and store it.
In consequence of that, all that the CF has to tell us is what is the current
schema - no longer how to create a new memtable.
Also, with a new parameter to be passed to the memtable_list the creation code
gets quite big and hard to follow. So let's move the creation functions to a
helper.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 39def369ce)
After commit a843aea547, a gate was introduced to make sure that
an asynchronous operation is finished before column family is
destroyed. A sstable testcase was not stopping column family,
instead it just removed column family from compaction manager.
That could cause an user-after-free if column family is destroyed
while the asynchronous operation is running. Let's fix it by
stopping column family in the test.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <ed910ec459c1752148099e6dc503e7f3adee54da.1461177411.git.raphaelsc@scylladb.com>
(cherry picked from commit eb51c93a5a)
"This patchset is a backport of the atomic sstable deletion patchset, which
waits until all shards agree to delete an sstable set before deleting it,
avoiding the resurrecting data problem.
The first four patches are identical to master, the last patch is new.
Fixes#1181"
Since seastar is limited to 128 cross-shard calls per shard-pair,
long-duration smp calls can lead to deadlocks.
Prevent such calls by returning immediately from shard 0 (which manages
the deletions), and calling back to the requesting shard when the deletion
completes.
If sstables A, B are compacted, A and B must be deleted atomically.
Otherwise, if A has data that is covered by a tombstone in B, and that
tombstone is deleted, and if B is deleted while A is not, then the data
in A is resurrected.
Fixes#1181.
(cherry picked from commit a843aea547)
A shared sstable must be compacted by all shards before it can be deleted.
Since we're stoping, that's not going to happen. Cancel those pending
deletions to let anyone waiting on them to continue.
(cherry picked from commit e43dbac836)
When we compact a set of sstables, we have to remove the set atomically,
otherwise we can resurrect data if the following happens:
insert data to sstable A
insert tombstone to sstable B
compact A+B -> C (removing both data and tombstone)
delete B only
read data from A
Since an sstable may be shared by multiple shard, and each shard performs
compaction at a different time, we need to defer deletion of an sstable
set until all shards agree that the set can be deleted.
An additional atomicity issue exists because posix does not provide a way
to atomically delete multiple files. This issue is not addressed by this
patch.
(cherry picked from commit 2ba584db8d)
Broken by f15c380a4f.
This resulted in empty collection being returned in the results
instead of no collection.
Fixes org.apache.cassandra.cql3.validation.entities.CollectionsTest
from cassandra-unit-tests.
(cherry picked from commit c69d0a8e87)
This series contains 1.0 backports of the following series:
* Commit 9b98278 ("Merge "Be able to boot without a Summary" from Glauber")
* Commit 60352f8 ("Merge "Fixes for the reading of missing Summary" from Glauber")
The backport was done by Glauber because the original commits don't work
as-is due to I/O error handling differences in master and 1.0.
Fixes#1170
When we recreate the summary from a missing Summary, we should make
sure it is generated sanely, and that it resembles the Summary that
would have otherwise been there.
In this tests we'll grab one of the Summary tests we've been doing,
and just apply them to the non-existent Summary file. We expect
the same results on those cases. Plus, a new test is added with some
sanity checking.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Now that we can boot without a Summary file, we can just as easily boot
with a broken one.
Suggested by Nadav, and it is actually very easy to do, so do it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Spotted by Avi post-merge
1) Need to close the file
2) Should be using the parameter pc instead of the default_class
1.0 backport: general_disk_error is non-existent. Replace it with just
propagating the exception
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This shouldn't be a problem in practice, because if read_toc() fails,
the users will just tend to discard the sstable object altogether, and
not insist on using it.
However, if somebody does try to keep using it, a subsequent read_toc() could
theoretically have some components filled up leading the new reader to believe
the toc was populated successfully.
It is easier to just clear the _components set and never worry about it, than
trying to reason about whether or not that could happen.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
There are cases in which a Summary file will not be present, and imported
SSTables will have just the Index and Data files. In earlier versions of
Cassandra, a Summary didn't exist, so one may not be generated when migrating.
In Issue #1170, we can see an example of tables generated by CQLSSTableWriter,
and they lack a Summary. Cassandra is robust against this and can cope
perfectly with the Summary not existing. I will argue that we should do the
same.
1.0 backport: open_checked_file_dma -> open_file_dma
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We do that by bailing immediately if we detect that the components
map is already populated. This allow us to call read_toc() earlier
if we need to - for instance, to inquire about the existence of the
Summary - without the need to re-read the components again later.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
for prepare_summary we can just pass the min interval as a parameter and
avoid having the schema do yet another hop. For sealing the summary, it
is completely unused and we can do away with it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This is done so we can use other consumers. An example of that, is regeneration
of the Summary from an existing Index.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Because just creating an SSTable object does not generate any I/O,
get_sstable_key_range should be an instance method. The main advantage
of doing that is that we won't have to read the summary twice. The way
we're doing it currently, if happens to be a shard-relevant table we'll
call load() - which reads the summary again.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
There are times in which we read the Summary file twice. That actually happens
every time during normal boot (it doesn't during refresh). First during
get_sstable_key_range and then again during load().
Every summary will have at least one entry, so we can easily test for whether
or not this is properly initialized.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Using leveled compaction strategy, only a few sstables will contain a
given key, so we need to filter out the rest. Using the summary entries
to filter keys works if the key is before the first summary entry,
but does not work if it is after the last summary entry, because the last
summary entry does not represent the last key; so sstables that are
are towards the beginning of the ring are read even if they do not contain
the key, greatly reducing read performance.
Fix by consulting the summary's first_key/last_key entries before consulting
the summary entry array.
(cherry picked from commit 715794cce6)
* seastar aa281bd...0225940 (10):
> memory: avoid exercising the reclaimers for oversized requests
> memory: fix live objects counter underflow due to cross-cpu free
> core/reactor: Don't abort in allocate_aligned_buffer() on allocation failure
> scripts/posix_net_conf.sh: added a support for bonding interfaces
> scripts/posix_net_conf.sh: move the NIC configuration code into a separate function
> scripts/posix_net_conf.sh: implement the logic for selecting default MQ mode
> scripts/posix_net_conf.sh: forward the interface name as a parameter
> http/routes: Remove request failure logging to stderr
> lowres_clock: Initialize _now when the clock is created
> apps/iotune: fix broken URL
Reproduced by dtest paging_test.py:TestPagingData.static_columns_paging_test.
Broken by f15c380a4f, where the
calcualtion of has_ck_selector got broken, in such a way that present
clustering restrictions were treated as if not present, which resulted
in static row being returned when it shouldn't.
While at it, unify the check between query_compacted() and
do_compact() by extracting it to a function.
(cherry picked from commit c2b955d40b)
The first erase_and_dispose(), which removes rows between last
position and beginning of the next range, can invalidate end()
iterator of the range. Fix by looking up end after erasing.
mutation_partition::range() was split into lower_bound() and
upper_bound() to allow for that.
This affects for example queries with descending order where the
selected clustering range is empty and falls before all rows.
Exposed by f15c380a4f, which is now
calling do_compact() during query.
Reproduced by dtest paging_test.py:TestPagingData.static_columns_paging_test
(cherry picked from commit a1539fed95)
Currently data query digest includes cells and tombstones which may have
expired or be covered by higher-level tombstones. This causes digest
mismatch between replicas if some elements are compacted on one of the
nodes and not on others. This mismatch triggers read-repair which doesn't
resolve because mutations received by mutation queries are not differing,
they are compacted already.
The fix adds compacting step before writing and digesting query results by
reusing the algorithm used by mutation query. This is not the most optimal
way to fix this. The compaction step could be folded with the query writing,
there is redundancy in both steps. However such change carries more risk,
and thus was postponed.
perf_simple_query test (cassandra-stress-like partitions) shows regression
from 83k to 77k (7%) ops/s.
Fixes#1165.
(cherry picked from commit f15c380a4f)
With big rows I see contention in XFS allocations which cause reactor
thread to sleep. Commitlog is a main offender, so enlarge extent to
commitlog segment size for big files (commitlog and sstable Data files).
Message-Id: <20160404110952.GP20957@scylladb.com>
(cherry picked from commit 70575699e4)
Until recently, we believed that range tombstones we read from sstables will
always be for entire rows (or more generalized clustering-key prefixes),
not for arbitrary ranges. But as we found out, because Cassandra insists
that range tombstones do not overlap, it may take two overlapping row
tombstones and convert them into three range tombstones which look like
general ranges (see the patch for a more detailed example).
Not only do we need to accept such "split" range tombstones, we also need
to convert them back to our internal representation which, in the above
example, involves two overlapping tombstones. This is what this patch does.
This patch also contains a test for this case: We created in Cassandra
an sstable with two overlapping deletions, and verify that when we read
it to Scylla, we get these two overlapping deletions - despite the
sstable file actually having contained three non-overlapping tombstones.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <b7c07466074bf0db6457323af8622bb5210bb86a.1459399004.git.glauber@scylladb.com>
(cherry picked from commit 99ecda3c96)
This is a rewrite of Glauber's earlier patch to do the same thing, taking
into account Avi's comments (do not use a class, do not throw from the
constructor, etc.). I also verified that the actual use case which was
broken in #1136 was fixed by this patch.
Currently, we have no support for range tombstones because CQL will not
generate them as of version 2.x. Thrift will, but we can safely leave this for
the future.
However, we have seen cases during a real migration in which a pure-CQL
Cassandra would generate range tombstones in its SSTables.
Although we are not sure how and why, those range tombstones were of a special
kind: their end and next's start range were adjacent, which means that in
reality, they could very well have been written as a single range tombstone for
an entire clustering key - which we support just fine.
This code will attempt to fix this problem temporarily by merging such ranges
if possible. Care must be taken so that we don't end up accepting a true
generic range tombstone by accident.
Fixes#1136
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1459333972-20345-1-git-send-email-nyh@scylladb.com>
(cherry picked from commit 0fc9a5ee4d)
As Nadav noticed in his bug report, check_marker is creating its error messages
using characters instead of numbers - which is what we intended here in the
first place.
That happens because sprint(), when faced with an 8-byte type, interprets this
as a character. To avoid that we'll use uint16_t types, taking care not to
sign-extend them.
The bug also noted that one of the error messages is missing a parameter, and
that is also fixed.
Fixes#1122
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <74f825bbff8488ffeb1911e626db51eed88629b1.1459266115.git.glauber@scylladb.com>
(cherry picked from commit 23808ba184)
While Seastar in general can accept any parameter for its I/O queues, Scylla
in particular shouldn't run with them disabled. Such will be the status when
the max-io-requests parameter is not enabled.
On top of that, we would like to have enough depth per I/O queue not to allow
for shard-local parallelism. Therefore, we will require a minimum per-queue
capacity of 4. In machines where the disk iodepth is not enough to allow for 4
concurrent requests per shard, one should reduce the number of I/O queues.
For --max-io-requests, we will check the parameter itself. However, the
--num-io-queues parameter is not mandatory, and given enough concurrent
requests, Seastar's default configuration can very well just be doing the right
thing. So for that, we will check the final result of each I/O queue.
As it is the case with other checks of the sorts, this can be overridden by
the --developer-mode switch.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <63bf7e91ac10c95810351815bb8f5e94d75592a5.1458836000.git.glauber@scylladb.com>
(cherry picked from commit e750a94300)
After 4e52b41a4, remove_by_toc_name() became aware of temporary TOC
files, however, it doesn't consider that some components may be
missing if temporary TOC is present.
When creating a new sstable, the first thing we do is to write all
components into temporary TOC, so content of a temporary TOC isn't
reliable until it is renamed.
Solution is about implementing the following flow (described by Avi):
"Flow should be:
- remove all components in parallel
- forgive ENOENT, since the compoent may not have been written;
otherwise deletion error should be raised
- fsync the directory
- delete the temporary TOC
"
This problem can be reproduced by running compaction without disk
space, so compaction would fail and leave a partial sstable that would
be marked for deletion. Afterwards, remove_by_toc_name() would try to
delete a component that doesn't exist because it looked at the content
of temporary TOC.
Fixes#1095.
Signed-off-by: Raphael Carvalho <raphaelsc@scylladb.com>
Message-Id: <0cfcaacb43cc5bad3a8a7ea6c1fa6f325c5de97d.1459194263.git.raphaelsc@scylladb.com>
(cherry picked from commit d515a7fd85)
We had a problem reading certain existing Cassandra sstables into
Scylla.
Our consume_range_tombstone() function assumes that the start and end
columns have a certain "end of component" markers, and want to verify
that assumption. But because of bugs in older versions of Cassandra,
see https://issues.apache.org/jira/browse/CASSANDRA-7593, sometimes the
"end of component" was missing (set to 0). CASSANDRA-7593 suggested
this problem might exist on the start column, so we allowed for that,
but now we discovered a case where also the end column is set to 0 -
causing the test in consume_range_tombstone() to fail and the sstable
read to fail - causing Scylla to no be able to import that sstable from
Cassandra. Allowing for an 0 also on the end column made it possible
to read that sstable, compact it, and so on.
Fixes#1125.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1459173964-23242-1-git-send-email-nyh@scylladb.com>
(cherry picked from commit a05577ca41)
Problem found by dtest which loads sstables with generation 1 and 2 into an
empty column family. The root of the problem is that reshuffle procedure
changes new sstables to start from generation 2 at least. So reshuffle could
try to set generation 1 to 2 when generation 2 exists.
This problem can be fixed by starting from generation 1 instead, so reshuffle
would handle this case properly.
Fixes#1099.
Signed-off-by: Raphael Carvalho <raphaelsc@scylladb.com>
Message-Id: <88c51fbda9557a506ad99395aeb0a91cd550ede4.1458917237.git.raphaelsc@scylladb.com>
(cherry picked from commit e6e5999282)
Fixes#797
To make sure an inopportune crash after truncate does not leave
sstables on disk to be considered live, and thus resurrect data,
after a truncate, use delete function that renames the TOC file to
make sure we've marked sstables as dead on disk when we finish
this discard call.
Message-Id: <1458575440-505-2-git-send-email-calle@scylladb.com>
Rebase to 1.0:
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Note: "normal" remove_by_toc_name must now be prepared for and check
if the TOC of the sstable is already moved to temp file when we
get to the juicy delete parts.
Message-Id: <1458575440-505-1-git-send-email-calle@scylladb.com>
For the rebase to 1.0:
Signed-off-by: Glauber Costa <glauber@scylladb.com>
A STREAM_MUTATION_DONE message will signal the receiver that the sender
has completed the sending of streams mutations. When the receiver finds
it has zero task to send and zero task to receive, it will finish the
stream_session, and in turn finish the stream_plan if all the
stream_sessions are finished. We should call receive_task_completed only
after the flush finishes so that when stream_plan is finshed all the
data is on disk.
Fixes repair_disjoint_data_test issue with Glauber's "[PATCH v4 0/9] Make
sure repairs do not cripple incoming load" serries
======================================================================
FAIL: repair_disjoint_data_test
(repair_additional_test.RepairAdditionalTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "scylla-dtest/repair_additional_test.py",
line 102, in repair_disjoint_data_test
self.check_rows_on_node(node1, 3000)
File "scylla-dtest/repair_additional_test.py",
line 33, in check_rows_on_node
self.assertEqual(len(result), rows, len(result))
AssertionError: 2461
(cherry picked from commit c2eff7e824)
The repair code as it is right now is a bit convoluted: it resorts to detached
continuations + do_for_each when calling sync_ranges, and deals with the
problem of excessive parallelism by employing a semaphore inside that range.
Still, even by doing that, we still generate a great number of
checksum requests because the ranges themselves are processed in parallel.
It would be better to have a single-semaphore to limit the overall parallelism
for all requests.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit f49e965d78)
Theoretically, because we can have a lot of pending streaming memtables, we can
have the database start throttling and incoming connections slowing down during
streaming.
Turns out this is actually a very easy condition to trigger. That is basically
because the other side of the wire in this case is quite efficient in sending
us work. This situation is alleviated a bit by reducing parallelism, but not
only it does't go away completely, once we have the tools to start increasing
parallelism again it will become common place.
The solution for this is to limit the streaming memtables to a fraction of the
total allowed dirty memory. Using the nesting capability built in in the LSA
regions, we will make the streaming region group a child of the main region
group. With that, we can throttle streaming requests separately, while at the
same time being able to control the total amount of dirty memory as well.
Because of the property, it can still be the case that incoming requests will
throttle earlier due to streaming - unless we allow for more dirty memory to be
used during repairs - but at least that effect will be limited.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 34a9fc106f)
The repair process will potentially send ranges containing few mutations,
definitely not enough to fill a memtable. It wants to know whether or not each
of those ranges individually succeeded or failed, so we need a future for each.
Small memtables being flushed are bad, and we would like to write bigger
memtables so we can better utilize our disks.
One of the ways to fix that, is changing the repair itself to send more
mutations at a single batch. But relying on that is a bad idea for two reasons:
First, the goals of the SSTable writer and the repair sender are at odds. The
SSTable writer wants to write as few SSTables as possible, while the repair
sender wants to break down the range in pieces as small as it can and checksum
them individually, so it doesn't have to send a lot of mutations for no reason.
Second, even if the repair process wants to process larger ranges at once, some
ranges themselves may be small. So while most ranges would be large, we would
still have potentially some fairly small SSTables lying around.
The best course of action in this case is to coalesce the incoming streams
write-side. repair can now choose whatever strategy - small or big ranges - it
wants, resting assure that the incoming memtables will be coalesced together.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 455d5a57d2)
Keeping the mutations coming from the streaming process as mutations like any
other have a number of advantages - and that's why we do it.
However, this makes it impossible for Seastar's I/O scheduler to differentiate
between incoming requests from clients, and those who are arriving from peers
in the streaming process.
As a result, if the streaming mutations consume a significant fraction of the
total mutations, and we happen to be using the disk at its limits, we are in no
position to provide any guarantees - defeating the whole purpose of the
scheduler.
To implement that, we'll keep a separate set of memtables that will contain
only streaming mutations. We don't have to do it this way, but doing so
makes life a lot easier. In particular, to write an SSTable, our API requires
(because the filter requires), that a good estimate on the number of partitions
is informed in advance. The partitions also need to be sorted.
We could write mutations directly to disk, but the above conditions couldn't be
met without significant effort. In particular, because mutations can be
arriving from multiple peer nodes, we can't really sort them without keeping a
staging area anyway.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 5fa866223d)
Streaming has currently one class, that can be used to contain the read
operations being generated by the streaming process. Those reads come from two
places:
- checksums (if doing repair)
- reading mutations to be sent over the wire.
Depending on the amount of data we're dealing with, that can generate a
significant chunk of data, with seconds worth of backlog, and if we need to
have the incoming writes intertwined with those reads, those can take a long
time.
Even if one node is only acting as a receiver, it may still read a lot for the
checksums - if we're talking about repairs, those are coming from the
checksums.
However, in more complicated failure scenarios, it is not hard to imagine a
node that will be both sending and receiving a lot of data.
The best way to guarantee progress on both fronts, is to put both kinds of
operations into different classes.
This patch introduces a new write class, and rename the old read class so it
can have a more meaningful name.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 10c8ca6ace)
The column family still has to teach the memtable list how to allocate a new memtable,
since it uses CF parameters to do so.
After that, the memtable_list's constructor takes a seal and a create function and is complete.
The copy constructor can now go, since there are no users left.
The behavior of keeping a reference to the underlying memtables can also go, since we can now
guarantee that nobody is keeping references to it (it is not even a shared pointer anymore).
Individual memtables are, and users may be keeping references to them individually.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 635bb942b2)
Each list can have a different active memtable. The column family method keeps
existing, since the two separate sets of memtable are just an implementation
detail to deal with the problem of streaming QoS: *the* active memtable keeps
being the one from the main list.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 6ba95d450f)
memtable_list is currently just an alias for a vector of memtables. Let's move
them to a class on its own, exporting the relevant methods to keep user code
unchanged as much as possible.
This will help us keeping separate lists of memtables.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit af6c7a5192)
scylla_io_seup requires the scylla-server env to be setup to run
correctly. previously scylla_io_setup was encapsulated in
scylla-io.service that assured this.
extracting CPUSET,SMP from SCYLLA_ARGS as CPUSET is needed for invoking
io_tune
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <d49af9cb54ae327c38e451ff76fe0322e64a5f00.1458747527.git.shlomi@scylladb.com>
(cherry picked from commit 6a18634f9f)
Vlad and I were working on finding the root of the problems with
refresh. We found that refresh was deleting existing sstable files
because of a bug in a function that was supposed to return the maximum
generation of a column family.
The intention of this function is to get generation from last element
of column_family::_sstables, which is of type std::map.
However, we were incorrectly using std::map::end() to get last element,
so garbage was being read instead of maximum generation.
If the garbage value is lower than the minimum generation of a column
family, then reshuffle_sstables() would set generation of all existing
sstables to a lower value. That would confuse our mechanism used to
delete sstables because sstables loaded at boot stage were touched.
Solution to this problem is about using rbegin() instead of end() to
get last element from column_family::_sstables.
The other problem is that refresh will only load generations that are
larger than or equal to X, so new sstables with lower generation will
not be loaded. Solution is about creating a set with generation of
live SSTables from all shards, and using this set to determine whether
a generation is new or not.
The last change was about providing an unused generation to reshuffle
procedure by adding one to the maximum generation. That's important to
prevent reshuffle from touching an existing SSTable.
Tested 'refresh' under the following scenarios:
1) Existing generations: 1, 2, 3, 4. New ones: 5, 6.
2) Existing generations: 3, 4, 5, 6. New ones: 1, 2.
3) Existing generations: 1, 2, 3, 4. New ones: 7, 8.
4) No existing generation. No new generation.
5) No existing generation. New ones: 1, 2.
I also had to adapt existing testcase for reshuffle procedure.
Fixes#1073.
Signed-off-by: Raphael Carvalho <raphaelsc@scylladb.com>
Message-Id: <1c7b8b7f94163d5cd00d90247598dd7d26442e70.1458694985.git.raphaelsc@scylladb.com>
(cherry picked from commit 370b1336fe)
On scylla_setup interactive mode we are using lsblk to list up candidate
block devices for RAID, and -p option is to print full device paths.
Since Ubuntu 14.04LTS version of lsblk doesn't supported this option, we
need to use non-full path name and complete paths before passes it to
scylla_raid_setup.
Fixes#1030
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1458325411-9870-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 6edd909b00)
Since upstart does not have same behavior as systemd, we need to run scylla_io_setup and scylla_ami_setup in scylla-server.conf's pre-start stanza.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
(cherry picked from commit 7828023599)
--local-pkg and --unstable arguments didn't handled on Ubuntu, support it.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
(cherry picked from commit 93bf7bff8e)
"apt-get -y install mdadm" shows up a dialog to select install mode of postfix, this will block scylla-ami-setup.service forever since it is running as background task, we need to prevent it.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
(cherry picked from commit 0c83b34d0c)
This introduces Ubuntu AMI.
Both CentOS AMI and Ubuntu AMI are need to build on same distribution, so build_ami.sh script automatically detect current distribution, and selects base AMI image.
Fixes#998
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
(cherry picked from commit b097ed6d75)
* dist/ami/files/scylla-ami 84bcd0d...89e7436 (3):
> Merge "iotune packaging fix for scylla-ami" from Takuya
> Ubuntu AMI support on scylla_install_ami
> scylla_ami_setup is not POSIX sh compatible, change shebang to /bin/bash
Currently we execute all statements in parallel, but some statements
depend on order, in particular list append/prepend. Fix by executing
sequentially.
Fixes cql_additional_tests.py:TestCQL.batch_and_list_test dtest.
Fixes#1075.
Message-Id: <1458672874-4749-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 5f44afa311)
Fixes the following assertion failure:
row_cache_alloc_stress: tests/row_cache_alloc_stress.cc:120: main(int, char**)::<lambda()>::<lambda()>: Assertion `mt->occupancy().used_space() < memory::stats().free_memory()' failed.
memory::stats()::free_memory() may be much lower than the actual
amount of reclaimable memory in the system since LSA zones will try to
keep a lot of free segments to themselves. Fix by using actual amount
of reclaimable memory in the check.
(cherry picked from commit a4e3adfbec)
The test injects allocation failures at every allocation site during
apply(). Only allocations throug allocation_strategy are instrumented,
but currently those should include all allocations in the apply() path.
The target and source mutations are randomized.
(cherry picked from commit 2fbb55929d)
The problem was that verify_row() was returning a future which was not
waited on. Fix by running the code in a thread.
(cherry picked from commit 19b3df9f0f)
We cannot leave partially applied mutation behind when the write
fails. It may fail if memory allocation fails in the middle of
apply(). This for example would violate write atomicity, readers
should either see the whole write or none at all.
This fix makes apply() revert partially applied data upon failure, by
the means of ReversiblyMergeable concept. In a nut shell the idea is
to store old state in the source mutation as we apply it and swap back
in case of exception. At cell level this swapping is inexpensive, just
rewiring pointers. For this to work, the source mutation needs to be
brought into mutable form, so frozen mutations need to be unfrozen. In
practice this doesn't increase amount of cell allocations in the
memtable apply path because incoming data will usually be newer and we
will have to copy it into LSA anyway. There are extra allocations
though for the data structures which holds cells.
I didn't see significant change in performance of:
build/release/tests/perf/perf_simple_query -c1 -m1G --write --duration 13
The score fluctuates around ~77k ops/s.
Fixes#283.
(cherry picked from commit dc290f0af7)
Currently only "set" storage could store empty cells, but not the
"vector" one because there empty cell has the meaning of being
missing. To implement rolback, we need to be able to distinguish empty
cells from missing ones. Solve by making vector storage use a bitmap
for presence checking instead of emptiness. This adds 4 bytes to
vector storage.
(cherry picked from commit d5e66a5b0d)
It is needed for noexcept destruction, which we need for exception
safety in higher layers.
According to [1], erase() only throws if key comparison throws, and in
our case it doesn't.
[1] http://en.cppreference.com/w/cpp/container/unordered_map/erase
(cherry picked from commit 22d193ba9f)
* seastar 6a207e1...9f2b868 (10):
> memory: set free memory to non-zero value in debug mode
> Merge "Increase IOTune's robustness by including a timeout" from Glauber
> shared_future: add companion class, shared_promise
> rpc: fix client connection stopping
> semaphore: allow wait() and signal() after broken()
> run reactor::stop() only once
> sharded: fix start with reference parameter
> core: add asserts to rwlock
> util/defer: Fix cancel() not being respected
> tcp: Do not return accept until the connection is connected
We did the clean up in idl/gossip_digest.idl.hh, but the patch to clean
up gms/application_state.hh was never merged.
To maintain compatibility with previous version of scylla, we can not
change application_state.hh, instead change idl to be sync with
application_state.hh.
Message-Id: <3a78b159d5cb60bc65b354d323d163ce8528b36d.1458557948.git.asias@scylladb.com>
(cherry picked from commit 39992dd559)
Take a reference of messaging_service object inside
send_message_timeout_and_retry to make sure it is not freed during the
life time of send_message_timeout_and_retry operation.
(cherry picked from commit b8abd88841)
Messaging service stop() method calls stop() on all clients. If
remove_rpc_client_one() is called while those stops are running
client::stop() will be called twice which not suppose to happen. Fix it
by ignoring client remove request during messaging service shutdown.
Fixes#1059
Message-Id: <1458639452-29388-2-git-send-email-gleb@scylladb.com>
(cherry picked from commit 357c91a076)
The version number ordering rules are different for rpm and deb. Use
tilde ('~') for the latter to ensure a release candidate is ordered
_before_ a final version.
Message-Id: <1458627524-23030-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit ae33e9fe76)
Commit 6a3872b355 fixed some use-after-free
bugs but introduced a new one because of a typo:
Instead of capturing a reference to the long-living io-class object, as
all the code does, one place in the code accidentally captured a *copy*
of this object. This copy had a very temporary life, and when a reference
to that *copy* was passed to sstable reading code which assumed that it
lives at least as long as the read call, a use-after-free resulted.
Fixes#1072
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1458595629-9314-1-git-send-email-nyh@scylladb.com>
(cherry picked from commit 2eb0627665)
Defer registering services to the API server until commitlog has been
replayed to ensure that nobody is able to trigger sstable operations via
'nodetool' before we are ready for them.
Message-Id: <1458116227-4671-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit 972fc6e014)
Fix the validation error message to look like this:
Scylla version 666.development-20160316.49af399 starting ...
WARN 2016-03-17 12:24:15,137 [shard 0] config - Option partitioner is not (yet) used.
WARN 2016-03-17 12:24:15,138 [shard 0] init - NOFILE rlimit too low (recommended setting 200000, minimum setting 10000; you may run out of file descriptors.
ERROR 2016-03-17 12:24:15,138 [shard 0] init - Bad configuration: invalid 'listen_address': eth0: boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::system::system_error> > (Invalid argument)
Exiting on unhandled exception of type 'bad_configuration_error': std::exception
Instead of:
Exiting on unhandled exception of type 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::system::system_error> >': Invalid argument
Fixes#1051.
Message-Id: <1458210329-4488-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit 69dacf9063)
When NR_CPU >= 8, we disabled cpu0 for AMI on scylla_sysconfig_setup.
But scylla_io_setup doesn't know that, try to assign NR_CPU queues, then scylla fails to start because queues > cpus.
So on this fix scylla_io_setup checks sysconfig settings, if '--smp <n>' specified on SCYLLA_ARGS, use n to limit queue size.
Also, when instance type is not supported pre-configured parameters, we need to passes --cpuset parameters to iotune. Otherwise iotune will run on a different set of CPUs, which may have different performance characteristics.
Fixes#996, #1043, #1046
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1458221762-10595-2-git-send-email-syuu@scylladb.com>
(cherry picked from commit 4cc589872d)
_closed_occupancy will be used when a region is removed from its region
group, make sure that it is accurate.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
(cherry picked from commit 338fd34770)
For this verb(), we don't call get_session - and it doesn't look like we will.
We currently have no debug message for this one, which makes it harder to debug
the stream of messages. Print it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit a3ebf640c6)
Whenever we call get_session, that will print a debug message about the arrival
of this new verb. Because we also print that explicitly in PREPARE_DONE, that
message gets duplicated.
That confuses poor developers who are, for a while, left wondering why is it that
the sender is sender the message twice.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 0ab4275893)
Our sstables::mutation_reader has a specialization in which start and end
ranges are passed as futures. That is needed because we may have to read the
index file for those.
This works well under the assumption that every time a mutation_reader will be
created it will be used, since whoever is using it will surely keep the state
of the reader alive.
However, that assumption is no longer true - for a while. We use a reader
interface for reading everything from mutations and sstables to cache entries,
and when we create an sstable mutation_reader, that does not mean we'll use it.
In fact we won't, if the read can be serviced first by a higher level entity.
If that happens to be the case, the reader will be destructed. However, since
it may take more time than that for the start and end futures to resolve, by
the time they are resolved the state of the mutation reader will no longer be
valid.
The proposed fix for that is to only resolve the future inside
mutation_reader's read() function. If that function is called, we can have a
reasonable expectation that the caller object is being kept alive.
A second way to fix this would be to force the mutation reader to be kept alive
by transforming it into a shared pointer and acquiring a reference to itself.
However, because the reader may turn out not to be used, the delayed read
actually has the advantage of not even reading anything from the disk if there
is no need for it.
Also, because sstables can be compacted, we can't guarantee that the sst object
itself , used in the resolution of start and end can be alive and that has the
same problem. If we delay the calling of those, we will also solve a similar
problem. We assume here that the outter reader is keeping the SSTable object
alive.
I must note that I have not reproduced this problem. What goes above is the
result of the analysis we have made in #1036. That being the case, a thorough
review is appreciated.
Fixes#1036
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <a7e4e722f76774d0b1f263d86c973061fb7fe2f2.1458135770.git.glauber@scylladb.com>
(cherry picked from commit 6a3872b355)
Asking to read from byte 100 when a file has 50 bytes is an obvious error.
But what if we ask to read from byte 50? What if we ask to read 0 bytes at
byte 50? :-)
Before this patch, code which asked to read from the EOF position would
get an exception. After this patch, it would simply read nothing, without
error. This allows, for example, reading 0 bytes from position 0 on a file
with 0 bytes, which apparently happened in issue #1039...
A read which starts at a position higher than the EOF position still
generates an exception.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1458137867-10998-1-git-send-email-nyh@scylladb.com>
(cherry picked from commit 02ba8ffbe8)
The uncompression code reads the compressed chunks containing the bytes
pos through pos + len - 1. This, however, is not correct when len==0,
and pos + len - 1 may even be -1, causing an out-of-range exception when
calling locate() to find the chunks containing this byte position.
So we need to treat len==0 specially, and in this case we don't read
anything, and don't need to locate() the chunks to read.
Refs #1039.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1458135987-10200-1-git-send-email-nyh@scylladb.com>
(cherry picked from commit 73297c7872)
This fixes gossip test shutdown similar to what commit 13ce48e ("tests:
Fix stop of storage_service in cql_test_env") did for CQL tests:
gossip_test: /home/penberg/scylla/seastar/core/sharded.hh:439: Service& seastar::sharded<Service>::local() [with Service = net::messaging_service]: Assertion `local_is_initialized()' failed.
Running 1 test case...
[snip]
unknown location(0): fatal error in "test_boot_shutdown": signal: SIGABRT (application abort requested)
seastar/tests/test-utils.cc(32): last checkpoint
Message-Id: <1458126520-20025-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit 2f519b9b34)
The cf can be deleted after the cf deletion check. Handle this case as
well.
Use "warn" level to log if cf is missing. Although we can handle the
case, but it is good to distingush where the receiver of streaming
applied all the stream mutations or not. We believe that the cf is
missing because it was dropped, but it could be missing because of a bug
or something we didn't anticipated here.
Related patch: "streaming: Handle cf is deleted when sending
STREAM_MUTATION_DONE"
Fixes simple_add_new_node_while_schema_changes_test failure.
Message-Id: <c4497e0500f50e0a3422efb37e73130765c88c57.1458090598.git.asias@scylladb.com>
(cherry picked from commit 2d50c71ca3)
It is a legacy API from c*. Since we can wait for the
update_pending_ranges to complete, we can wait for it directly instead
of calling block_until_update_pending_ranges_finished to do so.
Also, change do_update_pending_ranges to be private.
Message-Id: <ac79b2879ec08fdcd3b2278ff68962cc71492f12.1458040608.git.asias@scylladb.com>
The verb is just for reporting and debugging purposes, but it is better
not to register it until it can return a meaningful value. Besides, it
really belongs to the migration manager subsystem anyway.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1458037053-14836-1-git-send-email-pdziepak@scylladb.com>
Streaming is used by bootstrap and repair. Streaming uses storage_proxy
class to apply the frozen_mutation and db/column_family class to
invalidate row cache. Defer the initalization just before repair and
bootstrap init.
Message-Id: <8e99cf443239dd8e17e6b6284dab171f7a12365c.1458034320.git.asias@scylladb.com>
Register the REPAIR_CHECKSUM_RANGE messaging service verb handler after
we have replayed the commitlog to avoid responding with bogus checksums.
Message-Id: <1458027934-8546-1-git-send-email-penberg@scylladb.com>
"At the momment, the migration_listener callbacks returns void, it is impossible
to wait for the callbacks to complete. Make the callbacks runs inside seastar
thread, so if we need to wait for the callback, we can make it call
foo_operation().get() in the callback. It is easier than making the callbacks
return future<>.
Fixes #1000."
If a keyspace is created after we calcuate the pending ranges during
bootstrap. We will ignore the keyspace in pending ranges when handling
write request for that keyspace which will casue data lose if rf = 1.
Fixes#1000
At the momment, the callbacks returns void, it is impossible to wait for
the callbacks to complete. Make the callbacks runs inside seastar
thread, so if we need to wait for the callback, we can make it call
foo_operation().get() in the callback. It is easier than making the
callbacks return future<>.
Some network equipment that does TCP session tracking tend to drop TCP
sessions after a period of inactivity. Use keepalive mechanism to
prevent this from happening for our inter-node communication.
Message-Id: <20160314173344.GI31837@scylladb.com>
* seastar 88cc232...0739576 (4):
> rpc: allow configuring keepalive for rpc client
> net: add keepalive configuration to socket interface
> iotune: refuse to run if there is not enough space available
> rpc: make client connection error more clear
Defer registering migration manager RPC verbs after commitlog has has
been replayed so that our own schema is fully loaded before other other
nodes start querying it or sending schema updates.
Message-Id: <1457971028-7325-1-git-send-email-penberg@scylladb.com>
The same shard may create an sstables::sstable object for the same SStable
that doesn't belong to it more than once and mark it
for deletion (e.g. in a 'nodetool refresh' flow).
In that case the destructor of sstables::sstable accounted
the deletion requests from the same shard more than once since it was a simple
counter incremented each time there was a deletion request while it should
account request from the same shard as a single request. This is because
the removal logic waited for all shards to agree on a removal of a specific
SStable by comparing the counter mentioned above to the total
number of shards and once they were equal the SStable files were actually removed.
This patch fixes this by replacing the counter by an std::unordered_set<unsigned>
that will store a shard ids of the shards requesting the deletion
of the sstable object and will compare the size() of this set
to smp::count in order to decide whether to actually delete the corresponding
SStable files.
Fixes#1004
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1457886812-32345-1-git-send-email-vladz@cloudius-systems.com>
When we are about to write a new sstable, we check if the sstable exists
by checking if respective TOC exists. That check was added to handle a
possible attempt to write a new sstable with a generation being used.
Gleb was worried that a TOC could appear after the check, and that's indeed
possible if there is an ongoing sstable write that uses the same generation
(running in parallel).
If TOC appear after the check, we would again crap an existing sstable with
a temporary, and user wouldn't be to boot scylla anymore without manual
intervention.
Then Nadav proposed the following solution:
"We could do this by the following variant of Raphael's idea:
1. create .txt.tmp unconditionally, as before the commit 031bf57c1
(if we can't create it, fail).
2. Now confirm that .txt does not exist. If it does, delete the .txt.tmp
we just created and fail.
3. continue as usual
4. and at the end, as before, rename .txt.tmp to .txt.
The key to solving the race is step 1: Since we created .txt.tmp in step 1
and know this creation succeeded, we know that we cannot be running in
parallel with another writer - because such a writer too would have tried to
create the same file, and kept it existing until the very last step of its
work (step 4)."
This patch implements the solution described above.
Let me also say that the race is theoretical and scylla wasn't affected by
it so far.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <ef630f5ac1bd0d11632c343d9f77a5f6810d18c1.1457818331.git.raphaelsc@scylladb.com>
Since calculate_pending_ranges will modify token_metadata, we need to
replicate to other shards. With this patch, when we call
calculate_pending_ranges, token_metadata will be replciated to other
non-zero shards.
In addition, it is not useful as a standalone class. We can merge it
into the storage_service. Kill one singleton class.
Fixes#1033
Refs #962
Message-Id: <fb5b26311cafa4d315eb9e72d823c5ade2ab4bda.1457943074.git.asias@scylladb.com>
"This series adds more information (i.e. keys and tombstones) to the
query result digest in order to ensure correctness and increase the
chances of early detection of disagreement between replicas.
The digest is no longer computed by hashing query::result but build
using the query result builder. That is necessary since the query
result itself doesn't contain all information required to compute
the digest. Another consequence of this is that now replicas asked
for a result need to send both the result and the digest to
the coordinator as it won't be able to compute the digest itself.
Unfortunately, these patches change our on wire communication:
1) hash computation is different
2) format of query::result is changed (and it is made non-final)
Fixes #182."
Query result digest is used to verify that all replicas have the same
data. Therefore, it needs to contain more information than the query
result itself in order to ensure proper detection of disagreements.
Generally, adding clustering keys to the digest regardless of whether
the client asked for them will guarantee correctness. However, adding
tombstones as well improves the chances of early detection of nodes
containing stale data.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Result digest is going to be computed in query result builder and
require information not available in the query resylt. That's why the
digest now needs to be sent to the other nodes together with the result
as they won't be able compute it on their own.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Currently, if there is a disagreement between replicas we get mutations
from all of them, merge this mutations and send the result to the
client, difference between the result and the mutation sent by a
particular replica is sent back to repair it.
Unfortunately, that may not suffice to provide user with correct results
in case of disagreements.
Consider the following scenario:
create table cf(p int, c int, r int, primary key(p, c));
node1:
p=0, c=1, r=1 (timestamp = 1)
p=0, c=2, r=2 (timestamp = 2)
node2:
p=0, c=1, r=tombstone (timestamp = 2)
p=0, c=2, r=1 (timestamp = 1)
query:
select r from cf limit 1;
Let's assume there are no row markers. node1 will send only outdated
cell (p=0, c=1, r=1) while node2 will send both tombstone for c=1 and
outdated cell (p=0, c=2, r=1). A disagreement will be detected, the
replies will be merged and the coordinator will respond to the client
with result r=1, while the correct answer is r=2.
The solution proposed in this patch is to attempt to detect cases when
the problem may occur and retry queries with larger limit which result
in replicas providing more information.
The detection logic is simple: the partition key and clustering key of
the last row in the reconciled result are compared with the partition
keys and clustering keys of the last rows of replies from replicas
(except short reads). If the (pk, ck) of the replica last row is smaller
than the (pk, ck) of the reconciled result the query is retried.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Fix bootstrap_test.py:TestBootstrap.failed_bootstap_wiped_node_can_join_test
Logs on node 1:
INFO 2016-03-11 15:53:43,287 [shard 0] gossip - FatClient 127.0.0.2 has been silent for 30000ms, removing from gossip
INFO 2016-03-11 15:53:43,287 [shard 0] stream_session - stream_manager: Close all stream_session with peer = 127.0.0.2 in on_remove
WARN 2016-03-11 15:53:43,498 [shard 0] stream_session - [Stream #4e411ba0-e75e-11e5-81f8-000000000000] stream_transfer_task: Fail to send STREAM_MUTATION_DONE to 127.0.0.2:0: std::runtime_error ([Stream #4e411ba0-e75e-11e5-81f8-000000000000] GOT STREAM_ MUTATION_DONE 127.0.0.1: Can not find stream_manager)
terminate called without an active exception
Backtrace on node 1:
#0 0x00007fb74723da98 in raise () from /lib64/libc.so.6
#1 0x00007fb74723f69a in abort () from /lib64/libc.so.6
#2 0x00007fb74ab84aed in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
#3 0x00007fb74ab82936 in ?? () from /lib64/libstdc++.so.6
#4 0x00007fb74ab82981 in std::terminate() () from /lib64/libstdc++.so.6
#5 0x00007fb74ab82be9 in __cxa_rethrow () from /lib64/libstdc++.so.6
#6 0x0000000000f3521e in streaming::stream_transfer_task::<lambda()>::<lambda(auto:44)>::operator()<std::__exception_ptr::exception_ptr> (ep=..., __closure=0x7ffce74d8630) at streaming/stream_transfer_task.cc:169
#7 do_void_futurize_apply<const streaming::stream_transfer_task::start()::<lambda()>::<lambda(auto:44)>&, std::__exception_ptr::exception_ptr> (func=...) at /home/asias/src/cloudius-systems/scylla/seastar/core/future.hh:1142
#8 futurize<void>::apply<const streaming::stream_transfer_task::start()::<lambda()>::<lambda(auto:44)>&, std::__exception_ptr::exception_ptr> (func=...) at /home/asias/src/cloudius-systems/scylla/seastar/core/future.hh:1190
#9 future<>::<lambda(auto:7&&)>::operator()<future<> > ( fut=fut@entry=<unknown type in /home/asias/src/cloudius-systems/scylla/build/release/scylla, CU 0xec84d00, DIE 0xee2561d>, __closure=__closure@entry=0x7ffce74d8630) at /home/asias/src/cloudius-systems/scylla/seastar/core/future.hh:1014
Message-Id: <1457684884-4776-2-git-send-email-asias@scylladb.com>
Currently, if sstable::write_components() is called to write a new sstable
using the same generation of a sstable that exists, a temporary TOC will
be unconditionally created. Afterwards, the same sstable::write_components()
will fail when it reaches sstable::create_data(). The reason is obvious
because data component exists for that generation (in this scenario).
After that, user will not be able to boot scylla anymore because there is
a generation with both a TOC and a temporary TOC. We cannot simply remove a
generation with TOC and temporary TOC because user data will be lost (again,
in this scenario). After all, the temporary TOC was only created because
sstable::write_components() was wrongly called with the generation of a
sstable that exists.
Solution proposed by this patch is to trigger exception if a TOC file
exists for the generation used.
Some SSTable unit tests were also changed to guarantee that we don't try
to overwrite components of an existing sstable.
Refs #1014.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <caffc4e19cdcf25e4c6b9dd277d115422f8246c4.1457643565.git.raphaelsc@scylladb.com>
Since commit 2f56577 ("sstables: more efficient read of compressed data
file"), the compressed_file_input_stream uses a file_input_stream to
efficiently read the compressed data at chunks some desired size (128 KB
is our default) instead of at smaller compressed chunks.
However, I had a bug where I mis-calculated the desired length of the
read (giving the *end byte* instead of the length!) and as a result
file_input_stream did not know where the read was supposed to stop, and
always read 128 KB buffers. The results were not incorrect, because the
sstable reader stops when it needs to, even if given too much data. But
it was inefficient because too much data was read in the last buffer.
With this patch, the length is correctly given to the input stream, and
it can read a much smaller buffer at the end of the read, not the full
128 KB. I tested that this actually happens.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1457633616-15193-1-git-send-email-nyh@scylladb.com>
"As described in issue #1014, we have found ourselves in a situation where
SSTables can be written too early, and that causes problems for the existing
SSTables. While this shouldn't happen - and Pekka's recent patch to move
populate() a lot earlier in initialization should fix that, when that did
happen what we had was not enough to prevent it from overwriting existing
tables.
We should do a lot better job protecting against that.
Also, some of the exceptions that are generated at totally inconclusive. This
series also aims at making some of the exceptions more descriptive."
This patch makes sure that every time we need to create a new generation number -
the very first step in the creation of a new SSTable, the respective CF is already
initialized and populated. Failure to do so can lead to data being overwritten.
Extensive details about why this is important can be found
in Scylla's Github Issue #1014
Nothing should be writing to SSTables before we have the chance to populate the
existing SSTables and calculate what should the next generation number be.
However, if that happens, we want to protect against it in a way that does not
involve overwriting existing tables. This is one of the ways to do it: every
column family starts in an unwriteable state, and when it can finally be written
to, we mark it as writeable.
Note that this *cannot* be a part of add_column_family. That adds a column family
to a db in memory only, and if anybody is about to write to a CF, that was most
likely already called. We need to call this explicitly when we are sure we're ready
to issue disk operations safely.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The standard C++ exception messages that will be thrown if there is anything
wrong writing the file, are suboptimal: they barely tell us the name of the failing
file.
Use a specialized create function so that we can capture that better.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Deletion of previous stale, temporary SSTables is done by Shard0. Therefore,
let's run Shard0 first. Technically, we could just have all shards agree on the
deletion and just delete it later, but that is prone to races.
Those races are not supposed to happen during normal operation, but if we have
bugs, they can. Scylla's Github Issue #1014 is an example of a situation where
that can happen, making existing problems worse. So running a single shard
first and getting making sure that all temporary tables are deleted provides
extra protection against such situations.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We are no longer using the in_flight_seals gate, but forgot to remove it.
To guarantee that all seal operations will have finished when we're done,
we are using the memtable_flush_queue, which also guarantees order. But
that gate was never removed.
The FIXME code should also be removed, since such interface does exist now.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We already have a function that wraps this, re-use it. This FIXME is still
relevant, so just move it there. Let's not lose it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We use memory usage as a threshold these days, and nowhere is _mutation_count
checked. Get rid of it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
While looking at initialization code I felt like my head is going to
explode. Moving initialization into a thread makes things a little bit
better. Only lightly tested.
Message-Id: <20160310163142.GE28529@scylladb.com>
Make sure http_response.hh that is pulled by locator/ec2_snitch.hh is
built. The commit is similar to what commit 6ccf8f8 ("build: make sure
to ask seastar to build http/request_parser.hh, and depend on it") did
for request_parser.hh.
Fixes the following build error on CentOS:
In file included from ./locator/ec2_multi_region_snitch.hh:41:0,
from locator/ec2_multi_region_snitch.cc:39:
./locator/ec2_snitch.hh:24:40: fatal error: http/http_response_parser.hh: No such file or directory
Spotted by Shlomi.
Message-Id: <1457612266-315-1-git-send-email-penberg@scylladb.com>
We start services like gossiper before system keyspace is initialized
which means we can start writing too early. Shuffle code so that system
keyspace is initialized earlier.
Refs #1014
Message-Id: <1457593758-9444-1-git-send-email-penberg@scylladb.com>
If we do
- Decommission a node
- Stop a node
we will shutdown gossip more than once in:
- storage_service::decommission
- storage_service::drain_on_shutdown
Fix by checking if it is already stopped and back off if so.
If we do
- Decommission a node
- Stop a node
we will shutdown messaging_service more than once in:
- storage_service::decommission
- storage_service::drain_on_shutdown
Fixes#1005
Refs #1013
This fix a dtest failure in debug build.
update_cluster_layout_tests.TestUpdateClusterLayout.simple_decommission_node_1_test/
/data/jenkins/workspace/urchin-dtest/label/monster/mode/debug/scylla/seastar/core/future.hh:802:35:
runtime error: member call on null pointer of type 'struct
future_state'
core/future.hh:334:49: runtime error: member access within null
pointer of type 'const struct future_state'
ASAN:SIGSEGV
=================================================================
==4557==ERROR: AddressSanitizer: SEGV on unknown address
0x000000000000 (pc 0x00000065923e bp 0x7fbf6ffac430 sp 0x7fbf6ffac420
T0)
#0 0x65923d in future_state<>::available() const
/data/jenkins/workspace/urchin-dtest/label/monster/mode/debug/scylla/seastar/core/future.hh:334
#1 0x41458f1 in future<>::available()
/data/jenkins/workspace/urchin-dtest/label/monster/mode/debug/scylla/seastar/core/future.hh:802
#2 0x41458f1 in then_wrapped<parallel_for_each(Iterator, Iterator,
Func&&)::<lambda(parallel_for_each_state&)> [with Iterator =
std::__detail::_Node_iterator<std::pair<const net::msg_addr,
net::messaging_service::shard_info>, false, true>; Func =
net::messaging_service::stop()::<lambda(auto:39&)> [with auto:39 =
std::unordered_map<net::msg_addr, net::messaging_service::shard_info,
net::msg_addr::hash>]::<lambda(std::pair<const net::msg_addr,
net::messaging_service::shard_info>&)>]::<lambda(future<>)>, future<>
> /data/jenkins/workspace/urchin-dtest/label/monster/mode/debug/scylla/seastar/core/future.hh:878
Attempt to print std::nested_exception currently results in exception
to leak outside the printer. Fix by capturing all exception in the
final catch block.
For nested exception, the logger will print now just
"std::nested_exception". For nested exceptions specifically we should
log more, but that is a separate problem to solve.
Message-Id: <1457532215-7498-1-git-send-email-tgrabiec@scylladb.com>
Now, get_ranges_for_endpoint will unwrap the first range. With t0 t1 t2
t3, the first range (t3,t0] will be splitted as (min,t0] and (t3,max].
Skippping the range (t3,max] we will get the correct ownership number as
if the first range were not splitted.
Fixes#928
Message-Id: <2e30ebd53f3dba3cc5e0cf36d5541c354b0e30ca.1457506704.git.asias@scylladb.com>
In the preparation phase of streaming, we check that remote node has all
the cf_id which are needed for the entire streaming process, including the
cf_id which local node will send to remote node and wise versa.
So, at later time, if the cf_id is missing, it must be that the cf_id is
deleted. It is fine to ingore no_such_column_family exception. In this
patch, we change the code to ignore at server side to avoid sending the
exception back, to avoid handle exception in an IDL compatiable way.
One thing we can improve is that the sender might know the cf is deleted
later than the receiver does. In this case, the sender will send some
more mutations if we send back the no_such_column_family back to the
sender. However, since we do not throw exceptions in the receiver stream
mutation handler, it will not cause a lot of overhead, the receiver will
just ignore the mutation received.
Fixes#979
It is possible that a cf is deleted after we make the cf reader. Avoid
sending them to avoid the unnecessary overhead to send them on the wire and
the peer node to drop the received mutations.
"Hook streaming with gossip callback so we can abort
the stream_session in such case:
- a node is restarted
- a node is removed from the cluster
Fixes #1001."
Before this patch, reading large ranges from a compressed data file involved
two inefficiencies:
1. The compressed data file was read one compressed chunk at a time.
Such a chunk is around 30 KB in size, well below our desired sstable
read-ahead size (sstable_buffer_size = 128 KB).
2. Because the compressed chunks have variable length (the uncompressed
chunk has a fixed length) they are not aligned to disk blocks, so
consecutive chunks have overlapping blocks which were unnecessarily
read twice.
The fix for both issues is to build the compressed_file_input_stream on
an existing file_input_stream, instead of using direct file IO to read the
individual chunks. file_input_stream takes care of doing the appropriate
amount of read-ahead, and the compressed_file_input_stream layer does the
decompression of the data read from the underlying layer.
Fixes#992.
Historical note: Implementing compressed_file_input_stream on top of
file_input_stream was already tried in the past, and rejected. The problem
at that time was that compressed_file_input_stream's constructor did not
specify the *end* of the range to read, so that when we wanted to read
only a small range we got too much read-ahead beyond the exactly one
compressed chunk that we needed to read. Following the fix to issue #964,
we now know on every streaming read also the intended *end* of the stream,
so we can now use this to stop reading at the end of the last required
chunk, even when we use a read-ahead buffer much larger than a chunk.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1457304335-8507-1-git-send-email-nyh@scylladb.com>
We try to be robust against files disappearing (due to any kind of corruption)
inside the data directory.
But if the data directory itself goes missing, that's a situation that we don't
handle correctly. We will keep accepting writes normally, but when we try to
flush the memtable to disk, we'll fail with a system error.
Having the CF directory disappearing is not a common thing. But it is also one
that we can easily protect against, by touching all CF directories we know
about on startup.
Fixes#999
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <ed66373dccca11742150a6d08e21ece3980227d3.1457379853.git.glauber@scylladb.com>
- Start a node
- Inject data
- Start another node to bootstrap
- Before the second node finishes streaming, kill the second node
- After a while the node will be removed from the cluster becusue it does
not manage to join the cluster.
- At this time, messaging_service might keep retrying the
stream_mutations unncessarily.
To fix, check if the peer node is still a known node in the gossip.
If the peer node of a stream_session is restarted or removed we should
abort the streaming. It is better to hook gossip callback in the stream
manager than in each streamm_session.
In due time we will have to fix this, but as an interim step, let's use
a "better" magic number.
The problem with 100, is that as soon as the partitions start to go bigger,
we're using too much memory. Since this is multiplied by the number of token
ranges, and happens in every shard, the final number can become really big,
and the amount of resources we use go up proportionally.
This means that even we are mistaken about the new number (we probably are),
in this case it is better to err on the side of a more conservative resource
usage.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <97158f3db5734916cee4ccf12eaa66e7402570bb.1457448855.git.glauber@scylladb.com>
When we do a streaming read that knows the expected *end* position of the
read, we can use a large read-ahead buffer, and at the same time, stop
reading at exactly the intended end (or small rounding of it to the DMA
block size) and not waste resources blindly reading a large amount of data
after the end just to fill the read-ahead buffer.
The sstable reading code, both for reading the data file and the index file,
created a file input stream without specifiying its end, thereby losing
this optimization - so when a large buffer was used, we would get a large
over-read. This patch fixes this, so sstable data file and index file are
read using a file input stream which is a ware of its end.
Fixes#964.
Note that this patch does not change the behavior when reading a
*compressed* data file. For compressed read, we did not have the problem
of over-read in the first place, because chunks are read one by one.
But we do have other sources of inefficiencies there (stemming, again,
from the fact that the compressed chunks are read one by one), and I
opened a separate issue #992 for that.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1457219304-12680-1-git-send-email-nyh@scylladb.com>
Fixes#967
Frozen lists are just atomic cells. However, old code inserted the
frozen data directly as an atomic_cell_or_collection, which in turn
meant it lacked the header data of a cell. When in turn it was
handled by internal serialization (freeze), since the schema said
is was not a (non-frozen) collection, we tried to look at frozen
list data as cell header -> most likely considered dead.
Message-Id: <1457432538-28836-1-git-send-email-calle@scylladb.com>
background_reads collectd counter was not always properly decremented.
Fix it and streamline background read repair error handling.
Message-Id: <20160307182255.GI4849@scylladb.com>
Currently it is waited upon only if background read repair check is
needed and this cause unhandled exception warning to be printed if
it enters failed state. Fix this by always waiting on it, but doing
anything beyond ignoring an exception only if check is needed.
Message-Id: <1457351304-28721-1-git-send-email-gleb@scylladb.com>
Currently write acknowledgements handling does not take bootstrapping
node into account for CL=EACH_QUORUM. The patch fixes it.
Fixes#994
Message-Id: <20160307121620.GR2253@scylladb.com>
During bootstrapping additional copies of data has to be made to ensure
that CL level is met (see CASSANDRA-833 for details). Our code does
that, but it does not take into account that bootstraping node can be
dead which may cause request to proceed even though there is no
enough live nodes for it to be completed. In such a case request neither
completes nor timeouts, so it appear to be stuck from CQL layer POV. The
patch fixes this by taking into account pending nodes while checking
that there are enough sufficient live nodes for operation to proceed.
Fixes#965
Message-Id: <20160303165250.GG2253@scylladb.com>
From Vlad:
This series modifies the 'database' class to use the internal
_enable_incremental_backups value (initialized with
'incremental_backups' configuration value) instead of using the
'incremental_backups' configuration value directly.
Then we update this internal value in runtime from 'nodetool
enable/disablebackup' API callback so that newly created keyspaces and
column families use the newly configured incremental backup
configuration.
From Vlad:
This series fixes the first part of issue #909 (the second part has a
separate github issue #965) which is a discrepancy between a
storage_service::token_metadata and a gossiper::endpoint_state_map
contents on non-zero shards.
In region destructor, after active segments is freed pointer to it is
left unchanged. This confuses the remaining parts of the destructor
logic (namely, removal from region group) which may rely on the
information in region_impl::_active.
In this particular case the problem was that code removing from the
region group called region_impl::occupancy() which was
dereferencing _active if not null.
Fixes#993.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1457341670-18266-1-git-send-email-pdziepak@scylladb.com>
'nodetool enable/disablebackup' callback was modifying only the
existing keyspaces and column families configurations.
However new keyspaces/column families were using
the original 'incremental_backups' configuration value which could
be different from the value configured by 'nodetool enable/disablebackup'
user command.
This patch updates the database::_enable_incremental_backups per-shard
value in addition to updating the existing keyspaces and column families
configurations.
Fixes#845
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Store the "incremental_backups" configuration value in the database
class (and use it when creating a keyspace::config) in order to be
able to modify it in runtime.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
If storage_service::token_metadata is not distributed together with
gossiper::endpoint_state_map there may be a situation when a non-zero
shard sees a new value in token_metadata (e.g. newly added node's
token ranges) while still seeing an old gossiper::endpoint_state_map
contents (e.g. a mentioned above newly added node may not be present,
thus causing gossiper::is_alive() to return FALSE for that node, while
the node is actually alive and kicking).
To avoid this discrepancy we will always update a token_metadata together
with an endpoint_state_map when we distribute new token_metadata data
among shards.
Fixes#909
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
We will need to access it from a storage_service class when replicate
token_metadata.
Rename _shadow_endpoint_state_map -> shadow_endpoint_state_map
according to our coding convention.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
If timeout happens after cl promise is fulfilled, but before
continuation runs it removes all the data that cl continuation needs
to calculate result. Fix this by calculating result immediately and
returning it in cl promise instead of delaying this work until
continuation runs. This has a nice side effect of simplifying digest
mismatch handling and making it exception free.
Fixes#977.
Message-Id: <1457015870-2106-3-git-send-email-gleb@scylladb.com>
Read executor may ask for more than one data reply during digest
resolving stage, but only one result is actually needed to satisfy
a query, so no need to store all of them.
Message-Id: <1457015870-2106-2-git-send-email-gleb@scylladb.com>
In digest resolver for cl to be achieved it is not enough to get correct
number of replies, but also to have data reply among them. The condition
in digest timeout does not check that, fortunately we have a variable
that we set to true when cl is achieved, so use it instead.
Message-Id: <1457015870-2106-1-git-send-email-gleb@scylladb.com>
Ubuntu 14.04LTS package is broken now because iotune does not statically linked against libstdc++, so this patch fixed it.
Requires seastar patch to add --static-stdc++ on configure.py.
Fixes#982
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1456995050-22007-1-git-send-email-syuu@scylladb.com>
1) As explained in commit 697b16414a (gossip: Make gossip message
handling async), in each gossip round we can make talking to the 1-3
peer nodes in parallel to reduce latency of gossip round.
2) Gossip syn message uses one way rpc message, but now the returned
future of the one way message is ready only when message is dequeued for
some reason (sent or dropped). If we wait for the one way syn messge to
return it might block the gossip round for a unbounded time. To fix, do
not wait for it in the gossip round. The downside is there will be no
back pressure to bound the syn messages, however since the messages are
once per second, I think it is fine.
Message-Id: <ea4655f121213702b3f58185378bb8899e422dd1.1456991561.git.asias@scylladb.com>
Currently schema changes are only logged at coordinator node which
initiates the change. It would be helpful in post morten analysis to
also see when and how schema changes are resolved when applied on
other nodes.
Message-Id: <1456953095-1982-1-git-send-email-tgrabiec@scylladb.com>
Use the existing "feed_hash" mechanism to find a checksum of the
content of a mutation, instead of serializing the mutation (with freeze())
and then finding the checksum of that string.
The serialized form is more prone to future changes, and not really
guaranteed to provide equal hashes for mutations which are considered
"equal".
Fixes#971
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1456958676-27121-1-git-send-email-nyh@scylladb.com>
While is is formally better to take a local lock first and
then first contend for a global, in this case it is arguably
better to ensure we get a gate exception synchronously (early)
instead of potentially in a continuation. Old version might
cause us to do a gate::leave even while never entered.
And since we should really only have one active (contending)
segment per shard anyway, it should not matter.
Message-Id: <1456931988-5876-1-git-send-email-calle@scylladb.com>
Fixes#865
(Some) gcc 5 (5.3.0 for me) on ubuntu will generate errors on
compilation of this code (compiling logalloc_test). The memcpy
to inline storage seems to confuse the compiler.
Simply change to std::copy, which shuts the compiler up.
Any decent stl should convert primitive std::copy to memcpy
anyway, but since it is also the inline (small storage),
it should not matter which way.
Message-Id: <1456931988-5876-4-git-send-email-calle@scylladb.com>
"This series implements describe_schema_versions so that we nodetool
describecluster can return proper schema information for the whole
cluster. It involves adding new verb SCHEMA_CHECK which is used to get
schema version for a given node and a simple map-reduce that using that
verb gets info from the whole cluster.
This fixes#677, fixes#684, and fixes #472."
Useful for determining order of events in logs of different nodes, or
for estimating how much time passed between two events.
Fixes#941.
Example log:
INFO 2016-03-01 18:30:37,688 [shard 0] gossip - Waiting for gossip to settle before accepting client requests...
INFO 2016-03-01 18:30:45,689 [shard 0] gossip - No gossip backlog; proceeding
INFO 2016-03-01 18:30:45,689 [shard 0] storage_service - Starting listening for CQL clients on localhost:9042...
Message-Id: <1456853532-28800-1-git-send-email-tgrabiec@scylladb.com>
Start with coarse control:
1) converting the run_with_write_api_lock operations:
join_ring, start_gossiping, stop_gossiping, start_rpc_server,
stop_rpc_server, start_native_transport, stop_native_transport,
decommission, remove_node, drain, move, rebuild
to use run_with_api_lock which uses a flag to indicate current operation
in progress.
If one of the above operation is in progress when admin issues another
opeartion we return a "try again" exception to avoid running two
operations in parallel.
2) converting the run_with_read_api_lock to use no lock.
Fixes#850.
Message-Id: <00782b601028ed87437e5decae382f72dff634f6.1456758391.git.asias@scylladb.com>
Currently when reading a view to an object the stream stored has the
same bound as the containing stream, not the bounds of the object
itself. The serializer of the view assumes that the stream has the
bounds of the object itself.
Fixes dtest failure in
paging_test.py:TestPagingSize.test_undefined_page_size_default
Fixes#963.
Message-Id: <1456854556-32088-1-git-send-email-tgrabiec@scylladb.com>
The segment->segment_manager pointer has, until now, been a raw pointer,
which in a way is sensible, since making circular shared pointer
relations is in general bad. However, since the code and life cycle
of segments has evolved quite a bit since that initial relation
was defined, becoming both more and then suddenly, in a sense,
less, asynchronous over time, the usage of the relation is in fact
more consistent with a shared pointer, in that a segment needs to
access its manager to properly do things like write and flush.
These two ops in particular depend on accessing the segment manager
in a way that might be fine even using raw pointers, if it was not
again for that little annoying thing of continuation reordering.
So, lets just make the relation a shared pointer, solving the issue
of whether the manager is alive when a segment accesses it. If it
has been "released" (shut down), the existing mechanisms (gate)
will then trigger and prevent any actual _actions_ from taking
place. And we don't have to complicate anything else even more.
Only "big" change is that we need to explicitly orphan all
segments in commitlog destructor (segment_manager is essentially
a p-impl).
This fixes some spurious crashes in nightly unit tests.
Fixes#966.
Message-Id: <1456838735-17108-1-git-send-email-calle@scylladb.com>
We cannot use shared_ptr *instances* for checking duplicate column
definitions because they are never equal. Store column definition name
in the unordered_map instead.
Fixes cql_additional_tests.py:TestCQL.identifier_test.
Spotted by Shlomi.
Message-Id: <1456840506-13941-1-git-send-email-penberg@scylladb.com>
When the first time the keep alive timer fires, the _last_stream_bytes
btyes will be zero since it is the first time we update it. The keep
alive timer will be rearmed and fired again. The second time, we find
there is no progress, we close the session. The total idle time will be
2 * keep alive timer.
To make the idle time to close the session be more precise, we reduce
the interval to check the progess and close the session by checking last
time the progress is made.
Message-Id: <c959cffce0cc738a3d73caaf71d2adb709d46863.1456831616.git.asias@scylladb.com>
Checking schema::is_dense() is not enough to know whether row marker
should be inserted or not as there may be compact storage tables that
are not considered dense (namely, a table with now clustering key).
Row marker should only be insterted if schema::is_cql3_table() is true.
Fixes#931.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1456834937-1630-1-git-send-email-pdziepak@scylladb.com>
corrupt_segment() is meant to write some garbage at arbitrary position
in the commitlog segment. That position is not necessairly properly
aligned for uint32_t.
Silences ubsan complaints about unaligned write.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1456827726-21288-1-git-send-email-pdziepak@scylladb.com>
Unlike CentOS/Fedora, scylla_io_setup is calling from pre-start section of scylla-server upstart job, not from separated job.
This is because Upstart does not provide same behavior as After / Requires directives on systemd.
Fixes#954.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1456825805-4195-1-git-send-email-syuu@scylladb.com>
This patch change the way optional vector are implemented.
Now a vector of optional would be handle like any other non primitive
types, with a single method add() that would return a writer to the
optional.
The writer to the optional would have a skip and write method like
simple optional field.
For basic types the write method would get the value as a parameter, for
composite type, it would return a writer to the type.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1456796143-3366-2-git-send-email-amnon@scylladb.com>
To report disk usage, scylla was only taking into account size of
sstable data component. Other components such as index and filter
may be relatively big too. Therefore, 'nodetool status' would
report an innacurate disk usage. That can be fixed by taking into
account size of all sstable components.
Fixes#943.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <08453585223570006ac4d25fe5fb909ad6c140a5.1456762244.git.raphaelsc@scylladb.com>
Fixes#482
See code comment. Reserve segment allocation count sum can temporarily
overflow due to continuation delay/reordering, if we manage to reach the
on_timer code before finally clauses from previous reserve allocation
invocation has processed. However, since these are benign overflows
(just indicating even more that we don't need to do anything right now)
simply capping the count should be fine.
Avoids assert in boost irange.
Message-Id: <1456740679-4537-1-git-send-email-calle@scylladb.com>
"This patchset fixes#950, run scylla-io-setup before scylla-server on anycase, and installs example /etc/scylla.d/io.conf by default to prevent error on 'EnvironmentFile=/etc/scylla.d/*.conf'."
With this change, you can define your own prefix of AMI name in variable.json.
example:
{
"access_key": "xxx",
"secret_key": "xxx",
"subnet_id": "xxx",
"security_group_id": "xxx",
"region": "us-east-1",
"associate_public_ip_address": "true",
"instance_type": "c4.xlarge",
"ami_prefix": "takuya-"
}
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1456329247-5109-1-git-send-email-syuu@scylladb.com>
Prevent error on 'EnvironmentFile=/etc/scylla.d/*.conf'.
Parameters are commented out, and the file will replace when scylla starts, by scylla-io-setup.service.
"The series includes Amnon's unmerged support for optional<> in idl-compiler.
Depends on seastar patch "[PATCH seastar] simple_input_stream: Introduce begin()".
The query result footprint for cassandra-stress mutation as reported
by tests/memory-footprint increased by 18% from 285 B to 337 B.
perf_simple_query shows slight regression in throughput (-8%):
build/release/tests/perf/perf_simple_query -c4 -m1G --partitions 100000
Before: ~433k tps
After: ~400k tps"
We require SSE 4.2 (for commitlog CRC32), verify it exists early and bail
out if it does not.
We need to check early, because the compiler may use newer instructions
in the generated code; the earlier we check, the lower the probability
we hit an undefined opcode exception.
Message-Id: <1456665401-18252-1-git-send-email-avi@scylladb.com>
Before:
ERROR [shard 0] storage_service - Format of host-id =
marshal_exception (marshalling error) is incorrect ???
Exiting on unhandled exception of type 'marshal_exception': marshalling error
After:
ERROR [shard 0] storage_service - Unable to parse 127.0.0.3 as host-id
Exiting on unhandled exception of type 'std::runtime_error': Unable to
parse 127.0.0.3 as host-id
Message-Id: <1456737987-32353-1-git-send-email-asias@scylladb.com>
It is used by
nodetool status
If an api operation inside storage_service takes a long time to finish
, which holds the lock, it will block nodetool status for a long time.
I think it is safe to get the load map even if other operations are in-flight.
Refs: #850
Message-Id: <1456737987-32353-2-git-send-email-asias@scylladb.com>
"This series:
1) Log total bytes sent/recevied when a stream plan completes.
It is useful in test code.
2) Fix http://scylla_ip:10000/stream_manager API"
Otherwise we will leak it, and region destructor will fail:
row_cache_test: utils/logalloc.cc:1211: virtual logalloc::region_impl::~region_impl(): Assertion `seg->is_empty()' failed.
Fixes regression in row_cache_test.
Since invalidate() may allocate, we need to take the region lock to
keep m.partitions references valid around whole clear_and_dispose(),
which relies on that.
The query result footprint for cassandra-stress mutation as reported
by tests/memory-footprint increased by 18% from 285 B to 337 B.
perf_simple_query shows slight regression in throughput (-8%):
build/release/tests/perf/perf_simple_query -c4 -m1G --partitions 100000
Before: ~433k tps
After: ~400k tps
By initilizing them to 0 we can catch unclosed frames at
deserialization time. It's better than leaving frame size undefined,
which may cause errors much later in deserialization process and thus
would make it harder to identifiy the real cause.
This patch adds optional writer support an optional field can be either
skip or set.
For vector of optional, a write_empty method will
add 1 to the vector count and mark the optional as false.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
For each stream_session, we pretend we are sending/receiving one file,
to make it compatible with nodetool. For receiving_files, the file name
is "rxnofile". For sending_files, the file name is "txnofile".
stream_manager::update_all_progress_info is introduced to update the
progress info of all the stream_sessions in the node. We need this
because streaming mutations are received on all the cores, but the
stream_session object is only on one of the cores. It adds overhead if
we update progress info in stream_session object whenever we receive a
streaming mutation. So, what we do now is when we really need the
progress info, we update the progress info in stream_session object.
With http://127.0.0.$i:10000/stream_manager/, it looks like below when
decommission node 3 in a 3 nodes cluster.
=========== GET NODE 1
[{"plan_id": "935a2cc0-dc6b-11e5-bdbf-000000000000", "description":
"Unbootstrap", "sessions": [{"receiving_files": [{"value": {"direction":
"IN", "file_name": "rxnofile", "session_index": 0, "total_bytes":
16876296, "peer": "127.0.0.3", "current_bytes": 16876296}, "key":
"rxnofile"}], "receiving_summaries": [{"files": 1, "total_size": 0,
"cf_id": "869d8630-dc6b-11e5-bdbf-000000000000"}], "session_index": 0,
"state": "PREPARING", "connecting": "127.0.0.3", "peer": "127.0.0.3"}]}]
=========== GET NODE 2
[{"plan_id": "935a2cc0-dc6b-11e5-bdbf-000000000000", "description":
"Unbootstrap", "sessions": [{"receiving_files": [{"value": {"direction":
"IN", "file_name": "rxnofile", "session_index": 0, "total_bytes":
16755552, "peer": "127.0.0.3", "current_bytes": 16755552}, "key":
"rxnofile"}], "receiving_summaries": [{"files": 1, "total_size": 0,
"cf_id": "869d8630-dc6b-11e5-bdbf-000000000000"}], "session_index": 0,
"state": "PREPARING", "connecting": "127.0.0.3", "peer": "127.0.0.3"}]}]
=========== GET NODE 3
[{"plan_id": "935a2cc0-dc6b-11e5-bdbf-000000000000", "description":
"Unbootstrap", "sessions": [{"sending_files": [{"value": {"direction":
"OUT", "file_name": "txnofile", "session_index": 0, "total_bytes":
16876296, "peer": "127.0.0.1", "current_bytes": 16876296}, "key":
"txnofile"}], "sending_summaries": [{"files": 1, "total_size": 0,
"cf_id": "869d8630-dc6b-11e5-bdbf-000000000000"}], "session_index": 0,
"state": "PREPARING", "connecting": "127.0.0.1", "peer":
"127.0.0.1"},{"sending_files": [{"value": {"direction": "OUT",
"file_name": "txnofile", "session_index": 0, "total_bytes": 16755552,
"peer": "127.0.0.2", "current_bytes": 16755552}, "key": "txnofile"}],
"sending_summaries": [{"files": 1, "total_size": 0, "cf_id":
"869d8630-dc6b-11e5-bdbf-000000000000"}], "session_index": 0, "state":
"PREPARING", "connecting": "127.0.0.2", "peer": "127.0.0.2"}]}]
The problem is that a generic functions (eg. skip()) which call
deserialize() overloads based on their template parameter only see
deserilize() overloads which were declared at the time skip() was
declared and not those which are available at the time of
instantiation. This forces all serializers to be declared before
serialization_visitors.hh is first included. Serializers included
later will fail to compile. This becomes problematic to ensure when
serializers are included from headers.
Template class specialization lookup doesn't suffer from this
limitation. We can use that to solve the problem. The IDL compiler
will now generate template class specializations with read/write
static methods. In addition to that, default serializer() and
deserialize() implementations are delegating to serializer<>
specialization so that API and existing code doesn't have to change.
Message-Id: <1456423066-6979-1-git-send-email-tgrabiec@scylladb.com>
"Gleb has recently noted that our query reads are not even being registered
with the I/O queue.
Investigating what is happening, I found out that while the priority that
make_reader receives was not being properly passed downwards to the SSTable
reader. The reader code is also used by compaction class, and that one is fine.
But the CQL reads are not.
On top of that, there are also some other places where the tag was not properly
propagated, and those are patched."
add_local_application_state is used in various places. Before this
patch, it can only be called on cpu zero. To make it safer to use, use
invoke_on() to foward the code to run on cpu zero, so that caller can
call it on any cpu.
Refs: #795
Message-Id: <d69b81c5561622078dbe887d87209c4ea2e3bf46.1456315043.git.asias@scylladb.com>
Gleb saw once:
scylla: gms/gossiper.cc:1393:
gms::gossiper::add_local_application_state(gms::application_state,
gms::versioned_value):: mutable: Assertion
`endpoint_state_map.count(ep_addr)' failed.
The assert is about we can not find the entry in endpoint_state_map of
the node itself. I can not really find any place we could call
add_local_application_state before we call gossiper::start_gossiping()
where it inserts broadcast address into endpoint_state_map.
I can not reproduce issue, let's log the error so we can narrow down
which application state triggered the assert.
Refs: #795
Message-Id: <f4433be0a0d4f23470a5e24e528afdb67b74c7ef.1456315043.git.asias@scylladb.com>
Fixes#934 - faulty assert in discard_sstables
run_with_compaction_disabled clears out a CF from compaction
mananger queue. discard_sstables wants to assert on this, but looks
at the wrong counters.
pending_compactions is an indicator on how much interested parties
want a CF compacted (again and again). It should not be considered
an indicator of compactions actually being done.
This modifies the usage slightly so that:
1.) The counter is always incremented, even if compaction is disallowed.
The counters value on end of run_with_compaction_disabled is then
instead used as an indicator as to whether a compaction should be
re-triggered. (If compactions finished, it will be zero)
2.) Document the use and purpose of the pending counter, and add
method to re-add CF to compaction for r_w_c_d above.
3.) discard_sstables now asserts on the right things.
Message-Id: <1456332824-23349-1-git-send-email-calle@scylladb.com>
We call a mutation source during the query path without any consideration
for attaching a priority. This is incorrect, and queries called through this
facility will end up in the default class.
Fix this by attaching the query priority class here.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Not all SSTable readers will end up getting the right tag for a priority
class. In particular, the range reader, also used for the memtables complete
ignores any priority class.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
There are situations when a memtable is already flushed but the memtable
reader will continue to be in place, relaying reads to the underlying
table.
For that reason, the "memtables don't need a priority class" argument
gets obviously broken. We need to pass a priority class for its reader
as well.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Fixes#937
In fixing #884, truncation not truncating memtables properly,
time stamping in truncate was made shard-local. This however
breaks the snapshot logic, since for all shards in a truncate,
the sstables should snapshot to the same location.
This patch adds a required function argument to truncate (and
by extension drop_column_family) that produces a time stamp in
a "join" fashion (i.e. same on all shards), and utilizes the
joinpoint type in caller to do so.
Message-Id: <1456332856-23395-2-git-send-email-calle@scylladb.com>
Lets operations working on all shards "join" and acquire
the same value of something, with that value being based on
whenever all shards reach the join.
Obvious use case: time stamp after one set of per-shard ops, but
before final ones.
The generation of the value is guaranteed to happen on the shards
that created the join point.
Based on the join-ops in CF::snapshot, but abstracted and made
caller responsibility. Primary use case is to help deal with
the join-problem of truncation.
Message-Id: <1456332856-23395-1-git-send-email-calle@scylladb.com>
The gocql driver assumes that there's a result metadata section in the
PREPARED message. Technically, Scylla is not at fault here as the CQL
specification explicitly states in Section 4.2.5.4. ("Prepared") that the
section may be empty:
- <result_metadata> is defined exactly as <metadata> but correspond to the
metadata for the resultSet that execute this query will yield. Note that
<result_metadata> may be empty (have the No_metadata flag and 0 columns, See
section 4.2.5.2) and will be for any query that is not a Select. There is
in fact never a guarantee that this will non-empty so client should protect
themselves accordingly. The presence of this information is an
However, Cassandra always populates the section so lets do that as well.
Fixes#912.
Message-Id: <1456317082-31688-1-git-send-email-penberg@scylladb.com>
* dist/ami/files/scylla-ami 398b1aa...d4a0e18 (3):
> Sort service running order (scylla-ami-setup.service -> scylla-io-setup.service -> scylla-server.service)
> Drop --ami and --disk-count parameters
> dist: pass the number of disks to set io params
In each gossip round, i.e., gossiper::run(), we do:
1) send syn message
2) peer node: receive syn message, send back ack message
3) process ack message in handle_ack_msg
apply_state_locally
mark_alive
send_gossip_echo
handle_major_state_change
on_restart
mark_alive
send_gossip_echo
mark_dead
on_dead
on_join
apply_new_states
do_on_change_notifications
on_change
4) send back ack2 message
5) peer node: process ack2 message
apply_state_locally
At the moment, syn is "wait" message, it times out in 3 seconds. In step
3, all the registered gossip callbacks are called which might take
significant amount of time to complete.
In order to reduce the gossip round latency, we make syn "no-wait" and
do not run the handle_ack_msg insdie the gossip::run(). As a result, we
will not get a ack message as the return value of a syn message any
more, so a GOSSIP_DIGEST_ACK message verb is introduced.
With this patch, the gossip message exchange is now async. It is useful
when some nodes are down in the cluster. We will not delay the gossip
round, which is supposed to run every second, 3*n seconds (n = 1-3,
since it talks to 1-3 peer nodes in each gossip round) or even
longer (considering the time to run gossip callbacks).
Later, we can make talking to the 1-3 peer nodes in parallel to reduce
latency even more.
Refs: #900
We will soon switch to use no-wait message for gossip. GOSSIP_DIGEST_SYN
will no longer return GOSSIP_DIGEST_ACK message. So we need a standalone
verb for GOSSIP_DIGEST_ACK.
The problem is we initialize _last_interpret when failure_detector
object is constructed. When interpret() runs for the first time, the
_last_interpret value is not the last time we run interpret() but the
time we initialize failure_detector object.
Fix by initializing _last_interpret inside interpret().
[Thu Feb 18 02:40:04 2016] INFO [shard 0] storage_service - Node 127.0.0.1 state jump to normal
[Thu Feb 18 02:40:04 2016] INFO [shard 0] storage_service - NORMAL: node is now in normal status
[Thu Feb 18 02:40:04 2016] INFO [shard 0] gossip - Waiting for gossip to settle before accepting client requests...
[Thu Feb 18 02:40:12 2016] INFO [shard 0] gossip - No gossip backlog; proceeding
Starting listening for CQL clients on 127.0.0.1:9042...
[Thu Feb 18 02:40:12 2016] INFO [shard 0] gossip - Node 127.0.0.2 is now part of the cluster
[Thu Feb 18 02:40:12 2016] INFO [shard 0] gossip - InetAddress 127.0.0.2 is now UP
[Thu Feb 18 02:40:13 2016] INFO [shard 0] gossip - do_gossip_to_live_member: Favor newly added node 127.0.0.2
[Thu Feb 18 02:40:13 2016] WARN [shard 0] failure_detector - Not marking nodes down due to local pause of 9091 > 5000 (milliseconds)
"Add scylla-io-setup.service to configure max-io-requests and num-io-queues on first boot.
Moved SCYLLA_IO configuration code from scylla_sysconfig_setup to scylla-io-setup.service, revert commits related it.
On scylla-io-setup.service, autodetect Amazon EC2 instead of using AMI variable on sysconfig."
While serialization vector it is sometimes required to rollback some of
the serialized elements.
vector_position is the equivalent to the bytes_ostream position struct.
It holds information about the current position in a serialized vector,
the position in the bufffer and the current number of elements
serialized.
It will allow to rollback to the current point.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1456041750-1505-2-git-send-email-amnon@scylladb.com>
Fixes#927.
The new visiting code builds cell instances using
atomic_cell::make_*() factory methods, which won't work in LSA context
because they depend on managed_bytes storage to be linearized. It may
not be since large blob support. This worked before because we created
cells from views before which works in all contexts.
Fix by constructing them in standard allocator context.
Message-Id: <1456234064-13608-2-git-send-email-tgrabiec@scylladb.com>
The line:
boost::apply_visitor(atomic_cell_or_collection_visitor(std::move(visitor), id, col), cell);
is executed in a loop, so visitor could be used after being
moved-from. This may not always be allowed for some visitors. Also,
vistors may keep state, which should be preserved for the whole
visitation.
This doesn't fix any issue right now.
Message-Id: <1456234064-13608-1-git-send-email-tgrabiec@scylladb.com>
The 'clear' function explicitly clears the screen and repaints it which
causes really annoying flicker. Use 'erase' to make scyllatop more
pleasant on the eyes.
Message-Id: <1456229348-2194-1-git-send-email-penberg@scylladb.com>
It is often the case that the there is useful debugging information
printed by the test before it hangs. It is annoying to see just "TIMED
OUT" in jenkins. Print the output always when it is available.
In addition to that, we should not interpret all exceptions thrown
from communicate() as timeouts. For example, currently ^C sent to the
script misleadingly results in "TIMED OUT" to be printed.
Message-Id: <1456174992-21909-1-git-send-email-tgrabiec@scylladb.com>
Every native scalar function is already tagged whether they're pure or
not but because we don't implement the is_pure() function, all functions
end up being advertised as pure. This means that functions like now()
that are *not* pure, end up being evaluated only once.
Fixes#571.
Message-Id: <1456227171-461-1-git-send-email-penberg@scylladb.com>
In same cases we may have a lot of empty partitions whose tombstones
expired, and there is no point in including them in the results.
This was found to cause performance issues for workloads using batch
updates. system.batchlog table would accumulate a lot of deletes over
time. It has gc_grace_seconds set to 0 so most of the tombstones would
be expired. mutation queries done by batchlog manager were still
returning all partitions present in memtables which caused mutation
queries result to be inflated. This in turn was causing
mutation_result_merger to take a long time to process them.
We don't support the 'CREATE TYPE' statement for now. The user-visible
error message, however, is unreadable because our CQL parser doesn't
even recognize the statement.
cqlsh:ks1> CREATE TYPE config (url text);
SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query] message=" : cannot match to any predicted input...
Implement just enough of 'CREATE TYPE' parsing to be able to report a
human readable error message if someone tries to execute such
statements:
cqlsh:ks1> CREATE TYPE config (url text);
ServerError: <ErrorMessage code=0000 [Server error] message="User-defined types are not supported yet">
Message-Id: <1456148719-9473-2-git-send-email-penberg@scylladb.com>
As explained in commit 0ff0c55 ("transport: server: 'short' should be
unsigned"), "short" type is always unsigned in the CQL binary protocol.
Therefore, drop the read_unsigned_short() variant altogether and just
use read_short() everywhere.
Message-Id: <1456133171-1433-1-git-send-email-penberg@scylladb.com>
This may result in errors during reading like the following one:
runtime error: Unexpected marker. Found k, expected \x01\n)'
The error above happened when executing limits.py:max_key_length_test dtest.
After change the exception will happen during writing and will be clearer.
Refs #807.
This patch doesn't deal with the problem of ensuring that we will
never hit those errors, which is very desirable. We shouldn't ack a
write if we can't persist it to sstables.
Message-Id: <1456130045-2364-1-git-send-email-tgrabiec@scylladb.com>
Change the name used with class_registrator from "EverywhereReplicationStrategy"
(used in the initial patch from CASSANDRA-826 JIRA) to "EverywhereStrategy"
as it is in the current DCE code.
With this change one will be able to create an instance of
everywhere_replication_strategy class by giving either
an "org.apache.cassandra.locator.EverywhereStrategy" (full name) or
an "EverywhereStrategy" (short name) as a replication strategy name.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1456081258-937-1-git-send-email-vladz@cloudius-systems.com>
This strategy would ignore an RF configuration and would
always try to replicate on all cluster nodes.
This means that its get_replication_factor() would return a
number of currently "known" nodes in the cluster and
if a cluster is currently bootstrapping this value obviously may
change in time for the same key. Therefore using this strategy
should be done with caution.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1456074333-15014-3-git-send-email-vladz@cloudius-systems.com>
Return a number of currently known endpoints when
it's needed in a fast path flow.
Calling a get_all_endpoints().size() for that matter
would not be fast enough because of the unordered_set->vector transformation
we don't need.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1456074333-15014-2-git-send-email-vladz@cloudius-systems.com>
Current algorithm is O(N^2) where N is the column count. This causes
limits.py:TestLimits.max_columns_and_query_parameters_test to timeout
because CREATE TABLE statement takes too long.
This change replaces it with an algorithm of O(N)
complexity. _defined_names are already sorted so if any duplicates
exist, they must be next to each other.
Message-Id: <1456058447-5080-1-git-send-email-tgrabiec@scylladb.com>
When find_uuid() fails Scylla would terminate with:
Exiting on unhandled exception of type 'std::out_of_range': _Map_base::at
But we are supposed to ignore directories for unknown column
families. The try {} catch block is doing just that when
no_such_column_family is thrown from the find_column_family() call
which follows find_uuid(). Fix by converting std::out_of_range to
no_such_column_family.
Message-Id: <1456056280-3933-1-git-send-email-tgrabiec@scylladb.com>
When there is a lot of chunks we may get stack overflow.
This seems to fix issue #906, a memory corruption during schema
merge. I suspect that what causes corruption there is overflowing of
the stack allocated for the seastar thread. Those stacks don't have
red zones which would catch overflow.
Message-Id: <1456056288-3983-1-git-send-email-tgrabiec@scylladb.com>
before:
^CINFO [shard 0] compaction_manager - Asked to stop
INFO [shard 0] compaction_manager - compaction task handler stopped due to shutdown
INFO [shard 0] compaction_manager - compaction task handler stopped due to shutdown
INFO [shard 1] compaction_manager - Asked to stop
INFO [shard 2] compaction_manager - Asked to stop
INFO [shard 1] compaction_manager - compaction task handler stopped due to shutdown
INFO [shard 2] compaction_manager - compaction task handler stopped due to shutdown
INFO [shard 3] compaction_manager - Asked to stop
INFO [shard 1] compaction_manager - compaction task handler stopped due to shutdown
INFO [shard 2] compaction_manager - compaction task handler stopped due to shutdown
INFO [shard 3] compaction_manager - compaction task handler stopped due to shutdown
INFO [shard 3] compaction_manager - compaction task handler stopped due to shutdown
after:
^CINFO [shard 0] compaction_manager - Asked to stop
INFO [shard 0] compaction_manager - Stopped
INFO [shard 1] compaction_manager - Asked to stop
INFO [shard 2] compaction_manager - Asked to stop
INFO [shard 3] compaction_manager - Asked to stop
INFO [shard 1] compaction_manager - Stopped
INFO [shard 2] compaction_manager - Stopped
INFO [shard 3] compaction_manager - Stopped
`compaction_manager - compaction task handler stopped due to shutdown` is still printed
in debug level
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <535d5ad40102571a3d5d36257342827989e8f0f4.1455835407.git.raphaelsc@scylladb.com>
"This series switches mutation_partition_serializer, mutation_partition_view
and frozen_mutation to the IDL-based serialization format.
canonical_mutations and frozen_schemas are still not converted.
Quick test with 4 node ccm cluster and cassandra-stress doesn't show any
problem, unsurprisingly, as frozen_mutation_test obviously still passes."
Allows mixing data_output with other output stream like
seastar::simple_output_stream which is useful when switching to the new
IDL-based serializers.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
bytes can always be trivially converted to bytes_view. Conversion in the
other direction requires a copy.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch allows using both auto-generated serializers or writer-based
serialization for non-stub [[writable]] types.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This helps having C++ code properly indented both in the compiler source
code and in the auto-generated files.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Test auto-generated and writer-based serialization as well as
deserialization of simple compound type, vectors and variants.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
To implement nodetool's "--start-token"/"--end-token" feature, we need
to be able to repair only *part* of the ranges held by this node.
Our REST API already had a "ranges" option where the tool can list the
specific ranges to repair, but using this interface in the JMX
implementation is inconvenient, because it requires the *Java* code
to be able to intersect the given start/end token range with the actual
ranges held by the repaired node.
A more reasonable approach, which this patch uses, is to add new
"startToken"/"endToken" options to the repair's REST API. What these
options do is is to find the node's token ranges as usual, and only
then *intersect* them with the user-specified token range. The JMX
implementation becomes much simpler (in a separate patch for scylla-jmx)
and the real work is done in the C++ code, where it belongs, not in
Java code.
With the additional scylla-jmx patch to use the new REST API options
provided here, this fixes#917.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1455807739-25581-1-git-send-email-nyh@scylladb.com>
'nodetool cleanup' must wait for termination of cleanup, however,
cleanup is handled asynchronously. To solve that, a mechanism is
added here to wait for termination of a cleanup. This mechanism is
about using promise to notificate waiter of cleanup completion.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <6dc0a39170f3f51487fb8858eb443573548d8bce.1455655016.git.raphaelsc@scylladb.com>
"writers are used to stream objects programmatically and not from objects.
visitors views are used to retrieve information from serialized objects without
deserialize them entirely, but to skip to the position in the buffer with the
relevant information and deserialize only it."
This patch adds static assert to the generated code that verify that a
declare type in the idl matches the parameter type.
Accepted type ignores reference and const when comparison.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds a writer object to classes in the idl.
It adds attribute support to classes. Writer will be created for classes
that are marked as writable.
For the writers, the code generator creates two kind of struct.
states, that holds the write state (manly the place holders for all
current objects, and vectors) and nodes that represent the current
position in the writing state machine.
To write an object create a writer:
For example creating a writer for mutation, if out is a bytes_ostream
writer_of_mutation w(out);
Views are used to read from buffer without deserialize an entire
object.
This patch adds view creation to the idl-compiler. For each view a
read_size function is created that will be used when skipping through
buffers.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The skip template function is used when skipping data types.
By default is uses a deserializer to calculate the size.
A specific implementation save unneeded deserialization. For fix sized
object the skip function would become an expression allowing the
compiler to drop the function altogether.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The serialization_visitors.hh contains helper classes for the readers and writers
visitors class.
place_holder is a wrapper around bytes_stream place holder.
frame is used to store size in bytes.
empty_frame is used with final object (that do not store their size)
from the code that uses it, it looks the same, but in practice it does
not store any data.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Reader and writer can use the bytes_ostream as a raw bytes stream,
handling the bytes encoding and streaming on their own.
To fully support this functionality, place holder should support it as
well.
This patch adds a get_stream method that return a simple_output_stream
writer can use it using their own serialization function.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds a stub support for boost::variant. Currently variant are
not serialized, this is added just so non stub classes will be able to
compile.
deserialize for chrono::time_point and deserializer for chrono::duration
unknown variant:
Planning for situations where variant could be expanded, there may be
situation that a variant will return an unknown value.
In those cases the data and index will be paseed to the reader, that
can decide what to do with it.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
* dist/ami/files/scylla-ami b3b85be...398b1aa (3):
> Import AMI initialization code from scylla-server repo
> Use long options on scylla_raid_setup and scylla_sysconfig_setup
> Wait more longer to finishing AMI setup
* seastar b25a958...1bbb02f (6):
> native-stack: fix arp request missing under loopback connection
> apps: iotune: fix compilation with g++ 4.9
> simple-stream: Add copy constructor
> tcp: don't need to choose another core since only one core
> Merge "Fix undefined behaviors related to reactor shutdown" from Tomasz
> rpc: do not wait for data to be send before reporting timeout
From Avi:
This patchset introduces a linearization context for managed_bytes objects.
Within this context, any scattered managed_bytes (found only in lsa regions,
so limited to memtable and cache) are auto-linearized for the lifetime of
the context. This ensures that key and value lookups can use fast
contiguous iterators instead of using slow discontiguous iterators (or
crashing, as is the case now).
To avoid scattered keys (and values, though those are already protected)
from being accessed, run the update procedure in a managed_bytes linearization
context.
Fixes#807.
Avi says:
"Something like unordered_set<unsigned long> is error prone, because ints
tend to mix up (also, need to use a sized type, unsigned long varies among
machines)."
With that in mind, it's better if we keep track of compacting sstables in
a unordered_set<shared_sstable>.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <249f0fd4cfcf786cf3c37a79978f7743d07f48ad.1455120811.git.raphaelsc@scylladb.com>
* seastar 353b1a1...0f759f0 (11):
> tutorial: add a link to future API documentation
> sleep: document
> tutorial: fix typos
> gate: add check() method
> tutorial: introduce seastar::gate
> doc: explain how to test the native stack without dpdk
> doc: separate the mini-tutorial into its own file
> doc: move DPDK build instructions to its own file
> doc: split building instructions into separate files
> doc: fix/modernize git commands in contributing.md
> doc: how-to on contributing & guidelines
We want the format of query results to be eventually defined in the
IDL and be independent of the format we use in memory to represent
collections. This change is a step in this direction.
The change decouples format of collection cells in query results from
our in-memory representation. We currently use collection_mutation_view,
after the change we will use CQL binary protocol format. We use that because
it requires less transformations on the coordinator side.
One complication is that some list operations need to retrieve keys
used in list cells, not only values. To satisfy this need, new query
option was added called "collections_as_maps" which will cause lists
and sets to be reinterpreted as maps matching their underlying
representation. This allows the coordinator to generate mutations
referencing existing items in lists.
List operations and prefetching were not handling static columns
correctly. One issue was that prefetching was attaching static column
data to row data using ids which might overlap with clustered columns.
Another problem was that list operations were always constructing
clustering key even if they worked on a static column. For static
columns the key would be always empty and lookup would fail.
The effect was that list operations which depend on curent state had
no effect. Similar problem could be observed on C* 2.1.9, but not on 2.2.3.
Fixes#903.
This puts knowledge about which cql_serialization_formats have the
same collection format into one place,
cql_serialization_format::collection_format_unchanged().
The validation was wrongly assuming that empty thrift key, for which
the original C* code guards against, can only correspond to empty
representation of our partition_key. This no longer holds after:
commit 095efd01d6
"keys: Make from_exploded() and components() work without schema"
This was responsible for dtest failure:
cql_additional_tests.TestCQL:column_name_validation_test
serialize() and from_bytes() is a low level interface, which in this
case can be replaced with a partition_key static factory method
resulting in cleaner code.
This reverts commit dadd097f9c.
That commit caused serialized forms of varint and decimal to have some
excess leading zeros. They didn't affect deserialization in any way but
caused computed tokens to differ from the Cassandra ones.
Fixes#898.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1455537278-20106-1-git-send-email-pdziepak@scylladb.com>
"Fixes #884Fixes#895
Also at seastar-dev: calle/truncate_more
1.) Change truncation records to be stored with IDL serialization
2.) Fix db::serializers encoding of replay_position
3.) Detect attempted reading of Origin truncation records, and instead
of crashing, ignore and warn.
4.) Change truncation time stamps to be generated per-shard, _after_
CF flush is done, otherwise data in memtables at flush would be
retained/replayed on next start. Retain the highest time stamp
generated.
Note for (3): This patch set does _not_ clear out origin records
automatically. This because I feel that is a somewhat drastic and
irreversible thing to do. If we want to avail the user of a means
to get rid of the (3) warning, we should probably tell him to either
use cqlsh, or add an API call for this, so he can do it explicitly.
"
When shutting down a node gracefully, this patch asks all ongoing repairs
started on this node to stop as soon as possible (without completing
their work), and then waits for these repairs to finish (with failure,
usually, because they didn't complete).
We need to do this, because if the repair loop continues to run while we
start destructing the various services it relies on, it can crash (as
reported in #699, although the specific crash reported there no longer
occurs after some changes in the streaming code). Additionally, it is
important that to stop the ongoing repair, and not wait for it to complete
its normal operation, because that can take a very long time, and shutdown
is supposed to not take more than a few seconds.
Fixes#699.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1455218873-6201-1-git-send-email-nyh@scylladb.com>
We can't move-from in the loop because the subject will be empty in
all but the first iteration.
Fixes crash during node stratup:
"Exiting on unhandled exception of type 'runtime_exception': runtime error: Invalid token. Should have size 8, has size 0"
Fixes update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_node_1_test (and probably others)
Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>
* seastar 14c9991...353b1a1 (2):
> scripts: posix_net_conf.sh: Change the way we learn NIC's IRQ numbers
> gate: protect against calling close() more than once
When scylla stopped an ongoing compaction, the event was reported
as an error. This patch introduces a specialized exception for
compaction stop so that the event can be handled appropriately.
Before:
ERROR [shard 0] compaction_manager - compaction failed: read exception:
std::runtime_error (Compaction for keyspace1/standard1 was deliberately
stopped.)
After:
INFO [shard 0] compaction_manager - compaction info: Compaction for
keyspace1/standard1 was stopped due to shutdown.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <1f85d4e5c24d23a1b4e7e0370a2cffc97cbc6d44.1455034236.git.raphaelsc@scylladb.com>
"This series changes the on-wire definitions of keys to be of the following form:
class partition_key {
std::vector<bytes> exploded();
};
Keys are therefore collections of components. The components are serialized according
to the format specified in the CQL binary protocol. No bit depends now on how we store keys in memory.
Constructing keys from components currently requires a schema reference,
which makes it not possible to deserialize or serialize the keys automatically
by RPC. To avoid those complications, compound_type was changed so that
it can be constructed and components can be iterated over without schema.
Because of this, partition_key size increased by 2 bytes."
For simplicity, we want to have keys serializable and deserializable
without schema for now. We will serialize keys in a generic form of a
vector of components where the format of components is specified by
CQL binary protocol. So conversion between keys and vector of
components needs to be possible to do without schema.
We may want to make keys schema-dependent back in the future to apply
space optimizations specific to column types. Existing code should
still pass schema& to construct and access the key when possible.
One optimization had to be reverted in this change - avoidance of
storing key length (2 bytes) for single-component partition keys. One
consequence of this, in addition to a bit larger keys, is that we can
no longer avoid copy when constructing single-component partition keys
from a ready "bytes" object.
I haven't noticed any significant performance difference in:
tests/perf/perf_simple_query -c1 --write
It does ~130K tps on my machine.
Like we did in commit d54c77d5d0,
make the remaining functions in abstract_replication_strategy return
non-wrap-around ranges.
This fixes:
ERROR [shard 0] stream_session - [Stream #f0b7fda0-cf3e-11e5-b6c4-000000000000]
stream_transfer_task: Fail to send to 127.0.0.4:0: std::runtime_error (Not implemented: WRAP_AROUND)
in streaming.
Message-Id: <514d2a9a1d3b868d213464c8858ac5162c0338d8.1455093643.git.asias@scylladb.com>
The conversion to bytes_view can fail if the key is scattered; so defer that
conversion until later. In a later patch we will intervene before the
conversion to ensure the data is linearized.
Using a partition_key_view can save an allocation in some cases. We will
make use of it when we linearize a partition_key; during the process we
are given a simple byte pointer, and constructing a partition_key from that
requires an allocation.
A large managed_bytes blob can be scattered in lsa memory. Usually this is
fine, but someone we want to examine it in place without copying it out, but
using contiguous iterators for efficiency.
For this use case, introduce with_linearized_managed_bytes(Func),
which runs a function in a "linearization context". Within the linearization
context, reads of managed_bytes object will see temporarily linearized copies
instead of scattered data.
Fixes#884
Time stamps for truncation must be generated after flush, either by
splitting the truncate into two (or more) for-each-shard operations,
or simply by doing time stamping per shard (this solution).
We generate TS on each shard after flushing, and then rely on the
actual stored value to be the highest time point generated.
This should however, from batch replay point of view, be functionally
equivalent. And not a problem.
Since the table is written from all shards, and we possibly might
have conflicting time stamps, we define the trucated_at time
as the highest time point. I.e. conservative.
Truncation records are not portable between us and Origin.
We need to detect and ensure we neither try to use, and more to the
point, don't crash because of data format error when loading, origin
records from a migrated system.
This problem was seen by Tzach when doing a migration from an origin
setup.
Updated record storage to use IDL-serialized types + added versioning
and magic marking + odd-size-checking to ensure we load only correct
data. The code will also deal with records from an older version of
scylla.
gcc 4.9 complains about the type{ val, val } construction of
type with implicit default constructor, i.e. member = initial
declarations. gcc 5 does not (and possibly rightly so).
However, we still (implicitly) claim to support gcc 4.9 so
why not just change this particular instance.
Message-Id: <1454921328-1106-1-git-send-email-calle@scylladb.com>
connection::_pending_requests_gate is responsible for keeping connection
objects alive as long as there are outstanding requests and is closed
in connection::proccess() when needed. Closing it in connection::shutdown()
as well may cause the gate to be closed twice what is a bug.
Fixes#690.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1454596390-23239-1-git-send-email-pdziepak@scylladb.com>
While pkgconfig is supposed to be a distribution and version neutral way
of detecting packages, it doesn't always work this way. The sd_notify()
manual page documents that sd_notify is available via the libsystemd
package, but on centos 7.0 it is only available via the libsystemd-daemon
package (on centos 7.1+ it works as expected).
Fix by allowing for alternate version of package names, testing each one
until a match is found.
Fixes#879.
Message-Id: <1454858862-5239-1-git-send-email-avi@scylladb.com>
Currently, only the shard where the stream_plan is created on will send
streaing mutations. To utilize all the available cores, we can make each
shard send mutations which it is responsbile for. On the receiver side,
we do not forward the mutations to the shard where the stream_session is
created, so that we can avoid unnecessary forwarding.
Note: the downside is that it is now harder to:
1) to track number of bytes sent and received
2) to update the keep alive timer upon receive of the STREAM_MUTATION
To fix, we now store the sent/recieved bytes info on all shards. When
the keep alive timer expires, we check if any progress has been made.
Hopefully, this patch will make the streaming much faster and in turn
make the repair/decommission/adding a node faster.
Refs: https://github.com/scylladb/scylla/issues/849
Tested with decommission/repair dtest.
Message-Id: <96b419ab11b736a297edd54a0b455ffdc2511ac5.1454645370.git.asias@scylladb.com>
The is_reversed function uses a variable length array, which isn't
spec-abiding C++. Additionally, the Clang compiler doesn't allow them
with non-POD types, so this function wouldn't compile.
After reading through the function it seems that the array wasn't
necessary as the check could be calculated inline rather than
separately. This version should be more performant (since it no longer
requires the VLA lookup performance hit) while taking up less memory in
all but the smallest of edge-cases (when the clustering_key_size *
sizeof(optional<bool>) < sizeof(size_type) - sizeof(uint32_t) +
sizeof(bool).
This patch uses relation_order_unsupported it assure that the exception
order is consistent with the preivous version. The throw would
otherwise be moved into the initial for-loop.
There are two derrivations in behavior:
The first is the initial assert. It however should not change the apparent
behavior besides causing orderings() to be looked up 2x in debug
situations.
The second is the conversion of is_reversed_ from an optional to a bool.
The result is that the final return value is now well-defined to be
false in the release-condition where orderings().size() == 0, rather
than be the ill-defined *is_reversed_ that was there previously.
Signed-off-by: Erich Keane <erich.keane@verizon.net>
Message-Id: <1454546285-16076-4-git-send-email-erich.keane@verizon.net>
Clang enforces that a union's constexpr CTOR must initialize
one of the members. The spec is seemingly silent as to what
the rule on this is, however, making this non-constexpr results in clang
accepting the constructor.
Signed-off-by: Erich Keane <erich.keane@verizon.net>
Message-Id: <1454604300-1673-1-git-send-email-erich.keane@verizon.net>
PHI_FACTOR is a constexpr variable that is defined using std::log.
Though G++ has a constexpr version of std::log, this itself is not spec
complaint (in fact, Clang enforces this). See C++ Spec 26.8 for the
definition of std::log and 17.6.5.6 for the rule regarding adding
constexpr where it isn't specified.
This patch replaces the std::log statement with a version from math.h
that contains the exact value (M_LOG10El).
Signed-off-by: Erich Keane <erich.keane@verizon.net>
Message-Id: <1454603285-32677-1-git-send-email-erich.keane@verizon.net>
Array of integral types on little endian machine can be memcpyed into/out
of a buffer instead of serialized/deserialized element by element.
Message-Id: <20160204155425.GC6705@scylladb.com>
It is much easier to see what is going on this way otherwise graphs for
bg mutations and overall mutations are very close with usual scaling for
many workloads.
Message-Id: <20160204083452.GH6705@scylladb.com>
Refs #860
Refs #802
An sstable file set with any component missing is interpreted as a
critical error during boot. Currently sstable removal procedure could
leave the files in a non-bootable state if the process crashed after
TOC was removed but before all components were removed as well.
To solve this problem, start the removal by renaming the TOC file to a
so called "temporary TOC". Upon boot such kind of TOC file is
interpreted as an sstable which is safe to remove. This kind of TOC
was added before to deal with a similar scenario but in the opposite
direction - when writing a new sstable.
Fixes#868.
Registerring exit hooks while reactor is already iterating over exit
hooks is not allowed and currently leads to undefined behavior
observed in #868. While we should make the failure more user friendly,
registering exit hooks concurrently with shutdown will not be allowed.
We don't expect exit hooks to be registered after exit starts because
this would violate the guarantee which says that exit hooks are
executed in reverse order of registration. Starting exit sequence in
the middle of initialization sequence would result in use after free
errors. Btw, I'm not sure if currently there's anything which prevents
this
To solve this problem, move the exit hook to initilization
sequence. In case of tests, the cleanup has to be called explicitly.
Fixes#563.
Refs #584
CQLv2 encodes batch query_options in v1 format, not v2+.
CQLv1 otoh has no batch support at all.
Make read_options use explicit version format if needed.
v2: Ensure we preserve cql protocol version in query_opts
Message-Id: <1454514510-21706-1-git-send-email-calle@scylladb.com>
Currently, we wait for ongoing compaction during shutdown, but
that may take 'forever' if compacting huge sstables with a slow
disk. Compaction of huge sstables will take a considerable amount
of time even with fast disks. Therefore, all ongoing compaction
should be stopped during shutdown.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <3370f17ce4274df417ea60651f33fc5d4de91199.1454441286.git.raphaelsc@scylladb.com>
Task is stopped by closing gate and forcing it to exit via gate
exception. The problem is that task->compacting_cf may be set to
the column family being compacted, and compaction_manager::remove
would see it and try to stop the same task again, which would
lead to problems. The fix is to clean task->compacting_cf when
stopping task.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <3473e93c1a107a619322769d65fa020529b5501b.1454441286.git.raphaelsc@scylladb.com>
The problem is that on the follower side, we set up _session_info too
late, after received PREPARE_DONE_MESSAGE message. The initiator can
send STREAM_MUTATION before sending PREPARE_DONE_MESSAGE message.
To fix, we set up _session_info after we received the prepare_message on
both initiator and follower.
Fixes#869
scylla: streaming/session_info.cc:44: void
streaming::session_info::update_progress(streaming::progress_info):
Assertion `peer == new_progress.peer' failed.
Message-Id: <6d945ba1e8c4fc0949c3f0a72800c9448ba27761.1454476876.git.asias@scylladb.com>
Change the partition_checksum structure to be better suited for the
new serializers:
1. Use std::array<> instead of a C array, as the latters are not
supported by the new serializers.
2. Use an array of 32 bytes, instead of 4 8-byte integers. This will
guarantee that no byte-swapping monkey-business will be done on
these checksums.
The checksum XOR and equality-checking methods still temporarily
cast the bytes to 8-byte chunks, for (hopefully) better performance.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1454364900-3076-1-git-send-email-nyh@scylladb.com>
Seems like boost::intrusive::set<>::comp() is not accessible on some
versions of boost. Replace by the equivalent
boost::intrusive::set<>::key_comp().
Fixes#858.
Message-Id: <1454326483-29780-1-git-send-email-avi@scylladb.com>
"Make the streaming use standalone tcp connection and send more mutations in
parallel.
It is supposed to help: "Decommission not fully utilizing hardware #849""
The idea behind the current 10 stream_mutations per core limitation is
to avoid streaming overwhelms the TCP connection and starves normal cql
verbs if the streaming mutations are big and takes long time to
complete.
Now that we use a standalone connection for streaming verbs, we can
increase the limitation.
Hopefully, this will fix#849.
In streaming, the amount of data needs to be streamed to peer nodes
might be large.
In order to avoid the streaming overwhelms the TCP connection used by
user CQL verbs and starves the user CQL queries, we use a standalone TCP
connection for streaming verbs.
* dist/ami/files/scylla-ami e284bcd...b2724be (2):
> Revert "Run scylla.yaml construction only once"
> Move AMI dependent part of scylla_prepare to scylla-ami-setup.service
"This moves AMI dependent part of scylla_prepare to scylla-ami repo, make it scylla-ami-setup.service which is independent systemd unit.
Also, it stopped calling scylla_sysconfig_setup on scylla_setup (called on AMI creation time), call it from scylla-ami-setup instead."
Install scylla-ami-setup.service, stop calling scylla_sysconfig_setup on AMI.
scylla-ami-setup.service will call it instead.
Only works with scylla-ami fix.
Fixes#857
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
It actually doesn't called unconditionally now, (since $LOCAL_PKG is always empty) so we can safely remove this.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
To simplify streaming verb handler.
- Use get_session instead of open coded logic to get get_coordinator and
stream_session in all the verb handlers
- Use throw instead of assert for error handling
- init_receiving_side now returns a shared_ptr<stream_result_future>
It is 1:1 mapping between session_info and stream_session. Putting
session_info inside stream_session, we can get rid of the
stream_coordinator::host_streaming_data class.
Generated sstables may imply either fully or partially written.
Compaction is interrupted if it was deriberately asked to stop (stop API)
or it was forced to do so in event of a failure, ex: out of disk space.
There is a need to explicitly delete sstables generated by a compaction
that was interrupted. Otherwise, such sstables will waste disk space and
even worsen read performance, which degrades as number of generations
to look at increases.
Fixes#852.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <49212dbf485598ae839c8e174e28299f7127f63e.1453912119.git.raphaelsc@scylladb.com>
storage_service::get_local_ranges returns sorted ranges, which are
not overlapping nor wrap-around. As a result, there is no need for
the consumer to do anything.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
supervisor_notify() calls periodically, to log message on systemd.
So raise(SIGSTOP) will called multiple times, upstart doesn't expected that.
We need to call it just one time.
Fixes#846
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Upstart job able to specify ulimit like systemd, so drop ubuntu's scylla_run and merge with redhat one.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Not all the idls are used by the messaging service, this patch removes
the auto-generated single include file that holds all the files and
replaes it with individual include of the generated fiels.
The patch does the following:
* It removes from the auto-generated inc file and clean the configure.py
from it.
* It places an explicit include for each generated file in
messaging_serivce.
* It add dependency of the generated code in the idl-compiler, so a
change in the compiler will trigger recreation of the generated files.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1453900241-13053-1-git-send-email-amnon@scylladb.com>
I saw the following Boost format string related warning during commitlog
replay:
INFO [shard 0] commitlog_replayer - Replaying node3/commitlog/CommitLog-1-72057594289748293.log, node3/commitlog/CommitLog-1-90071992799230277.log, node3/commitlog/CommitLog-1-108086391308712261.log, node3/commitlog/CommitLog-1-251820357.log, node3/commitlog/CommitLog-1-54043195780266309.log, node3/commitlog/CommitLog-1-36028797270784325.log, node3/commitlog/CommitLog-1-126100789818194245.log, node3/commitlog/CommitLog-1-18014398761302341.log, node3/commitlog/CommitLog-1-126100789818194246.log, node3/commitlog/CommitLog-1-251820358.log, node3/commitlog/CommitLog-1-18014398761302342.log, node3/commitlog/CommitLog-1-36028797270784326.log, node3/commitlog/CommitLog-1-54043195780266310.log, node3/commitlog/CommitLog-1-72057594289748294.log, node3/commitlog/CommitLog-1-90071992799230278.log, node3/commitlog/CommitLog-1-108086391308712262.log
WARN [shard 0] commitlog_replayer - error replaying: boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::io::too_many_args> > (boost::too_many_args: format-string referred to less arguments than were passed)
While inspecting the code, I noticed that one of the error loggers is
missing an argument. As I don't know how the original failure triggered,
I wasn't able to verify that that was the only one, though.
Message-Id: <1453893301-23128-1-git-send-email-penberg@scylladb.com>
Any stream, no matter initialized by us or initialized by a peer node,
can send and receive data. We should audit incoming/outgoing bytes in
the all streams.
The main motivation behind this change is to make get_ranges() easier for
consumers to work with the returned ranges, e.g. binary search to find a
range in which a token is contained. In addition, a wrap-around range
introduces corner cases, so we should avoid it altogether.
Suppose that a node owns three tokens: -5, 6, 8
get_ranges() would return the following ranges:
(8, -5], (-5, 6], (6, 8]
get_ranges() will now return the following ranges:
(-inf, -5], (-5, 6], (6, 8], (8, +inf)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <4bda1428d1ebbe7c8af25aa65119edc5b97bc2eb.1453827605.git.raphaelsc@scylladb.com>
We register engine().at_exit() callbacks when we initialize the services. We
do not really call the callbacks at the moment due to #293.
It is pretty hard to see the whole picture in which order the services
are shutdown. Instead of for each services to register a at_exit()
callbacks, I proposal to have a single at_exit() callback which do the
shutdown for all the services. In cassandra, the shutdown work is done
in storage_service::drain_on_shutdown callbacks.
In this patch, the drain_on_shutdown is executed during shutdown.
As a result, the proper gossip shutdown is executed and fixes#790.
With this patch, when Ctrl-C on a node, it looks like:
INFO [shard 0] storage_service - Drain on shutdown: starts
INFO [shard 0] gossip - Announcing shutdown
INFO [shard 0] storage_service - Node 127.0.0.1 state jump to normal
INFO [shard 0] storage_service - Drain on shutdown: stop_gossiping done
INFO [shard 0] storage_service - CQL server stopped
INFO [shard 0] storage_service - Drain on shutdown: shutdown rpc and cql server done
INFO [shard 0] storage_service - Drain on shutdown: shutdown messaging_service done
INFO [shard 0] storage_service - Drain on shutdown: flush column_families done
INFO [shard 0] storage_service - Drain on shutdown: shutdown commitlog done
INFO [shard 0] storage_service - Drain on shutdown: done
Time a node waits after sending gossip shutdown message in milliseconds.
Reduces ./cql_query_test execution time
from
real 2m24.272s
user 0m8.339s
sys 0m10.556s
to
real 1m17.765s
user 0m3.698s
sys 0m11.578
row_cache::update() does not explicitly invalidate the entries it failed
to update in case of a failure. This could lead to inconsistency between
row cache and sstables.
In paractice that's not a problem because before row_cache::update()
fails it will cause all entries in the cache to be invalidated during
memory reclaim, but it's better to be safe and explicitly remove entries
that should be updated but it was not possible to do so.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1453829681-29239-1-git-send-email-pdziepak@scylladb.com>
The API needs to be available at an early stage of the initialization,
on the other hand not all the specific APIs are available at that time.
This patch breaks the API initialization into stages, in each stage
additional commands will be available.
While setting that the api header files was broken into api_init.hh that
is relevent to the main and to api.hh which holds the different
api helper functions.
Fixes#754
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1453822331-16729-2-git-send-email-amnon@scylladb.com>
Last series accidently broke batch mode.
With new, fancy, potentitally blocking ways, we need to treat
batch mode differently, since in this case, sync should always
come _after_ alloc-write.
Previous patch caused infinite loop. Broke jenkins.
Message-Id: <1453821077-2385-1-git-send-email-calle@scylladb.com>
Also check closed status in allocate, since alloc queue waiting could
lead to us re-allocating in a segment that gets closed in between
queue enter and us running the continuation.
Message-Id: <1453811471-1858-1-git-send-email-calle@scylladb.com>
invocation of sstable_test "./test.py --name sstable_test --mode
release --jenkins a"
ran ... --log_sink=a.release.sstable_test -c1.boost.xml" which caused
the test to fail "with error code -11" fix that.
In addition boost test printout was bad fix that as well
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <3af8c4b55beae673270f5302822d7b9dbba18c0f.1453809032.git.shlomi@scylladb.com>
"Adds flush + write thresholds/limits that, when reached, causes
operations to wait before being issued.
Write ops waiting also causes further allocations to queue up,
i.e. limiting throughput.
Adds getters for some useful "backlog" measurements:
* Pending (ongoing) writes/flush
* Pending (queued, wating) allocations
* Num times write/flush threshold has been exceeded (i.e. waits occured)
* Finished, dirty segments
* Unused (preallocated) segments"
* seastar 97f418a...bdb273a (6):
> rpc: alias rpc::type to boost::type
> Fix warning_supported to properly work with Clang
> rpc: change 'overflow' to 'underflow' in input stream processing
> rpc: log an error that caused connection to be closed.
> rpc: clarify deserialization error message
> rpc: do not append new line in a logger
Configured on start (for now - and dummy values at that).
When shard write/flush count reaches limit, and incoming ops will queue
until previous ones finish.
Consequently, if an allocation op forces a write, which blocks, any
other incoming allocations will also queue up to provide back pressure.
"After the patch, all of our relevant I/O is placed on a specific priority class.
The ones which are not are left into the Seastar's default priority, which will
effectively work as an idle class.
Examples of such I/O are commitlog replay and initial SSTable loading. Since they
will happen during initialization, they will run uncontended, and do not justify
having a class on their own."
It is wrong to get a stream plan id like below:
utils::UUID plan_id = gms::get_local_gossiper().get_host_id(ep);
We should look at all stream_sessions with the peer in question.
There are only two messages: prepare_message and outgoing_file_message.
Actually only the prepare_message is the message we send on wire.
Flatten the namespace.
After this patch, our I/O operations will be tagged into a specific priority class.
The available classes are 5, and were defined in the previous patch:
1) memtable flush
2) commitlog writes
3) streaming mutation
4) SSTable compaction
5) CQL query
Signed-off-by: Glauber Costa <glauber@scylladb.com>
After the introduction of the Fair I/O Queueing mechanism in Seastar,
it is possible to add requests to a specific priority class, that will
end up being serviced fairly.
This patch introduces a Priority Manager service, that manages the priority
each class of request will get. At this moment, having a class for that may
sound like an overkill. However, the most interesting feature of the Fair I/O
queue comes from being able to adjust the priorities dynamically as workloads
changes: so we will benefit from having them all in the same place.
This is designed to behave like one of our services, with the exception that
it won't use the distributed interface. This is mainly because there is no
reason to introduce that complexity at this point - since we can do thread local
registration as we have been doing in Seastar, and because that would require us
to change most of our tests to start a new service.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
SSTables already have a priority argument wired to their read path. However,
most of our reads do not call that interface directly, but employ the services
of a mutation reader instead.
Some of those readers will be used to read through a mutation_source, and those
have to patched as well.
Right now, whenever we need to pass a class, we pass Seastar's default priority
class.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
All the SSTable read path can now take an io_priority. The public functions will
take a default parameter which is Seastar's default priority.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
All variants of write_component now take an io_priority. The public
interfaces are by default set to Seastar's default priority.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The only user for the default size is data_read, sitting at row.cc.
That reader wants to read and process a chunk all at once. So there's
really no reason to use the default buffer size - except that this code
is old.
We should do as we do in other single-key / single-range readers and
try to read all at once if possible, by looking at the size we received
as a parameter. Cleaning up the data_stream_at interface then comes as
a nice side effect.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Its definition as a lambda function is inconvenient, because it does not allow
us to use default values for parameters.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Its definition as a lambda function is inconvenient, because it does not allow
us to use default values for parameters.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In the existing code, when we fail to reach one of the replicas of some
range being repaired, we would give up, and not continue to repair the
living replicas of this range. The thinking behind this was since the
repair should be considered failed anyway, there's no point in trying
to do a half-job better.
However, in a discussion I had with Shlomi, he raised the following
alternative thinking, which convinced me: In a large cluster, having
one node or another temporarily dead has a high probability. In that
case, even if the if the repair is doomed to be considered "failed",
we want it at least to do as much as it possibly can to repair the
data on the living part of the cluster. This is what this patch does:
If we can only reach some of the replicas of a given range, the repair
will be considered failed (as before), but we will still repair the
reachable replicas of this range, if they have different checksums.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1453724443-29320-1-git-send-email-nyh@scylladb.com>
Unlike streaming in c*, scylla does not need to open tcp connections in
streaming service for both incoming and outgoing messages, seastar::rpc
does the work. There is no need for a standalone stream_init_message
message in the streaming negotiation stage, we can merge the
stream_init_message into stream_prepare_message.
The helper is used only once in init_sending_side and in
init_receiving_side we do not use create_and_register to create
stream_result_future. Kill the trivial helper to make the code more
consistent.
In addition, rename variables "future" and "f" to sr (streaming_result).
So we have:
- init_sending_side
called when the node initiates a stream_session
- init_receiving_side
called when the node is a receiver of a stream_session initiated by a peer
In scylla, in each stream_coordinator, there will be only one
stream_session for each remote peer. Drop the code supporting multiple
stream_sessions in host_streaming_data.
We now have
shared_ptr<stream_session> _stream_session
instead of
std::map<int, shared_ptr<stream_session>> _stream_sessions
- from
We can get it form the rpc::client_info
- session_index
There will always be one session in stream_coordinator::host_streaming_data with a peer.
- is_for_outgoing
In cassandra, it initiates two tcp connections, one for incoming stream and one for outgoing stream.
logger.debug("[Stream #{}] Sending stream init for incoming stream", session.planId());
logger.debug("[Stream #{}] Sending stream init for outgoing stream", session.planId());
In scylla, it only initiates one "connection" for sending, the peer initiates another "connection" for receiving.
So, is_for_outgoing will also be true in scylla, we can drop it.
- keep_ss_table_level
In scylla, again, we stream mutations instead of sstable file. It is
not relevant to us.
- int connections_per_host
Scylla does not create connections per stream_session, instead it uses
rpc, thus connections_per_host is not relevant to scylla.
- bool keep_ss_table_level
- int repaired_at
Scylla does not stream sstable files. They are not relevant to scylla.
"This series:
- Add more debug info to stream session
- Fail session if we fail to send COMPLETE_MESSAGE
- Handle message retry logic for verbs used by streaming
See commit log for details."
"The series do the following:
It adds the code generation
Perform the needed changes in the current classes so each would have getter for
each of its serializable value and a constructor from the serialized values.
It adds a schema definition that cover gossip_diget_ack
It changes the messaging_service to use the generated code.
An overall explanation of the solution with a description of the schema IDL can
be found on the wiki page:
https://github.com/scylladb/scylla/wiki/Serializer-Deserializer-Code-generation
"
Serializer requires class to be defined, so it has to be in .h file. It
also does not support nested types yet, so move it outside of containing
class.
python3 needs to install pyparsing excplicitely. This adds the
installation of python3-pyparsing to the require dependencies in the
README.md
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
the serializer
This patch adds a specific template initialization so that the rpc would
use the serializer and deserializer that are auto-generated.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds the serializer and serializer_imple files. They holds
the functions that are not auto generated: primitives and templates (map
and vector) It also holds the include to the auto-generated code.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds rules and the idl schema to configure, which will call
the code generation to create the serialization and deserialization
functions.
There is also a rule to create the header file that include the auto
generated header files.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This is a definition example for gossip_digest_ack with all its sub
classes.
It can be used by the code generator to create the serializer and
deserializer functions.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
inet_address uses uint32_t to store the ip address, but its constructor
is int32_t.
So this patch adds such a constructor.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch contains two changes, it make the constructor with parameters
public. And it removes the dependency in messaging_service.hh from the
header file by moving some of the code to the .cc file.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The code generation takes a schema file and create two files from it,
one with a dist.hh extension containing the forward declarations and a second
with dist.impl.hh with the actual implementation.
Because the rpc uses templating for the input and output streams. The
generated functions are templates.
For each class, struct or enum, two functions are created:
serialize - that gets the output buffer as template parameter and
serialize the object to it. There must be a public way to get to each of
the parameters in the class (either a getter or the parameter should be
public)
deserialize - that gets an input buffer, and return the deserialize
object (and by reference the number of char it read).
To create the return object, the class must have a public constructor
with all of its parameters.
The solution description can be found here:
https://github.com/scylladb/scylla/wiki/Serializer-Deserializer-Code-generation
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
When a verb timeout, if we resend the message again, the peer could receive
the message more than once. This would confuse the receiver. Currently, only
the streaming code use the retry logic.
- In case of rpc:timeout_error:
Instead of doing timeout in a relatively short time and resending a few
times, we make the timeout big enough and let the tcp to do the resend.
Thus, we can avoid resending the message more than once, of course, the
receiver would not receive the message more than once.
- In case of rpc::closed_error:
There are two cases:
1) Failing to establish a connection.
For instance, the peer is down. It is safe to resend since we know for
sure the receiver hasn't received the message yet.
2) The connection is established.
We can not figure out if the remote peer have received the message
already or not upon receiving the rpc::closed_error exception.
Currently, we still sleep & resend the message again, so the receiver
might receive the message more than once. We do not have better choice
in this case, if we want the resend to recover the sending error due to
temporary network issue, since failing the whole stream_session due to
failing to send a single message is not wise.
NOTE: If the duplicated message is received when the stream_session is done,
it will be ignored since it can not find the stream_manager anymore.
For message like, STREAM_MUTATION, it is ok to receive twice (we apply the
mutation twice).
TODO: For other messages which uses the retry logic, we need
to make sure it is ok to receive more than once.
- Add debug for the peer address info
- Add debug in stream_transfer_task and stream_receive_task
- Add debug when cancel the keep_alive timer
- Add debug for has_active_sessions in stream_result_future::maybe_complete
Compaction manager was initially created at utils because it was
more generic, and wasn't only intended for compaction.
It was more like a task handler based on futures, but now it's
only intended to manage compaction tasks, and thus should be
moved elsewhere. /sstables is where compaction code is located.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"This series moves the "backup" logic into the sstable::write_components()
methods, adds a support for enabling backup for sstables flushed in the
compaction flow (in addition to a regular flushing flow which had this support
already) and enables the "incremental_backups" configuration option."
I fixed up a merge conflict with commit 5e953b5 ("Merge "Add support to
stop ongoing compaction" from Raphael").
"stop compaction is about temporarily interrupting all ongoing compaction
of a given type.
That will also be needed for 'nodetool stop <compaction_type>'.
The test was about starting scylla, stressing it, stopping compaction using
the API and checking that scylla was able to recover.
Scylla will print a message as follow for each compaction that was stopped:
ERROR [shard 0] compaction_manager - compaction failed: read exception:
std::runtime_error (Compaction for keyspace1/standard1 was deliberately stopped.)
INFO [shard 0] compaction_manager - compaction task handler sleeping for 20 seconds"
Enable the incremental_backups/--incremental-backups option.
When enabled there will be a hard link created in the
<column family directory>/backup directory for every flushed
sstable.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Enable incremental backup when sstables are flushed if
incremental backup has been requested.
It has been enabled in the regular flushing flow before but
wasn't in the compaction flow.
This patch enables it in both places and does it using a
backup capability of sstable::write_components() method(s).
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
When 'backup' parameter is TRUE - create backup hard
links for a newly written sstables in <sstable dir>/backups/
subdirectory.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Only difference from previous sleep is that we will
explicitly delete the objects if the process terminates
before tasks are run. I.e. make ASas happier.
Message-Id: <1453295521-29580-1-git-send-email-calle@scylladb.com>
Arguments buffer_size and true were accidently inverted.
GCC wasn't complaning because implicit conversion of bool to
int, and vice-versa, is valid.
However, this conversion is not very safe because we could
accidentaly invert parameters.
This should fix the last problem with sstable_test.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <9478cd266006fdf8a7bd806f1c612ec9d1297c1f.1453301866.git.raphaelsc@scylladb.com>
test_cassandra_hash also sort of expects exceptions. ASas causes false
positives here as well with seastar::thread, so do it with normal cont.
Message-Id: <1453295521-29580-2-git-send-email-calle@scylladb.com>
Since 581271a243 "sstables: ignore data
belonging to dropped columns" we silently drop cells if there is no
column in the current schema that they belong to or their timestamp is
older than the column dropped_at value. Originally this check was
applied to row markers as well which caused them to be always dropped
since there is no column in the schema representing these markers.
This patch makes sure that the check whether colum is alive is performed
only if the cell is not a row marker.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1453289300-28607-1-git-send-email-pdziepak@scylladb.com>
stop_compaction is implemented by calling stop_compaction() of
compaction manager for each database.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's needed for nodetool stop, which is called to stop all ongoing
compaction. The implementation is about informing an ongoing compaction
that it was asked to stop, so the compaction itself will trigger an
exception. Compaction manager will catch this exception and re-schedule
the compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
compaction_info makes more sense because this structure doesn't
only store stats about ongoing compaction. Soon, we will add
information to it about whether or not an user asked to stop the
respective ongoing compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
From Paweł:
These patches contain fixes for date and timestamp types:
- date and timestamp are considered compatible types
- date type is added to abstract_type::parse_type()
When this code was originally written, we used to operate on a generic
output_stream. We created a file output stream, and then moved it into
the generic object.
Many patches and reworks later, we now have a file_writer object, but
that pattern was never reworked.
So in a couple of places we have something like this:
f = file_object acquired by open_file_dma
auto out = file_writer(std::move(f), 4096);
auto w = make_shared<file_writer>(std::move(out));
The last statement is just totally redundant. make_shared can create
an object from its parameters without trouble, so we can just pass
the parameter list directly to it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <c01801a1fdf37f8ea9a3e5c52cd424e35ba0a80d.1453219530.git.glauber@scylladb.com>
* Match origin log messages
- Demote per-file printouts to "debug" level.
* Print an all-files stat summary for whole replay (begin/summary)
- At info level, like origin
Prompted by dtest that expects origin log output.
Message-Id: <1453216558-18359-1-git-send-email-calle@scylladb.com>
"We need to be able to replay mutations created using older versions of
the table's schema. frozen_mutation can be only read using the version
it was serialized with, and there is no guarantee that the node will
know this version at the time of replay. Currently versions are kept
in-memory so a node forgets all past versions when it restarts. This
was not implemented yet, replay would fail with exception if the
version is unknown."
* Match origin log messages
- Demote per-file printouts to "debug" level.
* Print an all-files stat summary for whole replay (begin/summary)
- At info level, like origin
Prompted by dtest that expects origin log output.
v2:
* Fixed broken + operator
* Use map_reduce instead of easily readable code
We need to be able to replay mutations created using older versions of
the table's schema. frozen_mutation can be only read using the version
it was serialized with, and there is no guarantee that the node will
know this version at the time of replay. Currently versions kept
in-memory so a node forgets all past versions when it restarts.
To solve this, let's store canonical_mutations which, like data in
sstables, can be read using any later schema version of given table.
From Paweł:
"This series contains some more fixes for issues related to alter table,
namely: incorrect parsing of collection information in comparator, missing
schema::_raw._collections in equality check, missing compatibility
information for utf8->blob, ascii->blob and ascii->utf8 casts."
test_password_authenticator_operations causes ASan failures, in a way
that I am 99% sure is fully false positive, caused by a combo of
seastar threads, exception throwing and externals.
In lieu of actually identifying what ASan flaw causes this and
potentially cure it, for now, lets just re-write the test in question
to not use seastar::async, but normal continuation. Less easy to read,
but passes ASan.
Message-Id: <1453205136-10308-1-git-send-email-calle@scylladb.com>
test_password_authenticator_operations causes ASan failures, in a way
that I am 99% sure is fully false positive, caused by a combo of
seastar threads, exception throwing and externals.
In lieu of actually identifying what ASan flaw causes this and
potentially cure it, for now, lets just re-write the test in question
to not use seastar::async, but normal continuation. Less easy to read,
but passes ASan.
When compacting sstable, mutation that doesn't belong to current shard
should be filtered out. Otherwise, mutation would be duplicated in
all shards that share the sstable being compacted.
sstable_test will now run with -c1 because arbitrary keys are chosen
for sstables to be compacted, so test could fail because of mutations
being filtered out.
fixes#527.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <1acc2e8b9c66fb9c0c601b05e3ae4353e514ead5.1453140657.git.raphaelsc@scylladb.com>
Don't demand the messaging_service version to be the same on both
sides of the connection in order to use internal addresses.
Upstream has a similar change for CASSANDRA-6702 in commit a7cae32 ("Fix
ReconnectableSnitch reconnecting to peers during upgrade").
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1452686729-32629-1-git-send-email-vladz@cloudius-systems.com>
The correct format of collection information in comparator is:
o.a.c.db.m.ColumnToCollection(<name1>:<type1>, <name2>:<type2>, ...)
not:
o.a.c.db.m.ColumnToCollection(<name1>:<type1>),
o.a.c.db.m.ColumnToCollection(<name2>:<type2>) ...
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
The upstream of origin adds the version to the application_state in the
get_endpoints in the failure detector.
In our implementation we return an object to the jmx proxy and the proxy
do the string formatting.
This patch adds the version to the return object which is both useful as
an API and will allow the jmx proxy to add it to its output when we move
forward with the jmx version.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1448962889-19611-1-git-send-email-amnon@scylladb.com>
Representation format is an implementation detail of
partition_key. Code which compares a value to representation makes
assumptions about key's representation. Compare keys to keys instead.
Message-Id: <1453136316-18125-1-git-send-email-tgrabiec@scylladb.com>
We cannot stop the stream manager because it's accessible via the API
server during shutdown, for example, which can cause a SIGSEGV.
Spotted by ASan.
Message-Id: <1453130811-22540-1-git-send-email-penberg@scylladb.com>
Fix various issues in set_messaging_service() that caused
heap-buffer-overflows when JMX proxy connects to Scylla API:
- Off-by-one error in 'num_verb' definition
- Call to initializer list std::vector constructor variant that caused
the vector to be two elements long.
- Missing verb definitions from the Swagger definition that caused
response vector to be too small.
Spotted by ASan.
Message-Id: <1453125439-16703-1-git-send-email-penberg@scylladb.com>
Wait for batchlog removal before completing a query otherwise batchlog
removal queries may accumulate. Still ignore an error if it happens
since it is not critical, but log it.
Message-Id: <20160118095642.GB6705@scylladb.com>
"This series makes sure that Scylla rejects adding a collections if
its column name is the same as a collection that existed before and
their types are incompatible.
Fixes#782"
Historically the purpose of the metric is to show how much memory is
in standard allocations. After zones were introduced, this would also
include free space in lsa zones, which is almost all memory, and thus
the metric lost its original meaning. This change brings it back to
its original meaning.
Message-Id: <1452865125-4033-1-git-send-email-tgrabiec@scylladb.com>
"This patch is intended to add support to column family cleanup, which will
make 'nodetool cleanup' possible.
Why is this feature needed? Remove irrelevant data from a node that loses part
of its token range to a newly added node."
"This series adds support for multiple schema versions to the commit log.
All segments contain column mappings of all schema versions used by the
mutations contained in the segment, which are necessary in order to be
able to read frozen mutations and upgrade them to the current schema
version."
While MUTATION and MUTATION_DONE are asynchronous by nature (when a MUTATION
completes, it sends a MUTATION_DONE message instead of responding
synchronously), we still want them to be synchronous at the server side
wrt. the RPC server itself. This is because RPC accounts for resources
consumed by the handler only while the handler is executing; if we return
immediately, and let the code execute asynchronously, RPC believes no
resources are consumed and can instantiate more handlers than the shard
has resources for.
Fix by changing the return type of the handlers to future<no_wait_type>
(from a plain no_wait_type), and making that future complete when local
processing is over.
Ref #596.
Message-Id: <1453048967-5286-1-git-send-email-avi@scylladb.com>
Theoretically, one could want to repair a single host *and* all the hosts
in one or more other data centers which don't include this host. However,
Cassandra's "nodetool repair" explicitly does not allow this, and fails if
given a list of data centers (via the "-dc" option) which doesn't include
the host starting the repair. So we need to behave like "nodetool repair"
and fail in this case too.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1453037016-25775-1-git-send-email-nyh@scylladb.com>
Cassandra disallows adding a column with the same name as a collection
that existed in the past in that table if the types aren't compatible.
To enforce that Scylla needs to keep track of all collections that ever
existed in the column family.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
* seastar a8183c1...e93cd9d (2):
> rpc: make sure we serialize on _resources_available sempahore
> rpc: fix support for handlers returning future<no_wait_type>
Since we switched to use mk-build-deps, it only resolves build-time dependencies.
We also need to install install-time dependencies.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
This patch adds a function which waits for the background cleanup work
which is started from sstable destructors.
We wait for those cleanups on reactor exit so that unit tests don't
leak. This fixes erratic ASAN complaint about memory leak when running
schema_change_test in debug mode:
Indirect leak of 64 byte(s) in 1 object(s) allocated from:
0x7fab24413912 in operator new(unsigned long) (/lib64/libasan.so.2+0x99912)
0x1776aeb in make_unique<continuation<future<T>::then_wrapped(Func&&) [with Func = future<T>::handle_exception(Func&&) [with Func = sstables::sstable::~sstable()::<lambda(auto:52)>; T = {}]::<lambda(auto:5&&)>; Result = future<>; T = {}]::<lambda(auto:2&&)> >, future<T>::then_wrapped(Func&&) [with Func = future<T>::handle_exception(Func&&) [with Func = sstables::sstable::~sstable()::<lambda(auto:52)>; T = {}]::<lambda(auto:5&&)>; Result = future<>; T = {}]::<lambda(auto:2&&)> > /usr/include/c++/5.1.1/bits/unique_ptr.h:765
0x1752b69 in schedule<future<T>::then_wrapped(Func&&) [with Func = future<T>::handle_exception(Func&&) [with Func = sstables::sstable::~sstable()::<lambda(auto:52)>; T = {}]::<lambda(auto:5&&)>; Result = future<>; T = {}]::<lambda(auto:2&&)> > /home/tgrabiec/src/scylla2/seastar/core/future.hh:513
0x1711365 in schedule<future<T>::then_wrapped(Func&&) [with Func = future<T>::handle_exception(Func&&) [with Func = sstables::sstable::~sstable()::<lambda(auto:52)>; T = {}]::<lambda(auto:5&&)>; Result = future<>; T = {}]::<lambda(auto:2&&)> > /home/tgrabiec/src/scylla2/seastar/core/future.hh:690
0x16d0474 in then_wrapped<future<T>::handle_exception(Func&&) [with Func = sstables::sstable::~sstable()::<lambda(auto:52)>; T = {}]::<lambda(auto:5&&)>, future<> > /home/tgrabiec/src/scylla2/seastar/core/future.hh:880
0x1696e9c in handle_exception<sstables::sstable::~sstable()::<lambda(auto:52)> > /home/tgrabiec/src/scylla2/seastar/core/future.hh:1012
0x1638ba8 in sstables::sstable::~sstable() sstables/sstables.cc:1619
The leak is about allocations related to close() syscall tasks invoked
from sstable destructor, which were not waited for.
Message-Id: <1452783887-25244-1-git-send-email-tgrabiec@scylladb.com>
This reverts commit f0d68e4 ("main: start the http server in the first
step"). The service layer is not ready to serve clients before it's
fully up and running which causes early startup crashes everywhere.
Message-Id: <1452768015-22763-1-git-send-email-penberg@scylladb.com>
If authentication is disabled, nobody calls login() to set the current
user. There's untranslated code in client_state constructor to do just
that.
Fixes "You have not logged in" errors when USE statement is executed
with authentication disabled.
Message-Id: <1452759946-13998-1-git-send-email-penberg@scylladb.com>
"Add implementation of cassandra password authenticator, and user
password checking to CQL connections.
User/pwd are stored in system_auth table. Passwords are hashed
using glibc 'crypt_r'.
The latter is worth noting, as this is a difference compared to origin;
Origin uses Java bcrypt library for salt/hash, i.e. blowfish hashing.
Most glibc variants do _not_ have support for blowfish. To be 100%
compatible with imported origin tables we might need to add
bcrypt/blowfish sources into scylla (no packaged libs available afaict)
The code currently first attempts to use blowfish, if we happen to run
centos or Openwall, which has it compiled in. Otherwise we will fall
back to sha512, sha256 or even md5 depending on lib support.
To use:
* scylla.conf: authenticator=PasswordAuthenticator
* cqlsh -u cassandra -p cassandra
Not implemented (yet):
* "Authorizer", thus no KS/CF access checking
* CQL create/alter/delete user (create_user_statement etc). I.e. there is
only a single user name; default "cassandra:cassandra" user/pwd combo"
It's needed to keep the iterators valid in case eviciton is triggered
somehwere in between. It probably isn't because destructors should not
allocate, but better be safe.
Currently for wrap around the "begin" iterator would not meet with the
"end" iterator, invoking undefined behavior in erase_and_dispose()
which results in a crash.
Fixes#785
User db storage + login/pwd db using system tables.
Authenticator object is a global shard-shared singleton, assumed
to be completely immutable, thus safe.
Actual login authentication is done via locally created stateful object
(sasl challenge), that queries db.
Uses "crypt_r" for password hashing, vs. origins use of bcrypt.
Main reason is that bcrypt does not exist as any consistent package
that can be consumed, so to guarantee full compatibility we'd have
to include the source. Not hard, but at least initially more work than
worth.
Fixes#614
* Use warning threshold from config
* Don't throw exceptions. We're only supposed to warn.
* Try to actually estimate mutation data payload size, not
number of mutations.
Message-Id: <1452615759-23213-1-git-send-email-calle@scylladb.com>
Each segment chunk should contain column mappings for all schema
versions used by the mutations it contains. In order to avoid
duplication db::commitlog::segment remembers all schema versions already
written in current chunk.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Current commitlog interface requires writers to specify the size of a
new entry which cannot depend on the segment to which the entry is
written.
If column mappings are going to be stored in the commitlog that's not
enough since we don't know whether column mapping needs to be written
until we known in which segment the entry is going to be stored.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Refs #752
Paged aggregate queries will re-use the partition_slice object,
thus when setting a specific ck range for "last pk", we will hit
an exception case.
Allow removing entries (actually only the one), and overwriting
(using schema equality for keys), so we maintain the interface
while allowing the pager code to re-set the ck range for previous
page pass.
[tgrabiec: commit log cleanup, fixed issue ref]
Message-Id: <1452616259-23751-1-git-send-email-calle@scylladb.com>
Fixes#614
* Use warning threshold from config
* Don't throw exceptions. We're only supposed to warn.
* Try to actually estimate mutation data payload size, not
number of mutations.
Fixes#752
We set row limit for query to be min of page size/remaining in limit,
but if we have a multinode query we might end up with more rows than asked
for, so must do this again in post-processing.
Refs #792
Paged aggregate queries will re-use the partition_slice object,
thus when setting a specific ck range for "last pk", we will hit
an exception case.
Allow removing entries (actually only the one), and overwriting
(using schema equality for keys), so we maintain the interface
while allowing the pager code to re-set the ck range for previous
page pass.
v2:
* Changed to schema-equality checks so we sort of maintain a
sane api and behaviour, even with the 1-entry map
v3:
* Renamed remove "contains" in specific_ranges, and made the calling
code use more map-like logic, again to keep things cleaner
Fixes#752
We set row limit for query to be min of page size/remaining in limit,
but if we have a multinode query we might end up with more rows than asked
for, so must do this again in post-processing.
Message-Id: <1452606935-12899-2-git-send-email-calle@scylladb.com>
According to specification
(here https://wiki.apache.org/cassandra/InternodeEncryption)
when the internode encryption is set to `dc` the data passed between
DCs should be encrypted and similarly, when it's set to `rack`
the inter-rack traffic should encrypted.
Currently Scylla would encrypt the traffic inside a local DC in the
first case and inside the local RACK in the later one.
This patch fixes the encryption logic to follow the specification
above.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1452501794-23232-1-git-send-email-vladz@cloudius-systems.com>
With Docker we might be running on a filesystem that does not support DMA
(aufs; or tmpfs on boot2docker), so let --developer-mode allow running
on those file systems.
Message-Id: <1452593083-25601-1-git-send-email-avi@scylladb.com>
yum command think "development-xxxx.xxxx" is older than "0.x", so nightly package mistakenly update with release version.
To prevent this problem, we should add greater number prior to "development".
Also same on Ubuntu package.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1452592877-29721-1-git-send-email-syuu@scylladb.com>
Use systemd Type=notify to tell systemd about startup progress.
We can now use 'systemctl status scylla-server' to see where we are
in service startup, and 'systemctl start scylla-server' will wait until
either startup is complete, or we fail to start up.
read_entry did not verify that current chunk has enough data left
for a minimal entry. Thus we could try to read an entry from the slack
left in a chunk, and get lost in the file (pos > next, skip very much
-> eof). And also give false errors about corruption.
Message-Id: <1452517700-599-1-git-send-email-calle@scylladb.com>
This will add support for an user to clean up an entire keyspace
or some of its column families.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Cleanup is a procedure that will discard irrelevant keys from
all sstables of a column family, thus saving disk space.
Scylla will clean up a sstable by using compaction code, in
which this sstable will be the only input used.
Compaction manager was changed to become aware of cleanup, such
that it will be able to schedule cleanup requests and also know
how to handle them properly.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The implementation is about storing generation of compacting sstables
in an unordered set per column family, so before strategy is called,
compaction manager will filter out compacting sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently, compaction strategy is the responsible for both getting the
sstables selected for compaction and running compaction.
Moving the code that runs compaction from strategy to manager is a big
improvement, which will also make possible for the compaction manager
to keep track of which sstables are being compacted at a moment.
This change will also be needed for cleanup and concurrent compaction
on the same column family.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Cleanup is about rewriting a sstable discarding any keys that
are irrelevant, i.e. keys that don't belong to current node.
Parameter cleanup was added to compact_sstables.
If set to true, irrelevant code such as the one that updates
compaction history will be skipped. Logic was also added to
discard irrelevant keys.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That code will be used by column family cleanup, so let's put
that code into a function. This change also improves the code
readability.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"Our domain objects have schema version dependent format, for efficiency
reasons. The data structures which map between columns and values rely on
column ids, which are consecutive integers. For example, we store cells in a
vector where index into the vector is an implicit column id identifying table
column of the cell. When columns are added or removed the column ids may
shift. So, to access mutations or query results one needs to know the version
of the schema corresponding to it.
In case of query results, the schema version to which it conforms will always
be the version which was used to construct the query request. So there's no
change in the way query result consumers operate to handle schema changes. The
interfaces for querying needed to be extended to accept schema version and do
the conversions if necessary.
Shard-local interfaces work with a full definition of schema version,
represented by the schema type (usually passed as schema_ptr). Schema versions
are identified across shards and nodes with a UUID (table_schema_version
type). We maintain schema version registry (schema_registry) to avoid fetching
definitions we already know about. When we get a request using unknown schema,
we need to fetch the definition from the source, which must know it, to obtain
a shard-local schema_ptr for it.
Because mutation representation is schema version dependent, mutations of
different versions don't necessarily commute. When a column is dropped from
schema, the dropped column is no longer representable in the new schema. It is
generally fine to not hold data for dropped columns, the intent behind
dropping a column is to lose the data in that column. However, when merging an
incoming mutation with an existing mutation both of which have different
schema versions, we'd have to choose which schema should be considered
"latest" in order not to loose data. Schema changes can be made concurrently
in the cluster and initiated on different nodes so there is not always a
single notion of latest schema. However, schema changes are commutative and by
merging changes nodes eventually agree on the version. For example adding
column A (version X) on one node and adding column B (version Y) on another
eventually results in a schema version with both A and B (version Z). We
cannot tell which version among X and Y is newer, but we can tell that version
Z is newer than both X and Y. So the solution to the problem of merging
conflicting mutations could be to ensure that such merge is performed using
the schema which is superior to schemas of both mutations.
The approach taken in the series for ensuring this is as follows. When a node
receives a mutation of an unknown schema version it first performs a schema
merge with the source of that mutation. Schema merge makes sure that current
node's version is superior to the schema of incoming mutation. Once the
version is synced with, it is remembered as such and won't be synced with on
later mutations. Because of this bookkeeping, schema versions must be
monotonic; we don't want table altering to result in any earlier version
because that would cause nodes to avoid syncing with them. The version is a
cryptographically-secure hash of schema mutations, which should fulfill this
purpose in practice.
TODO: It's possible that the node is already performing a sync triggered by
broadcasted schema mutations. To avoid triggering a second sync needlessly, the
schema merging should mark incoming versions as being synced with.
Each table shard keeps track of its current schema version, which is
considered to be superior to all versions which are going to be applied to it.
All data sources for given column family within a shard have the same notion
of current schema version. Individual entries in cache and memtables may be at
earlier versions but this is hidden behind the interface. The entries are
upgraded to current version lazily on access. Sstables are immutable, so they
don't need to track current version. Like any other data source, they can be
queried with any schema version.
Note, the series triggered a bug in demangler:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68700"
* seastar d0bf6f8...ad3577b (9):
> httpd: close connection before deleting it
> reactor: support for non-O_DIRECT capable filesystems
> tests: modernize linecount
> IO queues: destruct within reactor's destructor
> tests: Use dnsdomainname in mkcert.gmk
> tests: memcached: workaround a possible race between flush_all and read
> apps: memcached: reduce the error during the expiration time translation
> timer: add missing #include
> core: do not call open_file_dma directly
Fixes#757.
The MOTD banner now printed upon .bash_profile execution,
if scylla is running, ends with a 'tput sgr0'. That command
appends an extra '[m' at the beginning of the output of any
following command. The automation scripts don't like this.
So let's add an 'echo' at the end of that path to add a newline,
avoiding the condition described above, and another one at the
'ScyllaDB is not started' path, for symmetry. I'm doing this
as it seems easier than having to develop heuristics to know
whether to remove or not that character.
CC: Shlomi Livne <slivne@scylladb.com>
Signed-off-by: Lucas Meneghel Rodrigues <lmr@scylladb.com>
Message-Id: <1452216044-28374-1-git-send-email-lmr@scylladb.com>
read_entry did not verify that current chunk has enough data left
for a minimal entry. Thus we could try to read an entry from the slack
left in a chunk, and get lost in the file (pos > next, skip very much
-> eof). And also give false errors about corruption.
Exceptions originated by an unimplemented to_string() methods
may interrupt the query() flow if not intercepted. Don't let it
happen.
Fixes issue #768
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
We want the statements to be removed before we ack the schema change,
otherwise it will race with all future operations.
Since the subscriber will be invoked on each shard, there is no need
to broadcast to all shards, we can just handle current shard.
Replicates https://issues.apache.org/jira/browse/CASSANDRA-7910 :
"Prepare a statement with a wildcard in the select clause.
2. Alter the table - add a column
3. execute the prepared statement
Expected result - get all the columns including the new column
Actual result - get the columns except the new column"
Currently the notify_*() method family broadcasts to all shards, so
schema merging code invokes them only on shard 0, to avoid doubling
notifications. We can simplify this by making the notify_*() methods
per-instance and thus shard-local.
If the schema_builder is constructed from an existing schema we need to
make sure that the original column ids of regular and static columns are
*not* used since they may become invalid if columns are added or
removed.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
When a column is dropped its name and deletion timestamp are added
to schema::_raw._dropped_columns to prevent data resurrection in case a
column with the same name is added. To reduce the number of lookups in
_dropped_columns this patch makes each instance of column_definition
to caches this information (i.e. timestamp of the latest removal of a
column with the same name).
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Knowing which columns were dropped (and when) is important to prevent
the data from the dropped ones reappearing if a new column is added with
the same name.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
We want the node's schema version to change whenever
table_schema_version of any table changes. The latter is calculated by
hashing mutations so we should also use mutation hash when calculating
schema digest.
We need to track which schema version were synced with on current node
to avoid triggering the sync on every mutation. We need to sync before
mutating to be able to apply the incoming mutation using current
node's schema, possibly applying irreverdible transformations to it to
make it conform.
There is one current schema for given column_family. Entries in
memtables and cache can be at any of the previous schemas, but they're
always upgraded to current schema on access.
The verb belongs to a seaprate client to avoid potential deadlocks
should the throttling on connection level be introduced in the
future. Another reason is to reduce latency for version requests as it
can potentially block many requests.
The intent is to make data returned by queries always conform to a
single schema version, which is requested by the client. For CQL
queries, for example, we want to use the same schema which was used to
compile the query. The other node expects to receive data conforming
to the requested schema.
Interface on shard level accepts schema_ptr, across nodes we use
table_schema_version UUID. To transfer schema_ptr across shards, we
use global_schema_ptr.
Because schema is identified with UUID across nodes, requestors must
be prepared for being queried for the definition of the schema. They
must hold a live schema_ptr around the request. This guarantees that
schema_registry will always know about the requested version. This is
not an issue because for queries the requestor needs to hold on to the
schema anyway to be able to interpret the results. But care must be
taken to always use the same schema version for making the request and
parsing the results.
Schema requesting across nodes is currently stubbed (throws runtime
exception).
Schema is tracked in memtable and cache per-entry. Entries are
upgraded lazily on access. Incoming mutations are upgraded to table's
current schema on given shard.
Mutating nodes need to keep schema_ptr alive in case schema version is
requested by target node.
We must use canonical_mutation form to allow for changes in the schema
of schema tables. The node which deserializes schema mutations may not
have the same version of the schema tables so we cannot use
frozen_mutation, which is a schema dependent form.
frozen_schema will transfer schema definition across nodes with schema
mutations. Because different nodes may have different versions of
schema tables, we cannot use frozen_mutations to transfer these
because frozen_mutation can only be read using the same version of the
schema it was frozen with. To solve this problem, new from of mutation
is introduced called canonical_mutation, which can be read using any
version of the schema.
This patch fixes a regression introduced by
a commit ca935bf "tests: Fix gossip_test".
database service initializes a replication_strategy
object and a replication_strategy requires a snitch
service to be initialized.
A snitch service requires a broadcast address to be
set.
If any of the above is not initialized we are going
to hit the corresponding assert().
Set a snitch to a SimpleSnitch and a broadcast
address to 127.0.0.1.
Fixes issue #770
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1452421748-9605-1-git-send-email-vladz@cloudius-systems.com>
The goal is to provide various test cases with a way of iterating over
many combinations of mutaitons. It's good to have this in one place to
avoid duplication and increased coverage.
Right now in some places we use column_id, and in some places
size_t. Solve it by using column_count_type whose meaning is "an
integer sufficiently large for indexing columns". Note that we cannot
use column_id because it has more meaning to it than that.
The version needs to change value not only on structural changes but
also temporal. This is needed for nodes to detect if the version they
see was already synchronized with or not even if it has the same
structure as the past versions. We also need to end up with the same
version on all nodes when schema changes are commuted.
For regular mutable schemas version will be calculated from underlying
mutations when schema is announced. For static schemas of system
keyspace it is calculated by hashing scylla version and column id,
because we don't have mutations at the time of building the schema.
We will be able to reuse the code in frozen_schema. We need to read
data in mutation form so that we can construct the correct
schema_table_version, and attach the mutations to schema_ptr.
It simplifies add_table_to_schema_mutation() interface.
The current code is also a bit confusing, partition_key is created
with the keyspaces() schema and used in mutations destined for the
columnfamilies() schema. It works, the types are the same, but looks a
bit scary.
For static and regular (row) columns it is very convenient in some
cases to utilize the fact that columns ordered by ids are also ordered
by name. It currently holds, so make schema export this guarantee and
enable consumers to rely on.
The static schema::row_column_ids_are_ordered_by_name field is about
allowing code external to schema to make it very explicit (via
static_assert) that it relies on this guarantee, and be easily
discoverable in case we would have to relax this.
With 10 sstables/shard and 50 shards, we get ~10*50*50 messages = 25,000
log messages about sstables being ignored. This is not reasonable.
Reduce the log level to debug, and move the message to database.cc,
because at its original location, the containing function has nothing to
do with the message itself.
Reviewed-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Message-Id: <1452181687-7665-1-git-send-email-avi@scylladb.com>
Wait for the future returned by the http server start process to resolve,
so we know it is started. If it doesn't, we'll hit the or_terminate()
further down the line and exit with an error code.
Message-Id: <1452092806-11508-3-git-send-email-avi@scylladb.com>
Because our shutdown process is crippled (refs #293), we won't shutdown the
snitch correctly, and the sharded<> instance can assert during shutdown.
This interferes with the next patch, which adds orderly shutdown if the http
server fails to start.
Leak it intentionally to work around the problem.
Message-Id: <1452092806-11508-2-git-send-email-avi@scylladb.com>
Make the following tests pass:
bootstrap_test.py:TestBootstrap.shutdown_wiped_node_cannot_join_test
bootstrap_test.py:TestBootstrap.killed_wiped_node_cannot_join_test
1) start node2
2) wait for cql connection with node2 is ready
3) stop node2
4) delete data and commitlog directory for node2
5) start node2
In step 5), node2 will do the bootstrap process since its data,
including the system table is wiped. It will think itself is a completly
new node and can possiblly stream from wrong node and violate
consistency.
To fix, we reject the boot if we found the node was in SHUTDOWN or
STATUS_NORMAL.
CASSANDRA-9765
Message-Id: <47bc23f4ce1487a60c5b4fbe5bfe9514337480a8.1452158975.git.asias@scylladb.com>
Implement the wait for gossip to settle logic in the bootup process.
CASSANDRA-4288
Fixes:
bootstrap_test.py:TestBootstrap.shutdown_wiped_node_cannot_join_test
1) start node2
2) wait for cql connection with node2 is ready
3) stop node2
4) delete data and commitlog directory for node2
5) start node2
In step 5, sometimes I saw in shadow round of node2, it gets node2's
status as BOOT from other nodes in the cluster instead of NORMAL. The
problem is we do not wait for gossip to settle before we start cql server,
as a result, when we stop node2 in step 3), other nodes in the cluster
have not got node2's status update to NORMAL.
The previous SSL enablement patches do make uses of these
options but they are still marked as Unused.
Change this and also update the db/config.hh documentation
accordingly.
Syntax is now:
client_encryption_options:
enabled: true
certificate: <path-to-PEM-x509-cert> (default conf/scylla.crt)
keyfile: <path-to-PEM-x509-key> (default conf/scylla.key)
Fixes: #756.
Signed-off-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <1452032073-6933-1-git-send-email-benoit@scylladb.com>
Compaction fixes from Raphael:
There were two problems causing issue 676:
1) max_purgeable was being miscalculated (fixed by b7d36af).
2) empty row not being removed by mutation_partition::do_compact
Testcase is added to make sure that a tombstone will be purged under
certain conditions.
do_compact() wasn't removing an empty row that is covered by a
tombstone. As a result, an empty partition could be written to a
sstable. To solve this problem, let's make trim_rows remove a
row that is considered to be empty. A row is empty if it has no
tombstone, no marker and no cells.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"This is another version of the repair overhaul, to avoid streaming *all* the
data between nodes by sending checksums of token ranges and only streaming
ranges which contain differing data."
Support the "hosts" and "dataCenters" parameters of repair. The first
specifies the known good hosts to repair this host from (plus this host),
and the second asks to restrict the repair to the local data center (you
must issue the repair to a node in the data center you want to repair -
issuing the command to a data center other than the named one returns
an error).
For example these options are used by nodetool commands like:
nodetool repair -hosts 127.0.0.1,127.0.0.2 keyspace
nodetool repair -dc datacenter1
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The existing repair code always streamed the entire content of the
database. In this overhaul, we send "repair_checksum_range" messages to
the other nodes to verify whether they have exactly the same data as
this node, and if they do, we avoid streaming the identical code.
We make an attempt to split the token ranges up to contain an estimated
100 keys each, and send these ranges' checksums. Future versions of this
code will need to improve this estimation (and make this "100" a parameter)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a function sync_range() for synchronizing all partitions
in a given token range between a set of replicas (this node and a list of
neighbors).
Repair will call this function once it has decided that the data the
replicas hold in this range is not identical.
The implementation streams all the data in the given range, from each of
the neighbors to this node - so now this node contains the most up-to-date
data. It then streams the resulting data back to all the neighbors.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a new type of message, "REPAIR_CHECKSUM_RANGE" to scylla's
"messaging_service" RPC mechanism, for the use of repair:
With this message the repair's master host tells a slave host to calculate
the checksum of a column-family's partitions in a given token range, and
return that checksum.
The implementation of this message uses the checksum_range() function
defined in the previous patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds functions for calculating the checksum of all the
partitions in a given token range in the given column-family - either
in the current shard, or across all shards in this node.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a mechanism for calculating a checksum for a set of
partitions. The repair process will use these checksums to compare the
data held by different replicas.
We use a strong checksum (SHA-256) for each individual partition in the set,
and then a simple XOR of those checksums to produce a checksum for the
entire set. XOR is good enough for merging strong checksums, and allows us
to independently calculate the checksums of different subsets of the
original sets - e.g., each shard can calculate its own checksum and we
can XOR the resulting checksums to get the final checksum.
Apache Cassandra uses a very similar checksum scheme, also using SHA-256
and XOR. One small difference in the implementation is that we include the
partition key in its checksum, while Cassandra don't, which I believe to
have no real justification (although it is very unlikely to cause problems
in practice). See further discussion on this in
https://issues.apache.org/jira/browse/CASSANDRA-10728.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
A cut-and-paste accident in query::to_partition_range caused the wrong
end's inclusiveness to be tested.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Everything except alter_table_statement::announce_migration() is
translated. announce_migration() has to wait for multi schema support to
be merged.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
We have an API that wraps open_file_dma which we use in some places, but in
many other places we call the reactor version directly.
This patch changes the latter to match the former. It will have the added benefit
of allowing us to make easier changes to these interfaces if needed.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <29296e4ec6f5e84361992028fe3f27adc569f139.1451950408.git.glauber@scylladb.com>
This exception was not caught properly as a std::exception
by report_failed_future call to report_exception because the
superclass std::exception was not initialized.
Fixes#669.
Signed-off-by: Benoît Canet <benoit@scylladb.com>
Recently, Scylla was changed to mandate the use of XFS
for its data directories, unless the flag --developer-mode true
is provided. So during the AMI setup stage, if the user
did not provide extra disks for the setup scripts to prepare,
the scylla service will refuse to start. Therefore, the
message in scylla_prepare has to be changed to an actual
error message, and the file name, to be changed to
something that reflects the event that happened.
Signed-off-by: Lucas Meneghel Rodrigues <lmr@scylladb.com>
Avi Kivity (1):
Merge "rpc negotiation fixes" from Gleb
Gleb Natapov (3):
rpc: fix peer address printing during logging
rpc: make server send negotiation frame back before closing connection on error
rpc: fix documentation for negotiation procedure.
Nadav Har'El (1):
fix operator<<() for std::vector<T>
Tomasz Grabiec (1):
core/byteorder: Add missing include
scylla-ami.sh moved some ami specific files. This parts have been
dropped when converging scylla-ami into scylla_install. Fixing that.
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
The script imports the /etc/sysconfig/scylla-server for configuration
settings (NR_PAGES). The /etc/sysconfig/scylla-server iincludes an AMI
param which is of string value and called as a last step in
scylla_install (after scylla_bootparam_setup has been initated).
The AMI variable is setup in scylla_install and is used in multiple
scripts. To resolve the conflict moving the import of
/etc/sysconfig/scylla-server after the AMI variable has been compared.
Fixes: #744
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Switch to CentOS 7 as the Docker base image. It's more stable and
updated less frequently than Fedora. As a bonus, it's Thrift package
doesn't pull the world as a dependency which reduces image size from 700
MB to 380 MB.
Suggested by Avi.
Message-Id: <1451911969-26647-1-git-send-email-penberg@scylladb.com>
If requests are delayed downstream from the cql server, and the client is
able to generate unrelated requests without limit, then the transient memory
consumed by the requests will overflow the shard's capacity.
Fix by adding a semaphore to cap the amount of transient memory occupied by
requests.
Fixes#674.
compatible: can be cast, keeps sort order
value-compatible: can be cast, may change sort order
frozen: values participate in sort order
unfrozen: only sort keys participate in sort order
Fixes#740.
"Messaging service inherits from seastar::async_sharded_service which
guaranties that sharded<>::stop() will not complete until all references
to the service will not go away. It was done specifically to avoid using
more verbose gate interface in sharded<> services since it turned out
that almost all of them need one eventually. Unfortunately patches to
add redundant gate to messaging_service sneaked pass my review. The series
reverts them."
When 'nodetool clearsnapshot' is given no parameters it should
remove all existing snapshots.
Fixes issue #639
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
service::storage_service::clear_snapshot() was built around _db.local()
calls so it makes more sense to move its code into the 'database' class
instead of calling _db.local().bla_bla() all the time.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Current service initialization is a total mess in cql_test_env. Start
the service the same order as in main.cc.
Fixes#715, #716
'./test.py --mode release' passes.
It's just a waste of time to find them manually
when compiling ScyllaDB on a fresh install: add them.
Also fix the ninja-build name.
Signed-off-by: Benoît Canet <benoit@scylladb.com>
After upgrading an AMI and trying to stop and start a machine the
/var/lib/scylla/coredump is not created. Create the directory if it does
not exist prior to generating core
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
There's two untranslated grammar subrules in the 'relation' rule. We
have all the necessary AST classes translated, so translate the
remaining subrules.
Refs #534.
Signed-off-by: Pekka Enberg <penberg@scylladb.com>
messaging_service will use private ip address automatically to connect a
peer node if possible. There is no need for the upper level like
streaming to worry about it. Drop it simplifies things a bit.
The stream info was creted but was left out of the stream state. This
patch adds the created stream_info to the stream state vector.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
* seastar 8b2171e...de112e2 (4):
> build: disable -fsanitize=vptr harder
> Merge "Make RPC more robust against protocol changes" from Gleb
> new when_all implementation
> apps: memcached: Don't fiddle with the expiration time value in a usable range
Commit 2ba4910 ("main: verify that the NOFILE rlimit is sufficient")
added a recommendation to set NOFILE rlimit to 200k. Update our release
binaries to do the same.
"This series solve an issue with the load broadcaster that reports negative
values due to an integer wrap around. While fixing this issue an additional
change was made so that the load_map would return doubles and not formatted
string. This is a better API, safer and better documented."
Since commit 16596385ee, long_token() is already checking
t.is_minimum(), so the comment which explains why it does not (for
performance) is no longer relevant. And we no longer need to check
t._kind before calling long_token (the check we do here is the same
as is_minimum).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Check the list of column families passed as an option to repair, to
provide the user with a more meaningful exception when a non-existant
column family is passed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This was a plain bug - ranges_opt is supposed to parse the option into
the vector "var", but took the vector by value.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Support the "columnFamilies" parameter of repair, allowing to repair
only some of the column families of a keyspace, instead of all of them.
For example, using a command like "nodetool repaire keyspace cf1 cf2".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The script scylla_coredump_setup was introduced in
9b4d0592, and added to the scylla rpm spec file, as a
post script. However, calling yum when there's one
yum instance installng scylla server will cause a deadlock,
since yum waits for the yum lock to be released, and the
original yum process waits for the script to end.
So let's remove this from the script. Debian shouldn't be
affected, since it was never added to the debian build
rules (to the best of my knowlege, after analyzing 9b4d0592),
hence I did not remove it. It should cause the same problem
with apt-get in case it was used.
CC: Takuya ASADA <syuu@scylladb.com>
[ penberg: Rebase and drop statement about 'abrt' package not in Fedora. ]
Signed-off-by: Lucas Meneghel Rodrigues <lmr@scylladb.com>
A default value was not set for the "incremental" and "parallelism"
repair parameters, so Scylla can wrongly decide that they have an
unsupported value.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The repair API use to have an undocumented parameter list similiar to
origin.
This patch changes the way repair is getting its parameters.
Instead of a one undocumented string it now lists all the different
optional parameters in the swagger file and accept them explicitely.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This change set the http server to start as the first step in the boot
order.
It is helpfull if some other step takes a long time or stuck.
Fixes#725
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Actually check that a snapshot directory with a given tag
exists instead of just checking that a 'snapshot' directory
exists.
Fixes issue #689
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
In origin the storage_serivce report the load map as a formatted string.
As an API a better option is to report the load map as double and let
the JMX proxy do the formatting.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The map_reduce0 convert the result value to the init value type. In
load_bradcaster 0 is of type int.
This result with an int wrap around and negative results.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Add partial support for the "incremental" option (only support the
"false" setting, i.e., not incremental repair) and the "parallelism"
option (the choice of sequential or parallel repair is ignored - we
always use our own technique).
This is needed because scylla-jmx passes these options by default
(e.g., "incremental=false" is passed to say this is *not* incremental
repair, and we just need to allow this and ignore it).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When throwing an "unsupported repair options" exception to the caller
(such as "nodetool repair"), also list which options were not recognized.
Additionally, list the options when logging the repair operation.
This patch includes an operator<< implementation for pretty-printing an
std::unordered_map. We may want to move it later to a more central
location - even Seastar (like we have a pretty-printer for std::vector
in core/sstring.hh).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
max_purgeable was being incorrectly calculated because the code
that creates vector of uncompacted sstables was wrong.
This value is used to determine whether or not a tombstone can
be purged.
Operand < is supposed to be used instead in the callback passed
as third parameter to boost::set_difference.
This fix is a step towards closing the issue #676.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The start_native_transport() function in storage_service expects the
'enabled' option to be defined. If the option is not defined, it means
that encryption is implicitly disabled.
Fixes#718.
"Adds support for TLS/SSL encrypted (and cert verified)
connections for message service
* Modify config option to match "native" style cerificate management
* Add SSL options to messaging service and generate SSL server/client
endpoints when required
* Add config option handling to init/main"
* Massage user options in main
* Use them in storage_service, and if needed, load certificates etc
and pass to transport/cql server.
Conflicts:
service/storage_service.cc
Optional credentials argument determine if SSL or normal
server socket is created.
Note: This does not follow the pattern of "socket as argument", simply
because this is a distributed object, so only trivial or immutable
objects should be passed to it.
Describe scylla version of option.
Note, for test usage, the below should be workable:
server_encryption_options:
internode_encryption: all
certificate: seastar/tests/test.crt
truststore: seastar/tests/catest.pem
keyfile: seastar/tests/test.key
Since the seastar test suite contains a snakeoil cert + trust
combo
* Accept port + credentials + option for what to encrypt
* If set, enable a SSL listener at ssl_port
* Check outgoing connections by IP to determine if
they should go to SSL/normal endpoint
Requires seastar RPC patch
Note: currently, the connections created by messaging service
does _not_ do certificate name verification. While DNS lookup
is probably not that expensive here, I am not 100% sure it is
the desired behaviour.
Normal trust is however verified.
* Mark option used
* Make sub-options adapted to seastar-tls useable values (i.e. x509)
Syntax is now:
server_encryption_options:
internode_encryption: <none, all, dc, rack>
certificate: <path-to-PEM-x509-cert> (default conf/scylla.crt)
keyfile: <path-to-PEM-x509-key> (default conf/scylla.key)
truststore: <path-to-PEM-trust-store-file> (default empty,
use system trust)
"When a node gain or regain responsibility for certain token ranges, streaming
will be performed, upon receiving of the stream data, the row cache
is invalidated for that range.
Refs #484."
The describe_ring method in storage_service did not report the start and
end tokens.
Also for rpc addresses that are not the local address, it returned the
value representation (including the version) and not just the adress.
Fixes#695
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Use steady_clock instead of high_resolution_clock where monotonic
clock is required. high_resolution_clock is essentially a
system_clock (Wall Clock) therefore may not to be assumed monotonic
since Wall Clock may move backwards due to time/date adjustments.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Use steady_clock instead of high_resolution_clock where monotonic
clock is required. high_resolution_clock is essentially a
system_clock (Wall Clock) therefore may not to be assumed monotonic
since Wall Clock may move backwards due to time/date adjustments.
Fixes issue #638
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
"Fixes stream_session hangs:
1) if the sending node is gone, the receiving peer will wait forever
2) if the node which should send COMPLETE_MESSAGE to the peer node is gone,
the peer node will wait forever"
This patch fixes a bug where the *first* run of "nodetool repair" always
returned immediately, instead of waiting for the repair to complete.
Repair operations are asynchronous: Starting a repair returns a numeric
id, which can then be used to query for the repair's completion, and this
is what "nodetool repair" does (through our JMX layer). We started with
the repair ID "0", the next one is "1", and so on.
The problem is that "nodetool repair", when it sees 0 being returned,
treats it not as a regular repair ID, but rather as an answer that
there is nothing to repair - printing a message to that effect and *not*
waiting for the repair (which was correctly started) to complete.
The trivial fix is to start our repair IDs at 1, instead of 0.
We currently do not return 0 in any case (we don't know there is nothing
to repair before we actually start the work, and parameter errors
cause an exception, not a return of 0).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
If we can't open the file, we will fail with a misterious error. It is a costumary
scenario, though, since people who are unaware or have just forgotten about seastar's
restriction of direct io access may put those files in tmpfs and other mount points.
We have a direct_io check that is designed exactly for this purpose, so as to give
the user a better error message. This patch makes use of it.
Fixes#644
Signed-off-by: Glauber Costa <glauber@scylladb.com>
It is hard-coded as 30 seconds at the moment.
Usage:
$ scylla --ring-delay-ms 5000
Time a node waits to hear from other nodes before joining the ring in
milliseconds.
Same as -Dcassandra.ring_delay_ms in cassandra.
The midpoint() algorithm to find a token between two tokens doesn't
work correctly in case of wraparound. The code tried to handle this
case, but did it wrong. So this patch fixes the midpoint() algorithm,
and adds clearer comments about why the fixed algorithm is correct.
This patch also modifies two midpoint() tests in partitioner_test,
which were incorrect - they verified that midpoint() returns some expected
values, but expected values were wrong!
We also add to the test a more fundemental test of midpoint() correctness,
which doesn't check the midpoint against a known value (which is easy to
get wrong, like indeed happened); Rather we simply check that the midpoint
is really inside the range (according to the token ordering operator).
This simple test failed with the old implementation of midpoint() and
passes with the new one.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The problem is that we set the session state to WAIT_COMPLETE in
send_complete_message's continuation, the peer node might send
COMPLETE_MESSAGE before we run the continuation, thus we set the wrong
status in COMPLETE_MESSAGE's handler and will not close the session.
Before:
GOT STREAM_MUTATION_DONE
receive task_completed
SEND COMPLETE_MESSAGE to 127.0.0.2:0
GOT COMPLETE_MESSAGE, from=127.0.0.2, connecting=127.0.0.3, dst_cpu_id=0
complete: PREPARING -> WAIT_COMPLETE
GOT COMPLETE_MESSAGE Reply
maybe_completed: WAIT_COMPLETE -> WAIT_COMPLETE
After:
GOT STREAM_MUTATION_DONE
receive task_completed
maybe_completed: PREPARING -> WAIT_COMPLETE
SEND COMPLETE_MESSAGE to 127.0.0.2:0
GOT COMPLETE_MESSAGE, from=127.0.0.2, connecting=127.0.0.3, dst_cpu_id=0
complete: WAIT_COMPLETE -> COMPLETE
Session with 127.0.0.2 is complete
If the session is idle for 10 minutes, close the session. This can
detect the following hangs:
1) if the sending node is gone, the receiving peer will wait forever
2) if the node which should send COMPLETE_MESSAGE to the peer node is
gone, the peer node will wait forever
Fixes simple_kill_streaming_node_while_bootstrapping_test.
Get from address from cinfo. It is needed to figure out which stream
session this mutation is belonged to, since we need to update the keep
alive timer for this stream session.
Currently, if the node is actually down, although the streaming_timeout
is 10 seconds, the sending of the verb will return rpc_closed error
immediately, so we give up in 20 * 5 = 100 seconds. After this change,
we give up in 10 * 30 = 300 seconds at least, and 10 * (30 + 30) = 600
seconds at most.
It is oneway message at the moment. If a COMPLETE_MESSAGE is lost, no
one will close the session. The first step to fix the issue is to try to
retransmit the message.
$NAME is full name of distribution, for script it is too long.
$ID is shortened one, which is more useful.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
The helper function for summing statistic over the column family are
template function that infer the return type acording to the type of the
Init param.
In the API the return value should be int64_t, passing an integer would
cause a number wrap around.
A partial output from the nodetool cfstats after the fix
nodetool cfstats keyspace1
Keyspace: keyspace1
Read Count: 0
Read Latency: NaN ms.
Write Count: 4050000
Write Latency: 0.009178098765432099 ms.
Pending Flushes: 0
Table: standard1
SSTable count: 12
Space used (live): 1118617445
Space used (total): 23336562465
Fixes#682
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
From Calle:
Fixes#589
Query should not return dangling static row in partition without any
regular/ck columns if a CK restriction is applied.
Refs #650
Fixes bug in CK range code for paging, and removes CK use for tables with not
clustering -> way simpler code. Also removed lots of workaround code no longer
required.
Note that this patch set does not fully fix #650/paging since bug #663 causes
duplicate rows. Still almost there though.
Boost::date_time doesn't accept some of the date and time formats that
the origin do (e.g. 2013-9-22 or 2013-009-22).
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Refs #640
* Remove use of cluster key range for tables without CK
Checking CK existance once and use the info allows us to remove some
stupid complexity in checking for "last key" match
* With fix for #589 we can also remove some superfluous code to
compensate for that issue, and make "partition end" simper
* Remove extra row in CK case. Not needed anymore
End result is that pager now more or less only relies on adapted query
ranges.
timestamp_from_string() is used by both timestamp and date types, so it
is better to move the try { } catch { } to the functions itself instead
of expecting its callers to catch exceptions.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
On AMI, scylla-server fails to systemctl restart because scylla_prepare tries to mount /var/lib/scylla even it's already mounted.
This patch fixes the issue.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
* seastar b44d729...51154f7 (6):
> semaphore: add with_semaphore()
> scripts: posix_net_conf.sh: don't transform wide CPU mask
> resource: fix build for systems without HWLOC
> build: link libasan before all other libraries
> Use sys_membarrier() when available
> build: add missing library (boost_filesystem)
Fixes#589
If we got no rows, but have live static columns, we should only
give them back IFF we did not have any CK restrictions.
If ck:s exist, and we have a restriction on them, we either have maching
rows, or return nothing, since cql does not allow "is null".
This patch introduces a test for reading keys from a single sstable with
the range begining and end being the keys present in the index summary.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
When choosing a relevant range of buckets it wasn't taken into account
whether the range bounds are inclusive or not. That may have resulted in
more buckets being read than necessary which was a condition not
expected by the code responsible from looking for a relevant keys inside
the buckets.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
If a sstable doesn't belong to current shard, mark_for_deletion
should be called for the deletion manager to still work.
It doesn't mean that the sstable will be deleted, but that the
sstable is not relevant to the current shard, thus it can be
deleted by the deletion manager in the future.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When a node gain or regain responsibility for certain token ranges,
streaming will be performed, upon receiving of the stream data, the
row cache is invalidated for that range.
Refs #484.
From Paweł:
"This series fixes sstables::key_reader not respecting range inclusiveness
if the bounds were the keys that were present in the index summary.
Fixes #663."
This patch introduces a test for reading keys from a single sstable with
the range begining and end being the keys present in the index summary.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
When choosing a relevant range of buckets it wasn't taken into account
whether the range bounds are inclusive or not. That may have resulted in
more buckets being read than necessary which was a condition not
expected by the code responsible from looking for a relevant keys inside
the buckets.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
"This series adds support for nodetool command 'drain'. The general idea
of this command is to close all connection (both with clienst and other
nodes) and flush all memtables to disk.
Fixes #662."
"Merge AMI scripts to dist/common/scripts, make it usable on non-AMI
environments. Provides a script to do all settings automatically, which
able to run as one-liner like this:
curl http://url_to_scylla_install | sudo bash -s -- -d /dev/xvdb,/dev/xvdc -n eth0 -l ./
Also enables coredump, save it to /var/lib/scylla/coredump"
When the server is shutting down a flag _stopping is set and listeners
are aborted using abort_accept(), which causes accept() calls to return
failed futures. However, accept handler just checks that the flag
_stopping is set and returns which causes a failed future to be
destroyed and a warning is printed.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
The underlying data source for cache should not be the same memtable
which is later used to update the cache from. This fixes the following
assertion failure:
row_cache_test_g: utils/logalloc.hh:289: decltype(auto) logalloc::allocating_section::operator()(logalloc::region&, Func&&) [with Func = memtable::make_reader(schema_ptr, const partition_range&)::<lambda()>]: Assertion `r.reclaiming_enabled()' failed.
The problem is that when memtable is merged into cache their regions
are also merged, so locking cache's region locks the memtable region
as well.
Currently sourcing for the second time causes an exception from
pretty printer registration:
Traceback (most recent call last):
File "./scylla-gdb.py", line 41, in <module>
gdb.printing.register_pretty_printer(gdb.current_objfile(), build_pretty_printer())
File "/usr/share/gdb/python/gdb/printing.py", line 152, in register_pretty_printer
printer.name)
RuntimeError: pretty-printer already registered: scylla
Fixes the case where background activity needed to complete CL=ONE writes
is queued up in the storage proxy, and the client adds new work faster
than it can be cleared.
The last two loops were incorrectly inside the first one. That's a
bug because a new sstable may be emplaced more than once in the
sstable list, which can cause several problems. mark_for_deletion
may also be called more than once for compacted sstables, however,
it is idempotent.
Found this issue while auditing the code.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"* seastar 294ea30...b44d729 (5):
> Merge "Properly distribute IO queues" from Glauber
> reactor: allow more poll time in virtualized environments
> reactor: fix idle-poll limit
> reactor: use a vector of unique_ptr for the IO queues
> io queues: make the queues really part of the reactor"
With consistency level less then ALL mutation processing can move to
background (meaning client was answered, but there is still work to
do on behalf of the request). If background request rate completion
is lower than incoming request rate background request will accumulate
and eventually will exhaust all memory resources. This patch's aim is
to prevent this situation by monitoring how much memory all current
background request take and when some threshold is passed stop moving
request to background (by not replying to a client until either memory
consumptions moves below the threshold or request is fully completed).
There are two main point where each background mutation consumes memory:
holding frozen mutation until operation is complete in order to hint it
if it does not) and on rpc queue to each replica where it sits until it's
sent out on the wire. The patch accounts for both of those separately
and limits the former to be 10% of total memory and the later to be 6M.
Why 6M? The best answer I can give is why not :) But on a more serious
note the number should be small enough so that all the data can be
sent out in a reasonable amount of time and one shard is not capable to
achieve even close to a full bandwidth, so empirical evidence shows 6M
to be a good number.
I am sure it's a compiler issue but I am not ready to give up and
upgrade just yet:
sstables/compaction.cc:307:55: error: converting to ‘std::unordered_map<int, long int>’ from initializer list would use explicit constructor ‘std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::unordered_map(std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::size_type, const hasher&, const key_equal&, const allocator_type&) [with _Key = int; _Tp = long int; _Hash = std::hash<int>; _Pred = std::equal_to<int>; _Alloc = std::allocator<std::pair<const int, long int> >; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::size_type = long unsigned int; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::hasher = std::hash<int>; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::key_equal = std::equal_to<int>; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::allocator_type = std::allocator<std::pair<const int, long int> >]’
stats->start_size, stats->end_size, {});
Test was failing because _qp (distributed<cql3::query_processor>) was stopped
before _db (distributed<database>).
Compaction manager is member of database, and when database is stopped,
compaction manager is also stopped. After a2fb0ec9a, compaction updates the
system table compaction history, and that requires a working query context.
We cannot simply move _qp->stop() to after _db->stop() because the former
relies on migration_manager and storage_proxy. So the most obvious fix is to
clean the global variable that stores query context after _qp was stopped.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The previous patch added message_service read()/write() support for all
types which know how to serialize themselves through our "old" serialization
API (serialize()/deserialize()/serialized_size()).
So we no longer need the almost 200 lines of repetitive code in
messaging_service.{cc,hh} which defined these read/write templates
separately for a dozen different types using their *serialize() methods.
We also no longer need the helper functions read_gms()/write_gms(), which
are basically the same code as that in the template functions added in the
previous patch.
Compilation is not significantly slowed down by this patch, because it
merely replaces a dozen templates by one template that covers them all -
it does not add new template complexity, and these templates are anyway
instantiated only in messaging_service.cc (other code only calls specific
functions defined in messaging_service.cc, and does not use these templates).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Currently, messaging_service only supports sending types for which a read/
write function has been explicitly implemented in messageing_service.hh/cc.
Some types already have serialization/deserialization methods inside them,
and those could have been used for the serialization without having to write
new functions for each of these types. Many of these types were already
supported explicitly in messaging_service.{cc,hh}, but some were forgot -
for example, dht::token.
So this patch adds a default implemention of messaging_service write()/read()
which will work for any type which has these serialization methods.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
* seastar 5b9e3da...294ea30 (9):
> Merge "IO queues" from Glauber
> reactor: increment check_direct_io_support to also deal with files
> Merge "SSL/TLS initial certificate validation" from Calle
> tutorial.md: remove inaccurate statements about x86
> build: verify that the installed compiler is up to date
> build: complain if fossil version of gnutls is installed
> build: fix debian naming of gnutls-devel package
> build: add configure-time check for gnutls-devel
> tutorial.md: introduction to asynchrnous programming
send_to_live_endpoints() is never waited upon, it does its job in the
background. This patch formalize that by changing return value to void
and also refactoring code so that frozen_mutation shared pointer is not
held more that it should: currently it is held until send_mutation()
completes, but since send_mutation() does not use frozen_mutation
asynchronously this is not necessary.
Replace db_clock::now_in_usec() and db_clock::now() * 1000 accesses
where the intent is to create a new auto-generate cell timestamp with
a call to new_timestamp(). Now the knowledge of how to create timestamps
is in a single place.
"get_compactions returns progress information for each compaction
running in the system. It can be accessed using swagger UI.
'nodetool compactionstats' is not working yet because of some
pending work in the nodetool side."
Apparently, link hook copy constructor is a no-op and move contructor
doesn't exist so the code is correct, but that explicit move makes code
needlessly confusing.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
That's important for compaction stats API that will need stats
data of each ongoing compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This list will store compaction_stats for each ongoing compaction.
That's why register and deregister methods are provided.
This change is important for compaction stats API that needs data
of each ongoing compaction, such as progress, ks, cf, etc.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"This patchset will make Scylla update the system table
COMPACTION_HISTORY whenever a compaction job finishes.
Functions were added to both update and retrieve the
content of this system table. Compaction history API
is also enabled in this series."
When compaction job finishes, call function to update the system
table COMPACTION_HISTORY. That's also needed for the compaction
history API.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This method is intended to return content of the system table
COMPACTION_HISTORY as a vector of compaction_history_entry.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If a sstable doesn't belong to current shard, mark_for_deletion
should be called for the deletion manager to still work.
It doesn't mean that the sstable will be deleted, but that the
sstable is not relevant to the current shard, thus it can be
deleted by the deletion manager in the future.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
There is a check whose intent was to detect wrap around during walk of
the ring tokens by comparing the split point with minimum token, which
is supposed to be inserted by the ring iterator. It assumed that when
we encounter it, the range is a wrap around. It doesn't hold when
minimum token is part of the token metadata or set of tokens is empty.
In such case, a full range would be split into 3 overlapping full
ranges. The fix is to drop the assumption and instead ensure that
ranges do not wrap around by unwrapping them if necessary.
Fixes#655.
The default move assignment operator calls boost::intrusive::set's move
assignment operator, which leaks, because it does not believe it owns
the data.
Fix by providing a custom implementation.
All components of prefixable compound type are preceeded by their
length what makes them not byte order comparable.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Frozen collection type names must be wrapped in FrozenType so that we
are able to store the types correctly in system tables.
This fixes#646 and fixes#580.
Signed-off-by: Pekka Enberg <penberg@scylladb.com>
The test_assignement() function is invoked via the Cassandra unit tests
so we might as well implement it.
Signed-off-by: Pekka Enberg <penberg@scylladb.com>
It is 30 seconds instead of 5 seconds by default. To align with c*.
Pleas note, after this a node will takes at least 30 seconds to complete
a bootstrap.
Originally, large allocation test case attempted to allocate an object
as big as halft of the space used by the lsa. That failed when the test
was executed with lower amount of memory available mainly due to the
memory fragmentation caused by previous test cases.
This patches reduces the size of the large allocation to 3/8 of the
total space used by the lsa which is still a lot but seems to make the
test pass even with as little memory as 64MB per shard.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
If we get a core dump from a user, it is important to be able to
identify its version. Copy the release string into the heap (which is
copied into the code dump), so we can search for it using the "strings"
or "ident" commands.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
This fixes compile error:
In function `logalloc::segment_zone::segment_zone()':
/home/lmr/Code/scylla/utils/logalloc.cc:412: undefined reference to `logalloc::segment_zone::minimum_size'
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.
Signed-off-by: Lucas Meneghel Rodrigues <lmr@scylladb.com>
blob_storage defined with attribute packed which makes its alignment
requirement equal 1. This means that its members may be unaligned.
GCC is obviously aware of that and will generate appropriate code
(and not generate ubsan checks). However, there are few places where
members of blob_storage are accessed via pointers, these have to be
wrapped by unaligned_cast<> to let the compiler know that the location
pointed to may be not aligned properly.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Form Paweł:
This series fixes support for clustering keys which trailing components
are null. The solution is to use clustering_key_prefix instead of
clustering_key everywhere.
Fixes#515.
Schemas using compact storage can have clustering keys with the trailing
components not set and effectively being a clustering key prefixes
instead of full clustering keys.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
In case of non-compound dense tables the column name is just the value
of the clustering key (which has only one component). Current code just
casts clustering_key to bytes_view which works because there is no
additional metadata in single element clustering keys.
However, that may change when the internal representation of clustering
key is changed so explicitly extract the proper component.
This change will become necessary when clustering_key is replaced by
clustering_key_prefix.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
In case of schemas that use compact storage it is possible that trailing
components of clustering keys are not set.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
When this tool was written, we were still using /var/lib/cassandra as a default
location. We should update it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
* seastar 5dc22fa...c5e595b (3):
> memory: be less strict about NUMA bindings
> reactor: let the resource code specify the default memory reserve
> resource: reserve even more memory when hwloc is compiled in
Fixes#642
"This series attempts to make LSA more friendly for large (i.e. bigger
than LSA segment) allocations. It is achieved by introducing segment
zones – large, contiguous areas of segments and using them to allocate
segments instead of calling malloc() directly.
Zones can be shrunk when needed to reclaim memory and segments can be
migrated either to reduce number of zone or to defragment one in order
to be able to shrink it. LSA tries to keep all segments at the lower
addresses and reclaims memory starting from the zones in the highest
parts of the address space."
Also:
[PATCH scylla v1 0/7] gossip mark node down fix + cleanup
[PATCH scylla v1 0/2] Refuse decommissioned node to rejoin
[PATCH scylla] storage_service: Fix added node not showing up in nodetool in status joining
When replacing a node, we might ignore the tokens so that the tokens is
empty. In this case, we will have
std::unordered_map<inet_address, std::unordered_set<token>> = {ip, {}}
passed to token_metadata::update_normal_tokens(std::unordered_map<inet_address,
std::unordered_set<token>>& endpoint_tokens)
and hit the assert
assert(!tokens.empty());
1) Start node 1, node 2, node 3
2) Stop node 3
3) Start node 4 to replace node 3
4) Kill node 4 (removal of node 3 in system.peers is not flushed to disk)
5) Start node 4 (will load node 3's token and host_id info in bootup)
This makes
"Token .* changing ownership from 127.0.0.3 to 127.0.0.4"
messages printed again in step 5) which are not expected, which fails the dtest
FAIL: replace_first_boot_test (replace_address_test.TestReplaceAddress)
----------------------------------------------------------------------
Traceback (most recent call last):
File "scylla-dtest/replace_address_test.py",
line 220, in replace_first_boot_test
self.assertEqual(len(movedTokensList), numNodes)
AssertionError: 512 != 256
In commit 56df32ba56 (gossip: Mark node as
dead even if already left). A node liveness check is missed.
Fix it up.
Before: (mark a node down multiple times)
[Tue Dec 8 12:16:33 2015] INFO [shard 0] gossip - InetAddress 127.0.0.3 is now DOWN
[Tue Dec 8 12:16:33 2015] DEBUG [shard 0] storage_service - endpoint=127.0.0.3 on_dead
[Tue Dec 8 12:16:34 2015] INFO [shard 0] gossip - InetAddress 127.0.0.3 is now DOWN
[Tue Dec 8 12:16:34 2015] DEBUG [shard 0] storage_service - endpoint=127.0.0.3 on_dead
[Tue Dec 8 12:16:35 2015] INFO [shard 0] gossip - InetAddress 127.0.0.3 is now DOWN
[Tue Dec 8 12:16:35 2015] DEBUG [shard 0] storage_service - endpoint=127.0.0.3 on_dead
[Tue Dec 8 12:16:36 2015] INFO [shard 0] gossip - InetAddress 127.0.0.3 is now DOWN
[Tue Dec 8 12:16:36 2015] DEBUG [shard 0] storage_service - endpoint=127.0.0.3 on_dead
After: (mark a node down only one time)
[Tue Dec 8 12:28:36 2015] INFO [shard 0] gossip - InetAddress 127.0.0.3 is now DOWN
[Tue Dec 8 12:28:36 2015] DEBUG [shard 0] storage_service - endpoint=127.0.0.3 on_dead
The only reason we needed it is to make
_application_state[key] = value
work.
With the current default constructor, we increase the version number
needlessly. To fix and to be safe, remove the default constructor
completely.
Backport: CASSANDRA-8801
a53a6ce Decommissioned nodes will not rejoin the cluster.
Tested with:
topology_test.py:TestTopology.decommissioned_node_cant_rejoin_test
The get_token_endpoint API should return a map of tokens to endpoints,
including the bootstrapping ones.
Use get_local_storage_service().get_token_to_endpoint_map() for it.
$ nodetool -p 7100 status
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 127.0.0.1 12645 256 ? eac5b6cf-5fda-4447-8104-a7bf3b773aba rack1
UN 127.0.0.2 12635 256 ? 2ad1b7df-c8ad-4cbc-b1f1-059121d2f0c7 rack1
UN 127.0.0.3 12624 256 ? 61f82ea7-637d-4083-acc9-567e0c01b490 rack1
UJ 127.0.0.4 ? 256 ? ced2725e-a5a4-4ac3-86de-e1c66cecfb8d rack1
Fixes#617
Originally, lsa allocated each segment independently what could result
in high memory fragmentation. As a result many compaction and eviction
passes may be needed to release a sufficiently big contiguous memory
block.
These problems are solved by introduction of segment zones, contiguous
groups of segments. All segments are allocated from zones and the
algorithm tries to keep the number of zones to a minimum. Moreover,
segments can be migrated between zones or inside a zone in order to deal
with fragmentation inside zone.
Segment zones can be shrunk but cannot grow. Segment pool keeps a tree
containing all zones ordered by their base addresses. This tree is used
only by the memory reclamer. There is also a list of zones that have
at least one free segments that is used during allocation.
Segment allocation doesn't have any preferences which segment (and zone)
to choose. Each zone contains a free list of unused segments. If there
are no zones with free segments a new one is created.
Segment reclamation migrates segments from the zones higher in memory
to the ones at lower addresses. The remaining zones are shrunk until the
requested number of segments is reclaimed.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
A dynamic bitset implementation that provides functions to search for
both set and cleared bits in both directions.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Currently test case "Testing reading when memory can't be reclaimed."
assumes that the allocation section used by row cache upon entering
will require more free memory than there is available (inc. evictable).
However, the reserves used by allocation section are adjusted
dynamically and depend solely on previous events. In other words there
is no guarantee that the reserve would be increased so much that the
allocation will fail.
The problem is solved by adding another allocation that is guaranteed
to be bigger than all evictable and free memory.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Scattering of blobs from Avi:
This patchset converts the stack to scatter managed_bytes in lsa memory,
allowing large blobs (and collections) to be stored in memtable and cache.
Outside memtable/cache, they are still stored sequentially, but it is assumed
that the number of transient objects is bounded.
The approach taken here is to scatter managed_bytes data in multiple
blob_storage objects, but to linearize them back when accessing (for
example, to merge cells). This allows simple access through the normal
bytes_view. It causes an extra two copies, but copying a megabyte twice
is cheap compared to accessing a megabyte's worth of small cells, so
per-byte throughput is increased.
Testing show that lsa large object space is kept at zero, but throughput
is bad because Scylla easily overwhelms the disk with large blobs; we'll
need Glauber's throttling patches or a really fast disk to see good
throughput with this.
Add linearize() and unlinearize() methods that allow making an
atomic_cell_or_collection object temporarily contiguous, so we can examine
it as a bytes_view.
Instead of allocating a single blob_storage, chain multiple blob_storage
objects in a list, each limited not to exceed the allocation_strategy's
max_preferred_allocation_size. This allows lsa to allocate each blob_storage
object as an lsa managed object that can be migrated in memory.
Also provide linearize()/scatter() methods that can be used to temporarily
consolidate the storage into a single blob_storage. This makes the data
contiguous, so we can use a regular bytes_view to examine it.
This adds the implementation for the index_summary_off_heap_memory for a
single column family and for all of them.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Similiar to origin, off heap memory, memory_footprint is the size of
queus multiply by the structure size.
memory_footprint is used by the API to report the memory that is taken
by the summary.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Our premier allocation_strategy, lsa, prefers to limit allocations below
a tenth of the segment size so they can be moved around; larger allocations
are pinned and can cause memory fragmentation.
Provide an API so that objects can query for this preferred size limit.
For now, lsa is not updated to expose its own limit; this will be done
after the full stack is updated to make use of the limit, or intermediate
steps will not work correctly.
2015-12-06 16:23:42 +02:00
494 changed files with 27965 additions and 11389 deletions
Use class or struct similar to the object you need the serializer for.
Use namespace when applicable.
##keywords
* class/struct - a class or a struct like C++
class/struct can have final or stub marker
* namespace - has the same C++ meaning
* enum class - has the same C++ meaning
* final modifier for class - when a class mark as final it will not contain a size parameter. Note that final class cannot be extended by future version, so use with care
* stub class - when a class is mark as stub, it means that no code will be generated for this class and it is only there as a documentation.
* version attributes - mark with [[version id ]] mark that a field is available from a specific version
* template - A template class definition like C++
##Syntax
###Namespace
```
namespace ns_name { namespace-body }
```
* ns_name: either a previously unused identifier, in which case this is original-namespace-definition or the name of a namespace, in which case this is extension-namespace-definition
* namespace-body: possibly empty sequence of declarations of any kind (including class and struct definitions as well as nested namespaces)
* identifier: the name of the enumeration that's being declared.
* enum-base: colon (:), followed by a type-specifier-seq that names an integral type (see the C++ standard for the full list of all possible integral types).
* enumerator-list: comma-separated list of enumerator definitions, each of which is either simply an identifier, which becomes the name of the enumerator, or an identifier with an initializer: identifier = integral value.
Note that though C++ allows constexpr as an initialize value, it makes the documentation less readable, hence is not permitted.
* type: Any valid C++ type, following the C++ notation. note that there should be a serializer for the type, but deceleration order is not mandatory
* member-access: is the way the member can be access. If the member is public it can be the name itself. if not it could be a getter function that should be followed by braces. Note that getter can (and probably should) be const methods.
* attributes: Attributes define by square brackets. Currently are use to mark a version in which a specific member was added [ [ version version-number] ] would mark that the specific member was added in the given version number.
###template
`template < parameter-list > class-declaration`
* parameter-list - a non-empty comma-separated list of the template parameters.
* class-decleration - (See class section) The class name declared become a template name.
##IDL example
Forward slashes comments are ignored until the end of the line.
"description":"If the value is the string 'true' with any capitalization, repair only the first range returned by the partitioner.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"parallelism",
"description":"Repair parallelism, can be 0 (sequential), 1 (parallel) or 2 (datacenter-aware).",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"incremental",
"description":"If the value is the string 'true' with any capitalization, perform incremental repair.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"jobThreads",
"description":"An integer specifying the parallelism on each node.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"ranges",
"description":"An explicit list of ranges to repair, overriding the default choice. Each range is expressed as token1:token2, and multiple ranges can be given as a comma separated list.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"startToken",
"description":"Token on which to begin repair",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"endToken",
"description":"Token on which to end repair",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"columnFamilies",
"description":"Which column families to repair in the given keyspace. Multiple columns families can be named separated by commas. If this option is missing, all column families in the keyspace are repaired.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"dataCenters",
"description":"Which data centers are to participate in this repair. Multiple data centers can be listed separated by commas.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"hosts",
"description":"Which hosts are to participate in this repair. Multiple hosts can be listed separated by commas.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"trace",
"description":"If the value is the string 'true' with any capitalization, enable tracing of the repair.",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -1964,6 +2044,20 @@
}
}
},
"map_string_double":{
"id":"map_string_double",
"description":"A key value mapping between a string and a double",
"properties":{
"key":{
"type":"string",
"description":"The key"
},
"value":{
"type":"double",
"description":"The value"
}
}
},
"maplist_mapper":{
"id":"maplist_mapper",
"description":"A key value mapping, where key and value are list",
throwexceptions::invalid_request_exception(sprint("Cannot rename unknown column %s in table %s",from,column_family()));
}
if(schema->get_column_definition(to->name())){
throwexceptions::invalid_request_exception(sprint("Cannot rename column %s to %s in table %s; another column of that name already exist",from,to,column_family()));
}
if(def->is_part_of_cell_name()){
throwexceptions::invalid_request_exception(sprint("Cannot rename non PRIMARY KEY part %s",from));
}
if(def->is_indexed()){
throwexceptions::invalid_request_exception(sprint("Cannot rename column %s because it is secondary indexed",from));
throwexceptions::invalid_request_exception(sprint("Table names shouldn't be more than %d characters long (got \"%s\")",schema::NAME_LENGTH,cf_name.c_str()));
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.