Commit Graph

560 Commits

Author SHA1 Message Date
Tomasz Grabiec
2db8626dbf database: Ignore spaces in initial_token list
Currently we get boost::lexical_cast on startup if inital_token has a
list which contains spaces after commas, e.g.:

  initial_token: -1100081313741479381, -1104041856484663086, ...

Fixes #1664.
Message-Id: <1473840915-5682-1-git-send-email-tgrabiec@scylladb.com>

(cherry picked from commit a498da1987)
2016-09-14 12:03:41 +03:00
Avi Kivity
e296fef581 Fix bad backport (259b2592d4) 2016-07-15 14:18:50 +03:00
Avi Kivity
5ee6a00b0f db: don't over-allocate memory for mutation_reader
column_family::make_reader() doesn't deal with sstables directly, so it
doesn't need to reserve memory for them.

Fixes #1453.
Message-Id: <1468429143-4354-1-git-send-email-avi@scylladb.com>

(cherry picked from commit d3c87975b0)
2016-07-15 14:11:01 +03:00
Avi Kivity
64df5f3f38 db: estimate queued read size more conservatively
There are plenty of continuations involved, so don't assume it fits in 1k.
Message-Id: <1468429516-4591-1-git-send-email-avi@scylladb.com>

(cherry picked from commit 23edc1861a)
2016-07-15 14:09:47 +03:00
Avi Kivity
259b2592d4 db: do not create column family directories belonging to foreign keyspaces
Currently, for any column family, we create a directory for it in all
keyspace directories.  This is incredibly awkward.

Fix by iterating over just the keyspace's column families, not all
column families in existence.

Fixes #1457.
Message-Id: <1468495182-18424-1-git-send-email-avi@scylladb.com>

(cherry picked from commit 1048e1071b)
2016-07-15 14:08:46 +03:00
Avi Kivity
8547f34d60 mutation_reader: make restricting_mutation_reader even more restricting
While limiting the number of concurrently executing sstable readers reduces
our memory load, the queued readers, although consuming a small amount of
memory, can still grow without bounds.

To limit the damage, add two limits on the queue:
 - a timeout, which is equal to the read timeout
 - a queue length limit, which is equal to 2% of the shard memory divided
   by an estimate of the queued request size (1kb)

Together, these limits bound the amount of memory needed by queued disk
requests in case the disk can't keep up.
Message-Id: <1467206055-30769-1-git-send-email-avi@scylladb.com>

(cherry picked from commit 9ac730dcc9)
2016-06-29 17:29:00 +03:00
Avi Kivity
00692d891e db: add statistics about queued reads
Fixes #1398.

(cherry picked from commit f03cd6e913)
2016-06-27 19:43:16 +03:00
Avi Kivity
94aa879d19 db: restrict replica read concurrency
Since reading mutations can consume a large amount of memory, which, moreover,
is not predicatable at the time the read is initiated, restrict the number
of reads to 100 per shard.  This is more than enough to saturate the disk,
and hopefully enough to prevent allocation failures.

Restriction is applied in column_family::make_sstable_reader(), which is
called either on a cache miss or if the cache is disabled.  This allows
cached reads to proceed without restriction, since their memory usage is
supposedly low.

Reads from the system keyspace use a separate semaphore, to prevent
user reads from blocking system reads.  Perhaps we should select the
semaphore based on the source of the read rather than the keyspace,
but for now using the keyspace is sufficient.

Fixes #1398.

(cherry picked from commit edeef03b34)
2016-06-27 19:43:07 +03:00
Duarte Nunes
ffeef2f072 database: Actually decrease query_state limit
query_state expects the current row limit to be updated so it
can be enforced across partition ranges. A regression introduced
in e4e8acc946 prevented that from
happening by passing a copy of the limit to querying_reader.

This patch fixes the issue by having column_family::query update
the limit as it processes partitions from the querying_reader.

Fixes #1338

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1465804012-30535-1-git-send-email-duarte@scylladb.com>
(cherry picked from commit c896309383)
2016-06-21 10:03:22 +03:00
Nadav Har'El
ad50d83302 Rewriting shared sstables only after all shards loaded sstables
After commit faa4581, each shard only starts splitting its shared sstables
after opening all sstables. This was important because compaction needs to
be aware of all sstables.

However, another bug remained: If one shard finishes loading its sstables
and starts the splitting compactions, and in parallel a different shard is
still opening sstables - the second shard might find a half-written sstable
being written by the first shard, and abort on a malformed sstable.

So in this patch we start the shared sstable rewrites - on all shards -
only after all shards finished loading their sstables. Doing this is easy,
because main.cc already contains a list of sequential steps where each
uses invoke_on_all() to make sure the step completes on all shards before
continuing to the next step.

Fixes #1371

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1466426641-3972-1-git-send-email-nyh@scylladb.com>
(cherry picked from commit 3372052d48)
2016-06-20 18:20:01 +03:00
Nadav Har'El
dececbc0b9 Rewrite shared sstables only after entire CF is read
Starting in commit 721f7d1d4f, we start "rewriting" a shared sstable (i.e.,
splitting it into individual shards) as soon as it is loaded in each shard.

However as discovered in issue #1366, this is too soon: Our compaction
process relies in several places that compaction is only done after all
the sstables of the same CF have been loaded. One example is that we
need to know the content of the other sstables to decide which tombstones
we can expire (this is issue #1366). Another example is that we use the
last generation number we are aware of to decide the number of the next
compaction output - and this is wrong before we saw all sstables.

So with this patch, while loading sstables we only make a list of shared
sstables which need to be rewritten - and the actual rewrite is only started
when we finish reading all the sstables for this CF. We need to do this in
two cases: reboot (when we load all the existing sstables we find on disk),
and nodetool referesh (when we import a set of new sstables).

Fixes #1366.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1466344078-31290-1-git-send-email-nyh@scylladb.com>
(cherry picked from commit faa45812b2)
2016-06-19 17:11:14 +03:00
Nadav Har'El
0a2d4204bd Rewrite shared sstables soon after startup
Several shards may share the same sstable - e.g., when re-starting scylla
with a different number of shards, or when importing sstables from an
external source. Sharing an sstable is fine, but it can result in excessive
disk space use because the shared sstable cannot be deleted until all
the shards using it have finished compacting it. Normally, we have no idea
when the shards will decide to compact these sstables - e.g., with size-
tiered-compaction a large sstable will take a long time until we decide
to compact it. So what this patch does is to initiate compaction of the
shared sstables - on each shard using it - so that a soon as possible after
the restart, we will have the original sstable is split into separate
sstables per shard, and the original sstable can be deleted. If several
sstables are shared, we serialize this compaction process so that each
shard only rewrites one sstable at a time. Regular compactions may happen
in parallel, but they will not not be able to choose any of the shared
sstables because those are already marked as being compacted.

Commit 3f2286d0 increased the need for this patch, because since that
commit, if we don't delete the shared sstable, we also cannot delete
additional sstables which the different shards compacted with it. For one
scylla user, this resulted in so much excessive disk space use, that it
literally filled the whole disk.

After this patch commit 3f2286d0, or the discussion in issue #1318 on how
to improve it, is no longer necessary, because we will never compact a shared
sstable together with any other sstable - as explained above, the shared
sstables are marked as "being compacted" so the regular compactions will
avoid them.

Fixes #1314.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1465406235-15378-1-git-send-email-nyh@scylladb.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 721f7d1d4f)
2016-06-16 14:01:33 +03:00
Tomasz Grabiec
74b8f63e8f row_cache: Make stronger guarantees in clear/invalidate
Correctness of current uses of clear() and invalidate() relies on fact
that cache is not populated using readers created before
invalidation. Sstables are first modified and then cache is
invalidated. This is not guaranteed by current implementation
though. As pointed out by Avi, a populating read may race with the
call to clear(). If that read started before clear() and completed
after it, the cache may be populated with data which does not
correspond to the new sstable set.

To provide such guarantee, invalidate() variants were adjusted to
synchronize using _populate_phaser, similarly like row_cache::update()
does.

(cherry picked from commit 170a214628)

Conflicts:
	database.cc
2016-06-16 14:01:33 +03:00
Glauber Costa
30d54cef38 database: add a comment explaining the choice of function in CF stop
We have recently commited a fix to a broken streaming bug that involved
reverting column_family::stop() back to calling the custom seal functions
explicitly for both memtables and streaming memtables.

We here add a comment to explain why that had to be done.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <fe94b5883e9c29adc7fc9ee9f498894c057e7b64.1464293167.git.glauber@scylladb.com>
2016-05-29 11:28:15 +03:00
Glauber Costa
46f60f52d9 database: do not use implicitly stated seal function when closing the CF
In commit 4981362f57, I have introduced a regression that was thankfully
caught by our dtest infrastructure.

That patch is a preparation patch for the active reclaim patchset that is to
come, and it consolidated all the flushes using the memtable_list's seal_fn
function instead of calling the seal function explicitly.

The problem here is that the streaming memtables have the delayed mechanism,
about which the memtable_list is unaware. Calling memtable_list's
seal_active_memtable() for the streaming memtables calls the delayed version,
that does not guarantee flush. If we're lucky, we will indeed flush after the
timer expires, but if we're not we'll just stop the CF with data not flushed.

There are two options to fix this: the first is to teach the memtable_list about
the delayed/forced mechanism, and the second is to just call the correct
function explicitly during shutdown, and then when the time comes to add
continuations to the result of the seal, add them here as well.

Although the second option involves a bit more work and duplication, I think it
is better in the sense that the delayed / forced mechanism really is something
that belong to the streaming only. Being this the only user, I don't think it
justifies complicating the memtable_list with this concept.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <b26017c825ccf585f39f58c4ab3787d78e551f5f.1464126884.git.glauber@scylladb.com>
2016-05-25 08:21:24 +03:00
Pekka Enberg
ceb29f9d32 Merge "Introduce upload dir for sstable migration" from Raphael
"This change is intended to make migration process safer and easier.
 All column families will now have a directory called upload.
 With this feature, users may choose to copy migrated sstables to upload
 directory of respective column families, and run 'nodetool refresh'.
 That's supposed to be the preferred option from now on."
2016-05-24 16:36:47 +03:00
Avi Kivity
9637c2232c Merge "Move the JMX timer polling logic to Scylla" from Amnon 2016-05-24 13:07:52 +03:00
Raphael S. Carvalho
c2fa3b796d db: fix read consistency after refresh
If sstable loaded by refresh covers a row that is cached by the
column family, read query may fail to return consistent data.
What we should do is to clear cache for the column family being
loaded with new sstables.

Fixes #1212.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <a08c9885a5ceb0b2991e40337acf5b7679580a66.1464072720.git.raphaelsc@scylladb.com>
2016-05-24 12:11:41 +03:00
Raphael S. Carvalho
e5f0314afd db: introduce upload directory for sstable migration
This change is intended to make migration process safer and easier.
All column families will now have a directory called upload.
With this feature, users may choose to copy migrated sstables to upload
directory of respective column families, and call 'nodetool refresh'.
That's supposed to be the preferred option from now on.

For each sstable in upload directory, refresh will do the following:
1) Mutate sstable level to 0.
2) Create hard links to its components in column family dir, using
a new generation. We make it safe by creating a hard link to temporary
TOC first.
3) Remove all of its components in upload directory.

This new code runs after refresh checked for new sstables in the column
family directory. Otherwise, we could have a generation conflict.
Unlike the first step, this new step runs with sstable write enabled.
It's easier here because we know exactly which sstables are new.

After that, refresh will load new sstables found in column family
and upload directories.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-05-20 17:26:21 -03:00
Raphael S. Carvalho
74c8a87777 sstables: fix statistics rewrite
It's not working because it tries to overwrite existing statistics
file with exclusive flag.
It's fixed by writing new statistics into temporary file and
renaming it into place.

If Scylla failed in middle of rewrite, a temporary file is left
over. So boot code was adjusted to delete a temporary file created
by this rewrite procedure.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-05-20 17:24:15 -03:00
Raphael S. Carvalho
ee0f66eef6 db: fix migration of sstables with level greater than 0
Refresh will rewrite statistics of any migrated sstable with level
> 0. However, this operation is currently not working because O_EXCL
flag is used, meaning that create will fail.

It turns out that we don't actually need to change on-disk level of
a sstable by overwriting statistics file.
We can only set in-memory level of a sstable to 0. If Scylla reboots
before all migrated sstables are compacted, leveled strategy is smart
enough to detect sstables that overlap, and set their in-memory level
to 0.

Fixes #1124.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-05-17 11:08:08 -03:00
Amnon Heiman
750f30cf07 column_family: Change histogram to
timed_rate_moving_average_and_histogram

As part of moving the derived statistic in to scylla, this replaces the
histogram object in the column_family to
timed_rate_moving_average_and_histogram.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2016-05-17 11:53:15 +03:00
Glauber Costa
17b9203719 database: invert order of elements
So that the sizes of the region can be initialized first

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <dc3df186a977b492d83c0a397f206c2db940aa37.1463448522.git.glauber@scylladb.com>
2016-05-17 11:28:39 +03:00
Glauber Costa
2ff6d38d0c database: use a single constructor for the column family
We've been keeping two constructors for the column family to allow for a
version without the commitlog. But it's by now quite complicated to maintain
the two, because changes always have to be made in two places.

This patch adds a private constructor that does the actual construction, and
have the public constructors to call it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <dd3cb0b9c20ad154a6131bad6ece619f70ed5025.1463448522.git.glauber@scylladb.com>
2016-05-17 11:28:39 +03:00
Glauber Costa
8fede5b98e memtables: isolate logic for disk writes disabled
When we have disk writes disabled, we exit immediately from the flush
function. We can just encode that separately and pass a different function
in the memtable_list creation. That simplifies the memtable flush a bit.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <908e3b5eb2c6ee84b8ad7b31c3673be5531a087c.1463448522.git.glauber@scylladb.com>
2016-05-17 11:28:38 +03:00
Glauber Costa
4981362f57 memtables: always seal through memtable_list seal function
I would like to be able to apply a function at the end of every flush, that is
common for both memtables and streaming memtables. For instance, to unthrottle
current waiters. Right now some calls to seal_active_memtable are open coded,
calling the column family's function directly, for both the main memtable list
and the streaming list.

This patch moves all the current open code callers to call the respective
memtable_list function.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0c780254f3c4eb03e2bcd856b83941cf49a84b85.1463448522.git.glauber@scylladb.com>
2016-05-17 11:28:37 +03:00
Piotr Jastrzebski
dcba6f5c45 Pass clustering_row_ranges to mutation readers.
This will allow readers to reduce the amount of data read.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-05-16 14:36:57 +02:00
Piotr Jastrzebski
23c23abe53 Make memtable mutation_reader slice using clustering ranges.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-05-16 11:46:41 +02:00
Piotr Jastrzebski
484d2ecd0a Slice data with clustering key range in sstable reader
Add additional parameters to mp_row_consumer to be able to fetch
only cells for given clustering key ranges

This will be used in row_cache when it will work on clustering key
level instead of partition key level.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-05-16 11:46:30 +02:00
Pekka Enberg
d93d46e721 Merge "ALTER KEYSPACE" from Calle
"Implementation of ALTER KEYSPACE.
Fixes #429"
2016-05-10 22:07:06 +03:00
Piotr Jastrzebski
240a185727 Stop scanning keyspace data directory when populating.
Iterate over column families and check/create directories for them
instead of scanning keyspace data directory and filtering directories
against column families that exist in system tables for this keyspace.

Fixes #1008

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <26da66eec67a1ab1318917a66161915cdef924ab.1462890592.git.piotr@scylladb.com>
2016-05-10 17:35:55 +03:00
Calle Wilund
6ef7885ae3 database: Implement update_keyspace
Reloads keyspace metadata and replaces in existing keyspace. 
Note: since keyspace metadata, and consequently, replication 
strategy now becomes volatile, keyspace::metadata now returns
shared pointer by value (i.e. keep-alive). 
Replication strategy should receive the same treatment, but
since it is extensively used, but never kept across a 
continuation, I've just added a comment for now.
2016-05-10 14:31:30 +00:00
Avi Kivity
80302d98dd database: silence atomic deletion cancellation logs during compaction
Those logs are expected during shutdown.
2016-05-07 20:37:48 +03:00
Raphael S. Carvalho
5aeeb0b3e8 compaction: add support to parallel compaction on the same column family
It was noticed that small sstables will accumulate for a column family because
scylla was limited to two compaction per shard, and a column family could have
at most one compaction running at a given shard. With the number of sstables
increasing rapidly, read performance is degraded.

At the moment, our compaction manager works by running two compaction task
handlers that run in parallel to the rest of the system. Each task handler
gets to run when needed, gets a column family from compaction manager queue,
runs compaction on it, and goes to sleep again. That's basically its cycle.
Compaction manager only allows one instance of a column family to be on its
queue, meaning that it's impossible for a column family to be compacted in
parallel. One compaction starts after another for a given column family.

To solve the problem described, we want to concurrently run compaction jobs
of a column family that have different "size tier" (or "weight").
For those unfamiliar, compaction job contains a list of sstables that will be
compacted together.
The "size tier" of a compaction job is the log of the total size of the input
sstables. So a compaction job only gets to run if its "size tier" is not the
same of an ongoing compaction. There is no point in compacting concurrently at
the same "size tier", because that slows down both compactions.

We will no longer queue column families in compaction manager. Instead, we
create a new fiber to run compaction on demand.
This fiber that runs asynchronously will do the following:
1) Get a compaction job from compaction strategy.
2) Calculate "size tier" of compaction job.
3) Run compaction job if its "size tier" is not the same of an ongoing
compaction for the given column family.
As before, it may decide to re-compact a column family based on a stat stored
in column family object.

Ran all compaction-related dtests.

Fixes #1216.

Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <d30952ff136192a522bde4351926130addec8852.1462311908.git.raphaelsc@scylladb.com>
2016-05-04 11:46:09 +03:00
Calle Wilund
9130b0de16 database.cc: Fix compilation error with boost 1.55
Message-Id: <1461067254-526-1-git-send-email-calle@scylladb.com>
2016-04-25 12:54:43 +03:00
Pekka Enberg
f6da9bc92b Merge "Additional mutations/queries related collectd metrics" from Vlad
"This series introduces some additional metrics (mostly) in a storage_proxy and
a database level that are meant to create a better picture of how data flows
in the cluster.

First of all where possible counters of each category (e.g. total writes in the storage
proxy level) are split into the following categories:
   - operations performed on a local Node
   - operations performed on remote Nodes aggregated per DC

In a storage_proxy level there are the following metrics that have this "split"
nature (all on a sending side):
   - total writes (attempts/errors)
   - writes performed as a result of a Read Repair logic
   - total data reads (attempts/completed/errors)
   - total digest reads (attempts/completed/errors)
   - total mutations data reads (attempts/completed/errors)

In a batchlog_manager:
   - writes performed as a result of a batchlog replay logic

Thereby if for instance somebody wants to get an idea of how many writes
the current Node performs due to user requested mutations only he/she has
to take a counter of total writes and subtract the writes resulted by Read
Repairs and batchlog replays.

On a receiving side of a storage_proxy we add the two following counters:
   - total number of received mutations
   - total number of forwarded mutations (attempts/errors)

In order to get a better picture of what is going on on a local Node
we are adding two counters on a database level:
   - total number of writes
   - total number of reads

Comparing these to total writes/reads in a storage_proxy may give a good
idea if there is an excessive access to a local DB for example."
2016-04-21 15:58:45 +03:00
Vlad Zolotarov
97e5bfa815 database: add metrics for total writes and reads
This patch adds a counter of total writes and reads
for each shard.

It seems that nothing ensures that all database queries are
ready before database object is destroyed.
Make _stats lw_shared_ptr in order to ensure that the object is
alive when lambda gets to incrementing it.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-04-21 11:28:53 +03:00
Duarte Nunes
c7b3a4b144 udt: Parse user types system table
This patch loads and parses the user types system table during
bootstrap.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-04-20 09:54:06 +02:00
Tomasz Grabiec
45527fcffa Merge branch 'glommer/issue-1144-v5'
From Glauber:

There are current some outstanding issues with the throttling code. It's
easier to see them with the streaming code, but at least one of them is general.

One of them is related to situations in which the amount of memory available
leaves only one memtable fitting in memory. That would only happen with the
general code if we set the memtable cleanup threshold to 100 % - and I don't
even know if it is valid - but will happen quite often with the streaming code.
If that happens, we'll start throttling when that memtable is being written,
but won't be able to put anything else in its place - leading to unnecessary
throttling.

The second, and more serious, happens when we start throttling and the amount
of available memory is not at least 1MB. This can deadlock the database in
the sense that it will prevent any request from continuing, and in turn causing
a flush due to memtable size. It is a good practice anyway to always guarantee
progress.

Fixes #1144
2016-04-18 12:20:13 +02:00
Glauber Costa
9c87ae3496 throttle: always release at least one request if we are below the limit
Our current throttling code releases one requests per 1MB of memory available
that we have. If we are below the memory limit, but not by 1MB or more, then
we will keep getting to unthrottle, but never really do anything.

If another memtable is close to the flushing point, those requests may be
exactly the ones that would make it flush. Without them, we'll freeze the
database.

In general, we need to always release at least one request to make sure that
progress is always achieved.

This fixes #1144

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-14 13:13:15 -04:00
Glauber Costa
2c5dfe08c1 memtable_list: make sure at least two memtables are available
This is usually not a problem for the main memtable list - although it can be,
depending on settings, but shows up easily for the streaming memtables list.

We would like to have at least two memtables, even if we have to cut it short.
If we don't do that, one memtable will have use all available memory and we'll
force throttling until the memtable gets totally flushed.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-14 12:12:50 -04:00
Glauber Costa
1daede7396 unnest throttle_state
throttle_state is currently a nested member of database, but there is no
particular reason - aside from the fact that it is currently only ever
referenced by the database for us to do so.

We'll soon want to have some interaction between this and the column family, to
allow us to flush during throttle. To make that easier, let's unnest it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-14 12:12:50 -04:00
Glauber Costa
39def369ce move information about memtables' region group inside memtable list
This is a preparation patch so we can move the throttling infrastructure inside
the memtable_list. To do that, the region group will have to be passed to the
throttler so let's just go ahead and store it.

In consequence of that, all that the CF has to tell us is what is the current
schema - no longer how to create a new memtable.

Also, with a new parameter to be passed to the memtable_list the creation code
gets quite big and hard to follow. So let's move the creation functions to a
helper.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-14 12:12:50 -04:00
Avi Kivity
a843aea547 db: delete compacted sstables atomically
If sstables A, B are compacted, A and B must be deleted atomically.
Otherwise, if A has data that is covered by a tombstone in B, and that
tombstone is deleted, and if B is deleted while A is not, then the data
in A is resurrected.

Fixes #1181.
2016-04-14 17:14:26 +03:00
Paweł Dziepak
2db70cf912 database: remove throw() specifiers
Most of them are missing std::bad_alloc (which leads to aborts) and they
force the compiler to add unnecessary runtime checks.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-04-11 23:52:13 +01:00
Glauber Costa
8453ff7788 make get_sstable_key_range an instance method
Because just creating an SSTable object does not generate any I/O,
get_sstable_key_range should be an instance method. The main advantage
of doing that is that we won't have to read the summary twice. The way
we're doing it currently, if happens to be a shard-relevant table we'll
call load() - which reads the summary again.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-08 17:14:29 -04:00
Avi Kivity
db03295c8a Merge "Fix query digest mismatch" from Tomasz
"Currently data query digest includes cells and tombstones which may have
expired or be covered by higher-level tombstones. This causes digest
mismatch between replicas if some elements are compacted on one of the
nodes and not on others. This mismatch triggers read-repair which doesn't
resolve because mutations received by mutation queries are not differing,
they are compacted already.

The fix adds compacting step before writing and digesting query results by
reusing the algorithm used by mutation query. This is not the most optimal
way to fix this. The compaction step could be folded with the query writing,
there is redundancy in both steps. However such change carries more risk,
and thus was postponed.

perf_simple_query test (cassandra-stress-like partitions) shows regression
from 83k to 77k (7%) ops/s.

Fixes #1165."
2016-04-08 12:13:29 +03:00
Pekka Enberg
38a54df863 Fix pre-ScyllaDB copyright statements
People keep tripping over the old copyrights and copy-pasting them to
new files. Search and replace "Cloudius Systems" with "ScyllaDB".

Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>
2016-04-08 08:12:47 +03:00
Tomasz Grabiec
f15c380a4f database: Compact mutations when executing data queries
Currently data query digest includes cells and tombstones which may have
expired or be covered by higher-level tombstones. This causes digest
mismatch between replicas if some elements are compacted on one of the
nodes and not on others. This mismatch triggers read-repair which doesn't
resolve because mutations received by mutation queries are not differing,
they are compacted already.

The fix adds compacting step before writing and digesting query results by
reusing the algorithm used by mutation query. This is not the most optimal
way to fix this. The compaction step could be folded with the query writing,
there is redundancy in both steps. However such change carries more risk,
and thus was postponed.

perf_simple_query test (cassandra-stress-like partitions) shows regression
from 83k to 77k (7%) ops/s.

Fixes #1165.
2016-04-07 19:56:58 +02:00
Calle Wilund
ff5df306e3 database: Use disk-marking delete function in discard_sstables
Fixes #797

To make sure an inopportune crash after truncate does not leave
sstables on disk to be considered live, and thus resurrect data,
after a truncate, use delete function that renames the TOC file to
make sure we've marked sstables as dead on disk when we finish
this discard call.
Message-Id: <1458575440-505-2-git-send-email-calle@scylladb.com>
2016-03-24 12:02:08 +02:00