Commit Graph

612 Commits

Author SHA1 Message Date
Raphael S. Carvalho
dffb41f9d8 sstables: remove schema parameter from some sstable methods
schema can now be found in the sstable object itself.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <0fa44fedbe784d924522d7eeca77c16294479c6e.1473959677.git.raphaelsc@scylladb.com>
2016-09-19 13:25:58 +02:00
Calle Wilund
f126cf769a column_family: Ensure flush() waits for all previous flushes + self
Fixes #1577
Message-Id: <1472569952-4066-1-git-send-email-calle@scylladb.com>
2016-09-14 11:00:41 +01:00
Tomasz Grabiec
a498da1987 database: Ignore spaces in initial_token list
Currently we get boost::lexical_cast on startup if inital_token has a
list which contains spaces after commas, e.g.:

  initial_token: -1100081313741479381, -1104041856484663086, ...

Fixes #1664.
Message-Id: <1473840915-5682-1-git-send-email-tgrabiec@scylladb.com>
2016-09-14 11:58:13 +03:00
Raphael S. Carvalho
b9f67351da db: expose clustering filter info via collectd
That's needed to observe behavior of clustering filter, and to
check if it's worthwhile for a specific workload.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 11:32:23 -03:00
Raphael S. Carvalho
a2dc88889d db: enable clustering optimization only on dtcs
Leveled strategy will not benefit from this strategy because
there's only a few sstables that will contain a given partition
key, which means that a clustering key that belongs to a specific
partition key can only be in a few sstables as well.

Date tiered strategy is the one that will actually benefit the
most from this optimization. Size tiered may benefit from it too
if clustering key isn't overwritten, but it will not use the
clustering optimization.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 11:31:07 -03:00
Raphael S. Carvalho
8d03ccd604 sstables: optimize reads with clustering filter
If user specifies a clustering filter, it's possible to filter out
sstable based on its metadata that tracks min/max clustering value.

For example, if sstable stores clustering key from 'a' through 'c',
it's possible to filter out that sstable if user asks for data
with clustering key greater than 'c'.

That's done by comparing each component separately because
clustering key may be composite. Further information can be found
here: https://issues.apache.org/jira/browse/CASSANDRA-5514

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 10:51:50 -03:00
Raphael S. Carvalho
004617839d database: check bloom filter of all sstables earlier
All sstables will now have bloom filter checked in a single pass
before reader iterate through all candidates. It's possible that
we will need to futurize the procedure if it holds cpu for too
long. This change is also a step towards the optimization that
will rule out sstables based on clustering filter.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 10:50:08 -03:00
Raphael S. Carvalho
1f31223f32 sstables: store schema in sstable object
That will be needed for optimization that will store decorated keys
in the sstable object, and also for a subsequent work that will
detect wrong metadata (min/max column names) by looking at columns
in the schema. As schema is stored in sstable, there's no longer
a need to store ks and cf names in it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 10:49:17 -03:00
Glauber Costa
dc5d8e33af Revert "row_cache: update sstable histograms on cache hits"
This reverts commit 1726b1d0cc.

Reverting this patch turns our SSTable access counter into a miss counter only.
The estimated histogram always starts its first bucket at 1, so by marking cache
accesses we will be wrongly feeding "1" into the buckets.

Notice that this is not yet ideal: nodetool is supposed to show a histogram of
all reads, and by doing this we are changing its meaning slightly. Workloads
that serve mostly from cache will be distorted towards their misses.

The real solution is to use a different histogram, but we will need to enforce
a newer version of nodetool for that: the current issue is that nodetool expects
an EstimatedHistogram in a specific format in the other side.

Conflicts:
	row_cache.hh

Message-Id: <a599fa9e949766e7c9697450ae34fc28e881e90a.1472742276.git.glauber@scy
lladb.com>
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-09-01 18:07:31 +03:00
Duarte Nunes
ba374da043 database: Trace sstable accesses
This patch traces when we read from an sstable, be it a key range or a
single one.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-01 12:04:32 +02:00
Duarte Nunes
030db65c62 database: Accept a trace_state_ptr
This patch changes the database and column_family types so a
trace_state_ptr can be passed in when querying. This enables tracing
of the inner components.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-01 12:04:28 +02:00
Glauber Costa
1726b1d0cc row_cache: update sstable histograms on cache hits
If we have a cache hit, we still need to update our sstable histogram - notting
that we have touched 0 SSTables.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-08-31 15:14:22 -04:00
Glauber Costa
ce24fd05fe database: keep statistics on SSTables touched per read
That is done for single partition queries only - mimicking what
Cassandra does on that matter.

For this to be correct, we also need to update this histogram on cache
hits - in which case we update the read as having touched 0 SSTables. That
will be done on a separate patch.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-08-31 15:14:21 -04:00
Piotr Jastrzebski
3607d99269 Remove clustering_key_filtering_context.
Remove clustering_key_filter_factory and clustering_key_filtering_context.
Use partition_slice directly with a static get_ranges method.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-08-30 20:31:55 +02:00
Raphael S. Carvalho
108fd1fade database: close file in lister
After listing is done, let's close file. This fixes no bug.
It's only an improvement.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <2f52d297bcf6a6b6e3429912c28f17e6b37f8842.1471381607.git.raphaelsc@scylladb.com>
2016-08-17 11:01:44 +03:00
Glauber Costa
b361dee488 database: memtables pending flushes tell us nothing
We have two counters that tracks how many memtable flushes are in progress, and
how much memory are they pinning.

The problem is, after we have revamped the code to limit the amount of flushes
in progress, those counters became useless: as they live inside the semaphore
side, they will only be incremented once we have past the semaphore.

One wouldn't notice if working with CPU-bound problems, where memtables don't
pile. But as soon as they do, those counters will always show the same numbers:
the depth of the semaphore, which doesn't mean much. The problem is poised to
become much worse: once we enable write behind in full and set the semaphore's
depth to one, that's the number we'll see here all the time.

The fix is to move the counters outside the semaphore, which will bring back its
old semantics.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <c5ae6903e170f3f356cdda7ed78a4c9ba8d5f024.1471370504.git.glauber@scylladb.com>
2016-08-17 10:54:15 +03:00
Paweł Dziepak
8a386a51bd Merge "Don't cache wide partitions" from Piotr
"When reading a partition try to read it all
but once more bytes are read than a given limit
we decide that partition is wide and we don't cache it.
Instead we retry the read with clustering key filtering
applied."
2016-07-21 10:24:25 +01:00
Piotr Jastrzebski
636a4acfd0 Add flag to configure
max size of a cached partition.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-07-21 09:47:20 +02:00
Tomasz Grabiec
0d26294fac database: Add table name to log message about sealing
Message-Id: <1468917744-2539-1-git-send-email-tgrabiec@scylladb.com>
2016-07-20 10:12:31 +03:00
Tomasz Grabiec
a0832f08d2 schema_tables: Add more logging
Message-Id: <1468917771-2592-1-git-send-email-tgrabiec@scylladb.com>
2016-07-20 10:12:00 +03:00
Duarte Nunes
3518db531e database: Get non-system column_families
This patch adds an utility function that allows fetching the set of
column_families that do not belong to the system keyspace.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-07-18 23:58:31 +00:00
Duarte Nunes
4bc00c2055 database: Expose selection of sstables by a range
This patch allows a set of a column_family's sstables to be
selected according to a range of ring_positions.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-07-18 23:58:31 +00:00
Avi Kivity
1048e1071b db: do not create column family directories belonging to foreign keyspaces
Currently, for any column family, we create a directory for it in all
keyspace directories.  This is incredibly awkward.

Fix by iterating over just the keyspace's column families, not all
column families in existence.

Fixes #1457.
Message-Id: <1468495182-18424-1-git-send-email-avi@scylladb.com>
2016-07-14 14:31:05 +03:00
Avi Kivity
23edc1861a db: estimate queued read size more conservatively
There are plenty of continuations involved, so don't assume it fits in 1k.
Message-Id: <1468429516-4591-1-git-send-email-avi@scylladb.com>
2016-07-14 11:42:24 +02:00
Avi Kivity
d3c87975b0 db: don't over-allocate memory for mutation_reader
column_family::make_reader() doesn't deal with sstables directly, so it
doesn't need to reserve memory for them.

Fixes #1453.
Message-Id: <1468429143-4354-1-git-send-email-avi@scylladb.com>
2016-07-14 10:01:42 +02:00
Avi Kivity
24e3026e32 Merge "compaction manager refactoring" from Raphael 2016-07-10 17:16:23 +03:00
Tomasz Grabiec
6a1f9a9b97 db: Improve logging
Message-Id: <1467997671-16570-1-git-send-email-tgrabiec@scylladb.com>
2016-07-10 16:15:03 +03:00
Tomasz Grabiec
c0233c877d db: Avoid out-of-memory when flushing cannot keep up
memtable_list::seal_on_overlflow() is called on each mutation to check
if current memtable should be flushed. It will call
memtable_list::seal_active_memtable() when that is the case.

The number of concurrent seals is guarded by a semaphore, starting
from commit 0f64eb7e7d, and allows
at most 4 of them.

If there are 4 flushes already pending, every incoming mutation will
enqueue a new flush task on the semaphore's wait list, without waiting
for it. The wait queue can grow without bounds, eventually leading to
out-of-memory.

The fix is to seal the memtable immediately to satisfy should_flush()
condition, but limit concurrency of actual flushes. This way the wait
queue size on the semaphore is limited by memtables pending a flush,
which is fairly limited.

Message-Id: <1467997652-16513-1-git-send-email-tgrabiec@scylladb.com>
2016-07-10 10:53:51 +03:00
Raphael S. Carvalho
e38f66c6fe database: make certain column family functions const qualified
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-07-08 15:05:22 -03:00
Vlad Zolotarov
f2bf453be2 database: revive mutation retry in case of replay_position_reordered_exception
The logic that would retry applying a mutation in case of
a replay_position_reordered_exception error was broken by
a commit 0c31f3e626
Author: Glauber Costa <glauber@scylladb.com>
Date:   Wed Apr 20 19:09:21 2016 -0400

    database: move memtable throttler to the LSA throttler

This patch makes it work again.

Fixes #1439

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1467893342-30559-1-git-send-email-vladz@cloudius-systems.com>
2016-07-07 15:00:35 +02:00
Paweł Dziepak
32a5de7a1f db: handle receiving fragmented mutations
If mutations are fragmented during streaming a special care must be
taken so that isolation guarantees are not broken.

Mutations received with flag "fragmented" set are applied to a memtable
that is used only by that particular streaming task and the sstables
created by flushing such memtables are not made visible until the task
is complte. Also, in case the streaming fails all data is dropped.

This means that fragmented mutations cannot benefit from coalescing of
writes from multiple streaming plans, hence separate way of handling
them so that there is no loss of performance for small partitions.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-07 12:18:35 +01:00
Paweł Dziepak
4031c0ed8f streaming: pass plan_id to column family for apply and flush
plan_id is needed to keep track of the origin of mutations so that if
they are fragmented all fragments are made visible at the same time,
when that particular streaming plan_id completes.

Basically, each streaming plan that sends big (fragmented) mutations is
going to have its own memtables and a list of sstables which will get
flushed and made visible when that plan completes (or dropped if it
fails).

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-07 12:18:35 +01:00
Paweł Dziepak
51ec7a7285 db: wait for ongoing flushes at end of streaming
When flush_streaming_mutations() is called at the end of streaming it is
supposed to flush all data and then invalidate cache. ranges However, if
there are already some memtable flushes in progress it won't wait for them.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-07 12:18:35 +01:00
Glauber Costa
54ce6221a7 allow the dirty memory manager to be used without a database object
Some of our tests don't provide a database object to a CF. Create a default
dirty memory manager object that can be used without a database for them.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <872f8c9232ff87d788e271b1db86c814d7a75d9f.1467832713.git.glauber@scylladb.com>
2016-07-07 10:00:43 +01:00
Glauber Costa
b0932ceb04 database: act on LSA pressure notification
Issue 1195 describes a scenario with a fairly easy reproducer in which we can
freeze the database. That involves writing simultaneously to multiple CFs, such
that the sum of all the memory they are using is larger than the dirty memory
limit, without not any of them individually being larger than the memtable size.
Because we will never reach the individual memtable seal size for any of them,
none of them will initiate a flush leading the database to a halt.

The LSA has now gained infrastructure that allow us to be notified when pressure
conditions mount. What we will do in this case is initiate a flush ourselves.

Fixes #1195

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-07-05 17:46:28 -04:00
Glauber Costa
7169b727ea move system tables to its own region
In the spirit of what we are doing for the read semaphore, this patch moves
system writes to its own dirty memory manager. Not only will it make sure that
system tables will not be serialized by its own semaphore, but it will also put
system tables in its own region group.

Moving system tables to its own region group has the advantage that system
requests won't be waiting during throttle behind a potentially big queue of user
requests, since requests are tended to in FIFO order within the same region
group. However, system tables being more controlled and predictable, we can
actually go a step further and give them some extra reservation so they may not
necessarily block even if under pressure (up to 10 MB more).

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-07-05 17:46:28 -04:00
Glauber Costa
c358947284 database: wrap semaphore and region group into a new dirty memory manager
We currently have a semaphore in the column family level that protects us against
multiple concurrent sstable flushes. However, storing that semaphore into the CF,
not the database, was a (implementation, not design) mistake.

One comment in particular makes it quite clear:

   // Ideally, we'd allow one memtable flush per shard (or per database object), and write-behind
   // would take care of the rest. But that still has issues, so we'll limit parallelism to some
   // number (4), that we will hopefully reduce to 1 when write behind works.

So I aimed for the shard, but ended up coding it into the CF because that's closer to the
flush point - my bad.

This patch fixes this while paving the way for active reclaim to take place. It wraps the semaphore
and the region group in a new structure, the dirty_memory_manager. The immediate benefit is that we
don't need to be passing both the semaphore and the region group downwards in the DB -> CF path. The
long term benefit is that we now have a one unified structure that can hold shared flush data in all
of the CFs.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-07-05 15:29:04 -04:00
Glauber Costa
0c31f3e626 database: move memtable throttler to the LSA throttler
The LSA infrastructure, through the use of its region groups, now have
a throttler mechanism built-in. This patch converts the current throttlers
so that the LSA throttler is used instead.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-07-05 15:05:19 -04:00
Avi Kivity
68e613b313 Rebuild _column_family::_sstables when changing compaction_strategy
The concrete sstable_set type depends on the compaction strategy, so
ask the compaction_strategy to create a new sstable_set object and populate
it.
2016-07-03 13:42:10 +03:00
Avi Kivity
44a6cef4e1 sstable mutation readers: use sstable_set::select()
Apply compaction strategy specific logic to narrow down the set of sstables
used for a query; can speed up reads using LeveledCompactionStrategy
significantly.

Fixes #1185.
2016-07-03 10:50:58 +03:00
Avi Kivity
4cb7618601 Convert column_family::_sstables to sstable_set
Using sstable_set will allow us to filter sstables during a query before
actually creating a reader (this is left to the next patch; here we just
convert the users of the _sstables field).
2016-07-03 10:32:27 +03:00
Avi Kivity
2a46410f4a Change sstable_list from a map to a set
sstable_list is now a map<generation, sstable>; change it to a set
in preparation for replacing it with sstable_set.  The change simplifies
a lot of code; the only casualty is the code that computes the highest
generation number.
2016-07-03 10:26:57 +03:00
Avi Kivity
9ac730dcc9 mutation_reader: make restricting_mutation_reader even more restricting
While limiting the number of concurrently executing sstable readers reduces
our memory load, the queued readers, although consuming a small amount of
memory, can still grow without bounds.

To limit the damage, add two limits on the queue:
 - a timeout, which is equal to the read timeout
 - a queue length limit, which is equal to 2% of the shard memory divided
   by an estimate of the queued request size (1kb)

Together, these limits bound the amount of memory needed by queued disk
requests in case the disk can't keep up.
Message-Id: <1467206055-30769-1-git-send-email-avi@scylladb.com>
2016-06-29 15:17:35 +02:00
Avi Kivity
f03cd6e913 db: add statistics about queued reads 2016-06-27 17:25:08 +03:00
Avi Kivity
edeef03b34 db: restrict replica read concurrency
Since reading mutations can consume a large amount of memory, which, moreover,
is not predicatable at the time the read is initiated, restrict the number
of reads to 100 per shard.  This is more than enough to saturate the disk,
and hopefully enough to prevent allocation failures.

Restriction is applied in column_family::make_sstable_reader(), which is
called either on a cache miss or if the cache is disabled.  This allows
cached reads to proceed without restriction, since their memory usage is
supposedly low.

Reads from the system keyspace use a separate semaphore, to prevent
user reads from blocking system reads.  Perhaps we should select the
semaphore based on the source of the read rather than the keyspace,
but for now using the keyspace is sufficient.
2016-06-27 17:17:56 +03:00
Duarte Nunes
aacc7193f2 schema: Replace keyspace's schema_ptr on CF update
This patch ensures we replace the schema_ptr held by its respective
keyspace object when a column family is being updated.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20160623085710.26168-1-duarte@scylladb.com>
2016-06-23 11:11:52 +02:00
Glauber Costa
e08fa7dafa fix potential stale data in cache update
We currently have a problem in update_cache, that can be trigger by ordering
issues related to memtable flush termination (not initiation) and/or
update_cache() call duration.

That issue is described in #1364, and in short, happens if a call to
update_cache starts before and ongoing call finishes. There is now a new SSTable
that should be consulted by the presence checker that is not.

The partition checker operates in a stale list because we need to make sure the
SSTable we just wrote is excluded from it.  This patch changes the partition
checker so that all SSTables currently in use are consulted, except for the one
we have just flushed. That provides both the guarantee that we won't check our
own SSTable and access to the most up-to-date SSTable list.

Fixes #1364

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <fa1cee672bba8e21725c6847353552791225295f.1466534499.git.glauber@scylladb.com>
2016-06-23 10:54:44 +02:00
Duarte Nunes
69798df95e query: Limit number of partitions returned
This is required to implement a thrift verb.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-06-22 09:48:13 +02:00
Paweł Dziepak
0828c88b25 mutation_partition: implement streaming-friendly data_query()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:31:19 +01:00
Paweł Dziepak
b6f78a8e2f sstable: make sstable reads return streamed_mutation
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:29:50 +01:00