This order is required since 5e1348e741
(storage_service: Use get_local_snitch_ptr in gossip_snitch_info).
This fixes the breakage in the cql_query_test.
Reported-by: Asias He <asias@cloudius-systems.com>
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
This reverts commit 52aa0a3f91.
After c9909dd183 this is no longer needed since reference to a
handler is not used in abstract_write_response_handler::wait() continuation.
Conflicts:
service/storage_proxy.cc
"This introduces a very simple cache which caches whole partitions.
There is a reclaimer registerred which clears all caches upon memory pressure.
This is a temporary measure until we implement log-structured allocator and
incremental eviction.
I can see that for small data sets this series imporoves cassandra-stress read
throughput from 2k to 50k tps on muninn/huginn."
row_cache class is meant to cache data for given table by wrapping
some underlying data source. It gives away a mutation_reader which
uses in-memory data if possible, or delegates to the underlying reader
and populates the cache on-the-fly.
Accesses to data in cache is tracked for eviction purposes by a
separate entity, the cache_tracker. There is one such tracker for the
whole shard.
Currently mutation clustering uses two timers, one expires when wait for
cl timeouts and is canceled when cl is achieved, another expires if some
endpoints do not answer for a long time (cl may be already achieved at
this point and first timer will be canceled). This is too complicated
especially since both timers can expire simultaneously. Simplify it by
having only one timer and checking in a callback whether cl was achieved.
"
- Introduce a global distributed snitch object.
- Add the corresponding methods in i_endpoint_snitch class needed to work with
this object.
- Added additional check to gossiping_property_file_snitch_test.
"
The single generated file is corrupted from time to time. Switch to use
multiple files in the hope that this will resolve the issue.
Signed-off-by: Shlomi Livne <shlomi@cloudius-systems.com>
This tests the basic compaction functionality: I created three small
tables using Cassandra (see commands below), compact them into one,
load the resulting table and check its content.
This test demonstrates, but is commented out to make the test succeed,
a bug: If a partition had old values and then a newer deletion (tombstone)
in another sstable, both values and tombstones are left behind in the
compacted table. This will be fixed (and the test uncommented) in a later
patch.
The three sstables were created with:
USE try1;
CREATE TABLE compaction (
name text,
age int,
height int,
PRIMARY KEY (name)
);
INSERT INTO compaction (name, age) VALUES ('nadav', 40);
INSERT INTO compaction (name, age) VALUES ('john', 30);
<flush>
INSERT INTO compaction (name, height) VALUES ('nadav', 186);
INSERT INTO compaction (name, age, height) VALUES ('jerry', 40, 170);
<flush>
DELETE FROM compaction WHERE name = 'nadav';
INSERT INTO compaction (name, age) VALUES ('john', 20);
INSERT INTO compaction (name, age, height) VALUES ('tom', 20, 180);
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
This patch adds the basic compaction function sstables::compact_sstables,
which takes a list of input sstables, and creates several (currently one)
merged sstable. This implementation is pretty simple once we have all
the infrastructure in place (combining reader, writer, and a pipe between
them to reduce context switches).
This is already working compaction, but not quite complete: We'll need
to add compaction strategies (which sstables to compact, and when),
better cardinality estimator, sstable management and renaming, and a lot
of other details, and we'll probably still need to change the API.
But we can already write a test for compacting existing sstables (see
the next patch), and I wanted to get this patch out of the way, so we can
start working on applying compaction in a real use case.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
The sstable has a lot of data, but suprisingly, and accurate count of the
number of partitions isn't available. We can get a good estimate by looking
at the number of summary entries.
Based on Origin's IndexSummary.getEstimatedKeyCount().
We need this estimate for compaction if we can't get (yet) a better
estimate from the cardinality estimator algorithm.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Snitch class semantics defined to be per-Node. To make it so we
introduce here a static member in an i_endpoint_snitch class that
has to contain the pointer to the relevant snitch class instance.
Since the snitch contents are not always pure const it has to be per
shard, therefore we'll make it a "distributed". All the I/O is going
to take place on a single shard and if there are changes - they are going
to be propagated to the rest of the shards.
The application is responsible to initialize this distributed<shnitch>
before it's used for the first time.
This patch effectively reverts most of the "locator: futurize
snitch creation" a2594015f9 patch - the part that modifies the
code that was creating the snitch instance. Since snitch is
created explicitly by the application and all the rest of the code
simply assumes that the above global is initialized we won't need
all those changes any more and the code will get back to be nice and simple
as it was before the patch above.
So, to summarize, this patch does the following:
- Reverts the changes introduced by a2594015f9 related to the fact that
every time a replication strategy was created there should have been created
a snitch that would have been stored in this strategy object. More specifically,
methods like keyspace::create_replication_strategy() do not return a future<>
any more and this allows to simplify the code that calls it significantly.
- Introduce the global distributed<snitch_ptr> object:
- It belongs to the i_endpoint_snitch class.
- There has been added a corresponding interface to access both global and
shard-local instances.
- locator::abstract_replication_strategy::create_replication_strategy() does
not accept snitch_ptr&& - it'll get and pass the corresponding shard-local
instance of the snitch to the replication strategy's constructor by itself.
- Adjusted the existing snitch infrastructure to the new semantics:
- Modified the create_snitch() to create and start all per-shard snitch
instances and update the global variable.
- Introduced a static i_endpoint_snitch::stop_snitch() function that properly
stops the global distributed snitch.
- Added the code to the gossiping_property_file_snitch that distributes the
changed data to all per-shard snitch objects.
- Made all existing snitches classes properly maintain their state in order
to be able to shut down cleanly.
- Patched both urchin and cql_query_test to initialize a snitch instance before
all other services.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
New in v6:
- Rebased to the current master.
- Extended a commit message a little - the summary.
New in v5:
- database::create_keyspace(): added a missing _keyspaces.emplace()
New in v4:
- Kept the database::create_keyspace() to return future<> by Glauber's request
and added a description to this method that needs to be changed when Glauber
adds his bits that require this interface.
We left some columns at a FIXME state, because we didn't have all types
implemented to reflect this. In particular, all collection types were left
behind.
Now that we do, let's refresh the system table's schemas.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
"We have found a bug when reading an old sstable. Some versions of Cassandra
will not use start_range as a marker, but rather 0.
We need to account for that possibility."
Some version of Origin will write 0 instead of -1 as the start of range marker
for a range tombstone. I've just came across one of such tables, that ended up
breaking our code. Let's be more flexible in what we accept. We don't really have
a choice.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Current timeout is 100ms. cassandra-stress is failing for me often
because of this, with "Mutation write timeout" message.
The comment says that the timeout value is based on
DatabaseDescriptor.getWriteRpcTimeout(), which in Origin is equal to 2
seconds by default, so bump it up.
Code pointers:
DatabaseDescriptor:L844
public static long getWriteRpcTimeout()
{
return conf.write_request_timeout_in_ms;
}
Config:L74
public volatile Long write_request_timeout_in_ms = 2000L;
"This series adds the storage_proxy API with a stab API implementation.
It covers the API defined in StorageProxyMBean, it does not contain the metrics
associate with the storage proxy that will be added in a different series."