"Fixes for commitlog (debug) test failures related to shutdowns.
Note that most the fixes here are only really related to the tests
failing, not really real scylla runs. However, at some point we'll
have real shutdown in scylla as well (not just hard exit), at which
point this becomes more relevant there as well.
Main issue was post-flush continuation chains for stats update
remaining unexecuted, due to task reordering, once the commitlog
object itself had been destroyed. This could have been handled by just
making the stats object a shared pointer, but in general it seems more
prudent to enforce having all tasks completed after shutdown.
* Change commitlog shutdown to use gate+wait for all outstanding ops
(flush, write, timer). Thus we can ensure everything is finished
when returning from "shutdown".
* Fix bug with "commitlog::clear" (test method) not doing the intended deed
* Most importantly, fix the tests themselves, cleaning up old crud, and
fixing invalid assumptions (CL behaviour changed quite a bit since tests
were created), and remove races.
Disclaimer: I've _never_ managed to reproduce the debug tests failing
like in jenkins locally (though I managed to provoke other failures),
but at least jenkins runs with this series have been clean. Knock knock."
* Do close + fsync on all segments
* Make sure all pending cycle/sync ops are guarded with a gate, and
explicitly wait for this gate on shutdown to make sure we don't
leave hanging flushes in the task queue.
* Fix bug where "commitlog::clear" did not in fact shut down the CL,
due to "_shutdown" being already set.
Note: This is (at least currently) not an issue for anything else than tests,
since we don't shutdown the normal server "properly", i.e. the CL itself
will not go away, and hanging tasks are ok, as long as the sync-all is done
(which it was previously). But, to make tests predictable, and future-proof
the CL, this is better.
This map will contain the (internal) IPs corresponding to specific Nodes.
The mapping is also stored in the system.peers table.
So, instead of always connecting to external IP messaging_service::get_rpc_client()
will query _preferred_ip_cache and only if there is no entry for a given
Node will connect to the external IP.
We will call for init_local_preferred_ip_cache() at the end of system table init.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
New in v2:
- Improved the _preferred_ip_cache description.
- Code styling issues.
New in v3:
- Make get_internal_ip() public.
- get_rpc_client(): return a get_preferred_ip() usage dropped
in v2 by mistake during rebase.
get_preferred_ips() returns all preferred_ip's stored in system.peers
table.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
New in v2:
- Get rid of extra std::move().
Fix for (mainly) test failures (use-after free)
I.e. test case test_commitlog_delete_when_over_disk_limit causes
use-after free because test shuts down before a pending flush is done,
and the segment manager is actually gone -> crash writing stats.
Now, we could make the stats a shared pointer, but we should never
allow an operation to outlive the segment_manager.
In normal op, we _almost_ guarantee this with the shutdown() call,
but technically, we could have a flush continuation trailing somewhere.
* Make sure we never delete segments from segment_manager until they are
fully flushed
* Make test disposal method "clear" be more defensive in flushing and
clearing out segments
When building the in-memory schema for a column family, we were
ignoring compaction strategy class because of a bug in the
existing code. Example: suppose that you create a column family
with leveled compaction strategy. This option would be ignored
and the default strategy (size-tiered) would be used instead.
Found this problem while working on leveled compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
In Cassandra, when you create a new column family, a directory for it
immediately appears under the KS directory.
In the past, we have made a decision to delay that creation until the first
SSTable is created, which works well in general.
There is a problem, however, for backup restoration: the standard procedure to
call loadNewSSTables is to do that in an empty directory. But the directory
simply won't be there until we create the first SSTable: bummer!
In the current incarnation of the code in schema_tables.cc, there is already
some code that runs on CPU0 only. That is a perfect place for the directory
creation. So let's do it.
After this patch, a directory for the CF appears right after the CF creation.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
xfs doesn't like writes beyond eof (exactly at eof is fine), and due
to continuation reordering, we sometimes do that.
Fix by pre-truncating the segment to its maximum size.
Re-check file size overflow after each cycle() call (new buffer),
otherwise we could write more, in the case we are storing a mutation
larger than current buffer size (current pos + sizeof(mut) < max_size, but
after cycle required by sizeof(mut) > buf_remain, the former might not be
true anymore.
"This series adds EC2Snich.
Since both GossipingPropertyFileSnitch and EC2SnitchXXX snitches family
are using the same property file it was logical to share the corresponding
code. Most of this series does just that... "
Currently, we are calculating truncated_at during truncate() independently for
each shard. It will work if we're lucky, but it is fairly easy to trigger cases
in which each shard will end up with a slightly different time.
The main problem here, is that this time is used as the snapshot name when auto
snapshots are enabled. Previous to my last fixes, this would just generate two
separate directories in this case, which is wrong but not severe.
But after the fix, this means that both shards will wait for one another to
synchronize and this will hang the database.
Fix this by making sure that the truncation time is calculated before
invoke_on_all in all needed places.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
This function returns the directory containing the configuration
files. It takes into an account the evironment variables as follows:
- If SCYLLA_CONF is defines - this is the directory
- else if SCYLLA_HOME is defines, then $SCYLLA_HOME/conf is the directory
- else "conf" is a directory, namely the configuration files should be
looked at ./conf
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
New in v2:
- Updated get_conf_dir() description.
Since replay is a "node global" operation, we should not attempt to
do it in parallel on each shard. It will just overlap/interfere.
Could just run this on cpu 0 or but since this _could_ be a
lengty operation, each timer callback is round-robined shards just in case...
Fixes #423
* CF ID now maps to a truncation record comprised of a set of
per-shard RP:s and a high-mark timestamp
* Retrieving RP:s are done in "bulk"
* Truncation time is calculated as max of all shards.
This version of the patch will accept "old" truncation data, though the
result of applying it will most likely not be correct (just one shard)
Record is still kept as a blob, "new" format is indicated by
record size.
Must ensure we find a chunk/entry boundary still even when run
with a start offset, since file navigation in chunk based.
Was not observed as broken previously because
1.) We did not run with offsets
2.) The exception never reached caller.
Also make the reader silently ignore empty files.
Almost the whole file is (accidentally) indented four spaces to the
right for no reason. Fix that up because it's annoying as hell.
Signed-off-by: Pekka Enberg <penberg@scylladb.com>
When we query schema keyspaces after we have applied a delete mutation,
the dropped keyspace does not exist in the "after" result set. Fix the
merge_keyspaces() algorithm to take that into account.
Makes merge_keyspaces() really call to database::drop_keyspace() when a
keyspace is dropped.
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
When we query schema tables after we have applied a delete mutation, the
dropped table does not exist in the "after" result set. Fix the
merge_tables() algorithm to take that into account.
Makes merge_tables() really call to database::drop_column_family() when
a table is dropped.
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
Align with rest of file (for better or worse). This allows calls from
entity without query_processor handy (i.e. storage_proxy).
Added "minimal" setup method for the "global" state, to facilitate
tests. Doing a full setup either in cql_test_env or after it is created
breaks badly. (Not sure why). So quick workaround.
Updated the current two users (batchlog_manager and commitlog_replayer)
callsites to conform.
Refs #356
Pre-allocates N segments from timer task. N is "adaptive" in that it is
increased (to a max) every time segement acquisition is forced to allocate
a new instead of picking from pre-alloc (reserve) list. The idea is that it is
easier to adapt how many segments we consume per timer quanta than the timer
quanta itself.
Also does disk pressure check and flush from timer task now. Note that the
check is still only done max once every new segment.
Some logging cleanup/betterment also to make behaviour easier to trace.
Reserve segments start out at zero length, and are still deleted when finished.
This is because otherwise we'd still have to clear the file to be able to
properly parse it later (given that is can be a "half" file due to power fail
etc). This might need revisiting as well.
With this patch, there should be no case (except flush starvation) where
"add_mutation" actually waits for a (potentially) blocking op (disk).
Note that since the amount of reserve is increased as needed, there will
be occasional cases where a new segment is created in the alloc path
until the system finds equilebrium. But this should only be during a breif
warmup.
v2: Fixed timestamp not being reset on reserve acquire
map_reduce() can run the reducer out-of-order which breaks the MD5 hash.
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
Fixes#357. [tgrabiec]