It's been linked with various performance issues, either by causing
them or making them worse. One example is #1634, and also recently
I have investigated continuous performance degradation that was also
linked to defrag on idle activity.
Until we can figure out how to reduce its impact, we should disable it.
Signed-off-by: Glauber Costa <glauber@glauber.scylladb>
Message-Id: <20170627201109.10775-1-glauber@scylladb.com>
(cherry picked from commit f3742d1e38)
So that they are not left on disk even though we did a clean shutdown.
First part of the fix is to ensure that closed segments are recognized
as not allocating (_closed flag). Not doing this prevents them from
being collected by discard_unused_segments(). Second part is to
actually call discard_unused_segments() on shutdown after all segments
were shut down, so that those whose position are cleared can be
removed.
Fixes#2550.
Message-Id: <1499358825-17855-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 6555a2f50b)
This patch is based on 6c8b5fc. It moves the check whether a dropped
type is still used by other types or tables from schema_tables to
the drop_type_statement, as delaying this check to after applying the
mutations can leave the keyspace in a broken state.
Fixes#2490
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1497466736-28841-1-git-send-email-duarte@scylladb.com>
It is safe to copy column_mapping accros shards. Such guarantee comes at
the cost of performance.
This patch makes commitlog_entry_writer use IDL generated writer to
serialise commitlog_entry so that column_mapping is not copied. This
also simplifies commitlog_entry itself.
Performance difference tested with:
perf_simple_query -c4 --write --duration 60
(medians)
before after diff
write 79434.35 89247.54 +12.3%
(cherry picked from commit 374c8a56ac)
Also: Fixes#2468.
The previous fix removed the additional insertion of "min rp" per source
shard based on whether we had processed existing CF:s or not (i.e. if
a CF does not exist as sstable at all, we must tag it as zero-rp, and
make whole shard for it start at same zero.
This is bad in itself, because it can cause data loss. It does not cause
crashing however. But it did uncover another, old old lingering bug,
namely the commitlog reader initiating its stream wrongly when reading
from an actual offset (i.e. not processing the whole file).
We opened the file stream from the file offset, then tried
to read the file header and magic number from there -> boom, error.
Also, rp-to-file mapping was potentially suboptimal due to using
bucket iterator instead of actual range.
I.e. three fixes:
* Reinstate min position guarding for unencoutered CF:s
* Fix stream creating in CL reader
* Fix segment map iterator use.
v2:
* Fix typo
Message-Id: <1490611637-12220-1-git-send-email-calle@scylladb.com>
(cherry picked from commit b12b65db92)
Fixes#2173
Per-shard min positions can be unset if we never collected any
sstable/truncation info for it, yet replay segments of that id.
Wrap the lookups to handle "missing data -> default", which should have been
there in the first place.
Message-Id: <1490185101-12482-1-git-send-email-calle@scylladb.com>
(cherry picked from commit c3a510a08d)
Fixes#2098
Replay previously did all segments in parallel on shard 0, which
caused heavy memory load. To reduce this and spread footprint
across shards, instead do X segments per shard, sequential per shard.
v2:
* Fixed whitespace errors
Message-Id: <1489503382-830-1-git-send-email-calle@scylladb.com>
(cherry picked from commit 078589c508)
"If a node is bootstrapped with auto_boostrap disabled, it will not
wait for schema sync before creating global keyspaces for auth and
tracing. When such schema changes are then reconciled with schema on
other nodes, they may overwrite changes made by the user before the
node was started, because they will have higher timestamp.
To prevent that, let's use minimum timestamp so that default schema
always looses with manual modifications. This is what Cassandra does.
Fixes #2129."
* tag 'tgrabiec/prevent-keyspace-metadata-loss-v1' of github.com:scylladb/seastar-dev:
db: Create default auth and tracing keyspaces using lowest timestamp
migration_manager: Append actual keyspace mutations with schema notifications
(cherry picked from commit 6db6d25f66)
* seastar 397685c...c1dbd89 (13):
> lowres_clock: drop cache-line alignment for _timer
> net/packet: add missing include
> Merge "Adding histogram and description support" from Amnon
> reactor: Fix the error: cannot bind 'std::unique_ptr' lvalue to 'std::unique_ptr&&'
> Set the option '--server' of tests/tcp_sctp_client to be required
> core/memory: Remove superfluous assignment
> core/memory: Remove dead code
> core/reactor: Use logger instead of cerr
> fix inverted logic in overprovision parameter
> rpc: fix timeout checking condition
> rpc: use lowres_clock instead of high resolution one
> semaphore: make semaphore's clock configurable
> rpc: detect timedout outgoing packets earlier
Includes treewide change to accomodate rpc changing its timeout clock
to lowres_clock.
Includes fixup from Amnon:
collectd api should use the metrics getters
As part of a preperation of the change in the metrics layer, this change
the way the collectd api uses the metrics value to use the getters
instead of calling the member directly.
This will be important when the internal implementation will changed
from union to variant.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1485457657-17634-1-git-send-email-amnon@scylladb.com>
The per-node limit will be total memory divided by number of shards
instead of just total memory. For example, when Scylla is started with
-c16 -m16G, the commit log will induce flushes on given shard when
unflushed data exceeds on that shard 62MB instead of 1GB.
Fixes#2046.
Message-Id: <1485874534-10939-1-git-send-email-tgrabiec@scylladb.com>
It still has problems:
- while resharding a very large leveled compaction strategy table, a huge
amount of tiny sstables are generated, overwhelming the file descriptor
limits
- there is a large impact on read latency while resharding is going on
(cherry picked from commit cf27d44412)
(forward-ported from branch-1.6)
Commit f0c28e1 ("db/schema_tables: Add schema_functions and
schema_aggregates tables") forgot to add the newly added tables to the
db::schema_tables::ALL list, which is used for authorization checks, for
example.
Fixes the following auth_test.py dtest failures:
('Unable to connect to any servers', {'127.0.0.1': Unauthorized('Error from server: code=2100 [Unauthorized] message="User cathy has no SELECT permission on <table system.schema_functions> or any of its parents"',)})
Message-Id: <1484045277-4997-1-git-send-email-penberg@scylladb.com>
This reverts commit b81a57e8eb.
With exponential range scanning, we should now be able to survive
msb ignore bits of 12, which allows better sharding on large clusters.
This patch adds an utility function that creates a raw select
statement from a set of columns and a where clause. It is intended to
be used to create the prepared select statement used by the view
class.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch builds the mutations to announce a new view. Aside from
including the view schema, we include the base table mutations so
that a node is resilient against receiving create view mutations
before the base table create mutations.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch allows a view schema to be frozen. To unfreeze such a
schema, we add an is_view attribute to the schema idl.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch replaces the add_table_to_schema_mutation() function with
add_table_or_view_to_schema_mutation().
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch extracts update_column_family from schema_tables into
database so it can be used when adding materialized views, in future
patches.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch moves some duplicate code into the
add_column_family_and_create_directory() function. It also saves some
superfluous keyspace lookups and readies the code to be used by
materialized views.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes read_table_mutations() so that it can now
read schemas from other tables besides the column families
schema table.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the view class, which will contains functions related
to populating a view, either from the base table's write path or from
the view building mechanism which copies over already existing data in
the base table.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"The current criteria for memtable flush is not being respected. The
problem is demonstrated to happen when the dirty memory group is over
limit, and so is the system table extra allowance. In that situation,
both the normal region and the system table region will be under
pressure and try to flush.
More specifically, because the normal region inherits from the system
region, if the normal region is under pressure (over the soft limit
threshold), the system region will certainly be as well, even though it
has an extra allowance. This is because after virtual dirty, we start
blocking when we reach half the region, but memory itself can grow up to
100 % of the region. So the total amount of memory used will be
certainly bigger than the system pressure threshold, which is now 50 %
plus the allowance.
To fix that, this patch reworks the flush logic so that the regions are
not dependent on each other.
Fixes#1918"
* 'flush-criteria-v6' of github.com:glommer/scylla:
config: get rid of memtable_total_space
database: rework dirty memory hierarchy
system keyspace: write batchlog mutation in user memory
database: remove flush_token
database: abstract pressure condition notification
database: encapsulate semaphore_units into a flush_permit
database: remove friendship declaration
database: simplify flush_one
database: make memtable_list aware in cases it can't flush
Batchlog is a potentially memory-intensive table whose workload is
driven by user needs, not system's. Move it to the user dirty memory
manager.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
column_mapping is not safe to access across shards, because data_type
is not safe to access. One of the manifestation of this is that
abstract_type::is_value_compatible_with() always fails if the two
types belong to different shards.
During replay, column_mapping lives on the replaying shard, and is
used by converting_mutation_partition_applier against the schema on
the target shard. Since types in the mapping will be considered
incompatible with types in the schema, all cells will be dropped.
Fix by using column_mapping in a safe way, by copying it to the target
shard if necessary. Each shard maintains its own cache of column
mappings.
Fixes#1924.
Message-Id: <1481310463-13868-1-git-send-email-tgrabiec@scylladb.com>
The semaphore future may be unavailable for many reasons. Specifically,
if the task quota is depleted right between sem.wait() and the .then()
clause in get_units() the resulting future won't be available.
That is particularly visible if we decrease the task quota, since those
events will be more frequent: we can in those cases clearly see this
counter going up, even though there aren't more requests pending than
usual.
This patch improves the situation by replacing that check. We now verify
whether or not there are waiters in the semaphore.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <113c0d6b43cd6653ce972541baf6920e5765546b.1481222621.git.glauber@scylladb.com>
The problem is that replay will unlink any segments which were on disk
at the time the replay starts. However, some of those segments may
have been created by current node since the boot. If a segment is part
of reserve for example, it will be unlinked by replay, but we will
still use that segment to log mutations. Those mutations will not be
visible to replay after a crash though.
The fix is to record preexisting segents before any new segments will
have a chance to be created and use that as the replay list.
Introduced in abe7358767.
dtest failure:
commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup
Message-Id: <1481117436-6243-1-git-send-email-tgrabiec@scylladb.com>
When requests hit the commitlog, each of them will be assigned a replay
position, which we expect to be ordered. If reorders happen, the request
will be discarded and re-applied. Although this is supposed to be rare,
it does increase our latencies, specially when big requests are
involved. Processing big requests is expensive and if we have to do it
twice that adds to the cost.
The commitlog is supposed to issue replay positions in order, and it
coudl be that the code that adds them to the memtables will reorder
them. However, there is one instance in which the commitlog will not
keep its side of the bargain.
That happens when the reserve is exhausted, and we are allocating a
segment directly at the same time the reserve is being replenished. The
following sequence of events with its deferring points will ilustrate
it:
on_timer:
return this->allocate_segment(false). // defer here // then([this](sseg_ptr s) {
At this point, the segment id is already allocated.
new_segment():
if (_reserve_segments.empty()) {
[ ... ]
return allocate_segment(true).then ...
At this point, we have a new segment that has an id that is higher than
the previous id allocated.
Then we resume the execution from the deferring point in on_timer():
i = _reserve_segments.emplace(i, std::move(s));
The next time we need to allocate a segment, we'll pick it from the
reserve. But the segment in the reserve has an id that is lower than the
id that we have already used.
Reorders are bad, but this one is particularly bad: because the reorder
happens with the segment id side of the replay position, that means that
every request that falls into that segment will have to be reinserted.
This bug can be a bit tricky to reproduce. To make it more common, we
can artificially add a sleep() fiber after the allocate_segment(false)
in on_timer(). If we do that, we'll see a sea of reinsertions going on
in the logs (if dblog is set to debug).
Applying this patch (keeping the sleep) will make them all disappear.
We do this by rewriting the reserve logic, so that the segments always
come from the reserve. If we draw from a single pool all the time, there
is no chance of reordering happening. To make that more amenable, we'll
have the reserve filler always running in the background and take it out
of the timer code.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <49eb7edfcafaef7f1fdceb270639a9a8b50cfce7.1480531446.git.glauber@scylladb.com>