The name "column_family" is both awkward and obsolete. Rename to
the modern and accurate "table".
An alias is kept to avoid huge code churn.
To prevent a One Definition Rule violation, a preexisting "table"
type is moved to a new namespace row_cache_stress_test.
Tests: unit (release)
Message-Id: <20180624065238.26481-1-avi@scylladb.com>
This patchset brings support for writing range tombstones to SSTables
3.x. ('mc' format).
In SSTables 3.x, range tombstones are represented by so-called range
tombstone markers (hereafter RT markers) that denote range tombstone
start and end bounds. So each range tombstone is represented in data
file by two ordered RT markers.
There are also markers that both close the previous range tombstone and
open the new one in case if two range tombstones are ajdacent. This is
done to consume less disk space on such occasions.
Range tombstones written as RT markers are naturally non-overlapping.
* github.com:argenet/scylla projects/sstables-30/write-range-tombstones/v6
range_tombstone_stream: Remove an unused boolean flag.
Revert "Add missing enum values to bound_kind."
sstables: Move to_deletion_time helper up and make it static.
sstables: Write end-of-partition byte before flushing the last index
block.
sstables: Add support for writing range tombstones in SSTables 3.x
format.
tests: Add unit test covering simple range tombstone.
tests: Add unit test covering adjacent range tombstones.
tests: Add test to cover non-adjacent RTs.
tests: Add test covering mixed rows and range tombstones.
tests: Add test covering SSTables 3.x with many RTs.
tests: Add unit test covering overlapping RTs and rows.
tests: Add tests writing a range tombstone and a row overlapping with
its start.
tests: Add tests writing a range tombstone and a row overlapping with
its end.
tests: Add function that writes from multiple memtable into SSTables.
tests: Add test where 2nd range tombstone covers the remainder of the
1st one.
tests: Add test writing two non-adjacent range tombstones with same
clustering key prefix at their bounds.
tests: Add test covering overlapped range tombstones.
"
This series addresses issue #3516 and enhances space watchdog to make it
device-aware. It's needed because since last MV-related changes, space
watchdog can be responsible for multiple hints manager, which means
multiple directories, which may mean multiple devices.
Hence, having a single static space size limit is not enough anymore
and watchdog should take it into account that different managers
may work on different disks, while yet another managers can share
the same device.
Tests: unit (release)
"
* 'enhance_space_watchdog_4' of https://github.com/psarna/scylla:
hints: reserve more space for dedicated storage
hints: add is_mountpoint function
hints: make space_watchdog device-aware
hints: add device_id to manager
hints: add get_device_id function
Reserving 10% of space for hints managers makes sense if the device
is shared with other components (like /data or /commitlog).
But, if hints directory is mounted on a dedicated storage, it makes
sense to reserve much more - 90% was chosen as a sane limit.
Whether storage is 'dedicated' or not is based on a simple check
if given hints directory is a mount point.
Fixes#3516
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
Instead of having one static space limit for all directories,
space_watchdog now keeps a per-device limit, shared among
hints managers residing on the same disks.
References #3516
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
In order to make space_watchdog device-aware, device_id field
is added to hints manager. It's an equivalent of stat.st_dev
and it identifies the disk that contains manager's root directory.
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
In order to distinguish which directories reside on which devices,
get_device_id function is added to resource manager.
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
Very often people use the issue tracker to just ask questions. We have
been telling them to close the bug and move the discussion somewhere
else but it would be better if people were already directed to the right
place before they even get it wrong.
This would be easier to everybody.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180621135051.3254-1-glauber@scylladb.com>
This comes in handy when we want to test overlapping range tombstones
because memtable would otherwise de-overlap them internally.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Tests three cases:
- a row lying inside a range tombstone
- a row that has the same clustering key as range tombstone start
- a row that has the same clustering key as range tombstone end
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
These are two RTs where one's RT end clustering is the same as another
one's RT start bound but they are both exclusive.
In this case those bounds should not (and cannot) be merged into a
single RT boundary when writing RT markers.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
For SSTables 3.x. ('mc' format), range tombstones are represented by
their bounds that are written to the data file as so-called RT markers.
For adjacent range tombstones, an RT marker can be of a 'boundary' type
which means it closes the previous range tombstone and opens the new
one.
Internally, sstable_writer_m relies on range_tombstone_stream to both
de-overlap incoming range tombstones and order them so that when they
are drained they can be easily thought of as just pairs of their bounds.
By default Scylla docker runs without the security features.
This patch adds support for the user to supply different params values for the
authenticator and authorizer classes and allowing to setup a secure Scylla in
Docker.
For example if you want to run a secure Scylla with password and authorization:
docker run --name some-scylla -d scylladb/scylla --authenticator
PasswordAuthenticator --authorizer CassandraAuthorizer
Update the Docker documentation with the new command line options.
Signed-off-by: Noam Hasson <noam@scylladb.com>
Message-Id: <20180620122340.30394-1-noam@scylladb.com>
On current .bash_profile it prints "Constructing RAID volume..." when
scylla_ami_setup is still running, even it running on unsupported
instance types.
To avoid that we need to run instance type check at first, then we can
run rest of the script.
Fixes#2739
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <20180613111539.30517-1-syuu@scylladb.com>
"
Make sure we properly handle row marker and row tombstone
when reading a row.
Tests: unit {release}
"
* 'haaawk/sstables3/read-liveness-info-v4' of ssh://github.com/scylladb/seastar-dev:
sstable: consume row marker in data_consume_rows_context_m
sstable: Add consumer_m::consume_row_marker_and_tombstone
sstable: add is_set and to_row_marker to liveness_info
* https://github.com/vladzcloudius/scylla.git tracing_prepared_parameters-v6:
cql3::query_options: add get_names() method
tracing::trace_state: hide the internals of params_values
tracing: store queries statements for BATCH
tracing: store the prepared statements parameters values
"
A few fixes in scripts that were found when debugging #3508.
This series fixed this issue.
"
Fixes#3508
* 'ami_scripts_fixes-v1' of https://github.com/vladzcloudius/scylla:
scylla_io_setup: properly define the disk_properties YAML hierarchy
scylla_io_setup: fix a typo: s/write_bandwdith/write_bandwidth/
scylla_io_setup: hardcode the "mountpoint" YAML node to "/var/lib/scylla" for AMIs
scylla_io_setup: print the io_properties.yaml file name and not its handle info
scylla_lib.sh: tolerate perftune.py errors
"
We are seeing some workloads with large datasets where the compaction
controller ends up with a lot of shares. Regardless of whether or not
we'll change the algorithm, this patchset handles a more basic issue,
which is the fact that the current controller doesn't set a maximum
explicitly, so if the input is larger than the maximum it will keep
growing without bounds.
It also pushes the maximum input point of the compaction controller from
10 to 30, allowing us to err on the side of caution for the 2.2 release.
"
* 'tame-controller' of github.com:glommer/scylla:
controller: do not increase shares of controllers for inputs higher than the maximum
controller: adjust constants for compaction controller
"
This mini series fixes some querier-cache related issues discovered
while working on stateful range-scans.
1) A problem in the memory based cache eviction test that is is yet
unexposed (#3529).
2) Possible usage of invalidated iterators in querier_cache (#3424).
3) lookup() possibly returning a querier with the wrong read range
(#3530).
Tests: unit(release)
"
* 'fix-querier-cache-invalid-iterators-master' of https://github.com/denesb/scylla:
querier: find_querier(): return end() when no querier matches the range
querier_cache: restructure entries storage
tests/querier_cache: fix memory based eviction test
batch_statement::verify_batch_size() verifies that the total size of
mutations generated by the batch statement is smaller than certain
configurable thresholds. This is done by a custom mutation_partition
visitor, which violates atomic_cell_view::value() preconditions by
calling it even for dead cells.
The simples solution is to use
mutation_partition::external_memory_usage() instead.
Message-Id: <20180619131405.12601-1-pdziepak@scylladb.com>
When dropping a table, wait for the column family to quiesce so that
no pending writes compete with the truncate operation, possibly
allowing data to be left on disk.
Fixes#2562
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180618193134.31971-1-duarte@scylladb.com>
Patch f39891a999 fixed 3443,
but also introduced a regression in dtest - new column
was unconditionally added to view during ALTER TABLE ADD,
while it should only be the case for "include all columns" views.
This patch fixes the regression (spotted by query_new_column_test).
References #3443
Message-Id: <7410d965255a514d78cf0ce941a3236b9d8ddbbd.1529399135.git.sarna@scylladb.com>
When none of the queriers found for the lookup key match the lookup
range `_entries.end()` should be returned as the search failed. Instead
the iterator returned from the failed `std::find_if()` is returned
which, if the find failed, will be the end iterator returned by the
previous call to `_entries.equal_range()`. This is incorrect because as
long as `equal_range()`'s end iterator is not also `_entries.end()` the
search will always return an iterator to a querier regardless of whether
any of them actually matches the read range.
Fix by returning `_entries.end()` when it is detected that no queriers
match the range.
Fixes: #3530
Currently querier_cache uses a `std::unordered_map<utils::UUID, querier>`
to store cache entries and an `std::list<meta_entry>` to store meta
information about the querier entries, like insertion order, expiry
time, etc.
All cache eviction algorithms use the meta-entry list to evict entries
in reverse insertion order (LRU order). To make this possible
meta-entries keep an iterator into the entry map so that given a
meta-entry one can easily erase the querier entry. This however poses a
problem as std::unordered_map can possibly invalidate all its iterators
when new items are inserted. This is use-after-free waiting to happen.
Another disadvantages of the current solution is that it requires the
meta-entry to use a weak pointer to the querier entry so that in case
that is removed (as a result of a successful lookup) it doesn't try to
access it. This has an impact on all cache eviction algorithms as they
have to be prepared to deal with stale meta-entries. Stale meta-entries
also unnecesarily consume memory.
To solve these problems redesign how querier_cache stores entries
completely. Instead of storing the entries in an `std::unordered_map`
and storing the meta-entries in an `std::list`, store the entries in an
`std::list` and an intrusive-map (index) for lookups. This new design
has severeal advantages over the old one:
* The entries will now be in insert order, so eviction strategies can
work on the entry list itself, no need to involve additional data
structures for this.
* All data related to an entry is stored in one place, no data
duplication.
* Removing an entry automatically removes it from the index as intrusive
containers support auto unlink. This means there is no need to store
iterators for long terms, risking use-after-free when the container
invalidates it's iterators.
Additional changes:
* Modify eviction strategies so that they work with the `entry`
interface rather than the stored value directly.
Ref #3424
Do increment the key counter after inserting the first querier into the
cache. Otherwise two queriers with the same key will be inserted and
will fail the test. This problem is exposed by the changes the next
patches make to the querier-cache but will be fixed before to maintain
bisectability of the code.
Fixes: #3529
This is to stay compliant with the Origin for SSTables 3.x.
It differs from SSTables 2.x (ka/la) as for those the last promoted
index block is pushed first and the end-of-partition byte is written
after.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Right now there is no limit to how much the shares of the controllers
can grow. That is not a big problem from the memtable flush controller,
since it has a natural maximum in the dirty limit.
But the compaction controller, the way it's written today, can grow
forever and end up with a very large value for shares. We'll cap that at
adjust() time by not allowing shares to grow indefinitely.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Right now the controller adjusts its shares based on how big the backlog
is in comparison to shard memory. We have seen in some tests that if the
dataset becomes too big, this may cause compactions to dominate.
While we may change the input altogether in future versions, I'd like to
propose a quick change for the time being: move the high point from 10x
memory size to 30x memory size. This will cause compactions to increase
in shares more slowly.
While this is as magic as the 10 before, they will allow us to err in
the side of caution, with compactions not becoming aggressive enough to
overly disrupt workloads.
Signed-off-by: Glauber Costa <glauber@scylladb.com>