Paxos may leave an operation in a background after returning result to a
caller. Lest add a counter for background/foreground paxos handlers so
that it will be easier to detect memory related issues.
Message-Id: <20200510092942.GA24506@scylladb.com>
"
A good portion of the values that one would want to be examine with
scylla-tools will be partition or clustering keys. While examining them
was possible before too, especially for single component keys, it
required manually extracting the components from it, so they can be
individually examined.
This series adds support for working with keys directly, by adding
prefixable and full compound type support.
When passing --prefix-compound or --full-compound, multiple types can be
passed, which will form the compound type.
Example:
$ scylla_types --print --prefix-compound -t TimeUUIDType -t Int32Type 0010d00819896f6b11ea00000000001c571b000400000010
(d0081989-6f6b-11ea-0000-0000001c571b, 16)
Another feature added in this series is validation. For this,
`compound_type::validate()` had to be implemented first. We already use
this in our code, but currently has a no-op body.
Example:
$ scylla-types --validate --full-compound -t TimeUUIDType -t Int32Type 0010d00819896f6b11ea00000000001c571b0004000000
0010d00819896f6b11ea00000000001c571b0004000000: INVALID - seastar::internal::backtraced<marshal_exception> (marshaling error: compound_type iterator - not enough bytes, expected 4, got 3 Backtrace: 0x1b2e30f
0x85c9d5
0x85cb07
0x85cc7b
0x85cd7c
0x85d2d7
0x844e03
0x84241b
0x84490b
0x844ae5
0x19c0362
0x19c0741
0x19c13d1
0x19c4b44
0x8aeb7a
0x8aeca7
0x19ebc90
0x19fb8d5
0x1a12b49
0x19c4376
0x19c47a6
0x19c4900
0x843373
/lib64/libc.so.6+0x271a2
0x84202d
)
Tests: unit(dev)
"
* 'tools-scylla-types-compound-support/v1' of https://github.com/denesb/scylla:
tools/scylla_types: add validation action
tools/scylla_types: add compound_type support
tools/scylla_types: single source of truth for actions
compound_type: implement validate()
compound_type: fix const correctness
tools: mv scylla_types scylla-types
When sending hints from one file, rps_set field in send_one_file_ctx
keeps track of commitlog positions of hints that are being currently
sent, or have failed to be sent. At the end of the operation, if sending
of some hints failed, we will choose position of the earliest hint that
failed to be sent, and will retry sending that file later, starting from
that position. This position is stored in _last_not_complete_rp.
Usually, this set has a bounded size, because we impose a limit of at
most 128 hints being sent concurrently. Because we do not attempt to
send any more hints after a failure is detected, rps_set should not have
more than 128 elements at a time.
Due to a bug, commitlog positions of old hints (older than
gc_grace_seconds of the destination table) were inserted into rps_set
but not removed after checking their age. This could cause rps_set to
grow very large when replaying a file with old hints.
Moreover, if the file mixed expired and non-expired hints (which could
happen if it had hints to two tables with different gc_grace_seconds),
and sending of some non-expired hints failed, then positions of expired
hints could influence calculation _last_not_complete_rp, and more hints
than necessary would be resent on the next retry.
This simple patch removes commitlog position of a hint from rps_set when
it is detected to be too old.
Fixes#6422
"
We inherited from Origin a `caching` table parameter. It's a map of named caching parameters. Before this PR two caching parameters were expected: `keys` and `rows_per_partition`. So far we have been ignoring them. This PR adds a new caching parameter called `enabled` which can be set to `true` or `false` and controls the usage of the cache for the table. By default, it's set to `true` which reflects Scylla behavior before this PR.
This new capability is used to disable caching for CDC Log table. It is desirable because CDC Log entries are not expected to be read often. They also put much more pressure on memory than entries in Base Table. This is caused by the fact that some writes to Base Table can override previous writes. Every write to CDC Log is unique and does not invalidate any previous entry.
Fixes#6098Fixes#6146
Tests: unit(dev, release), manual
"
* haaawk-dont_cache_cdc:
cdc: Don't cache CDC Log table
table: invalidate disabled cache on memtable flush
table: Add cache_enabled member function
cf_prop_defs: persist caching_options in schema
property_definitions: add get that returns variant
feature: add PER_TABLE_CACHING feature
caching_options: add enabled parameter
We use pystache to parametrize our scylla.spec, but pystache is not
present in Fedora 32. Fortunately rpm provides its own template mechanism,
and this patch switches to using it:
- no longer install pystache
- pass parameters via rpm "-D" options
- use 0/1 for conditionals instead of true/false as per rpm conventions
- sanitize the "product" variable to not contain dashes
- change the .spec file to use rpm templating: %{...} and %if ... %endif
instead of mustache templating
Input SSTables of resharding is deleted at the coordinator shard, not at the
shards they belong to.
We're not acquiring deletion semaphore before removing those input SSTables
from the SSTable set, so it could happen that resharding deletes those
SSTables while another operation like snapshot, which acquires the semaphore,
find them deleted.
Let's acquire the deletion semaphore so that the input SSTables will only
be removed from the set, when we're certain that nobody is relying on their
existence anymore.
Now resharding will only delete input SStables after they're safely removed
from the SSTable set of all shards they belong to.
unit: test(dev).
Fixes#6328.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200507233636.92104-1-raphaelsc@scylladb.com>
The "current compatibility with DynamoDB" section in alternator.md is where
we should list very briefly our state of compatibility - it's not the right
place to explain implementation details or track obscure bugs. I've
significantly shortened the "Tags" section because, in brief, we do
fully support tags and should say that we do.
I moved the two bugs mentioned in the text into the bug tracker:
Refs #6389
Refs #6391
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200507125022.22608-1-nyh@scylladb.com>
Allow examining partition and clustering keys, by adding support for
full and prefix compound types. The members of the compound type are
specified by passing several types with --type on the command line.
The patch implements:
- /storage_service/auto_compaction API endpoint
- /column_family/autocompaction/{name} API endpoint
Those APIs allow to control and request the status of background
compaction jobs for the existing tables.
The implementation introduces the table::_compaction_disabled_by_user.
Then the CompactionManager checks if it can push the background
compaction job for the corresponding table.
New members
===
table::enable_auto_compaction();
table::disable_auto_compaction();
bool table::is_auto_compaction_disabled_by_user() const
Test
===
Tests: unit(sstable_datafile_test autocompaction_control_test), manual
$ ninja build/dev/test/boost/sstable_datafile_test
$ ./build/dev/test/boost/sstable_datafile_test --run_test=autocompaction_control_test -- -c1 -m2G --overprovisioned --unsafe-bypass-fsync 1 --blocked-reactor-notify-ms 2000000
The test tries to submit a compaction job after playing
with autocompaction control table switch. However, there is
no reliable way to hook pending compaction task. The code
assumed that with_scheduling_group() closure will never
preempt execution of the stats check.
Revert
===
Reverts commit c8247ac. In previous version the execution
sometimes resulted into the following error:
test/boost/sstable_datafile_test.cc(1076): fatal error: in "autocompaction_control_test":
critical check cm->get_stats().pending_tasks == 1 || cm->get_stats().active_tasks == 1 has failed
This version adds a few sstables to the cf, starts
the compaction and awaits until it is finished.
API change
===
- `/column_family/autocompaction/` always returned `true` while answering to the question: if the autocompaction disabled (see https://github.com/scylladb/scylla-jmx/blob/master/src/main/java/org/apache/cassandra/db/ColumnFamilyStore.java#L321). now it answers to the question: if the autocompaction for specific table is enabled. The question logic is inverted. The patch to the JMX is required. However, the change is decent because all old values were invalid (it always reported all compactions are disabled).
- `/column_family/autocompaction/` got support for POST/DELETE per table
Fixes
===
Fixes#1488Fixes#1808Fixes#440
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
Reviewed-by: Glauber Costa <glauber@scylladb.com>
Currently the available actions are documented in several different
places:
* code implementing them
* description
* documentation for --action
* error message that validates value for --action
This is guaranteed to result in incorrect, possibly self-contradicting
documentation. Resolve by generating all documentation from the handler
registry, which now also contains the description of the action.
Also have a separate flag for each action, instead of --action=$ACTION.
Alternator supports four different write isolation policies, the default
being to do all the writes with LWT, but these policies were only briefly
explained in alternator.md.
This patch significantly expands on this explanation, better explaining
the tradeoffs involved in these four options, and when each might make
sense (if at all).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200506235152.18190-1-nyh@scylladb.com>
The single-key sstable reader uses the clustering ranges from the slice
to determine the upper bound of the disk read-range using the index.
For this is simply uses the end bound of the last clustering ranges. For
reverse reads however the clustering ranges in the slice are in reverse
order, so this will in fact be the upper bound of the smallest range.
Depending on whether the distance between the clustering range is big
enough for the sstable reader to use the index to skip between them,
this will lead to either reading too little data or an assert failure.
This patch fixes the problematic function `get_slice_upper_bound()` to
consider reverse reads as well.
Initially I thought there will be more mishandling of reverse slices,
but actually `mutation_fragment_filter`, the component doing the actual
slicing of rows, is already reverse-slice aware.
A unit test which reproduces the assert failure is also added.
Fixes: #6171
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200507114956.271799-1-bdenes@scylladb.com>
"
Garbage collected SSTables, created by incremental compaction process,
are being added to the SSTable set using a function that invalidates
row cache using the range of the SSTable itself. That's incorrect
because data in GC SSTables come from preexisting SSTables in set,
meaning the state of data isn't changed and so no need for
invalidation at all. Incorrect invalidation like this is a source of
read performance issues. This problem is fixed by including GC
SSTables to the descriptor which is used to specify changes to the
SSTable set, which is the correct thing to do given that a midway
failure could leave the set in an incorrect state.
Fixes#5956.
Fixes#6275.
tests: unit(dev)
"
* 'fix_issue_5956_v4' of github.com:raphaelsc/scylla:
sstables/compaction: Don't invalidate row cache when adding GC SSTable to SSTable set
sstables/compaction: Change meaning of compaction_completion_desc input and output fields
sstables/compaction: Clean up code around garbage_collected_sstable_writer
The shutdown process of compaction manager starts with an explicit call
from the database object. However that can only happen everything is
already initialized. This works well today, but I am soon to change
the resharding process to operate before the node is fully ready.
One can still stop the database in this case, but reshardings will
have to finish before the abort signal is processed.
This patch passes the existing abort source to the construction of the
compaction_manager and subscribes to it. If the abort source is
triggered, the compaction manager will react to it firing and all
compactions it manages will be stopped.
We still want the database object to be able to wait for the compaction
manager, since the database is the object that owns the lifetime of
the compaction manager. To make that possible we'll use a future
that is return from stop(): no matter what triggered the abort, either
an early abort during initial resharding or a database-level event like
drain, everything will shut down in the right order.
The abort source is passed to the database, who is responsible from
constructing the compaction manager.
Tests: unit (dev), manual start+stop, manual drain + stop
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200506184749.98288-1-glauber@scylladb.com>
This is unrelated to counters, but happens to fix#4209
`tuple::delayed_value::contains_bind_marker` used to check that
ALL terms are bound (not that ANY of them is bound). As a result,
scylla would crash in prepare codepath for collections of tuples.
After this fix `invalid_request_exception` is thrown instead.
* jul-stas-4209-crash-on-counter-shards-set:
boost/tests: test for bound variable in a list of tuple literals
cql3: fix detection of bound variables in tuples
So that nested exceptions are not lost. Also, marshal exceptions, the
ones we have in these places, already have a backtrace, so might as well
use that, instead of creating a new one, loosing unwound frames.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200507091405.244544-1-bdenes@scylladb.com>
Add a couple of cql tests regarding conditional batches:
1. Verify that "delete" takes priority over "insert"
when applied to the same row within the same batch.
2. Test that a workaround for the issue works as expected (i.e.
delete only individual cells instead of the full record).
Tests: unit(dev)
Fixes: #6273
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200506201200.176590-1-pa.solodovnikov@scylladb.com>
`tuple::delayed_value::contains_bind_marker` used to check that
ALL terms are bound (not that ANY of them is bound). As a result,
scylla would crash in prepare codepath for collections. After this
fix `invalid_request_exception` is thrown instead.
Fixes#4209
We must unregister the monitor upon destruction to prevent use-after-free
from `compaction_backlog_tracker::backlog` path.
This is similar to ~compaction_read_monitor as implemented
in commit ca284174d0Fixes#6385
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200506214419.569655-1-bhalevy@scylladb.com>
In commit da3bf20e71 we supposedly enabled
support for Cassandra's "start_native_transport" option which can be set to
0 to run Scylla without listening on the CQL port. This can be useful, for
example, if a user only want the DynamoDB or Redis APIs but not CQL.
Unfortunately, the option was still marked "Unused", so it wasn't really
enabled as a valid command line option. This patch fixes that, and
documents the start_native_transport option in docs/protocols.md, where
we document the different protocols, ports, and options to configure them.
Fixes#6387.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200506174850.13616-1-nyh@scylladb.com>
fmt, the formatting library we use, detects types with conversion
to std::string_view (and formats them as strings) and types that
support operator<<(std::ostream, const T&) (and performs custom
formatting on them). However, if <fmt/ostream.h>, the latter is
not done.
The problem happens with seastar::sstring, which implements both,
and debug mode, which disables inlining. Some translation units
do include <fmt/ostream.h>, and so generate code to do custom
formatting. exception_utils.cc doesn't, and so generates code
to format via string_view conversion. At link time, the
compiler picks one of the generated functions and includes it
in the final binary; it happened to pick one generated outside
exception_utils.cc, using custom formatting.
However, there is also code in fmt to encode which path fmt
chose - string_view or custom. This code is constexpr and so
is evaluated in exception_utils.cc. The result is that the
function to perform formatting of seastar::sstring uses custom
formatting, while the descriptor containing the method used
says it is formatting via string_view. This is enough to cause
a crash.
The problem is limited to debug mode, since in other modes
all this code is inlined, and so is consistent within the
translation unit.
We need a more general fix (hopefully in fmt), but for now a
simple fix is to add the missing include.
Ref https://github.com/fmtlib/fmt/issues/1662
"
CDC has to create CDC streams that are co-located with corresponding BaseTable data. This is not always easy. Especially for small vnodes. This PR introduces new partitioner which allows us to easily find such stream ids that the stream belongs to a given vnode and shard.
The idea is that a partitioner accepts only keys that are a blob composed of two int64 numbers. The first number is the token of the key.
Tests: unit(dev), dtests(CDC)
"
* haaawk-cdc_partitioner:
cdc:use CDCPartitioner for CDC Log
dht: Add find_first_token_for_shard
dht: use long_token in token::to_int64
cdc: add CDCPartitioner
stream_id: add token_from_bytes static function
i_partitioner: Stop distinguishing whether keys order is preserved
CDC writes are not expected to be read multiple times so it makes little sense
to cache them. Moreover, CDC Log puts much bigger pressure on memory usage than
Base Table because some updates to the Base Table override existing data while
related CDC Log updates are always a new entry in a memtable.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
table::update_cache has two branches of its logic.
One when caching is enabled and the other when it's
disabled. This patch adds unconditional cache invalidation
to the second (disabled caching) branch.
This is done for two purposes. First and foremost, it gives
the guarantee that when we enable the cache later it will be in
the right state and will be ready for usage. This is because
any memtable flush that would logically invalidate the cache,
actually physically does that too now. An additional benefit of this
change is that disabled cache will be cleared during the next
memtable flush that will happen after turning the switch off.
Previously, the cache would also be emptied but it would take
more time before all its elements are removed by eviction.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Previously 'WITH CACHING =' was ignored both in
CREATE TABLE and in ALTER TABLE statements.
Now it will be persisted in schema so that
it can be used later to control caching per table.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Using shared_ptr's in `unrecognized_entity_exception` can lead
to cross-cpu deletion of a pointer which will trigger an assert
`_cpu == std::this_thread::get_id()' when shared_ptr is disposed.
Copy `column_identifier` to the exception object and avoid using
an instance of `cql3::relation`: just get a string representation
from it since nothing more is used in associated exception
handling code.
Fixes: #6287
Tests: unit(dev, debug), dtest(lwt_destructive_ddl_test.py:LwtDestructiveDDLTest.test_rename_column)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200506155714.150497-1-pa.solodovnikov@scylladb.com>
When generating view updates, an endpoint can appear both
as a primary paired endpoint for the view update, and as a pending
endpoint (due to range movements). In order not to generate
the same update twice for the same endpoint, the paired endpoint
is removed from the list of pending endpoints if present.
Fixes#5459
Tests: unit(dev),
dtest(TestMaterializedViews.add_dc_during_mv_insert_test)
Following up on 91b71a0b1a
We also need to serialize storage_service::true_snapshots_size
with snapshot-modifying operations.
It seems like it was assumed that get_snapshot_details
is done under run_snapshot_list_operation, but the one called
here is the table method, not the api::storage_service::get_snapshot_details.
Fixes#5603
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200506115732.483966-1-bhalevy@scylladb.com>
Both `cql3::column_condition` and `cql3::column_condition::raw`
classes are marked as `final`: it's safe to use lw_shared_ptr
instead of generic `seastar::shared_ptr`.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200428202249.82785-1-pa.solodovnikov@scylladb.com>
This patch adds a comprehensive, hopefully complete, test for the
yet-unimplemented FilterExpression feature. FilterExpression is the
modern syntax which allows filtering the results of Query and Scan requests.
The patch includes 50 tests spanning more than 700 lines of code,
testing (hopefully) all the various FilterExpression features,
sub-cases, syntax peculiarities, and so on.
As usual, all included tests pass when run against DynamoDB
("pytest --aws") and xfail when run against Scylla.
This test should be helpful to understand how to implement
FilterExpression correctly, as well as test the future implementation.
Refs #5038.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200503165639.15320-1-nyh@scylladb.com>
We often have to examine raw values, obtained from various sources, like
sstables, logs and coredumps. For some types it is quite simple to
convert raw hex values to human readable ones manually (integers), for
others it is very hard or simply not practical. This command-line tool
aims to ease working with raw values, by providing facilities to print
them in human readable form and compare them. We can extend it with more
functions as needed.
Examples:
$ scylla_types -a print -t Int32Type b34b62d4
-1286905132
$ scylla_types -a compare -t 'ReversedType(TimeUUIDType)' b34b62d46a8d11ea0000005000237906 d00819896f6b11ea00000000001c571b
b34b62d4-6a8d-11ea-0000-005000237906 > d0081989-6f6b-11ea-0000-0000001c571b
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200505124914.104827-1-bdenes@scylladb.com>
* seastar 3c2e27811...e708d1df3 (10):
> Merge "Fix a few issues found by clang's asan" from Rafael
> seastar: app_template: allow a description to be provided for the app
> membarrier: fix madvise(MADV_DONTNEED) failure and crash with --lock-memory
Fixes#6346
> rpc::compressor: Fix static init fiasco with names
> fair_queue: express all internal fair_queue quantities as fair_queue_tickets
> net: remove API v1 compatibility layer (variadic future in networking)
> testing: Move parts of the exchanger out of line
> on_internal_error: add overload taking an std::exception_ptr
> tuple_utils: Add a missing include
> Merge "Fix use of uninitialized found by valgrind" from Rafael
Garbage collected SSTable is incorrectly added to SSTable set with a function
that invalidates row cache. This problem is fixed by adding GC SStable
to set using mechanism which replaces old sstables with new sstables.
Also, adding GC SSTable to set in a separate call is not correct.
We should make sure that GC SSTable reaches the SSTable set at the same time
its respective old (input) SSTable is removed from the set, and that's done
using a single request call to table.
Fixes#5956.
Fixes#6275.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
input_sstables is renamed to old_sstables and is about old SSTables that should be
deleted and removed from the SSTable set.
output_sstables is renamed to new_sstables and is about new SSTable that should be
added to the SSTable set, replacing the old ones.
This will allow us, for example, to add auxiliary SSTables to SSTable set using
the same call which replaces output SSTables by input SSTables in compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This cleanup allows us to get rid of the ugly compaction::create_new_sstable(),
and reduce complexity by getting rid of observable.
garbage_collected_sstable_writer::data is introduced to allow compaction to
directly communicate with the GC writer, which is stored in mutation_compaction,
making it unreachable after the compaction has started. By making compaction
store GC writer's data and using that same data to create g__c__s__w,
compaction is able to communicate with GC writer without the complexity of
observable utility. This move is important for the subsequent work which
will fix a couple of issues regarding management of GC SSTables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>