Make sure gc'able-tombstone-only sstable is properly generated with data that
comes from regular compaction's input sstable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Compaction prevents data resurrection from happening by checking that there's
no way a data shadowed by a GC'able tombstone will survive alone, after
a failure for example.
Consider the following scenario:
We have two runs A and B, each divided to 5 fragments, A1..A5, B1..B5.
They have the following token ranges:
A: A1=[0, 3] A2=[4, 7] A3=[8, 11] A4=[12, 15] A5=[16,18]
B is the same as A's ranges, offset by 1:
B: B1=[1,4] B2=[5,8] B3=[9,12] B4=[13,16] B5=[17,19]
Let's say we are finished flushing output until position 10 in the compaction.
We are currently working on A3 and B3, so obviously those cannot be deleted.
Because B2 overlaps with A3, we cannot delete B2 either.
Otherwise, B2 could have a GC'able tombstone that shadows data in A3, and after
B2 is gone, dead data in A3 could be resurrected *on failure*.
Now, A2 overlaps with B2 which we couldn't delete yet, so we can't delete A2.
Now A2 overlaps with B1 so we can't delete B1. And B1 overlaps with A1 so
we can't delete A1. So we can't delete any fragment.
The problem with this approach is obvious, fragments can potentially not be
released due to data dependency, so incremental compaction efficiency is
severely reduced.
To fix it, let's not purge GC'able tombstones right away in the mutation
compactor step. Instead, let's have compaction writing them to a separate
sstable run that would be deleted in the end of compaction.
By making sure that tombstone information from all compacting sstables is not
lost, we no longer need to have incremental compaction imposing lots of
restriction on which fragments could be released. Now, any sstable which data
is safe in a new sstable can be released right away. In addition, incremental
compaction will only take place if compaction procedure is working with one
multi-fragment sstable run at least.
Fixes#4531.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
* seastar 1f68be436f...e888b1df9c (8):
> sharded: Make map work with mapper that returns a future
> cmake: Remove FindBoost.cmake
> Reduce noncopyable_function instruction cache footprint
> doc: add Loops section to the tutorial
> Merge "Move file related code out of reactor" from Asias
> Merge "Move the io_queue code out of reactor" from Asias
> cmake: expose seastar_perf_testing lib
> future: class doc: explain why discarding a future is bad
- main.cc now includes new file io_queue.hh
- perf tests now include seastar perf utilities via user, not
system, includes since those are not exported
Merged patch set from Piotr Sarna:
Refs #5046
This commit adds handling "Authorization:" header in incoming requests.
The signature sent in the authorization is recomputed server-side
and compared with what the client sent. In case of a mismatch,
UnrecognizedClientException is returned.
The signature computation is based on boto3 Python implementation
and uses gnutls to compute HMAC hashes.
This series is rebased on a previous HTTPS series in order to ease
merging these two. As such, it depends on the HTTPS series being
merged first.
Tests: alternator(local, remote)
The series also comes with a simple authorization test and a docs update.
Piotr Sarna (6):
alternator: migrate split() function to string_view
alternator: add computing the auth signature
config: add alternator_enforce_authorization entry
alternator: add verifying the auth signature
alternator-test: add a basic authorization test case
docs: update alternator authorization entry
alternator-test/test_authorization.py | 34 ++++++++
configure.py | 1 +
alternator/{server.hh => auth.hh} | 22 ++---
alternator/server.hh | 3 +-
db/config.hh | 1 +
alternator/auth.cc | 88 ++++++++++++++++++++
alternator/server.cc | 112 +++++++++++++++++++++++---
db/config.cc | 1 +
main.cc | 2 +-
docs/alternator/alternator.md | 7 +-
10 files changed, 241 insertions(+), 30 deletions(-)
create mode 100644 alternator-test/test_authorization.py
copy alternator/{server.hh => auth.hh} (58%)
create mode 100644 alternator/auth.cc
Before this change, when populating non-system keyspaces, each data
directory was scanned and for each entry (keyspace directory),
a keyspace was populated. This was done in a serial fashion - populating
of one keyspace was not started until the previous one was done.
Loading keyspaces in such fashion can introduce unnecessary waiting
in case of a large number of keyspaces in one data directory. Population
process is I/O intensive and barely uses CPU.
This change enables parallel loading of keyspaces per data directory.
Populating the next keyspace does not wait for the previous one.
A benchmark was performed measuring startup time, with the following
setup:
- 1 data directory,
- 200 keyspaces,
- 2 tables in each keyspace, with the following schema:
CREATE TABLE tbl (a int, b int, c int, PRIMARY KEY(a, b))
WITH CLUSTERING ORDER BY (b DESC),
- 1024 rows in each table, with values (i, 2*i, 3*i) for i in 0..1023,
- ran on 6-core virtual machine running on i7-8750H CPU,
- compiled in dev mode,
- parameters: --smp 6 --max-io-requests 4 --developer-mode=yes
--datadir $DIR --commitlog-directory $DIR
--hints-directory $DIR --view-hints-directory $DIR
The benchmark tested:
- boot time, by comparing timestamp of the first message in log,
and timestamp of the following message:
"init - Scylla version ... initialization completed."
- keyspace population time, by comparing timestamps of messages:
"init - loading non-system sstables"
and
"init - starting view update generator"
The benchmark was run 5 times for sequential and parallel version,
with the following results:
- sequential: boot 31.620s, keyspace population 6.051s
- parallel: boot 29.966s, keyspace population 4.360s
Keyspace population time decreased by ~27.95%, and overall boot time
by about ~5.23%.
Tests: unit(release)
Fixes#2007
The signature sent in the "Authorization:" header is now verified
by computing the signature server-side with a matching secret key
and confirming that the signatures match.
Currently the secret key is hardcoded to be "whatever" in order
to work with current tests, but it should be replaced
by a proper key store.
Refs #5046
The config entry will be used to turn authorization for alternator
requests on and off. The default is currently off, since the key store
is not implemented yet.
A function for computing the auth signature from user requests
is added, along with helper functions. The implementation
is based on gnutls's HMAC.
Refs #5046
The implementation of string split was based on sstring type for
simplicity, but it turns out that more generic std::string_view
will be beneficial later to avoid unneeded string copying.
Unfortunately boost::split does not cooperate well with string views,
so a simple manual implementation is provided instead.
Schema changes can have big effects on performance, typically it should
be a rare event.
It is usefull to monitor how frequently the schema changed.
This patch adds a counter that increases each time a schema changed.
After this patch the metrics would look like:
scylla_database_schema_changed{shard="0",type="derive"} 2
Fixes#4785
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
We can use the reader::peek() to check if the reader contains any data.
If not, do not open the rpc stream connection. It helps to reduce the
port usage.
Refs: #4943
Both in a single-statement transaction and in a batch
we expect that serial consistency is provided. Move the
check to query_options class and make it available for
reuse.
Keep get_serial_consistency() around for use in
transport/server.cc.
Message-Id: <20191006154532.54856-2-kostja@scylladb.com>
When another node is reported to be down, view updates queued
for it are cancelled, but some of them may already be initiated.
Right now, cancelling such a write resulted in an exception,
but on conceptual level it's not really an exception, since
this behaviour is expected.
Previous version of this patch was based on introducing a special
exception type that was later handled specially, but it's not clear
if it's a good direction. Instead, this patch simply makes this
path non-exceptional, as was originally done by Nadav in the first
version of the series that introduced handling unstarted write
cancellations. Additionally, a message containing the information
that a write is cancelled is logged with debug level.
README.md has 3 fixes applied:
- s/alternator_tls_port/alternator_https_port
- conf directory is mentioned more explicitly
- it now correctly states that the self-signed certificate
warning *is* explicitly ignored in tests
Message-Id: <e5767f7dbea260852fc2fa9b613e1bebf490cc78.1570444085.git.sarna@scylladb.com>
"
Fixes#5134, Eviction concurrent with preempted partition entry update after
memtable flush may allow stale data to be populated into cache.
Fixes#5135, Cache reads may miss some writes if schema alter followed by a
read happened concurrently with preempted partition entry update.
Fixes#5127, Cache populating read concurrent with schema alter may use the
wrong schema version to interpret sstable data.
Fixes#5128, Reads of multi-row partitions concurrent with memtable flush may
fail or cause a node crash after schema alter.
"
* tag 'fix-cache-issues-with-schema-alter-and-eviction-v2' of github.com:tgrabiec/scylla:
tests: row_cache: Introduce test_alter_then_preempted_update_then_memtable_read
tests: row_cache_stress_test: Verify all entries are evictable at the end
tests: row_cache_stress_test: Exercise single-partition reads
tests: row_cache_stress_test: Add periodic schema alters
tests: memtable_snapshot_source: Allow changing the schema
tests: simple_schema: Prepare for schema altering
row_cache: Record upgraded schema in memtable entries during update
memtable: Extract memtable_entry::upgrade_schema()
row_cache, mvcc: Prevent locked snapshots from being evicted
row_cache: Make evict() not use invalidate_unwrapped()
mvcc: Introduce partition_snapshot::touch()
row_cache, mvcc: Do not upgrade schema of entries which are being updated
row_cache: Use the correct schema version to populate the partition entry
delegating_reader: Optimize fill_buffer()
row_cache, memtable: Use upgrade_schema()
flat_mutation_reader: Introduce upgrade_schema()
Merged patch series from Piotr Sarna:
This series adds HTTPS support for Alternator.
The series comes with --https option added to alternator-test, which makes
the test harness run all the tests with HTTPS instead of HTTP. All the tests
pass, albeit with security warnings that a self-signed x509 certificate was
used and it should not be trusted.
Fixes#5042
Refs scylladb/seastar#685
Patches:
docs: update alternator entry on HTTPS
alternator-test: suppress the "Unverified HTTPS request" warning
alternator-test: add HTTPS info to README.md
alternator-test: add HTTPS to test_describe_endpoints
alternator-test: add --https parameter
alternator: add HTTPS support
config: add alternator HTTPS port
* seastar c21a7557f9...1f68be436f (6):
> scheduling: Add per scheduling group data support
> build: Include dpdk as a single object in libseastar.a
> sharded: fix foreign_ptr's move assignment
> build: Fix DPDK libraries linking in pkg-config file
> http server: https using tls support
> Make output_stream blurb Doxygen
The BEGINS_WITH condition in conditional updates (via Expected) requires
that the given operand be either a string or a binary. Any other operand
should result in a validation exception - not a failed condition as we
generate now.
This patch fixes the test for this case so it will succeed against
Amazon DynamoDB (before this patch it fails - this failure was masked by
a typo before commit 332ffa77ea). The patch
then fixes our code to handle this case correctly.
Note that BEGINS_WITH handling of wrong types is now asymmetrical: A bad
type in the operand is now handled differently from a bad type in the
attribute's value. We add another check to the test to verify that this
is the case.
Fixes#5141
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20191006080553.4135-1-nyh@scylladb.com>
make_single_key_reader() currently doesn't actually create
single-partition readers because it doesn't set
mutation_reader::forwarding::no when it creates individual
readers. The readers will default to mutation_reader::forwarding::yes
and actually create scanning readers in preparation for
fast-forwarding across partitions.
Fix by passing mutation_reader::forwarding::no.
Currently, methods of simple_schema assume that table's schema doesn't
change. Accessors like get_value() assume that rows were generated
using simple_schema::_s. Because if that, the column_definition& for
the "v" column is cached in the instance. That column_definiion&
cannot be used to access objects created with a different schema
version. To allow using simple_schema after schema changes,
column_definition& caching is now tagged with the table schema version
of origin. Methods which access schema-dependent objects, like
get_value(), are now accepting schema& corresponding to the objects.
Also, it's now possible to tell simple_schema to use a different
schema version in its generator methods.
Cache update may defer in the middle of moving of partition entry
from a flushed memtable to the cache. If the schema was changed since
the entry was written, it upgrades the schema of the partition_entry
first but doesn't update the schema_ptr in memtable_entry. The entry
is removed from the memtable afterward. If a memtable reader
encounters such an entry, it will try to upgrade it assuming it's
still at the old schema.
That is undefined behavior in general, which may include:
- read failures due to bad_alloc, if fixed-size cells are interpreted
as variable-sized cells, and we misinterpret a value for a huge
size
- wrong read results
- node crash
This doesn't result in a permanent corruption, restarting the node
should help.
It's the more likely to happen the more rows there are in a
partition. It's unlikely to happen with single-row partitions.
Introduced in 70c7277.
Fixes#5128.
If the whole partition entry is evicted while being updated from the
memtable, a subsequent read may populate the partition using the old
version of data if it attempts to do it before cache update advances
past that partition. Partial eviction is not affected because
populating reads will notice that there is a newer snapshot
corresponding to the updater.
This can happen only in OOM situations where the whole cache gets evicted.
Affects only tables with multi-row partitions, which are the only ones
that can experience the update of partition entry being preempted.
Introduced in 70c7277.
Fixes#5134.
invalidate_unwrapped() calls cache_entry::evict(), which cannot be
called concurrently with cache update. invalidate() serializes it
properly by calling do_update(), but evict() doesn't. The purpose of
evict() is to stress eviction in tests, which can happen concurrently
with cache update. Switch it to use memory reclaimer, so that it's
both correct and more realistic.
evict() is used only in tests.
When a read enters a partition entry in the cache, it first upgrades
it to the current schema of the cache. The same happens when an entry
is updated after a memtable flush. Upgrading the entry is currently
performed by squashing all versions and replacing them with a single
upgraded version. That has a side effect of detaching all snapshots
from the partition entry. Partition entry update on memtable flush is
writing into a snapshot. If that snapshot is detached by a schema
upgrade, the entry will be missing writes from the memtable which fall
into continuous ranges in that entry which have not yet been updated.
This can happen only if the update of the entry is preempted and the
schema was altered during that, and a read hit that partition before
the update went past it.
Affects only tables with multi-row partitions, which are the only ones
that can experience the update of partition entry being preempted.
The problem is fixed by locking updated entries and not upgrading
schema of locked entries. cache_entry::read() is prepared for this,
and will upgrade on-the-fly to the cache's schema.
Fixes#5135
The sstable reader which populates the partition entry in the cache is
using the schema of the partition entry snapshot, which will be the
schema of the cache at the time the partition was entered. If there
was a schema change after the cache reader entered the partition but
before it created the sstable reader, the cache populating reader will
interpret sstable fragments using the wrong schema version. That is
more likely if partitions have many rows, and the front of the
partition is populated. With single-row partitions that's unlikely to
happen.
That is undefined behavior in general, which may include:
- read failures due to bad_alloc, if fixed-size cells are
interpreted as variable-sized cells, and we misinterpret
a value for a huge size
- wrong read results
- node crash
This doesn't result in a permanent corruption, restarting the node
should help.
Fixes#5127.
Use move_buffer_content_to() which is faster than fill_buffer_from()
because it doesn't involve popping and pushing the fragments across
buffers. We save on size estimation costs.
Running with --https and a self-signed certificate results in a flood
of expected warnings, that the connection is not to be trusted.
These warnings are silenced, as users runing a local test with --https
usually use self-signed certificates.
The test_describe_endpoints test spawns another client connection
to the cluster, so it needs to be HTTPS-aware in order to work properly
with --https parameter.
Running with --https parameter will result in sending the requests
via HTTPS instead of HTTP. By default, port 8043 is used for a local
cluster. Before running pytest --https, make sure that Scylla
was properly configured to initialize a HTTPS alternator server
by providing the alternator_tls_port parameter.
The HTTPS-based connection runs with verification disabled,
otherwise it would not work with self-signed certificates,
which are useful for tests.
By providing a server based on a TLS socket, it's now possible
to serve HTTPS requests in alternator. The HTTPS server is enabled
by setting its port in scylla.yaml: alternator_tls_port=XXXX.
Alternator TLS relies on the existing TLS configuration,
which is provided by certificate, keyfile, truststore, priority_string
options.
Fixes#5042
The test test_update_expression_function_nesting() fails because DynamoDB
don't allow an expression like list_append(list_append(:val1, :val2), :val3)
but Alternator doesn't check for this (and supports this expression).
The "xfail" message was outdated, suggesting that the test fails because
the "SET" expression isn't supported - but it is. So replace the message
by a more accurate one.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190915104708.30471-1-nyh@scylladb.com>
Merged patch set from Dejan Mircevski implementing some of the
missing operators for Expected: NE, IN, NULL and NOT_NULL.
Patches:
alternator: Factor out Expected operand checks
alternator: Implement NOT_NULL operator in Expected
alternator: Implement NULL operator in Expected
alternator: Fix expected_1_null testcase
alternator: Implement IN operator in Expected
alternator: Implement NE operator in Expected
alternator: Factor out common code in Expected