Consider an unpaged query that consumes all of available memory, despite
fea5067dfa which limits them (perhaps the
user raised the limit, or this is a system query). Eventually we will see a
bad_alloc which will abort the query and destroy this reconcilable_result_builder.
During destruction, we first destroy _memory_accounter, and then _result.
Destroying _memory_accounter resumes some continuations which can then
allocate memory synchronously when increasing the task queue to accomodate
them. We will then crash. Had we not crashed, we would immediately afterwards
release _result, freeing all the memory that we would ever need.
Fix by making _result the last member, so it is freed first.
Fixes#7240.
Introduce new database config option `schema_registry_grace_period`
describing the amount of time in seconds after which unused schema
versions will be cleaned up from the schema registry cache.
Default value is 1 second, the same value as was hardcoded before.
Tests: unit(debug)
Refs: #7225
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200915131957.446455-1-pa.solodovnikov@scylladb.com>
The set contains 3 small optimizations:
- avoid copying of partition key on lookup path
- reduce number of args carried around when creating a new entry
- save one partition key comparison on reader creation
Plus related satellite cleanups.
* https://github.com/xemul/scylla/tree/br-row-cache-less-copies:
row_cache: Revive do_find_or_create_entry concepts
populating reader: Do not copy decorated key too early
populating reader: Less allocator switching on population
populating reader: Fix indentation after previous patch
row_cache: Move missing entry creation into helper
test: Lookup an existing entry with its own helper
row_cache: Do not copy partition tombstone when creating cache entry
row_cache: Kill incomplete_tag
row_cache: Save one key compare on direct hit
"
This PR removes _pending_ranges and _pending_ranges_map in token_metadata.
This removal of makes copying of token_metadata faster and reduces the chance to cause reactor stall.
Refs: #7220
"
* asias-token_metadata_replication_config_less_maps:
token_metadata: Remove _pending_ranges
token_metadata: Get rid of unused _pending_ranges_map
from Asias.
This series follows "repair: Add progress metrics for node ops #6842"
and adds the metrics for the remaining node operations,
i.e., replace, decommission and removenode.
Fixes#1244, #6733
* asias-repair_progress_metrics_replace_decomm_removenode:
repair: Add progress metrics for removenode ops
repair: Add progress metrics for decommission ops
repair: Add progress metrics for replace ops
Change 94995acedb added yielding to abstract_replication_strategy::do_get_ranges.
And 07e253542d used get_ranges_in_thread in compaction_manager.
However, there is nothing to prevent token_metadata, and in particular its
`_sorted_tokens` from changing while iterating over them in do_get_ranges if the latter yields.
Therefore copy the the replication strategy `_token_metadata` in `get_ranges_in_thread(inet_address ep)`.
If the caller provides `token_metadata` to get_ranges_in_thread, then the caller
must make sure that we can safely yield while accessing token_metadata (like
in `do_rebuild_replace_with_repair`).
Fixes#7044
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200915074555.431088-1-bhalevy@scylladb.com>
- Remove get_pending_ranges and introduce has_pending_ranges, since the
caller only needs to know if there is a pending range for the keyspace
and the node.
- Remove print_pending_ranges which is only used in logging. If we
really want to log the new pending token ranges, we can log when we
set the new pending token ranges.
This removal of _pending_ranges makes copying of token_metadata faster
and reduces the chance to cause reactor stall.
Refs: #7220
"
The range_tombstone_list provides an abstraction to work with
sorted list of range tombstones with methods to add/retrive
them. However, there's a tombstones() method that just returns
modifiable reference to the used collection (boost::intrusive_set)
which makes it hard to track the exact usage of it.
This set encapsulates the collaction of range tombstones inside
the mentioned ..._list class.
tests: unit(dev)
"
* 'br-range-tombstone-encapsulate-collection' of https://github.com/xemul/scylla:
range_tombstone_list: Do not expose internal collection
range_tombstone_list: Introduce and use pop-and-lock helper
range_tombstone_list: Introduce and use pop_as<>()
flat_mutation_reader: Use range_tombstone_list begin/end API
repair: Mark some partition_hasher methods noexcept
hashers: Mark hash updates noexcept
The histogram constructor has a `counts` parameter defaulted to
`defaultdict(int)`. Due to how default argument values work in
python -- the same value is passed to all invocations -- this results in
all histogram instances sharing the same underlying counts dict. Solve
it the way this is usually solved -- default the parameter to `None` and
when it is `None` create a new instance of `defaultdict(int)` local to
the histogram instance under construction.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200908142355.1263568-1-bdenes@scylladb.com>
Currently blobs are converted to python bytes objects and printed by
simply converting them to string. This results in hard to read blobs as
the bytes' __str__() attempts to interpret the data as a printable
string. This patch changes this to use bytes.hex() which prints blobs in
hex format. This is much more readable and it is also the format that
scylla uses when printing blobs.
Also the conversion to bytes is made more efficient by using gdb's
gdb.inferior.read_memory() function to read the data.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200911085439.1461882-1-bdenes@scylladb.com>
hugepages and libhugetlbfs-bin packages is only required for DPDK mode,
and unconditionally installation causes error on offline mode, so drop it.
Fixes#7182
data::cell targets 8KB as its maximum allocations size to avoid
pressuring the allocator. This 8KB target is used for internal storage
-- values small enough to be stored inside the cell itself -- as well
for external storage. Externally stored values use 8KB fragment sizes.
The problem is that only the size of data itself was considered when
making the allocations. For example when allocating the fragments
(chunks) for external storage, each fragment stored 8KB of data. But
fragments have overhead, they have next and back pointers. This resulted
in a 8KB + 2 * sizeof(void*) allocation. IMR uses the allocation
strategy mechanism, which works with aligned allocations. As the seastar
allocation only guarantees aligned allocations for power of two sizes,
it ends up allocating a 16KB slot. This results in the mutation fragment
using almost twice as much memory as would be required. This is a huge
waste.
This patch fixes the problem by considering the overhead of both
internal and external storage ensuring allocations are 8KB or less.
Fixes: #6043
Tests: unit(debug, dev, release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200910171359.1438029-1-bdenes@scylladb.com>
Instead of using the default hasher, hasing specializations should
use the hasher type they were specialized for. It's not a correctness
issue now because the default hasher (xx_hasher) is compatible with
its predecessor (legacy_xx_hasher_without_null_digest), but it's better
to be future-proof and use the correct type in case we ever change the
default hasher in a backward-incompatible way.
Message-Id: <c84ce569d12d9b4f247fb2717efa10dc2dabd75b.1600074632.git.sarna@scylladb.com>
The reader fills up the buffer upon construction, which is not what other
readers do, and is considered to be waste of cycles, as the reader can be
dropped early.
Refs #1671
test: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200910171134.11287-2-xemul@scylladb.com>
"
Migration manager installs several cluster feature change listeners.
The listeners will call update_schema_version_and_announce() when cluster
features are enabled, which does this:
return update_schema_version(proxy, features).then([] (utils::UUID uuid) {
return announce_schema_version(uuid);
});
It first updates the schema version and then publishes it via
gossip in announce_schema_version(). It is possible that the
announce_schema_version() part of the first schema change will be
deferred and will execute after the other four calls to
update_schema_version_and_announce(). It will install the old schema
version in gossip instead of the more recent one.
The fix is to serialize schema digest calculation and publishing.
Refs #7200
This problem also brought my attention to initialization code, which could be
prone to the same problem.
The storage service computes gossiper states before it starts the
gossiper. Among them, node's schema version. There are two problems with that.
First is that computing the schema version and publishing it is not
atomic, so is not safe against concurrent schema changes or schema
version recalculations. It will not exclude with
recalculate_schema_version() calls, and we could end up with the old
(and incorrect) schema version being advertised in gossip.
Second problem is that we should not allow the database layer to call
into the gossiper layer before it is fully initialized, as this may
produce undefined behavior.
Maybe we're not doing concurrent schema changes/recalculations now,
but it is easy to imagine that this could change for whatever reason
in the future.
The solution for both problems is to break the cyclic dependency
between the database layer and the storage_service layer by having the
database layer not use the gossiper at all. The database layer
publishes schema version inside the database class and allows
installing listeners on changes. The storage_service layer asks the
database layer for the current version when it initializes, and only
after that installs a listener which will update the gossiper.
Tests:
- unit (dev)
- manual (3 node ccm)
"
* tag 'fix-schema-digest-calculation-race-v1' of github.com:tgrabiec/scylla:
db, schema: Hide update_schema_version_and_announce()
db, storage_service: Do not call into gossiper from the database layer
db: Make schema version observable
utils: updateable_value_source: Introduce as_observable()
schema: Fix race in schema version recalculation leading to stale schema version in gossip
"
This pull request fixes unified relocatable package dependency issues in
other build modes than release, and then adds unified tarball to the
"dist" build target.
Fixes#6949
"
* 'penberg/build/unified-to-dist/v1' of github.com:penberg/scylla:
configure.py: Build unified tarball as part of "dist" target
unified/build_unified: Use build/<mode>/dist/tar for dependency tarballs
configure.py: Use build/<mode>/dist/tar for unified tarball dependencies
This patch is a proposal for the removal of the redis
classes describing the commands. 'prepare' and 'execute'
class functions have been merged into a function with
the name of the command.
Note: 'command_factory' still needs to be simplified.
Signed-off-by: Etienne Adam <etienne.adam@gmail.com>
Message-Id: <20200913183315.9437-1-etienne.adam@gmail.com>
it's failing as so:
Python Exception <class 'TypeError'> unsupported operand type(s) for +: 'int' and 'str':
it's a regression caused by e4d06a3bbf.
_mask() should use the ref stored in the ctor to dereference _impl.
Fixes#7058.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200908154342.26264-1-raphaelsc@scylladb.com>
The storage service computes gossiper states before it starts the
gossiper. Among them, node's schema version. There are two problems with that.
First is that computing the schema version and publishing it is not
atomic, so is not safe against concurrent schema changes or schema
version recalculations. It will not exclude with
recalculate_schema_version() calls, and we could end up with the old
(and incorrect) schema version being advertised in gossip.
Second problem is that we should not allow the database layer to call
into the gossiper layer before it is fully initialized, as this may
produce undefined behavior.
The solution for both problems is to break the cyclic dependency
between the database layer and the storage_service layer by having the
database layer not use the gossiper at all. The database layer
publishes schema version inside the database class and allows
installing listeners on changes. The storage_service layer asks the
database layer for the current version when it initializes, and only
after that installs a listener which will update the gossiper.
This also allows us to drop unsafe functions like update_schema_version().
Migration manager installs several feature change listeners:
if (this_shard_id() == 0) {
_feature_listeners.push_back(_feat.cluster_supports_view_virtual_columns().when_enabled(update_schema));
_feature_listeners.push_back(_feat.cluster_supports_digest_insensitive_to_expiry().when_enabled(update_schema));
_feature_listeners.push_back(_feat.cluster_supports_cdc().when_enabled(update_schema));
_feature_listeners.push_back(_feat.cluster_supports_per_table_partitioners().when_enabled(update_schema));
}
They will call update_schema_version_and_announce() when features are enabled, which does this:
return update_schema_version(proxy, features).then([] (utils::UUID uuid) {
return announce_schema_version(uuid);
});
So it first updates the schema version and then publishes it via
gossip in announce_schema_version(). It is possible that the
announce_schema_version() part of the first schema change will be
deferred and will execute after the other four calls to
update_schema_version_and_announce(). It will install the old schema
version in gossip instead of the more recent one.
The fix is to serialize schema digest calculation and publishing.
Refs #7200
It causes gdb to print UUIDs like this:
"a3eadd80-f2a7-11ea-853c-", '0' <repeats 11 times>, "4"
This is quite hard to read, let's drop the string display hint, so they
are displayed like this:
a3eadd80-f2a7-11ea-853c-000000000004
Much better. Also technically UUID is a 128 bit integer anyway, not a
string.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200911090135.1463099-1-bdenes@scylladb.com>
The build_unified.sh script has the same bug as configure.py had: it
looks for the python tarball in
build/<mode>/scylla-python3-package.tar.gz, but it's never generated
there. Fix up the problem by using build/<mode>/dist/tar location for
all dependency tarballs.
The build target for scylla-unified-package.tar.gz incorrectly depends
on "build/<mode>/scylla-python3-package.tar.gz", which is never
generated. Instead, the package is either generated in
"build/release/scylla-python3-package.tar.gz" (for legacy reasons) or
"build/<mode>/dist/tar/scylla-python3-package.tar.gz". This issues
causes building unified package in other modes to fail.
To solve the problem, let's switch to using the "build/<mode>/dist/tar"
locations for unified tarball dependencies, which is the correct place
to use anyway.
When the test suite is run with Scylla serving in HTTPS mode, using
test/alternator/run --https, two Alternator Streams tests failed.
With this patch fixing a bug in the test, the tests pass.
The bug was in the is_local_java() function which was supposed to detect
DynamoDB Local (which behaves in some things differently from the real
DynamoDB). When that detection code makes an HTTPS request and does not
disable checking the server's certificate (which on Alternator is
self-signed), the request fails - but not in the way that the code expected.
So we need to fix the is_local_java() to allow the failure mode of the
self-signed certificate. Anyway, this case is *not* DynamoDB Local so
the detection function would return false.
Fixes#7214
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200910194738.125263-1-nyh@scylladb.com>
This patch adds regression tests for four recently-fixed issues which did not yet
have tests:
Refs #7157 (LatestStreamArn)
Refs #7158 (SequenceNumber should be numeric)
Refs #7162 (LatestStreamLabel)
Refs #7163 (StreamSpecification)
I verified that all the new tests failed before these issues were fixed, but
now pass.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200907155334.562844-1-nyh@scylladb.com>
"
This series fixes a bug in `appending_hash<row>` that caused it to ignore any cells after the first NULL. It also adds a cluster feature which starts using the new hashing only after the whole cluster is aware of it. The series comes with tests, which reproduce the issue.
Fixes#4567
Based on #4574
"
* psarna-fix_ignoring_cells_after_null_in_appending_hash:
test: extend mutation_test for NULL values
tests/mutation: add reproducer for #4567
gms: add a cluster feature for fixed hashing
digest: add null values to row digest
mutation_partition: fix formatting
appending_hash<row>: make publicly visible
The test is extended for another possible corner case:
[1, NULL, 2] vs [1, 2, NULL] should have different digests.
Also, a check for legacy behavior is added.
With the new hashing routine, null values are taken into account
when computing row digest. Previous behavior had a regression
which stopped computing the hash after the first null value
is encountered, but the original behavior was also prone
to errors - e.g. row [1, NULL, 2] was not distinguishable
from [1, 2, NULL], because their hashes were identical.
This hashing is not yet active - it will only be used after
the next commit introduces a proper cluster feature for it.
appending_hash<row> specialisation is declared and defined in a *.cc file
which means it cannot have a dedicated unit test. This patch moves the
declaration to the corresponding *.hh file.
There was a typo in get_column_defs_for_filtering(): it checked the
wrong pointer before dereferencing. Add a test exposing the NULL
dereference and fix the typo.
Tests: unit (dev)
Fixes#7198.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
In commit 7d86a3b208 (storage_service:
Make replacing node take writes), application state of TOKENS of the
replacing node is added into gossip and propagated to the cluster after
the initial start of gossip service. This can cause a race below
1. The replacing node replaces the old dead node with the same ip address
2. The replacing node starts gossip without application state of the TOKENS
3. Other nodes in the cluster replace the application states of old dead node's
version with the new replacing node's version
4. replacing node dies
5. replace operation is performed again, the TOKENS application state is
not preset and replace operation fails.
To fix, we can always add TOKENS application state when the
gossip service starts.
Fixes: #7166
Backports: 4.1 and 4.2
The clustering_row class looks as a decorated deletable_row, but
re-implements all its logic (and members). Embed the deletable_row
into clustering_row and keep the non-static row logic in one
class instead of two.