Indexed queries are using paging over the materialized view
table. Results of the view read are then used to issue reads of the
base table. If base table reads are short reads, the page is returned
to the user and paging state is adjusted accordingly so that when
paging is resumed it will query the view starting from the row
corresponding to the next row in the base which was not yet
returned. However, paging state's "remaining" count was not reset, so
if the view read was exhausted the reading will stop even though the
base table read was short.
Fix by restoring the "remaining" count when adjusting the paging state
on short read.
Tests:
- index_with_paging_test
- secondary_index_test
Fixes#9198
Message-Id: <20210818131840.1160267-1-tgrabiec@scylladb.com>
First, it doesn't test the gossiper so
it's unclear why have it at all.
And it doesn't test anything more than what we test
using the cql_test_env either.
For testing gossip there is test/manual/gossip.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211122081305.789375-2-bhalevy@scylladb.com>
"
After this series, compaction will finally stop including database.hh.
tests: unit(debug).
"
* 'stop_including_database_hh_for_compaction' of github.com:raphaelsc/scylla:
compaction: stop including database.hh
compaction: switch to table_state in get_fully_expired_sstables()
compaction: switch to table_state
compaction: table_state: Add missing methods required by compaction
Make compaction procedure switch to table_state. Only function in
compaction.cc still directly using table is
get_fully_expired_sstables(T,...), but subsequently we'll make it
switch to table_state and then we can finally stop including database.hh
in the compaction code.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
These are the only methods left for compaction to switch to
table_state, so compaction can finally stop including database.hh
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"
Add a sharded locator::effective_replication_map_factory that holds
shared effective_replication_maps.
To search for e_r_m in the factory, we use a compound `factory_key`:
<replication_strategy type, replication_strategy options, token_metadata ring version>.
Start the sharded factory in main (plus cql_test_env and tools/schema_loader)
and pass a reference to it to storage_proxy and storage_server.
For each keyspace, use the registry to create the effective_replication_map.
When registered, effective_replication_map objects erase themselves
from the factory when destroyed. effective_replication_map then schedules
a background task to clear_gently its contents, protected by the e_r_m_f::stop()
function.
Note that for non-shard 0 instances, if the map
is not found in the registry, we construct it
by cloning the precalculated replication_map
from shard 0 to save the cpu cycles of re-calculating
it time and again on every shard.
Test: unit(dev), schema_loader_test(debug)
DTest: bootstrap_test.py:TestBootstrap.decommissioned_wiped_node_can_join_test update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_new_node_while_schema_changes_with_repair_test (dev)
"
* tag 'effective_replication_map_factory-v7' of https://github.com/bhalevy/scylla:
effective_replication_map: clear_gently when destroyed
database: shutdown keyspaces
test: cql_test_env: stop view_update_generator before database shuts down
effective_replication_map_factory: try cloning replication map from shard 0
tools: schema_loader: start a sharded erm_factory
storage_service: use erm_factory to create effective_replication_map
keyspace: use erm_factory to create effective_replication_map
effective_replication_map: erase from factory when destroyed
effective_replication_map_factory: add create_effective_replication_map
effective_replication_map: enable_lw_shared_from_this
effective_replication_map: define factory_key
keyspace: get a reference to the erm_factory
main: pass erm_factory to storage_service
main: pass erm_factory to storage_proxy
locator: add effective_replication_map_factory
We can't have view updates happening after the database shuts down.
In particular, mutateMV depends on the keyspace effective_replaication_map
and it is going to be released when all keyspaces shut down, in the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To be used for creating effective_replication_map
when token_metadata changes, and update all
keyspaces with it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It will be used further to create shared copies
of effective_replication_map based on replication_strategy
type and config options.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We're using a coarse resolution when rounding clock time for sstables to
be evenly distributed across time buckets. Let's use a better resolution,
to make sure sstables won't fall into the edges.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211118172126.34545-1-raphaelsc@scylladb.com>
The test checks every 100 * smp::count milliseconds that a shard
had been able to make at least once step. Shards, in turn, take up
to 100 ms sleeping breaks between steps. It seems like on heavily
loaded nodes the checking period is too small and the test
stuck-detector shoots false-positives.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20211118154932.25859-1-xemul@scylladb.com>
The previous implementation based on `delivery_queue` had a serious
defect: if receiving a message (`rpc::receive`) blocked, other messages
in the queue had to wait. This would cause, for example, `vote_request`
messages to stop being handled by a server if the server was in the middle
of applying a snapshot.
Now `rpc::receive` returns `void`, not `future<>`. Thus we no longer
need `delivery_queue`: the network message delivery function can simply
call `rpc::receive` directly. Messages which require asynchronous work
to be performed (such as snapshot application) are handled in
`rpc::receive` by spawning a background task. The number of such
background tasks is limited separately for each message type; now if
we exceed that limit, we drop other messages of this type (previously
they would queue up indefinitely and block not only other messages
of this type but different types as well).
Message-Id: <20211116163316.129970-1-kbraun@scylladb.com>
fmt 8 checks format strings at compile time, and requires that
non-compile-time format strings be wrapped with fmt::runtime().
Do that, and to allow coexistence with fmt 7, supply our own
do-nothing version of fmt::runtime() if needed. Strictly speaking
we shouldn't be introducing names into the fmt namespace, but this
is transitional only.
Closes#9640
A member variable is a reference, not a pure value, so std::same_as<>
needs to be given a reference (and clanf 13 insists). However, clang
12 doesn't accept the correct constraint, so use std::convertible_to<>
as a compromise.
Closes#9642
If we get errors/exceptions in delete_segments we can (and probably will) loose track of disk footprint counters. This can in turn, if using hard limits, cause us to block indefinitely on segment allocation since we might think we have larger footprint than we actually do.
Of course, if we actually fail deleting a segment, it is 100% true that we still technically hold this disk footprint (now unreachable), but for cases where for example outside forces (or wacky tests) delete a file behind our backs, this might not be true. One could also argue that our footprint is the segments and file names we keep track of, and the rest is exterior sludge.
In any case, if we have any exceptions in delete_segments, we should recalculate disk footprint based on current state, and restart all new_segment paths etc.
Fixes#9348
(Note: this is based on previous PR #9344 - so shows these commits as well. Actual changes are only the latter two).
Closes#9349
* github.com:scylladb/scylla:
commitlog: Recalculate footprint on delete_segment exceptions
commitlog_test: Add test for exception in alloc w. deleted underlying file
commitlog: Ensure failed-to-create-segment is re-deleted
commitlog::allocate_segment_ex: Don't re-throw out of function
Add schema parameter so that:
* Caller has better control over schema -- especially relevant for
reverse reads where it is not possible to follow the convention of
passing the query schema which is reversed compared to that of the
mutations.
* Now that we don't depend on the mutations for the schema, we can lift
the restriction on mutations not being empty: this leads to safer
code. When the mutations parameter is empty, an empty reader is
created.
Add "make_" prefix to follow convention of similar reader factory
functions.
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211115155614.363663-1-bdenes@scylladb.com>
In the DynamoDB API, a number is encoded in JSON requests as something
like: {"N": "123"} - the type is "N" and the value "123". Note that the
value of the number is encoded as a string, because the floating-point
range and accuracy of DynamoDB differs from what various JSON libraries
may support.
We have a function unwrap_number() which supported the value of the
number being encoded as an actual number, not a string. But we should
NOT support this case - DynamoDB doesn't. In this patch we add a test
that confirms that DynamoDB doesn't, and remove the unnecessary case
from unwrap_number(). The unnecessary case also had a FIXME, so it's
a good opportunity to get rid of a FIXME.
When writing the test, I noticed that the error which DynamoDB returns
in this case is SerializionException instead of the more usual
ValidationException. I don't know why, but let's also change the error
type in this patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211115125738.197099-1-nyh@scylladb.com>
Add "rows" field to system.large_partitions. Add partitions to the
table when they are too large or have too many rows.
Fixes#9506
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Closes#9577
The db::config reference is available on the database, which
can be get from the virtual_table itself. The problem is that
it's a const refernece, while system.config will be updateable
and will need non-const reference.
Adding non-const get_config() on the database looks wrong. The
database shouldn't be used as config provider, even the const
one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The series fixes issues:
server may use the wrong configuration after applying a remote snapshot, causing a split-brain situation
assertion ins raft::server_impl::notify_waiters()
snapshot transfer to a server removed from the configuration should be aborted
cluster may become stuck when a follower takes a snapshot after an accepted entry that the leader didn't learn about
* scylla-dev/random-test-fixes-v2:
raft: rename rpc_configuration to configuration in fsm output
raft: test: test case for the issue #9552
raft: fix matching of a snapshotted log on a follower
raft: abort snapshot transfer to a server that was removed from the configuration
raft: fix race between snapshot application and committing of new entries
raft: test: add test for correct last configuration index calculation during snapshot application
raft: do not maintain _last_conf_idx and _prev_conf_idx past snapshot index
raft: correctly truncate the log in a persistence module during snapshot application
"
table_state is being introduced for compaction subsystem, to remove table dependency
from compaction interface, fix layer violations, and also make unit testing
easier as table_state is an abstraction that can be implemented even with no
actual table backing it.
In this series, compaction strategy interfaces are switching to table_state,
and eventually, we'll make compact_sstables() switch to it too. The idea is
that no compaction code will directly reference a table object, but only work
with the abstraction instead. So compaction subdirectory can stop
including database.hh altogether, which is a great step forward.
"
* 'table_state_v5' of https://github.com/raphaelsc/scylla:
sstable_compaction_test: switch to table_state
compaction: stop including database.hh for compaction_strategy
compaction: switch to table_state in estimated_pending_compactions()
compaction: switch to table_state in compaction_strategy::get_major_compaction_job()
compaction: switch to table_state in compaction_strategy::get_sstables_for_compaction()
DTCS: reduce table dependency for task estimation
LCS: reduce table dependency for task estimation
table: Implement table_state
compaction: make table param of get_fully_expired_sstables() const
compaction_manager: make table param of has_table_ongoing_compaction() const
Introduce table_state
Move run_with_compaction_disabled() into compaction manager
run_with_compaction_disabled() living in table is a layer violation as the
logic of disabling compaction for a table T clearly belongs to manager
and table shouldn't be aware of such implementation details.
This makes things less error prone too as there's no longer a need for
coordination between table and manager.
Manager now takes all the responsibility.
* 'move_disable_compaction_to_manager/v6' of https://github.com/raphaelsc/scylla:
compaction: move run_with_compaction_disabled() from table into compaction_manager
compaction_manager: switch to coroutine in compaction_manager::remove()
compaction_manager: add struct for per table compaction state
compaction_manager: wire stop_ongoing_compactions() into remove()
compaction_manager: introduce stop_ongoing_compactions() for a table
compaction_manager: prevent compaction from being postponed when stopping tasks
compaction_manager: extract "stop tasks" from stop_ongoing_compactions() into new function
Last method in compaction_strategy using table. From now on,
compaction strategy no longer works directly with table.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
From now on, get_sstables_for_compaction() will use table_state.
With table_state, we avoid layer violations like strategy using
manager and also makes testing easier.
Compaction unit tests were temporarily disabled to avoid a giant
commit which is hard to parse.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
* flat_reader_assertions::produces_range_tombstone() does not actually
check range tombstones beyond the fact that they are in fact range
tombstones (unless non-empty ck_ranges is passed). Fix the immediate
problem, change assertion logic to take split and overlapping range
tombstones into account properly, and also fix several
accidentally-incorrect tests.
Fixes#9470
* Convert the remaining sstable_3_x reader tests to v2, now that they
are more correct and only the actual convertion remains.
This deals with the sstable reader tests that involve range
tombstones.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
That's intended to fix a bad layer violation as table was given the
responsibility of disabling compaction for a given table T, but that
logic clearly belongs to compaction_manager instead.
Additionally, gate will be used instead of counter, as former provides
manager with a way to synchronize with functions running under
run_with_compaction_disabled. so remove() can wait for their
termination.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
flat_reader_assertions::produces_range_tombstone() does not actually
check range tombstones beyond the fact that they are in fact range
tombstones (unless non-empty ck_ranges is passed).
Fixing the immediate problem reveals that:
* The assertion logic is not flexible enough to deal with
creatively-split or creatively-overlapping range tombstones.
* Some existing tests involving range tombstones are in fact wrong:
some assertions may (at least with some readers) refer to wrong
tombstones entirely, while others assert wrong things about right
tombstones.
* Range tombstones in pre-made sstables (such as those read by
sstable_3_x_test) have deletion time drift, and that now has to be
somehow dealt with.
This patch (which is not split into smaller ones because that would
either generate unreasonable amount of work towards ensuring
bisectability or entail "temporarily" disabling problematic tests,
which is cheating) contains the following changes:
* flat_reader_assertions check range tombstones more carefully, by
accumulating both expected and actually-read range tombstones into
lists and comparing those lists when a partition ends (or when the
assertion object is destroyed).
* flat_reader_assertions::may_produce_tombstones() can take
constraining ck_ranges.
* Both flat_reader_assertions and flat_reader_assertions_v2 can be
instructed to ignore tombstone deletion times, to help with tests that
read pre-made sstables.
* Affected tests are changed to reflect reality. Most changes to
tests make sense; the only one I am not completely sure about is in
test_uncompressed_filtering_and_forwarding_range_tombstones_read.
Fixes#9470
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
This PR introduces 4 new virtual tables aimed at replacing nodetool commands, working towards the long-term goal of replacing nodetool completely at least for cluster information retrieval purposes.
As you may have noticed, most of these replacement are not exact matches. This is on purpose. I feel that the nodetool commands are somewhat chaotic: they might have had a clear plan on what command prints what but after years of organic development they are a mess of fields that feel like don't belong. In addition to this, they are centered on C* terminology which often sounds strange or doesn't make any sense for scylla (off-heap memory, counter cache, etc.).
So in this PR I tried to do a few things:
* Drop all fields that don't make sense for scylla;
* Rename/reformat/rephrase fields that have a corresponding concept in scylla, so that it uses the scylla terminology;
* Group information in tables based on some common theme;
With these guidelines in mind lets look at the virtual tables introduced in this PR:
* `system.snapshots` - replacement for `nodetool listnapshots`;
* `system.protocol_servers`- replacement for `nodetool statusbinary` as well as `Thrift active` and `Native Transport active` from `nodetool info`;
* `system.runtime_info` - replacement for `nodetool info`, not an exact match: some fields were removed, some were refactored to make sense for scylla;
* `system.versions` - replacement for `nodetool version`, prints all versions, including build-id;
Closes#9517
* github.com:scylladb/scylla:
test/cql-pytest: add virtual_tables.py
test/cql-pytest: nodetool.py: add take_snapshot()
db/system_keyspace: add versions table
configure.py: move release.cc and build_id.cc to scylla_core
db/system_keyspace: add runtime_info table
db/system_keyspace: add protocol_servers table
service: storage_service: s/client_shutdown_hooks/protocol_servers/
service: storage_service: remove unused unregister_client_shutdown_hook
redis: redis_service: implement the protocol_server interface
alternator: controller: implement the protocol_server interface
transport: controller: implement the protocol_server interface
thrift: controller: implement the protocol_server interface
Add protocol_server interface
db/system_keyspace: add snapshots virtual table
db/virtual_table: remove _db member
db/system_keyspace: propagate distributed<> database and storage_service to register_virtual_tables()
docs/design-notes/system_keyspace.md: add listing of existing virtual tables
docs/guides: add virtual-tables.md
This PR started by realizing that in the memtable reversing reader, it
never happened on tests that `do_refresh_state` was called with
`last_row` and `last_rts` which are not `std::nullopt`.
Changes
- fix memtable test (`tesst_memtable_with_many_versions_conforms_to_mutation_source`), so that there is a background job forcing state refreshes,
- fix the way rt_slice is computed (was `(last_rts, cr_range_snapshot.end]`, now is `[cr_range_snapshot.start, last_rts)`).
Fixes#9486Closes#9572
* github.com:scylladb/scylla:
partition_snapshot_reader: fix indentation in fill_buffer
range_tombstone_list: {lower,upper,}slice share comparator implementation
test: memtable: add full_compaction in background
partition_snapshot_reader: fix obtaining rt_slice, if Reversing and _last_rts was set
range_tombstone_list: add lower_slice
This patch fixes a bug in UpdateItem's ReturnValues=ALL_NEW, which in
some cases returned the OLD (pre-modification) value of some of the
attributes, instead of its NEW value.
The bug was caused by a confusion in our JSON utility function,
rjson::set(), which sounds like it can set any member of a map, but in
fact may only be used to add a *new* member - if a member with the same
name (key) already existed, the result is undefined (two values for the
same key). In ReturnValues=ALL_NEW we did exactly this: we started with
a copy of the original item, and then used set() to override some of the
members. This is not allowed.
So in this patch, we introduce a new function, rjson::replace(), which
does what we previously thought that rjson::set() does - i.e., replace a
member if it exists, or if not, add it. We call this function in
the ReturnValues=ALL_NEW code.
This patch also adds a test case that reproduces the incorrect ALL_NEW
results - and gets fixed by this patch.
In an upcoming patch, we should rename the confusingly-named set()
functions and audit all their uses. But we don't do this in this patch
yet. We just add some comments to clarify what set() does - but don't
change it, and just add one new function for replace().
Fixes#9542
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211104134937.40797-1-nyh@scylladb.com>
Add full compaction in test_memtable_with_many_versions_conforms_to_mutation_source
in background. Without it, some paths in the partition snapshot reader
weren't covered, as the tests always managed to read all range
tombstones and rows which cover a given clustering range from just a
single snapshot. Now, when full_compaction happens in process of reading
from a clustering range, we can force state refresh with non-nullopt
positions of last row and last range tombstone.
Note: this inability to test affected only the reversing reader.
Cleanup and improvements for compaction
* 'compaction_cleanup_and_improvements_v2' of https://github.com/raphaelsc/scylla:
compaction: fix outdated doc of compact_sstables()
table: fix indentation in compact_sstables()
table: give a more descriptive name to compaction_data in compact_sstables()
compaction_manager: rename submit_major_compaction to perform_major_compaction
compaction: fix indentantion in compaction.hh
compaction: move incremental_owned_ranges_checker into cleanup_compaction
compaction: make owned ranges const in cleanup_compaction
compaction: replace outdated comment in regular_compaction
compaction: give a more descriptive name to compaction_data
compaction_manager: simplify creation of compaction_data
for symmetry, let's call it perform_* as it doesn't work like submission
functions which doesn't wait for result, like the one for minor
compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
there's no need for wrapping compaction_data in shared_ptr, also
let's kill unused params in create_compaction_data to simplify
its creation.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This case takes 45+ minutes which is 1.5 times longer then the
second longest case out there. I propose to keep the many-400
case out of debug runs, there's many-100 one which is configured
the same way but uses 4x times less "nodes".
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
There are 3 overlapping problems with the test case. It
has use after move that covers wrond window selection and
relies on a time-since-epoch being aligned with the time
window by chance.
tests: unit(dev)
"
* 'br-twcs-test-fixes' of https://github.com/xemul/scylla:
test, compaction: Do not rely on random timestamp
test, compaction: Fix use after move in twcs reshape
Again, there's a sub-case with sequential time stamps that still
works by chance. This time it's because splitting 256 sstables
into buckets of maximum 8 ones is allowed to have the 1st and the
last ones with less than 8 items in it, e.g. 3, 8, ..., 8, 5. The
exact generation depends on the time-since-epoch at which it
starts.
When all the cases are run altogether this time luckily happens
to be well-aligned with 8-hours and the generated buckets are
filled perfectly. When this particular test-case is run all alone
(e.g. by --run_test or --parallel-cases) then the starting time
becomes different and it gets less than 4 sstables in its first
bucket.
The fix is in adjusting the starting time to be aligned with the
8 hours window.
Actually, the 8 hours appeared in the previous patch, before which
it was 24 hours. Nonetheless, the above reasoning applies to any
size of the time window that's less than 256, so it's still an
independent fix.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>