This pull request introduces support for global secondary indexes based on static columns.
Local secondary indexes based on secondary columns are not planned to be supported and are explicitly forbidden. Because there is only one static row per partition and local indexes require full partition key when querying, such indexes wouldn't be very useful and would only waste resources.
The index table for secondary indexes on static columns, unlike other secondary indexes, do not contain clustering keys from the base table. A static column's value determines a set of full partitions, so the clustering keys would only be unnecessary.
The already existing logic for querying using secondary indexes works after introducing minimal notifications. The view update generation path now works on a common representation of static and clustering rows, but the new representation allowed to keep most of the logic intact.
New cql-pytests are added. All but one of the existing tests for secondary indexes on static columns - ported from Cassandra - now work and have their `xfail` marks lifted; the remaining test requires support for collection indexing, so it will start working only after #2962 is fixed.
Materialized view with static rows as a key are __not__ implemented in this PR.
Fixes: #2963Closes#11166
* github.com:scylladb/scylladb:
test_materialized_view: verify that static columns are not allowed
test_secondary_index: add (currently failing) test for static index paging
test_secondary_index: add more tests for secondary indexes on static columns
cassandra_tests: enable existing tests for static columns
create_index_statement: lift restriction on secondary indexes on static rows
db/view: fetch and process static rows when building indexes
gms/feature_service: introduce SECONDARY_INDEXES_ON_STATIC_COLUMNS cluster feature
create_index_statement: disallow creation of local indexes with static columns
select_statement: prepare paging for indexes on static columns
select_statement: do not attempt to fetch clustering columns from secondary index's table
secondary_index_manager: don't add clustering key columns to index table of static column index
replica/table: adjust the view read-before-write to return static rows when needed
db/view: process static rows in view_update_builder::on_results
db/view: adjust existing view update generation path to use clustering_or_static_row
column_computation: adjust to use clustering_or_static_row
db/view: add clustering_or_static_row
deletable_row: add column_kind parameter to is_live
view_info: adjust view_column to accept column_kind
db/view: base_dependent_view_info: split non-pk columns into regular and static
Adds a test which verifies that static columns are not allowed in
materialized views. Although we added support for static columns in
secondary indexes, which share a lot of code with materialized views,
static columns in materialized views are not yet ready to use.
Currently, when executing queries accelerated by an index on a static
column, paging is unable to break base table partitions across pages and
is forced to return them in whole. This will cause problems if such a
query must return a very large base table partition because it will have
to be loaded into memory.
Fixing this issue will require a more sophisticated approach than what
was done in the PR. For the time being, an xfailing pytest is added
which should start passing after paging is improved.
We need to obtain the Raft ID of the replaced node during the shadow round and
place it in the address map. It won't be placed by the regular gossiping route
if we're replacing using the same IP, because we override the application state
of the replaced node. Even if we replace a node with a different IP, it is not
guaranteed that background gossiping manages update the address map before we
need it, especially in tests where we set ring_delay to 0 and disable
wait_for_gossip_to_settle. The shadow round, on the other hand, performs a
synchronous request (and if it fails during bootstrap, bootstrap will fail -
because we also won't be able to obtain the tokens and Host ID of the replaced
node).
Fetch the Raft ID of the replaced node in `prepare_replacement_info`,
which runs the shadow round. Return it in `replacement_info`. Then
`join_token_ring` passes it to `setup_group0`, which stores it in the
address map. It does that after `join_group0` so the entry is
non-expiring (the replaced node is a member of group 0). Later in the
replace procedure, we call `remove_from_group0` for the replaced node.
`remove_from_group0` will be able to reverse-translate the IP of the
replaced node to its Raft ID using the address map.
Also remove an unconditional 60 seconds sleep from the replace code. Make it
dependent on ring_delay.
Enable the replace tests.
Modify some code related to removing servers from group 0 which depended on
storing IP addresses in the group 0 configuration.
Closes#12172
* github.com:scylladb/scylladb:
test/topology: enable replace tests
service/raft: report an error when Raft ID can't be found in `raft_group0::remove_from_group0`
service: handle replace correctly with Raft enabled
gms/gossiper: fetch RAFT_SERVER_ID during shadow round
service: storage_service: sleep 2*ring_delay instead of BROADCAST_INTERVAL before replace
This change removes sstables.hh from some other headers replacing it
with version.hh and shared_sstable.hh. Also this drops
sstables_manager.hh from some more headers, because this header
propagates sstables.hh via self. That change is pretty straightforward,
but has a recochet in database.hh that needs disk-error-handler.hh.
Without the patch touch sstables/sstable.hh results in 409 targets
recompillation, with the patch -- 299 targets.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12222
There are two similarly named classes: ::test_region_group and
dirty_memory_manager_logalloc::test_region_group. Rename the
former to ::raii_region_group (that's what it's for) and the
latter to ::test_region_group, to reduce confusion.
While deletable_row is used to hold regular columns of a clustering row,
its name or implementation doesn't suggest that it is a requirement. In
fact, some of its methods already take a column_kind parameter which is
used to interpret the kind of columns held in the row.
This commit removes the assumption about the column kind from the
`deletable_row::is_live` method.
This PR hits two goals for "object storage" effort
1. Sstables loader "knows" that sstables components are stored in a Linux directory and uses utils/lister to access it. This is not going to work with sstables over object storage, the loader should be abstracted from the underlying storage.
2. Currently class keyspace and class column_family carry "datadir" and "all_datadirs" on board which are path on local filesystem where sstable files are stored (those usually started with /var/lib/scylla/data). The paths include subsdirs like "snapshots", "staging", etc. This is not going to look nice for obejct storage, the /var/lib/ prefix is excessive and meaningless in this case. Instead, ks and cf should know their "location" and some other component should know the directory where in which the files are stored.
Said that, this PR prepares distributed_loader and sstables_directly to stop using Linux paths explicitly by making both call sstables_manager to list and open sstables object. After it will be possible to teach manager to list sstables from object storage. Also this opens the way to removing paths from keyspace and column_family classes and replacing those with relative "location"s.
Closes#12128
* github.com:scylladb/scylladb:
sstable_directory: Get components lister from manager
sstable_directory: Extract directory lister
sstable_directory: Remove sstable creation callback
sstable_directory: Call manager to make sstables
sstable_directory: Keep error handler generator
sstable_directory: Keep schema_ptr
sstable_directory: Use directory semaphore from manager
sstable_directory: Keep reference on manager
tests: Use sstables creation helper in some cases
sstables_manager: Keep directory semaphore reference
sstables, code: Wrap directory semaphore with concurrency
Currently if a node that is outside of the config tries to add an entry
or modify config transient error is returned and this causes the node
to retry. But the error is not transient. If a node tries to do one of
the operations above it means it was part of the cluster at some point,
but since a node with the same id should not be added back to a cluster
if it is not in the cluster now it will never be.
Return a new error not_a_member to a caller instead.
Message-Id: <Y42mTOx8bNNrHqpd@scylladb.com>
Yet another continuation to previous patch -- IO error handlers
generator is also needed to create sstables.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Continuation of one-before-previous patch. In order to create sstable
without external lambda the directory code needs schema.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After previous patch sstables_directory code may no longer require for
semaphore argument, because it can get one from manager. This makes the
directory API shorter and simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sstables_directly accesses /var/lib/scylla/data in two ways -- lists
files in it and opens sstables. The latter is abdtracted with the help
of lambdas passed around, but the former (listing) is done by using
directory liters from utils.
Listing sstables components with directlry lister won't work for object
storage, the directory code will need to call some abstraction layer
instead. Opening sstables with the help of a lambda is a bit of
overkill, having sstables manager at hand could make it much simpler.
Said that, this patch makes sstables_directly reference sstables_manager
on start.
This change will also simplify directory semaphore usage (next patch).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Several test cases push sstables creation lambda into
with_sstables_directory helper. There's a ready to use helper class that
does the same. Next patch will make additional use of that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently this is a sharded<semaphore> started/stopped in main and
referenced by database in order to be fed into sstables code. This
semaphore always comes with the "concurrency" parameter that limits the
parallel_for_each parallelizm.
This patch wraps both together into directory_semaphore class. This
makes its usage simpler and will allow extending it in the future.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
One of the prerequisites to make sstables reside on object-storage is not to let the rest of the code "know" the filesystem path they are located on (because sometimes they will not be on any filesystem path). This patch makes the methods that can reveal this path back private so that later they can be abstracted out.
Closes#12182
* github.com:scylladb/scylladb:
sstable: Mark some methods private
test: Don't get sstable dir when known
test: Use move_to_quarantine() helper
test: Use sstable::filename() overload without dir name
sstables: Reimplement batch directory sync after move
table, tests: Make use of move_to_new_dir() default arg
sstables: Remove fsync_directory() helper
table: Simplify take_snapshot()'s collecting sstables names
The test enables an error injection inside the Raft upgrade procedure
on one of the nodes which will cause the node to throw an exception
before entering `synchronize` state. Then it restarts other nodes with
Raft enabled, waits until they enter `synchronize` state, puts them in
RECOVERY mode, removes the error-injected node and creates a new Raft
group 0.
As soon as the other nodes enter `synchronize`, the test disabled the
error injection (the rest of the test was outside the `async with
inject_error(...)` block). There was a small chance that we disabled the
error injection before the node reached it. In that case the node also
entered `synchronize` and the cluster managed to finish the upgrade
procedure. We encountered this during next promotion.
Eliminate this possibility by extending the scope of the `async with
inject_error(...)` block, so that the RECOVERY mode steps on the other
nodes are performed within that block.
Closes#12162
The sstable_move_test creates sstables in its own temp directories and
the requests these dirs' paths back from sstables. Test can come with
the paths it has at hand, no need to call sstables for it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Two places in tests move sstable to quarantine subdir by hand. There's
the class sstable method that does the same, so use it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The dir this place currently uses is the directory where the sstable was
created, so dropping this argument would just render the same path.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method in question accepts boolean bit whether or not it should sync
directories at the end. It's always true but in one case, so there's the
default value for it. Make use of it.
Anticipating the suggestion to replace bool with bool_class -- next
patch will replace it with something else.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We used GOSSIP_ECHO verb to perform failure detection. Now we use
a special verb DIRECT_FD_PING introduced for this purpose.
There are multiple reasons to do so.
One minor reason: we want to use the same connection as other Raft
verbs: if we can't deliver Raft append_entries or vote messages
somewhere, that endpoint should be marked dead; if we can, the
endpoint should be marked alive. So putting pings on the same
connection as the other Raft verbs is important when dealing with
weird situations where some connections are available but others are
not. Observe that in `do_get_rpc_client_idx`, we put the new verb in
the right place.
Another minor reason: we remove the awkward gossiper `echo_pinger`
abstraction which required storing and updating gossiper generation
numbers. This also removes one dependency from Raft service code to
gossiper.
Major reason 1: the gossip echo handler has a weird mechanism where a
replacing node returns errors during the replace operation to some of
the nodes. In Raft however, we want to mark servers as alive when they
are alive, including a server running on a node that's replacing
another node.
Major reason 2, related to the previous one: when server B is
replacing server A with the same IP, the failure detector will try to
ping both servers. Both servers are mapped to the same IP by the
address map, so pings to both servers will reach server B. We want
server B to respond to the pings destined for server B, but not to
pings destined for server A, so the sender can mark B alive but keep A
marked dead.
To do this, we include the destination's Raft ID in our RPCs. The
destination compares the received ID with its own. If it's different,
it returns a `wrong_destination` response, and the failure detector
knows that the ping did not reach the destination (it reached someone
else).
Yet another reason: removes "Not ready to respond gossip echo
message" log spam during replace.
Closes#12107
* github.com:scylladb/scylladb:
service/raft: specialized verb for failure detector pinger
db: system_keyspace: de-staticize `{get,set}_raft_server_id`
service/raft: make this node's Raft ID available early in group registry
The method became unused since 70e5252a (table: no longer accept online
loading of SSTable files in the main directory) and the whole concept of
reshuffling sstables was dropped later by 7351db7c (Reshape upload files
and reshard+reshape at boot).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12165
We used GOSSIP_ECHO verb to perform failure detection. Now we use
a special verb DIRECT_FD_PING introduced for this purpose.
There are multiple reasons to do so.
One minor reason: we want to use the same connection as other Raft
verbs: if we can't deliver Raft append_entries or vote messages
somewhere, that endpoint should be marked dead; if we can, the
endpoint should be marked alive. So putting pings on the same
connection as the other Raft verbs is important when dealing with
weird situations where some connections are available but others are
not. Observe that in `do_get_rpc_client_idx`, we put the new verb in
the right place.
Another minor reason: we remove the awkward gossiper `echo_pinger`
abstraction which required storing and updating gossiper generation
numbers. This also removes one dependency from Raft service code to
gossiper.
Major reason 1: the gossip echo handler has a weird mechanism where a
replacing node returns errors during the replace operation to some of
the nodes. In Raft however, we want to mark servers as alive when they
are alive, including a server running on a node that's replacing
another node.
Major reason 2, related to the previous one: when server B is
replacing server A with the same IP, the failure detector will try to
ping both servers. Both servers are mapped to the same IP by the
address map, so pings to both servers will reach server B. We want
server B to respond to the pings destined for server B, but not to
pings destined for server A, so the sender can mark B alive but keep A
marked dead.
To do this, we include the destination's Raft ID in our RPCs. The
destination compares the received ID with its own. If it's different,
it returns a `wrong_destination` response, and the failure detector
knows that the ping did not reach the destination (it reached someone
else).
Yet another reason: removes "Not ready to respond gossip echo
message" log spam during replace.
Raft ID was loaded or created late in the boot procedure, in
`storage_service::join_token_ring`.
Create it earlier, as soon as it's possible (when `system_keyspace`
is started), pass it to `raft_group_registry::start` and store it inside
`raft_group_registry`.
We will use this Raft ID stored in group registry in following patches.
Also this reduces the number of disk accesses for this node's Raft ID.
It's now loaded from disk once, stored in `raft_group_registry`, then
obtained from there when needed.
This moves `raft_group_registry::start` a bit later in the startup
procedure - after `system_keyspace` is started - but it doesn't make
a difference.
In a recent commit 757d2a4, we removed the "xfail" mark from the test
test_manual_requests.py::test_too_large_request_content_length
because it started to pass on more modern versions of Python, with a
urllib3 bug fixed.
Unfortunately, the celebration was premature: It turns out that although
the test now *usually* passes, it sometimes fails. This is caused by
a Seastar bug scylladb/seastar#1325, which I opened #12166 to track
in this project. So unfortunately we need to add the "xfail" mark back
to this test.
Note that although the test will now be marked "xfail", it will actually
pass most of the time, so will appear as "xpass" to people run it.
I put a note in the xfail reason string as a reminder why this is
happening.
Fixes#12143
Refs #12166
Refs scylladb/seastar#1325
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12169
The field was not used for anything. We can keep decommissioned server
in `stopped` field.
In fact it caused us a problem: since recently, we're using
`ScyllaCluster.uninstall` to clean-up servers after test suite finishes
(previously we were using `ScyllaServer.uninstall` directly). But
`ScyllaCluster.uninstall` didn't look into the `decommissioned` field,
so if a server got decommissioned, we wouldn't uninstall it, and it left
us some unnecessary artifacts even for successful tests. This is now
fixed.
Closes#12163
Mainly this PR removes global db::config and feature service that are used by sstables::test_env as dependencies for embedded sstables_manager. Other than that -- drop unused methods, remove nested test_env-s and relax few cases that use two temp dirs at a time for no gain.
Closes#12155
* github.com:scylladb/scylladb:
test, utils: Use only one tempdir
sstable_compaction_test: Dont create nested envs
mutation_reader_test: Remove unused create_sstable() helper
tests, lib: Move globals onto sstables::test_env
tests: Use sstables::test_env.db_config() to access config
features: Mark feature_config_from_db_config const
sstable_3_x_test: Use env method to create sst
sstable_3_x_test: Indentation fix after previous patch
sstable_3_x_test: Use sstable::test_env
test: Add config to sstable::test_env creation
config: Add constexpr value for default murmur ignore bits
There's a do_with_cloned_tmp_directory that makes two temp dirs to toss
sstables between them. Make it go with just one, all the more so it
would resemble existing manipulations aroung staging/ subdir
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The "compact" test case runs in sstables::test_env and additionally
wraps it with another instance provided by do_with_tmp_directory helper.
It's simpler to create the temp dir by hand and use outter env.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a bunch of objects that are used by test_env as sstables_manager
dependencies. Now when no other code needs those globals they better sit
on the test_env next to the manager
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently some places use global test config, but it's going to be
removed soon, so switch to using config from environment
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are several cases there that construct sstables_manager by hand
with the help of a bunch of global dependencies. It's nicer to use
existing wrapper.
(indentation left broken until next patch)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
To make callers (tests) construct it with different options. In
particular, one test will soon want to construct it with custom large
data handler of its own.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
... and use in some places of sstable_compaction_test. This will allow
getting rid of global test_db_config thing later
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is the core of dynamic IP address support in Raft, moving out the
IP address sourcing from Raft Group 0 configuration to gossip. At start
of Raft, the raft id <> IP address translation map is tuned into the
gossiper notifications and learns IP addresses of Raft hosts from them.
The series intentionally doesn't contain the part which speeds up the
initial cluster assembly by persisting the translation cache and using
more sources besides gossip (discovery, RPC) to show correctness of the
approach.
Closes#12035
* github.com:scylladb/scylladb:
raft: (rpc) do not throw in case of a missing IP address in RPC
raft: (address map) actively maintain ip <-> raft server id map
The site member is created in ScyllaCluster.start(), for startup failure
this might not be initialized, so check it's present before stop()ing
it. And delete it as it's not running and proper initialization should
call ScyllaCluster.start().
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#11939
A question was raised on what fetch_size (the requested page size
in a paged scan) counts when there is a filter: does it count the
rows before filtering (as scanned from disk) or after filter (as
will be returned to the client)?
This patch adds a test which demonstrates that Cassandra and Scylla
behave differently in this respect: Cassandra counts post-filtering -
so fetch_size results are actually returned, while Scylla currently
counts pre-filtering.
It is arguable which behavior is the "correct" one - we discuss this in
issue #12102. But we have already had several users (such as #11340)
who complained about Scylla's behavior and expected Cassandra's behavior,
so if we decide to keep Scylla's behavior we should at least explain and
justify this decision in our documentation. Until then, let's have this
test which reminds us of this incompatibility. This test currently passes
on Cassandra and fails (xfail) on Scylla.
Refs #11340
Refs #12102
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12103
This patch adds a regression test for the old issue #65 which is about
a multi-column (tuple) clustering-column relation in a SELECT when one
these columns has reversed order. It turns out that we didn't notice,
but this issue was already solved - but we didn't have a regression test
for it. So this patch adds just a regression test. The test confirms that
Scylla now behaves like was desired when that issue was opened. The test
also passes on Cassandra, confirming that Scylla and Cassandra behave
the same for such requests.
Fixes#65
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12130