Commit Graph

52 Commits

Author SHA1 Message Date
Andy Grover
820b7295f0 cleanup: Unused LIST_HEADs
Signed-off-by: Andy Grover <agrover@versity.com>
2021-04-05 16:23:41 -07:00
Zach Brown
3de703757f Fix weird comment editing error
That comment looked very weird indeed until I recognized that I must
have forgotten to delete the first two attempts at starting the
sentence.

Signed-off-by: Zach Brown <zab@versity.com>
2021-03-16 12:02:05 -07:00
Andy Grover
cf278f5fa0 scoutfs: Tidy some enum usage
Prefer named to anonymous enums. This helps readability a little.

Use enum as param type if possible (a couple spots).

Remove unused enum in lock_server.c.

Define enum spbm_flags using shift notation for consistency.

Rename get_file_block()'s "gfb" parameter to "flags" for consistency.

Signed-off-by: Andy Grover <agrover@versity.com>
2020-11-30 13:35:44 -08:00
Andy Grover
e6228ead73 scoutfs: Ensure padding in structs remains zeroed
Audit code for structs allocated on stack without initialization, or
using kmalloc() instead of kzalloc().

- avl.c: zero padding in avl_node on insert.
- btree.c: Verify item padding is zero, or WARN_ONCE.
- inode.c: scoutfs_inode contains scoutfs_timespecs, which have padding.
- net.c: zero pad in net header.
- net.h: scoutfs_net_addr has padding, zero it in scoutfs_addr_from_sin().
- xattr.c: scoutfs_xattr has padding, zero it.
- forest.c: item_root in forest_next_hint() appears to either be
    assigned-to or unused, so no need to zero it.
- key.h: Ensure padding is zeroed in scoutfs_key_set_{zeros,ones}

Signed-off-by: Andy Grover <agrover@versity.com>
2020-10-29 14:15:33 -07:00
Zach Brown
edd8fe075c scoutfs: remove lsm code
Remove all the now unused code that deals with lsm: segment IO, the item
cache, and the manifest.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
feaf17c3a5 scoutfs: add conn destroy workq
Lockdep gets angry when we try to destroy an accepted conn workqueue
from within work in a listening conn's workqueue.  It doesn't recognize
that they have a hierarchical relationship that maintains a consistent
order and we can't get at the workqueue lockdep_map to set subclasses.
We add a destroy workqueue which will have its own class.

Signed-off-by: Zach Brown <zab@versity.com>
2019-08-20 15:52:13 -07:00
Zach Brown
ec7f60bebb scoutfs: net conn lifetime tracing
Add trace events for network connections.

Signed-off-by: Zach Brown <zab@versity.com>
2019-08-20 15:52:13 -07:00
Zach Brown
ab7bde9e2c scoutfs: replace node_id with rid in networking
Use the client's rid in networking instead of the node_id.

The node_id no longer has to be allocated by the server and sent in the
greeting.  Instead the client sends it to the server in its greeting.

The server then uses the client's announced rid just like it used to use
the its node_id.  It's used to record clients in the btree and to
identify clients in sending and receive processing.

The use of the rid in networking calls makes its way to locking and
compaction which now use the rid to identify clients intead of the
node_id.

Signed-off-by: Zach Brown <zab@versity.com>
2019-08-20 15:52:13 -07:00
Zach Brown
36b0df336b scoutfs: add unmount barrier
Now that a mount's client is responsible for electing and starting a
server we need to be careful about coordinating unmount.  We can't
let unmounting clients leave the remaining mounted clients without
quorum.

The server carefully tracks who is mounted and who is unmounting while
it is processing farewell requests.  It only sends responses to voting
mounts while quorum remains or once all the voting clients are all
trying to unmount.

We use a field in the quorum blocks to communicate to the final set of
unmounting voters that their farewells have been processed and that they
can finish unmounting without trying to restablish quorum.

The commit introduces and maintains the unmount_barrier field in the
quorum blocks.  It is passed to the server from the election, the
server sends it to the client and writes new versions, and the client
compares what it received with what it sees in quorum blocks.

The commit then has the clients send their unique name to the server
who stores it in persistent mounted client records and compares the
names to the quorum config when deciding which farewell reqeusts
can be responded to.

Now that farewell response processing can block for a very long time it
is moved off into async work so that it doesn't prevent net connections
from being shutdown and re-established.  This also makes it easier to
make global decisions based on the count of pending farewell requests.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
fa3e0a31c7 scoutfs: use SO_REUSEADDR for server socket
The server's listening address is fixed by the raft config in the super
block.  If it shuts down and rapidly starts back up it needs to bind to
the currently lingering address.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
0bc0ff9300 scoutfs: add clock sync trace events
Generate unique trace events on the send and recv side of each message
sent between nodes.  This can be used to reasonbly efficiently
synchronize the monotonic clock in trace events between nodes given only
their captured trace events.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
a546bd0aab scoutfs: check for newlines in msg.h wrappers
The message formatter adds a newline so callers don't have to.  But
sometimes they do and we get double newlines.  Add a build check that
the format string doesn't end in a newline so that we stop adding these.
And fix up all the current offenders.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
74366f0df1 scoutfs: make networking more reliable
The current networking code has loose reliability guarantees.  If a
connection between the client and server is broken then the client
reconnects as though its an entirely new connection.  The client resends
requests but no responses are resent.  A client's requests could be
processed twice on the same server.  The server throws away disconnected
client state.

This was fine, sort of, for the simple requests we had implemented so
far.  It's not good enough for the locking service which would prefer to
let networking worry about reliable message delivery so it doesn't have
to track and replay partial state across reconnection between the same
client and server.

This adds the infrastructure to ensure that requests and responses
between a given client and server will be delivered across reconnected
sockets and will only be processed once.

The server keeps track of disconnected clients and restores state if the
same client reconnects.  This required some work around the greetings so
that clients and servers can recognize each other.  Now that the server
remembers disconnected clients we add a farewell request so that servers
can forget about clients that are shutting down and won't be
reconnecting.

Now that connections between the client and server are preserved we can
resend responses across reconnection.  We add outgoing message sequence
numbers which are used to drop duplicates and communicate the received
sequence back to the sender to free responses once they're received.

When the client is reconnecting to a new server it resets its receive
state that was dependent on the old server and it drops responses which
were being sent to a server instance which no longer exists.

This stronger reliable messaging guarantee will make it much easier
to implement lock recovery which can now rewind state relative to
requests that are in flight and replay existing state on a new server
instance.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
f472c0bc87 scoutfs: add scoutfs_net_response_node()
Today all responses can only be sent down the connection that sent the
response while the request is being processed.  We'll be adding
subsystems that need to send responses asynchronously after initial
request processing.  Give them a call to send a response to a node id
instead of to a node's connection.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
6caa87458b scoutfs: add scoutfs_net_client_node_id()
Some upcoming network request processing paths need access to the
connected client's node_id.  We could add it to the arguments but that'd
be a lot of churn so we'll add an accessor function for now.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
8fedfef1cc scoutfs: remove stale net response data comment
There was a time when responding with an error wouldn't include the
caller's data payload.  That hasn't been the case since we added
compaction network requests which include a reference to the compaction
operation with the error response.

Signed-off-by: Zach Brown <zab@versity.com>
2019-04-12 10:54:07 -07:00
Zach Brown
2cc990406a scoutfs: compact using net requests
Currently compaction is only performed by one thread running in the
server.  Total metadata throughput of the system is limited by only
having one compaction operation in flight at a time.

This refactors the compaction code to have the server send compaction
requests to clients who then perform the compaction and send responses
to the server.  This spreads compaction load out amongst all the clients
and greatly increases total compaction throughput.

The manifest keeps track of compactions that are in flight at a given
level so that we maintain segment count invariants with multiple
compactions in flight.  It also uses the sparse bitmap to lock down
segments that are being used as inputs to avoid duplicating items across
two concurrent compactions.

A server thread still coordinates which segments are compacted.  The
search for a candidate compaction operation is largely unchanged.  It
now has to deal with being unable to process a compaction because its
segments are busy.  We add some logic to keep searching in a level until
we find a compaction that doesn't intersect with current compaction
requests.  If there are none at the level we move up to the next level.

The server will only issue a given number of compaction requests to a
client at a time.  When it needs to send a compaction request it rotates
through the current clients until it finds one that doesn't have the max
in flight.

If a client disconnects the server forgets the compactions it had sent
to that client.  If those compactions still need to be processed they'll
be sent to the next client.

The segnos that are allocated for compaction are not reclaimed if a
client disconnects or the server crashes.  This is a known deficiency
that will be addressed with the broader work to add crash recovery to
the multiple points in the protocol where the server and client trade
ownership of persistent state.

The server needs to block as it does work for compaction in the
notify_up and response callbacks.  We move them out from under spin
locks.

The server needs to clean up allocated segnos for a compaction request
that fails.  We let the client send a data payload along with an error
response so that it can give the server the id of the compaction that
failed.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
07eec357ee scoutfs: simplify reliable request delivery
It was a bit of an overreach to try and limit duplicate request
processing in the network layer.  It introduced acks and the necessity
to resync last_processed_id on reconnect.

In testing compaction requests we saw that request processing stopped if
a client reconnected to a new server.  The new server sent low request
ids which the client dropped because they were lower than the ids it got
from the last server.  To fix this we'd need to add smarts to reset
ids when connecting to new servers but not existing servers.

In thinking about this, though, there's a bigger problem.  Duplicate
request processing protection only works up in memory in the networking
connections.  If the server makes persistent changes, then crashes, the
client will resend the request to the new server.  It will need to
discover that the persistent changes have already been made.

So while we protected duplicate network request processing between nodes
that reconnected, we didn't protect duplicate persistent side-effects
of request processing when reconnecting to a new server.  Once you see
that the request implementations have to take this into account then
duplicate request delivery becomes a simpler instance of this same case
and will be taken care of already.  There's no need to implement the
complexity of protecting duplicate delivery between running nodes.

This removes the last_processed_id on the server.  It removes resending
of responses and acks.  Now that ids can be processed out of order we
remove the special known ID of greeting commands.  They can be processed
as usual.  When there's only request and response packets we can
differentiate them with a flag instead of a u8 message type.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
62d6c11e3c scoutfs: clean up workqueue flags
We had gotten a bit sloppy with the workqueue flags.  We needed _UNBOUND
in some workqueues where we wanted concurrency by scheduling across cpus
instead of waiting for the current (very long running) work on a cpu to
finish.  We add NON_REENTRANT out of an abundance of caution.  It has
gone away in modern kernels and is probably not needed here, but
according to the docs we would want it so we at least document that fact
by using it.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
30d5471e4a scoutfs: call net response func outside lock
Today response processing calls a requests's response callback from
inside the net spinlock.  This happened to work for the synchronous
blocking request handler who only had to record the result and wake
their waiter.

It doesn't work for server compact response processing which needs to
use IO to commit the result of the compaction.

This lifts the call to the response function out of complete_send() and
into the response processing work function.  Other complete_send()
callers now won't trigger the response function call and can't see
errors, which they all ignored anyway.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
0adbd7e439 scoutfs: have server track connected clients
This extends the notify up and down calls to let the server keep track
of connected clients.

It adds the notion of per-connection info that is allocated for each
connection.  It's passed to the notification callbacks so that callers
can have per-client storage without having to manage allocations in the
callbacks.

It adds the node_id argument to the notification callbacks to indicate
if the call is for the listening socket itself or an accepted client
connection on that listening socket.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
746293987c scoutfs: let server send msg to specific node_id
The current sending interfaces only send a message to the peer of a
given connection.  For the server to send to a specific connected client
it'd have to track connections itself and send to them.

This adds a sending interface that uses the node_id to send to a
specific connected client.  The conn argument is the listening socket
and its accepted sockets are searched for the destination node_id.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
8b3193ea72 scoutfs: server allocates node_id
Today node_ids are randomly assigned.  This adds the risk of failure
from random number generation and still allows for the risk of
collisions.

Switch to assigning strictly advancing node_ids on the server during the
initial connection greeting message exchange.  This simplifies the
system and allows us to derive information from the relative values of
node_ids in the system.

To do this we refactor the greeting code from internal to the net layer
to proper client and server request and response processing.  This lets
the server manage persistent node_id storage and allows the client to
wait for a node_id during mount.

Now that net_connect is sync in the client we don't need the notify_up
callback anymore.  The client can perform those duties when the connect
returns.

The net code still has to snoop on request and response processing to
see when the greetings have been exchange and allow messages to flow.

Signed-off-by: Zach Brown <zab@versity.com>
2018-08-28 15:34:30 -07:00
Zach Brown
07df8816e3 scoutfs: add trivial seq file for net messages
Signed-off-by: Zach Brown <zab@versity.com>
2018-07-27 09:50:21 -07:00
Zach Brown
8ff3ef3131 scoutfs: add trivial seq file for net connections
Signed-off-by: Zach Brown <zab@versity.com>
2018-07-27 09:50:21 -07:00
Zach Brown
17dec65a52 scoutfs: add bidirectional network messages
The client and server networking code was a bit too rudimentary.

The existing code only had support for the client synchronously and
actively sending requests that the server could only passively respond
to.  We're going to need the server to be able to send requests to
connected clients and it can't block waiting for responses from each
one.

This refactors sending and receiving in both the client and server code
into shared networking code.  It's built around a connection struct that
then holds the message state.  Both peers on the connection can send
requests and send responses.

The existing code only retransmitted requests down newly established
connections.  Requests could be processed twice.

This adds robust reliability guarantees.  Requests are resend until
their response is received.  Requests are only processed once by a given
peer, regardless of the connection's transport socket.  Responses are
reiably resent until acknowledged.

This only adds the new refactored code and disables the old unused code
to keep the diff foot print minmal.  A following commit will remove all
the unused code.

Signed-off-by: Zach Brown <zab@versity.com>
2018-07-27 09:50:21 -07:00
Zach Brown
c1b2ad9421 scoutfs: separate client and server net processing
The networking code was really suffering by trying to combine the client
and server processing paths into one file.  The code can be a lot
simpler by giving the client and server their own processing paths that
take their different socket lifecysles into account.

The client maintains a single connection.  Blocked senders work on the
socket under a sending mutex.  The recv path runs in work that can be
canceled after first shutting down the socket.

A long running server work function acquires the listener lock, manages
the listening socket, and accepts new sockets.  Each accepted socket has
a single recv work blocked waiting for requests.  That then spawns
concurrent processing work which sends replies under a sending mutex.
All of this is torn down by shutting down sockets and canceling work
which frees its context.

All this restructuring makes it a lot easier to track what is happening
in mount and unmount between the client and server.  This fixes bugs
where unmount was failing because the monolithic socket shutdown
function was queueing other work while running while draining.

Signed-off-by: Zach Brown <zab@versity.com>
2017-08-04 10:47:42 -07:00
Mark Fasheh
a65b28d440 scoutfs: lock impossible ino group for listen lock
Otherwise we get into a problem where the listen lock is conflicting with
regular inode group requests. Since we never drop the listen lock and it (by
design) blocks progress on another node, those inode group requests may
hang.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-07-19 19:04:41 -05:00
Zach Brown
8a42a4d75a scoutfs: introduce lock names
Instead of locking one resource with ranges we'll have callers map their
logical resources to a tuple name that we'll store in lock resources.
The names still map to ranges for cache reading and cache invalidation
but the ranges aren't exposed to the DLM.  This lets us use the stock
DLM and distribute resources across masters.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-19 13:30:03 -07:00
Zach Brown
6de2bfc1c5 scoutfs: use the dlm mode/levels directly
We intend to use more of the dlm lock levels.  Let's use its modes
directly so we don't have to maintain a mental map from differently
named modes.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-19 13:30:03 -07:00
Zach Brown
8d29c82306 scoutfs: sort keys by zone, then inode, then type
Holding a DLM lock protects a range of the key space.  The DLM locks
span inodes or regions of inodes.  We need the sort order in LSM items
to match the DLM range keys so that we can read all the items covered by
a lock into the cache from a region of LSM segments.  If their orders
differered then we'd have to jump around segments to find all the items
covered by a given DLM lock.

Previously we were sorting by type then, within types, by inode.  Now we
want to sort by inode then by type.  But there are structures which
previously had a type but weren't then sorted by inode.  We introduce
zones as the primary sort key.  Inode index and node zones are sorted by
the inode fields and node ids respectively.  Then comes the fs zone
first sorted by inode then the type of the key.

The bulk of this is the mechanical introduction of the zone field to the
keys, moving the type field down, and a bulk rename of _KEY to _TYPE.
But there are some more substantial changes.

The orphan keys needed to be put in a zone.   They fit in the NODE zone
which is all about resources that nodes hold and would need to be
cleaned up if the node went away.

The key formatting is significantly changed to match the new formatting.
Formatted keys are now generally of the form "zone.primary.type..."

And finally with the keys now properly sorted by inodes we can correctly
construct a single range of item cache keys to invalidate when unlocking
the inode group locks.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-19 13:30:03 -07:00
Zach Brown
690049c293 scoutfs: add GET_MANIFEST_ROOT network op
We're going to need to be able to sample the current stable manifest
root occasionally.  We're adding it now because we don't yet
have the lock plumbing that would provide the lvb.  Eventually
this call will bubble up into the locking and the root will be
stored in the lock instead of always requested.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-19 13:30:03 -07:00
Zach Brown
c2f13ccf24 scoutfs: have net.c commit btree blocks
Convert the net server metadata dirtying and committing code to use the
btree instead of the ring.  It has to be careful to setup and teardown
the btree info as it starts up and shuts down the server.

This fixes up some questionable setup/teardown changes made in the
previous patches to convert the manifest and allocator.  We could rebase
the patches to merge those together.  But given that the previous
patches don't work at all without the net updates it might not be worth
the trouble.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-08 10:59:40 -07:00
Zach Brown
ff5a094833 scoutfs: store allocator regions in btree
Convert the segment allocator to store its free region bitmaps in the
btree.

This is a very straight forward mechanical transformation.  We split the
allocator region into a big-endian index key and the bitmap value
payload.  We're careful to operate on aligned copies of the bitmaps so
that they're long aligned.

We can remove all the funky functions that were needed when writing the
ring.  All we're left with is a call to apply the pending allocations to
dirty btree blocks before writing the btree.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-08 10:59:40 -07:00
Zach Brown
fc50072cf9 scoutfs: store manifest entries in the btree
Convert the manifest to store entries in persistent btree keys and
values instead of using the rbtree in memory from the ring.

The btree doesn't have a sort function.  It just compares variable
length keys.  The most complicated part of this transformation is
dealing with the fallout of this.  The compare function can't compare
different search keys and item keys so searches need to construct full
synthetic btree keys to search.  It also can't return different
comparisons, like overlaping, so the caller needs to do a bit more work
to use key comparisons to find overlapping segments.  And it can't
compare differently depending on the level of the manifest so we store
the manifest in keys differently depending on whether its in level 0 or
not.

All mount clients can now see the manifest blocks.  They can query the
manifest directly when trying to find segments to read.  We can get rid
of all the networking calls that were finding the segments for readers.

We change the manifest functions that relied on the ring that the to
make changes in the manifest persistent.  We don't touch the allocator
or the rest of the manifest server, though, so this commit breaks the
world.  It'll be restored in future patches as we update the segment
allocator and server to work with the btree.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-08 10:59:40 -07:00
Mark Fasheh
e6f3b3ca8f scoutfs: add lock caching
We refcount our locks and hold them across system calls. If another node
wants access to a given lock we'll mark it as blocking in the bast and queue
a work item so that the lock can later be released. Otherwise locks are
free'd under memory pressure, unmount or after a timer fires.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-07-06 18:15:11 -05:00
Zach Brown
f7701177d2 scoutfs: throttle addition of level 0 segments
Writers can add level 0 segments much faster (~20x) than compaction can
compact them down into the lower levels.  Without a limit on the number
of level 0 segments item readind can try to read an extraordinary number
of level 0 segments and wedge the box nonreclaimable page allocations.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-27 14:04:38 -07:00
Zach Brown
5f5729b2a4 scoutfs: add sticky compaction
As we write segments we're not limiting the number of segments they
intersect at the next level.  Compactions are limited to a fanout's
worth of overlapping segments.  This means that we can get a compaction
where the upper level segment overlapps more than the segments that are
part of the compaction.  In this case we can't write the remaining upper
level items at the lower level because now we can have a level with
segments whose keys intersect.

Instead we detect this compaction case.  We call it sticky because after
merging with the lower level segments the remaining items in the upper
level need to stick to the upper level.  The next time compaction comes
around it'll compact the remaining items with the additional lower
overlaping segments.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-27 14:04:38 -07:00
Mark Fasheh
e711c15acf scoutfs: use dlm for locking
To actually use it, we first have to copy symbols over from the dlm build
into the scoutfs source directory. Make that happen automatically for us in
the Makefile.

The only users of locking at the moment are mount, unmount and xattr
read/write. Adding more locking calls should be a straight-forward endeavor.

The LVB based server ip communication didn't work out, and LVBS as they are
written don't make sense in a range locking world. So instead, we record the
server ip address in the superblock. This is protected by the listen lock,
which also arbitrates which node will be the manifest server.

We take and drop the dlm lock on each lock/unlock call. Lock caching will
come in a future patch.

Signed-off-by: Mark Fasheh <mfasheh@versity.com>
2017-06-23 15:08:02 -05:00
Zach Brown
2bd698b604 scoutfs: set NODELAY and REUSEADDR on net sockets
Add a helper that creates a socket and sets nodelay for all sockets and
set reuseaddr in listening sockets.

Signed-off-by: Zach Brown <zab@versity.com>
2017-06-06 14:29:05 -07:00
Zach Brown
b7bbad1fba scoutfs: add precise transation item reservations
We had a simple mechanism for ensuring that transaction didn't create
more items than would fit in a single written segment.  We calculated
the most dirty items that a holder could generate and assumed that all
holders dirtied that much.

This had two big problems.

The first was that it wasn't accounting for nested holds.
write_begin/end calls the generic inode dirtying path whild holding a
transaction.  This ended up deadlocking as the dirty inode waited to be
able to write while its trans held back in write_begin prevented
writeout.

The second was that the worst case (full size xattr) item dirtying is
enormous and meaningfully restricts concurrent transaction holders.
With no currently dirty items you can have less than 16 full size xattr
writes.  This concurrency limit only gets worse as the transaction fills
up with dirty items.

This fixes those problems.  It adds precise accounting of the dirty
items that can be created while a transaction is held.  These
reservations are tracked in journal_info so that they can be used by
nested holds.  The precision allows much greater concurrency as
something like a create will try to reserve a few hundreds bytes instead
of 64k.  Normal sized xattr operations won't try to reserve the largest
possible space.

We add some feedback from the item cache to the transaction to issue
warnings if a holder dirties more items than it reserved.

Now that we have precise item/key/value counts (segment space
consumption is a function of all three :/) we can't have a single atomic
track transaction holders.  We add a long-overdue trans_info and put a
proper lock and fields there and much more clearly track transaction
serialization amongst the holders and writer.

Signed-off-by: Zach Brown <zab@versity.com>
2017-05-23 12:15:13 -07:00
Zach Brown
5f11cdbfe5 scoutfs: add and index inode meta and data seqs
For each transaction we send a message to to the server asking for a
unique sequence number to associate with the transaction.  When we
change metadata or data of an inode we store the current transaction seq
in the inode and we index it with index items like the other inode
fields.

The server remembers the sequences it gives out.  When we go to walk the
inode sequence indexes we ask the server for the largest stable seq and
limit results to that seq.  This ensures that we never return seqs that
are past dirty items so never have inodes and seqs appear in the past.

Nodes use the sync timer to regularly cycle through seqs and ensure that
inode seq index walks don't get stuck on their otherwise idle seq.

Signed-off-by: Zach Brown <zab@versity.com>
2017-05-23 12:12:24 -07:00
Zach Brown
373def02f0 scoutfs: remove trade_time message
This was mostly just a demonstration for how to add messages.  We're
about to add a message that we always send on mount so this becomes
completely redundant.

Signed-off-by: Zach Brown <zab@versity.com>
2017-05-18 10:52:04 -07:00
Zach Brown
c678923401 scoutfs: don't try to sync on mount errors
kill_sb tries to sync before calling kill_block_super.   It shouldn't do
this on mount errors that wouldn't have initialized the higher level
systems needed for syncing.

Signed-off-by: Zach Brown <zab@versity.com>
2017-05-16 10:48:12 -07:00
Zach Brown
6afeb97802 scoutfs: reference file data with extent items
Our first attempt at storing file data put them in items.  This was easy
to implement but won't be acceptable in the long term.  The cost of the
power of LSM indexing is compaction overhead.  That's acceptable for
fine grained metadata but is totally unacceptable for bulk file data.

This switches to storing file data in seperate block allocations which
are referenced by extent items.

The bulk of the change is the mechanics of working with extents.  We
have high level callers which add or remove logical extents and then
underlying mechanisms that insert, merge, or split the items that
the extents are stored in.

We have three types of extent items.  The primary type maps logical file
regions to physical block extents.  The next two store free extents
per-node so that clients don't create lock and LSM contention as they
try and allocate extents.

To fill those per-node free extents we add messages that communcate free
extents in the form of lists of segment allocations from the server.

We don't do any fancy multi-block allocation yet.  We only allocate
blocks in get_blocks as writes find unmapped blocks.  We do use some
per-task cursors to cache block allocation positions so that these
single block allocations are very likely to merge into larger extents as
tasks stream wites.

This is just the first chunk of the extent work that's coming.  A later
patch adds offline flags and fixes up the change nonsense that seemed
like a good idea here.

The final moving part is that we initiate writeback on all newly
allocated extents before we commit the metadata that references the new
blocks.  We do this with our own dirty inode tracking because the high
level vfs methods are unusably slow in some upstream kernels (they walk
all inodes, not just dirty inodes.)

Signed-off-by: Zach Brown <zab@versity.com>
2017-05-16 10:48:11 -07:00
Zach Brown
d5a2b0a6db Move towards compaction messages
The compaction code is still directly referencing the super block
and calling sync methods as though it was still standalone.  This is
mostly OK because only the server runs it.  But it isn't quite right
because the sync methods no longer make the rings persistent as they
write the item transaction.  The server is in control of that now.

Eventually we'll have compaction messages being sent between the mount
clients and the server.  Let's take a step in that direction by having
the compaction work call net methods to get its compaction parameters
and finish the compaction.  Eventually these would be marshalled through
request/process/reply code.

But in this first step we know that the compaction code is running on
the server so we can forgo all the messaging and just call in to and out
of compaction.  The net calls just holds the ring consistency locks in
the server and call into the manifest to do the work, commiting the
changes when its done.

This is more careful about segno alloction and freeing.  Compaction
doesn't call the allocator directly.  It gets allocaitons from the
messages and returns them if it doesn't use them.  We actually now
free segnos as they're removed from the manifest.

With the server controlling compaction and can tear all the fiddly level
count watching code out of the manifest.  Item transactions can't care
about the level counts and the server always tries compaction after the
manifest is updated intead of having the manifest watch the level counts
and call compaction.

Now that the server owns the rings they should not be torn down as the
super is torn down, net does that now.  And we need to be more careful
to be sure that writes from dirtying and compaction are stable before
killing the super.

With all this in place moving to shared compaction involves adding the
messages and negotiating concurrent compactions in the manifest.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-24 14:02:18 -07:00
Zach Brown
cec3f9468a Further isolate rings and compaction
Each mount was still loading the manifest and allocator rings and
starting compaction, even if they were coordinating segment reads
and writes with the server.

This moves ring and compaction setup and teardown from on mount and
unmount to as the server starts up and shuts down.  Now only the server
has the rings resident and is running compaction.

We had to null some of the super info fields so that we can repeatedly
load and destroy the ring indices over the lifetime of a mount.

We also have to be careful not to call between item transactions and
compaction.   We'll restore this functionality with the server in the
future.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:51:10 -07:00
Zach Brown
5eefaf34f8 Server updates ring for level0 segment writes
Transaction commits currently directly modify the ring and super block
as segments are written.  As we introduce shared mounts only the server
can modify the ring and super blocks.

This adds network messages to let mounts write items in a level 0
segment while the server modifies the allocator and manifest.

The item transaction commit now sends a message to the server to get an
allocated segno for its new level0 segment and sends a manifest entry to
the server once the segment is written.  The request and reply handlers
for the functions are straight forward.  The processing paths are simple
wrappers around the allocation and update functions that transaction
writing used to call directly.

Now that the item transactions aren't updating the super sync can't
work with the super sequence numbers.

The server needs to make both allocations and manifest updates
persistent before it sends replies to the client.  We add the ability
for the server processing paths to queue and wait for commits of the
rings and super block.  We can hopefull get reasonable batching by using
a work struct for the commit.  We update the other processing path
callers that modify the rings to use the new commit mechanism.

We add a few segment and manifest functions to work with manifest
entries that describe segments.  This creats a bit of similar looking
code thorughout the segment and manifest code but we'll come back and
clean this up once we see what the final shared support looks like.

scoutfs_seg_alloc() now takes the segno from the caller for the segment
it's allocating and inserting into the cache.  Transaction commit uses
the segno it got from the server while compaction still allocates
locally.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:51:10 -07:00
Zach Brown
5487aee6a7 Read items with manifest entries from server
Item reading tries to directly walk the manifest to find segments to
read.  That doesn't work when only the server has read the ring and
loaded the manifest.

This adds a network message to ask the server for the manifest entries
that describe the segments that will be needed to read items.

Previously item reading would walk the manifest and build up native
manifest references in a list that it'd use to read.   To implement the
network message we add request sending, processing, and reply parsing
around those original functions.  Item reading now packs its key range
and sends it to the server.  The server walks the manifest and sends the
entries that intersect with the key range.  Then the reply function
builds up the native manifest references that item reading will use.

The net reply functions needed an argument so that the manifest reading
request could pass in the caller's list that the native manifest
references should be added to.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:51:10 -07:00
Zach Brown
b50de90196 Alloc inodes from pool from server
Inode allocation was always modifying the in-memory super block.  This
doesn't work when the server is solely responsible for modifying the
super blocks.  We add network messages to have mounts send a message to
the server to request inodes that they can use to satisfy allocation.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:51:10 -07:00