Quick Background:
-----------------
In pNFS, typically, a READ or a WRITE request goes to the DS (data
server) instead of the MDS (metadata server). However, the MDS is the
entity that maintains the state for the files. Hence, the DS needs to
make all the required checks on READ and WRITE I/O operations as
determined by the NFSv4.1 protocol.
The fundamental idea is that if the MDS would deny READ or WRITE
operation on any file for the lack of proper access rights, stateid,
open mode, or other relevant state tokens, then the DS should also deny
that request. The DS uses a control protocol message, DS_CHECKSTATE, to
communicate with the MDS in order to conduct the required state checking
before allowing an I/O to proceed to the disk.
The description of the DS_CHECKSTATE RPC procedure for the control
protocol between DS and MDS can be found at:
http://wikis.sun.com/display/NFS/pNFS+Control+Protocol+Specification
Summary of changes:
-------------------
On receiving an I/O request from the client, the DS issues DS_CHECKSTATE
to the MDS with the file mode, stateid, filehandle, and client owner.
The MDS then validates the stateid (delegation, lock, open), and then
returns the open mode and the clientid for the client owner. Most of
the changes are related to the infrastructure for collecting
the arguments on the DS side, and processing the arguments on the MDS
such that we can validate the state that the client supplies to the DS
in an I/O request.
Further, the DS and MDS pieces have been "nnodified", which means that
the control protocol routines for CHECKSTATE go via the nnode layer as
and when required. As part of nnodification, my changes create the nnode
infrastructure for state operations on the DS and resuse the existing
infrastructure on the MDS.
The walkthrough below covers the overall checkstate infrastructure, and
the new nnode infrastructure on the DS for state operations. Since most
of the changes are related to the mechanisms w.r.t. state checking, the
few design points are interspersed in the walkthrough. Most of the
design issues arise in the processing of the DS_CHECKSTATE RPC request
on the MDS.
Walkthrough
-----------
-Initiation at the DS:
----------------------
On the DS, the starting point for checkstate is in the heart of the read
and write RPC dispatch routines for NFSv4.1 protocol. The I/Os received
at the DS are processed via the mds_op_read and mds_op_write calls. The
names of the read and write routines may seem a bit odd, given that we
are executing DS routines, but after Sam's integration of nnodes, the
reads and writes use the same RPC dispatch routines for NFSv4.1 calls,
irrespective of whether the calls execute on the MDS or the DS. In
future, we may name these routines differently, so that they are
agnostic of the actual entity that executes them.
-Initiation of DS_CHECKSTATE via nnodes:
----------------------------------------
The high-level routine that gets invoked from mds_op_read/mds_op_write
for checking state is nnop_check_stateid. When executed on the DS, due
to the magic of nnodes, this call will turn into an over the wire
DS_CHECSTATE RPC procedure call. The starting point for the over the
wire routine is dserv_mds_checkstate. If you are curious about how the
translation takes place, look at the following structure in
dserv_server.c. Here, we setup the nnode state operations for the DS.
static nnode_state_ops_t dserv_nnode_state_ops =
Each nnode has an element for state ops and state data. These elements
are initialized at the time of nnode creation in dserv_nnode_build. At
the time of nnode creation, the proposed changes in this webrev setup
the state ops to point to dserv_nnode_state_ops as described above. Thus
nnop_check_stateid will invoke the nnode's state operation
nso_checkstate, which in turn will map to dserv_mds_checkstate.
-Preparing of arguments:
------------------------
The real work of preparing the arguments happens in
cp_ds_mds_checkstateid. Here is how we construct the current arguments:
client_owner4 - from the sessions pointer that can be accessed via
compound_state_t. The derivation of client_owner4 assumes that the
nfs_client_id4 and client_owner4 are comparable. The language
in the SPEC suggests that it is indeed the case at the server. See
Section 2.4.1 of NFS v4.1 proposed standard
(http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-29#section-2.4.1).
The derivation is based on the sessions pointer,
which is already cached in the compound_state_t. The hold and release on
the database entry for the sessions pointer happens in the context of
rfs41_dispatch, so we do not need to worry about that here.
mode, stateid, - obtained directly from mds_op_read/mds_op_write NFS
v4.1 dispatch routines.
DS filehandle - obtained via the nnode_state_t itself in
dserv_mds_checkstate. Note that the filehandle has to be encoded because
it got decoded in the guts of the processing of NFSv4.1 compound RPC
procedure call for NFS v4.1 to the DS. The DS filehandle is opaque to
the client but has to be understood by the DS. Hence, it is decoded at
the DS. However, since it is opaque to the client, the MDS expects it to
come as encoded.
The actual dispatch of the RPC happens in dserv_mds_call, which is a
generic routine for RPC calls related to the control protocol, used by
other RPC procedures as well.
-Processing of the arguments (all the fun happens here):
--------------------------------------------------------
The processing of the DS_CHECKSTATE procedure on the MDS happens in
routine ds_checkstate. Here we decode the ds filehandle, and use the
filehandle to get to the MDS nnode. AHA..big catch. What we have is a DS
filehandle, but what we need is an MDS nnode. Hmm..how do we do that?
Several ways: (a) presumably convert the DS filehandle to MDS filehandle
and then invoke the existing nnode routines to get to MDS nnode; (b)
convert the DS filehandle directory to the MDS nnode; and (c) convert
the DS filehandle into a vnode, and convert the vnode into an nnode.
After discussing with Sam, I used the last alternative because it was
the easiest to implement and test out. Eventually, this will get fixed
to use the alternative (b).
Next, we use the existing v4.0 interface for state checking
-check_stateid. But wait! Checking the state is a state
operation, so we MUST go via the nnode abstraction. Hence, we use the
same nnode call to do checkstate on the MDS that was first used to
trigger checkstate on the DS - nnop_check_stateid. Of course, due to the
magic of nnodes, this time the nnop_check_stateid turns into a local
function call - check_stateid. THANK YOU nnodes!
But wait a minute! the NFSv4.0 check_stateid uses v4 bits in the
stateid, but for NFSv4.1 we should be doing checkstate using v4.1 bits.
The current stateid generation for v4.1 happens using the v4.0 bits
(reuse of v4.0 code). Jim has graciously promised to get all that
cleaned up.
The last piece is the return of the clientid. The clientid is returned
as a side effect of nnop_check_state, since the state data structures
can be used to reached to the client record (rfs4_client_t) that stores
the client id.
There are two ways to do this. Take the client_owner4 -
which is actually interpreted as nfs_client_id4, use it to index into
the client record table (table of rfs4_client_t) and get to the
clientid. Yes, that would work, but the issue is that the current client
implementation gives a different client_owner4 string to the MDS and the
DS at the time of exchange id. Hence there is no way to map the
client_owner4 from the DS into the client record at the MDS. This was
okay for NFSv4.0 but not okay for NFSv4.1. See
http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-29, section
11.7.7. This is an issue that the sessions team will be resolving.
The second way is to access the client record via the state data
structures. The pointer to the client record is cached in these
structures. However, there is another issue exposed when we use the
state data structures to access the client record.
Take a scenario where the client reboots after issuing a bunch of writes
to the DS. The current gate code does not expire the state immediately
for opens and locks after exchange id, or when the clientid gets
confirmed (at CREATE_SESSION/next session op). However, on an exchange
id, the current implementation does hide the client record database
entry for that client.
Now consider the event that a checkstate arrives at the MDS from an I/O
that was issued before the client rebooted and before the previous lease
expired, but after the completion of exchange id. Here, the MDS may
return error for the clientid if the client record is accessed via the
database, but no error if client record is accessed via the state data
structures. This inconsistency will be fixed eventually after some
changes to the sessions implementation. For now, we use the state data
structures to access the client record.
Handling of Errors:
-------------------
On the DS, the errors from the MDS come back as control protocol errors.
These proposed changes map such errors into NFS v4.1 errors for reads
and writes using the get_nfs_status routine. On the MDS, the existing
NFS v4/v4.1 routines return NFSv4/v4.1 errors. The proposed changes map
these errors into the relevant control protocol errors using the
get_ds_status routine.
Future work:
------------
1) DS_CHECKSTATE is supposed to return the layouts for the file. Sam
recently pushed a utility for layouts. I will soon start studying the
utility to use it to return the layouts via DS_CHECKSTATE.
2) Validating a client's access rights/RPC security via DS_CHECKSTATE.
This is a little tricky since we are dealing with RPC authentication; a
firm design is still pending. I have had some initial discussions with
the team, and lookout for a design proposal shortly.
3) There is no state caching infrastructure at the DS as of now, so
DS_CHECKSTATE will get called for every I/O. This is an interim solution
because we need the DS_CHECKSTATE functionality in the currrent gate to
make forward progress on some of the other MDS pieces, while the caching
infrastructure evolves in parallel. Again, lookout for design notes on
state caching.