Uncovering the HPC Bottleneck eSeminar

Searching Blueprints

Click here for tips on how to improve your search.

Additional Resources

Follow us on Twitter

Uncovering the HPC Bottleneck: High Performance Storage and the Lustre(TM) File System



Please view recorded presenatation of this webinar cosponsored with Ziff Davis Enterprise!

FEATURED SPEAKERS

Larry McIntosh - Technical Field Architect - Sun Microsystems, Inc.
Michael Krieger - VP, Market Experts Group - Ziff Davis Enterprise
Sean Cochrane - North America Chief Architect, Global Systems Practice - Sun Microsystems, Inc.

Please submit your questions for the speakers by adding a comment to this page. We will make sure that no questions are left unanswered!

Q: When will 2.0 be available?
Larry McIntosh: The Lustre Roadmap can be reviewed here: http://wiki.lustre.org/index.php?title=Lustre_Roadmap

Q: Wouldn't SSD HD be solution? expensive one though!!!
Ken Kutzer: Thanks for your question, there is one pn SSD in the queue for the Q&A at the end.

Q: Are you going to address how Lustre and QFS relate?
Larry McIntosh: QFS is a block related shared FS. It can access shared storage as a mechanism to share the FS between systems through a SAN architecture. Lustre is a Parallel File System for Linux Clusters and does not require a SAN for data access but can fit into one. Lustre stores data in smart object chunks across OSS servers. There is an LNET piece of Lustre that provides for client data access similar to NFS by gaining access across ethernet, Infinband, Quadrics, etc. So the data that the clients gain access to with Lustre is across networks as mentioned and these clients do not have block connection to the data as QFS does. Lustre scales much higher than QFS but the servers are Linux based today both MDS and OSS.

Q: How does Lustre Compare to Parallel NFS in terms of performance, cost and ease of deployment? Do you have any data (that you can share) to back your response?
Larry McIntosh: Lustre is available today and Sun has done a great job of integrating it within the Sun HPC Software, Linux Edition. Use of this simplifies the deployment effort. In addition, one can work with Sun on a POC. pNFS is really a great technology where folks want to utilize a parallel version of NFS based upon NFS version 4.1 and continued specs originating for this. Sun is an active member fo this collaborative working group. There will be pNFS support for Lustre based upon the roadmap. I believe one really should be requiring pNFS to have Lustre like MDS capabilities and further to require pNFS to linearly scale as we can show Lustre really has performed from the low to very high end of the spectrum.

Q: Is Lustre available only on Linux? If not available on Solaris, is there an equivalent option in Solaris, like with the popular VxFS?
Larry McIntosh: Lustre Servers run on Linux today. Sun is working on a ZFS version of this but this is not available today. Patchless clients are available for a number of client types but the servers are all Linux based.

Q: If I need to summarize, if high performance is important for NFS customers, Lustre will provide that. If customers need fast access to their data that might be out on a Veritas FS, QFS is the faster solution? This doesn't work natively on Solaris?
Larry McIntosh: Not yet. Sun is working on this. Lustre is a Linux MDS and OSS based architecture. We are also working on Lustre with ZFS.

Q: Do the OSS's support the SCSI OSD protocol?
Larry McIntosh: The OSSs support Object Based Storage for data storage and retrieval. There is a good white paper that will be pushed out at the end of the presentation that describes this for you.

Q: Is clustered MDS supported?
Larry McIntosh: Not yet. It is on the roadmap: http://wiki.lustre.org/index.php?title=Lustre_Roadmap

Q: Does the OSS support iSCSI?
Larry McIntosh: iSCSI could be used but it is not recommended for the OSS - LNET is used here. For iSCSI I would look at Sun's 7000 Series system.

Q: When we said "fail over" between OSSs, does it means that have doable number of OSS machines ?
Larry McIntosh: A good way to implement failover between OSS servers is to have shared storage between them in pairs of two.
That said you can make one active on one set of shared data aad the other one passive on this same shared data. Now the Passive system can also be active with still yet another piece of shared data and the corresponding paired OSS system can be passive on this set of data. So you can achieve best of both worlds here. In both cases the OSS systems become the failover for the other systems primary use of the data.

Q: What is the performance for clients over WAN with RTT ~ 100 ms?
Larry McIntosh: I am hoping you don't get multiple responses to this. I had thought I answered this but let me make sure it gets answered. TACC demonstrated along with Sun the use of 4 Sun Fire x4540 Thor systems with 10 GigE at SC08. These systems were connected remotely through Lustre WAN connection back to TACC's Ranger system. TACC was able to achieve 2.8GB/s data throughput across the WAN connection to these systems remotely. There is a major Lustre WAN project that members of the US Tergarid participate in that could also provide additional input on this. Finally Lustre WAN could be reviewed further via the NET with public data posted by doing a search: "Lustre WAN and RTT" which shows some results of IU and ORNL as well as IU and PSC, etc.

Q: Why do you need 32GB in the OSS module, when the HA-OSS nodes only have 8GB?
Ken Kutzer: I believe 32 and 64GBs are the standard cofigs w/ short lead times. 8GB would be a custom X4540 config.

Q: This may be a bit too technical for today, but: What is the determining criteria that would dictate the necessary ratio of MDSs to OSSs? Or is a single MDS sufficient for max config?
Larry McIntosh: Actually it is a great question. At TACC we took a simplistic approach and are utilizing a failover pair of MDS servers for each FS TACC implemented. These are on our Sun Fire X4600 Systems which was a great system to use at the time. There are newer systems one could use today for this from Sun now as well. The Lustre Operations Manual goes into this a bit and gives some further recommendations. TACC has 3 file systems - the largest of which is 1.2 PBs in size. So there are 6 (3 pairs) of active/passive MDS servers serving all 1.7 PB of online storage. This architecture has also been implemented at other facilities WW by Sun and is working rather well. I would not recommend using a single MDS for production systems. For test and development systems perhaps this would be ok.

Q: What is the performance for clients over WAN with RTT ~ 50-100 ms?
Larry McIntosh: We have seen very food performance here. TACC and Sun demonstrated use of four OSS Sun Fire X4540 Systems running 10 GigE and was able to achieve 2.8 GB/s throughput on these systems running remote at SC08 connecting into TACC's Ranger system. Further info is on the Net if one looks at Lustre WAN

Q: What kind of a purging mechanism do you use for a files system that large?
Larry McIntosh: Actually there is a Quota facility within Lustre that one can use. There is also a way to utililze the data movers discussed to push data out to an HSM system such as SAMFS after a certain period of time if one does not want to have a hard policy of merely deleting data from the FS. Once it is in SAMFS the policy definitions of SAMFS take over. This is where the real policy driven cycles of the company take over and SAM can implement these very well.

Q: I have been informed that mission critical data should not be used with Lustre due to possibly having MDS problems. What is your word on this? Is this only dependent on equipment stability, or is the filesystem not entirely resilient?
Larry McIntosh: MDSs can be put into a failover server mode of operation with shared storage. Data can also be mirrored. There are a number of customers that have put this architecture in place. I would not recommend putting a production system in place without this but testing without failover mds systems would be ok.

Q: Does "LMT" provide tools to determine real-time performance of the Lustre filesystem? eg when the filesystem "slows down"...?
Larry McIntosh: LMT is a great tool and we anticipate that this may make it into the Linux SW stack that Sun has or at least as an optional add on as one can use today. A tool that one should look at for something like this is collectl that one can get from sourceforge.net

Q: There was no mention of SSD drives in the various storage arrays. Any thoughts as to when and where they might be used?
Larry McIntosh: We scaled the presentation down to one hour - we had some slides showing this - sorry for that. Now that said we have SSDs in our Series 7000 system today. We are also working on Lustre supporting Hybrid Storage Pools with both SSDs abd HDDs. Prelimianry work shows that Lustre is lowering latency times down and increasing BW by utilizing these even with large sequential I/Os of 1 meg stripe sizes. So we anticipate folks using a very fast portion of the storage pool with SSDs with Lustre that even give better performance than with traditional HDDs of course as we see today.

Q: what OS are you using for SAMFS/datamover server and how exactly do you 'move' the data from a Lustre file system on to a SAMFS file system?
Larry McIntosh: We are running SAM on Solaris. We have shared qfs clients on linux running shared qfs and lustre. The data mover therefore sees both systems. There are a number of ways of moving the data with simple copies as a basic method. Both ARSC and DKRZ in Germany have implemented this. Sun accomplished this via PS contracts for these particular customers. Another way of moving data is with GridFTP or using SCP. There was actually an implementation outside of Lustre at SDSC where they are doing this with Linux systems which have access to both IBM's GPFS and SAMFS. They had used SAMFS to archive data this way in the past.

Q: Given the basic non-HA OSS configuration how many times has the lustre file system at TACC crashed and what was the impact?
Larry McIntosh: I was on a panel discussion at SC08 with TACC and TACC was asked a similar question. TACC expressed that their Lustre FS is very stable. It would seem to have to be since they are also running home directories on Ranger as one of their FSs.

Q: How would cost of a TACC-like system using 1Gb or 10Gb Ethernet compare, especially for those of us with existing gigabit Ethernet infrastructure?
Larry McIntosh: The Top500 is very rich with ethernet and will continue to be. Lustre runs very well on this infrastructure and is on the Top500 as a FS running on this. The LNET portion of Lustre was built to support varying networks and does it very well from a scalable perspective. That said Sun can speak to you about either of these approaches. Sun has provisioning working over IB and has a nice blueprint showing how to do this but for folks who wish to utilize ethernet this is certainly a very good way to go as an all encompassing solution with Lustre.

Q: Does Lustre support Oracle database?
Larry McIntosh: Suggest looking at Oracle with QFS. Sun has done some good work here with this.

Q: Do you think that SDD drives would provide any performance kicker?
Larry McIntosh: Yes, we are seeing this in our SSD work with Lustre. We are also seeing this in our Series 7000 Systems introduced by Sun the last few weeks ago through the Analytics that this system offers.

Q: what is the strength of Sun HPC architecture vs IBM HPC architecture
Larry McIntosh: That could best be answered by customers such as TACC, TITECH, DLR/CASE in Germany. Many more who have dealt with both companies. That said Sun's Constellation system's architecture and newly developed Sun HPC Software, Linux edition offering really is a complete solution one can beat IBM with as we have proven at TACC.

Q: Do compute nodes in a cluster use the Lustre client to do their own I/O or is I/O coordinated through some nodes dedicated to doing I/O for the cluster?
Larry McIntosh: The compute clients gain access for I/O. Refer to the white paper url we pushed out for further details.

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.

Sign up or Log in to add a comment or watch this page.


The individuals who post here are part of the extended Sun Microsystems community and they might not be employed or in any way formally affiliated with Sun Microsystems. The opinions expressed here are their own, are not necessarily reviewed in advance by anyone but the individual authors, and neither Sun nor any other party necessarily agrees with them.

Copyright 1994-2009 Sun Microsystems, Inc.
Powered by Atlassian Confluence
Sun Guidelines on Public Discourse Privacy Policy Terms of Use Trademarks Site Map Employment Investor Relations Contact