Lustre Melts My Brain (Again)

Chances are, if you're working on a big supercomputer, you're interacting with one of two major distributed file systems (there are more, but these are the big 2).

If you're on Cray, you're using lustre, most likely.

If you're on IBM, you're probably using GPFS.

We're on Cray.

Without giving away any proprietary details, Lustre is a distributed file system.  A typical configuration is comprised thusly:

  • Multiple storage targets.  These are the hard drives (conceptually, the reality is much more complicated).
  • Multiple servers:  Each server hosts multiple storage targets, and performs the IO operations as requested by the client.
  • Clients, the computers that access the file system.
  • Metadata servers/targets.  These maintain a coherent file system name space (associate the correct file data with the correct file name, for example).
This grossly oversimplified, mind you.  There are also routers that forward IO between different network types (and we've got a bunch of those).

So we've go a user shotgunning so much data down our throats at once that it causes targets to time out.

It took me 2 days to figure that out.  My brain is melted.

Incidentally: one of the many things wrong with excel is its tendency to simply strip off the seconds from any date/time fields in your spreadsheet.  Because if you have 12 data points per minute, why would you need to be able to distinguish between them?

On the bright side  LMT is absolutely invaluable, and well worth the effort in setting it up.

We're at a point where I think the only recourse will be to increase timeouts to accommodate the load, rather than to tweak the number of in-flight IO operations, since the latter is pretty much already maxed.

Maybe at some point in the future I'll write something about scalability challenges. 

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.