Writes from multiple processes result in missing or zero-length files
The reason for missing or zero-length files from a BGL application and error messages like: fclose: Stale NFS file handle lies in cache problems with the NFS server. The frequent reboot of the I/O nodes for every run leads to wrong replies to the clients. Unfortunately caching can not be turned off at the server.
This problem should disappear when GPFS can be used as data filesystem in the future.
If you run into this problem with your application in the meantime you may use a workaround and call the mpirun command using -nofree without rebooting the partition:
-
mpirun [-np <p> | -partition <part>] -nofree ......
If a partition has been dynamically allocated with -np (resulting in the allocation of a partition named RMP....) and -nofree, use this partition name in all consecutive runs:
-
mpirun -partition RMP... [-nofree] ......
Use llstat to find the name of the partition, which is allocated to you.
CAUTION:
After using -nofree once (!) on a partition the partition remains allocated to the user (and accounted), even if -nofree is left out on mpirun.
So at the end of multiple mpiruns a partition must be freed with:
-
freepartition <partname>
last change 15.11.2005 |
