mpich
- 2007-01-19 network problems related to starting parallel programs
These problems hold on the program with nothing to stdout or stderr.
- localhost and outside hostname(i.e. dl403k-1.cmb.usc.edu) both appear in the machinefile. I guess this is probably due to the fact they are in different network realms.
- firewall blocking ports in the range of 1024--65535 could block the mpich communication. see http://www-unix.mcs.anl.gov/mpi/mpich1/docs/mpichman-chp4mpd/node108.htm .
- 2007-01-20 some tips
mostly, it's the network quality. common errors arising are like below:
p5_26056: p4_error: Timeout in establishing connection to remote process: 0 rm_l_5_26057: (301.274493) net_send: could not write to fd=5, errno = 32 p1_32309: p4_error: net_recv read: probable EOF on socket: 1 p3_17484: p4_error: net_recv read: probable EOF on socket: 1 p4_26051: p4_error: net_recv read: probable EOF on socket: 1 rm_l_4_26052: (301.379653) net_send: could not write to fd=5, errno = 32 rm_l_1_32311: (301.715118) net_send: could not write to fd=5, errno = 32 yh@dl403k-1:~$ p5_26056: (305.281470) net_send: could not write to fd=5, errno = 32 p3_17484: (311.544807) net_send: could not write to fd=5, errno = 32 p4_26051: (311.393724) net_send: could not write to fd=5, errno = 32 p1_32309: (311.736869) net_send: could not write to fd=5, errno = 32
- it seems to not really matter whether the copies of the program on different machines are different or not. if the version of mpich/mpichpython is slightly different, it's all right as well.
- localhost appears to be stronger than ethernet connection.
- when network quality is bad, anything involving stderr (stdout maybe), popen (new rpy module calls this to get RHOME, RVERSION..) ... could take the whole program down. However, when network is robust, everything works.
- the mpd version of mpich1 is weird. though several machines are booted up, copies of the parallel program appear to be running on its own and they don't communicate (size of communicator=1).
- 2007-01-28 stderr/stdout output AND communicator
- Parallel environment seems very volatile under some circumstances (using public network...). This might be caused by the stderr/stdout output by nodes before their communicators get initiated.