Personal tools
You are here: Home log mpich
Document Actions

mpich

2007-01-19 network problems related to starting parallel programs

These problems hold on the program with nothing to stdout or stderr.

  1. localhost and outside hostname(i.e. dl403k-1.cmb.usc.edu) both appear in the machinefile. I guess this is probably due to the fact they are in different network realms.
  2. firewall blocking ports in the range of 1024--65535 could block the mpich communication. see http://www-unix.mcs.anl.gov/mpi/mpich1/docs/mpichman-chp4mpd/node108.htm .
2007-01-20 some tips

mostly, it's the network quality. common errors arising are like below:

p5_26056:  p4_error: Timeout in establishing connection to remote process: 0
rm_l_5_26057: (301.274493) net_send: could not write to fd=5, errno = 32
p1_32309:  p4_error: net_recv read:  probable EOF on socket: 1
p3_17484:  p4_error: net_recv read:  probable EOF on socket: 1
p4_26051:  p4_error: net_recv read:  probable EOF on socket: 1
rm_l_4_26052: (301.379653) net_send: could not write to fd=5, errno = 32
rm_l_1_32311: (301.715118) net_send: could not write to fd=5, errno = 32
yh@dl403k-1:~$ p5_26056: (305.281470) net_send: could not write to fd=5, errno = 32
p3_17484: (311.544807) net_send: could not write to fd=5, errno = 32
p4_26051: (311.393724) net_send: could not write to fd=5, errno = 32
p1_32309: (311.736869) net_send: could not write to fd=5, errno = 32
  1. it seems to not really matter whether the copies of the program on different machines are different or not. if the version of mpich/mpichpython is slightly different, it's all right as well.
  2. localhost appears to be stronger than ethernet connection.
  3. when network quality is bad, anything involving stderr (stdout maybe), popen (new rpy module calls this to get RHOME, RVERSION..) ... could take the whole program down. However, when network is robust, everything works.
  4. the mpd version of mpich1 is weird. though several machines are booted up, copies of the parallel program appear to be running on its own and they don't communicate (size of communicator=1).
2007-01-28 stderr/stdout output AND communicator
Parallel environment seems very volatile under some circumstances (using public network...). This might be caused by the stderr/stdout output by nodes before their communicators get initiated.
Related content
« November 2009 »
Su Mo Tu We Th Fr Sa
1234567
89101112 1314
1516171819 20 21
22232425262728
2930
 

Powered by Plone CMS, the Open Source Content Management System

This site conforms to the following standards: