From owner-chemistry@ccl.net Tue Nov 18 07:23:01 2008 From: "John McKelvey jmmckel^^gmail.com" To: CCL Subject: CCL: mpirun failure Message-Id: <-38116-081118072028-11482-w0OTmlK8tzb5w6a91DEmag*server.ccl.net> X-Original-From: "John McKelvey" Content-Type: multipart/alternative; boundary="----=_Part_79813_31106121.1227010815031" Date: Tue, 18 Nov 2008 07:20:15 -0500 MIME-Version: 1.0 Sent to CCL by: "John McKelvey" [jmmckel#%#gmail.com] ------=_Part_79813_31106121.1227010815031 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Folks, This pgm fpi [included in the mpich-1.2.7p1 tarball, computes pi] runs fine on 1 processor on SMP box but for 2 processors I get the below, knowing that 127.0.0.1 means "looping back". Any hints how to make this work most appreciated with mpich-1.2.7p1 [I have to use this version of mpich.] Many thanks! John McKelvey $mpirun -np 2 fpi connect to address 127.0.0.1: Connection refused Trying krb4 rsh... connect to address 127.0.0.1: Connection refused trying normal rsh (/usr/bin/rsh) localhost.localdomain: Connection refused p0_4885: p4_error: Child process exited while making connection to remote process on localhost.localdomain: 0 Interrupt p0_4885: (33.019531) net_send: could not write to fd=4, errno = 32 ------=_Part_79813_31106121.1227010815031 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Folks,

This pgm fpi [included in the mpich-1.2.7p1 tarball, computes pi] runs fine on 1 processor on SMP box but for 2 processors I get the below, knowing that 127.0.0.1 means "looping back".  Any hints how to make this work most appreciated with mpich-1.2.7p1 [I have to use this version of mpich.]

Many thanks!

John McKelvey

$mpirun -np 2 fpi
connect to address 127.0.0.1: Connection refused
Trying krb4 rsh...
connect to address 127.0.0.1: Connection refused
trying normal rsh (/usr/bin/rsh)
localhost.localdomain: Connection refused

p0_4885:  p4_error: Child process exited while making connection to remote process on localhost.localdomain: 0
Interrupt
p0_4885: (33.019531) net_send: could not write to fd=4, errno = 32




------=_Part_79813_31106121.1227010815031-- From owner-chemistry@ccl.net Tue Nov 18 10:46:01 2008 From: "Jozsef Csontos jcsontos.lists .. gmail.com" To: CCL Subject: CCL: mpirun failure Message-Id: <-38117-081118104055-16768-2Qergr/O4i9bYJNCuk4H3w###server.ccl.net> X-Original-From: Jozsef Csontos Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=UTF-8 Date: Tue, 18 Nov 2008 16:40:20 +0100 MIME-Version: 1.0 Sent to CCL by: Jozsef Csontos [jcsontos.lists]![gmail.com] Hi John, I believe that you use ssh for communication and you don't have passwordless connection from your node (CPU0) to your node (CPU1) in your smp machine. Below I pasted one of my earlier reply to a similar problem. > You should work on this issue. (for example: generate a public key with > "ssh-keygen -t rsa" on the master node then put the content of the > generated file into the authorized_keys file on the working nodes "cat > ~/.ssh/id_rsa.pub | ssh hostname_of_your_working_node "cat- >> > ~/.ssh/authorized_keys") However, this procedure strongly depends on > your cluster configuration. > > I hope it helps, > > Jozsef In your case the above means that, A, ssh-keygen -t rsa (just press enter 3 times) B, cp .ssh/id_rsa.pub .ssh/authorized_keys C, try it (ssh localhost - first time you got a keyring question than you're done) Or you can try to google, http://www.google.com/search?hl=en&q=passwordless+ssh&btnG=Google+Search&aq=1&oq=passwordl Good luck, Jozsef John McKelvey jmmckel^^gmail.com wrote: > Folks, > > This pgm fpi [included in the mpich-1.2.7p1 tarball, computes pi] runs > fine on 1 processor on SMP box but for 2 processors I get the below, > knowing that 127.0.0.1 means "looping back". Any > hints how to make this work most appreciated with mpich-1.2.7p1 [I > have to use this version of mpich.] > > Many thanks! > > John McKelvey > > $mpirun -np 2 fpi > connect to address 127.0.0.1 : Connection refused > Trying krb4 rsh... > connect to address 127.0.0.1 : Connection refused > trying normal rsh (/usr/bin/rsh) > localhost.localdomain: Connection refused > > p0_4885: p4_error: Child process exited while making connection to > remote process on localhost.localdomain: 0 > Interrupt > p0_4885: (33.019531) net_send: could not write to fd=4, errno = 32 > > > >