DAPL errors on Azure RDMA-enabled SLES cluster -


i set 2 azure a8 vms in availability set running sles-hpc 12 (following tutorial here: https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-cluster-rdma/).

when run intel mpi pingpong test, getting dapl errors:

azureuser@sshvm0:~> /opt/intel/impi/5.0.3.048/bin64/mpirun -hosts 10.0.0.4,10.0.0.5 -ppn 1 -n 2 -env i_mpi_fabrics=shm:dapl -env i_mpi_dynamic_connection=0 -env i_mpi_dapl_provider=ofa-v2-ib0 /opt/intel/impi/5.0.3.048/bin64/imb-mpi1 pingpong sshvm1:d28:bef0eb40: 12930 us(12930 us):  dapl_rdma_accept: err -1 input/output error sshvm1:d28:bef0eb40: 12946 us(16 us):  dapl err accept input/output error [1:10.0.0.5][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c:622] error(0x40000): ofa-v2-ib0: not accept dapl connection request: dat_internal_error() assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c @ line 622: 0 internal abort - process 0 

similar errors when running 1 of osu mpi microbenchmarks (compiled impi compiler):

azureuser@sshvm0:~> /opt/intel/impi/5.0.3.048/bin64/mpirun -hosts 10.0.0.4,10.0.0.5 -ppn 1 -n 2 -env i_mpi_fabrics=shm:dapl -env i_mpi_dynamic_connection=0 -env i_mpi_dapl_provider=ofa-v2-ib0 /opt/intel/impi/5.0.3.048/bin64/imb-mpi1 pingpong sshvm1:d28:bef0eb40: 12930 us(12930 us):  dapl_rdma_accept: err -1 input/output error sshvm1:d28:bef0eb40: 12946 us(16 us):  dapl err accept input/output error [1:10.0.0.5][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c:622] error(0x40000): ofa-v2-ib0: not accept dapl connection request: dat_internal_error() assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_conn_rc.c @ line 622: 0 internal abort - process 0 

what cause of these errors? how fix , run these microbenchmarks? help!

i did verify ssh connectivity between 2 nodes running "mpiexec -machinefile machinefile -n 2 hostname"

you need update rdma drivers. have updated documentation follow link below https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-cluster-rdma/

please go section update linux rdma drivers sles 12

please follow instructions , update rdma drivers. please update drivers if have provisioned vm's in 1 of following regions east north central south central north europe


Comments

Popular posts from this blog

c++ - llvm function pass ReplaceInstWithInst malloc -

java.lang.NoClassDefFoundError When Creating New Android Project -

Decoding a Python 2 `tempfile` with python-future -