IntelĀ® MPI Library User's Guide for Linux* OS
Communication problems with the IntelĀ® MPI Library are usually caused by a signal termination (SIGTERM, SIGKILL, or other signals). Such terminations may be due to a host reboot, receiving an unexpected signal, out-of-memory (OOM) manager errors and others.
To deal with such failures, you need to find out the reason for the MPI process termination (for example, by checking the system log files).
[50:node02] unexpected disconnect completion event from [41:node01]
and/or
================================================================================ = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 20066 RUNNING AT node01 = EXIT CODE: 15 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES ================================================================================
The exact node and the MPI process reported in the table may not reflect the one where the initial failure had occurred.
One of MPI processes is terminated by a signal (for example, SIGTERM or SIGKILL) on node01. The MPI application was run over the dapl fabric.
Try to find out the reason of the MPI process termination. This may be a host reboot, receiving an unexpected signal, OOM manager errors and others. Check the system log files.
rank = 26, revents = 25, state = 8 Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 2969: (it_plfd->revents & POLLERR) == 0 internal ABORT - process 25 Fatal error in PMPI_Alltoall: A process has failed, error stack: PMPI_Alltoall(1062).......: MPI_Alltoall(sbuf=0x9dd7d0, scount=64, MPI_BYTE, rbuf=0x9dc7b0, rcount=64, MPI_BYTE, comm=0x84000000) failed MPIR_Alltoall_impl(860)...: MPIR_Alltoall(819)........: MPIR_Alltoall_intra(360)..: dequeue_and_set_error(917): Communication error with rank 2rank = 45, revents = 25, state = 8 Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 2969: (it_plfd->revents & POLLERR) == 0 internal ABORT - process 84 ... Fatal error in PMPI_Alltoall: A process has failed, error stack: PMPI_Alltoall(1062).......: MPI_Alltoall(sbuf=MPI_IN_PLACE, scount=-1, MPI_DATATYPE_NULL, rbuf=0x2ba2922b4010, rcount=8192, MPI_INT, MPI_COMM_WORLD) failed MPIR_Alltoall_impl(860)...: MPIR_Alltoall(819)........: MPIR_Alltoall_intra(265)..: MPIC_Sendrecv_replace(658): dequeue_and_set_error(917): Communication error with rank 84 ...
and/or:
================================================================================ = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 21686 RUNNING AT node01 = EXIT CODE: 15 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES ================================================================================
The exact node and the MPI process reported in the table may not reflect the one where the initial failure had occurred.
One of MPI processes is terminated by a signal (for example, SIGTERM or SIGKILL) . The MPI application was run over the tcp fabric. In such cases, hang of the MPI application is possible.
Try to find out the reason of the MPI process termination. This may be a host reboot, receiving an unexpected signal, OOM manager errors and others. Check the system log files.
[mpiexec@node00] control_cb (../../pm/pmiserv/pmiserv_cb.c:773): connection to proxy 1 at host node01 failed [mpiexec@node00] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status [mpiexec@node00] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event [mpiexec@node00] main (../../ui/mpich/mpiexec.c:1063): process manager error waiting for completion
The remote pmi_proxy process is terminated by the SIGKILL (9) signal on node01.
Try to find out the reason of the pmi_proxy process termination. This may be a host reboot, receiving an unexpected signal, OOM manager errors and others. Check the system log files.
Failed to connect to host node01 port 22: No route to host
One of the MPI compute nodes (node01) is not available on the network. In such cases, hang of the MPI application is possible.
Check the network interfaces on the nodes and make sure the host is accessible.
Failed to connect to host node01 port 22: Connection refused
The MPI remote node access mechanism is SSH. The SSH service is not running on node01.
Check the state of the SSH service on the nodes.