Communication Problems

Communication problems with the Intel® MPI Library are usually caused by a signal termination (SIGTERM, SIGKILL, or other signals). Such terminations may be due to a host reboot, receiving an unexpected signal, out-of-memory (OOM) manager errors and others.

To deal with such failures, you need to find out the reason for the MPI process termination (for example, by checking the system log files).

Example 1

Symptom/Error Message

[50:node02] unexpected disconnect completion event from [41:node01]

and/or

================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 20066 RUNNING AT node01
= EXIT CODE: 15
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
================================================================================

The exact node and the MPI process reported in the table may not reflect the one where the initial failure had occurred.

Cause

One of MPI processes is terminated by a signal (for example, SIGTERM or SIGKILL) on node01. The MPI application was run over the dapl fabric.

Solution

Try to find out the reason of the MPI process termination. This may be a host reboot, receiving an unexpected signal, OOM manager errors and others. Check the system log files.

Example 2

Symptom/Error Message

rank = 26, revents = 25, state = 8
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c
at line 2969: (it_plfd->revents & POLLERR) == 0
internal ABORT - process 25
Fatal error in PMPI_Alltoall: A process has failed, error stack:
PMPI_Alltoall(1062).......: MPI_Alltoall(sbuf=0x9dd7d0, scount=64, MPI_BYTE, rbuf=0x9dc7b0,
rcount=64, MPI_BYTE, comm=0x84000000) failed
MPIR_Alltoall_impl(860)...:
MPIR_Alltoall(819)........:
MPIR_Alltoall_intra(360)..:
dequeue_and_set_error(917): Communication error with rank 2rank = 45, revents = 25,
state = 8
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c
at line 2969: (it_plfd->revents & POLLERR) == 0
internal ABORT - process 84
...
Fatal error in PMPI_Alltoall: A process has failed, error stack:
PMPI_Alltoall(1062).......: MPI_Alltoall(sbuf=MPI_IN_PLACE, scount=-1, MPI_DATATYPE_NULL,
rbuf=0x2ba2922b4010, rcount=8192, MPI_INT, MPI_COMM_WORLD) failed
MPIR_Alltoall_impl(860)...:
MPIR_Alltoall(819)........:
MPIR_Alltoall_intra(265)..:
MPIC_Sendrecv_replace(658):
dequeue_and_set_error(917): Communication error with rank 84
...

and/or:

================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 21686 RUNNING AT node01
= EXIT CODE: 15
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
================================================================================

The exact node and the MPI process reported in the table may not reflect the one where the initial failure had occurred.

Cause

One of MPI processes is terminated by a signal (for example, SIGTERM or SIGKILL) . The MPI application was run over the tcp fabric. In such cases, hang of the MPI application is possible.

Solution

Try to find out the reason of the MPI process termination. This may be a host reboot, receiving an unexpected signal, OOM manager errors and others. Check the system log files.

Example 3

Symptom/Error Message

[mpiexec@node00] control_cb (../../pm/pmiserv/pmiserv_cb.c:773): connection to proxy
1 at host node01 failed
[mpiexec@node00] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76):
callback returned error status
[mpiexec@node00] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501):
error waiting for event
[mpiexec@node00] main (../../ui/mpich/mpiexec.c:1063): process manager error waiting
for completion

Cause

The remote pmi_proxy process is terminated by the SIGKILL (9) signal on node01.

Solution

Try to find out the reason of the pmi_proxy process termination. This may be a host reboot, receiving an unexpected signal, OOM manager errors and others. Check the system log files.

Example 4

Symptom/Error Message

Failed to connect to host node01 port 22: No route to host

Cause

One of the MPI compute nodes (node01) is not available on the network. In such cases, hang of the MPI application is possible.

Solution

Check the network interfaces on the nodes and make sure the host is accessible.

Example 5

Symptom/Error Message

Failed to connect to host node01 port 22: Connection refused

Cause

The MPI remote node access mechanism is SSH. The SSH service is not running on node01.

Solution

Check the state of the SSH service on the nodes.