Intel® Fortran Compiler 16.0 User and Reference Guide
This topic only applies to Intel® Many Integrated Core Architecture (Intel® MIC Architecture).
To transfer data between the CPU and the coprocessor, use the OFFLOAD_TRANSFER directive with either all in clauses or all out clauses. Without a signal clause the data transfer is synchronous: The next statement is executed only after the data transfer is complete.
OFFLOAD_TRANSFER with a signal makes the data transfer asynchronous. The tag specified in the signal clause is an address expression associated with that dataset. The data transfer is initiated and the CPU can continue past the directive statement.
A later directive written with a wait clause causes the activity specified in the directive to begin only after all the data associated with the tag has been received. The data is placed into the variables specified when the data transfer was initiated. These variables must still be accessible.
Alternatively, you can use the non-blocking API OFFLOAD_SIGNALED() to also determine if a section of offloaded code has completed running on a specific target device.
The signal and wait clauses, the OFFLOAD_WAIT construct and the OFFLOAD_SIGNALED() API refer to a specific target device, so you must specify target-number in the target() clause.
Querying a signal before the signal has been initiated results in undefined behavior, and a runtime abort of the application. For example, consider a query of a signal (SIG1) on target device 0, where the signal was actually initiated for target device 1. The signal was initiated for target device 1, so there is no signal (SIG1) associated with target device 0, and therefore the application aborts.
If, during an asynchronous offload, a signal is created in one thread, Thread A, and waited for in a different thread, Thread B, you are responsible for ensuring that Thread B does not query the signal before Thread A has initiated the asynchronous offload to set up the signal. Thread B querying the signal before Thread A has initiated the asynchronous offload to set up the signal, results in a runtime abort of the application.
If if-specifier evaluates to false and you use a signal (tag) clause, then the signal is undefined and any wait on this signal has undefined behavior.
To transfer data asynchronously from the CPU to the coprocessor, use a signal clause in an OFFLOAD_TRANSFER directive with in clauses. The variables listed in the in clauses form a data set. The directive initiates the data transfer of those variables from the CPU to the coprocessor. A subsequent OFFLOAD directive with a wait clause that uses the same value for tag as that used in the signal clause causes the statement controlled by the directive to begin execution on the coprocessor only after the data transfer is complete.
To transfer data asynchronously from the coprocessor to the CPU, use the signal and wait clauses in two different directives. The first offload directive performs the computation, but only initiates the data transfer. The second directive causes a wait for the data transfer to complete.
In the following example, the data transfer of the floating-point array f1 is initiated at line 10, and f2 is initiated at line 12. The offloads do not initiate a computation. Their only purpose is to start transferring f1 and f2 to the coprocessor. At line 14 the CPU initiates the computation of the function foo on the coprocessor. The function uses the data f1 and f2, whose transfer was initiated earlier. The execution of the offloaded region on the coprocessor begins only after the transfer of f1 and f2 completes. The variable result returns the results of the computation.
01 integer, parameter:: n=4086 02 real, allocatable :: f1(:), f2(:), result 03 !dir$ attributes offload:mic :: f1, f2, foo 04 integer :: signal_1, signal_2 05 !dir$ attributes align : 64 :: f1 06 !dir$ attributes align : 64 :: f2 07 allocate(f1(n)) 08 allocate(f2(n)) 09 f1 = 1.0 10 !dir$ offload_transfer target (mic:0) in(f1) signal(signal_1) 11 f2 = 3.14 12 !dir$ offload_transfer target (mic:0) in(f2) signal(signal_2) 13 !dir$ offload begin target(mic:0) wait (signal_1, signal_2) 14 result = foo(n, f1, f2) 15 !dir$ end offload
Multiple independent asynchronous data transfers can occur at any time. The example below uses offload_transfer to send f1 and f2 to the coprocessor at different times, first f1 in line 10, and then f2 in line 13.
01 program main 02 integer, parameter:: n=4086 03 real, allocatable :: f1(:), f2(:), result 04 !dir$ attributes offload:mic :: f1, f2, foo 05 integer :: signal_1, signal_2 06 !dir$ attributes align : 64 :: f1 07 !dir$ attributes align : 64 :: f2 08 allocate(f1(n)) 09 allocate(f2(n)) 10 !dir$ offload begin target(mic:0) in (f1 ) nocopy (f2) signal(signal_1) 11 call foo(N, f1, f2) 12 !dir$ end offload 13 !dir$ offload_transfer target(mic:0) wait(signal_1) out (f2) 14 end program main
In the following example, the data transfer of the floating-point arrays in1 and in2 is initiated at line 15. The offload does not initiate a computation. Its only purpose is to start transferring in1 to the coprocessor. Within the do loop, either in1 or in2 is transferred to the coprocessor, and computation starts on whichever set has already been transferred. At line 20 the CPU initiates the computation of the function compute on the coprocessor, and tells it to work on in1. At line 24, the CPU initiates the computation of the function compute on the coprocessor, but tells it to work on in2, which was transferred at line 23.
The following example double buffers inputs to an offload.
01 module M 02 integer, parameter :: NNN = 100 03 integer, parameter :: count = 25000000 04 integer :: arr(NNN) 05 real :: dd 06 !dir$ attributes offload:mic::arr, dd 07 end module M 08 subroutine do_async_in() 09 !dir$ attributes offload:mic :: compute 10 use m 11 integer i, signal_1, signal_2, iter 12 real, allocatable :: in1(:), in2(:) 13 real, allocatable :: out1(:), out2(:) 14 iter = 10 15 !dir$ offload_transfer target(mic:0) in(in1 : length(count) alloc_if(.false.) free_if(.false.) ) signal(signal_1) 16 do i=1, iter 17 if (mod(i,2) == 0) then 18 !dir$ offload_transfer target(mic:0) if(i .ne. iter) in(in2 : length(count) alloc_if(.false.) free_if(.false.) ) signal(signal_2) 19 !dir$ offload target(mic:0) nocopy(in1) wait(signal_1) out(out1 : length(count) alloc_if(.false.) free_if(.false.) ) 20 call compute(in1, out1) 21 else 22 !dir$ offload_transfer target(mic:0) if(i .ne. iter) in(in1 : length(count) alloc_if(.false.) free_if(.false.) ) signal(signal_1) 23 !dir$ offload target(mic:0) nocopy(in2) wait(signal_2) out(out2 : length(count) alloc_if(.false.) free_if(.false.) ) 24 call compute(in2, out2) 25 endif 26 end do 27 end subroutine do_async_in