Intel® C++ Compiler 16.0 User and Reference Guide
This topic only applies to Intel® Many Integrated Core Architecture (Intel® MIC Architecture).
To transfer data between the CPU and the coprocessor, use the offload_transfer pragma with either all in clauses or all out clauses. Without a signal clause the data transfer is synchronous: The next statement is executed only after the data transfer is complete.
offload_transfer with a signal makes the data transfer asynchronous. The tag specified in the signal clause is an address expression associated with that dataset. The data transfer is initiated and the CPU can continue past the pragma statement.
A later pragma written with a wait clause causes the activity specified in the pragma to begin only after all the data associated with the tag has been received. The data is placed into the variables specified when the data transfer was initiated. These variables must still be accessible.
Alternatively, you can use the non-blocking API _Offload_signaled() to also determine if a section of offloaded code has completed running on a specific target device.
The signal and wait clauses, the offload_wait construct and the _Offload_signaled() API refer to a specific target device, so you must specify target-number in the target() clause.
Querying a signal before the signal has been initiated results in undefined behavior, and a runtime abort of the application. For example, consider a query of a signal (SIG1) on target device 0, where the signal was actually initiated for target device 1. The signal was initiated for target device 1, so there is no signal (SIG1) associated with target device 0, and therefore the application aborts.
If, during an asynchronous offload, a signal is created in one thread, Thread A, and waited for in a different thread, Thread B, you are responsible for ensuring that Thread B does not query the signal before Thread A has initiated the asynchronous offload to set up the signal. Thread B querying the signal before Thread A has initiated the asynchronous offload to set up the signal, results in a runtime abort of the application.
If if-specifier evaluates to false and you use a signal (tag) clause, then the signal is undefined and any wait on this signal has undefined behavior.
To transfer data asynchronously from the CPU to the coprocessor, use a signal clause in an offload_transfer pragma with in clauses. The variables listed in the in clauses form a data set. The pragma initiates the data transfer of those variables from the CPU to the coprocessor. A subsequent offload pragma with a wait clause that uses the same value for tag as that used in the signal clause causes the statement controlled by the pragma to begin execution on the coprocessor only after the data transfer is complete.
To transfer data asynchronously from the coprocessor to the CPU, use the signal and wait clauses in two different pragmas. The first offload pragma performs the computation, but only initiates the data transfer. The second pragma causes a wait for the data transfer to complete.
The example below demonstrates various asynchronous data transfers between the CPU and coprocessor.
1 #include <stdio.h> 2 3 __attribute__((target(mic))) 4 void add_inputs(int N, float *f1, float*f2); 5 6 void display_vals(int id, int N, float*f2); 7 8 int main() 9 { 10 const int N = 5; 11 float *f1, *f2; 12 int i, j; 13 14 f1 = (float *)_mm_malloc(N*sizeof(float),4096); 15 f2 = (float *)_mm_malloc(N*sizeof(float),4096); 16 17 for (i=0;i<N;i++){ 18 f1[i]=i+1; 19 f2[i]=0.0; 20 } 21
Section 1 below (lines 22-56) demonstrates asynchronous data transfers, using IN and OUT, between the CPU and coprocessor with asynchronous computation. The data transfer of the arrays f1 and f2is initiated at lines 28-30. The offload_transfer does not initiate a computation. Its only purpose is to start transferring data for f1 and f2 to the coprocessor. At lines 40-44 the CPU initiates the computation, with the function add_inputs, on the coprocessor and continues execution to the offload_wait at line 51. The offloaded function uses the data f1 and f2, whose transfer was initiated earlier on the CPU. The execution of the offloaded region on the coprocessor begins only after the transfers of f1 and f2 are complete and the signal tag, (f1) is set accordingly. While the offloaded region executes on the coprocessor, the CPU waits at line 51 pending completion of the computation and data transfer of the results in f2 to the CPU. Execution on the CPU only continues beyond line 50 after the data for f2 is transferred to the CPU and the signal tag (f2) is set accordingly.
22 //----------- Section 1 -------------------------------------- 23 24 // Asynchronous transfer IN (to coprocessor) of f1 and f2 25 // 26 // CPU issues send and then continues 27 28 #pragma offload_transfer target(mic:0) signal (f1) \ 29 in( f1 : length(N) alloc_if(1) free_if(0) ) \ 30 in( f2 : length(N) alloc_if(1) free_if(0) ) 31 32 // Asynchronous compute and transfer OUT (to CPU) of f2 33 // 34 // CPU issues request to perform computation and continues 35 // 36 // Coprocessor receives offload request, waits for pre-sent 37 // data. After receiving data, performs computation and 38 // transfers (asynchronous) data OUT (to CPU) 39 40 #pragma offload target(mic:0) wait(f1) signal (f2) \ 41 in( N ) \ 42 nocopy( f1 : alloc_if(0) free_if(1) ) \ 43 out( f2 : length(N) alloc_if(0) free_if(1) ) 44 add_inputs(N, f1, f2); 45 46 // Wait for offload completion 47 // 48 // CPU waits for completion of previous offload and 49 // data transfer out (to CPU) of f2 50 51 #pragma offload_wait target(mic:0) wait(f2) 52 53 54 // Show current values 55 display_vals(1, N, f2); 56
In the same example, section 2 (lines 57-90) demonstrates multiple asynchronous data transfers, using IN, from the CPU to the coprocessor with synchronous computation and synchronous data transfer, using OUT, from the coprocessor to the CPU. Multiple independent asynchronous data transfers can occur at any time. The offload_transfer sends f1 and f2 to the coprocessor at different times, first f1 in lines 63-64, and then f2 in lines 68-69. The transfers are independent. At lines 81-85 the execution of the offloaded region and the function add_inputs on the coprocessor begins only after the transfers of f1 and f2 are complete and the signal tags (f1 and f2 ) are both set accordingly. Execution on the CPU waits for the completion of the offloaded computation and data transfer of the results in f2 to the CPU. The data transfer of f2 to the CPU occurs synchronous with the execution of the offloaded region.
57 //----------- Section 2 -------------------------------------- 58 59 // Independent asynchronous transfers IN (to coprocessor) 60 // 61 // CPU issues send and continues 62 63 #pragma offload_transfer target(mic:0) signal (f1) \ 64 in( f1 : length(N) alloc_if(1) free_if(0) ) 65 66 // CPU issues send and continues 67 68 #pragma offload_transfer target(mic:0) signal (f2) \ 69 in( f2 : length(N) alloc_if(1) free_if(0) ) 70 71 // Wait for independent transfers IN (to coprocessor), 72 // perform synchronous compute and data transfers out 73 // 74 // CPU issues request to perform computation and waits for 75 // completion 76 // 77 // Coprocessor receives offload request, waits for pre-sent 78 // data. After receiving data, performs computation and 79 // transfers (synchronous) data OUT (to CPU) 80 81 #pragma offload target(mic:0) wait(f1 , f2) \ 82 in( N ) \ 83 nocopy( f1 : alloc_if(0) free_if(1) ) \ 84 out( f2 : length(N) alloc_if(0) free_if(1) ) 85 add_inputs(N, f1, f2); 86 87 88 // Show current values 89 display_vals(2, N, f2); 90
Section 3 (lines 91-132) in the example demonstrates an independent asynchronous data transfer (IN) from the CPU to the coprocessor with synchronous data transfer (IN) from the CPU to the coprocessor and computation, followed by an independent asynchronous data transfer (OUT) from the coprocessor to the CPU. The offloaded function uses the data f1 and f2 . The transfer of f2 was initiated earlier on the CPU at lines 97-98. The execution of the offloaded region on lines 111-115 on the coprocessor begins only after the transfers of f1 and f2 are complete and the signal tag (f2) is set accordingly for the transfer of f2 . After the offloaded region executes on the coprocessor, the computed results of f2 remain on the coprocessor and execution on the CPU continues beyond line 115. At lines 122-123, the CPU initiates an asynchronous data transfer (OUT) from the coprocessor to the CPU for the computed results for f2 and continues execution to line 128 where the CPU waits for the completion of the transfer of f2 . Execution on the CPU continues beyond line 128 only after the data for f2 is transferred to the CPU and the signal tag (f2) is set accordingly.
91 //----------- Section 3 -------------------------------------- 92 93 // Asynchronous transfer IN (to coprocessor) of f2 94 // 95 // CPU issues send and then continues 96 97 #pragma offload_transfer target(mic:0) signal(f2) \ 98 in( f2 : length(N) alloc_if(1) free_if(0) ) 99 100 // Synchronous transfer IN (to coprocessor) of f1 with 101 // synchronous compute of f2 where new computed values 102 // of f2 remain on coprocessor 103 // 104 // CPU transfers values IN (to coprocessor) of f1, then issues 105 // request to perform computation and waits for completion 106 // 107 // Coprocessor receives offload request, waits for pre-sent 108 // data for f2. After receiving data, performs the 109 // computation and holds the results in f2 on coprocessor 110 111 #pragma offload target(mic:0) wait(f2) \ 112 in( N ) \ 113 in ( f1 : length(N) alloc_if(1) free_if(0) ) \ 114 nocopy( f2 ) 115 add_inputs(N, f1, f2); 116 117 118 // CPU waits for completion of previous offload, then 119 // initiates asynchronous transfer OUT (to CPU) of f2 120 // and continues 121 122 #pragma offload_transfer target(mic:0) signal (f2) \ 123 out( f2 : length(N) alloc_if(0) free_if(1) ) 124 125 126 // CPU waits for completion of transfer of f2 to the CPU 127 128 #pragma offload_wait target(mic:0) wait(f2) 129 130 // Show current values 131 display_vals(3, N, f2); 132 133 } 134 135 void add_inputs (int N, float *f1, float*f2) 136 { 137 int i; 138 139 for (i=0; i<N; i++){ 140 f2[i] = f2[i] + f1[i]; 141 } 142 } 143 144 void display_vals (int id, int N, float *f2) 145 { 146 int i; 147 148 printf("\nResults after Offload #%d:\n",id); 149 for (i=0; i<N; i++){ 150 printf(" f2[%d]= %f\n",i,f2[i]); 151 } 152 }
The following example double buffers inputs to an offload.
#pragma offload_attribute(push, target(mic)) int count = 25000000; int iter = 10; float *in1, *out1; float *in2, *out2; #pragma offload_attribute(pop) void do_async_in() { int i; #pragma offload_transfer target(mic:0) in(in1 : length(count) alloc_if(0) free_if(0) ) signal(in1) for (i=0; i<iter; i++) { if (i%2 == 0) { #pragma offload_transfer target(mic:0) if(i!=iter-1) in(in2 : length(count) alloc_if(0) free_if(0) ) signal(in2) #pragma offload target(mic:0) nocopy(in1) wait(in1) out(out1 : length(count) alloc_if(0) free_if(0) ) compute(in1, out1); } else { #pragma offload_transfer target(mic:0) if(i!=iter-1) in(in1 : length(count) alloc_if(0) free_if(0) ) signal(in1) #pragma offload target(mic:0) nocopy(in2) wait(in2) out(out2 : length(count) alloc_if(0) free_if(0) ) compute(in2, out2); } } }