Intel® C++ Compiler 16.0 User and Reference Guide
This topic only applies to Intel® Many Integrated Core Architecture (Intel® MIC Architecture).
Almost all vector intrinsic functions supporting Intel® Initial Many Core Instructions (Intel® IMCI) have the form:
vop v1 {k1}, v2, S(v3/m)
where v1 is a destination operand. The instructions are writemasked, so only those elements with the corresponding bit set in vector mask register k1 are computed and stored into v1. Elements in v1 with the corresponding bit clear in k1 retain their previous values.
This means that the destination vector v1 is also the source vector and it should be passed to the intrinsic function as an additional parameter.
The 512-bit vector intrinsics work in an element-wise manner: the first element of the first source vector is operated on together with the first element of the second source vector, and the result is stored in the first element of a destination vector, and so on for the remaining seven or 15 elements.
The contents of a 512-bit vector may be treated as either eight or 16 elements, depending on the intrinsic. For example, in the intrinsic functions:
The vector mask register that serves as the writemask for a vector intrinsic determines which element locations are actually operated upon; the mask can disable the operation and update for any combination of element locations.
Most vector intrinsics have three different vector operands (typically, two sources and one destination) except those instructions that have a single source and thus use only two operands.
In addition, any of the source vectors can be a result of permutation operations on memory registers or vectors.
To simplify the usage and to enable compiler optimizations, we provide pairs of intrinsics for each vector instruction - an unmasked variant and a masked variant.
It is important to understand the following points about the variants:
In the unmasked variant, the vectors are passed as parameters ( v2 and v3) for which the corresponding bits are set to '1' in the default mask register k0. The mask register k0 is not part of the argument list.
_mm512_<vop>(v2, v3)
In the masked variant, two additional registers are passed as parameters - v1_old and k1.
_mm512_mask_<vop>(v1_old, k1, v2, v3)
Those elements in v2 and v3 with the corresponding bit clear (set to '0') in vector mask k1 are not used for the operation. Instead, the corresponding element from v1_old is copied to the result vector. The following piece of code explains this concept:
if (mask[i] == 1) Result[i] = v2[i] + v3[i] else Result[i] = v1_old[i]
To make the workings of the masked vector k1 clear, here is an example.
Consider an intrinsic that performs an element-by-element addition operation with carry, where the two source vectors are v1 and v3. The vector carry holds the carry over value. Vector k2_old supplies elements to resulting vector under certain circumstances.
For the masked variant of the intrinsic, the vector k1 is a mask of 16 bits. If the bit number '3' in k1 is set to '1' then the third element of the resulting vector will be the result of addition between the third element of v1 vector and the third element of v3 vector, and the third element of carry will be the carry of that sum.
In addition, if bit number two in the mask k1 is '0', then the second element of the resulting vector will be equal to the second element of vector v1, and the second element of carry will be equal to the second element of k2_old.
The code below demonstrates how it works:
for (n=0; n < 16; n++) { res[i] = v1[i] *carry[i] = k2_old[i] if ( k1[i] == 1 ) { res[i] = res[i] + v3[i] *carry[i] = Carry(v1[i] + v3[i]) } }
The v1_old vector is used similarly to the k2_old vector. It supplies elements to the resulting vector when the elements undergoing the operation have corresponding bit set to '0' in the mask k1 vector.